What is Amazon EMR (Elastic MapReduce)?
Enterprises today, small and large are looking to the ‘Big Data’ technologies to help solve the Data problem. As explained in one of my earlier posts, Big Data is not a single technology but term used to describe the large amount of Data generated in today’s increasingly online world.
One of the technologies frequently aligned with Big Data is Apache Hadoop. In an earlier post by Veeraj, he compared various distributions of Hadoop.
In this post, I will describe Amazon EMR (Elastic MapReduce) distribution in detail.
So, what is Amazon EMR?
Amazon EMR is a managed Hadoop distribution by Amazon Web Services. Amazon EMR helps users to analyze & process large amount of data by distributing data computation across multiple nodes in a cluster on AWS Cloud.
Amazon EMR uses a customized Apache Hadoop framework to achieve large scale distributed processing of data. Hadoop framework uses distributed data processing architecture known as MapReduce, in which a data processing task is mapped to a set of servers in a cluster for processing. The results of the computation performed by these servers are reduced to a single output.
All the open source projects that work with Apache Hadoop also work seamlessly with Amazon EMR. In addition to this Amazon EMR is well integrated with various AWS services like EC2 (used to launch master and slave nodes), S3 (used as an alternative to HDFS), CloudWatch (monitor jobs on EMR), Amazon RDS, DynamoDB etc.
Amazon EMR allows you run your custom map-reduce programs, written in Java. You have the flexibility to launch any number of EC2 instances with various server configurations. EMR allows you to update the default Hadoop configurations to tune your job flows (job flow is a set of steps to process a specific data set using a cluster of EC2 instances) according to your specific needs.
EMR also allows writing Bootstrap actions, which provides a way to run custom set-up prior to execution your job flow. Bootstrap actions can be used to install software or configure instances before running a job flow.
Overall, Amazon EMR provides a simpler and cost effective way to deploy your own Hadoop cluster without the overheads of buying and maintaining your own hardware and deploying your own Hadoop cluster.
Use cases of Amazon EMR
Amazon EMR can be used to process applications with data intensive workloads.
Some of the common use case examples for Amazon EMR are:
- Data Mining
- Log file analysis
- Web indexing
- Machine learning
- Financial analysis
- Scientific simulations
- Data warehousing
- Bioinformatics research
Apart from these there could be several specific use cases in your organization that might require large scale data computation, for all such use cases you can use Amazon EMR.
Advantages of Amazon EMR
- No upfront investments in hardware infrastructure
- Simple and managed cluster launching
- Easy to scale up or down
- Integration with other Amazon Web Services including S3 as an alternative to HDFS
- Integration with other Apache Hadoop projects, including Hive and Pig
- Multiple EC2 instance options for clusters gives a lot of flexibility
- Integration with leading BI Tools
- Multiple management tools including CLI, SDKs and User Console
Limitations of Amazon EMR
- Amazon EMR is not open source, so you have limited control over the source code
- There are increased latencies as typical EMR jobs use data stored in S3 which is processed on EC2, moving data from S3 to EC2 takes some time
- Amazon EMR does not support the latest version of Hadoop, current versions supported by EMR are Hadoop 0.20.205 and Hadoop 1.0.3 with custom patches. If your application requires to use the latest features of Hadoop, EMR may not be the best option
If you have a Big Data requirement and are looking for expert help on it please feel free to contact us.