Amazon EMR Guide - Optimize Big Data Analytics

How Master Slave Structure works

What is Amazon EMR (Elastic MapReduce)?

Enterprises today, small and large are looking to the ‘Big Data’ technologies to help solve the Data problem. As explained in one of my earlier posts, Big Data is not a single technology but term used to describe the large amount of Data generated in today’s increasingly online world.

One of the technologies frequently aligned with Big Data is Apache Hadoop. In an earlier post by Veeraj, he compared various distributions of Hadoop.

In this post, I will describe Amazon EMR (Elastic MapReduce) distribution in detail.

So, what is Amazon EMR?

Amazon EMR is a managed Hadoop distribution by Amazon Web Services. Amazon EMR helps users to analyze & process large amount of data by distributing data computation across multiple nodes in a cluster on AWS Cloud.

Amazon EMR uses a customized Apache Hadoop framework to achieve large scale distributed processing of data. Hadoop framework uses distributed data processing architecture known as MapReduce, in which a data processing task is mapped to a set of servers in a cluster for processing. The results of the computation performed by these servers are reduced to a single output.

All the open source projects that work with Apache Hadoop also work seamlessly with Amazon EMR. In addition to this Amazon EMR is well integrated with various AWS services like EC2 (used to launch master and slave nodes), S3 (used as an alternative to HDFS), CloudWatch (monitor jobs on EMR), Amazon RDS, DynamoDB etc.

Amazon EMR allows you run your custom map-reduce programs, written in Java. You have the flexibility to launch any number of EC2 instances with various server configurations. EMR allows you to update the default Hadoop configurations to tune your job flows (job flow is a set of steps to process a specific data set using a cluster of EC2 instances) according to your specific needs.

EMR also allows writing Bootstrap actions, which provides a way to run custom set-up prior to execution your job flow. Bootstrap actions can be used to install software or configure instances before running a job flow.

Overall, Amazon EMR provides a simpler and cost effective way to deploy your own Hadoop cluster without the overheads of buying and maintaining your own hardware and deploying your own Hadoop cluster.

Use cases of Amazon EMR

Amazon EMR can be used to process applications with data intensive workloads.

Some of the common use case examples for Amazon EMR are:

Data Mining
Log file analysis
Web indexing
Machine learning
Financial analysis
Scientific simulations
Data warehousing
Bioinformatics research

Apart from these there could be several specific use cases in your organization that might require large scale data computation, for all such use cases you can use Amazon EMR.

Advantages of Amazon EMR

No upfront investments in hardware infrastructure
Simple and managed cluster launching
Easy to scale up or down
Integration with other Amazon Web Services including S3 as an alternative to HDFS
Integration with other Apache Hadoop projects, including Hive and Pig
Multiple EC2 instance options for clusters gives a lot of flexibility
Integration with leading BI Tools
Multiple management tools including CLI, SDKs and User Console

Limitations of Amazon EMR

Amazon EMR is not open source, so you have limited control over the source code
There are increased latencies as typical EMR jobs use data stored in S3 which is processed on EC2, moving data from S3 to EC2 takes some time
Amazon EMR does not support the latest version of Hadoop, current versions supported by EMR are Hadoop 0.20.205 and Hadoop 1.0.3 with custom patches. If your application requires to use the latest features of Hadoop, EMR may not be the best option

If you have a Big Data requirement and are looking for expert help on it please feel free to contact us.

Cloud Consulting, Strategy, and Migration

DevSecOps

Cloud Security Engineering

Application Assessment

Cloud Native Application Development & Testing

SaaS Product & Platform Development

Data Strategy

Data Governance and Engineering

Advanced Analytics

Cloud Governance & Reporting

Cloud Discovery & Optimization

DevOps Transformation (DoT)

cAssure

cSecure

SaaS Factory Model

BlazePulse

cSaver

Cloud and Platform Modernization

Cloud Security Operations

Conversational AI

Application Maintenance & Enhancement

Application Modernization

Managed Analytics

BI Modernization

Cloud Managed Services

How Analytics Helps Businesses Better Serve their Customers

Building Capabilities of Incident Response/Disaster Recovery on the Cloud

Impactful Cloud Computing Trends to Look For in 2022

Cloud-based End-to-End System Helped Feed Ontario Achieve ~60% Operational Optimization

Azure Cloud Hosting Enabled Etek to See ~40% Performance Improvement in their Website

Cloud Migration Helped Customer Achieve Uniform Platform and Reduce Rollbacks by Over 60%

Ebooks & Whitepapers

Augment Your Deployment Velocity With Terraform

Back to Business as Usual - Rightsizing Your AWS Cloud Cost

Call Us

Email Us

Financial Services

Banking & Insurance

Media & Entertainment

Telecom

Technology

About Us

Our Leadership

Customer Speaks

Strategic Partners

OneClan Life

Clouditects

Work With Us

Thought Leadership

Awards and Recognition

Media Coverage