Amazon EMR

Amazon EMR (Elastic MapReduce) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. Amazon EMR uses Hadoop, an open source framework, to distribute your data and processing across a resizable cluster of Amazon EC2 instances.

  • Using EMR, run multiple clusters with different sizes, specs and node types
  • Transient Cluster: Shut down the cluster when the job is done. Use it when (data load time + processing time )*no.of jobs < 24 hrs
  • Alive Clusters: Cluster stays around after the job is done. Able to share data between multiple jobs. Use it when (data load time + processing time )*no.of jobs > 24 hrs
  • Core Nodes: Runs TaskTrackers and DataNode (HDFS). You can add core nodes, but cannot remove them.
  • Task Nodes: Runs Task Trackers but no Datanodes. Reads from Core node HDFS. Can add/remove nodes. Use for speeding up job processing. Need extra horse power to pull data from S3.
  • You can use Amazon S3 as HDFS. Permanent data (Results) are stored in S3, while intermediate results are in HDFS. This way when the job is done, you can delete the cluster.
    • This way, S3 can share data with multiple clusters. HDFS cannot do this.
    • Don't use S3 when you are processing the same data set more than once.
  • If job is CPU/Memory bounded data then locality doesn't make a difference
  • Use S3 and HDFS for I/O intensive work loads. Store data in S3, pull using s3distcp, and process in HDFS.

  • In order to add nodes elastically follow this architecture.
    • Monitor cluster capacity with Elastic Cloud Watch
    • Have a SNS topic to notify elastic bean stalk to deploy an application
    • The application adds the corresponding nodes to your cluster.

Best Practices

  • Use M1.xlarge and larger nodes for production work loads
  • Use CC2 for memory and CPU intensive
  • Use CC2/C1.xlarge for CPU intensive
  • Hs1 for HDFS workloads
  • Hi1 and Hs1 for Disk I/O
  • Prefer smaller cluster of larger nodes than larger cluster of small nodes.
  • Estimated no of nodes = (Total Mappers *Time to process sample files)/(Instance mapper capacity *Desired Processing time)

Video Links

EMR Best Practices and Deep Dive