MapReduce Fundamentals

  • This is a data analysis architecture capable of processing huge quantities of data in a parallel way. It’s suitable for data which can be modularised with the same basic operations executed at scale. MapReduce is a process with two main stages, map and reduce. The idea is to split the data into manageable bits, perform operations on those bits and then recombine for reporting.

AWS EMR

  • Runs Apache Hadoop and Spark and some of the other open source projects such as Hive.
  • Ways to launch an EMR cluster;
    • If your objective is to migrate or move data using other AWS services such as Data Pipeline and use EMR as the computational infrastructure for that.
    • To create a long running cluster, general purpose EMR cluster via console, cli or sdk. You can also connect to the master node and interactively perform tasks connecting to that node. If you use Hive software, you need to use this method. Hive allows you to use SQL like queries.
  • Architecture of EMR Clusters;
    • Master Node (HDFS name node): Controls the whole cluster and distribution of the workloads and monitors the performance. This is the node where you have direct access to. EMR cluster runs on EC2 instances on a VPC.
    • Core Nodes: 0 or more core nodes inside a cluster. You can also have a single node (master) EMR cluster. These are the nodes performing tasks in a cluster. They are responsible for managing the actual data replication within the HDFS file system between other nodes. If a core node fails, you can potentially lose data. That matters if you use spot instances.
    • Task Nodes: execute functions but they don’t have any involvement in HDFS, hence there is a lot less risk in terms of data loss within the cluster. They are suited for spot instances as their failure does not cause any issues. All of the task nodes functionalites managed by Core nodes.
  • EMR is capable of retrieving input data from S3 and storing output to S3. EMRFS (S3 backed cluster) offers persistance beyond the lifetime of the cluster.
  • By default, all of the nodes of the cluster are placed in the same AZ for low latency and high throughput.
  • Applications available within an EMR Cluster; Hadoop, Ganglia, Hive, Mahout, Pig etc.
  • EC2 Instance size: usually master and core nodes would be m5 type. You cannot change the master node after you have created.

  • Performance and Cost Optimisation

    • Make sure you are provisioning the cluster as close as to source and output data, i.e. same region with S3. Every bucket has a region in S3 actually in spite of S3 being global. It will have data transfer cost implication. If you are using EMRFS, make sure cluster is in the same region as the buckets.
    • Core nodes are the critical nodes where you need to choose wisely depending on the scenario. Generally, most situations m5.large etc. type general purpose instances would be enough. You start off considering m4.large instance where 50 or less nodes for the master node. If more, then m4.xlarge. You can analyse using CloudWatch to identify additional capacity.