I want to know more than just “MapReduce looks at features of things and then sorts them to make computation easier” or “MapReduce: Divide and Conquer”. First, I will look at Hadoop v1 and its associates and then v2. I will not be learning Java and how to write those functions.
Everything Hadoop generally aims to help do distributed computing — ie, “use two computers and you will half your time”. MapReduce is the tool to direct and separate the data to each node or “computer” and then reduce it to the result you want out; and it also provides some function* in case of nodes failing.
*MapReduce can save the day.
MapReduce puts data on the node that should execute it, then on a node in the same Rack and then one more copy on a different Rack. (If you don’t recognise the Rack terminology I would recommend my other Hadoop post).
This way if the node goes down it is fine, if the rack goes down it is fine.
MapReduce v1
The MapReduce Engine employs a Master / Slave architecture where a leader manages tasks allocation to its workers and the workers go and to complete jobs. As such, the Leader is called JobTracker and the workers are called TaskTracker.
JobTrackers, Leaders, accept the MapReduce jobs we submit, tell the TaskTrackers what to do which will be “look through these and order them” aka Map or they will say “calculate the final result of these results” aka Reduce. To do this the JobTrackers must also be aware of the capacity of the TaskTrackers, as to know when to give them work.
TaskTrackers in turn will do the Map and Reduce activities and then submit their results and status to the leader, “Here are the results, I am free now!”.
To be the most efficient the JobTracker/Leader would want to its workers to be as close to the raw material as possible. It does this by trying to keep data transfer between nodes at a minimum. The best way to do this is to keep within a Rack if possible.
A day in the life
Monday starts, the data blocks are copied in from the HDFS in big unsorted packages within each node. The first step is always to Map and hence the JobTracker tells each node’s worker what things we are interested in and TaskTracker nodes go to work, labelling every item with the relevant components.
The JobTracker says we want to know how many of each coloured items we have in the most recent block that arrived. The workers will set to work looking through each and labelling. If there is anything that doesn’t fit into its labelling it will completely ignore it. For example, if we said the red and blue colours only it wouldn’t label the greens.
Tuesday and the mapping task has been completed. The warehouse floor is colourful with all the tags lying on the data.
We need to sort our stuff out. We need to Shuffle things around and get some order. Each worker takes their messy pile and one by one separates them to the like colours. Then the workers each need to move those ordered piles to the passover area, so one by one, they walk over to the locations where the workers who do the next task sit and put their piles in.
Wednesday and it is organised to each of the counting guys to start, aiming to Reduce the bulk into a simpler number. The TaskTrackers start doing their work, counting the numbers and creating a report which then goes to the warehouse. Workers can have one or more colours to count.
Don’t forget that any worker who is free can be asked to do any of the phases of work: Map, Shuffle or Reduce.
What is clever is that the larger the load, the more the blocks and hence the the more workers you will have involved in the process. As such, there should a consistent performance in terms of speed. Similarly, the more categories used the more reducer workers will be used to count and hence the faster the computation. Oh and the number of workers doesn’t need to stay constant!
Also MapReduce tasks can be compounded. You may have more blocks of coloured items being counted in a different factory/node/rack. Once the results file has been placed back in the HDFS another MapReduce function may go through them to add to the more general total. Sometimes a Combiner can be used here to help, by doing some processing to help the info be simpler and easier to combine, but this is optional.
The one question I had was who told the JobTracker what we wanted. The answer here was the JobClient. Who is the JobClient? They are the salesman who put the order of work in. So who is the customer? Well that is the developer who wrote their order as a MapReduce program. This order is turned into the individual instruction that is put on the HDFS, which the workers later access, once the JobTracker plans who should do what job.
Also know that throughout the workers day they are constantly in contact with the JobTracker sending updates on whether they are still working or not. These are called Heartbeat messages.
Outside of the analogy, there is also one step we have ignored. When the “worker” receives instructions on what to do they “put on their gloves” and spin up a virtual machine. Also after the map function nothing is backed up — the data is still on the local node. That means that pile of work will be lost if we fail there.
Factory Faults
Hopefully that analogy helps you, it does me. Know lets run it further to try to understand fault tolerance.
Workers Failing:
The manager / JobTracker wants constant communication with its workers / TastTrackers. It does this with Heartbeat messages. Imagine these are actual heartbeat information. If the pulse goes flat the manager can remotely trigger a heart restart. The worker will not be asked to do the same activity again. Ie if they were doing Activity 1, someone else will do that task and they will get a new Activity 2. But what about the work they had already done to Activity 1?
Well if Activity 1 was a map or shuffle, we have to do it all again. This is because the data isn’t saved anywhere else than locally on the worker, and when the worker resets so does their local process.
But if Activity 1 was a reduce, you do those in sections. The worker might have counted all the Purple and Yellow but hadn’t finished the Orange. Purple and Yellow sums are completed and saved in the HDFS. We would have to assign someone to do the Orange sum though.
Hadoop organises this all automatically.
Writing MapReduce instructions
First you need to be in contact with your data. This means we need it to be on the distributed file system (our HDFS). One way to do this is to login to your Hadoop system. As such, we want to essentially get hold of the computer that is running our system which isn’t easy, especially if it is in the cloud (can you fly?).
One way to do it is to use a program called PuTTY. Essentially, it is an emulator of a normal command line interface, visible on your computer, but running on that computer.
First lets see what files we have running on our hadoop system’s file system (the HDFS). hdfs dfs -ls
. Let’s break that down, hdfs
calls for functions from Hadoop, dfs
asks for functions that apply to the distributed file system, and -ls
means list segments or in human language list all the files.
What we want to do now is upload our data to it. First let’s create a file…
hdfs dfs -mkdir sampledata
…and allow it to be accessible — ie don’t have any permission settings that could block it being used.
hdfs dfs -chmod 777 sampledata
This is new. -chmod
is a command which allows you to set user permissions. There are 3 levels you can define permissions for: Owner, Group, and Others. Read more here. For each group we need to define their abilities. There are 7 types:
0 — no permission
1 — execute
2 — write
3 — write and execute
4 — read
5 — read and execute
6 — read and write
7 — read, write, and execute
Next we want to move the data using the -put
method.
hdfs dfs -put ~/theDataFilesName.dat sampledata
Sweet. So once uploaded we could check that the file is the right one:
hdfs dfs -cat sampledata/SumnerCountyTemp.dat | more
Here the -cat
allows use to preview the file and | more
allows you to press the spacebar to continue scrolling through the file.
Next you need to write your own Map and Reduce instruction files. These need to be in Java for this and then compiled to a Java Archive file (.jar). Then you push these the HDFS so any nodes can read them.
Once you have the results you may want to run it again — maybe you have new data or something. Hadoop will want an empty location to put its results in and as such you will need to remove the last results it generated. You do this with -rm
. To remove all within it it needs to be recursive and hence -R
, then finally add the file’s location you want to remove.
hdfs dfs -rm -R sampledata/locationOfResults
The Hadoop documentation has examples of other tutorials to do, as does YouTube.
Problems with MapReduce v1 and Hadoop v1
One issue is there is only one leader / JobTracker. What happens when they are overloaded with MapReduce requests and heartbeats? Well it becomes a bottleneck and restricts.
Basically, managers can get too much info and become useless.
YARN
YARN was introduced to help here in Hadoop v2 and MapReduce v2. It splits up the role of the leader / JobTracker. YARN provides a Resource Manager and Application Masters, and Node Managers.
The global / king Resource Manager makes demands of what work needs to be done.
The Application Master’s job is to evaluate incoming jobs and ask for resources and then monitoring its team and restarting them if necessary. They also update the Resource Manager when their team members have finished so they can be used elsewhere. Anyone can become an Application Master at anytime, if the Resource Manager tells them. The Resource Manager will define their team too, who HR will hire.
There is also a Node Manager. They are like HR. They hire and train / spin up team members / containers who can do the job. It is directed to do this by the Resource Manager. There will be one HR person for every location and hence every node. They feedback when they have hired / spun up resources.
Another reason for YARN is that the Application Master and Resource Manager are multilingual. They can take orders from Spanish, Italians, anyone. What I actually mean is they can work with a variety of coding languages and frameworks.
YARN is also completely compatible with MapReduce v1 and hence moving over sees no visible difference.
YARN has the ability to direct a Node Manager to spin up a new container if needed. As such, utilization can be dynamic, scalable and efficient.
MapReduce is still very popular in its new function as an add-on to Hadoop but there are others. One of those others is Apache Spark which I want to talk about in another post.