Let’s get some grounding.
Imagine I am a school teacher and I need to work out my classes average. There are 30 kids in my class so it is fairly simple. Excel sheet done.
Imagine I am a teacher on Udemy or something similar. My class size is in 300,000. Painful Excel sheet done.
Now imagine I want more than just the average. Excel probably won’t be able to handle it anymore and certainly wouldn’t be the easiest to work with.
Imagine I am a teacher but now am marking the classes multiple choice tests. Suddenly each student has 60 data points. In my school class that means 1,800 data points at least. In my Udemy class that is 18,000,000.
Let’s say I have a brilliantly slow computer which takes a second to do every sum. How long would all these calculations take?
As you can see my old old computer that struggles with simple calculations in excel is… struggling. I don’t want to wait over a year for the marking of a test.
I instead split my data file and send it to the other laptops I have in my imaginary junk draw. Now I have three similar laptops all to calculate the results. That means if I can get the laptops to work together well we can pretty much cut the time into a third!
Awesome I can now get my exam results within a third of a year! Woo hoo! [Sarcasm]
Have you lost it Alex?
Valid question. I realise the gain in the example above is minimal but there are real examples:
- SETI@home — a screensaver that allowed SETI to use your computer to analyse signals in the Search for Extra-Terrestrial Intelligence.
- Dream Lab — plug in your phone at night and run the app, and let it “solve cancer in your sleep”
This solution for speeding up computation time has many names and ideas. Call it parallel computing, speed-up, distributed computing or just being a clever ol’ stick, I am not going to bite your arm off about what is correct.
Business use this too. Hadoop is a tool to do this.
History of Hadoop
It is a long one. I don’t like long stories. MapReduce was created by Google. Google “open-sourced it”. Hadoop used MapReduce but tweaked it to make it easier and more robust. Hadoop has grown continually to make it easier and now is considered by some the “prefered tool”.
Next: How to Hadoop — coming soon.