Mastering Hadoop 3
上QQ阅读APP看书,第一时间看更新

MapReduce origin

Doug and Mike started working on an algorithm that can process data stored on NDFS. They wanted a system whose performance can be doubled by just doubling the number of machines running the program. At the same time, Google published MapReduce: Simplified Data Processing on Large Clusters (https://research.google.com/archive/mapreduce.html).

The core idea behind the MapReduce model was to provide parallelism, fault tolerance, and data locality features. Data locality means a program is executed where data is stored instead of bringing the data to the program. MapReduce was integrated into Nutch in 2005. In 2006, Doug created a new incubating project that consisted of HDFS (Hadoop Distributed File System), named after NDFS, MapReduce, and Hadoop Common.

At that time, Yahoo! was struggling with its backend search performance. Engineers at Yahoo! already knew the benefits of Google File System and MapReduce implemented at Google. Yahoo! decided to adopt the capability of Hadoop and they employed Doug to help their engineering team to do so. In 2007, a few more companies who started contributing to Hadoop and Yahoo! reported that they were running 1,000 node Hadoop clusters at the same time.

NameNodes and DataNodes have a specific role in managing overall clusters. NameNodes are responsible for maintaining metadata information. MapReduce engines have a job tracker and task tracker whose scalability is limited to 40,000 nodes because the overall work of scheduling and tracking is handled by only the job tracker. YARN was introduced in Hadoop version 2 to overcome scalability issues and resource management jobs. It gave Hadoop a new lease of life and Hadoop became a more robust, faster, and more scalable system.