Hadoop is an
open source software framework which is used for handling Big Data. By ‘Big Data’,
I obviously mean large data sets which are often difficult to store and process.
Hadoop works in a distributed computing environment across clusters of computers
and facilitates rapid data transfer rates.
The core of Hadoop consists of four modules
included in the basic framework:
Common – the libraries and utilities used by other Hadoop modules. These libraries provides filesystem and OS level abstractions
and contains the necessary Java files and scripts required to start Hadoop.
Distributed File System (HDFS) – HDFS is a scalable and portable file system written in
Java capable of storing data across thousands of commodity servers
to achieve high bandwidth between nodes. It uses a master/slave
architecture where master consists of a single NameNode that manages the file system metadata and one or more
slave DataNodes that store
the actual data.
3. YARN –
(Yet Another Resource Negotiator) provides resource management for the
processes running on Hadoop.
MapReduce – The MapReduce programming model is a parallel
processing software framework which is comprised of two steps. First step is
the Map job in which a master node takes inputs and partitions them into
smaller subproblems and then distributes them to worker nodes. After the map
job has taken place, the master node takes the answers to all of the
subproblems and combines them to produce output.
MapReduce framework consists of a single master JobTracker and
one slave TaskTracker per
cluster-node. The master is responsible for resource management, tracking
resource consumption/availability and scheduling the jobs component tasks on
the slaves, monitoring them and re-executing the failed tasks. The slaves
TaskTracker execute the tasks as directed by the master and provide task-status
information to the master periodically. The JobTracker is a single point of
failure for the Hadoop MapReduce service which means if JobTracker goes down,
all running jobs are halted.
Uses and Applications of Hadoop
Hadoop has seen
significant adoptions in the past decade which has made it the most widespread for
Big Data managing. Some of the prominent use cases include Yahoo! Inc. in 2008
running on Linux cluster of 10 billion computers and Facebook in 2010 claiming
a Hadoop cluster of 21 PetaByte of storage. Hadoop can be used for
any sort of work that is batch-oriented rather than real-time, is very
data-intensive, and benefits from parallel processing of data. The
applications of Hadoop include marketing analytics, machine learning, data
mining, image processing, web crawling, processing of XML messages and general