Big data is the most important trend that is defining
the new emerging analytical tools. Big data has various applications in
different areas like traffic control, weather forecasting, fraud detection,
security, education and health care. Extraction of knowledge from massive
amount of data sets has become a challenging task. Big data usually includes data sets with sizes beyond the ability of
commonly used software tools to capture, store and analyze it within a
tolerable elapsed time. Due to widespread usage of many
computing devices such as smart phones, laptops, wearable computing devices;
the data processing over the internet has exceeded more than the modern
computers can handle. Due to this high growth rate, the term Big Data is
envisaged. However, the fast growth rate of such large data generates numerous
challenges, such as data inconsistency and incompleteness, scalability,
timeliness, and security. The
question that arises now is how to develop a high performance platform to
efficiently analyze big data and how to design an appropriate mining algorithm
to find the useful things from big data. This paper begins with a brief
introduction to the big data technology and its importance and
also focuses on various challenges and issues that need to be emphasized. The
tools used in big data technology are also discussed in detail.
Key Words: Big Data, Hadoop, Map Reduce, Pig, Hive,
1. WHAT IS BIG
Statistic shows that 500+terabytes of new data gets ingested into the databases of
social media site Face book,
every day. This data is mainly generated in terms of photo and video uploads, message
exchanges, putting comments etc. Single Jet engine can generate 10+terabytes of data in 30 minutes of a flight time. With
many thousand flights per day, generation of data reaches up to many Peta bytes. The New York Stock Exchange generates
about one terabyte of
new trade data per day. The following table shows the difference between
Traditional Vs Big data.
Audio and Video
Traditional Vs Big Data
Categories of Big Data
data could be found in three forms:
data that can be stored, accessed and processed in the form of fixed format is
termed as a structured data.
An Employee table in a database is an example of Structured Data
data with unknown form or the structure is classified as unstructured data. In
addition to the size being huge, un-structured data poses multiple challenges
in terms of its processing for deriving value out of it.
Output returned by ‘Google Search’
data can contain both the forms of data.
Personal data stored in a XML file
1.3 Characteristics of
Fig2: Big Data with 3V’s
collect data from a variety of sources, including business transactions, social
media and information from sensor or machine-to- machine data.
Data streams in at an unprecedented speed and must be dealt with in a timely
manner. RFID tags, sensors and smart metering are driving the need to deal with
torrents of data in near-real time.
Data comes in all types of formats from structured, numeric data in traditional
databases to unstructured text documents, email, video, audio, stock ticker
data and financial transactions.
Why Is Big Data Important?
importance of big data doesn’t revolve around how much data you have, but what
you do with it. You can take data from any source and analyze it to find answers
that enable 1) cost reductions, 2) time reductions, 3) new product development
and optimized offerings, and 4) smart decision making. When you combine big
data with high-powered analytics, you can accomplish business-related tasks
Determining root causes of failures, issues and
defects in near-real time.
Generating coupons at the point of sale based
on the customer’s buying habits.
Recalculating entire risk portfolios in
Detecting fraudulent behavior before it affects
II. BIG DATA CHALLENGES
data due to its various properties like volume, velocity, variety, variability,
value and complexity put forward many challenges. The Fig3 shows various challenges
in big data. Fig4 list some of the challenges in big data along with its impacts
and risks involved.
Fig3: Big Data Big Challenges
Fig 4: Impacts and Risk in Big Data
III. TECHNIQUES FOR BIG DATA HANDLING
Big Data handles
massive amount of data with the elapsed time period. The following are some of
the tools that are used to handle the large data sets in effective manner.
Hadoop Distributed File System (HDFS) is a distributed file
system designed to run on commodity hardware. It has many similarities with
existing distributed file systems. However, the differences from other
distributed file systems are significant. HDFS is highly fault-tolerant and is
designed to be deployed on low-cost hardware. HDFS provides high throughput
access to application data and is suitable for applications that have large
Features of HDFS
It is suitable for the distributed storage and
Hadoop provides a command interface to interact
The built-in server of name node and data node
helps to easily check the status of cluster.
Streaming access to file system data.
HDFS provides file permissions and
II. MAP REDUCE
Reduce is a processing technique and a program model for distributed computing
based on java. The Map Reduce algorithm contains two important tasks, namely
Map and Reduce. Map takes a set of data and converts it into another set of
data, where individual elements are broken down into key/value pairs. Secondly,
reduce task, which takes the output from a map as an input and combines those
data value pairs into a smaller set of values. As the sequence of the name Map Reduce
implies, the reduce task is always performed after the map job. A Map Reduce
job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner. The framework sorts
the outputs of the maps, which are then input to the reduce tasks. Typically
both the input and the output of the job are stored in a file-system. The
framework takes care of scheduling tasks, monitoring them and re-executes the
Pig is an open-source technology that offers a high-level mechanism for the
parallel programming of Map Reduce
jobs to be executed on Hadoop
clusters. Pig enables developers to create query execution
routines for analyzing large, distributed data sets without having to do
low-level work in Map Reduce, much like the way the Apache Hive
data warehouse software provides a SQL-like interface for Hadoop
that doesn’t require direct Map Reduce programming.
The key parts of Pig are a
compiler and a scripting language known as Pig Latin. Pig Latin is a
data-flow language geared toward parallel processing. Managers of the Apache
Software Foundation’s Pig project position the language as being
part way between declarative SQL
and the procedural Java
approach used in Map Reduce applications.
Hive is a data warehouse infrastructure
tool to process structured data in Hadoop. It resides on top of Hadoop to
summarize Big Data, and makes querying and analyzing easy. Hive is an
ETL and Data warehousing tool developed on top of Hadoop Distributed File
System (HDFS). Hive makes job easy for performing operations like 1)Data
encapsulation 2)Ad-hoc queries 3)Analysis of huge datasets.
Hive Consists of Mainly 3 core parts
Storage and Computing
Hive provides different drivers for communication with
a different type of applications. For Thrift based applications, it will
provide Thrift client for communication. For Java related applications, it provides
JDBC Drivers. Hive Services:
Client interactions with Hive can be performed through
Hive Services. If the client wants to perform any query related operations in
Hive, it has to communicate through Hive Services. CLI is the command line
interface acts as Hive service for DDL (Data definition Language) operations.
All drivers communicate with Hive server and to the main driver in Hive
Hive Storage and Computing:
Hive services such as Meta store, File system, and Job
Client in turn communicates with Hive storage and performs the following
actions 1) Metadata information of tables created in Hive is stored in Hive
“Meta storage database”. 2) Query results and data loaded in the
tables are going to be stored in Hadoop cluster on HDFS.
is an open source, distributed database, developed by Apache Software
foundation. HBase is a data model that is similar to Google’s big table
designed to provide quick random access to huge amounts of structured data. This
tutorial provides an introduction to HBase, the procedures to set up HBase on
Hadoop File Systems, and ways to interact with HBase shell. It also describes
how to connect to HBase using java, and how to perform basic operations on
HBase using java.
As there are huge volumes of data that are
produced every day, so such large size of data it becomes very challenging to
achieve effective processing using the existing traditional techniques. Big
data is data that exceeds the processing capacity of conventional database
systems. In this paper fundamental concepts about Big Data are presented. These
concepts include Big Data characteristics, challenges and techniques for
handling big data.
2 A, Katal, Wazid M, and Goudar R.H. “Big data:
Issues, challenges, tools and Good practices.” Noida: 2013, pp. 404 – 409,
8-10 Aug. 2013.
3 Almeida, F., and Calistru, C, “The Main Challenges
and Issues of Big Data Management”, International Journal of Research
Studies in Computing, 2(1), 2013, pp. 11-20.
4 Apache Hadoop (2013). HDFS Architecture Guide
Online. Available: https://hadoop. apache.org/docs/r1.2.1/hdfs_design.ht
5 Amrit pal, Pinki Aggrawal, Kunal Jain, Sanjay
Aggrawal “A Performance Analysis of MapReduce Task with Large Number of Files
Dataset in Big Data using Hadoop” Forth International Conference on
Communication Systems and Network Technologies, 2014.
6 Rahm, E., & Hai Do, H. (2000). Data cleaning:
problems and current approaches. Bulletin of the Technical Committee on Data
Engineering, 23(4), 3-13.),
7 Apache Hadoop (2013). HDFS Architecture Guide
Online. Available: https://hadoop. apache.org/docs/r1.2.1/hdfs_design.ht
8 Intel, “Big Data Analaytics,”2012,