Abstract handle. Due to this high growth

Abstract

Big data is the most important trend that is defining
the new emerging analytical tools. Big data has various applications in
different areas like traffic control, weather forecasting, fraud detection,
security, education and health care. Extraction of knowledge from massive
amount of data sets has become a challenging task. Big data usually includes data sets with sizes beyond the ability of
commonly used software tools to capture, store and analyze it within a
tolerable elapsed time. Due to widespread usage of many
computing devices such as smart phones, laptops, wearable computing devices;
the data processing over the internet has exceeded more than the modern
computers can handle. Due to this high growth rate, the term Big Data is
envisaged. However, the fast growth rate of such large data generates numerous
challenges, such as data inconsistency and incompleteness, scalability,
timeliness, and security. The
question that arises now is how to develop a high performance platform to
efficiently analyze big data and how to design an appropriate mining algorithm
to find the useful things from big data. This paper begins with a brief
introduction to the big data technology and its importance and
also focuses on various challenges and issues that need to be emphasized. The
tools used in big data technology are also discussed in detail.

Your time is important. Let us write you an essay from scratch
100% plagiarism free
Sources and citations are provided


Get essay help

Key Words: Big Data, Hadoop, Map Reduce, Pig, Hive,
Hbase

 

I.INTRODUCTION

1. WHAT IS BIG
DATA?

Statistic shows that 500+terabytes of new data gets ingested into the databases of
social media site Face book,
every day. This data is mainly generated in terms of photo and video uploads, message
exchanges, putting comments etc. Single Jet engine can generate 10+terabytes of data in 30 minutes of a flight time. With
many thousand flights per day, generation of data reaches up to many Peta bytes. The New York Stock Exchange generates
about one terabyte of
new trade data per day. The following table shows the difference between
Traditional Vs Big data.

Traditional
Data

Big Data

Documents
Finances
Stock Records
Personnel files
Etc…

Photos
Audio and Video
3D Models
Simulations
Location data

Fig1:
Traditional Vs Big Data

1.2
Categories of Big Data

Big
data could be found in three forms:

Structured
Data
Unstructured
Data
Semi-structured
Data

Structured

Any
data that can be stored, accessed and processed in the form of fixed format is
termed as a structured data.

Example:
An Employee table in a database is an example of Structured Data

Unstructured

Any
data with unknown form or the structure is classified as unstructured data. In
addition to the size being huge, un-structured data poses multiple challenges
in terms of its processing for deriving value out of it.

Example:
Output returned by ‘Google Search’

Semi-structured

Semi-structured
data can contain both the forms of data.

Example:
Personal data stored in a XML file

1.3 Characteristics of
Big Data

Fig2: Big Data with 3V’s

Volume: Organizations
collect data from a variety of sources, including business transactions, social
media and information from sensor or machine-to- machine data.

Velocity:
Data streams in at an unprecedented speed and must be dealt with in a timely
manner. RFID tags, sensors and smart metering are driving the need to deal with
torrents of data in near-real time.

Variety:
Data comes in all types of formats from structured, numeric data in traditional
databases to unstructured text documents, email, video, audio, stock ticker
data and financial transactions.

1.4
Why Is Big Data Important?

The
importance of big data doesn’t revolve around how much data you have, but what
you do with it. You can take data from any source and analyze it to find answers
that enable 1) cost reductions, 2) time reductions, 3) new product development
and optimized offerings, and 4) smart decision making. When you combine big
data with high-powered analytics, you can accomplish business-related tasks
such as:

Determining root causes of failures, issues and
defects in near-real time.
Generating coupons at the point of sale based
on the customer’s buying habits.
Recalculating entire risk portfolios in
minutes.
Detecting fraudulent behavior before it affects
your organization.

II. BIG DATA CHALLENGES

Big
data due to its various properties like volume, velocity, variety, variability,
value and complexity put forward many challenges. The Fig3 shows various challenges
in big data. Fig4 list some of the challenges in big data along with its impacts
and risks involved.

 

Fig3: Big Data Big Challenges

Fig 4: Impacts and Risk in Big Data

III. TECHNIQUES FOR BIG DATA HANDLING

Big Data handles
massive amount of data with the elapsed time period. The following are some of
the tools that are used to handle the large data sets in effective manner.

I. HDFS

The
Hadoop Distributed File System (HDFS) is a distributed file
system designed to run on commodity hardware. It has many similarities with
existing distributed file systems. However, the differences from other
distributed file systems are significant. HDFS is highly fault-tolerant and is
designed to be deployed on low-cost hardware. HDFS provides high throughput
access to application data and is suitable for applications that have large
data sets.

Features of HDFS

It is suitable for the distributed storage and
processing.
Hadoop provides a command interface to interact
with HDFS.
The built-in server of name node and data node
helps to easily check the status of cluster.
Streaming access to file system data.
HDFS provides file permissions and
authentication.

II. MAP REDUCE
FRAMEWORK

Map
Reduce is a processing technique and a program model for distributed computing
based on java. The Map Reduce algorithm contains two important tasks, namely
Map and Reduce. Map takes a set of data and converts it into another set of
data, where individual elements are broken down into key/value pairs. Secondly,
reduce task, which takes the output from a map as an input and combines those
data value pairs into a smaller set of values. As the sequence of the name Map Reduce
implies, the reduce task is always performed after the map job. A Map Reduce
job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner. The framework sorts
the outputs of the maps, which are then input to the reduce tasks. Typically
both the input and the output of the job are stored in a file-system. The
framework takes care of scheduling tasks, monitoring them and re-executes the
failed tasks.

III. PIG

Apache
Pig

Apache
Pig is an open-source technology that offers a high-level mechanism for the
parallel programming of Map Reduce
jobs to be executed on Hadoop
clusters. Pig enables developers to create query execution
routines for analyzing large, distributed data sets without having to do
low-level work in Map Reduce, much like the way the Apache Hive
data warehouse software provides a SQL-like interface for Hadoop
that doesn’t require direct Map Reduce programming.

The key parts of Pig are a
compiler and a scripting language known as Pig Latin. Pig Latin is a
data-flow language geared toward parallel processing. Managers of the Apache
Software Foundation’s Pig project position the language as being
part way between declarative SQL
and the procedural Java
approach used in Map Reduce applications.

IV. HIVE

Hive is a data warehouse infrastructure
tool to process structured data in Hadoop. It resides on top of Hadoop to
summarize Big Data, and makes querying and analyzing easy. Hive is an
ETL and Data warehousing tool developed on top of Hadoop Distributed File
System (HDFS). Hive makes job easy for performing operations like 1)Data
encapsulation 2)Ad-hoc queries 3)Analysis of huge datasets.

Hive Consists of Mainly 3 core parts

Hive
Clients
Hive
Services
Hive
Storage and Computing

Hive Clients:

Hive provides different drivers for communication with
a different type of applications. For Thrift based applications, it will
provide Thrift client for communication. For Java related applications, it provides
JDBC Drivers. Hive Services:

Client interactions with Hive can be performed through
Hive Services. If the client wants to perform any query related operations in
Hive, it has to communicate through Hive Services. CLI is the command line
interface acts as Hive service for DDL (Data definition Language) operations.
All drivers communicate with Hive server and to the main driver in Hive
services.

Hive Storage and Computing:

Hive services such as Meta store, File system, and Job
Client in turn communicates with Hive storage and performs the following
actions 1) Metadata information of tables created in Hive is stored in Hive
“Meta storage database”. 2) Query results and data loaded in the
tables are going to be stored in Hadoop cluster on HDFS.

VI. HBASE

HBase
is an open source, distributed database, developed by Apache Software
foundation. HBase is a data model that is similar to Google’s big table
designed to provide quick random access to huge amounts of structured data. This
tutorial provides an introduction to HBase, the procedures to set up HBase on
Hadoop File Systems, and ways to interact with HBase shell. It also describes
how to connect to HBase using java, and how to perform basic operations on
HBase using java.

IV.CONCLUSION

As there are huge volumes of data that are
produced every day, so such large size of data it becomes very challenging to
achieve effective processing using the existing traditional techniques. Big
data is data that exceeds the processing capacity of conventional database
systems. In this paper fundamental concepts about Big Data are presented. These
concepts include Big Data characteristics, challenges and techniques for
handling big data.

V.REFERENCES

1https://www.idc.com/prodserv/4Pillars/bigdata

2 A, Katal, Wazid M, and Goudar R.H. “Big data:
Issues, challenges, tools and Good practices.” Noida: 2013, pp. 404 – 409,
8-10 Aug. 2013.

3 Almeida, F., and Calistru, C, “The Main Challenges
and Issues of Big Data Management”, International Journal of Research
Studies in Computing, 2(1), 2013, pp. 11-20.

4 Apache Hadoop (2013). HDFS Architecture Guide
Online. Available: https://hadoop. apache.org/docs/r1.2.1/hdfs_design.ht

5 Amrit pal, Pinki Aggrawal, Kunal Jain, Sanjay
Aggrawal “A Performance Analysis of MapReduce Task with Large Number of Files
Dataset in Big Data using Hadoop” Forth International Conference on
Communication Systems and Network Technologies, 2014.

6 Rahm, E., & Hai Do, H. (2000). Data cleaning:
problems and current approaches. Bulletin of the Technical Committee on Data
Engineering, 23(4), 3-13.),

7 Apache Hadoop (2013). HDFS Architecture Guide
Online. Available: https://hadoop. apache.org/docs/r1.2.1/hdfs_design.ht

8 Intel, “Big Data Analaytics,”2012,
http://www.intel.com/content/dam/www/public/us/en/documents/reports/data-insightspeer-research-report.pdf