Sunday, November 29, 2015

IT (7): What's Hadoop??......

Cloud computing offers some great opportunities for science, but most cloud computing platforms are both I/O and memory limited, and hence are poor matches for data-intensive computing. 
The other day, I was applying for a software job and one of the criteria was knowledge of Hadoop. Now, I had vaguely heard of the term with respect to big data management, but that's that. For solidifying my position in job market I must know, how it works. So I started accumulating information. Here, I am presenting some from understanding.

Colossal amount of data (generated from diverse fields, namely search engine, grid, transport, industry, agriculture, health, genomics, stock exchange, security, defense), which is called 'Big Data' posed many problems such as storage, handling, transfer, manipulation etc. Further the data types are diverse, classified a structured (relational data), semi-structured (XML) and unstructured (text, pdf, word).

MongoDB and NoSQL (cloud technology) offered the big data solution at operational level, while Massively Parallel Processing (MPP) and MapReduce worked at analytical level. Database owners like Oracle, IBM help to address the issues, but it has limitations. 

In order to solve the above issues, many infrastructures are developed, Hadoop being one of them. Invested by Google, this Apache open source project had solved many challenges inherent to big data. Written in Java and running on MapReduce algorithm, Hadoop operates on single to a cluster of machine.

(Anyway, if you are curious, how the name Hadoop came, it was after a stuffed yellow elephant, the toy of the inventor's kid).

Storage system of Hadoop, Hadoop Distributed File System (HDFS) is based on Google File System (GFS).  Further, Hadoop Common (Java libraries ) and Hadoop YARN (Yet Another Resource Negotiator) (job scheduling and cluster  management) are other important components. Other Apache frameworks are also installed along with Hadoop such as Pig, Hive, HBase, Spark etc. Working mechanism of Hadoop is based on 'division of labor' i.e. data is divided into comparable size file blocks and assigned to cluster nodes.

#Hadoop daemon: hdfs, yarn, MapReduce
#Hadoop operation modes:Local/Standalone Mode, Pseudo Distributed Mode, Fully Distributed Mode
HDFS follows the master-slave architecture. Its task is  fault detection and recovery
MapReduce is a  program model for distributed computing based on java. The reduce task is always performed after the map job.
#Download, extraction Hadoop 2.4.1 from Apache software foundation, installation
cd /usr/local 
tar xzf hadoop-2.4.1.tar.gz 
mv hadoop-2.4.1/* to hadoop/ 


No comments:

Post a Comment