Top Menu

Monday, November 16, 2015



Of course, everyone knows Hadoop as the solution to Big Data. What’s the problem with Big Data? Well, mostly it’s just that Big Data is too big to access and process in a timely fashion on a conventional enterprise system. Even a really large, optimally tuned, enterprise-class database system has conventional limits in terms of its maximum I/O, and there is a scale of data that outstrips this model and requires parallelism at a system level to make it accessible. While Hadoop is associated in many ways with advanced transaction processing pipelines, analytic and data sciences, these applications are sitting on top of a much simpler paradigm… that being that we can spread our data across a cluster and provision I/O and processor in a tunable ratio along with it. The tune-ability is directly related to the hardware specifications of the cluster nodes, since each node has processing, I/O and storage capabilities in a specific ratio. At this level, we don’t need Java software architects and data scientists to take advantage of Hadoop. We’re solving a fundamental infrastructure engineering issue, which is “how can we scale our I/O and processing capability along with our storage capacity”? In other words, how can we access our data?

What Is Hadoop?

At the most basic level, Hadoop is an open-source software platform designed to store and process quantities of data that are too large for just one particular device or server. Hadoop’s strength lies in its ability to scale across thousands of commodity servers that don’t share memory or disk space.
Hadoop delegates tasks across these servers (called “worker nodes” or “slave nodes”), essentially harnessing the power of each device and running them together simultaneously. This is what allows massive amounts of data to be analyzed: splitting the tasks across different locations in this manner allows bigger jobs to be completed faster.
Hadoop can be thought of as an ecosystem—it’s comprised of many different components that all work together to create a single platform. There are two key functional components within this ecosystem: The storage of data (Hadoop Distributed File System, or HDFS) and the framework for running parallel computations on this data (MapReduce). Let’s take a closer look at each.

Core Components

Ø  HDFS (Hadoop Distributed File System) :– Store data on the cluster
Ø  MapReduce : - Process data on the cluster

A Simple Hadoop Cluster
A Hadoop cluster:

ü  a group of machines working together to store and process the data

ü  Any number of ‘slave’ & worker nodes can be exist

ü  HDFS to store data

ü  MapReduce to process data

ü  Two ‘master’ nodes

Name Node: To manages HDFS

Job Tracker: To manages MapReduce

HDFS Basic Concepts

Ø  HDFS is a Filesystem written in java -  Based on Google's GFS
Ø  Sits on top of a native Filesystem -Such as ext3, ext4 or xfs
Ø  Provides redundant storage for massive amount of data -   Using readily/available, industry/standard computers

How HDFS works

-          Files are divided into blocks
-          Blocks are replicated across nodes

Options for Accessing HDFS

ü  FsShell Command line: Hadoop fs
ü  Sub/commands: -get, -put, -ls, -cat, etc
ü  Java API
ü  Ecosystem Projects
– Flume: Collects data from network sources (e.g. system logs)
– Sqoop:  Transfers data between HDFS and RDBMS 
– Hue: Web based interactive UI. Can browse, upload, download and view files

Storing and Retrieving Files