Saturday, January 19, 2013

Apache Hadoop

Apache Hadoop is an open source Java framework for processing large amount of distributed data.Hadoop is a Yahoo! initiated project to enable processing of very large data sets in a cost effective way.
It is based on Google's MapReduce algorithm which is a parallel and distributed solution approach for processing large datasets. MapReduce is utilized by Google and Yahoo for their websearch.Using MapReduce,data is transformed into a set of Key/Value pairs and later these are aggregated to provide the search results.
Data in Hadoop is broken down into smaller blocks and distributed throughout the cluster.This aids to execute the MapReduce functions on smaller subsets of the large data set.Apart from Java,Hadoop also provides linker for  C++ ,Perl,etc.
Hadoop is ideal for storing large amounts of data, like terabytes and petabytes.The Hadoop Distributed File System(HDFS) is employed as the storage system which is highly fault-tolerant and is designed to be deployed on low-cost hardware.It manages storage on the cluster by breaking incoming data into pieces, called “blocks,” and storing each of the blocks redundantly across the pool of servers. By default,
HDFS stores three complete copies of each file by copying each piece to three different servers but the number of copies is configurable.
Hadoop is very flexible,scalable  and can process large amount of data in parallel .This can be deployed on large clusters of commodity hardware,hence it is very cost effective.
Its ideal to use Hadoop  when there is a need to analyze enormous amount of information to understand the usage pattern and to predict demand.