Apache Hadoop – A brief overview

I want to add a little description of Apache Hadoop for those of us who never heard about the project.


“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.” (Apache Hadoop, 2014).

Simply said, the main advantage of Apache Hadoop lies in a fact that the technology allows its users to use distributed processing on large sets of data, and do this on computer clusters built from commodity hardware.  Plus, there are many services integrated into Hadoop such as services that do data storage, data processing, data access, data governance, but also security and operations.


The project started with a Google paper written in October 2003, which created quite a lot of stir and was soon followed by another Google research paper from Google (called: MapReduce: Simplified Data Processing on Large Clusters). An actual development started in the Apache Nutch project (http://nutch.apache.org), but it was moved to the new Hadoop subproject (http://hadoop.apache.org) in January 2006. Note: Apache Nutch(http://nutch.apache.org/) became a different product, a production-ready Web crawler, which runs on top of Apache Hadoop data structures.
But let’s go to a little bit of a detail and explore a couple of benefits and disadvantages of Hadoop.

The Benefits
Scalable – a highly scalable storage platform that can store and distribute enormous data sets across hundreds of inexpensive servers
Cost effective – Hadoop offers a cost effective storage solution for businesses’ exploding data sets.
Flexible –  companies can use Hadoop to derive valuable business insights from data sources such as social media, email conversations, log processing, data warehousing, market campaign analysis, etc.
Fast – Hadoop’s unique storage method is based on a distributed file system that basically ‘maps’ data wherever it is located on a cluster.
Resilient to failure – A key advantage of using Hadoop is its fault tolerance. When data is sent to an individual node, that data is also replicated to other nodes in the cluster, which means that in the event of failure, there is another copy available for use.

Security Concerns – Hadoop is missing encryption at the storage and network levels, which is a major selling point for government agencies and others that prefer to keep their data under wraps.
Vulnerable By Nature – The framework is written almost entirely in Java, one of the most exploited programming languages by cyber criminals.
Not Fit for Small Data – the Hadoop Distributed File System, lacks the ability to support the random reading of small files efficiently
Potential Stability Issues – Hadoop has had its fair share of stability issues, you need to run the latest stable version to handle most of the existing problems.
General Limitations – there are Apache Flume, MillWheel, and Google’s own Cloud Dataflow which seem to improve the efficiency and reliability of data collection, aggregation, and integration of Hadoop. The main point is that companies could be missing out on big benefits by using Hadoop alone.

Chemitiganti, V. (2016) What is Apache Hadoop?. Available at: http://hortonworks.com/apache/hadoop/ (Accessed: 12 December 2016).
Apache Hadoop (2014) Available at: http://hadoop.apache.org/ (Accessed: 12 December 2016).
admin (2015) Hadoop advantages and disadvantages. Available at: http://blogs.mindsmapped.com/bigdatahadoop/hadoop-advantages-and-disadvantages/ (Accessed: 12 December 2016).

Facebook Comments