Apache Hadoop Ecosystem Cheat Sheet

Reading Time: 6 minutes

Hadoop is a framework for running applications on large clusters built of commodity hardware. Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. Apache Hadoop has been in development for nearly 15 years. The term “Hadoop” refers to the Hadoop ecosystem or collection of additional software packages that can be installed on top of or alongside Hadoop.

  • Open Source
  • Part of Apache group
  • Power of JAVA
  • Supported By Big Web Giant Companies
Hadoop Ecosystem

There are so many add-on libraries on top of Apache Hadoop. You will be a zookeeper, surrounded and overwhelmed by such exotic animals (Pig, Hive, Phoneix, Impala) and funny names such as the Oozie, Tez, and Sqoop. What do Pig, Kangaroo, Eagle, and Phoenix have in common? Hadoop! We got some interesting technologies with curious names in Hadoop ecosystem. Azkaban is bloody wicked. H20 and Sparkling Water compete in the same space. Rethink, Couch, Dynamo, and Gemfire would let you think you just got out positive affirmations seminar. Leaving the bad jokes aside, Hadoop Ecosystem has been growing.  Therefore, I have made this cheat sheet for you to understand the technologies in the Apache Hadoop ecosystem.

Hadoop Ecosystem interactions

The Apache Hadoop project consists of the following key parts:

  1. Hadoop Common: The common utilities that support the other Hadoop modules.
  2. Hadoop Distributed File System (HDFSTM): A distributed file system that provides high-throughput access to application data.
  3. Hadoop YARN: A framework for job scheduling and cluster resource management.
  4. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

The official documentation states: “The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.”

Key concepts:

  • Deployed across multiple servers, called a cluster
  • HDFS provides a write-once-read-many access model for files
  • Hardware failure is the norm rather than the exception
  • Files are spread across multiple nodes, for fault tolerance, HA (high availability), and to enable parallel processing

NoSQL Database

Apache HBase: A random, real-time read/write NoSQL database (wide column store) to access data in Hadoop. Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

Scripting

Apache Hive: Facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. A SQL-like query (Hive SQL) generates MapReduce code.
Apache Pig:  A high-level language similar to Python or Bash for expressing data analysis programs. An ETL library generates MapReduce jobs just like Hive does.
Apache Impala: An open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Unlike Hive, Impala does not translate the queries into MapReduce jobs but executes them natively, meaning that it is faster than Hive.
Apache Drill: An open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage.
Apache Phoneix: An open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store.
Presto: An open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

Data Serialization

Apache Avro: A remote procedure call and data serialization framework developed within Apache’s Hadoop project. It uses JSON for defining data types and serializes data in a compact binary format.

Workflows

Apache Oozie:  A workflow scheduler system that manages Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability.
Apache Tez: An application framework which allows for a complex directed-acyclic-graph(DAG) of tasks for processing data.  It is currently built atop Apache Hadoop YARN. In some cases, it is used as an alternative to Hadoop MapReduce.
Apache Kafka: Provide a unified, high-throughput, low-latency platform for handling real-time scalable pub/sub message queue data feeds.

Connectors

Apache Flume: A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
Apache Sqoop:  A tool designed for efficiently transferring bulk data (importing/exporting) between Apache Hadoop and structured datastores such as relational databases.

Data Processing

Apache Flink: A framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale. So Flink supports event time semantics for out-of-order events, exactly-once semantics, backpressure control, and APIs optimized for writing both streaming and batch applications.
Apache Spark: A powerful open-source unified analytics engine built around speed, ease of use, and streaming analytics. Spark runs on Hadoop, Mesos, Kubernetes, standalone, or in the cloud. It has the following components:

Spark Core:  Dispatching, scheduling, and basic I/O functionalities
Spark SQL: DSL (domain-specific language)  to manipulate DataFrames. Because of its in-memory computing, the performance is even faster than Apache Impala.
Spark Streaming: Micro-batching to perform fast streaming
MLib: Scalable and easy machine learning library
GraphX: Distributes graph processing framework

Apache Storm: A real-time computation system designed to handle large streams of data within Hadoop. It can do micro-batching and stateful stream processing in batches using a trident.

Machine Learning

Apache Mahout:  A distributed linear algebra framework and mathematically expressive Scala DSL (domain-specific language) designed to perform predictive analytics on Hadoop data.
Apache MXNet: An acceleration library designed for building neural networks and other deep learning applications. MXNet automates common workflows and optimizes numerical computations.
Spark MLib, Storm SAMOA,  and Flink ML

Coordination

Apache Zookeeper:  A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Management and Monitoring

HCatalog: A table storage management tool. It allows you to access Hive metastore tables within Pig, Spark SQL, and/or custom MapReduce applications.
Ganglia: A scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.
Apache Ambari:  A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters. Allows configuration and management of a Hadoop cluster from one central web UI.
HUE(Hadoop User Experience): An open source Analytics Workbench for browsing, querying and visualizing data on a Web UI. It is developed by the Cloudera.

Interactive Notebooks

Apache Zeppelin: A web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
Jupyter Notebook:  An open-source web application that you can use to create and share documents that contain live code, equations, visualizations, and narrative text.

Security

Apache Ranger: A framework to enable, monitor and manage comprehensive data security across the Hadoop platform.
KNOX Gateway: A system that provides a single point of authentication and access for the Hadoop services in a cluster.

Hadoop Decision Matrix