Choose the right Hadoop solution

Reading Time: 8 minutes

Hadoop ecosystem is open-source with plenty of add-on packages. This takes away the infrastructure and software management aspect of the implementation. Though this adds dependency on the Hadoop host. Commercial distributions enable businesses to enjoy the power of Hadoop minus all the headaches. The commercial element generally means you have to pay to get in the door, but the cost turns out to be worth the price of admission when considering that you are passing tedious IT burdens such as deployment, configuration, and ongoing maintenance off to someone who’s better suited to handle them. While the number of Hadoop distributions is growing rapidly, you’ll typically see these guys sniffing around spots one through five.

Alternatively, Hadoop environment can be setup in your servers based on the Hadoop distributions that are commercially and freely available. However, it will be challenging and time-consuming to install and set up the system, and managing the system as it grows over time.

There are two categories of Hadoop solutions:

Hadoop distributions and Hadoop cloud services. In the first section, you will take a look at the most popular Hadoop distributions in the world — Cloudera, Hortonworks, and MapR. Whereas in the second section, you will take a look at the Hadoop on three clouds providers — Amazon AWS, Microsoft Azure, and Google GCP.

Hadoop Distributions

Cloudera

To begin with, Cloudera was the first company to release commercial Hadoop distribution and continues to be a leader in the industry. Cloudera Distribution Including Apache Hadoop (CDH) is a true open-source platform with exclusive features such as batch processing, interactive database querying, and interactive search powered by Apache’s search engine Solr. The key feature is Cloudera Manager, a centralized interface that aims to make handling a complete Hadoop infrastructure simple regardless of size. Cloudera distributions also come available in free versions with limited features and no technical support. In addition, Cloudera offers software, services, and support in five bundles available both on-premise and across multiple cloud providers:

Enterprise Data Hub: Cloudera’s comprehensive data management platform includes Data Science & Engineering, Operational DB, Data Warehouse, and Cloudera Essentials. The annual fee is $10,000.
Data Warehouse: High-performance data warehouse for BI and SQL analytics built on the core Cloudera Essentials platform.
Operational DB: Real-time data at scale for relational or NoSQL and structured or unstructured built on the core Cloudera Essentials platform. The annual fee is $8,000.
Data Science and Engineering: Accelerate exploratory big data processing (ETL), data science, and machine learning on top of the Core Essentials platform. The annual fee is $6,000.

Cloudera Essentials: Cloudera’s enterprise-ready management capabilities (Cloudera Manager) and open source platform distribution (CDH). CDH is the most popular Hadoop distribution with 100% open source components. The annual fee is $2,000.

If you so desire, please check Cloudera’s website for the latest features and pricing of each product. Should you want to get hands-on with CDH, you can download Cloudera QuickStart VM. Cloudera offers a free training course on Cloudera Essentials.

Cloudera also offers a managed-service offering in the cloud:

Altus Data Engineering: Provides a cloud-native offering of Cloudera Data Engineering. You can deploy Cloudera on all major cloud providers. For example, the hourly charge is $0.08 on AWS m4.xlarge. Please check the hourly rate list of different cloud providers and instance types.

Hortonworks

Hortonworks is the other Hadoop heavyweight on the big data scene. The Palo Alto-based startup is the mastermind behind HDP, an open-source data management platform optimized for enterprise applications. HDP offers the ability to capture data from multiple sources, process it, and easily share it across the organization at any scale. The software is known for its reliability as the vendor commits to using the most thoroughly tested and stable versions of Hadoop. Hortonworks prides itself on offering a totally free version of the platform, but it does have premium support available for those who need it. The Hortonworks Data Platform (HDP) is open source software and is the first Hadoop Distribution that supports Windows. Specifically, HDP enables the creation of a secure enterprise data lake and delivers the analytics you need to innovate faster and power real-time business insights. Here is the HDP Hybrid Architecture from its website:

Besides, Hortonworks expanded its partnership in June 2018 with the major cloud providers. Hortonworks release 3.0 has three products HDP, HDF (Hortonwork DataFlow) and DPS (Hortonworks DataPlane Service). These products are now available on Azure as well as Amazon Web Services (AWS) and the Google Cloud Platform (GCP). Furthermore, a brand new service IBM Hosted Analytics with Hortonworks (IHAH) combines HDP, IBM’s Db2 Big SQL and the IBM Data Science Experience, an AI-oriented offering. For a hands-on experience, please proceed to the Hortonworks tutorial on HDP.

MapR

MapR gets its name from MapReduce, which is the programming model in Hadoop that handles the filtering and sorting of data in networked computer clusters. This particular platform has some attributes that help it stand out in the increasingly crowded Hadoop distribution space, including a suite of features that keep disaster recovery in mind. Coupled with built-in data protection, MapR’s unique distributed network architecture improves availability by preventing data loss and downtime, while maintaining optimal  performance at virtually any scale. MapR has world-record performance. MapR-DB is 4-7x faster than HBase on other distributions. In addition, the DirectShuffle technology leverages the performance advantages of MapR-FS to deliver strong cluster performance, and Direct Access NFS simplifies data ingestion and access. Here is the MapR Converged Data Platform (CDP) architecture from its website:

Similarly, you can deploy MapR on Amazon AWS, Microsoft Azure, and Google Compute Engine. Note that you can find the total cost of ownership through its TCO calculator. MapR offers CDP’s Converged Community Edition for free, so you can try it before you buy it.

Hadoop Cloud Providers

The following cloud providers offer fully-managed Hadoop cloud service that makes it easy, fast, and cost-effective to process massive amounts of data:

AWS Elastic MapReduce (EMR)

Amazon EMR provides a managed Hadoop framework as a web service. Amazon is leveraging its expertise in the cloud to tap into emerging opportunities in the big data space.  Through its web-based service EMR, Amazon offers a hosted Hadoop solution that not only delivers the data handling capabilities, but the capacity to store it in the infinite resources of the cloud. The e-commerce giant takes on the burden of setting up the server clusters, provisioning the network points, and configuring the Hadoop application so you can focus solely on crunching numbers. Ease of use and scalability has made EMR a hit with high-profile clients such as Yelp, RazorFish, and Getty Images. The data cross dynamically with scalable Amazon EC2 instances. You can choose from Amazon S3 (EMRFS), the Hadoop Distributed File System (HDFS), and Amazon DynamoDB as the data stores. EMR can run popular frameworks such as Apache Spark, HBase, Presto, HUE, Flink and more. EMR supports more Hadoop ecosystem frameworks than Azure and GCP.  Hourly prices range from $0.011/hour to $0.27/hour ($94/year to $2367/year) plus cost on EC2, EBS, and S3. Choosing reserved or spot EC2 instance can save you money. To find out the total cost of EMR, please go to AWS Calculator.  EMR is not free under the AWS Free Tier.

Azure HDInsight

Azure HDInsight enables a broad range of scenarios such as ETL, Data Warehousing, Machine Learning, IoT and more. Use popular open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, R & more. You can choose Azure Blob storage instead of HDFS. Azure utilizes its own Azure Data Lake platform and its own cloud security framework. Hourly prices range from $0.074/hour to $1.496/hour. Microsoft offers a one-month free trial on Azure, so you may want to test a Hadoop cluster from HDInsight.

GCP Dataproc

Cloud Dataproc is a cloud-based managed Spark and Hadoop service offered on the Google Cloud Platform. Dataproc utilizes its Google Compute Engine(GCE) to process data. Dataproc integrates with its Cloud Storage, BigQuery, Bigtable, Stackdriver Logging, and Stackdriver Monitoring. But Dataproc self only has Apache Hadoop, Spark, Hive, and Pig. Hourly prices range from $0.001/hour to $0.640/hour. Google also offers Free Tier (12-month, $300 credit free trial) that allows you to use any GCP product, so you should take this offer to try Dataproc and BigQuery.

Conclusion

In conclusion, you should consider the components, deployment model, performance, security and cost holistically to choose the right Hadoop solution. Cloudera has the largest user base with the largest number of clients; moreover, Cloudera Enterprise Data Hub has a comprehensive data management platform with everything you need. But do you genuinely need every component? If your organization doesn’t have multitudinous Hadoop experts, then choosing a fully-managed Hadoop cloud service will let you focus on the development. If you need fast performance, you may want to take a look at MapR; if you want a low cost, Hortonworks is an excellent choice. Each Hadoop solution provides different approaches to authentication, security policy management, and data encryption, so you should base on your auditing policy and protection requirements to review how each solution addresses those needs. Also, designing a hybrid Hadoop solution keep some jobs on-premise, noncritical jobs on the cloud (e.g. AWS spot instance) to save cost. Moreover, if your Big Data below 1.6PB, you may want to take a look at the Redshift data warehouse option.