Hadoop Platforms: The Elephants in the Room

"When there’s an elephant in the room introduce him"
-Randy Paush

It is common that when speaking about Big Data two major assumptions often take place:

One: Hadoop comes to our minds right by its side, and many times are even considered synonyms, which they are not.

While Big Data is the boilerplate concept that refers to the process of handling enormous amounts of data coming in different forms  (structured and unstructured), independent of the use use of a particular technology or tool, Hadoop is in fact, a specific open source technology for dealing with these sort of voluminous data sets.

But before we continue, and as a mind refresher, let’s remind ourselves what is Hadoop with their own definition:
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Commercial Hadoop distributions assemble different combinations of various open source components from the Apache Software Foundation and more specifically from the Apache Hadoop stack.

These distributions integrate all components within a single product offering as an enterprise  ready commercial solution. In many cases, some distribution offer also proprietary software, support, consulting services, and training as part of their offering.

Two: When talking about Hadoop and its commercial use, quite often three common suspects come to our minds which, due to their history and ties with the evolution of Hadoop have become major players, we are talking about Cloudera, Hortonworks and MapR.

While there’s no doubt these Hadoop-based data platforms are major players, nowadays we can find  a significant number of options from which a company can choose from. So, to follow Mr. Pausch advice, let’s take a look at a list of Hadoop-based data platforms available in the market and introduce them.

Alibaba Cloud
Solution: Alibaba E-MapReduce Service

The Alibaba Cloud Elastic MapReduce (E-MapReduce) is a cloud-based big data processing solution based on Apache Hadoop and Apache Spark. E-MapReduce's flexibly allows the platform to be applied in different big data use cases including as trend analysis, data warehousing, and analysis of continuously streaming data.

Being in the cloud, E-MapReduce offers big data processing available within a flexible and scalable platform of distributed Hadoop clusters and seamless integration with the rest of the Alibaba Cloud offerings available.

Amazon Web Services
Solution: Amazon EMR

With Amazon EMR, the company provides a cloud-based managed Hadoop framework to make it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances.
With Amazon EMR is also possible to deploy and run other open source distributed frameworks including Spark, HBase, Presto, and Flink within Amazon EMR and, interact with data stored in other AWS data stores like Amazon S3 and Amazon DynamoDB.

Amazon EMR includes interesting features for log analysis, web indexing, data transformations (ETL), machine learning, financial analysis, scientific simulation, and bio-informatics capabilities.

Arenadata
Solution:  Arenadata Hadoop (Open Analytical Platform)

The ArenaData Unified Data Platform is composed of a set of components along with Hadoop, including all the necessary software to access, manipulate, protect and analyze data.

Arenadata Hadoop (ADH), aims at handling semi-structured and unstructured data. It's an enterprise ready Apache Hadoop based distribution. Today, Arenadata Hadoop (ADH) is certified to fully comply the ODPI (Open Data Platform initiative) standard to fully deploy and assembly a completed Apache-based set of open source products, without proprietary software.

Arenadata Hadoop provides a full set of tools for autonomous installation on physical, as well as virtual machines. A software for monitoring and administration helps the system to optimize performance on all system’s components while with Apache Ambari it provides the necessary interfaces required for integration with current administrative systems including as like Microsoft System Center and Teradata ViewPoint.

Cloudera
Solution: Cloudera Enterprise Data Hub

The Enterprise Data Hub (EDH) is Cloudera’s Hadoop data platform distribution, it is a solution intended for enabling fast, secure, and easy big data software available. From data science and engineering, to powering an operational database, to running large-scale analytics, all within the same product.

Offered in different flavors: Analytic DB, Operational DB, Data Science & Engineering as well as an Essentials version, Cloudera’s EDH also offers, aside from its analytics and data management capabilities, features to run in the cloud like:

  • High-performance analytics. Able to run any analytics tool of choice against cloud-native object store, Amazon S3.
  • Elasticity and flexibility. Support transient Hadoop clusters and the ability to scale up and down as needed as well as use of permanent clusters for long-running BI and operational jobs.
  • Multi-cloud provisioning. Deploy and manage Cloudera Enterprise across AWS, Google Cloud Platform, Microsoft Azure, and private networks.
  • Automated metering and billing. To only pay for what a company needs, when it needs it.


Gluent
Solution:  Gluent Data Platform

Implemented in large organizations across industries including: finance, telecom, retail and healthcare around the world, the Gluent Data Platform offers a Hadoop data platform for data offloading and access as well as its analysis.

Some benefits and features offered by Gluent include, among others:

  • High parallelism in Hadoop using cheap Hadoop cluster hardware and software
  • No changes required to existing application code for connection with sources by using Gluent’s Smart Connector
  • Offers capability to choose from and use multiple data engines (like Impala, Hive and Spark) to process your data
  • No data conversion or export/import is needed when using new engines on Hadoop


Google Cloud Platform
Solution:  Cloud Dataproc

Google’s Cloud Dataproc is a fully-managed cloud service for running Apache Spark and Apache Hadoop clusters. Some of the features of Cloud Dataproc includes:

  • Automated cluster management
  • Re-sizable clusters
  • Versioning
  • High availability
  • Integration with developer tools
  • Automatic or manual configuration
  • Flexible virtual machines

Cloud Dataproc also easily integrates with other Google Cloud Platform (GCP) services to provide a complete platform for data processing, analytics and machine learning.

Hortonworks
Solution:  Hortonworks Data Platform (HDP)

HDP is an enterprise ready and secure Apache Hadoop distribution designed on a centralized architecture based in YARN. HDP aims to address the complete set of needs for data-at-rest, as well as to power real-time customer applications and deliver robust big data analytics solutions.

Whether on-premises or in the cloud, Hortonworks provides flexibility to run the same industry-leading, open source platform to gain data insights in the data center as well as on the public cloud of choice (Microsoft Azure, Amazon Web Services or Google Cloud Platform)
(post-ads)
Infosys
Solution:  Infosys Information Platform (IIP)

IIP is a data and analytics platform designed to help enterprises leverage their data assets for innovation and enhance business growth. The solution can easily integrate with proprietary software, to allow companies to maximize value from existing investments.

According Infosys, IIP is collaborative platform that enables data engineers, data analysts and data scientists to work jointly across business domains and verticals. IIP can be deployed with ease and without vendor lock-in.

With improve security with role-based access controls that include cell-level authorizations IIP helps enterprises to simplify their data management operations and understand data better to accelerate the data-insight-action cycle.

IIP aims to be the right tool for organizations that want to gain real-time insights, get faster business value, stay compliant with updated governance and robust security, and reduce total cost of ownership with high availability.

MapR
Solution:  MapR Converged Data Platform

MapR’s Converged Data Platform integrates Hadoop, Spark, and Apache Drill along with real-time database capabilities, global event streaming, and scalable enterprise storage to provide a full enterprise ready big data management platform with Hadoop.

The MapR Platform aims to deliver enterprise grade security, reliability, and provide real-time performance capabilities while lowering both hardware and operational costs for applications and data.

The MapR Converged Data Platform has the ability to simultaneously perform analytics and applications at high speed and enable scaling and reliability. The strategy is to converge all data within a data fabric allows its storage, management, processing, and its analysis as data is being generated.

Mastodon C
Solution:  Kixi

Mastodon C’s open source data platform Kixi uses Hadoop, Cassandra and a set of open source technologies to ingest and integrate batch and real-time data within a single repository, from which the platform  can aggregate, model, and analyze it.

Some of kixi’s main features include:

  • Handling of real-time and sensor data via Apache Kafka
  • ETL and batch processing capabilities
  • Data Science capabilities for advanced data analysis
  • Ongoing support to ensure efficient data processing and continuous review and improvement of customers data pipelines and models.


Microsoft Azure
Solution:  Microsoft Azure HDInsight

Backed by Hortonworks, Azure’s HDInsight is, according to Microsoft, a fully-managed, full-spectrum open source analytics service for enterprises.

The Azure HDInsight service aims to provide a fully-managed cloud service to make it easy for organizations to process massive amounts of data via popular open source frameworks including Hadoop, Spark, Hive, LLAP, Kafka, Storm, R and others.

Azure HDInsight provides an architecture landscape for different use cases including ETL, Data Warehousing, Machine Learning, IoT and other services within an integrated platform.

NEC
Solution:  NEC Data Platform for Hadoop

Another offering powered by Hortonworks, NEC’s "Data Platform for Hadoop" is a pre-designed and pre-validated Hadoop appliance which integrates NEC's specialized hardware and Hortonworks’ Data Platform.

This NEC Hadoop-based appliance is already tuned to work with an enterprise ready Hortonworks platform, already certified for working on NEC’s server hardware.

Oracle
Solutions: Oracle Big Data Cloud Service and Oracle Big Data Cloud 

Oracle has gone “big” with big data, with both its Big Data Cloud and Big Data Cloud Service, the mega tech vendor offers a couple of Hadoop-based data management platforms: The Oracle Big Data Cloud Service and Oracle Big Data Cloud.

Derived from a partnership with Cloudera, the Oracle Big Data Cloud Service aims to enable organizations to launch their Big Data efforts by providing a data platform within a secure, automated and scalable service that can easily can be fully integrated with existing enterprise data in Oracle Database. The service has been designed to:

  • Deliver high performance through dedicated instances
  • Allow dynamic scaling as needed
  • Reinforce and extend security to Hadoop and NoSQL processes
  • Deliver a comprehensive solution that includes robust data integration, capabilities and integration with R, spatial and graph software

Oracle Big Data is an enterprise-ready Hadoop data platform intended for those organizations that want to run big data workloads including batch processing, streaming and/or machine learning within a public or as a private cloud configuration.

Qubole
Solution:  Qubole Data Service (Apache Hadoop as a Service)

Qubole offers an autonomous data platform implementation of Apache Hadoop in the cloud. The Apache Hadoop as a Service, part of Qubole Data Service offers a self-managing and self-optimizing implementation of Apache Hadoop that can run on different public cloud infrastructures including AWS, Azure and Oracle Cloud.

Qubole’s Hadoop service runs applications in MapReduce, Cascading, Pig, Hive, and Scalding. The service is optimized for faster workload performance and incorporates an enterprise-ready data security infrastructure.

SAP
Solution:  SAP Cloud Platform Big Data Services

SAP’s Big Data Services on its Cloud Platform is a full-service big data cloud-based Hadoop and Spark data platform.

The platform allows companies to utilize Apache Hadoop, Spark, Hive and Pig, as well as several third-party applications to take advantage of the most recent innovations in big data and attend the diverse set of use cases an organization might have.

Also, and worth mentioning, is that the service integrates with SAP Leonardo, the company’s IoT and digital innovation platform to take a systematic approach to digital innovation with SAP Leonardo’s capabilities while, according to SAP, the platform meets rigorous demands for reliability, scalability, and security.

Syncfusion
Solution:  Syncfusion Big Data Platform

Syncfusion Big Data Platform is a full fledge Hadoop distribution designed for Windows, Linux, and Azure. One of the things that make this Hadoop platform interesting, aside from its features for managing huge data loads is its ability to easily create, deploy, and scale a secure Syncfusion Hadoop cluster with basic or Kerberos enabled authentication in a Microsoft Azure Virtual Machines environment.

Syncfusion cluster manager allows to effectively manage the resources in Microsoft Azure with options to track billing details and shut down, restart, and destroy the virtual machines as required or start and stop the virtual machines with the Hadoop cluster at scheduled intervals.

Additionally, Syncfusion Big Data Platform includes support for creating and managing Hadoop clusters within Linux environments, Azure Blob storage for Azure VM-based Hadoop clusters as well as integration with Elasticsearch and MongoDB data access with Spark, among many other features.

T-Systems
Solution:  T-Systems Big Data Platform

The T-Systems Big Data Platform offering is a full solution Hadoop and in-memory based solution that comprises consultancy, planning, implementation and the optimization of big data analysis solutions and processes.

Along with a partnership with Cloudera and SAP HANA, and other best of breed data management tools, T-Systems provides organizations with a Hadoop ecosystem. T-Systems’ big data solution offers a scalable big data platform in the cloud.

The solution offers a full set of functions for the collection, backup and processing of large sets of unstructured data.

Additionally, T-Systems’ big data solution includes capabilities for real-time analytics, done with SAP HANA's in-memory architecture, which allows all data to be directly stored in main memory (RAM).

Teradata
Solution:  Teradata Appliance for Hadoop

The Teradata Appliance for Hadoop is Teradata’s enterprise Hadoop implementation approach. A ready-to-run enterprise platform pre-configured and optimized specifically to run enterprise class big data workloads.

The appliance features optimized versions of either Hortonworks HDP or Cloudera CDH running on top of Teradata hardware and a comprehensive set of Teradata-developed software components. Some features of the Teradata Appliance for Hadoop include:

  • Optimized hardware and flexible configurations
  • High-speed connectors and enhanced software usability features
  • Systems monitoring and management portals
  • Continuous availability and linear scalability
  • Teradata's world-class service and support

TickSmith
Solution: TickVault

TickVault is a Hadoop-based big data platform with the purpose of collecting, storing, transforming, analyzing and providing insights from structured and unstructured financial data. This includes trade & quote history, news and events, research and corporate actions among others.

The platform has been designed to help organizations speed development and management of financial related big data projects. The platform provides APIs and integrates them with pre-existing business software solutions including Matlab, R, or Excel, to avoid business disruptions and speed the analytics process.

Its unified web interface aims to provide easy data access and its distribution within as secure environment, allowing flexible and managing granular  permissions.

Hadoop Platforms: Mature and Enterprise Ready Big Data Platforms

From the list above its easy to see way gone are the days were just a few vendors would provide enterprise-ready option for undertaking a Hadoop-based big data project. The Hadoop space continues to evolve, while a more than decent amount  of vendors offer now reliable solutions for deploying Hadoop both on-premises or in the cloud to comply with most of the use cases an organization needs to address.

Granted is, of course, that for making a decision over which Hadoop data platform is the best for an organization much more information is needed, but this list can provide a place to start exploring the possibilities for new small or big data projects involving Hadoop.

Finally, I wouldn’t be surprised to discover there are other Hadoop platforms I had not mentioned here. Please feel free to let me know about ant other distribution I’m not considering in this list or feel free to drop me a comment or feedback below.

Notes:

  • During the writing of this piece, it wasn't possible to gather link and information regarding Huawei’s FusionInsight Big Data Platform, which is why it does not appear as part of our list.
  • While IBM will remain offering a Hadoop-based offering, this will be by integrating Hortonworks to its analytics arsenal rather than the existing IBM BigInsights. For more information read here.
  • All logos and trademarks are the property of their respective owners.


Hadoop Platforms: The Elephants in the Room Hadoop Platforms: The Elephants in the Room Reviewed by Jorge Garcia on May 04, 2018 Rating: 5
Powered by Blogger.