The BBBT Sessions: HortonWorks, Big Data and the Data Lake

Some of the perks of being an analyst are the opportunities to meet with vendors and hear about their offerings, their insight on the industry and best of all, to be part of great discussions and learn from those that are the players in the industry.

For some time now, I have had the privilege of being a member of the Boulder BI Brain Trust (BBBT), an amazing group consisting of Business Intelligence and Data Management analysts, consultants and practitioners covering various specific and general topics in the area. Almost every week, the BBBT engages a software provider to give us a briefing of their software solution. Aside from being a great occasion to learn about a solution, the session is also a tremendous source for discussion.

I will be commenting on these sessions here (in no particular order), providing information about the vendor presenting, giving my personal view, and highlighting any other discussion that might arise during the session.

I would like to start with Hortonworks, one of the key players in the Big Data space, and a company that has a strong influence on how Big Data is evolving in the IT industry.

The session

In a session conducted by David McJannet and Jim Walker, Hortonworks’ Marketing VP and Director of Product Marketing respectively, BBBT members had the chance to learn in more detail about Hortonworks’ offerings, strategy, and services aimed at bringing Hadoop to the enterprise, as well as to discuss Big Data and its insertion into the enterprise data management infrastructure especially in relation to data warehousing, analytics, and governance. Here are some of the highlights of the session…

About Hortonworks

Hortonworks is a recently emerged company, but with a lot of experience in the Big Data space. Founded in 2011, it was formed by the original Hadoop development and operations team from Yahoo! Why is this so relevant? Well, because Hortonworks lives and breathes Hadoop, and the company makes a living by building its data solutions on top of Hadoop and many of its derivative projects. And Hadoop is arguably the most important open source software project of all time, or maybe just after Linux.

Hadoop is described on its Web page as follows:

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers […].

Hortonworks focuses on driving innovation exclusively via the Apache Software Foundation, producing open source–based software that enables organizations to deal with their Big Data initiatives by delivering Apache Hadoop solutions ready for enterprise consumption. Hortonworks’ mission, as stated in Hortonworks’ presentation title:

Our mission is is to enable your Modern Data Architecture by delivering Enterprise Apache Hadoop.

(post-ads)

Hortonworks’ commitment to Hadoop

One of the interesting aspects of Hortonworks is its commitment to Hadoop, in many regards, from the way it handles Hadoop offerings for corporate consumption, to the amount of effort Hortonworks’ team devotes to evolving and enhancing Hadoop’s capabilities. To this point, Hortonworks shared the following graph, in which it’s possible to see the level of contribution of Hortonworks to the famed Apache project in 2013.

Figure 1. List of contributors for Hadoop and number of lines contributed (Source: http://ajisakaa.blogspot.ca/2014/02/the-activities-of-apache-hadoop.html)

In the same vein, the contribution of the Hortonworks team to Hadoop extends across its multiple sub-subprojects—HBase (Hadoop’s distributed data store), Pig (Hadoop’s large data set analysis language), and Hive (Hadoop’s data warehouse infrastructure), among others (Figure 2)—making Hortonworks a hub with some of the most important experts in Apache Hadoop and a strong commitment to its open source nature.

Figure 2. List of contributors to Hadoop and number of lines contributed (Courtesy of: Hortonworks)

Hortonworks’ approach to the business market is quite interesting. While maintaining its commitment to both Hadoop and open source ecosystems, Hortonworks has also been able to:

Package corporate-ready solutions, and
Ensuring strong partnerships with some important software companies such as Microsoft, Teradata, SAP, HP, RackSpace, and, most recently, Red Hat, extending Hortonworks’ reach and influence in the Big Data space and especially into corporate markets.

So what does Hortonworks offer?

Hortonworks clearly says it: They do Hadoop. And what this means is that Hortonworks flagship product—the Hortonworks Data Platform (HDP2)—is an enterprise solution based 100% on the open source Apache Hadoop platform. HDP2’s architecture uses the core set of Hadoop’s modules architected and certified for enterprise use, then includes fully tested and certified versions of Hadoop modules as well as a complete set of professional services provided by Hortonworks for its customers.

Another offering from the company includes Hortonworks sandbox, a Hadoop environment that includes interactive tutorials and the most recent Hadoop developments for learning and testing.

How does Hortonworks fit into an organization?

One of the main concerns of many organizations trying to embrace Big Data is how their Big Data initiative will fit within their existing data management infrastructure. More importantly, the organization needs to evolve its traditional data management infrastructure (Figure 3) so that Big Data adoption doesn’t generate more problems than solutions. Hortonworks is by no means the only software provider; vendors such as Cloudera and MapR also embrace Hadoop to solve an organization’s Big Data issues, but with a different approach.

Figure 3. A traditional data management approach (Courtesy of: Hortonworks)

Wayne Eckerson explains in The Battle for the Future of Hadoop:

Last November, Cloudera finally exposed its true sentiments by introducing the Enterprise Data Hub in which Hadoop replaces the data warehouse, among other things, as the center of an organization's data management strategy. In contrast, Hortonworks takes a hybrid approach, partnering with leading commercial data management and analytics vendors to create a data environment that blends the best of Hadoop and commercial software.

During the session, aside from the heated debates about whether or not to replace the data warehouse with new information hubs, both David McJannet and Kim Walker confirmed Hortonworks’ position, which consists of enabling companies to expand their existing data infrastructures (in contrast to Cloudera’s approach)—let companies to evolve, without replacing their data management platforms (Figure 4).

Figure 4. Hortonworks expands an organization’s traditional data management capabilities for addressing Big Data (Courtesy of: Hortonworks)

The appealing part of Hortoworks schema is that its Hadoop offerings act as an expansion of the rest of the data repository spectrum (relational databases, data warehouses, data marts, and so on). This makes sense in the context of coupling new data management strategies with existing ones; while Hadoop has proven to be effective for certain tasks and type of data, some problems still need to be handled with the use of “traditional” methods and existing tools. According to Mark Madsen ( What Hadoop Is. What Hadoop Isn’t.):

What it doesn’t resolve is aspects of a database catalog, strong schema support, robust SQL, interactive response times or reasonable levels of interactive concurrency—all things needed in a data warehouse environment that delivers traditional BI functions. In this type of workload, Hadoop doesn’t come close to what a parallel analytic database can achieve, including scaling this workload into the Petabyte range.

Yet Hadoop offers features the database can’t: extremely low cost storage and retrieval, albeit through a limited SQL interface; easy compatibility with parallel programming models; extreme scalability for storing and retrieving data, provided it isn’t for interactive, concurrent, complex query use; flexible concepts of schema (as in, there is no schema other than what you impose after the fact); processing over the stored data without the limitations of SQL, without any limitations other than the use of the MapReduce model; compatibility with public or private cloud infrastructures; and free, or support-only, so a price point far below that of databases.

Hortonworks’ approach is then to enable expansion and evolution of the existing data management platform by offering an enterprise-ready version of Hadoop, one that can be nicely integrated and fill those gaps between the data warehouse and the analysis of huge amounts of non-structured (polystructured) information.

What is Hortonworks for, anyway?

Despite the hype and eagerness about Big Data, many people still don’t have a clear idea about the context and use cases where a Hadoop approach can be useful. Hortonworks showed us a good list of examples of how some of their customers are using Hortonworks. Their current deployments run mainly within the financial services, telecom, retail, and manufacturing industries and expand for applications such as fraud prevention, trading risk, call detail records, and infrastructure investment as well as for assembly-line quality assurance and many other potential uses.

How Hortonworks addresses its customers Big Data needs is by demonstrated by how a customer typically embraces Hadoop in the context of working with increasing volumes of information.

The graph below shows a diagram correlating data (volume) and the value that it can bring to the organization by enhancing an organization’s capability to derive insight.

Figure 5. Described as a “Common journey to the data lake,” Hortonworks shows the relation between data volume and its potential value in the context of addressing specific problems (Courtesy of: Hortonworks)

Another interesting thing about this is the notion of the data lake. Pentaho CTO James Dixon, who’s credited with coining the term, describes it in the following simple terms:

If you think of a datamart as a store of bottled water—cleansed and packaged and structured for easy consumption—the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

Hortonworks uses Hadoop as the platform to provide a solution for the two main issues that this implies:

A new approach to analytics by enabling the expansion from a single query engine and a deterministic list of questions to a schema-on-read basis, enabling information analysis that addresses polystructured as well as real-time and batch data.
A means for data warehouse optimization, expanding the boundaries of strict data schemas.

The Hortonworks Data Platform uses the full open source Hadoop platform. It provides an enterprise-ready basis for handling Big Data within an organization, and aims to fit and optimize—not disrupt—the existing data platform (Figure 6). Some of the challenges of Hadoop deployments have been to cope with the often unfriendly environment and technical lack of expertise to handle Hadoop projects properly, especially for hybrid and complex environments mixing and interconnecting both traditional and Hadoop deployments.

The recent addition of YARN—Hadoop’s recent resource, job, and application manager—within Hadoop 2.0 and its inclusion in Hortonworks’ HDP2-enabled Hortonworks to provide more robust processing platform, which now can work and manage process loads aside from MapReduce, expanding HDP capabilities to managing both MapReduce and external applications and resources more efficiently. The Hortonworks website has a good summary of the use of YARN within HDP.

Figure 6. Hortonworks Data Platform General Architecture (Courtesy of: Hortonworks)

Open source software, especially projects based on Hadoop and big data, traditionally has a Linux orientation, so it’s worth mentioning that HDP2 platform is available on both Linux and Windows operating systems.

Hortonworks Data Platform, enterprise tested

During the session, one thing David McJannet and Jim Walker emphasized was Hortonworks’ testing and quality and assurance model, which includes testing HDP directly within Yahoo’s data environment, providing Hortonworks with a vast and ideal testing platform with complex and data-flooded scenarios—a good testing scenario for any data application.

To conclude

I have no doubt that the new breed of solutions such as Hortonworks and others offer impressive and innovative approaches to the analysis and management of complex and big data problems. Clearly, frameworks such as the data warehouse need to adapt to these new conditions or die (I tend to believe they will not die).

Instead, it seems that data warehouse methodologies and platforms potentially have the necessary elements—such as enterprise readiness, methodology, and stability—to evolve and include these new computing paradigms, or at least live within these new ecosystems.

So some of the challenges of deploying Big Data solutions, aside from the natural technological issues, could come from how these new concepts fit within existing infrastructures. They need to avoid task duplication, actually streamline processes and data handling, and fit within complex IT and data governance initiatives, ultimately to procure better results and return of investment for an organization.

Hortonworks takes an approach that should appeal to many organizations by fitting within their current infrastructures and enabling a smooth yet radical evolution of their existing data management platforms, whether via its HDP2 platform or delivered via Hortonworks’ strategic partners. It will be interesting to see what their big competitors have to offer.

But don’t take my word for granted. You can replay the session with Hortonworks—just go to the BBBT web page and subscribe.

Have comments? Feel free to drop me a line and I’ll respond as soon as possible.