Skip to main content

Not Your Father's Database: Interview with VoltDB’s John Piekos



As organizations deal with challenging times ―technologically and business wise―, managing increasing volumes of data has become a key to success.

As data management rapidly evolve, the main Big Data paradigm has changed from just “big” to “big, fast, reliable and efficient”.

No more than today in the evolution of the big data and database markets, the pressure is on for software companies to deliver new and improved database solutions capable not just to deal with increasing volumes of data but also to do it faster, better and in a more reliable fashion.

A number of companies have taken the market by storm, infusing the industry with new and spectacularly advanced database software —for both transactional and non-transactional operations— that are rapidly changing the database software landscape.

One of these companies is VoltDB. This New England (Massachusetts) based company has rapidly become a reference when it comes to the offering of next-generation of database solutions and, has gained the favor of important customers in key industries such as communications, finance and gaming.

VoltDB was co-founded by no other than world known database expert and 2014 ACM A.M. Turing Award recipient, professor, Dr. Michael Stonebraker who has been key in the development of a new generation database solution and the formation of a talented team in charge of its development.

With the new VoltDB 7.0 already in the market, we had the opportunity to chat with VoltDB’s John Piekos about VoltDB’s key features and evolution.

John is VoltDB’s Vice President of Engineering at VoltDB, where he heads up VoltDB’s engineering operations, including product development, QA, technical support, documentation and field engineering.

John has more than 25 years of experience leading teams and building software, delivering both enterprise and Big Data solutions.

John has held tech leadership positions at several companies, most recently at Progress Software where he led the OpenEdge database, ObjectStore database and Orbix product lines. Previously, John was vice president of Web engineering at EasyAsk, and chief architect at Novera Software, where he led the effort to build the industry’s first Java application server.

John holds an MS in computer science from Worcester Polytechnic Institute and a BS in computer science from the University of Lowell.

Thank you John, please allow me to start with the obvious question:

What’s the idea behind VoltDB, the company, and what makes VoltDB the database, to be different from other database offerings in the market?

What if you could build a database from the ground-up, re-imagine it, re-architect it, to take advantage of modern multi-core hardware and falling RAM prices, with the goal of making it as fast as possible for heavy write use cases like OLTP and the future sensor (IoT) applications?  That was the basis of the research Dr. Stonebraker set out to investigate.

Working with the folks at MIT, Yale, and Brown, they created the H-Store project and proved out the theory that if you eliminated the overhead of traditional databases (logging, latching, buffer management, etc), ran an all in-memory workload, spread that workload across all the available CPUs on the machine and horizontally scaled that workload across multiple machines, you could get orders of magnitude performance out of the database.

The commercial realization of that effort is VoltDB.  VoltDB is fully durable, able to process hundreds of thousands to millions of multi-statement SQL transactions per second, all while producing SQL-driven real-time analytics.

Today an increasing number of emerging databases work partially or totally in-memory while existing ones are changing their design to incorporate this capability. What are in your view the most relevant features users need to look for when trying to choose from an in-memory based database?

First and foremost, users should realize that not all in-memory databases are created equal.  In short, architecture choices require trade-offs.  Some IMDBs are created to process reads (queries) faster and others, like VoltDB, are optimized for fast writes.  It is impractical (impossible) to get both the fastest writes and the fastest reads at the same time on the same data, all while maintaining high consistency because the underlying data organization and architecture is different for writes (row oriented) than it is for reads (columnar).

 It is possible to maintain two separate copies of the data, one in row format, the other in compressed column format, but that reduces the consistency level - data may not agree, or may take a while to agree between the copies.

Legacy databases can be tweaked to run in memory, but realize that, short of a complete re-write, the underlying architecture may still be disk-based, and thus incur significant (needless) processing overhead.

VoltDB defines itself as an in-memory and operational database. What does this mean in the context of Big Data and what does it mean in the context of IT’s traditional separation between transactional and analytical workloads, how does VoltDB fit or reshapes this schemas?

VoltDB supports heavy write workloads - it is capable of ingesting never-ending streams of data at high ingestion rates (100,000+/second per machine, so a cluster of a dozen nodes can process over a million transactions a second).

While processing this workload, VoltDB can calculate (via standard SQL) and deliver strongly consistent real-time analytics, either ad hoc, or optimally, as pre-computed continuous queries via our Materialized View support.

These are capabilities simply not possible with traditional relational databases.  In the Big Data space, this places VoltDB at the front end, as the ingestion engine for feeds of data, from telco, digital ad tech, mobile, online gaming, IoT, Finance and numerous other application domains.

Just recently, VoltDB passed the famous Jepsen Testing for improving safety of distributed databases with VoltDB 6.4, Could you share with us some details of the test, the challenges and the benefits it brought for VoltDB?

We have a nice landing page with this information, including Kyle’s and VoltDB’s founding engineer John Hugg’s blog.

In summary, distributed systems programming is hard. Implementing the happy path isn’t hard, but doing the correct thing (such as returning the correct answer) when things go wrong (nodes failing, networks dropping), is where most of the engineering work takes place. VoltDB prides itself on strong consistency, which means returning the correct answer at all times (or not returning an answer at all - if, for example, we don’t have all of the data available).

Kyle’s Jepsen test is one of the most stringent tests out there.  And while we hoped that VoltDB would pass on the first go-around, we knew Kyle was good at breaking databases (he’s done it to many before us!).  He found a couple of defects, thankfully finding them before any known customer found them, and we quickly went to work fixing them. Working with Kyle and eventually passing the Jepsen test was one of the 2016 engineering highlights at VoltDB. We’re quite proud of that effort.

(post-ads)

One interesting aspect of VoltDB is that It’s a relational database complies fully with ACID and bring native SQL support, what are the differences of this design compared to, for example NoSQL and some so-called NewSQL offerings? Advantages, tradeoffs perhaps?

In general, NoSQL offerings favor availability over consistency - specifically, the database is always available to accept new content and can always provide content when queried, even if that content is not the most recent (i.e., correct) version written.

NoSQL solutions rely on non-standard query languages (some are SQL-like), to compute analytics. Additionally, NoSQL data stores do not offer rich transaction semantics, often providing “transactionality” on single key operations only.

Not all NewSQL database are created equal. Some favor faster reads (over fast writes).  Some favor geo-distributed data sets, often resulting in high latency, or at least unpredictable latency access and update patterns.  VoltDB’s focus is low and predictable OLTP (write) latency at high transactions/second scale, offering rich and strong transaction semantics.

Note that not all databases that claim to provide ACID transactions are equal. The most common place where ACID guarantees are weakened is isolation. VoltDB offers serializable isolation.

Other systems offer multiple levels of isolation, with a performance tradeoff between better performance (weak guarantees) and slower performance (strong guarantees). Isolation models like Read-Committed and Read-Snapshot are examples; many systems default to one of these.

VoltDB’s design trades off complex multi-dimensional (OLAP) style queries for high throughput OLTP-style transactions while maintaining an ACID multi-statement SQL programming interface. The system is capable of surviving single and multi-node failures.

Where failures force a choice between consistency and availability, VoltDB chooses consistency. The database supports transactionally rejoining failed nodes back to a surviving cluster and supports transactionally rebalancing existing data and processing to new nodes.

Real-world VoltDB applications achieve 99.9% latencies under 10ms at throughput exceeding 300,000 transactions per second on commodity Xeon-based 3-node clusters.

How about the handling of non-structured information within VoltDB? Is it expected VoltDB to take care of it or it integrates with other alternative solutions? What’s the common architectural scenario in those cases?

VoltDB supports the storage of JSON strings and can index, query and join on fields within those JSON values. Further, VoltDB can process streamed JSON data directly into the database using our Importers (See the answer for question #9) and custom formatters (custom decoding) - this makes it possible for VoltDB to transactionally process data in almost any format, and even to act as an ETL engine.

How does VoltDB interact with players in the Big Data space such as Hadoop, both open source and commercial distributions?

The VoltDB database supports directly exporting data into a downstream data lake.  This target could be Hadoop, Vertica, a JDBC source or even flat files.  VoltDB handles the real-time data storage and processing, as it is capable of transactionally ingesting (database “writes”) millions of events per second.

Typically the value of this data decreases with age - it becomes cold or stale - and eventually would be migrated to historical storage such as Hadoop, Spark, Vertica, etc.  Consider applications in the telco or online gaming space - the “hot data” may have a lifespan of one month in telco, or even one hour or less, in the case of game play.

Once the data becomes “historical” and is of less immediate value, it may be removed from VoltDB and stored on disk in the historical archive (such as Hadoop, Vertica, etc).

What capabilities VoltDB offers not just for database administration but for development on top of VoltDB with Python, R, or other languages?

While VoltDB offers traditional APIs such as JDBC, ODBC, Java and C++ native bindings, as well as Node.js, Go, Erlang, PHP, Python, etc., I think one of the more exciting next-generation features VoltDB offers is the ability to stream data directly into the database via our in-process Importers. VoltDB is a clustered database, meaning a database comprises one (1) or more processes (usually a machine, VM or container).

A database can be configured to have an “importer,” which is essentially a plug-in that listens to a source, reads incoming messages (events, perhaps) and transactionally processes them. If the VoltDB database is highly available, then the importer is highly available (surviving node failure).  VoltDB supports a Kafka Importer and a socket importer, as well as the ability to create your own custom importer.

Essentially this feature “eliminates the client application” and data can be transactionally streamed directly into VoltDB.  The data streamed can be JSON, CSV, TSV or any custom-defined format.  Further, the importer can choose which transactional behavior to apply to the incoming data.  This is how future applications will be designed: by hooking feeds, streams of data, directly to the database - eliminating much of the work of client application development.

We have one customer who has produced one of the top 10 games in the app store - their application streams in-game events into VoltDB at a rate upwards of 700,000 events per second.  VoltDB hosts a Marketing Optimization application that analyzes these in-game events in an effort to boost revenue.

If you had a crystal ball, how would you visualize the database landscape in 5 years from now? Major advancements?

Specialized databases will continue to carve out significant market share from established vendors.
IoT will be a major market, and will drive storage systems to support two activities: 1) Machine learning (historical analysis) on the Data Lake/Big Data; storage engines will focus on enabling data scientists to capture value from the vast increases of data, and 2) Real-time processing of streams of data. Batch processing of data is no longer acceptable - real-time becomes a “must have”.

Data creation continues to accelerate and capturing value from fresh data in real-time is the new revenue frontier.

Finally, could tell us a song that is an important part of the soundtrack of your life?  

I’m a passionate Bruce Springsteen fan (and also a runner), so it would have to be “Born to Run”.

Springsteen captures that youthful angst so perfectly, challenging us to break out of historic norms and create and experience new things, to challenge ourselves.

This perfectly captures the entrepreneurial spirit both of personal “self” as well as “professional self,” and it matches the unbridled spirit of what we’re trying to accomplish with VoltDB. “Together we could break this trap We'll run till we drop, baby we'll never go back.”



Comments

Popular posts from this blog

Machine Learning and Cognitive Systems, Part 2: Big Data Analytics

In the first part of this series, I described a bit of what machine learning is and its potential to become a mainstream technology in the industry of enterprise software, and serve as the basis for many other advances in the incorporation of other technologies related to artificial intelligence and cognitive computing. I also mentioned briefly how machine language is becoming increasingly important for many companies in the business intelligence and analytics industry. In this post I will discuss further the importance that machine learning already has and can have in the analytics ecosystem, especially from a Big Data perspective. Machine learning in the context of BI and Big Data analytics Just as in the lab, and other areas, one of the reasons why machine learning became extremely important and useful in enterprise software is its potential to deal not just with huge amounts of data and extract knowledge from it—which can somehow be addressed with disciplines such as data

Next-generation Business Process Management (BPM)—Achieving Process Effectiveness, Pervasiveness, and Control

The range of what we think and do is limited by what we fail to notice. And because we fail to notice that we fail to notice there is little we can do to change until we notice how failing to notice shapes our thoughts and deeds. —R.D. Laing Amid the hype surrounding technology trends such as big data, cloud computing, or the Internet of Things, for a vast number of organizations, a quiet, persistent question remains unanswered: how do we ensure efficiency and control of our business operations? Business process efficiency and proficiency are essential ingredients for ensuring business growth and competitive advantage. Every day, organizations are discovering that their business process management (BPM) applications and practices are insufficient to take them to higher levels of effectiveness and control. Consumers of BPM technology are now pushing the limits of BPM practices, and BPM software providers are urging the technology forward. So what can we expect from the next

Teradata Open its Data Lake Management Strategy with Kylo: Literally

Still distilling good results from the acquisition of former consultancy company Think Big Analytics , Teradata , a powerhouse in the data management market took one step further to expand its data management stack and to make an interesting contribution to the open source community. Fully developed by the team at Think Big Analytics, in March of 2017 the company launched Kylo –a full data lake management solution– but with an interesting twist: as a contribution to the open source community. Offered as an open source project under the Apache 2.0 license Kylo is, according to Teradata, a new enterprise-ready data lake management platform that enables self-service data ingestion and preparation, as well the necessary functionality for managing metadata, governance and security. One appealing aspect of Kylo is it was developed over an eight year period, as the result of number of internal projects with Fortune 1000 customers which has enabled Teradata to incorporate several be