Unveiled earlier this week, Google’s Cloud Dataflow service clearly competes against Amazon’s streaming-data processing service Kinesis and big data products like Hadoop — particularly since Cloud Dataflow is built on technology that Google claims replaces the algorithms behind Hadoop.
But on closer look, Cloud Dataflow is better thought of as a way for Google Cloud users to enrich the applications they develop — and the data they deposit — with analytics components. A Hadoop killer? Probably not.
Google bills the service as "the latest step in our effort to make data and analytics accessible to everyone," with an emphasis on the application you’re writing rather than the data you’re manipulating.
Significantly, Google Cloud Dataflow is meant to replace MapReduce, the software at the heart of Hadoop and other big data processing systems. MapReduce was originally developed by Google and later open-sourced, but Urs Hölzle, senior vice president of technical infrastructure, declared in the Google I/O keynote on Wednesday that "we [at Google] don’t really use MapReduce anymore."
In place of MapReduce, Google uses two other projects, Flume and MillWheel, that apparently influenced Dataflow’s design. The former lets you manage parallel piplines for data processing, which MapReduce didn’t provide on its own. The latter is described as "a framework for building low-latency data-processing applications," and has apparently been in wide use at Google for some time.
Most prominent, Cloud Dataflow is touted as superior to MapReduce in the amount of data that can be processed efficiently. Hölzle claimed MapReduce’s poor performance began once the amount of data reached the multipetabyte range. For perspective, Facebook claimed in 2012 it had a 100-petabyte Hadoop cluster, although the company did not go into detail about how much custom modification was used or even if MapReduce itself was still in operation.
Ovum analyst Tony Baer sees Google Cloud Dataflow as "part of an overriding trend where we are seeing an explosion of different frameworks and approaches for dissecting and analyzing big data. Where once big data processing was practically synonymous with MapReduce," he said in an email, "you are now seeing frameworks like Spark, Storm, Giraph, and others providing alternatives that allow you to select the approach that is right for the analytic problem."
Hadoop itself seems to be tilting away from MapReduce in favor of more advanced (if demanding) processing algorithms, such as Apache Spark. "Many problems do not lend themselves to the two-step process of map and reduce," explained InfoWorld’s Andy Oliver, "and for those that do, Spark can do map and reduce much faster than Hadoop can."
Baer concurs: "From the looks of it, Google Cloud Dataflow seems to have a resemblance to Spark, which also leverages memory and avoids the overhead of MapReduce."
The single greatest distinction between Hadoop and Google Cloud Dataflow, though, lies in where and how each is most likely to be deployed. Data tends to be processed where it sits, and for that reason Hadoop has become a data store as much as a data processing system. Those eying Google Cloud Dataflow aren’t likely to migrate petabytes of data into it from an existing Hadoop installation. It’s more likely Cloud Dataflow will be used to enhance applications already written for Google Cloud, ones where the data already resides in Google’s system or is being collected there. That’s not where the majority of Hadoop projects, now or in the future, are likely to end up.
"I don’t see this as a migration play," said Baer.