Has Hadoop Lost Its Cool?
by Gary Orenstein
A recent article on 7 Big Data Tools To Ditch In 2017 rekindled my assessment of the Hadoop ecosystem that something has to give. More specifically, the sheer number of projects that have emerged in and around Hadoop is likely to require a culling of the herd.
In the article, the author recommends saying farewell to the following tools:
- Java (the language, not the JVM)
Let’s ignore the language debate on item 4 and note that 6 of the 7 items are Hadoop-oriented.
I also recalled a chart from Strata + Hadoop World 2015 showcasing the ‘continually evolving and growing’ ecosystem.
In 2006, Hadoop comprised two critical tools, the Hadoop Distributed File System, HDFS, and MapReduce. By 2015 this Hadoop ecosystem list had ballooned to 22 tools or projects. And this is only based on what Cloudera officially endorsed or supported. There are several other projects launched or maintained by other Hadoop vendors such as Hortonworks and MapR that can make this list longer.
Another look at the diaspora is the Hadoop Project Commercial Support tracker by Merv Adrian at Gartner. In July 2016, Merv identified 15 projects supported by the 5 distributors he tracks: Amazon, Cloudera, Hortonworks, MapR and IBM.
The Transience of the Hadoop Ecosystem
Hadoop recently celebrated its 10 year anniversary. But things have changed. MapReduce has come and to some degree, has gone. While there are still many places where MapReduce can be valuable, the industry attention and direction has moved to faster processing frameworks such as Spark.
On the HDFS side, many folks still struggle trying to put the data to use. No doubt HDFS has contributed to the rise of big data, simply by allowing companies to store vast amounts of information at relatively low cost. But HDFS in and of itself has not provided the out-of-the-box experience required for enterprises to benefit at scale. Overall this has left the return on investment for Hadoop low. According to Gartner,
Thru 2018, 70% of Hadoop deployments will not meet cost savings and revenue generation objectives due to skills and integration challenges (https://twitter.com/nheudecker/status/571139810879893504).
Tony Baer, in a recent piece titled “Have We Reached Peak Hadoop?“ notes,
The rap against Hadoop is that unlike databases, it is not a monolithic platform, but an assemblage of projects or components. The core HDFS file system was very much a bare bones invention designed for rapid scanning at scale on commodity hardware. Hadoop was never meant to be or designed as a database. And even after years of commercial packaging, Hadoop is still a vendor-curated collection of projects that is complicated to deploy.
Further contributing to HDFS tribulations are the cloud object stores such as AWS S3 or the Azure Blob Store. These storage repositories, tightly integrated with broader cloud-provider services, are gaining steam as the “store everything” repository to compliment real-time workflows.
So has Hadoop lost its cool? Perhaps, but it is not going away. HDFS continues to be a trove of raw data. And companies still need help in order to access and process data from HDFS.
Hadoop has also proven to be a valuable ecosystem for experimentation. Though that certainly changes the picture for how architects may want to plan for operational applications which once deployed can run for years and years at large corporations.
Picking The Right Enterprise Data Strategy
No doubt Hadoop, especially HDFS, can help companies save money on raw storage costs. But large corporations need a combination of planning, flexibility, and enough structure to keep large groups of people in sync.
Here are a few ideas to help enterprises navigate Hadoop and also the larger data processing landscape.
- Pick products that celebrate stability, longevity, and feature enhancements. Take note of the overall product direction towards common or disparate goals
- Recognize the need that no single datastore or workflow solves all problems. Develop the ability to test and integrate datastores as application needs arise.
- Build best practices around file and object storage (HDFS, AWS S3, Azure Blob Store), NoSQL Databases (Cassandra, MongoDB), and real-time, in-memory relational databases (SAP HANA, MemSQL)
- Architect for integration with popular enterprise frameworks in business intelligence with Tableau, Zoomdata, Looker, and real-time distributed systems like Kafka, Spark, Cassandra, and MemSQL.
- Embrace organization, structure, and schemas. The time tested approach of the Structured Query Language (SQL) has been proven yet again in the world of big data. Solutions that embrace SQL from the get go with native engines and schema registries will foster organization and structure.
With these ideas in mind, enterprises can design for the long term, including optimizing the use of Hadoop and other critical data processing tools.
Gary Orenstein is the CMO at MemSQL.