A translation of the article was prepared specifically for students of the Data Engineer course.
After both Cloudera and MapR announced a few weeks ago that their business was in difficult times, I saw a stream of social media posts with the theme “Hadoop is Dead.” These posts are not new, but in a sector where technical experts rarely produce high-quality material for social networks, these exclamations are getting louder and louder. I would like to consider some of the arguments regarding the state of Hadoop.
Competition with free
Cloudera has suggestions that help Hadoop to be a more complete solution. These tools appeared before devops became widespread, and automated deployment was rare.
Their tools provide great deals for more than 2,600 customers, but most of the software they offer is open source and free. Cloudera ultimately competes with free software. To top it all off, many Hadoop ecosystem developers have worked at one time or another at Cloudera, i.e. in the end, they somehow subsidized the free offers they compete with.
Because they compete with free, Cloudera will never serve 100% of the Hadoop user base. I would not dare to use them as an indicator of Hadoop health precisely for this reason.
Other firms offering turnkey Spark and Presto solutions are trying to distance themselves from the Hadoop brand. Their offers may include hundreds of .jar files from various Hadoop projects, but nevertheless, these companies want to do everything possible to avoid competition with free offers, while lowering their development costs by using open source software. Sales are not so easy when your customer can legally download 80% of your offer without paying for it.
Competition with AWS
In 2012, I worked on implementing Hadoop with 25 other contractors. Some of my colleagues came from Google, others continued to work for Cloudera. A significant budget was involved, the team produced many paid hours, but a very small part of the Hadoop ecosystem was ready.
Within a few years, AWS EMR appeared and began to absorb its market share. EMR allows you to run Hadoop clusters with a wide variety of software installed in just a couple of clicks. It can work in point copies, which reduce equipment costs by ~ 80%, and can store data on S3, which was and remains cheap and reliable at 99.9999999999%.
Suddenly, the need for 25 contractors on the project disappeared. On some projects, only I, a full-time worker, and several other part-time ones, preparing the infrastructure in addition to our other responsibilities, could be involved. There is still a need for project consultants using AWS EMR, but the overall billing potential for this kind of work is much less than a few years ago.
What share of potential Cloudera business was lost in favor of EMR? Cloudera did a good job of setting up and managing bare metal clusters, but today most of the data world is in the cloud. It’s worth considering how attractive Hadoop is for your business, if only because AWS has a managed offer with point copies.
What is Hadoop?
If you asked me the definition of Hadoop, I would say that it is a large collection of open source software that integrates to some extent with one another and has several shared libraries. I see Hadoop as a partitioned database, almost like an operating system distribution for data.
Not all Hadoop-sponsored software projects are Apache projects. Presto is one such exception. Others, such as ClickHouse, with upcoming support for HDFS and Parquet, will not be perceived by many as a Hadoop project, although they will soon tick the compatibility graph.
Until 2012, there were no ORC files or Parquet. These formats contributed to the implementation of quick analytics in Hadoop. Prior to these formats, workloads were mostly line-oriented. If you need to convert terabytes of data and you can do it in parallel, then Hadoop will do the job perfectly. MapReduce was a framework often used for this purpose.
What column storage was offered for is an analysis of terabytes of data in a matter of seconds. Which turned out to be a more valuable offer for a larger number of enterprises. Data scientists may need only a small amount of data to get an idea, but first they will need to look at potentially petabytes of data to choose the right ones. Column analytics is key to their fluency in processing the data necessary to understand what needs to be selected.
MapReduce has two functional data processing operators, map and reduce, and treats data as strings. Spark immediately follows it and has more functional operators, such as filter and union, and perceives the data structured in a directed acyclic graph (Direct Acyclic Graph - DAG). These elements enabled Spark to run more complex workloads such as machine learning and graphical analytics. Spark can still use YARN as a capacity scheduler, much like the tasks in MapReduce are performed. But the Spark team also began to build their own scheduler and later added support for Kubernetes.
At some point, the Spark community tried to distance themselves from the Hadoop ecosystem. They did not want to be seen as an add-on over Legacy software or as a kind of "add-on" for Hadoop. Given the level of integration that Spark has with the rest of the Hadoop ecosystem, and given the hundreds of libraries from other Hadoop projects used by Spark, I disagree with the view that Spark is a stand-alone product.
MapReduce may not be the first choice for most workloads these days, but it is still the base environment when using hadoop distcp - a software package that can transfer data between AWS S3 and HDFS
faster than any other offer I tested.
Is every Hadoop tool successful?
No, there are some projects that have already overshadowed the news.
For example, many workloads that were previously automated with Oozie are now automated with Airflow. Robert Kanter, the main developer of Oozie, provided a significant part of the code base that exists today. Unfortunately, Robert no longer took such an active part in the project since leaving Cloudera in 2018. Meanwhile, Airflow has more than 800 participants, the number of which has nearly doubled over the past year. Almost every client with whom I worked since 2015 used Airflow in at least one department in their organizations.
Hadoop provides the various building blocks and elements that make up the data platform. Often, several projects compete for the provision of the same functionality. In the end, some of these projects fade out while others take the lead.
In 2010, there were several projects that were positioned as the first choice for various workloads, in which there were only a few participants or, in some cases, several significant deployments. The fact that these projects come and go was used as evidence that the entire Hadoop ecosystem is dying, but I do not draw such conclusions from this.
I see this weak association of projects as a way to develop many powerful features that can be used without any significant end-user license fees. This is the principle of survival of the fittest, and it proves that for each problem more than one approach was considered.
UPDATE: I initially stated that Oozie had 17 members based on what is reported on GitHub. In fact, Oozie had both direct commits and patches submitted by 152 developers, and not just 17 that appear in the GitHub calculation. Robert Kanter contacted me after the initial publication of this post with evidence of these additional 135 authors, and I thank him for this clarification.
Search traffic not working
One of the arguments in favor of the "death" of Hadoop is that Google search traffic on various Hadoop technologies does not work. Cloudera and a number of other consultants have done a good fundraising job in past years and have made significant efforts to advance their proposals. This, in turn, aroused great interest, and at some point a wave of people studying these technologies appeared in the technical community. This community is diverse, and at some point, most people, as always, moved on to other things.
In the entire history of Hadoop, there has not been such a rich variety of functionality as offered today, and it has never been so stable and tested in battle before.
Hadoop projects consist of millions of lines of code written by thousands of authors. Every week, hundreds of developers work on various projects. Most commercial database offerings are lucky if at least a handful of engineers make significant improvements to their codebases every week.
Why is Hadoop special?
First, there are HDFS clusters with a capacity of more than 600 PB. The nature of the HDFS metadata in RAM means that you can easily process 60k operations per second.
AWS S3 has broken a lot of what can be found on POSIX file systems to achieve scalability. Quick file changes, such as those required when converting CSV files to Parquet files, are not possible in S3 and require something like HDFS if you want to distribute the workload. If the conversion software was modified to do the above S3-only workload, tradeoffs with data locality are likely to be significant.
Secondly, the Hadoop Ozone project aims to create an S3 API-compatible system that can store trillions of objects in a cluster without the need to use its own cloud service. The project aims to have built-in support for Spark and Hive, which gives it good integration with the rest of the Hadoop ecosystem. Once released, this software will be one of the first such open source offerings that can store so many files in one cluster.
Third, even if you don’t work with petabytes of data, the APIs available to you in the Hadoop ecosystem provide a consistent interface for processing gigabytes of data. Spark is the ultimate solution for distributed machine learning. As soon as you get comfortable with the API, it does not matter if your workload is measured in GB or PB, the code you create does not need to be rewritten, you just need more machines to run it. I would first teach someone how to write SQL and PySpark code, and then I would teach them how to distribute AWK commands on multiple machines.
Fourth, many of the features of the Hadoop ecosystem are leaders for commercial suppliers. Each unsuccessful marketing move for a proprietary database leads to the fact that the sales team finds out how many missing features, trade-offs and bottlenecks are in their offer. Every POC failure causes the sales team to find out how reliable their internal software testing is.
This concludes the first part of the translation. Continuation can be
read here . And now we are waiting for your comments and invite everyone to a free webinar on the topic:
"Principles of building streaming analytics systems .
"