Uber’s Big Data Revolution: From MySQL to Hadoop and Beyond
Volume: 100+ PB Data, Latency: Minutes
My name is Vu Trinh, and I am a data engineer.
I’m trying to make my life less dull by spending time learning and researching “how it works“ in the data engineering field.
Here is a place where I share everything I’ve learned.
Not subscribe yet? Here you go:
Intro
This week, we journeyed through the evolution of Uber’s Big Data infrastructure, exploring the challenges, solutions, and innovations defining each phase of this transformation.
Note: This article primarily focuses on batch processing at Uber. For insights into their real-time processing, you can check my previous article here. Additionally, this article references Uber’s original piece from 2018, so some details and figures may have changed since then.
Growing Pains of Data
Generation 1
Data management was straightforward in Uber's early days. Before 2014, the company’s data was small enough to fit within a few MySQL and PostgreSQL databases.
Uber’s data was fragmented at this stage, spanning across different databases. Users who wanted a holistic view had to consolidate data manually.
Although this setup worked for a time, Uber’s global expansion requires a more robust data solution. The number of riders, drivers, and trips exploded across many cities. Suddenly, Uber needed to handle not just terabytes but potentially petabytes of data, and they needed to do so in a reliable and scalable way.
The First Data Warehouse
Uber made a significant leap by building their first data warehouse. This new system centralized all of Uber’s data into a single place. Thanks to its speed, scalability, and column-oriented design, they chose Vertica for the data warehouse solution.
With Vertica, Uber’s engineers standardized SQL as the primary interface for data access, making it easy for thousands of users across the company to run queries and extract valuable insights.
After a few months, Uber’s data warehouse grew to tens of terabytes, and hundreds of users were querying the system daily.
However, it didn’t come without challenges. They observed the data unreliability through the ETL job that ingested data to the Vertica. In particular, the lack of a formal schema agreement between the upstream data producing and downstream data consuming led to frequent ingestion failures when the source data format changed. Uber’s data was often stored in flexible JSON formats, which made it hard to enforce schema consistency and caused frequent breakdowns in data pipelines.
Furthermore, the lack of standardization in ingestion jobs led to the same data being ingested multiple times with different transformations, putting extra pressure on upstream data sources and increasing storage costs with duplicate data.
Hadoop as the Data Lake
Generation 2
Uber needed a new solution. They chose Hadoop as the heart of their next data platform generation; instead of loading data directly to the Vertica data warehouse, raw data is ingested into a Hadoop-based data lake.
This brought a whole new paradigm for Uber’s Big Data platform, allowing Uber to ingest and store vast amounts of raw data from various sources without transformation when ingestion.
This shift reduced the load on Uber’s source data stores, as data could now be ingested into Hadoop in its native format without pre-processing. Once the data was in Hadoop, it could be transformed and analyzed using various tools.
For the data consumption, the option is not only limited with the Vertica:
For interactive queries, Uber used Presto, an open-source distributed SQL engine allowing fast querying of large datasets.
Apache Spark was introduced for more complex data processing tasks, allowing teams to run large-scale jobs using SQL or programming languages.
Apache Hive was also deployed to handle large queries.
Uber ensured all data transformations happened in Hadoop; only critical tables for real-time SQL queries were carried in the data warehouse. This allows fast backfilling and recovery; if these operations are required, they only need to process data already in Hadoop, avoiding touching data from the source.
The most critical aspect of Uber’s transition to Hadoop was the adoption of Apache Parquet, a columnar storage format that offered significant storage savings and compute resource efficiency. Parquet’s columnar nature allowed Uber to compress data more effectively, reducing storage costs and speeding up query performance for analytical workloads.
By the time the second generation of the data platform was fully onboarded, the company was ingesting tens of petabytes of data into its Hadoop data lake.
Challenges of the Second Generation
While Hadoop enabled Uber to scale its data operations, it wasn’t perfect. One of the biggest challenges was the large number of small files in the HDFS, from ingestion or ad hoc batch jobs batch job or ETL process. The accumulation of these files began to put pressure on the HDFS NameNode, which is responsible for managing the file system’s metadata. As the number of files grew into the millions, the NameNode struggled to keep up.
Another major issue was data latency. At the time, Uber’s data was only made available to users once every 24 hours, which was far too slow for many of the company’s real-time business needs. This delay limited Uber’s ability to make timely decisions in many cases, such as demand forecasting and fraud detection.
Finally, while Hadoop solved many scalability issues, it didn’t support data updates or deletes. For example, rider and driver ratings, trip fare adjustments, and other real-time data must be updated frequently to ensure accurate reporting and analysis. However, Hadoop’s snapshot-based ingestion model meant that Uber had to re-load entire datasets from the source every time a minor update was made, which was inefficient and time-consuming.
The Introduction of Hudi
Generation 3
To address these challenges from the 2nd generation, they spent quite an amount of time identifying four primary pain points that need to be resolved in the next generation:
HDFS scalability limitation: HDFS’s NameNode struggles when data exceeds 10 petabytes, worsening beyond 50-100 petabytes. Solutions like ViewFS and HDFS NameNode Federation, along with moving data to separate clusters, mitigated these issues.
HDFS ViewFS: ViewFS provides a virtual filesystem in Hadoop, allowing users to access multiple HDFS clusters or directories through a unified namespace. It simplifies working with various HDFS locations by creating a seamless, single point of access.
HDFS NameNode Federation: NameNode Federation improves HDFS scalability by using multiple independent NameNodes, each managing a portion of the namespace. This reduces bottlenecks, enhances fault tolerance, and supports larger deployments.
Faster data in Hadoop: Uber’s second-generation platform's 24-hour data latency was too slow for real-time needs. To speed up delivery, Uber had to re-design their pipeline for incremental ingestion of only updated and new data instead of loading full snapshots.
Support for updates and deletes in Hadoop/Parquet: Uber’s data involves frequent updates, but snapshot-based ingestion wasn’t efficient.
Faster ETL and modeling: Like raw data ingestion, ETL and modeling jobs rebuilt entire tables with each run. They shifted to incremental updates, pulling only changed data and updating derived tables without full rebuilds, reducing latency.
With that in mind, Uber developed an open-source project called Hudi (Hadoop Upserts and Incremental), which fundamentally transformed how data was ingested and managed in the Hadoop ecosystem.
Hudi introduced the ability to perform upserts (update-inserts) and incremental data ingestion, allowing Uber to move away from the snapshot-based ingestion approach. Instead of reloading entire datasets daily, Hudi enabled Uber to ingest only the changes—new records, incremental updates, and deletes—reducing data latency from 24 hours to under an hour.
This incremental approach improved data freshness and reduced the computational resources required to process updates. For example, instead of reprocessing an entire 100-terabyte dataset every time new data was added, Uber could now update only the relevant partitions, leading to significant efficiency gains.
Besides the creation of Hudi, Uber also streamlined data change between storage using Apache Kafka. All upstream datastore events, including logs from various services, were sent to Kafka with a unified Avro encoding.
Marmaray, Uber’s data ingestion platform, runs in mini-batches and consumes changelogs from Kafka. It applies them to existing Hadoop data via Hudi, allowing records to be updated or deleted. Behind the scenes of Marmaray are Spark jobs that run every 10-15 minutes, ensuring data latency remains under 30 minutes.
By eliminating the need for transformations during the ingestion phase, Marmaray ensured that raw data could be ingested quickly and reliably, with any necessary transformations performed downstream in Hadoop. Data reliability has also improved because they could avoid error-prone transformations during ingestion.
Generalizing Data Ingestion
Uber’s number of upstream data stores increased over time. They decided to build a unified ingestion platform to streamline raw data ingestion into Hadoop. With this platform, the updating process can update Hadoop tables incrementally with a latency of 10-15 minutes. Hudi plays a vital role here; it allows ETL jobs to fetch only the changed data from the source table. Transfomration/modeling jobs only need to pass a checkpoint timestamp to the Hudi reader during each run to receive a stream of new or updated records from the raw source table.
I might have a deep dive article on Hudi soon ! ;)
Looking Ahead
Generation 4
With the third generation of its Big Data platform, Uber has reached a point where its data infrastructure is robust, scalable, and efficient. In the next phase of Uber’s data journey, they planned to focus on four key areas:
Data Quality: Uber was working to enforce stricter schema validation on upstream data sources.
Faster Data Access: They aimed to reduce data latency by bringing raw data latency down to five minutes and modeling data latency to ten minutes.
Operational Efficiency: Uber moved away from dedicated hardware and embraced containerization for its services. This would allow for greater flexibility in resource management and ensure that jobs can be scheduled and executed more efficiently across the company’s Hadoop and non-Hadoop services.
Scalability and Reliability: Uber continued optimizing its data ingestion platform to make it more resilient and scalable by standardizing changelogs across all upstream data sources and adopting a more unified approach to data ingestion.
In the May 2024 and early September articles, Uber shared that they are migrating their batch processing infrastructure to Google Cloud.
Outro
In this article, we explore Uber’s data platform journey from the MySQL databases to the Hadoop cluster and the creation of Apache Hudi. By learning about Uber’s data platform revolution, I hope we can gain valuable insights that can be applied to our data projects, even if they don't yet reach Uber's scale.
See you in the next article.
References
[1] Reza Shiftehfar, Uber’s Big Data Platform: 100+ Petabytes with Minute Latency (2018)
Before you leave
If you want to discuss this further, please leave a comment or contact me via LinkedIn, Email, or Twitter.
It might take you five minutes to read, but it took me more than five days to prepare, so it would greatly motivate me if you considered subscribing to receive my writing.
Anyone have any good reads on the architecture of having separate steps of "dump raw data quickly with no transformations" and "process and model the data"? I'm wondering if that would make sense for my needs, or if it only makes sense at the huge scale of Uber.
I think now they are planning to move to GCP for data pipelines etc.