To celebrate Lunar New Year (the true New Year holiday in Vietnam), I’m offering 50% off the annual subscription. The offer ends soon; grab it now to get full access to nearly 200 high-quality data engineering articles.
Intro
Today, we will explore how Airbnb builds and serves its semantic layer internally and what we can learn from it. More correctly, Airbnb did not only build a layer that “simplifies interactions between complex data storage systems and business users.“ They create a complete platform.
Motivation
In 2010, Airbnb had only one data analyst. His laptop was Airbnb data warehouse. He often ran queries right on production databases, and Airbnb.com was down for some time because of the heavy queries.
In the early 2010s, they hired more data scientists, and data kept growing. They upgraded their data infrastructure and built their own data orchestration tool, Airflow, with later open source.
Their top priority was to build a set of tables called core_data. These tables set the foundation for many data demands at Airbnb:
Airbnb’s experimentation platform for streamlining the A/B testing processes.
Dataportal — Airbnb's internal data catalog organizes and documents data assets.
Interactively data exploration with Apache Superset
Data University — a program that teaches non-data scientists valuable skills to democratize data analysis at Airbnb.
However, the growth came with challenges:
More users wanted to consume core_data, so they created many tables on top of it. There was no way to tell if a table with the exact requirement existed.
Because of the complexity of the growing warehouse, Airbnb found it challenging to track data. Data users could spend many hours debugging the data discrepancies.
For data consumption, decision-makers complained that different teams reported different numbers for simple business questions. As a result, business users and even data scientists lost trust in the data.
To celebrate Lunar New Year (the true New Year holiday in Vietnam), I’m offering 50% off the annual subscription. The offer ends soon; grab it now to get full access to nearly 200 high-quality data engineering articles.
Airbnb Minerva
They revamped the data warehouse to improve data quality.
First, their data engineering team rebuilt key business data models, resulting in lean tables that eliminate redundant joins. These tables served as the new foundation for the analytics.
That still was not enough.
They needed to join these tables to extract insight, backfill data whenever logic changes, or present the data consistently and correctly in many different consumption tools.
Airbnb built Minerva for these purposes.
Minerva took fact and dimension tables as inputs, performed data denormalization, and served the aggregated data to downstream applications. Airbnb hoped the Minerva API would close the gap between upstream data and downstream consumption.
At the time of Airbnb’s sharing, Minerva contained more than 12,000 metrics and 4,000 dimensions, with 200+ data producers across different functions and teams.
Architecture
Airbnb built Minerva on top of open-source projects:
Airflow for workflow orchestration.
Apache Hive and Apache Spark for the compute engine.
Presto and Apache Druid for serving.
For a metric, Minerva has components to cover its whole life cycle:
Minerva defines metrics, dimensions, and metadata in a centralized Github repository. Anyone at Airbnb with proper permissions can update these definitions.
It has a development flow for code review, static validation, and test runs.
It executes data aggregation/denormalization by resue data assets and intermediate joined results.
Minerva has a robust computation flow that can automatically retry after job failures, plus the built-in data-quality checks.
It exposes a unified data API to serve metrics and metadata.
Because the Minerva version controls data definitions (via Git), it can detect and track changes and then execute data backfilling.
Its data management features include cost attribution, GDPR-based deletion, or data access control.
For data retention, Minerva supports clean-up of data based on usage; infrequently used datasets can be deleted to save cost.
Design principle
Airbnb built Minerva to be:
Standardized: Data is defined in a single place. It must serve as a single entry point for anyone searching for definitions.
Declarative: Users define the output they want (like SQL). Minerva will handle everything from calculating metrics to storing and serving.
Scalable: Minerva must be scalable to support Airbnb’s internal data demands.
Consistent: The data is always consistent. If the user changes the definitions, Minerva must perform data backfill and keep the data up-to-date.
Highly available: Dataset replacement must be handled with minimal impact on data consumption.
Well-tested: Users can prototype and validate their changes before they are merged into production.
Standardized
Minerva is focused on metrics and dimensions instead of tables and columns like databases.
When a metric is defined in Minerva, users must provide necessary metadata, such as ownership, lineage, or metric description. Before Minerva, Airbnb managed metadata inefficiently as definitions scattered across various business intelligence tools.
Regarding version control in Minerva, they treat all definitions as code that must go through a review process before merging to production, just like code review.
Minerva’s configuration system cores are event and dimension sources, corresponding to fact tables and dimension tables in the data warehouse:
Event sources define the atomic events which are used to calculate metrics.
Dimension sources contain attributes that can be used with the metrics.
Declarative

One of Minerva’s promises is to simplify the time-consuming workflow so that users can quickly turn data into insights. Users can define a dimension set, an analysis-friendly dataset created from Minerva metrics and dimensions. Unlike ad-hoc datasets, dimension sets have several advantages:
Users define what they want. Minerva abstracts all the technical implementation details and complexity of creating it from the users.
Dimension sets can benefit from Minerva’s existing features.
Minerva can store and optimize these dimension sets to reduce query times.
Minerva can reuse dimension sets, which help reduce dataset duplication.
Scalable
Minerva was serving 5,000+ datasets across hundreds of users and 80+ teams.
To ensure it can scale, Airbnb built Minerva’s computation with the DRY (Do not Repeat Yourself) principle. They tried to reuse materialized data as much as possible to reduce wasted computing resources.
The computational flow has several stages:
Ingestion Stage: Minerva sensors are triggered when new data is added to the table’s partitions. The latest data is then ingested into Minerva.
Data Check Stage: This stage ensures that upstream data is “right. “ For example, a field should not be empty, or primary keys should be unique.
Join Stage: Minerva executes the joins based on join keys to generate dimension sets. Minerva computes the same calculations (e.g., same city dimension) that happen on different dimension sets using the same logic on the same source tables. This ensures consistent dataset computation at scale.
Post-processing and serving stage: Minerva further aggregates outputs for downstream consumption. It can optionally optimize data end-user query performance.
In addition, Airbnb included features to make Minerva operate efficiently. Some features are self-healing and automated backfilling.
Minerva tries to be data-aware. It checks for missing data for every job. If missing data is identified, it is included in the current run. This means a single run can have a data range changed dynamically (e.g., 3 days → 4 days of data). Users don’t need to reset tasks manually.
For the backfilling, if the backfill window is long (e.g., several months), it may generate an expensive query. If they split the backfill window into smaller ones, it will be too slow for a large initial window. To solve this, Airbnb introduced the batch backfill for Minerva.
They still split the backfill window into smaller ones based on the scalability of that dataset. For example, a one-year window would be divided into 12 1-month windows. Then, they run these 12 jobs in parallel.
Consistent
Internal users frequently change Minerva's definitions. Airbnb introduced a data version to ensure that Minerva datasets are consistent and up-to-date.
The data version is a hash of all the essential fields specified in the definitions (e.g., data source). When users change any field used for the hashing, the data version is automatically updated.
Each dataset has a data version, which makes Minerva automatically create and backfill a new dataset. This approach ensures that upstream changes are propagated to all downstream datasets, and no Minerva dataset will diverge from the source of truth.
Highly Available
Airbnb observed that backfills often could not catch up with user changes when updates affect many datasets. Given that Minerva promises to provide consistent and up-to-date data, a frequently changing dataset could result in backfill forever and cause data downtime.
Airbnb deployed a parallel computation environment called the Staging environment. The Staging environment replicates the Production environment. They will perform data backfilling in the staging before publishing it on the Production. The flow for the Staging environment is as follows:
Users developed and tested changes in the local environment.
They merge changes to the Staging environment.
The Staging environment loads the Staging configurations, retrieves any necessary Production configurations if needed, and starts backfilling modified datasets.
The Staging changes are merged into Production when the backfill is done.
Well-Tested
To help users validate data correctness, Minerva has a tool that reads from production but writes to a sandbox environment. The tool generates sample data on top of the user’s local modifications, allowing users to validate their changes.
The tool shows the step-by-step computation that Minerva follows to generate the output. This feature helps users debug issues just like they are running the logic. Finally, it also allows users to configure date ranges to limit the test data size, which helps them save a lot of time waiting for the test to finish.
Consumption
The Minerva teams partnered with other internal teams to create an ecosystem around Minerva:
Data catalog: They index all Minerva metrics and dimensions in Airbnb’s Dataportal. When a user searches for a metric, the Dataportal shows the result from Minerva.
Dataportal also offers a data exploration feature called Metric Explorer. Users can see metric trends with additional slicing and drill-down options, such as Group By and Filter. Users who want to dig deeper can switch to Superset to perform more advanced analytics.
They migrate the A/B test platform’s proprietary metric repo to Minerva, which helps achieve consistency across experimentation and analytics.
To enable executive reporting, they built a reporting framework that turns a set of user-specified Minerva metrics and dimensions into aggregated metric time series.
Minerva exposes API for Airbnb’s R and Python clients. This lets data scientists query Minerva data in a notebook environment. Data scientists can now have metric calculation results exactly like those of other tools such as Metric Explorer, saving them lots of time when investigating data discrepancies.
Outro
Thank you for reading this far.
In this article, we explore the motivation behind the need for the semantic platform from Airbnb, the platform architecture and design principle, and finally, how Minerva can serve downstream consumption.
Now it’s time to say goodbye. See you in my following articles.
Reference
[1] The Airbnb Tech Blog, How Airbnb Achieved Metric Consistency at Scale (2021)
[2] The Airbnb Tech Blog, How Airbnb Standardized Metric Computation at Scale (2021)
















