How did Discord evolve to handle trillions of data points
From in-house solutions to the modern data stack
My name is Vu Trinh, and I am a data engineer.
I’m trying to make my life less dull by spending time learning and researching “how it works“ in the data engineering field.
Here is a place where I share everything I’ve learned.
Not subscribe yet? Here you go:
Intro
The series on how big companies handle data analytics will continue with Discord, a chat app used by tens of millions.
Unlike companies like Uber, LinkedIn, or Twitter, which rely on open-source projects like Kafka, Flink, and Spark alongside cloud services, Discord initially deployed in-house orchestration solutions to manage their analytics workload.
However, as their needs grew more complex, the limitations of their initial setup led them to rebuild their data orchestration infrastructure using modern, open-source tools.
Derived - The in-house solution.
In a 2021 article, Discord introduced its internal solution, Derived, which helped it transform petabytes of raw data into the BigQuery data warehouse.
Their main requirement for this system is to maintain a complex Directed Acyclic Graph (DAG) of precomputed data.
Data flows from raw to cleaned tables, undergoing multiple transformation steps that may reference several tables.
At Discord, a derived table represents a data transformation using predecessor tables in the DAG as input. Essentially, it is an SQL SELECT statement referencing raw data or other derived tables.
They want to build a solution that satisfies the following criteria:
Users only need to know SQL to define derived table (transformation logic)
The system will infer the DAG from the SQL; thus, users do not need to consider the DAG structure.
Using Git for production configuration.
Seamless integration with their existing privatization systems and data governance policies.
The system must expose accessible metadata to build monitoring, lineage, and performance tooling.
Data backfilling must not be complex.
Taking into account those requirements, they build the Derived system with the following technical highlights:
Users use the YAML file to define the derived table's configuration. Within the YAML file, users can specify the SQL logic and settings like refresh frequency, schema, description, data update strategy, update schedule, date range, and BigQuery optimizations, including cluster columns or partition schemes.
Users can use the CLI to load table configurations and validate dependencies across the entire DAG for development.
Discord leveraged Ariflow for job scheduling, visualization, and monitoring.
The CLI also lets users create test versions of tables using shadow production data.
Once a pull request is made, CI deploys all new tables to a shadow production environment (a mimic of the production environment), allowing users to validate their changes with real data before merging to the production branch.
Each transformation run will be deployed separately as a Kubernetes Pod.
They store table metadata in a dedicated log store, accessible via BigQuery, which enables data observability. Discord can integrate this metadata with BigQuery's information schema for detailed insights like byte processing and slot usage.
The Derived system served Discord well in the early days. Still, over time, they realized that this in-house solution was limited in offering usability and flexibility in more complex use cases. The next iteration is needed.
New solution with Dagster and dbt
For the transformation layer, between solutions like Coalesce or SQLMesh, Discord chooses to go with dbt, thanks to its rich features and functionality.
The orchestration is more of a headache than that when there are plenty of available solutions from the open-source community: Airflow, Argo, Prefect, Kestra, and Mage.
After thorough research and consideration, they chose Dagster. Here are the key reasons behind Discord's decision:
Declarative automation: Orchestration like Ariflow requires specifying the imperative workflow, which contains a sequence of tasks that run on a pre-defined schedule (at 3:00 PM or every 6 hours). Dagster offers a different approach, allowing users to specify how up-to-date they expect each data asset to be. Dagster will take care of the scheduling to ensure data arrives on time.
Dagster has a modern UI that provides self-service observability for data engineers and data scientists.
It is easy to run Dagster locally.
Dagster supports out-of-the-box deployment and execution on Kubernetes.
Dagster supports the seamless migration of Airflow jobs.
Next, we’ll take a glimpse of how Discord data engineer operates their new data infrastructure with Dagster and dbt:
As mentioned, every Derived data transformation runs as an independent Kubernetes pod. Thanks to the support of operating efficiently in Kubernetes of Dagster, their engineer can get things up and running quickly.
They integrated dbt with Dagster using software-defined assets. A few words from Dagster about this feature:
An asset is an object in persistent storage, such as a table, file, or persisted machine learning model. Asset definitions enable a declarative approach to data management, in which code is the source of truth on what data assets should exist and how those assets are computed. — Source —
Instead of defining the transformation logic using YAML, they switch to dbt.
They schedule their entire DAG using Dagster's declarative automation method, triggered by scheduled runs monitoring our raw data layer.
To address a race condition in dbt incremental updates, where multiple runs would conflict by trying to delete the same temporary table, they modified dbt's logic for storing temporary data. This allowed for parallel execution of various partitions of the same asset.
Based on my experience with dbt, when using the dbt incremental model, it creates a temporary table named in the format
destination_table__dbt_tmp
. This temp table is used to update the destination table incrementally, and dbt deletes it after the update is complete. However, if multiple incremental models are executed for the same table, temp tables with different data ranges are created with the same__dbt_tmp
name. This can lead to conflicts when dbt tries to use or delete a temp table from a different run. I guess Discord will adjust dbt to include the specific data range in the temp table name, distinguishing each run from the others.
Discord also creates customized dbt CLI commands to boost developer productivity.
They also implemented a robust CI/CD process to prevent disruptive changes across table logic, dbt macros, and dbt tests.
Discord engineers leverage various dbt packages to onboard new features and functionality faster than YAML.
Outro
In this article, we learned some valuable lessons from Discord's data orchestration evolution journey. The initial Derived solution served them well in the first place, but Derived showed some limitations when there were requirements for more flexibility and usability. They decided to rebuild their data infrastructure with dagster and dbt to continue providing seamless data analytics.
References
[1] Daniel Meas, How Discord Creates Insights From Trillions Of Data Points (2021)
[2] Zach Bluhm, How Discord Uses Open-Source Tools for Scalable Data Orchestration & Transformation (2024)
Before you leave
If you want to discuss this further, please leave a comment or contact me via LinkedIn, Email, or Twitter.
It might take you five minutes to read, but it took me more than five days to prepare, so it would greatly motivate me if you considered subscribing to receive my writing.
Awesome article great to know that big players are starting to roll out Dagster to production.