How Did LinkedIn Handle 7 Trillion Messages Daily With Apache Kafka?
Was adding more machines enough?
My name is Vu Trinh, and I am a data engineer.
I’m trying to make my life less dull by spending time learning and researching “how it works“ in the data engineering field.
Here is a place where I share everything I’ve learned.
Not subscribe yet? Here you go:
Intro
I spent a decent amount of time learning the Apache Kafka concept, theory, and architecture. Observing Kafka from the perspective of the ski driver falling at 10,000 feet, it has a dead simple architecture: the brokers contain the topic, the producers are responsible for data writing, and the consumer is responsible for reading the data. Even with its simplicity, Kafka has become a core part of the infrastructure for companies of all sizes.
In this week's newsletter, we will learn how the company that created Kafka — LinkedIn, operates the message system to help them handle 7 trillion messages daily. (The number is referenced from an article in 2019. Thus, the statistic must be more significant now)
Overview
If you have not impressed with the statistics yet, here are some more numbers:
100 Kafka Cluster
4000 Kafka Brokers
100,000 topics
7,000,000 partitions
At LinkedIn, Kakfa is leveraged for a wide range of use cases; here are some large categories:
Decoupling the sender and receiver: one part of the application produces messages, while another part consumes them.
Monitoring: Kafka acts as the event bus to receive monitoring metrics from the agents. LinkedIn installs agents in the servers to collect application-generated measurements, such as CPU, RAM utilization, etc…
Logging: LinkedIn routes application, system, and public access logs to Kafka.
Tracking: Tracking involves every action, whether by users or applications. This data is crucial for keeping search indices current, tracking paid service usage, and measuring growth in real time. LinkedIn uses stream processing systems like Samza to process action data from Kafka.
LinkedIn needs to operate Kafka in the most reliable and scalable way to manage its vast data and support a variety of use cases. In the following sections, we'll explore how LinkedIn achieves these goals.
Tiers and Aggregation
An internet-scaled company like LinkedIn runs its infrastructure across multiple data centers.
Some applications only care about what is happening in a single data center, while others, such as building search indexes, need to operate across multiple data centers.
LinkedIn has a local cluster deployed in each data center for each message category. There is an aggregate cluster, which consolidates messages from all local clusters for a given category. With this strategy, the producer and consumer can interact with the local Kafka cluster without reaching across data centers.
Initially, they used Kafka Mirror Maker to copy data from the local to the aggregate cluster. Later, they encountered a scaling issue with this replication tool, so they switched to Brooklin, an internal solution that allows data to be streamed across different data stores.
When reading data, LinkedIn deploys consumers to consume data from the brokers in the same data center when reading data. This approach simplifies the configuration and avoids cross-datacenter network issues.
We can now see the tier of Kafka deployment at LinekdIn:
First tier: Producer
Second tier: Local cluster (across all data centers)
Additional tiers: Aggregate clusters
Final tier: Consumer
Operating at the many tiers raises a concern: the completeness of Kafka's message when it has gone through many tiers. LinkedIn needs a way of auditing.
Auditing Completeness
Kakfa Audit is an internal tool at LinkedIn that ensures sent messages do not disappear when copied through tiers.
When the producer sends messages to Kafka, it tracks the count of messages sent during the current time interval. Periodically, the producer sends this count as a message to a special auditing topic.
On the consumption side, audit consumers from the Kafka Console Auditor application will consume messages from all topics alongside the consumers from other applications.
Like the producer, audit consumers periodically send messages into the auditing topic, recording the number of messages they consume for each topic.
The LinkedIn engineers will compare the message count from producers and audit consumers to check if the message has landed in Kafka.
If the numbers are different, there must be a problem with the producer. Their engineers can trace the specific service and host responsible to them.
Tracing is possible because the Kafka message’s schema at LinkedIn contains a header that includes metadata like the timestamp, the originating physical server, and the service.
LinkedIn Kafka release branches
LinkedIn maintained internal Kafka release branches to deploy their production environment.
Their goal is to keep their internal branch close to the open-source Kafka release branch, which helps them leverage new features or hotfixes from the community and allows LinkedIn to contribute to Apache Kafka's open source.
LinkedIn engineers create an internal release branch by branching from the associated Apache Kafka branch; they call this the upstream branch.
They have two different approaches to commit Kafka patches developed at LinkedIn:
They commit changes to the upstream first, and if necessary, they issue a Kafka Improvement Proposal (KIP). Then, they cherry-pick them to their current LinkedIn release branch. This method is suitable for changes with low to medium urgency.
They commit to the internal release branch first, then to upstream later. This method is suitable in high-urgency scenarios.
Keeping their release branch close to the upstream branch is a two-way process; in addition to syncing their internal patch to the upstream branch, they also need to cherry-pick patches from upstream branches to their internal ones. There are the following types of patches in the LinkedIn release branch:
Patches from the upstream Kafka branch up to the branch point.
Cherry-picked patches from the upstream branch after the branch point.
Hotfix patches that are committed to the internal branch first and are prepared to be committed to the upstream branch.
LinkedIn-only patches appear only in the internal release branches. They tried to commit to upstream branches but were rejected by the open-source community.
Here is the LinkedIn Kafka development workflow:
If there is a new issue:
If the patch exists in the open-source Apache Kafka branch, they can cherry-pick from the upstream branch or catch up with this patch later in the next rebase.
If the patch does not exist in the upstream branch, it is attempted to be committed to both the upstream and internal branches.
If there is a new feature:
They will attempt to commit the patch to the upstream and internal branches. When committing to the upstream, they will issue the KIP if needed.
The LinkedIn engineers will choose the Upstream First route or LinkedIn First route based on the urgency of the patch. Typically, patches addressing production issues are committed as hotfixes first. Feature patches for approved KIPs should go to the upstream branch first.
Outro
Efficiently operating the data infrastructure that can process a massive scale of data is not a simple task. Adding more machines can not solve all the problems. Through the article, we’ve learned how LinkedIn operates Kafka to handle trillions of messages daily: from how they organize Kafka clusters across their data centers, how they ensure message completeness, and finally, their Kafka deployment workflow.
References
[1] Todd Palino, Running Kafka At Scale (2015)
[2] Jon Lee, How LinkedIn customizes Apache Kafka for 7 trillion messages per day (2019)
Before you leave
If you want to discuss this further, please leave a comment or contact me via LinkedIn, Email, or Twitter.
It might take you five minutes to read, but it took me more than five days to prepare, so it would greatly motivate me if you considered subscribing to receive my writing.