How Meta Solves Data Lineage At Scale
Meta’s Approach to Data Lineage: How They Did It and What We Can Learn
I’m making my life less dull by spending time learning and researching “how it works“ in the data engineering field.
Here is a place where I share everything I’ve learned. Not subscribe yet? Here you go:
Intro
When Meta recently published an article titled How Meta discovers data flows via lineage at scale, it instantly caught my attention.
As data engineers, we often hear about data lineage, but how many of us deeply understand its implications or the challenges of implementing it at scale? Meta’s approach to solving data lineage problems within their privacy infrastructure offers fascinating lessons.
In this article, we’ll explore Meta's challenges with data lineage, their solutions, and the practical lessons we can adopt—even if we don’t operate at Meta’s scale.
A Little Bit About Meta
Even my low-tech mom use Facebook.
With billions of users across Facebook, Instagram, WhatsApp, and more, the company handles petabytes of data daily. This data isn’t just about scale; it’s deeply interconnected. Every click, post, or message can flow through a complex web of systems—from user-facing apps to backend services and data warehouses. Managing and understanding these flows is no small feat, especially as Meta prioritizes user privacy.
At the heart of their efforts is the Privacy-Aware Infrastructure (PAI), a suite of technologies that ensures privacy controls across their systems. Data lineage is a cornerstone of PAI, allowing Meta to trace how data flows and ensure compliance with privacy requirements.
But what is data lineage?
Data lineage is the process of tracing data's journey through various systems, from its source to its final destination. It answers questions like: Where did this data originate? How has it been transformed? Where is it being used? It gives us:
Transparency and Trust: It clarifies how data flows through systems, essential for ensuring compliance with privacy regulations and building user trust.
Troubleshooting: Knowing the data's path helps engineers pinpoint the root cause when issues arise.
Impact Analysis: When making changes to systems, data lineage allows teams to assess potential downstream effects, minimizing unintended disruptions.
Compliance: In an era of stringent data privacy laws, like GDPR and CCPA, having a clear picture of data flows is mandatory to demonstrate compliance and protect user privacy.
Data lineage isn't just a "nice-to-have"—it's a foundational piece of modern data systems.
This article is sponsored by Multiplayer. Multiplayer auto-discovers, tracks, and documents your entire system architecture—from its components to APIs, dependencies, and environments. Gain real-time, comprehensive visibility into your system, all at a glance.
The Problem At Meta
Why Data Lineage Matters
For Meta, data lineage helps them understand how data—such as a user’s religious views on Facebook Dating—moves from the input stage to backend processing, storage, and usage in downstream systems.
This transparency is critical for implementing and validating privacy controls. The initial data lineage status at Meta:
Understanding the data flows across the system is crucial to establishing privacy controls in the PAI.
An important service is Policy Zones, which answers the question: “Where does my data come from, and where does it go?”
Internal users can use the lineage graphs to explain the data flow and where they collect and process it.
Meta developed the Policy Zone Manager (PZM), a tool based on data lineage that lets developers identify multiple downstream assets from a set of sources. This accelerates the rollout of privacy controls.
Once they implement privacy requirements, data lineage helps monitor and validate data flows continuously and provides enforcement mechanisms.
However, as Meta scaled PAI across all its apps, its existing lineage solutions fell short.
Expanding PAI to all of Meta’s apps introduced a massive challenge: ensuring high-quality, detailed data lineage across diverse systems. Manual methods couldn’t keep up with the pace of change or the sheer number of data flows. Manually authoring diagrams and spreadsheets couldn’t handle the complexity or volume of their data.
Meta risked delays in implementing privacy controls without robust lineage tools, which could impact user trust and regulatory compliance.
Is This Problem Unique?
While Meta’s scale is unparalleled, the core problem—managing data lineage efficiently—is something many companies face. As organizations grow, they often grapple with fragmented systems and incomplete lineage. This impacts everything from troubleshooting to compliance, making it a universal challenge for data teams.
How Meta Solved It
Meta developed a comprehensive lineage solution integrated into their PAI to tackle their challenges. The Policy Zone Manager (PZM) is central to this effort. This tool builds on lineage data, enabling developers to trace data flows and implement privacy controls efficiently.
The solution has the following steps.
Collecting data flow signals from many data activities
Meta discovers data flows for the web system activities by employing static and runtime analysis tools. It focuses on sensitive data, such as religious views. For instance, when users input data on the app, this data is transmitted to a web endpoint, written in the logging table, and stored in a database.
Static analysis tools simulate code execution to map out potential data flows. Data at Meta can flow through stacks of function calls in different programming languages, such as C++ or Python, from web systems to backend services.
Static code analysis is a debugging method by examining the code without executing the program. In the lineage context, although it doesn't execute the code, static analysis simulates the logical paths a program might take; this simulation helps identify potential data flows, such as data being read from a source (e.g., a form or API endpoint), data being processed or transformed by various functions, data being written to a destination (e.g., a database table or log file)
However, the static approach is not enough. It does not account for runtime-specific data flows, such as conditional logic based on user input.
Meta collects real-time signals during request execution. It captures and compares payloads at source and sink points, categorizing data flow evidence into match sets (high-confidence matches) and complete sets (broader potential matches for human review).
For example, Meta collects two payloads from a source and a sink. The source payload is {“data”: “Buddhist”} and . the sink payload is {“data”: “Buddhist” “event_timestamp“: “00:00:00“}, Meta sees this data likely flow from this source and sink.
However, if the sink payload represents a “more compacted and abstracted” value such as {“religion_count“: 1}, Meta is not sure if this data flows from the source to this sink. In such cases, Meta requires humans to review the flow result.
Unfortunately, Meta doesn’t share detailed rules for defining the confidence level for a flow result.
For the data warehousing activities, they combine runtime instrumentation with static analysis of SQL queries (from tools like Presto and Spark). Contextual runtime information, such as job IDs, helps fill gaps where static analysis might miss connections.
For AI systems, lineage tracking involves tracking relationships between datasets, models, and workflows. These systems construct detailed lineage graphs by integrating runtime signals from libraries like PyTorch and workflow engines like FBLearner Flow.
Identifying Relevant Data Flows
After building comprehensive lineage graphs, Meta needed a way to focus on specific data flows, like those involving religious views.
They developed an iterative analysis tool that allows developers to filter and refine these graphs efficiently. This tool uses a process of discovery, exclusion, and iteration to identify the most relevant flows.
How It Helps
The result? Developers can now confidently trace granular data flows and implement privacy controls quickly. This has significantly reduced the time and effort required to ensure compliance while maintaining Meta’s commitment to user privacy.
Lessons We Can Learn
Start Thinking About Data Lineage Early
I believe data lineage isn’t just for large companies. Even smaller teams can benefit from building lineage into their processes early. As your data ecosystem grows, having this foundation will save countless hours of debugging and compliance headaches.
Implementing the data linage
If you’re not working at Meta’s scale, start small. Tools like dbt lineage or metadata platforms like DataHub offer a solid foundation. If these tools fall short, consider Meta’s approach of embedding tracking logic into the code. Just remember, starting simple and iterating gradually will always outperform building a complex system that doesn’t fit your organization.
Lineage Graphs Alone Aren’t Enough
Meta’s case study also highlights an important point:
Simply having a lineage graph isn’t enough. You need tools that empower end-users to interact with and extract actionable insights from these graphs.
Start by leveraging existing interfaces from tools like dbt documentation or DataHub UI/API. Use these as a foundation to gather user feedback and iteratively enhance or customize solutions. This iterative approach ensures the tools meet user needs effectively, maximizing the value of your lineage data.
Measure and Iterate
Data lineage, like any engineering effort, benefits from continuous improvement. Regularly measure the effectiveness of your lineage tools and processes, and iterate based on feedback.
Outro
Above are my notes after learning how Meta does data lineage at a mega scale.
Meta’s journey with data lineage offers efficient ways to tackle complex challenges with innovative solutions. From scalable data flow collection to user-friendly tools, their approach provides valuable lessons for teams of all sizes.
As you reflect on these insights, consider how your organization handles data lineage. Are there gaps you can address? Tools you can adopt? Starting today can lead you to smoother operations and stronger compliance.
I’d love to hear from you if this has sparked ideas or questions.
Reference
[1] Facebook Engineering Blog, How Meta discovers data flows via lineage at scale (2025)