<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[VuTrinh.: Dimensions.]]></title><description><![CDATA[My blog-style writing about what I've learned from people smarter than me.]]></description><link>https://vutr.substack.com/s/dimensions</link><image><url>https://substackcdn.com/image/fetch/$s_!2JXp!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png</url><title>VuTrinh.: Dimensions.</title><link>https://vutr.substack.com/s/dimensions</link></image><generator>Substack</generator><lastBuildDate>Mon, 20 Apr 2026 06:10:26 GMT</lastBuildDate><atom:link href="https://vutr.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Vu Trinh]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[vutr27@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[vutr27@substack.com]]></itunes:email><itunes:name><![CDATA[Vu Trinh]]></itunes:name></itunes:owner><itunes:author><![CDATA[Vu Trinh]]></itunes:author><googleplay:owner><![CDATA[vutr27@substack.com]]></googleplay:owner><googleplay:email><![CDATA[vutr27@substack.com]]></googleplay:email><googleplay:author><![CDATA[Vu Trinh]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[I spent 5 hours learning Unity Catalog. Here’s everything you need to know.]]></title><description><![CDATA[The famous catalog service from Databricks, and it was open-sourced]]></description><link>https://vutr.substack.com/p/i-spent-5-hours-learning-unity-catalog</link><guid isPermaLink="false">https://vutr.substack.com/p/i-spent-5-hours-learning-unity-catalog</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Wed, 21 Jan 2026 05:15:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!IJhp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1af4045-1dfe-4d71-a7ca-3e232fcc8122_2000x1429.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p><em>I invite you to join my paid membership list to read this writing and 150+ high-quality data engineering articles:</em> </p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe&quot;,&quot;text&quot;:&quot;Upgrade subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe"><span>Upgrade subscription</span></a></p><ul><li><p><em>If that price isn&#8217;t affordable for you, check this <a href="https://vutr.substack.com/subscribe?coupon=c08a9839">DISCOUNT</a></em></p></li><li><p><em>If you&#8217;re a student with an education email, use this <a href="https://vutr.substack.com/subscribe?coupon=0b37c676">DISCOUNT</a></em></p></li><li><p><em>You can also claim this post for free (one post only).</em></p></li><li><p><em>Or take the <a href="https://vutr.substack.com/7d8f19f0">7-day trial</a> to get a feel for what you&#8217;ll be reading.</em></p></li></ul></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IJhp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1af4045-1dfe-4d71-a7ca-3e232fcc8122_2000x1429.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IJhp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1af4045-1dfe-4d71-a7ca-3e232fcc8122_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!IJhp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1af4045-1dfe-4d71-a7ca-3e232fcc8122_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!IJhp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1af4045-1dfe-4d71-a7ca-3e232fcc8122_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!IJhp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1af4045-1dfe-4d71-a7ca-3e232fcc8122_2000x1429.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IJhp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1af4045-1dfe-4d71-a7ca-3e232fcc8122_2000x1429.png" width="1456" height="1040" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b1af4045-1dfe-4d71-a7ca-3e232fcc8122_2000x1429.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1040,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:336781,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/182563955?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1af4045-1dfe-4d71-a7ca-3e232fcc8122_2000x1429.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IJhp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1af4045-1dfe-4d71-a7ca-3e232fcc8122_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!IJhp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1af4045-1dfe-4d71-a7ca-3e232fcc8122_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!IJhp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1af4045-1dfe-4d71-a7ca-3e232fcc8122_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!IJhp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1af4045-1dfe-4d71-a7ca-3e232fcc8122_2000x1429.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Intro</h2><p>In any database, a catalog is a central repository that stores metadata for all database objects, such as tables, columns, views, users, and relationships, acting as a directory to help the system find, understand, and manage data.</p><p>In the lakehouse world, there is also a concept of a catalog.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uBbl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b177046-639b-4814-9175-ba4ad060a9e9_1330x568.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uBbl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b177046-639b-4814-9175-ba4ad060a9e9_1330x568.png 424w, https://substackcdn.com/image/fetch/$s_!uBbl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b177046-639b-4814-9175-ba4ad060a9e9_1330x568.png 848w, https://substackcdn.com/image/fetch/$s_!uBbl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b177046-639b-4814-9175-ba4ad060a9e9_1330x568.png 1272w, https://substackcdn.com/image/fetch/$s_!uBbl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b177046-639b-4814-9175-ba4ad060a9e9_1330x568.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uBbl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b177046-639b-4814-9175-ba4ad060a9e9_1330x568.png" width="683" height="291.6872180451128" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9b177046-639b-4814-9175-ba4ad060a9e9_1330x568.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:568,&quot;width&quot;:1330,&quot;resizeWidth&quot;:683,&quot;bytes&quot;:111388,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/182563955?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b177046-639b-4814-9175-ba4ad060a9e9_1330x568.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uBbl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b177046-639b-4814-9175-ba4ad060a9e9_1330x568.png 424w, https://substackcdn.com/image/fetch/$s_!uBbl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b177046-639b-4814-9175-ba4ad060a9e9_1330x568.png 848w, https://substackcdn.com/image/fetch/$s_!uBbl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b177046-639b-4814-9175-ba4ad060a9e9_1330x568.png 1272w, https://substackcdn.com/image/fetch/$s_!uBbl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b177046-639b-4814-9175-ba4ad060a9e9_1330x568.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>It is the central metadata layer, serving as a unified directory to discover, govern, and manage data across diverse sources (such as data lakes and warehouses) by tracking tables, schemas, and access rules, enabling analytic engines and AI models to access data without moving it.</p><p>Databricks, known as the vendor that publicly introduced the concept of lakehouse to the world, has built and provided a robust catalog service on top of their offering.</p><p>It&#8217;s called Unity Catalog. It has served as Databricks&#8217;s proprietary product for a long time before the vendor decided to open-source it last year.</p><p>In this week&#8217;s article, I share my insights on the Unity Catalog after reading the Databricks paper, <a href="https://dl.acm.org/doi/abs/10.1145/3722212.3724459">Unity Catalog: Open and Universal Governance for the Lakehouse and Beyond.</a></p><div><hr></div><h2>Problems Unity Catalog was trying to solve</h2><p>Following the paper, UC has been serving the 9000 Databricks customers since 2021 with some interesting numbers: 100 million tables under management, 400,000 machine learning (ML) models, and 60,000 API calls per second (yeah, you read it right, per second). A good piece of software always starts with the user's problems in mind. Here are challenges that Unity Catalog was designed to address:</p><ul><li><p><strong>Uniform access control: </strong>A secure approach to let Lakehouse&#8217;s users access the assets (e.g., the tables or the ML models) in both ways: via catalog or via a direct cloud storage path.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-3Po!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a948b78-f6ea-4ae3-92b3-e6d8687f3205_930x466.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-3Po!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a948b78-f6ea-4ae3-92b3-e6d8687f3205_930x466.png 424w, https://substackcdn.com/image/fetch/$s_!-3Po!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a948b78-f6ea-4ae3-92b3-e6d8687f3205_930x466.png 848w, https://substackcdn.com/image/fetch/$s_!-3Po!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a948b78-f6ea-4ae3-92b3-e6d8687f3205_930x466.png 1272w, https://substackcdn.com/image/fetch/$s_!-3Po!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a948b78-f6ea-4ae3-92b3-e6d8687f3205_930x466.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-3Po!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a948b78-f6ea-4ae3-92b3-e6d8687f3205_930x466.png" width="530" height="265.5698924731183" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5a948b78-f6ea-4ae3-92b3-e6d8687f3205_930x466.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:466,&quot;width&quot;:930,&quot;resizeWidth&quot;:530,&quot;bytes&quot;:64509,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/182563955?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a948b78-f6ea-4ae3-92b3-e6d8687f3205_930x466.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-3Po!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a948b78-f6ea-4ae3-92b3-e6d8687f3205_930x466.png 424w, https://substackcdn.com/image/fetch/$s_!-3Po!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a948b78-f6ea-4ae3-92b3-e6d8687f3205_930x466.png 848w, https://substackcdn.com/image/fetch/$s_!-3Po!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a948b78-f6ea-4ae3-92b3-e6d8687f3205_930x466.png 1272w, https://substackcdn.com/image/fetch/$s_!-3Po!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a948b78-f6ea-4ae3-92b3-e6d8687f3205_930x466.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p><strong>Support for diverse asset types: </strong>The catalog must support a wide range of assets, adapting to the user&#8217;s needs. In addition to tables, users also need to govern ML models. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!y_Y7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F046ca6a3-ac53-45fe-9a3c-835a226634de_356x312.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!y_Y7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F046ca6a3-ac53-45fe-9a3c-835a226634de_356x312.png 424w, https://substackcdn.com/image/fetch/$s_!y_Y7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F046ca6a3-ac53-45fe-9a3c-835a226634de_356x312.png 848w, https://substackcdn.com/image/fetch/$s_!y_Y7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F046ca6a3-ac53-45fe-9a3c-835a226634de_356x312.png 1272w, https://substackcdn.com/image/fetch/$s_!y_Y7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F046ca6a3-ac53-45fe-9a3c-835a226634de_356x312.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!y_Y7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F046ca6a3-ac53-45fe-9a3c-835a226634de_356x312.png" width="258" height="226.1123595505618" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/046ca6a3-ac53-45fe-9a3c-835a226634de_356x312.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:312,&quot;width&quot;:356,&quot;resizeWidth&quot;:258,&quot;bytes&quot;:18754,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/182563955?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F046ca6a3-ac53-45fe-9a3c-835a226634de_356x312.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!y_Y7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F046ca6a3-ac53-45fe-9a3c-835a226634de_356x312.png 424w, https://substackcdn.com/image/fetch/$s_!y_Y7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F046ca6a3-ac53-45fe-9a3c-835a226634de_356x312.png 848w, https://substackcdn.com/image/fetch/$s_!y_Y7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F046ca6a3-ac53-45fe-9a3c-835a226634de_356x312.png 1272w, https://substackcdn.com/image/fetch/$s_!y_Y7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F046ca6a3-ac53-45fe-9a3c-835a226634de_356x312.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></li><li><p><strong>External access: </strong>Users require the data sharing capability done via the catalog; the key is that it must avoid data copying (which might increase storage cost and increase complexity when you must ensure data synchronization between the two copies).</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K2qS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a14c29d-5cdc-4fd6-82eb-a8a79dc499cc_844x310.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K2qS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a14c29d-5cdc-4fd6-82eb-a8a79dc499cc_844x310.png 424w, https://substackcdn.com/image/fetch/$s_!K2qS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a14c29d-5cdc-4fd6-82eb-a8a79dc499cc_844x310.png 848w, https://substackcdn.com/image/fetch/$s_!K2qS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a14c29d-5cdc-4fd6-82eb-a8a79dc499cc_844x310.png 1272w, https://substackcdn.com/image/fetch/$s_!K2qS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a14c29d-5cdc-4fd6-82eb-a8a79dc499cc_844x310.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K2qS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a14c29d-5cdc-4fd6-82eb-a8a79dc499cc_844x310.png" width="586" height="215.23696682464455" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3a14c29d-5cdc-4fd6-82eb-a8a79dc499cc_844x310.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:310,&quot;width&quot;:844,&quot;resizeWidth&quot;:586,&quot;bytes&quot;:63712,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/182563955?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a14c29d-5cdc-4fd6-82eb-a8a79dc499cc_844x310.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!K2qS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a14c29d-5cdc-4fd6-82eb-a8a79dc499cc_844x310.png 424w, https://substackcdn.com/image/fetch/$s_!K2qS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a14c29d-5cdc-4fd6-82eb-a8a79dc499cc_844x310.png 848w, https://substackcdn.com/image/fetch/$s_!K2qS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a14c29d-5cdc-4fd6-82eb-a8a79dc499cc_844x310.png 1272w, https://substackcdn.com/image/fetch/$s_!K2qS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a14c29d-5cdc-4fd6-82eb-a8a79dc499cc_844x310.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></li><li><p><strong>Discovery support: </strong>The assets must be discoverable and understandable. Their lifecycle and lineage must also be transparent</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JGad!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1aca805-6183-4c44-8dbc-70dec42aa8a8_490x296.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JGad!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1aca805-6183-4c44-8dbc-70dec42aa8a8_490x296.png 424w, https://substackcdn.com/image/fetch/$s_!JGad!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1aca805-6183-4c44-8dbc-70dec42aa8a8_490x296.png 848w, https://substackcdn.com/image/fetch/$s_!JGad!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1aca805-6183-4c44-8dbc-70dec42aa8a8_490x296.png 1272w, https://substackcdn.com/image/fetch/$s_!JGad!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1aca805-6183-4c44-8dbc-70dec42aa8a8_490x296.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JGad!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1aca805-6183-4c44-8dbc-70dec42aa8a8_490x296.png" width="320" height="193.30612244897958" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e1aca805-6183-4c44-8dbc-70dec42aa8a8_490x296.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:296,&quot;width&quot;:490,&quot;resizeWidth&quot;:320,&quot;bytes&quot;:51786,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/182563955?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1aca805-6183-4c44-8dbc-70dec42aa8a8_490x296.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JGad!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1aca805-6183-4c44-8dbc-70dec42aa8a8_490x296.png 424w, https://substackcdn.com/image/fetch/$s_!JGad!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1aca805-6183-4c44-8dbc-70dec42aa8a8_490x296.png 848w, https://substackcdn.com/image/fetch/$s_!JGad!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1aca805-6183-4c44-8dbc-70dec42aa8a8_490x296.png 1272w, https://substackcdn.com/image/fetch/$s_!JGad!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1aca805-6183-4c44-8dbc-70dec42aa8a8_490x296.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></li><li><p><strong>Performance: </strong>Users don&#8217;t want it to be slow, even though it only involves metadata-related operations.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6Z89!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d19039-78a2-427d-a81e-a1715cfe49fc_884x418.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Z89!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d19039-78a2-427d-a81e-a1715cfe49fc_884x418.png 424w, https://substackcdn.com/image/fetch/$s_!6Z89!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d19039-78a2-427d-a81e-a1715cfe49fc_884x418.png 848w, https://substackcdn.com/image/fetch/$s_!6Z89!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d19039-78a2-427d-a81e-a1715cfe49fc_884x418.png 1272w, https://substackcdn.com/image/fetch/$s_!6Z89!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d19039-78a2-427d-a81e-a1715cfe49fc_884x418.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Z89!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d19039-78a2-427d-a81e-a1715cfe49fc_884x418.png" width="426" height="201.4343891402715" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f9d19039-78a2-427d-a81e-a1715cfe49fc_884x418.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:418,&quot;width&quot;:884,&quot;resizeWidth&quot;:426,&quot;bytes&quot;:50213,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/182563955?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d19039-78a2-427d-a81e-a1715cfe49fc_884x418.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6Z89!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d19039-78a2-427d-a81e-a1715cfe49fc_884x418.png 424w, https://substackcdn.com/image/fetch/$s_!6Z89!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d19039-78a2-427d-a81e-a1715cfe49fc_884x418.png 848w, https://substackcdn.com/image/fetch/$s_!6Z89!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d19039-78a2-427d-a81e-a1715cfe49fc_884x418.png 1272w, https://substackcdn.com/image/fetch/$s_!6Z89!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d19039-78a2-427d-a81e-a1715cfe49fc_884x418.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></li></ul><div><hr></div><h2>How does Unity Catalog fit into the Databricks lakehouse architecture?</h2><p>The architecture consists of three key components:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!x0K1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F403cffe4-7380-4ed1-931a-46b309f54709_1300x820.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!x0K1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F403cffe4-7380-4ed1-931a-46b309f54709_1300x820.png 424w, https://substackcdn.com/image/fetch/$s_!x0K1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F403cffe4-7380-4ed1-931a-46b309f54709_1300x820.png 848w, https://substackcdn.com/image/fetch/$s_!x0K1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F403cffe4-7380-4ed1-931a-46b309f54709_1300x820.png 1272w, https://substackcdn.com/image/fetch/$s_!x0K1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F403cffe4-7380-4ed1-931a-46b309f54709_1300x820.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!x0K1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F403cffe4-7380-4ed1-931a-46b309f54709_1300x820.png" width="1300" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/403cffe4-7380-4ed1-931a-46b309f54709_1300x820.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1300,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:147862,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/182563955?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F403cffe4-7380-4ed1-931a-46b309f54709_1300x820.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!x0K1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F403cffe4-7380-4ed1-931a-46b309f54709_1300x820.png 424w, https://substackcdn.com/image/fetch/$s_!x0K1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F403cffe4-7380-4ed1-931a-46b309f54709_1300x820.png 848w, https://substackcdn.com/image/fetch/$s_!x0K1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F403cffe4-7380-4ed1-931a-46b309f54709_1300x820.png 1272w, https://substackcdn.com/image/fetch/$s_!x0K1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F403cffe4-7380-4ed1-931a-46b309f54709_1300x820.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>The storage</strong>: As you might know, the lakehouse paradigm separates the storage and compute for flexibility and interoperability. The storage could be object storage services from famous vendors like AWS or Google. In addition to the physical data stored in Parquet, there is a metadata layer on top of it that provides the table abstraction. Delta Lake and Iceberg are used for this purpose.</p></li><li><p><strong>The Databricks runtimes</strong>: The Databricks&#8217; core execution engine, which is a <a href="https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/fork-a-repo">fork</a> of Apache Spark that provides the same interface but has enhancements for reliability and performance. Databricks also built the Photon engine, a library that integrates closely with DBR to enhance Spark analytical workloads further. The engine acts as a new set of physical operators inside the DBR. The query plan can use these operators as any other Spark query. Databricks&#8217; customers can continue to run their workloads without any changes and still benefit from Photon.</p></li></ul><blockquote><p><em>Here is my previous article about <a href="https://open.substack.com/pub/vutr/p/how-is-databricks-spark-different?utm_campaign=post-expanded-share&amp;utm_medium=web">How is Databricks&#8217; Spark different from Open-Source Spark?</a></em></p></blockquote><ul><li><p><strong>The Unity Catalog service</strong>:<strong> </strong>And finally, today&#8217;s primary focus, the Unity Catalog service. It&#8217;s a multi-tenant service that provides all UC functionality and APIs. The clients (in most cases, the query engines) use this service to deliver what the user needs.</p></li></ul><div><hr></div><blockquote><p><em>I invite you to join my paid membership list to read this writing and 150+ high-quality data engineering articles:</em> </p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe&quot;,&quot;text&quot;:&quot;Upgrade subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe"><span>Upgrade subscription</span></a></p><ul><li><p><em>If that price isn&#8217;t affordable for you, check this <a href="https://vutr.substack.com/subscribe?coupon=c08a9839">DISCOUNT</a></em></p></li><li><p><em>If you&#8217;re a student with an education email, use this <a href="https://vutr.substack.com/subscribe?coupon=0b37c676">DISCOUNT</a></em></p></li><li><p><em>You can also claim this post for free (one post only).</em></p></li><li><p><em>Or take the <a href="https://vutr.substack.com/7d8f19f0">7-day trial</a> to get a feel for what you&#8217;ll be reading.</em></p></li></ul></blockquote><div><hr></div><h2>The SQL query&#8217;s journey</h2><p>After having a glimpse of Databricks lakehouse architecture, we will understand a typical SQL query journey:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BZAN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F602d96a4-9c6a-41fc-8237-566b09e79bce_1026x954.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BZAN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F602d96a4-9c6a-41fc-8237-566b09e79bce_1026x954.png 424w, https://substackcdn.com/image/fetch/$s_!BZAN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F602d96a4-9c6a-41fc-8237-566b09e79bce_1026x954.png 848w, https://substackcdn.com/image/fetch/$s_!BZAN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F602d96a4-9c6a-41fc-8237-566b09e79bce_1026x954.png 1272w, https://substackcdn.com/image/fetch/$s_!BZAN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F602d96a4-9c6a-41fc-8237-566b09e79bce_1026x954.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BZAN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F602d96a4-9c6a-41fc-8237-566b09e79bce_1026x954.png" width="590" height="548.5964912280701" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/602d96a4-9c6a-41fc-8237-566b09e79bce_1026x954.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:954,&quot;width&quot;:1026,&quot;resizeWidth&quot;:590,&quot;bytes&quot;:173463,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/182563955?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F602d96a4-9c6a-41fc-8237-566b09e79bce_1026x954.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BZAN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F602d96a4-9c6a-41fc-8237-566b09e79bce_1026x954.png 424w, https://substackcdn.com/image/fetch/$s_!BZAN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F602d96a4-9c6a-41fc-8237-566b09e79bce_1026x954.png 848w, https://substackcdn.com/image/fetch/$s_!BZAN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F602d96a4-9c6a-41fc-8237-566b09e79bce_1026x954.png 1272w, https://substackcdn.com/image/fetch/$s_!BZAN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F602d96a4-9c6a-41fc-8237-566b09e79bce_1026x954.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>Upon DBR receiving the SQL query from the user, it parses the query to extract the related data asset references, such as the table or view names.</p></li><li><p>DBR then sends a REST request to the Unity Catalog (UC) service to verify whether the user has the required permission on the data assets. Then the UC will return the assets&#8217; metadata, such as column definitions and constraints.</p></li><li><p>The DBR will use the return metadata for the planning process.</p></li><li><p>When the DBR needs to access the data in the object storage, it issues another request to fetch a temporary credential.</p></li><li><p>The DBR can now access the data using the returned credentials.</p></li><li><p>The query can now be processed, from scanning data to filtering, joining, and aggregating. The result will be sent back to the user.</p></li></ul><div><hr></div><h2>The designs</h2><p>After understanding the high-level of the Unity Catalog, we will dive into its system design to see how Databricks resolved the challenges discussed at the beginning of the article.</p><h3>The disaggregation of the catalog and engine</h3><p>First, Databricks decides that the Unity Catalog will be an independent service, not tied to any specific query engine. This helps two things:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DkwA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02cde7d2-fb9b-4bca-b2a8-4fd0adfe0346_1028x566.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DkwA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02cde7d2-fb9b-4bca-b2a8-4fd0adfe0346_1028x566.png 424w, https://substackcdn.com/image/fetch/$s_!DkwA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02cde7d2-fb9b-4bca-b2a8-4fd0adfe0346_1028x566.png 848w, https://substackcdn.com/image/fetch/$s_!DkwA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02cde7d2-fb9b-4bca-b2a8-4fd0adfe0346_1028x566.png 1272w, https://substackcdn.com/image/fetch/$s_!DkwA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02cde7d2-fb9b-4bca-b2a8-4fd0adfe0346_1028x566.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DkwA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02cde7d2-fb9b-4bca-b2a8-4fd0adfe0346_1028x566.png" width="693" height="381.5544747081712" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/02cde7d2-fb9b-4bca-b2a8-4fd0adfe0346_1028x566.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:566,&quot;width&quot;:1028,&quot;resizeWidth&quot;:693,&quot;bytes&quot;:77904,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/182563955?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02cde7d2-fb9b-4bca-b2a8-4fd0adfe0346_1028x566.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DkwA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02cde7d2-fb9b-4bca-b2a8-4fd0adfe0346_1028x566.png 424w, https://substackcdn.com/image/fetch/$s_!DkwA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02cde7d2-fb9b-4bca-b2a8-4fd0adfe0346_1028x566.png 848w, https://substackcdn.com/image/fetch/$s_!DkwA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02cde7d2-fb9b-4bca-b2a8-4fd0adfe0346_1028x566.png 1272w, https://substackcdn.com/image/fetch/$s_!DkwA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02cde7d2-fb9b-4bca-b2a8-4fd0adfe0346_1028x566.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>Security:</strong> The Unity Catalog can act as a gatekeeper; only those with authorized permission can access assets managed by the catalog.</p></li><li><p><strong>Interoperability </strong>(the spirit of the lakehouse architecture)<strong>: </strong>Different query engines can work with data managed by the catalog. The query engine accesses the metadata via the REST API, rather than replicating the metadata to each engine.</p></li></ul><h3>Diverse Asset Types management</h3><p>As mentioned, in addition to tables, users need the Unity Catalog to manage various data asset types.</p><p>The foundation of this ability lies in the entity-relationship (ER) data model, which backs all metadata operations.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pprJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd27996e-adf6-43fe-83a9-9ae0d38edf52_1226x758.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pprJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd27996e-adf6-43fe-83a9-9ae0d38edf52_1226x758.png 424w, https://substackcdn.com/image/fetch/$s_!pprJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd27996e-adf6-43fe-83a9-9ae0d38edf52_1226x758.png 848w, https://substackcdn.com/image/fetch/$s_!pprJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd27996e-adf6-43fe-83a9-9ae0d38edf52_1226x758.png 1272w, https://substackcdn.com/image/fetch/$s_!pprJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd27996e-adf6-43fe-83a9-9ae0d38edf52_1226x758.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pprJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd27996e-adf6-43fe-83a9-9ae0d38edf52_1226x758.png" width="530" height="327.68352365415984" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bd27996e-adf6-43fe-83a9-9ae0d38edf52_1226x758.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:758,&quot;width&quot;:1226,&quot;resizeWidth&quot;:530,&quot;bytes&quot;:142015,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/182563955?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd27996e-adf6-43fe-83a9-9ae0d38edf52_1226x758.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pprJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd27996e-adf6-43fe-83a9-9ae0d38edf52_1226x758.png 424w, https://substackcdn.com/image/fetch/$s_!pprJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd27996e-adf6-43fe-83a9-9ae0d38edf52_1226x758.png 848w, https://substackcdn.com/image/fetch/$s_!pprJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd27996e-adf6-43fe-83a9-9ae0d38edf52_1226x758.png 1272w, https://substackcdn.com/image/fetch/$s_!pprJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd27996e-adf6-43fe-83a9-9ae0d38edf52_1226x758.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The model exposes methods for tasks such as ID- or name-based lookup, parent-child relationship mapping, privilege grant management, and state management for resource provisioning and cleanup. Developers can extend the model for a specific asset type. This model is materialized in a relational database.</p><p>On top of the model is the adapter layer. The primary responsibility of this layer is to offer integration points for different asset types with various cloud providers.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SCKM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0173e78-9ad6-4aa8-8c01-8f4456a6df68_1718x970.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SCKM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0173e78-9ad6-4aa8-8c01-8f4456a6df68_1718x970.png 424w, https://substackcdn.com/image/fetch/$s_!SCKM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0173e78-9ad6-4aa8-8c01-8f4456a6df68_1718x970.png 848w, https://substackcdn.com/image/fetch/$s_!SCKM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0173e78-9ad6-4aa8-8c01-8f4456a6df68_1718x970.png 1272w, https://substackcdn.com/image/fetch/$s_!SCKM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0173e78-9ad6-4aa8-8c01-8f4456a6df68_1718x970.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SCKM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0173e78-9ad6-4aa8-8c01-8f4456a6df68_1718x970.png" width="622" height="351.1565934065934" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d0173e78-9ad6-4aa8-8c01-8f4456a6df68_1718x970.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:822,&quot;width&quot;:1456,&quot;resizeWidth&quot;:622,&quot;bytes&quot;:212987,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/182563955?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0173e78-9ad6-4aa8-8c01-8f4456a6df68_1718x970.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SCKM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0173e78-9ad6-4aa8-8c01-8f4456a6df68_1718x970.png 424w, https://substackcdn.com/image/fetch/$s_!SCKM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0173e78-9ad6-4aa8-8c01-8f4456a6df68_1718x970.png 848w, https://substackcdn.com/image/fetch/$s_!SCKM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0173e78-9ad6-4aa8-8c01-8f4456a6df68_1718x970.png 1272w, https://substackcdn.com/image/fetch/$s_!SCKM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0173e78-9ad6-4aa8-8c01-8f4456a6df68_1718x970.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A developer can register an asset type in the catalog by adding a manifest to UC&#8217;s asset types registry. The manifest includes information such as its location in the model hierarchy, the support operations and privileges, the rules associated with each operation, and how the catalog will manage the asset lifecycle.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ir7p!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58c35891-6dfb-466a-b277-69e26c6fb59a_1854x908.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ir7p!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58c35891-6dfb-466a-b277-69e26c6fb59a_1854x908.png 424w, https://substackcdn.com/image/fetch/$s_!ir7p!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58c35891-6dfb-466a-b277-69e26c6fb59a_1854x908.png 848w, https://substackcdn.com/image/fetch/$s_!ir7p!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58c35891-6dfb-466a-b277-69e26c6fb59a_1854x908.png 1272w, https://substackcdn.com/image/fetch/$s_!ir7p!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58c35891-6dfb-466a-b277-69e26c6fb59a_1854x908.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ir7p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58c35891-6dfb-466a-b277-69e26c6fb59a_1854x908.png" width="1456" height="713" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/58c35891-6dfb-466a-b277-69e26c6fb59a_1854x908.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:713,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:178368,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/182563955?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58c35891-6dfb-466a-b277-69e26c6fb59a_1854x908.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ir7p!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58c35891-6dfb-466a-b277-69e26c6fb59a_1854x908.png 424w, https://substackcdn.com/image/fetch/$s_!ir7p!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58c35891-6dfb-466a-b277-69e26c6fb59a_1854x908.png 848w, https://substackcdn.com/image/fetch/$s_!ir7p!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58c35891-6dfb-466a-b277-69e26c6fb59a_1854x908.png 1272w, https://substackcdn.com/image/fetch/$s_!ir7p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58c35891-6dfb-466a-b277-69e26c6fb59a_1854x908.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In this way, multiple asset types can be added to the Catalog.</p><p>Right on top of the adapter is the core features layer, which provides methods like namespace, lifecycle management, access control, and audit logging.</p><p>This layer exposes APIs for metadata management and credential requests.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Uqh2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab919c20-0a04-4fda-a006-d349e71a9fee_1026x680.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Uqh2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab919c20-0a04-4fda-a006-d349e71a9fee_1026x680.png 424w, https://substackcdn.com/image/fetch/$s_!Uqh2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab919c20-0a04-4fda-a006-d349e71a9fee_1026x680.png 848w, https://substackcdn.com/image/fetch/$s_!Uqh2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab919c20-0a04-4fda-a006-d349e71a9fee_1026x680.png 1272w, https://substackcdn.com/image/fetch/$s_!Uqh2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab919c20-0a04-4fda-a006-d349e71a9fee_1026x680.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Uqh2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab919c20-0a04-4fda-a006-d349e71a9fee_1026x680.png" width="500" height="331.3840155945419" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ab919c20-0a04-4fda-a006-d349e71a9fee_1026x680.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:680,&quot;width&quot;:1026,&quot;resizeWidth&quot;:500,&quot;bytes&quot;:112618,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/182563955?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab919c20-0a04-4fda-a006-d349e71a9fee_1026x680.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Uqh2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab919c20-0a04-4fda-a006-d349e71a9fee_1026x680.png 424w, https://substackcdn.com/image/fetch/$s_!Uqh2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab919c20-0a04-4fda-a006-d349e71a9fee_1026x680.png 848w, https://substackcdn.com/image/fetch/$s_!Uqh2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab919c20-0a04-4fda-a006-d349e71a9fee_1026x680.png 1272w, https://substackcdn.com/image/fetch/$s_!Uqh2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab919c20-0a04-4fda-a006-d349e71a9fee_1026x680.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Access control</h3><p>Databricks wants to ensure that data access is secure, regardless of how it is accessed: via the catalog or directly via the object storage path.</p><p>First, Databricks protects all the assets at two distinct levels: metadata and data. Metadata security is handled directly through UC&#8217;s REST APIs, where the system validates a user's permissions based on the specific operation being performed.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bWhp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84cd3879-23bc-4dfc-ad74-d54f8b88d8c5_1122x478.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bWhp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84cd3879-23bc-4dfc-ad74-d54f8b88d8c5_1122x478.png 424w, https://substackcdn.com/image/fetch/$s_!bWhp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84cd3879-23bc-4dfc-ad74-d54f8b88d8c5_1122x478.png 848w, https://substackcdn.com/image/fetch/$s_!bWhp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84cd3879-23bc-4dfc-ad74-d54f8b88d8c5_1122x478.png 1272w, https://substackcdn.com/image/fetch/$s_!bWhp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84cd3879-23bc-4dfc-ad74-d54f8b88d8c5_1122x478.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bWhp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84cd3879-23bc-4dfc-ad74-d54f8b88d8c5_1122x478.png" width="572" height="243.68627450980392" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/84cd3879-23bc-4dfc-ad74-d54f8b88d8c5_1122x478.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:478,&quot;width&quot;:1122,&quot;resizeWidth&quot;:572,&quot;bytes&quot;:127632,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/182563955?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84cd3879-23bc-4dfc-ad74-d54f8b88d8c5_1122x478.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bWhp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84cd3879-23bc-4dfc-ad74-d54f8b88d8c5_1122x478.png 424w, https://substackcdn.com/image/fetch/$s_!bWhp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84cd3879-23bc-4dfc-ad74-d54f8b88d8c5_1122x478.png 848w, https://substackcdn.com/image/fetch/$s_!bWhp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84cd3879-23bc-4dfc-ad74-d54f8b88d8c5_1122x478.png 1272w, https://substackcdn.com/image/fetch/$s_!bWhp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84cd3879-23bc-4dfc-ad74-d54f8b88d8c5_1122x478.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Rather than giving clients direct access to cloud storage, administrators grant storage permissions exclusively to the Unity Catalog service. When a client needs to read or write data, it requests temporary credentials from the UC.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5VLF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd281d211-6105-4aa4-b779-4bc0fdeaf6f4_1474x668.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5VLF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd281d211-6105-4aa4-b779-4bc0fdeaf6f4_1474x668.png 424w, https://substackcdn.com/image/fetch/$s_!5VLF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd281d211-6105-4aa4-b779-4bc0fdeaf6f4_1474x668.png 848w, https://substackcdn.com/image/fetch/$s_!5VLF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd281d211-6105-4aa4-b779-4bc0fdeaf6f4_1474x668.png 1272w, https://substackcdn.com/image/fetch/$s_!5VLF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd281d211-6105-4aa4-b779-4bc0fdeaf6f4_1474x668.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5VLF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd281d211-6105-4aa4-b779-4bc0fdeaf6f4_1474x668.png" width="1456" height="660" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d281d211-6105-4aa4-b779-4bc0fdeaf6f4_1474x668.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:660,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:141374,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/182563955?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd281d211-6105-4aa4-b779-4bc0fdeaf6f4_1474x668.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5VLF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd281d211-6105-4aa4-b779-4bc0fdeaf6f4_1474x668.png 424w, https://substackcdn.com/image/fetch/$s_!5VLF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd281d211-6105-4aa4-b779-4bc0fdeaf6f4_1474x668.png 848w, https://substackcdn.com/image/fetch/$s_!5VLF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd281d211-6105-4aa4-b779-4bc0fdeaf6f4_1474x668.png 1272w, https://substackcdn.com/image/fetch/$s_!5VLF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd281d211-6105-4aa4-b779-4bc0fdeaf6f4_1474x668.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If the engine needs to access the data via the object storage path, the UC will resolve the path to the asset identifier, check whether the client has sufficient permissions, and finally return a temporary credential.</p><p>For more fine-grained access control, such as row or column security, the UC works with the engine to enforce a two-level access control. In addition to the mechanism above to protect both the metadata and the data of a table, the engine is now responsible for row and column security. </p><h3>A discoverable catalog</h3><p>Recalled from the &#8220;Diverse Asset Types management&#8221; section, Unity Catalog is designed as a layered architecture, with the bottom layer the entity-relation model stored in the relational database, followed by the core feature layer and the APIs layer.</p><p>On top of the APIs layer is a layer with background functions, such as discovery, search, and data lineage. The service must be notified and updated whenever the metadata is changed.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e6G4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab22adb5-9002-44ec-a833-d94dbfdfbe74_1028x596.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e6G4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab22adb5-9002-44ec-a833-d94dbfdfbe74_1028x596.png 424w, https://substackcdn.com/image/fetch/$s_!e6G4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab22adb5-9002-44ec-a833-d94dbfdfbe74_1028x596.png 848w, https://substackcdn.com/image/fetch/$s_!e6G4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab22adb5-9002-44ec-a833-d94dbfdfbe74_1028x596.png 1272w, https://substackcdn.com/image/fetch/$s_!e6G4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab22adb5-9002-44ec-a833-d94dbfdfbe74_1028x596.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e6G4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab22adb5-9002-44ec-a833-d94dbfdfbe74_1028x596.png" width="1028" height="596" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ab22adb5-9002-44ec-a833-d94dbfdfbe74_1028x596.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:596,&quot;width&quot;:1028,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:164283,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/182563955?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab22adb5-9002-44ec-a833-d94dbfdfbe74_1028x596.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!e6G4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab22adb5-9002-44ec-a833-d94dbfdfbe74_1028x596.png 424w, https://substackcdn.com/image/fetch/$s_!e6G4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab22adb5-9002-44ec-a833-d94dbfdfbe74_1028x596.png 848w, https://substackcdn.com/image/fetch/$s_!e6G4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab22adb5-9002-44ec-a833-d94dbfdfbe74_1028x596.png 1272w, https://substackcdn.com/image/fetch/$s_!e6G4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab22adb5-9002-44ec-a833-d94dbfdfbe74_1028x596.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Whenever changes occur, the core service sends change events so the background functions can consume them to update the indexes or lineage graphs. Databricks adopts this event-driven architecture to ensure that background functions stay up to date with the core service.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!McYF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e5ea9b-58d3-4d61-9a0f-5779a323e467_1276x886.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!McYF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e5ea9b-58d3-4d61-9a0f-5779a323e467_1276x886.png 424w, https://substackcdn.com/image/fetch/$s_!McYF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e5ea9b-58d3-4d61-9a0f-5779a323e467_1276x886.png 848w, https://substackcdn.com/image/fetch/$s_!McYF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e5ea9b-58d3-4d61-9a0f-5779a323e467_1276x886.png 1272w, https://substackcdn.com/image/fetch/$s_!McYF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e5ea9b-58d3-4d61-9a0f-5779a323e467_1276x886.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!McYF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e5ea9b-58d3-4d61-9a0f-5779a323e467_1276x886.png" width="588" height="408.282131661442" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/36e5ea9b-58d3-4d61-9a0f-5779a323e467_1276x886.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:886,&quot;width&quot;:1276,&quot;resizeWidth&quot;:588,&quot;bytes&quot;:157120,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/182563955?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e5ea9b-58d3-4d61-9a0f-5779a323e467_1276x886.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!McYF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e5ea9b-58d3-4d61-9a0f-5779a323e467_1276x886.png 424w, https://substackcdn.com/image/fetch/$s_!McYF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e5ea9b-58d3-4d61-9a0f-5779a323e467_1276x886.png 848w, https://substackcdn.com/image/fetch/$s_!McYF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e5ea9b-58d3-4d61-9a0f-5779a323e467_1276x886.png 1272w, https://substackcdn.com/image/fetch/$s_!McYF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e5ea9b-58d3-4d61-9a0f-5779a323e467_1276x886.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The separation between the core service and the background functions ensures both can be scaled independently, and the failure of a service won&#8217;t impact the remaining services.</p><p>The core service also provides access control for background functions, ensuring that only authorized clients can discover the assets.</p><h3>Performance</h3><p>As UC is the entry point for all the analytical workload in Databricks, its access latencies directly impact the user experience. Here are some optimizations that Databricks implements for the UC:</p><ul><li><p>Batching multiple asset requests.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0GSI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da0eea0-dc1b-4dc4-87cf-36e7ae70f8a7_498x404.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0GSI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da0eea0-dc1b-4dc4-87cf-36e7ae70f8a7_498x404.png 424w, https://substackcdn.com/image/fetch/$s_!0GSI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da0eea0-dc1b-4dc4-87cf-36e7ae70f8a7_498x404.png 848w, https://substackcdn.com/image/fetch/$s_!0GSI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da0eea0-dc1b-4dc4-87cf-36e7ae70f8a7_498x404.png 1272w, https://substackcdn.com/image/fetch/$s_!0GSI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da0eea0-dc1b-4dc4-87cf-36e7ae70f8a7_498x404.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0GSI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da0eea0-dc1b-4dc4-87cf-36e7ae70f8a7_498x404.png" width="308" height="249.86345381526104" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3da0eea0-dc1b-4dc4-87cf-36e7ae70f8a7_498x404.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:404,&quot;width&quot;:498,&quot;resizeWidth&quot;:308,&quot;bytes&quot;:31914,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/182563955?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da0eea0-dc1b-4dc4-87cf-36e7ae70f8a7_498x404.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0GSI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da0eea0-dc1b-4dc4-87cf-36e7ae70f8a7_498x404.png 424w, https://substackcdn.com/image/fetch/$s_!0GSI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da0eea0-dc1b-4dc4-87cf-36e7ae70f8a7_498x404.png 848w, https://substackcdn.com/image/fetch/$s_!0GSI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da0eea0-dc1b-4dc4-87cf-36e7ae70f8a7_498x404.png 1272w, https://substackcdn.com/image/fetch/$s_!0GSI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da0eea0-dc1b-4dc4-87cf-36e7ae70f8a7_498x404.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p>Caching immutable metadata (e.g., temporary credentials) at both the UC service and the query engines.</p></li><li><p>For mutable metadata (e.g., table&#8217;s name, columns, permissions&#8230;), Databricks implements a write-through cache for the relational database that backs the metadata. The relational database&#8217;s data will be sharded across nodes; each node is responsible for its assigned subset of data in both the database&#8217;s storage and the cache.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JtXV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F302d155e-e60c-4a6a-9663-caa61eb93425_1376x490.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JtXV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F302d155e-e60c-4a6a-9663-caa61eb93425_1376x490.png 424w, https://substackcdn.com/image/fetch/$s_!JtXV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F302d155e-e60c-4a6a-9663-caa61eb93425_1376x490.png 848w, https://substackcdn.com/image/fetch/$s_!JtXV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F302d155e-e60c-4a6a-9663-caa61eb93425_1376x490.png 1272w, https://substackcdn.com/image/fetch/$s_!JtXV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F302d155e-e60c-4a6a-9663-caa61eb93425_1376x490.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JtXV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F302d155e-e60c-4a6a-9663-caa61eb93425_1376x490.png" width="1376" height="490" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/302d155e-e60c-4a6a-9663-caa61eb93425_1376x490.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:490,&quot;width&quot;:1376,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:90537,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/182563955?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F302d155e-e60c-4a6a-9663-caa61eb93425_1376x490.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JtXV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F302d155e-e60c-4a6a-9663-caa61eb93425_1376x490.png 424w, https://substackcdn.com/image/fetch/$s_!JtXV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F302d155e-e60c-4a6a-9663-caa61eb93425_1376x490.png 848w, https://substackcdn.com/image/fetch/$s_!JtXV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F302d155e-e60c-4a6a-9663-caa61eb93425_1376x490.png 1272w, https://substackcdn.com/image/fetch/$s_!JtXV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F302d155e-e60c-4a6a-9663-caa61eb93425_1376x490.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p>To prevent the cache from growing indefinitely. Databricks implements two strategies for cache eviction. First, there are standard algorithms, such as LRU, that discard unpopular cached entities. Second, to cap the number of cached versions of popular entities, Databricks enforces a timeout for all UC API calls. They assume that when a request to an asset populates a new version of the asset into the cache, the existing cached versions will be used for processing requests for at most the timeout period.</p></li></ul><blockquote><p><em>In a write-through cache, data is written simultaneously to the cache and the primary storage system. This approach ensures that the cache always contains the most recent data. &#8212; from <a href="https://www.designgurus.io/answers/detail/what-is-read-through-vs-write-through-cache?gad_source=1&amp;gad_campaignid=23163907085&amp;gbraid=0AAAAADME9yoSi4tAWruR5iWGGAGnQZcX-&amp;gclid=Cj0KCQiAgbnKBhDgARIsAGCDdle2rdxFg-8SWJywUX7CAwptXKd4E6MaSTRchDZQdsrRhjrR_H-0U_oaAlzDEALw_wcB">What is Read-Through vs Write-Through Cache? by Design Guru</a></em></p></blockquote><ul><li><p>UC uses database versions to provide snapshot isolation and serializable isolation.</p></li></ul><blockquote><p><em>The idea of snapshot isolation (SI) is that each read transaction reads a consistent snapshot of the database. The transaction will only see all changes committed before it starts.</em></p><p><em>Serializable is the strongest isolation level: although transactions can run in parallel, their effects are the same as if they were run serially (one at a time). There are several approaches to implement serializability, including serialized snapshot isolation (SSI).</em></p><p><em>SSI is still based on the SI; the readings are still served with consistent snapshots. However, SSI has mechanisms to detect conflicts between writes. A common approach is to detect whether the reading snapshot is stale.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!r7uP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d5a2360-9105-4a47-a14a-bcd0c0492643_826x428.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!r7uP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d5a2360-9105-4a47-a14a-bcd0c0492643_826x428.png 424w, https://substackcdn.com/image/fetch/$s_!r7uP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d5a2360-9105-4a47-a14a-bcd0c0492643_826x428.png 848w, https://substackcdn.com/image/fetch/$s_!r7uP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d5a2360-9105-4a47-a14a-bcd0c0492643_826x428.png 1272w, https://substackcdn.com/image/fetch/$s_!r7uP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d5a2360-9105-4a47-a14a-bcd0c0492643_826x428.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!r7uP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d5a2360-9105-4a47-a14a-bcd0c0492643_826x428.png" width="826" height="428" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3d5a2360-9105-4a47-a14a-bcd0c0492643_826x428.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:428,&quot;width&quot;:826,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!r7uP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d5a2360-9105-4a47-a14a-bcd0c0492643_826x428.png 424w, https://substackcdn.com/image/fetch/$s_!r7uP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d5a2360-9105-4a47-a14a-bcd0c0492643_826x428.png 848w, https://substackcdn.com/image/fetch/$s_!r7uP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d5a2360-9105-4a47-a14a-bcd0c0492643_826x428.png 1272w, https://substackcdn.com/image/fetch/$s_!r7uP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d5a2360-9105-4a47-a14a-bcd0c0492643_826x428.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>For a given transaction, SI provides it with a consistent snapshot and ignores changes from ongoing or new transactions. With SSI, the database can check if any ignored changes were committed during the transaction. If so, the database will abort the transaction if it&#8217;s not a read-only transaction &#8212; from <a href="https://open.substack.com/pub/vutr/p/acid-for-data-engineers?utm_campaign=post-expanded-share&amp;utm_medium=web">ACID For Data Engineers by Vu Trinh</a>.</em></p></blockquote><div><hr></div><h2>Outro</h2><p>In this article, I&#8217;ve shared my insights after reading the paper <a href="https://dl.acm.org/doi/abs/10.1145/3722212.3724459">Unity Catalog: Open and Universal Governance for the Lakehouse and Beyond</a>. From the problems that Unity Catalog is trying to solve, how it fits into the Databricks lakehouse architecture, the query life cycle with the Catalog as the entry point, and finally, the system designs that made it a reliable, secure, and performant lakehouse catalog</p><h2>Reference</h2><p><em>[1] Databricks, <a href="https://dl.acm.org/doi/abs/10.1145/3722212.3724459">Unity Catalog: Open and Universal Governance for the Lakehouse and Beyond</a>, 2025</em></p>]]></content:encoded></item><item><title><![CDATA[Why is Text-to-SQL so hard?]]></title><description><![CDATA[Why is there a need for it? What are its challenges? Is there a way to make it easier?]]></description><link>https://vutr.substack.com/p/why-is-text-to-sql-so-hard</link><guid isPermaLink="false">https://vutr.substack.com/p/why-is-text-to-sql-so-hard</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Thu, 16 Oct 2025 03:15:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ntvL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf489800-9e74-445b-bc01-5f055c6d801c_2000x1428.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p><em>I will publish a paid article every Tuesday. I wrote these with one goal in mind: to offer my readers, whether they are feeling overwhelmed when beginning the journey or seeking a deeper understanding of the field, 15 minutes of practical lessons and insights on nearly everything related to data engineering.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Upgrade subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe?"><span>Upgrade subscription</span></a></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ntvL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf489800-9e74-445b-bc01-5f055c6d801c_2000x1428.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ntvL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf489800-9e74-445b-bc01-5f055c6d801c_2000x1428.png 424w, https://substackcdn.com/image/fetch/$s_!ntvL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf489800-9e74-445b-bc01-5f055c6d801c_2000x1428.png 848w, https://substackcdn.com/image/fetch/$s_!ntvL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf489800-9e74-445b-bc01-5f055c6d801c_2000x1428.png 1272w, https://substackcdn.com/image/fetch/$s_!ntvL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf489800-9e74-445b-bc01-5f055c6d801c_2000x1428.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ntvL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf489800-9e74-445b-bc01-5f055c6d801c_2000x1428.png" width="1456" height="1040" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/af489800-9e74-445b-bc01-5f055c6d801c_2000x1428.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1040,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:433190,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf489800-9e74-445b-bc01-5f055c6d801c_2000x1428.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ntvL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf489800-9e74-445b-bc01-5f055c6d801c_2000x1428.png 424w, https://substackcdn.com/image/fetch/$s_!ntvL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf489800-9e74-445b-bc01-5f055c6d801c_2000x1428.png 848w, https://substackcdn.com/image/fetch/$s_!ntvL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf489800-9e74-445b-bc01-5f055c6d801c_2000x1428.png 1272w, https://substackcdn.com/image/fetch/$s_!ntvL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf489800-9e74-445b-bc01-5f055c6d801c_2000x1428.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Intro</h2><p>As Joe Reis and Matt Housley once said in the infamous book, <a href="https://www.amazon.com/Fundamentals-Data-Engineering-Robust-Systems/dp/1098108302/ref=sr_1_1?adgrpid=116133839923&amp;dib=eyJ2IjoiMSJ9.lYwfG6Cki9cIzZbbw-FkBLEGg8qxUMl8FddVr7cn3e53N5udUjs7b4Xw8dLmLC6PGFLeiu__B-8NQ3wXIYhVyEPbcg8uack-H3mXmSlnlOq03C-h9r-vAqimHYUHjeWDK5M0PDMpMm1vRjNLyn0lNEyy1K1YC4wfv1rfBRuxkjD_dMF6_EGdjKUD3aRjguPjldg1wmleWvAJk8jOE30xBiy4UispBaZe5IfRIW05prE.MyZpTE-b63KM3R6ZHK5T7A1Nfdy7SjwIihQnUHj3w5U&amp;dib_tag=se&amp;hvadid=585479350700&amp;hvdev=c&amp;hvlocphy=9198559&amp;hvnetw=g&amp;hvqmt=e&amp;hvrand=16285984323524444922&amp;hvtargid=kwd-902459765949&amp;hydadcr=28046_14525503&amp;keywords=fundamentals+of+data+engineering&amp;mcid=1e34ef84df94373dafcb2867abec2b05&amp;qid=1754635317&amp;sr=8-1">Fundamentals of Data Engineering</a>:</p><blockquote><p><em>A data engineer manages the data engineering lifecycle, starting with extracting data from source systems and concluding with serving data for specific use cases.</em></p></blockquote><p>The data serving is the primary interface through which we provide our service to end users (e.g., data analysts, data scientists, business stakeholders). No matter how well we store, process, and manage the data, if users cannot access or use it reliably, we have failed.</p><p>Today, I want to discuss one of the hottest methods for serving data in the era of AI: natural language to SQL. We will first understand why text-to-SQL is receiving a lot of attention recently, what its challenges are, and then attempt to find a solution that addresses them.</p><h2>Why Text-to-SQL</h2><p>In the past, if business users wanted to gain insight from the data, they had to communicate with the IT department so that these technical experts could assist them.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tTOS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16cb83ea-b843-4f1d-bf5b-7fa57ce034c5_484x462.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tTOS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16cb83ea-b843-4f1d-bf5b-7fa57ce034c5_484x462.png 424w, https://substackcdn.com/image/fetch/$s_!tTOS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16cb83ea-b843-4f1d-bf5b-7fa57ce034c5_484x462.png 848w, https://substackcdn.com/image/fetch/$s_!tTOS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16cb83ea-b843-4f1d-bf5b-7fa57ce034c5_484x462.png 1272w, https://substackcdn.com/image/fetch/$s_!tTOS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16cb83ea-b843-4f1d-bf5b-7fa57ce034c5_484x462.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tTOS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16cb83ea-b843-4f1d-bf5b-7fa57ce034c5_484x462.png" width="484" height="462" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/16cb83ea-b843-4f1d-bf5b-7fa57ce034c5_484x462.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:462,&quot;width&quot;:484,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!tTOS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16cb83ea-b843-4f1d-bf5b-7fa57ce034c5_484x462.png 424w, https://substackcdn.com/image/fetch/$s_!tTOS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16cb83ea-b843-4f1d-bf5b-7fa57ce034c5_484x462.png 848w, https://substackcdn.com/image/fetch/$s_!tTOS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16cb83ea-b843-4f1d-bf5b-7fa57ce034c5_484x462.png 1272w, https://substackcdn.com/image/fetch/$s_!tTOS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16cb83ea-b843-4f1d-bf5b-7fa57ce034c5_484x462.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The business intelligence tools have evolved since then. More functionalities, a shinier UI, the ability to connect to more systems, and most importantly, more friendly to non-technical users.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MOK9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb9bc64c-e111-4c3f-beae-8866a57f3b5b_1032x456.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MOK9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb9bc64c-e111-4c3f-beae-8866a57f3b5b_1032x456.png 424w, https://substackcdn.com/image/fetch/$s_!MOK9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb9bc64c-e111-4c3f-beae-8866a57f3b5b_1032x456.png 848w, https://substackcdn.com/image/fetch/$s_!MOK9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb9bc64c-e111-4c3f-beae-8866a57f3b5b_1032x456.png 1272w, https://substackcdn.com/image/fetch/$s_!MOK9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb9bc64c-e111-4c3f-beae-8866a57f3b5b_1032x456.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MOK9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb9bc64c-e111-4c3f-beae-8866a57f3b5b_1032x456.png" width="1032" height="456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eb9bc64c-e111-4c3f-beae-8866a57f3b5b_1032x456.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:456,&quot;width&quot;:1032,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:103159,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb9bc64c-e111-4c3f-beae-8866a57f3b5b_1032x456.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MOK9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb9bc64c-e111-4c3f-beae-8866a57f3b5b_1032x456.png 424w, https://substackcdn.com/image/fetch/$s_!MOK9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb9bc64c-e111-4c3f-beae-8866a57f3b5b_1032x456.png 848w, https://substackcdn.com/image/fetch/$s_!MOK9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb9bc64c-e111-4c3f-beae-8866a57f3b5b_1032x456.png 1272w, https://substackcdn.com/image/fetch/$s_!MOK9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb9bc64c-e111-4c3f-beae-8866a57f3b5b_1032x456.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>From asking the technical team for help, business users can now build their own charts or create reports with the assistance of modern business intelligence tools, which allow them to drag-and-drop the data fields they want.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AvDi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6be01640-b15b-40df-b7c5-a94231f1270c_556x288.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AvDi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6be01640-b15b-40df-b7c5-a94231f1270c_556x288.png 424w, https://substackcdn.com/image/fetch/$s_!AvDi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6be01640-b15b-40df-b7c5-a94231f1270c_556x288.png 848w, https://substackcdn.com/image/fetch/$s_!AvDi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6be01640-b15b-40df-b7c5-a94231f1270c_556x288.png 1272w, https://substackcdn.com/image/fetch/$s_!AvDi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6be01640-b15b-40df-b7c5-a94231f1270c_556x288.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AvDi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6be01640-b15b-40df-b7c5-a94231f1270c_556x288.png" width="556" height="288" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6be01640-b15b-40df-b7c5-a94231f1270c_556x288.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:288,&quot;width&quot;:556,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:43308,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6be01640-b15b-40df-b7c5-a94231f1270c_556x288.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AvDi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6be01640-b15b-40df-b7c5-a94231f1270c_556x288.png 424w, https://substackcdn.com/image/fetch/$s_!AvDi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6be01640-b15b-40df-b7c5-a94231f1270c_556x288.png 848w, https://substackcdn.com/image/fetch/$s_!AvDi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6be01640-b15b-40df-b7c5-a94231f1270c_556x288.png 1272w, https://substackcdn.com/image/fetch/$s_!AvDi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6be01640-b15b-40df-b7c5-a94231f1270c_556x288.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>But it seems like that&#8217;s not enough. The rise of AI chat interfaces like ChatGPT or Gemini makes people realize that &#8220;oh, using natural language is even more productive compared to the visual drag-and-drop.&#8220; BI tools on the market are starting to integrate the ability to answer human questions with the help of AI models.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tAiW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a3c88a6-7fbf-415c-a5fc-8ba6cef7a10e_630x296.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tAiW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a3c88a6-7fbf-415c-a5fc-8ba6cef7a10e_630x296.png 424w, https://substackcdn.com/image/fetch/$s_!tAiW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a3c88a6-7fbf-415c-a5fc-8ba6cef7a10e_630x296.png 848w, https://substackcdn.com/image/fetch/$s_!tAiW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a3c88a6-7fbf-415c-a5fc-8ba6cef7a10e_630x296.png 1272w, https://substackcdn.com/image/fetch/$s_!tAiW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a3c88a6-7fbf-415c-a5fc-8ba6cef7a10e_630x296.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tAiW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a3c88a6-7fbf-415c-a5fc-8ba6cef7a10e_630x296.png" width="630" height="296" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0a3c88a6-7fbf-415c-a5fc-8ba6cef7a10e_630x296.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:296,&quot;width&quot;:630,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:70984,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a3c88a6-7fbf-415c-a5fc-8ba6cef7a10e_630x296.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tAiW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a3c88a6-7fbf-415c-a5fc-8ba6cef7a10e_630x296.png 424w, https://substackcdn.com/image/fetch/$s_!tAiW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a3c88a6-7fbf-415c-a5fc-8ba6cef7a10e_630x296.png 848w, https://substackcdn.com/image/fetch/$s_!tAiW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a3c88a6-7fbf-415c-a5fc-8ba6cef7a10e_630x296.png 1272w, https://substackcdn.com/image/fetch/$s_!tAiW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a3c88a6-7fbf-415c-a5fc-8ba6cef7a10e_630x296.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The key is to enable the AI models to translate user input into SQL queries. Then, the tool will send the SQL to the database and create a chart/report based on the results.</p><p>Instead of choosing the `total_sales` and `country` fields, a simple text, &#8220;Show me the total sales breakdown by country in the last month,&#8221; is more intuitive for the users. Integrating with AI makes a solution more compelling.</p><h2>Challenges of Text-to-SQL</h2><blockquote><p><em>I refer to the paper <a href="https://arxiv.org/html/2408.05109v5">&#8220;A Survey of Text-to-SQL in the Era of LLMs: Where are we, and where are we going?&#8221;</a> for this section.</em></p></blockquote><p>Instructing AI models to accept natural language input and output a reliable SQL query is not easy to achieve. To better understand the challenges, let&#8217;s first revisit some steps that humans take to write SQL:</p><ul><li><p>We begin with the business question, the natural language query: for example, all countries with sales greater than 2,000 on Independence Day.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!r4RV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb15250fb-05f2-487b-accb-65d51e605533_524x264.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!r4RV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb15250fb-05f2-487b-accb-65d51e605533_524x264.png 424w, https://substackcdn.com/image/fetch/$s_!r4RV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb15250fb-05f2-487b-accb-65d51e605533_524x264.png 848w, https://substackcdn.com/image/fetch/$s_!r4RV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb15250fb-05f2-487b-accb-65d51e605533_524x264.png 1272w, https://substackcdn.com/image/fetch/$s_!r4RV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb15250fb-05f2-487b-accb-65d51e605533_524x264.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!r4RV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb15250fb-05f2-487b-accb-65d51e605533_524x264.png" width="524" height="264" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b15250fb-05f2-487b-accb-65d51e605533_524x264.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:264,&quot;width&quot;:524,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42408,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb15250fb-05f2-487b-accb-65d51e605533_524x264.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!r4RV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb15250fb-05f2-487b-accb-65d51e605533_524x264.png 424w, https://substackcdn.com/image/fetch/$s_!r4RV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb15250fb-05f2-487b-accb-65d51e605533_524x264.png 848w, https://substackcdn.com/image/fetch/$s_!r4RV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb15250fb-05f2-487b-accb-65d51e605533_524x264.png 1272w, https://substackcdn.com/image/fetch/$s_!r4RV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb15250fb-05f2-487b-accb-65d51e605533_524x264.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p>In our brain, we identify the entities: the countries, the sales, the context: June, and the condition: sales greater than 2,000.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lT_I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ea3302-444e-4221-9c27-e2887a677ccf_438x314.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lT_I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ea3302-444e-4221-9c27-e2887a677ccf_438x314.png 424w, https://substackcdn.com/image/fetch/$s_!lT_I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ea3302-444e-4221-9c27-e2887a677ccf_438x314.png 848w, https://substackcdn.com/image/fetch/$s_!lT_I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ea3302-444e-4221-9c27-e2887a677ccf_438x314.png 1272w, https://substackcdn.com/image/fetch/$s_!lT_I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ea3302-444e-4221-9c27-e2887a677ccf_438x314.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lT_I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ea3302-444e-4221-9c27-e2887a677ccf_438x314.png" width="438" height="314" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/06ea3302-444e-4221-9c27-e2887a677ccf_438x314.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:314,&quot;width&quot;:438,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:45987,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ea3302-444e-4221-9c27-e2887a677ccf_438x314.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lT_I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ea3302-444e-4221-9c27-e2887a677ccf_438x314.png 424w, https://substackcdn.com/image/fetch/$s_!lT_I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ea3302-444e-4221-9c27-e2887a677ccf_438x314.png 848w, https://substackcdn.com/image/fetch/$s_!lT_I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ea3302-444e-4221-9c27-e2887a677ccf_438x314.png 1272w, https://substackcdn.com/image/fetch/$s_!lT_I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ea3302-444e-4221-9c27-e2887a677ccf_438x314.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p>We find the relevant tables, columns, and records by examining the database schema. The human interpretation is essential here, which kind of sales (assume the company has more than one product), and what date is Independence Day? (This varies in countries.) This step may require us to revisit the business users to request additional information.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!j6_-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13545902-789c-4812-893c-d1516cd259a7_472x406.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!j6_-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13545902-789c-4812-893c-d1516cd259a7_472x406.png 424w, https://substackcdn.com/image/fetch/$s_!j6_-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13545902-789c-4812-893c-d1516cd259a7_472x406.png 848w, https://substackcdn.com/image/fetch/$s_!j6_-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13545902-789c-4812-893c-d1516cd259a7_472x406.png 1272w, https://substackcdn.com/image/fetch/$s_!j6_-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13545902-789c-4812-893c-d1516cd259a7_472x406.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!j6_-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13545902-789c-4812-893c-d1516cd259a7_472x406.png" width="472" height="406" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/13545902-789c-4812-893c-d1516cd259a7_472x406.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:406,&quot;width&quot;:472,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:47268,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13545902-789c-4812-893c-d1516cd259a7_472x406.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!j6_-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13545902-789c-4812-893c-d1516cd259a7_472x406.png 424w, https://substackcdn.com/image/fetch/$s_!j6_-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13545902-789c-4812-893c-d1516cd259a7_472x406.png 848w, https://substackcdn.com/image/fetch/$s_!j6_-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13545902-789c-4812-893c-d1516cd259a7_472x406.png 1272w, https://substackcdn.com/image/fetch/$s_!j6_-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13545902-789c-4812-893c-d1516cd259a7_472x406.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p>Then, we write SQL based on our understanding. We Select, Join, Group By, Where&#8230;</p></li></ul><p>We, humans, despite knowing what we are trying to do, still have some challenging problems while handling the &#8220;text-to-SQL &#8220; process: the uncertainty of the natural language, the database&#8217;s complexity, and the translation from the &#8220;flexible&#8221; natural language queries to the &#8220;strict&#8221; SQL queries.</p><h3>Natural language uncertainty</h3><p>We use natural language from the day we learn to say our first words, such as &#8220;mama&#8221; or &#8220;papa&#8220;. We practice it every day, and the way we communicate depends significantly on who we are, how we grew up, and how we perceive the world.</p><p>It&#8217;s normal for us to say a thing, and others understand it in different ways. This is called ambiguity. It could happen when a single word has multiple meanings, &#8230;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BwjF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1da933c-0b47-4313-a40b-3542764f09b3_620x394.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BwjF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1da933c-0b47-4313-a40b-3542764f09b3_620x394.png 424w, https://substackcdn.com/image/fetch/$s_!BwjF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1da933c-0b47-4313-a40b-3542764f09b3_620x394.png 848w, https://substackcdn.com/image/fetch/$s_!BwjF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1da933c-0b47-4313-a40b-3542764f09b3_620x394.png 1272w, https://substackcdn.com/image/fetch/$s_!BwjF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1da933c-0b47-4313-a40b-3542764f09b3_620x394.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BwjF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1da933c-0b47-4313-a40b-3542764f09b3_620x394.png" width="620" height="394" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c1da933c-0b47-4313-a40b-3542764f09b3_620x394.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:394,&quot;width&quot;:620,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:72329,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1da933c-0b47-4313-a40b-3542764f09b3_620x394.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BwjF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1da933c-0b47-4313-a40b-3542764f09b3_620x394.png 424w, https://substackcdn.com/image/fetch/$s_!BwjF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1da933c-0b47-4313-a40b-3542764f09b3_620x394.png 848w, https://substackcdn.com/image/fetch/$s_!BwjF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1da933c-0b47-4313-a40b-3542764f09b3_620x394.png 1272w, https://substackcdn.com/image/fetch/$s_!BwjF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1da933c-0b47-4313-a40b-3542764f09b3_620x394.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>&#8230;or a sentence can be parsed in various ways.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jQNr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77c3e0b1-1adc-4ef9-9c9f-7334137e45d4_456x286.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jQNr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77c3e0b1-1adc-4ef9-9c9f-7334137e45d4_456x286.png 424w, https://substackcdn.com/image/fetch/$s_!jQNr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77c3e0b1-1adc-4ef9-9c9f-7334137e45d4_456x286.png 848w, https://substackcdn.com/image/fetch/$s_!jQNr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77c3e0b1-1adc-4ef9-9c9f-7334137e45d4_456x286.png 1272w, https://substackcdn.com/image/fetch/$s_!jQNr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77c3e0b1-1adc-4ef9-9c9f-7334137e45d4_456x286.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jQNr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77c3e0b1-1adc-4ef9-9c9f-7334137e45d4_456x286.png" width="456" height="286" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/77c3e0b1-1adc-4ef9-9c9f-7334137e45d4_456x286.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:286,&quot;width&quot;:456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:34476,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77c3e0b1-1adc-4ef9-9c9f-7334137e45d4_456x286.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jQNr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77c3e0b1-1adc-4ef9-9c9f-7334137e45d4_456x286.png 424w, https://substackcdn.com/image/fetch/$s_!jQNr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77c3e0b1-1adc-4ef9-9c9f-7334137e45d4_456x286.png 848w, https://substackcdn.com/image/fetch/$s_!jQNr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77c3e0b1-1adc-4ef9-9c9f-7334137e45d4_456x286.png 1272w, https://substackcdn.com/image/fetch/$s_!jQNr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77c3e0b1-1adc-4ef9-9c9f-7334137e45d4_456x286.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The uncertainty also stemmed from under-specification, which occurs when expressions lack sufficient detail or context to convey their intended meanings. For example, Independence Day in Vietnam is different from Independence Day in the United States of America.</p><p>We can ask others, observe around, or leverage our experience and understanding to resolve the ambiguity. Meanwhile, the AI models might only have a natural language query.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!32Up!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22652ae7-c2e5-48e6-8c01-bbf560453772_480x248.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!32Up!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22652ae7-c2e5-48e6-8c01-bbf560453772_480x248.png 424w, https://substackcdn.com/image/fetch/$s_!32Up!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22652ae7-c2e5-48e6-8c01-bbf560453772_480x248.png 848w, https://substackcdn.com/image/fetch/$s_!32Up!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22652ae7-c2e5-48e6-8c01-bbf560453772_480x248.png 1272w, https://substackcdn.com/image/fetch/$s_!32Up!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22652ae7-c2e5-48e6-8c01-bbf560453772_480x248.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!32Up!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22652ae7-c2e5-48e6-8c01-bbf560453772_480x248.png" width="480" height="248" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/22652ae7-c2e5-48e6-8c01-bbf560453772_480x248.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:248,&quot;width&quot;:480,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:56312,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22652ae7-c2e5-48e6-8c01-bbf560453772_480x248.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!32Up!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22652ae7-c2e5-48e6-8c01-bbf560453772_480x248.png 424w, https://substackcdn.com/image/fetch/$s_!32Up!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22652ae7-c2e5-48e6-8c01-bbf560453772_480x248.png 848w, https://substackcdn.com/image/fetch/$s_!32Up!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22652ae7-c2e5-48e6-8c01-bbf560453772_480x248.png 1272w, https://substackcdn.com/image/fetch/$s_!32Up!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22652ae7-c2e5-48e6-8c01-bbf560453772_480x248.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>The database&#8217;s complexity</h3><p>It&#8217;s common for us, data engineers, to handle messy data systems. Lack of robust data modeling, complex relationships between tables, ambiguous columns, or more than one way to calculate a metric.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dT35!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf58bb8f-7b5a-4e77-a840-081252c37bbd_602x382.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dT35!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf58bb8f-7b5a-4e77-a840-081252c37bbd_602x382.png 424w, https://substackcdn.com/image/fetch/$s_!dT35!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf58bb8f-7b5a-4e77-a840-081252c37bbd_602x382.png 848w, https://substackcdn.com/image/fetch/$s_!dT35!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf58bb8f-7b5a-4e77-a840-081252c37bbd_602x382.png 1272w, https://substackcdn.com/image/fetch/$s_!dT35!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf58bb8f-7b5a-4e77-a840-081252c37bbd_602x382.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dT35!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf58bb8f-7b5a-4e77-a840-081252c37bbd_602x382.png" width="602" height="382" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cf58bb8f-7b5a-4e77-a840-081252c37bbd_602x382.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:382,&quot;width&quot;:602,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:72392,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf58bb8f-7b5a-4e77-a840-081252c37bbd_602x382.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dT35!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf58bb8f-7b5a-4e77-a840-081252c37bbd_602x382.png 424w, https://substackcdn.com/image/fetch/$s_!dT35!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf58bb8f-7b5a-4e77-a840-081252c37bbd_602x382.png 848w, https://substackcdn.com/image/fetch/$s_!dT35!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf58bb8f-7b5a-4e77-a840-081252c37bbd_602x382.png 1272w, https://substackcdn.com/image/fetch/$s_!dT35!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf58bb8f-7b5a-4e77-a840-081252c37bbd_602x382.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let&#8217;s confess here, it is tough for us to do the right thing the first time with this data system. We might run around the companies to ask for more clarification, cause some bugs, and create some weird reports before learning how to do it right. An AI model, somewhere on the internet, knows nothing about your company&#8217;s data system. How could we expect it to do better than us?</p><h3>Text-to-SQL Translation</h3><p>For the machine to understand, our Python or Java code must be translated into low-level machine language. This is a complex task, but at a high level, things are straightforward, as each language has a kind of dictionary to facilitate a one-to-one mapping between programming language code and machine code.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ItYm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F373d505e-7721-4ed1-a755-9722606301c3_650x270.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ItYm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F373d505e-7721-4ed1-a755-9722606301c3_650x270.png 424w, https://substackcdn.com/image/fetch/$s_!ItYm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F373d505e-7721-4ed1-a755-9722606301c3_650x270.png 848w, https://substackcdn.com/image/fetch/$s_!ItYm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F373d505e-7721-4ed1-a755-9722606301c3_650x270.png 1272w, https://substackcdn.com/image/fetch/$s_!ItYm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F373d505e-7721-4ed1-a755-9722606301c3_650x270.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ItYm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F373d505e-7721-4ed1-a755-9722606301c3_650x270.png" width="650" height="270" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/373d505e-7721-4ed1-a755-9722606301c3_650x270.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:270,&quot;width&quot;:650,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:33734,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F373d505e-7721-4ed1-a755-9722606301c3_650x270.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ItYm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F373d505e-7721-4ed1-a755-9722606301c3_650x270.png 424w, https://substackcdn.com/image/fetch/$s_!ItYm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F373d505e-7721-4ed1-a755-9722606301c3_650x270.png 848w, https://substackcdn.com/image/fetch/$s_!ItYm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F373d505e-7721-4ed1-a755-9722606301c3_650x270.png 1272w, https://substackcdn.com/image/fetch/$s_!ItYm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F373d505e-7721-4ed1-a755-9722606301c3_650x270.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>However, converting text to SQL is more challenging than that, as it typically involves a one-to-many mapping between the input natural language query &#8592;&#8594; database entities and natural language query &#8592;&#8594; SQL query.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Bvfj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbeb04c8-cc7c-4563-b268-cc56aea90cd3_654x446.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Bvfj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbeb04c8-cc7c-4563-b268-cc56aea90cd3_654x446.png 424w, https://substackcdn.com/image/fetch/$s_!Bvfj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbeb04c8-cc7c-4563-b268-cc56aea90cd3_654x446.png 848w, https://substackcdn.com/image/fetch/$s_!Bvfj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbeb04c8-cc7c-4563-b268-cc56aea90cd3_654x446.png 1272w, https://substackcdn.com/image/fetch/$s_!Bvfj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbeb04c8-cc7c-4563-b268-cc56aea90cd3_654x446.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Bvfj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbeb04c8-cc7c-4563-b268-cc56aea90cd3_654x446.png" width="654" height="446" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fbeb04c8-cc7c-4563-b268-cc56aea90cd3_654x446.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:446,&quot;width&quot;:654,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:48985,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbeb04c8-cc7c-4563-b268-cc56aea90cd3_654x446.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Bvfj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbeb04c8-cc7c-4563-b268-cc56aea90cd3_654x446.png 424w, https://substackcdn.com/image/fetch/$s_!Bvfj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbeb04c8-cc7c-4563-b268-cc56aea90cd3_654x446.png 848w, https://substackcdn.com/image/fetch/$s_!Bvfj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbeb04c8-cc7c-4563-b268-cc56aea90cd3_654x446.png 1272w, https://substackcdn.com/image/fetch/$s_!Bvfj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbeb04c8-cc7c-4563-b268-cc56aea90cd3_654x446.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Natural language is flexible, whereas SQL queries must adhere to a strict syntax. Even SQL queries could have different syntax depending on the standard and the database implementation.</p><p>We require not only that the queries be executable, but also that they be readable, optimized, and reliable. Placing this responsibility on the AI models seems to overwhelm them, given that they may return low-performance queries, hard-to-debug ones, inaccurate results, or multiple SQL queries for the same prompt.</p><div class="pullquote"><p>This article is sponsored by <a href="http://holistics.io/">Holistics</a>, a self-service BI tool built for the AI era.</p></div><h2>So, is there a way for us to deal with these problems?</h2><p>It turns out that there is a promising approach.</p><p>In the paper &#8220;<a href="https://arxiv.org/pdf/2311.07509">A benchmark to understand the role of knowledge graphs on large language models&#8217; accuracy for question answering on enterprise SQL databases</a>&#8221;, the author created a robust benchmark series of questions with different levels of complexity using a standardized insurance dataset. They asked ChatGPT to answer the questions in two ways:</p><ul><li><p>Generate the SQL directly</p></li><li><p>Generate the SQL with the help of a knowledge graph</p></li></ul><p>They observed that leveraging the knowledge graph indeed helps improve the accuracy of results:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RgiW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc97906f-7dca-4473-98a0-dbb4a8a628f9_1004x266.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RgiW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc97906f-7dca-4473-98a0-dbb4a8a628f9_1004x266.png 424w, https://substackcdn.com/image/fetch/$s_!RgiW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc97906f-7dca-4473-98a0-dbb4a8a628f9_1004x266.png 848w, https://substackcdn.com/image/fetch/$s_!RgiW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc97906f-7dca-4473-98a0-dbb4a8a628f9_1004x266.png 1272w, https://substackcdn.com/image/fetch/$s_!RgiW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc97906f-7dca-4473-98a0-dbb4a8a628f9_1004x266.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RgiW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc97906f-7dca-4473-98a0-dbb4a8a628f9_1004x266.png" width="1004" height="266" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fc97906f-7dca-4473-98a0-dbb4a8a628f9_1004x266.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:266,&quot;width&quot;:1004,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:73109,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc97906f-7dca-4473-98a0-dbb4a8a628f9_1004x266.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RgiW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc97906f-7dca-4473-98a0-dbb4a8a628f9_1004x266.png 424w, https://substackcdn.com/image/fetch/$s_!RgiW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc97906f-7dca-4473-98a0-dbb4a8a628f9_1004x266.png 848w, https://substackcdn.com/image/fetch/$s_!RgiW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc97906f-7dca-4473-98a0-dbb4a8a628f9_1004x266.png 1272w, https://substackcdn.com/image/fetch/$s_!RgiW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc97906f-7dca-4473-98a0-dbb4a8a628f9_1004x266.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The benchmark result, the third column, presents the accuracy when using the knowledge graph. <a href="https://arxiv.org/pdf/2311.07509">Source</a></figcaption></figure></div><p>Essentially, a knowledge graph is a structured way to represent knowledge about entities and their relationships, utilizing a graph-based data model. There is a popular solution that offers the same benefit.</p><h3>Yes, it is the semantic layer</h3><p>As a company&#8217;s business expands, the volume and variety of data increase; more decisions need to be made, more data must be stored, and more source data must be captured. Despite how well we prepare, data users might struggle to understand what they need to use the data effectively. We need a better abstraction layer that can lower the barrier for people.</p><p>The semantic layer is an abstraction layer that sits between the underlying data (e.g., data warehouses) and end-user applications (e.g., BI tools, data applications, or business users). From a high level, a semantic layer solution requires us to map business-friendly concepts to underlying data assets and specify the relationships between them.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZbLj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F747723bb-54b7-4675-ab12-2f6ff3bf26ad_1456x458.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZbLj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F747723bb-54b7-4675-ab12-2f6ff3bf26ad_1456x458.png 424w, https://substackcdn.com/image/fetch/$s_!ZbLj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F747723bb-54b7-4675-ab12-2f6ff3bf26ad_1456x458.png 848w, https://substackcdn.com/image/fetch/$s_!ZbLj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F747723bb-54b7-4675-ab12-2f6ff3bf26ad_1456x458.png 1272w, https://substackcdn.com/image/fetch/$s_!ZbLj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F747723bb-54b7-4675-ab12-2f6ff3bf26ad_1456x458.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZbLj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F747723bb-54b7-4675-ab12-2f6ff3bf26ad_1456x458.png" width="1456" height="458" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/747723bb-54b7-4675-ab12-2f6ff3bf26ad_1456x458.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:458,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:147627,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F747723bb-54b7-4675-ab12-2f6ff3bf26ad_1456x458.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZbLj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F747723bb-54b7-4675-ab12-2f6ff3bf26ad_1456x458.png 424w, https://substackcdn.com/image/fetch/$s_!ZbLj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F747723bb-54b7-4675-ab12-2f6ff3bf26ad_1456x458.png 848w, https://substackcdn.com/image/fetch/$s_!ZbLj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F747723bb-54b7-4675-ab12-2f6ff3bf26ad_1456x458.png 1272w, https://substackcdn.com/image/fetch/$s_!ZbLj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F747723bb-54b7-4675-ab12-2f6ff3bf26ad_1456x458.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Thanks to that, the layer acts as a translator between the data and its users. It abstracts all the complexity to ensure that only understandable and business-friendly concepts are presented to users.</p><h2>Semantic layer&#8217;s role in Text-to-SQL tasks</h2><p>Recall that ambiguity and database complexity affect the accuracy of the text-to-SQL system. With the help of the semantic layer, the Text-to-SQL output could be more reliable:</p><ul><li><p>AI models don&#8217;t need to understand the database complexity anymore, as all the information they require is baked into the semantic layer, from the tables needed to the right way to join them. In other words, an AI model is enriched with context through the semantic layer.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Jcto!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f1f6979-810b-45f7-b8d0-5325c96ab039_530x486.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Jcto!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f1f6979-810b-45f7-b8d0-5325c96ab039_530x486.png 424w, https://substackcdn.com/image/fetch/$s_!Jcto!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f1f6979-810b-45f7-b8d0-5325c96ab039_530x486.png 848w, https://substackcdn.com/image/fetch/$s_!Jcto!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f1f6979-810b-45f7-b8d0-5325c96ab039_530x486.png 1272w, https://substackcdn.com/image/fetch/$s_!Jcto!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f1f6979-810b-45f7-b8d0-5325c96ab039_530x486.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Jcto!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f1f6979-810b-45f7-b8d0-5325c96ab039_530x486.png" width="530" height="486" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6f1f6979-810b-45f7-b8d0-5325c96ab039_530x486.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:486,&quot;width&quot;:530,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:109465,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f1f6979-810b-45f7-b8d0-5325c96ab039_530x486.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Jcto!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f1f6979-810b-45f7-b8d0-5325c96ab039_530x486.png 424w, https://substackcdn.com/image/fetch/$s_!Jcto!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f1f6979-810b-45f7-b8d0-5325c96ab039_530x486.png 848w, https://substackcdn.com/image/fetch/$s_!Jcto!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f1f6979-810b-45f7-b8d0-5325c96ab039_530x486.png 1272w, https://substackcdn.com/image/fetch/$s_!Jcto!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f1f6979-810b-45f7-b8d0-5325c96ab039_530x486.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p>When a user requests &#8220;total sales,&#8221; the AI does not need to infer or guess the logic; it can simply reference the predefined &#8220;Total Sales&#8221; metric in the semantic layer, which already contains the calculation. This limits the ambiguity.</p></li></ul><h2>A real-world example</h2><p>The semantic layer has emerged lately, given its ability to abstract the complexity of the underlying data systems. As discussed, this is not only a benefit to business users but also to the AI models. The layer is an indispensable part of modern BI tools, such as Tableau, Looker, and Power BI, as well as an interesting solution called <a href="https://www.holistics.io/">Holistics</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ikWQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12d2e1ce-c1b4-4ca9-afb0-49078a25cc5e_276x88.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ikWQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12d2e1ce-c1b4-4ca9-afb0-49078a25cc5e_276x88.png 424w, https://substackcdn.com/image/fetch/$s_!ikWQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12d2e1ce-c1b4-4ca9-afb0-49078a25cc5e_276x88.png 848w, https://substackcdn.com/image/fetch/$s_!ikWQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12d2e1ce-c1b4-4ca9-afb0-49078a25cc5e_276x88.png 1272w, https://substackcdn.com/image/fetch/$s_!ikWQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12d2e1ce-c1b4-4ca9-afb0-49078a25cc5e_276x88.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ikWQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12d2e1ce-c1b4-4ca9-afb0-49078a25cc5e_276x88.png" width="374" height="119.2463768115942" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/12d2e1ce-c1b4-4ca9-afb0-49078a25cc5e_276x88.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:88,&quot;width&quot;:276,&quot;resizeWidth&quot;:374,&quot;bytes&quot;:14639,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12d2e1ce-c1b4-4ca9-afb0-49078a25cc5e_276x88.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ikWQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12d2e1ce-c1b4-4ca9-afb0-49078a25cc5e_276x88.png 424w, https://substackcdn.com/image/fetch/$s_!ikWQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12d2e1ce-c1b4-4ca9-afb0-49078a25cc5e_276x88.png 848w, https://substackcdn.com/image/fetch/$s_!ikWQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12d2e1ce-c1b4-4ca9-afb0-49078a25cc5e_276x88.png 1272w, https://substackcdn.com/image/fetch/$s_!ikWQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12d2e1ce-c1b4-4ca9-afb0-49078a25cc5e_276x88.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Established in 2015, the platform enables<strong> self-service data access for the entire organization</strong>. Compared to other BI tools, if users want to extract insight on Holistics, they must define their mapping between business concepts and the underlying tables via the semantic layer. Only after that, users can start presenting and organizing data using concepts exposed from the semantic layer.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Xb91!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ac5e474-5047-46c7-8dfc-4a0c2cffb53f_626x388.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Xb91!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ac5e474-5047-46c7-8dfc-4a0c2cffb53f_626x388.png 424w, https://substackcdn.com/image/fetch/$s_!Xb91!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ac5e474-5047-46c7-8dfc-4a0c2cffb53f_626x388.png 848w, https://substackcdn.com/image/fetch/$s_!Xb91!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ac5e474-5047-46c7-8dfc-4a0c2cffb53f_626x388.png 1272w, https://substackcdn.com/image/fetch/$s_!Xb91!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ac5e474-5047-46c7-8dfc-4a0c2cffb53f_626x388.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Xb91!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ac5e474-5047-46c7-8dfc-4a0c2cffb53f_626x388.png" width="626" height="388" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7ac5e474-5047-46c7-8dfc-4a0c2cffb53f_626x388.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:388,&quot;width&quot;:626,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:59402,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ac5e474-5047-46c7-8dfc-4a0c2cffb53f_626x388.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Xb91!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ac5e474-5047-46c7-8dfc-4a0c2cffb53f_626x388.png 424w, https://substackcdn.com/image/fetch/$s_!Xb91!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ac5e474-5047-46c7-8dfc-4a0c2cffb53f_626x388.png 848w, https://substackcdn.com/image/fetch/$s_!Xb91!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ac5e474-5047-46c7-8dfc-4a0c2cffb53f_626x388.png 1272w, https://substackcdn.com/image/fetch/$s_!Xb91!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ac5e474-5047-46c7-8dfc-4a0c2cffb53f_626x388.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>To work with the semantic layer, Holistics introduces the concept of &#8220;model&#8220;, which is an abstract representation on top of a table/query. A model should have the source (a physical table or a SQL query), the dimensions and measures, and the relationships to other models. Holistics uses relationships for constructing the join.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tasX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef594f9-382f-438e-97c1-413788587113_1078x746.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tasX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef594f9-382f-438e-97c1-413788587113_1078x746.png 424w, https://substackcdn.com/image/fetch/$s_!tasX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef594f9-382f-438e-97c1-413788587113_1078x746.png 848w, https://substackcdn.com/image/fetch/$s_!tasX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef594f9-382f-438e-97c1-413788587113_1078x746.png 1272w, https://substackcdn.com/image/fetch/$s_!tasX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef594f9-382f-438e-97c1-413788587113_1078x746.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tasX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef594f9-382f-438e-97c1-413788587113_1078x746.png" width="531" height="367.4638218923933" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fef594f9-382f-438e-97c1-413788587113_1078x746.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:746,&quot;width&quot;:1078,&quot;resizeWidth&quot;:531,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!tasX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef594f9-382f-438e-97c1-413788587113_1078x746.png 424w, https://substackcdn.com/image/fetch/$s_!tasX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef594f9-382f-438e-97c1-413788587113_1078x746.png 848w, https://substackcdn.com/image/fetch/$s_!tasX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef594f9-382f-438e-97c1-413788587113_1078x746.png 1272w, https://substackcdn.com/image/fetch/$s_!tasX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef594f9-382f-438e-97c1-413788587113_1078x746.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">An example of Holistics&#8217;s model&#8217;s dimension and measure definition. <a href="https://docs.holistics.io/docs/model-fields">Source</a> </figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rwOO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04853595-b4ba-4bc0-933a-bfb135771589_604x454.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rwOO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04853595-b4ba-4bc0-933a-bfb135771589_604x454.png 424w, https://substackcdn.com/image/fetch/$s_!rwOO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04853595-b4ba-4bc0-933a-bfb135771589_604x454.png 848w, https://substackcdn.com/image/fetch/$s_!rwOO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04853595-b4ba-4bc0-933a-bfb135771589_604x454.png 1272w, https://substackcdn.com/image/fetch/$s_!rwOO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04853595-b4ba-4bc0-933a-bfb135771589_604x454.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rwOO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04853595-b4ba-4bc0-933a-bfb135771589_604x454.png" width="396" height="297.65562913907286" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/04853595-b4ba-4bc0-933a-bfb135771589_604x454.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:454,&quot;width&quot;:604,&quot;resizeWidth&quot;:396,&quot;bytes&quot;:51913,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04853595-b4ba-4bc0-933a-bfb135771589_604x454.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rwOO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04853595-b4ba-4bc0-933a-bfb135771589_604x454.png 424w, https://substackcdn.com/image/fetch/$s_!rwOO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04853595-b4ba-4bc0-933a-bfb135771589_604x454.png 848w, https://substackcdn.com/image/fetch/$s_!rwOO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04853595-b4ba-4bc0-933a-bfb135771589_604x454.png 1272w, https://substackcdn.com/image/fetch/$s_!rwOO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04853595-b4ba-4bc0-933a-bfb135771589_604x454.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">An example of Holistics&#8217;s model&#8217;s relatitionship. <a href="https://docs.holistics.io/docs/relationships">Source</a> </figcaption></figure></div><p>With Holistic&#8217;s vision of the semantic layer from the beginning, it would be easier for them to develop the text-to-SQL feature. They&#8217;ve tried several approaches, including letting the AI models offload the generation of SQL to the semantic layer by translating the user&#8217;s natural language input to a format that the semantic layer could understand, such as a JSON payload.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UIKG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23228cb8-31e4-4780-a222-b792ebe20319_930x498.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UIKG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23228cb8-31e4-4780-a222-b792ebe20319_930x498.png 424w, https://substackcdn.com/image/fetch/$s_!UIKG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23228cb8-31e4-4780-a222-b792ebe20319_930x498.png 848w, https://substackcdn.com/image/fetch/$s_!UIKG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23228cb8-31e4-4780-a222-b792ebe20319_930x498.png 1272w, https://substackcdn.com/image/fetch/$s_!UIKG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23228cb8-31e4-4780-a222-b792ebe20319_930x498.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UIKG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23228cb8-31e4-4780-a222-b792ebe20319_930x498.png" width="930" height="498" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23228cb8-31e4-4780-a222-b792ebe20319_930x498.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:498,&quot;width&quot;:930,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:99488,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23228cb8-31e4-4780-a222-b792ebe20319_930x498.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UIKG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23228cb8-31e4-4780-a222-b792ebe20319_930x498.png 424w, https://substackcdn.com/image/fetch/$s_!UIKG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23228cb8-31e4-4780-a222-b792ebe20319_930x498.png 848w, https://substackcdn.com/image/fetch/$s_!UIKG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23228cb8-31e4-4780-a222-b792ebe20319_930x498.png 1272w, https://substackcdn.com/image/fetch/$s_!UIKG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23228cb8-31e4-4780-a222-b792ebe20319_930x498.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>By doing it this way, the text-to-SQL process can become even more reliable, as the SQL queries are now controlled by the semantic layer, which is designed to generate output queries based on well-tested logic and predefined entities within the semantic layer. Compared to the fact that the AI model has to guess, this way is more reliable.</p><h2>Even with the semantic layer, it might not be enough for text-to-SQL</h2><p>Although relying entirely on the semantic layer could be beneficial, this approach may be limited by the fact that the input format, such as JSON, doesn&#8217;t provide users with the necessary flexibility in cases of complex analytics requirements.</p><p>For example, with the pseudo-format like this:</p><pre><code>{ "metrics": ["total_sales"], "dimensions": ["country"]}</code></pre><p>It serves well for simple questions. However, the key-value formats could cause users trouble when expressing queries that require more advanced techniques, such as nested aggregation or period-over-period comparison.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tvdf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58cf029a-e11b-4a85-9b70-cc99180d0104_758x434.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tvdf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58cf029a-e11b-4a85-9b70-cc99180d0104_758x434.png 424w, https://substackcdn.com/image/fetch/$s_!tvdf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58cf029a-e11b-4a85-9b70-cc99180d0104_758x434.png 848w, https://substackcdn.com/image/fetch/$s_!tvdf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58cf029a-e11b-4a85-9b70-cc99180d0104_758x434.png 1272w, https://substackcdn.com/image/fetch/$s_!tvdf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58cf029a-e11b-4a85-9b70-cc99180d0104_758x434.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tvdf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58cf029a-e11b-4a85-9b70-cc99180d0104_758x434.png" width="758" height="434" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/58cf029a-e11b-4a85-9b70-cc99180d0104_758x434.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:434,&quot;width&quot;:758,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:88153,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58cf029a-e11b-4a85-9b70-cc99180d0104_758x434.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tvdf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58cf029a-e11b-4a85-9b70-cc99180d0104_758x434.png 424w, https://substackcdn.com/image/fetch/$s_!tvdf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58cf029a-e11b-4a85-9b70-cc99180d0104_758x434.png 848w, https://substackcdn.com/image/fetch/$s_!tvdf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58cf029a-e11b-4a85-9b70-cc99180d0104_758x434.png 1272w, https://substackcdn.com/image/fetch/$s_!tvdf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58cf029a-e11b-4a85-9b70-cc99180d0104_758x434.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>So, letting the AI model generate the SQL directly is less reliable, but interacting via the semantic layer with the intermediate format is less flexible. What do we do?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KX9I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95f45fc9-3374-4955-9ecd-765b60f55446_790x308.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KX9I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95f45fc9-3374-4955-9ecd-765b60f55446_790x308.png 424w, https://substackcdn.com/image/fetch/$s_!KX9I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95f45fc9-3374-4955-9ecd-765b60f55446_790x308.png 848w, https://substackcdn.com/image/fetch/$s_!KX9I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95f45fc9-3374-4955-9ecd-765b60f55446_790x308.png 1272w, https://substackcdn.com/image/fetch/$s_!KX9I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95f45fc9-3374-4955-9ecd-765b60f55446_790x308.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KX9I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95f45fc9-3374-4955-9ecd-765b60f55446_790x308.png" width="790" height="308" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/95f45fc9-3374-4955-9ecd-765b60f55446_790x308.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:308,&quot;width&quot;:790,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:96454,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95f45fc9-3374-4955-9ecd-765b60f55446_790x308.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KX9I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95f45fc9-3374-4955-9ecd-765b60f55446_790x308.png 424w, https://substackcdn.com/image/fetch/$s_!KX9I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95f45fc9-3374-4955-9ecd-765b60f55446_790x308.png 848w, https://substackcdn.com/image/fetch/$s_!KX9I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95f45fc9-3374-4955-9ecd-765b60f55446_790x308.png 1272w, https://substackcdn.com/image/fetch/$s_!KX9I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95f45fc9-3374-4955-9ecd-765b60f55446_790x308.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Holistics chooses to let the AI model generate the queries, but in a more reliable and controllable way. The model still leverages the help of the semantic layer for the business context and understanding; however, it has been trained to generate a new kind of query language instead of SQL. They call this AQL. s. Let&#8217;s delve into this language before moving on.</p><h3>The AQL language</h3><p>When the platform was first built, the creator behind Holistics had already developed a proprietary language for analytics, known as <a href="https://docs.holistics.io/as-code/aql/">AQL</a>. This language is designed to leverage the defined semantic layer, allowing us to query data at a higher level of abstraction.</p><p>AQL treats metrics as first-class citizens, making metric definition composable and reusable. This differs from SQL, where everything is a query. If you want to reuse a piece of metrics, you must save the query that calculates it somewhere, such as in a CTE, a view, or a table. When adjusting the metric logic, you must modify the query.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MuFh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bd6bc41-69d0-419e-8d18-59418541bd6e_804x920.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MuFh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bd6bc41-69d0-419e-8d18-59418541bd6e_804x920.png 424w, https://substackcdn.com/image/fetch/$s_!MuFh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bd6bc41-69d0-419e-8d18-59418541bd6e_804x920.png 848w, https://substackcdn.com/image/fetch/$s_!MuFh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bd6bc41-69d0-419e-8d18-59418541bd6e_804x920.png 1272w, https://substackcdn.com/image/fetch/$s_!MuFh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bd6bc41-69d0-419e-8d18-59418541bd6e_804x920.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MuFh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bd6bc41-69d0-419e-8d18-59418541bd6e_804x920.png" width="804" height="920" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9bd6bc41-69d0-419e-8d18-59418541bd6e_804x920.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:920,&quot;width&quot;:804,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:221016,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bd6bc41-69d0-419e-8d18-59418541bd6e_804x920.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MuFh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bd6bc41-69d0-419e-8d18-59418541bd6e_804x920.png 424w, https://substackcdn.com/image/fetch/$s_!MuFh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bd6bc41-69d0-419e-8d18-59418541bd6e_804x920.png 848w, https://substackcdn.com/image/fetch/$s_!MuFh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bd6bc41-69d0-419e-8d18-59418541bd6e_804x920.png 1272w, https://substackcdn.com/image/fetch/$s_!MuFh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bd6bc41-69d0-419e-8d18-59418541bd6e_804x920.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>AQL queries are written using business concepts (dimensions and measures) defined in the semantic layer, not raw table and column names. A user can ask for `total_revenue` by `user_country` without having to write the complex JOIN statements. This abstraction simplifies query writing and drastically improves the readability and maintainability of analytics code.</p><p>Additionally, AQL introduces the pipe operator <code>|</code>, which takes the result of the expression on its left and uses it as the input for the function on its right. This creates a clear, sequential, top-to-bottom flow of logic.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5bUh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F809780ae-dad0-401e-9865-35a721bfeed9_808x94.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5bUh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F809780ae-dad0-401e-9865-35a721bfeed9_808x94.png 424w, https://substackcdn.com/image/fetch/$s_!5bUh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F809780ae-dad0-401e-9865-35a721bfeed9_808x94.png 848w, https://substackcdn.com/image/fetch/$s_!5bUh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F809780ae-dad0-401e-9865-35a721bfeed9_808x94.png 1272w, https://substackcdn.com/image/fetch/$s_!5bUh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F809780ae-dad0-401e-9865-35a721bfeed9_808x94.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5bUh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F809780ae-dad0-401e-9865-35a721bfeed9_808x94.png" width="808" height="94" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/809780ae-dad0-401e-9865-35a721bfeed9_808x94.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:94,&quot;width&quot;:808,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:16554,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F809780ae-dad0-401e-9865-35a721bfeed9_808x94.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5bUh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F809780ae-dad0-401e-9865-35a721bfeed9_808x94.png 424w, https://substackcdn.com/image/fetch/$s_!5bUh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F809780ae-dad0-401e-9865-35a721bfeed9_808x94.png 848w, https://substackcdn.com/image/fetch/$s_!5bUh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F809780ae-dad0-401e-9865-35a721bfeed9_808x94.png 1272w, https://substackcdn.com/image/fetch/$s_!5bUh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F809780ae-dad0-401e-9865-35a721bfeed9_808x94.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Count the number of male users. <a href="https://docs.holistics.io/as-code/reference/metric-expression">Source</a></figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8f6p!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ae9f3d8-c8de-44fd-90c9-ee33af08edc9_870x160.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8f6p!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ae9f3d8-c8de-44fd-90c9-ee33af08edc9_870x160.png 424w, https://substackcdn.com/image/fetch/$s_!8f6p!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ae9f3d8-c8de-44fd-90c9-ee33af08edc9_870x160.png 848w, https://substackcdn.com/image/fetch/$s_!8f6p!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ae9f3d8-c8de-44fd-90c9-ee33af08edc9_870x160.png 1272w, https://substackcdn.com/image/fetch/$s_!8f6p!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ae9f3d8-c8de-44fd-90c9-ee33af08edc9_870x160.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8f6p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ae9f3d8-c8de-44fd-90c9-ee33af08edc9_870x160.png" width="870" height="160" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ae9f3d8-c8de-44fd-90c9-ee33af08edc9_870x160.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:160,&quot;width&quot;:870,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:32746,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ae9f3d8-c8de-44fd-90c9-ee33af08edc9_870x160.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8f6p!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ae9f3d8-c8de-44fd-90c9-ee33af08edc9_870x160.png 424w, https://substackcdn.com/image/fetch/$s_!8f6p!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ae9f3d8-c8de-44fd-90c9-ee33af08edc9_870x160.png 848w, https://substackcdn.com/image/fetch/$s_!8f6p!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ae9f3d8-c8de-44fd-90c9-ee33af08edc9_870x160.png 1272w, https://substackcdn.com/image/fetch/$s_!8f6p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ae9f3d8-c8de-44fd-90c9-ee33af08edc9_870x160.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The running total of the number of orders in 2023. <a href="https://docs.holistics.io/as-code/reference/expression">Source</a></figcaption></figure></div><p>Users express their metrics using AQL; then, Holistics converts them to SQL queries and executes them on the defined database.</p><h2>The solution</h2><p>Back to Holistics, the way they build the text-to-SQL will look like this: they trained their AI models to accept natural language input and output the AQL queries with the help of the semantic layer. The AQL query is then converted to a SQL query.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Hy88!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4fc26d3-4d41-4249-ae20-dcf47205bdd2_994x546.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Hy88!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4fc26d3-4d41-4249-ae20-dcf47205bdd2_994x546.png 424w, https://substackcdn.com/image/fetch/$s_!Hy88!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4fc26d3-4d41-4249-ae20-dcf47205bdd2_994x546.png 848w, https://substackcdn.com/image/fetch/$s_!Hy88!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4fc26d3-4d41-4249-ae20-dcf47205bdd2_994x546.png 1272w, https://substackcdn.com/image/fetch/$s_!Hy88!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4fc26d3-4d41-4249-ae20-dcf47205bdd2_994x546.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Hy88!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4fc26d3-4d41-4249-ae20-dcf47205bdd2_994x546.png" width="994" height="546" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e4fc26d3-4d41-4249-ae20-dcf47205bdd2_994x546.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:546,&quot;width&quot;:994,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:124177,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4fc26d3-4d41-4249-ae20-dcf47205bdd2_994x546.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Hy88!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4fc26d3-4d41-4249-ae20-dcf47205bdd2_994x546.png 424w, https://substackcdn.com/image/fetch/$s_!Hy88!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4fc26d3-4d41-4249-ae20-dcf47205bdd2_994x546.png 848w, https://substackcdn.com/image/fetch/$s_!Hy88!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4fc26d3-4d41-4249-ae20-dcf47205bdd2_994x546.png 1272w, https://substackcdn.com/image/fetch/$s_!Hy88!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4fc26d3-4d41-4249-ae20-dcf47205bdd2_994x546.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The outcomes are AI-generated queries that are fundamentally more verifiable, reliable, and governed than those produced by systems that attempt direct text-to-SQL translation:</p><ul><li><p><strong>Verifiable &amp; Readable:</strong> Because AQL is a high-level language that operates on business logic, the queries it generates are far more compact and intuitive than raw SQL. A user can look at a piped AQL query and immediately understand the logical steps the AI is taking and ensure that AI really gets what the intent of the question is about</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IbOn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74702200-ed0d-4d2a-a8b4-e1c52caff799_552x380.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IbOn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74702200-ed0d-4d2a-a8b4-e1c52caff799_552x380.png 424w, https://substackcdn.com/image/fetch/$s_!IbOn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74702200-ed0d-4d2a-a8b4-e1c52caff799_552x380.png 848w, https://substackcdn.com/image/fetch/$s_!IbOn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74702200-ed0d-4d2a-a8b4-e1c52caff799_552x380.png 1272w, https://substackcdn.com/image/fetch/$s_!IbOn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74702200-ed0d-4d2a-a8b4-e1c52caff799_552x380.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IbOn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74702200-ed0d-4d2a-a8b4-e1c52caff799_552x380.png" width="552" height="380" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/74702200-ed0d-4d2a-a8b4-e1c52caff799_552x380.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:380,&quot;width&quot;:552,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:102101,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74702200-ed0d-4d2a-a8b4-e1c52caff799_552x380.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IbOn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74702200-ed0d-4d2a-a8b4-e1c52caff799_552x380.png 424w, https://substackcdn.com/image/fetch/$s_!IbOn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74702200-ed0d-4d2a-a8b4-e1c52caff799_552x380.png 848w, https://substackcdn.com/image/fetch/$s_!IbOn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74702200-ed0d-4d2a-a8b4-e1c52caff799_552x380.png 1272w, https://substackcdn.com/image/fetch/$s_!IbOn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74702200-ed0d-4d2a-a8b4-e1c52caff799_552x380.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>This human-readability is critical for verification; it allows the model trainer or the end users to understand what the AI is doing. This is an improvement compared to spending time reading messy SQL queries.</p></li><li><p>The high level abstraction AQL provides reduces risks of errors and hallucination as compared to the risk of AI errors from interpreting and using low level SQL queries from scratch.</p></li><li><p>Because the AQL-to-SQL conversion is managed by Holistics&#8217; well-tested system, the generated SQL query is guaranteed once the AQL is correct.</p></li></ul></li><li><p><strong>Reliable:</strong> By abstracting away the most error-prone aspects of query generation&#8212;such as dialect-specific syntax, complex join logic, and the formulas for advanced analytics&#8212;the system significantly increases its reliability.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hQJh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65fbdb8a-edd8-4d7c-b62f-320ed6d441bb_532x262.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hQJh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65fbdb8a-edd8-4d7c-b62f-320ed6d441bb_532x262.png 424w, https://substackcdn.com/image/fetch/$s_!hQJh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65fbdb8a-edd8-4d7c-b62f-320ed6d441bb_532x262.png 848w, https://substackcdn.com/image/fetch/$s_!hQJh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65fbdb8a-edd8-4d7c-b62f-320ed6d441bb_532x262.png 1272w, https://substackcdn.com/image/fetch/$s_!hQJh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65fbdb8a-edd8-4d7c-b62f-320ed6d441bb_532x262.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hQJh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65fbdb8a-edd8-4d7c-b62f-320ed6d441bb_532x262.png" width="532" height="262" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/65fbdb8a-edd8-4d7c-b62f-320ed6d441bb_532x262.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:262,&quot;width&quot;:532,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:80057,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65fbdb8a-edd8-4d7c-b62f-320ed6d441bb_532x262.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hQJh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65fbdb8a-edd8-4d7c-b62f-320ed6d441bb_532x262.png 424w, https://substackcdn.com/image/fetch/$s_!hQJh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65fbdb8a-edd8-4d7c-b62f-320ed6d441bb_532x262.png 848w, https://substackcdn.com/image/fetch/$s_!hQJh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65fbdb8a-edd8-4d7c-b62f-320ed6d441bb_532x262.png 1272w, https://substackcdn.com/image/fetch/$s_!hQJh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65fbdb8a-edd8-4d7c-b62f-320ed6d441bb_532x262.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>The AI&#8217;s task is simplified to mapping intent to predefined metrics and dimensions in AQL. This leads to more accurate and dependable results.</p></li></ul></li><li><p><strong>Governed:</strong> Because every AQL query must operate through the semantic layer, it automatically inherits the organization&#8217;s single source of truth for business definitions.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rg-C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3621057-2207-4b7d-a587-cc369c1d61c7_338x228.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rg-C!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3621057-2207-4b7d-a587-cc369c1d61c7_338x228.png 424w, https://substackcdn.com/image/fetch/$s_!rg-C!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3621057-2207-4b7d-a587-cc369c1d61c7_338x228.png 848w, https://substackcdn.com/image/fetch/$s_!rg-C!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3621057-2207-4b7d-a587-cc369c1d61c7_338x228.png 1272w, https://substackcdn.com/image/fetch/$s_!rg-C!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3621057-2207-4b7d-a587-cc369c1d61c7_338x228.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rg-C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3621057-2207-4b7d-a587-cc369c1d61c7_338x228.png" width="338" height="228" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f3621057-2207-4b7d-a587-cc369c1d61c7_338x228.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:228,&quot;width&quot;:338,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:45618,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/170275925?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3621057-2207-4b7d-a587-cc369c1d61c7_338x228.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rg-C!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3621057-2207-4b7d-a587-cc369c1d61c7_338x228.png 424w, https://substackcdn.com/image/fetch/$s_!rg-C!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3621057-2207-4b7d-a587-cc369c1d61c7_338x228.png 848w, https://substackcdn.com/image/fetch/$s_!rg-C!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3621057-2207-4b7d-a587-cc369c1d61c7_338x228.png 1272w, https://substackcdn.com/image/fetch/$s_!rg-C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3621057-2207-4b7d-a587-cc369c1d61c7_338x228.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><ul><li><p>The AI won&#8217;t invent its metric calculations. Furthermore, access controls defined in the semantic layer are automatically enforced, ensuring that users can only query data to which they are authorized.</p></li></ul></li><li><p><strong>Flexibility</strong>: AQL is designed to express complex metrics seamlessly, including AI; the capability of a text-to-SQL system will not be limited to simple queries only due to the limitation of the intermediate format, such as JSON.</p></li></ul><div><hr></div><h2>Outro</h2><p>In this article, we first explore why extracting data insights using natural language is gaining increasing attention. Next, we examine the challenges of Text-to-SQL and find out that there is a promising solution to improve the accuracy with the help of the semantic layer.</p><p>Finally, we examine a real-life example: Holistics, which understands its solution to Text-to-SQL by leveraging semantic layers and its self-developed analytics language, AQL.</p><p>Thank you for reading this far. See you next time.</p><div><hr></div><h2>Reference</h2><p><em>[1] Phuc Nguyen, <a href="https://community.holistics.io/t/the-ideal-semantic-layer-and-metric-centric-paradigm-blog-post/1507">The Ideal Semantic Layer and Metric-Centric Paradigm</a>, 2023</em></p><p><em>[2] Tan Huynh, <a href="https://www.holistics.io/blog/metrics-deserve-better-composition/#composition-in-sql">Metrics Deserve Better Composition Than What SQL Allows</a>, 2024</em></p><p><em>[3] <a href="https://docs.holistics.io/docs/ai/architecture">Holistics AI Architecture</a></em></p><p><em>[4] <a href="https://docs.holistics.io/docs/">Holistics Official Documentation</a></em></p><p><em>[5] Justin Heinze, <a href="https://www.betterbuys.com/bi/history-of-business-intelligence/">History of Business Intelligence</a>, 2020</em></p><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[Where does your task run in Apache Airflow?]]></title><description><![CDATA[Everything about the Airflow Executors]]></description><link>https://vutr.substack.com/p/where-does-your-task-run-in-apache</link><guid isPermaLink="false">https://vutr.substack.com/p/where-does-your-task-run-in-apache</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Thu, 14 Aug 2025 03:15:20 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!EwIL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69edd9b1-4a25-485c-b327-1c9dc3d89725_2000x1428.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p>To celebrate Lunar New Year (the true New Year holiday in Vietnam), I&#8217;m offering <em><strong>50% off the annual subscription</strong></em>. The offer ends soon; grab it now to get full access to nearly 200 high-quality data engineering articles.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe&quot;,&quot;text&quot;:&quot;50% off the annual subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe"><span>50% off the annual subscription</span></a></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EwIL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69edd9b1-4a25-485c-b327-1c9dc3d89725_2000x1428.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EwIL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69edd9b1-4a25-485c-b327-1c9dc3d89725_2000x1428.png 424w, https://substackcdn.com/image/fetch/$s_!EwIL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69edd9b1-4a25-485c-b327-1c9dc3d89725_2000x1428.png 848w, https://substackcdn.com/image/fetch/$s_!EwIL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69edd9b1-4a25-485c-b327-1c9dc3d89725_2000x1428.png 1272w, https://substackcdn.com/image/fetch/$s_!EwIL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69edd9b1-4a25-485c-b327-1c9dc3d89725_2000x1428.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EwIL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69edd9b1-4a25-485c-b327-1c9dc3d89725_2000x1428.png" width="1456" height="1040" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/69edd9b1-4a25-485c-b327-1c9dc3d89725_2000x1428.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1040,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:382377,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/169443779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69edd9b1-4a25-485c-b327-1c9dc3d89725_2000x1428.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EwIL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69edd9b1-4a25-485c-b327-1c9dc3d89725_2000x1428.png 424w, https://substackcdn.com/image/fetch/$s_!EwIL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69edd9b1-4a25-485c-b327-1c9dc3d89725_2000x1428.png 848w, https://substackcdn.com/image/fetch/$s_!EwIL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69edd9b1-4a25-485c-b327-1c9dc3d89725_2000x1428.png 1272w, https://substackcdn.com/image/fetch/$s_!EwIL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69edd9b1-4a25-485c-b327-1c9dc3d89725_2000x1428.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Intro</h2><p>Orchestrating the data pipeline is as crucial as its task logic and performance. Luckily, we don&#8217;t have to do that from scratch; many available tools can help us.</p><p>Among them, Airflow appears to be the dominant solution, thanks to its openness and active community. However, as data engineers, writing DAG files was not enough; we needed to understand the underlying concepts to operate the tool confidently. </p><p>This article will take a closer look at one of the most important aspects of Airflow:<a href="https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/executor/index.html"> the mechanism by which tasks are run, Airflow&#8217;s executor</a>.</p><p>We first briefly revisit Airflow, and then we explore the executor and its available options.</p><div><hr></div><h2>The History</h2><p>Apache Airflow was created in 2014 at <strong>Airbnb</strong> when the company was dealing with massive and increasingly complex data workflows. At the time, existing orchestration tools were either too rigid, lacked scalability, or couldn&#8217;t accommodate the dynamic nature of data pipelines. To address this challenge, <strong>Maxime Beauchemin</strong>, a data engineer at Airbnb, spearheaded the creation of Airflow.</p><p>Airflow quickly gained traction and, in 2016, joined the <strong>Apache Software Foundation</strong>, becoming an open-source project with a robust and growing community.</p><p>If you've joined a new company these days, you're likely to work with Airflow.</p><div><hr></div><h2>Overview</h2><p>Orchestrating a complete data pipeline presents numerous challenges. When should we schedule the data retrieval from a third-party API? How do we effectively manage dependencies between the API call and the data processing job? What happens in the event of a failure? Can we observe it? If so, can we retry?</p><p><strong>Apache Airflow</strong> simplifies this problem by allowing engineers to define workflows as code and automating their execution.</p><p>At its core, Airflow operates on the concept of <strong>Directed Acyclic Graphs (DAGs)</strong> to model workflows. It is essentially a roadmap for the workflow and contains two main components:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b8-B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe73baaa6-8c3d-46f0-ae1e-6c60fb3ba21d_708x324.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b8-B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe73baaa6-8c3d-46f0-ae1e-6c60fb3ba21d_708x324.png 424w, https://substackcdn.com/image/fetch/$s_!b8-B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe73baaa6-8c3d-46f0-ae1e-6c60fb3ba21d_708x324.png 848w, https://substackcdn.com/image/fetch/$s_!b8-B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe73baaa6-8c3d-46f0-ae1e-6c60fb3ba21d_708x324.png 1272w, https://substackcdn.com/image/fetch/$s_!b8-B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe73baaa6-8c3d-46f0-ae1e-6c60fb3ba21d_708x324.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b8-B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe73baaa6-8c3d-46f0-ae1e-6c60fb3ba21d_708x324.png" width="708" height="324" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e73baaa6-8c3d-46f0-ae1e-6c60fb3ba21d_708x324.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:324,&quot;width&quot;:708,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:74377,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/169443779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe73baaa6-8c3d-46f0-ae1e-6c60fb3ba21d_708x324.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!b8-B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe73baaa6-8c3d-46f0-ae1e-6c60fb3ba21d_708x324.png 424w, https://substackcdn.com/image/fetch/$s_!b8-B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe73baaa6-8c3d-46f0-ae1e-6c60fb3ba21d_708x324.png 848w, https://substackcdn.com/image/fetch/$s_!b8-B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe73baaa6-8c3d-46f0-ae1e-6c60fb3ba21d_708x324.png 1272w, https://substackcdn.com/image/fetch/$s_!b8-B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe73baaa6-8c3d-46f0-ae1e-6c60fb3ba21d_708x324.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>Tasks (Nodes)</strong> are individual work units, such as running a query, copying data, executing a script, or calling an API.</p></li><li><p><strong>Dependencies (Edges)</strong>: The relationships between tasks that define their execution order (e.g., preprocessing is executed only after retrieving data from a third-party API).</p></li></ul><p>Airflow ensures tasks are executed sequentially or in parallel (based on their dependencies), automatically manages retries on failure (based on their retry configuration), and thoroughly logs task execution for monitoring and debugging purposes.</p><h2>The Internals</h2><p>There are several components inside Airflow:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-xgl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde4b0932-6c5e-4c08-b028-9232cb66c5bb_666x448.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-xgl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde4b0932-6c5e-4c08-b028-9232cb66c5bb_666x448.png 424w, https://substackcdn.com/image/fetch/$s_!-xgl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde4b0932-6c5e-4c08-b028-9232cb66c5bb_666x448.png 848w, https://substackcdn.com/image/fetch/$s_!-xgl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde4b0932-6c5e-4c08-b028-9232cb66c5bb_666x448.png 1272w, https://substackcdn.com/image/fetch/$s_!-xgl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde4b0932-6c5e-4c08-b028-9232cb66c5bb_666x448.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-xgl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde4b0932-6c5e-4c08-b028-9232cb66c5bb_666x448.png" width="666" height="448" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de4b0932-6c5e-4c08-b028-9232cb66c5bb_666x448.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:448,&quot;width&quot;:666,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:85413,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/169443779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde4b0932-6c5e-4c08-b028-9232cb66c5bb_666x448.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-xgl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde4b0932-6c5e-4c08-b028-9232cb66c5bb_666x448.png 424w, https://substackcdn.com/image/fetch/$s_!-xgl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde4b0932-6c5e-4c08-b028-9232cb66c5bb_666x448.png 848w, https://substackcdn.com/image/fetch/$s_!-xgl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde4b0932-6c5e-4c08-b028-9232cb66c5bb_666x448.png 1272w, https://substackcdn.com/image/fetch/$s_!-xgl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde4b0932-6c5e-4c08-b028-9232cb66c5bb_666x448.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ol><li><p><strong>Scheduler</strong>: The component responsible for parsing DAG files, scheduling tasks, and queuing them for execution based on their dependencies and schedules. The <strong>executor</strong> logic runs inside the scheduler.</p></li><li><p><strong>Web Server</strong> provides the Airflow UI, allowing users to visualize workflows, monitor task execution, inspect logs, and trigger DAG runs.</p></li><li><p><strong>Metadata Database</strong>: A central database that stores all metadata, including DAG definitions, task states, execution logs, and schedules. It&#8217;s essential for tracking the history of workflows.</p></li><li><p><strong>DAG folders</strong>: It contains DAG files defined by users.</p></li><li><p><strong>Workers</strong>: Components that execute the tasks assigned by the executor.</p><blockquote><p><em>The executor is our main dish today, and we will discuss it very soon</em></p></blockquote></li></ol><h3>Workflow Between Components</h3><p>The workflow between Airflow&#8217;s components can be broken down into the following steps:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q7ET!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc67cf488-c058-4236-b8a6-9b72475edf26_1438x986.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q7ET!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc67cf488-c058-4236-b8a6-9b72475edf26_1438x986.png 424w, https://substackcdn.com/image/fetch/$s_!Q7ET!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc67cf488-c058-4236-b8a6-9b72475edf26_1438x986.png 848w, https://substackcdn.com/image/fetch/$s_!Q7ET!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc67cf488-c058-4236-b8a6-9b72475edf26_1438x986.png 1272w, https://substackcdn.com/image/fetch/$s_!Q7ET!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc67cf488-c058-4236-b8a6-9b72475edf26_1438x986.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q7ET!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc67cf488-c058-4236-b8a6-9b72475edf26_1438x986.png" width="1438" height="986" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c67cf488-c058-4236-b8a6-9b72475edf26_1438x986.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:986,&quot;width&quot;:1438,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:491265,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/169443779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc67cf488-c058-4236-b8a6-9b72475edf26_1438x986.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Q7ET!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc67cf488-c058-4236-b8a6-9b72475edf26_1438x986.png 424w, https://substackcdn.com/image/fetch/$s_!Q7ET!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc67cf488-c058-4236-b8a6-9b72475edf26_1438x986.png 848w, https://substackcdn.com/image/fetch/$s_!Q7ET!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc67cf488-c058-4236-b8a6-9b72475edf26_1438x986.png 1272w, https://substackcdn.com/image/fetch/$s_!Q7ET!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc67cf488-c058-4236-b8a6-9b72475edf26_1438x986.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ol><li><p><strong>DAG defining</strong>: The users define the DAG with desired tasks and logic, including when to begin running it and the scheduled interval.</p></li><li><p><strong>DAG Parsing</strong>: The Scheduler scans the DAG directory, parses the DAG file, and loads them into the Metadata Database.</p></li><li><p><strong>Scheduling</strong>: Based on the DAG definitions and schedule intervals, the Scheduler determines which tasks are ready for execution and queues them.</p></li><li><p><strong>Task Execution</strong>: The Executor fetches the queued tasks and assigns them to available Workers. The Workers execute the tasks, and task states are updated in the Metadata Database.</p></li><li><p><strong>Monitoring</strong>: The Web Server queries the Metadata Database and visualizes DAG runs, task statuses, and logs in real-time. Users can monitor task progress, inspect logs, or trigger manual DAG runs from the UI.</p></li><li><p><strong>Retries and State Updates</strong>: If a task fails, the Scheduler ensures retries are handled according to the task configuration. The Executor updates task states in the database until all tasks are completed successfully or fail beyond retry limits.</p></li></ol><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Nd0m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c8c6b0-4df8-4218-9afc-9a750beaf5e0_600x200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Nd0m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c8c6b0-4df8-4218-9afc-9a750beaf5e0_600x200.png 424w, https://substackcdn.com/image/fetch/$s_!Nd0m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c8c6b0-4df8-4218-9afc-9a750beaf5e0_600x200.png 848w, https://substackcdn.com/image/fetch/$s_!Nd0m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c8c6b0-4df8-4218-9afc-9a750beaf5e0_600x200.png 1272w, https://substackcdn.com/image/fetch/$s_!Nd0m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c8c6b0-4df8-4218-9afc-9a750beaf5e0_600x200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Nd0m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c8c6b0-4df8-4218-9afc-9a750beaf5e0_600x200.png" width="600" height="200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/95c8c6b0-4df8-4218-9afc-9a750beaf5e0_600x200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:200,&quot;width&quot;:600,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:63326,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/169443779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c8c6b0-4df8-4218-9afc-9a750beaf5e0_600x200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Nd0m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c8c6b0-4df8-4218-9afc-9a750beaf5e0_600x200.png 424w, https://substackcdn.com/image/fetch/$s_!Nd0m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c8c6b0-4df8-4218-9afc-9a750beaf5e0_600x200.png 848w, https://substackcdn.com/image/fetch/$s_!Nd0m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c8c6b0-4df8-4218-9afc-9a750beaf5e0_600x200.png 1272w, https://substackcdn.com/image/fetch/$s_!Nd0m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95c8c6b0-4df8-4218-9afc-9a750beaf5e0_600x200.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em><strong>This article is sponsored by Astronomer</strong>. At Astronomer, a team of just five data engineers manages 27,000 daily tasks powering 18+ data products. After a series of architectural changes, they reduced DAG failure rates by 81%. On August 21, they&#8217;re sharing exactly how they did it, and how you can apply the same strategies to make your own pipelines more reliable. You&#8217;ll learn how to:</em></p><ul><li><p><em>Use Airflow Asset scheduling to prevent upstream data issues</em></p></li><li><p><em>Orchestrate cross-DAG dependencies with a Control DAG</em></p></li><li><p><em>Set up centralized observability to monitor SLAs and debug faster</em></p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.astronomer.io/events/webinars/how-to-increase-the-reliability-of-your-airflow-pipelines-video?utm_source=vu-trinh&amp;utm_medium=paidmedia&amp;utm_campaign=webinar-pipeline-reliability-8-25&quot;,&quot;text&quot;:&quot;Register HERE&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.astronomer.io/events/webinars/how-to-increase-the-reliability-of-your-airflow-pipelines-video?utm_source=vu-trinh&amp;utm_medium=paidmedia&amp;utm_campaign=webinar-pipeline-reliability-8-25"><span>Register HERE</span></a></p><div><hr></div><h2>Deployment </h2><p>Deploying Airflow ranges from running a lightweight local instance for testing and development to setting up a robust, scalable, and production-ready environment. Here's an overview of the deployment process:</p><h3>On a single machine </h3><p>Airflow can be deployed directly on a single machine (airflow standalone) for testing and development. This setup will initiate all the required components (scheduler, web server, and database) as separate processes on our machine.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S9rA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F645e0455-4c5e-4bfc-9ccd-841b6f94415b_844x442.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S9rA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F645e0455-4c5e-4bfc-9ccd-841b6f94415b_844x442.png 424w, https://substackcdn.com/image/fetch/$s_!S9rA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F645e0455-4c5e-4bfc-9ccd-841b6f94415b_844x442.png 848w, https://substackcdn.com/image/fetch/$s_!S9rA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F645e0455-4c5e-4bfc-9ccd-841b6f94415b_844x442.png 1272w, https://substackcdn.com/image/fetch/$s_!S9rA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F645e0455-4c5e-4bfc-9ccd-841b6f94415b_844x442.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S9rA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F645e0455-4c5e-4bfc-9ccd-841b6f94415b_844x442.png" width="844" height="442" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/645e0455-4c5e-4bfc-9ccd-841b6f94415b_844x442.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:442,&quot;width&quot;:844,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:126808,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/169443779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F645e0455-4c5e-4bfc-9ccd-841b6f94415b_844x442.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!S9rA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F645e0455-4c5e-4bfc-9ccd-841b6f94415b_844x442.png 424w, https://substackcdn.com/image/fetch/$s_!S9rA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F645e0455-4c5e-4bfc-9ccd-841b6f94415b_844x442.png 848w, https://substackcdn.com/image/fetch/$s_!S9rA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F645e0455-4c5e-4bfc-9ccd-841b6f94415b_844x442.png 1272w, https://substackcdn.com/image/fetch/$s_!S9rA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F645e0455-4c5e-4bfc-9ccd-841b6f94415b_844x442.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Another way to deploy Airflow on a single machine is by separting each process into a separate container (via Docker or on a local Kubernetes cluster such as Minikube)</p><p>However, a single-machine deployment is insufficient when operating Airflow in production, which requires scalability, availability, and fault tolerance.</p><h3><strong>Distributed Deployment</strong></h3><p>Airflow can be deployed in a distributed architecture; components are deployed independently and redundantly; each element is live on a separate machine and can be optionally deployed in multiple instances on different machines. (e.g., scheduler and webserver are on two other machines, the scheduler can have three instances deployed on three machines)</p><p>This setup enables better load distribution, making it well-suited for handling large-scale workflows. The most common approach for deploying Airflow's distributed architecture that I observed is using Kubernetes.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!niF4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe04bfc7e-245a-4edf-964c-4f9069820451_752x598.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!niF4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe04bfc7e-245a-4edf-964c-4f9069820451_752x598.png 424w, https://substackcdn.com/image/fetch/$s_!niF4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe04bfc7e-245a-4edf-964c-4f9069820451_752x598.png 848w, https://substackcdn.com/image/fetch/$s_!niF4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe04bfc7e-245a-4edf-964c-4f9069820451_752x598.png 1272w, https://substackcdn.com/image/fetch/$s_!niF4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe04bfc7e-245a-4edf-964c-4f9069820451_752x598.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!niF4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe04bfc7e-245a-4edf-964c-4f9069820451_752x598.png" width="752" height="598" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e04bfc7e-245a-4edf-964c-4f9069820451_752x598.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:598,&quot;width&quot;:752,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:177968,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/169443779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe04bfc7e-245a-4edf-964c-4f9069820451_752x598.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!niF4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe04bfc7e-245a-4edf-964c-4f9069820451_752x598.png 424w, https://substackcdn.com/image/fetch/$s_!niF4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe04bfc7e-245a-4edf-964c-4f9069820451_752x598.png 848w, https://substackcdn.com/image/fetch/$s_!niF4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe04bfc7e-245a-4edf-964c-4f9069820451_752x598.png 1272w, https://substackcdn.com/image/fetch/$s_!niF4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe04bfc7e-245a-4edf-964c-4f9069820451_752x598.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If you use Airflow managed by a cloud vendor like AWS or Google, your Airflow environment is deployed on a Kubernetes cluster, and all the DAG files are stored in the object storage (S3 for AWS and GCS for Google Cloud)</p><h2>Executors</h2><p>Now, the main dish.</p><p>Executors in Airflow are responsible for determining where and how tasks are executed. Different executors offer varying levels of scalability, isolation, and complexity.</p><h3>SequentialExecutor</h3><blockquote><p><em>Categorized as Local Executor, it is replaced by the LocalExecutor in Airflow 3</em></p></blockquote><p>This executor runs tasks sequentially (one after another) within a single process on the same machine as the scheduler. This executor is most used for development and local testing. It's simple but unsuitable for production due to its lack of parallelism.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!p733!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47b6a58-fe6d-4b64-ae98-2ae80c89e88f_828x604.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!p733!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47b6a58-fe6d-4b64-ae98-2ae80c89e88f_828x604.png 424w, https://substackcdn.com/image/fetch/$s_!p733!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47b6a58-fe6d-4b64-ae98-2ae80c89e88f_828x604.png 848w, https://substackcdn.com/image/fetch/$s_!p733!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47b6a58-fe6d-4b64-ae98-2ae80c89e88f_828x604.png 1272w, https://substackcdn.com/image/fetch/$s_!p733!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47b6a58-fe6d-4b64-ae98-2ae80c89e88f_828x604.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!p733!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47b6a58-fe6d-4b64-ae98-2ae80c89e88f_828x604.png" width="828" height="604" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f47b6a58-fe6d-4b64-ae98-2ae80c89e88f_828x604.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:604,&quot;width&quot;:828,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:123289,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/169443779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47b6a58-fe6d-4b64-ae98-2ae80c89e88f_828x604.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!p733!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47b6a58-fe6d-4b64-ae98-2ae80c89e88f_828x604.png 424w, https://substackcdn.com/image/fetch/$s_!p733!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47b6a58-fe6d-4b64-ae98-2ae80c89e88f_828x604.png 848w, https://substackcdn.com/image/fetch/$s_!p733!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47b6a58-fe6d-4b64-ae98-2ae80c89e88f_828x604.png 1272w, https://substackcdn.com/image/fetch/$s_!p733!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47b6a58-fe6d-4b64-ae98-2ae80c89e88f_828x604.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A critical operational detail is that this executor <a href="https://airflow.apache.org/docs/apache-airflow/2.3.2/production-deployment.html#multi-node-cluster">pauses the scheduler while a task is running</a>. This characteristic is a significant concern for production environments, as it prevents the scheduler from continuously monitoring or queuing new tasks.</p><p>The SequentialExecutor is also unique in its ability to operate with SQLite as its database backend, a choice that aligns with its single-task nature due to SQLite's lack of support for multiple concurrent connections.</p><h4><strong>Pros</strong></h4><ul><li><p>Its greatest strength is simplicity, requiring no external dependencies or complex configurations.</p></li></ul><h4><strong>Cons</strong></h4><ul><li><p>Can&#8217;t run tasks in parallel</p></li></ul><h3>LocalExecutor</h3><blockquote><p><em>Categorized as Local Executor</em></p></blockquote><p>The LocalExecutor represents an advancement over the SequentialExecutor by introducing parallelism while maintaining a relatively simple setup on a single machine. Concurrency is achieved through multiple processes on a single machine. It is suitable for small&#8212;to medium-sized workflows that require concurrency but don't need distributed execution.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EW0O!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10e3bc81-d35d-4b98-a98c-ee488dc6ea21_712x542.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EW0O!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10e3bc81-d35d-4b98-a98c-ee488dc6ea21_712x542.png 424w, https://substackcdn.com/image/fetch/$s_!EW0O!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10e3bc81-d35d-4b98-a98c-ee488dc6ea21_712x542.png 848w, https://substackcdn.com/image/fetch/$s_!EW0O!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10e3bc81-d35d-4b98-a98c-ee488dc6ea21_712x542.png 1272w, https://substackcdn.com/image/fetch/$s_!EW0O!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10e3bc81-d35d-4b98-a98c-ee488dc6ea21_712x542.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EW0O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10e3bc81-d35d-4b98-a98c-ee488dc6ea21_712x542.png" width="712" height="542" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/10e3bc81-d35d-4b98-a98c-ee488dc6ea21_712x542.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:542,&quot;width&quot;:712,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:138711,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/169443779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10e3bc81-d35d-4b98-a98c-ee488dc6ea21_712x542.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EW0O!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10e3bc81-d35d-4b98-a98c-ee488dc6ea21_712x542.png 424w, https://substackcdn.com/image/fetch/$s_!EW0O!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10e3bc81-d35d-4b98-a98c-ee488dc6ea21_712x542.png 848w, https://substackcdn.com/image/fetch/$s_!EW0O!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10e3bc81-d35d-4b98-a98c-ee488dc6ea21_712x542.png 1272w, https://substackcdn.com/image/fetch/$s_!EW0O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10e3bc81-d35d-4b98-a98c-ee488dc6ea21_712x542.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>To support this parallelism, a robust database backend such as MySQL or PostgreSQL is required, as SQLite does not handle the multiple connections necessary for concurrent operations.</p><p>The LocalExecutor has modes:</p><ul><li><p><strong>Unlimited Parallelism (</strong><code>parallelism == 0</code><strong>):</strong> In this mode, a new process is spawned for every task submitted. Upon task completion, the process terminates. This represents a direct, on-demand approach to task execution.</p></li><li><p><strong>Limited Parallelism (</strong><code>parallelism &gt; 0</code><strong>):</strong> This is the more common configuration for a production environment. A fixed number of worker processes (equal to the <code>parallelism</code>) are pre-spawned at startup. These workers continuously pull tasks from the queue, remaining active throughout the executor's lifecycle.</p></li></ul><p>When there are multiple Schedulers, each will run a local executor. This means tasks will be distributed across the Schedulers&#8217; machines.</p><h4><strong>Pros</strong></h4><ul><li><p>The simplicity</p></li><li><p>Can leverage multiple CPU cores on the host machine, leading to higher concurrency compared to the SequentialExecutor</p></li></ul><h4><strong>Cons</strong></h4><ul><li><p>Limited by the resources (CPU, RAM, etc.) of the Scheduler machines. More task processing capability means adding more Scheduler machines.</p></li></ul><h3>CeleryExecutor</h3><blockquote><p><em>Categorized as Remote Executor</em></p></blockquote><p>The CeleryExecutor allows us to enter distributed systems and horizontal scaling. It relies on <a href="https://docs.celeryq.dev/en/latest/getting-started/introduction.html">Celery, a robust distributed task queue</a>. Compared to the two above executors, the workers who run the task are separate from the scheduler machines. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xId_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27d1617d-0e8d-4ff1-9452-71a8a9ade295_1140x890.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xId_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27d1617d-0e8d-4ff1-9452-71a8a9ade295_1140x890.png 424w, https://substackcdn.com/image/fetch/$s_!xId_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27d1617d-0e8d-4ff1-9452-71a8a9ade295_1140x890.png 848w, https://substackcdn.com/image/fetch/$s_!xId_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27d1617d-0e8d-4ff1-9452-71a8a9ade295_1140x890.png 1272w, https://substackcdn.com/image/fetch/$s_!xId_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27d1617d-0e8d-4ff1-9452-71a8a9ade295_1140x890.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xId_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27d1617d-0e8d-4ff1-9452-71a8a9ade295_1140x890.png" width="1140" height="890" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/27d1617d-0e8d-4ff1-9452-71a8a9ade295_1140x890.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:890,&quot;width&quot;:1140,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:281135,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/169443779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27d1617d-0e8d-4ff1-9452-71a8a9ade295_1140x890.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xId_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27d1617d-0e8d-4ff1-9452-71a8a9ade295_1140x890.png 424w, https://substackcdn.com/image/fetch/$s_!xId_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27d1617d-0e8d-4ff1-9452-71a8a9ade295_1140x890.png 848w, https://substackcdn.com/image/fetch/$s_!xId_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27d1617d-0e8d-4ff1-9452-71a8a9ade295_1140x890.png 1272w, https://substackcdn.com/image/fetch/$s_!xId_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27d1617d-0e8d-4ff1-9452-71a8a9ade295_1140x890.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The CeleryExecutor setup involves:</p><ul><li><p>A message broker (most commonly RabbitMQ or Redis)</p></li><li><p>Celery workers</p></li></ul><p>Celery workers are typically long-running processes that continuously run to pick up tasks, allowing more than one task to run concurrently on a worker. To scale the task-running capability, we add more machines that run Celery worker processes. Similar to the LocalExecutor, it requires a robust, non-SQLite database (e.g., MySQL or PostgreSQL)</p><h4>Pros</h4><ul><li><p>Decoupling the task running process from the Scheduler.</p></li><li><p>Horizontal scaling by adding more machines that run the Celery workers.</p></li></ul><h4>Cons</h4><ul><li><p>More components compared to the local executors &#8594; More maintenance overhead</p></li><li><p>Noisy Neighbor: A heavy task could affect other functions on the shared machine that runs the Celery worker.</p></li><li><p>Not so good resource utilization as Celery workers could stay idle: running a fixed number of Celery workers continuously can lead to underutilized resources when few tasks are running.</p></li><li><p>The overhead of scaling worker machines.</p></li></ul><h3>KubernetesExecutor</h3><blockquote><p><em>Categorized as Containerized Executor</em></p></blockquote><p>This executor is designed for cloud-native and containerized environments. This executor dynamically creates Kubernetes pods for each task. For me, this one provides the best resource isolation, scalability, and fault tolerance.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-7pt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7d91a51-d11c-490a-aec2-1e2c555e54c0_862x476.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-7pt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7d91a51-d11c-490a-aec2-1e2c555e54c0_862x476.png 424w, https://substackcdn.com/image/fetch/$s_!-7pt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7d91a51-d11c-490a-aec2-1e2c555e54c0_862x476.png 848w, https://substackcdn.com/image/fetch/$s_!-7pt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7d91a51-d11c-490a-aec2-1e2c555e54c0_862x476.png 1272w, https://substackcdn.com/image/fetch/$s_!-7pt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7d91a51-d11c-490a-aec2-1e2c555e54c0_862x476.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-7pt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7d91a51-d11c-490a-aec2-1e2c555e54c0_862x476.png" width="862" height="476" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d7d91a51-d11c-490a-aec2-1e2c555e54c0_862x476.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:476,&quot;width&quot;:862,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:144680,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/169443779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7d91a51-d11c-490a-aec2-1e2c555e54c0_862x476.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-7pt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7d91a51-d11c-490a-aec2-1e2c555e54c0_862x476.png 424w, https://substackcdn.com/image/fetch/$s_!-7pt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7d91a51-d11c-490a-aec2-1e2c555e54c0_862x476.png 848w, https://substackcdn.com/image/fetch/$s_!-7pt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7d91a51-d11c-490a-aec2-1e2c555e54c0_862x476.png 1272w, https://substackcdn.com/image/fetch/$s_!-7pt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7d91a51-d11c-490a-aec2-1e2c555e54c0_862x476.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>When the Airflow scheduler senses that a task is ready for execution, it requests a new pod from the Kubernetes API.  This newly created pod then executes the task, reports its result back to the Airflow metadata database, and terminates upon task completion (users can choose to persist the pod for debugging later)</p><h4><strong>Pros</strong></h4><ul><li><p>Better resource utilization: resources are consumed only when tasks are actively running, leading to cost savings during idle periods.</p></li><li><p>Better isolation: each task can have its own pod with configurable resources. Unlike the above executors, KubernetesExecutor allows for better Python dependencies, as different tasks (pods) can have different sets of dependencies.</p></li></ul><h4><strong>Cons</strong></h4><ul><li><p>The cold start: The Kubernetes pod needs to be initiated (e.g., pull the Docker image, run) to run the task; it could take a while compared to the above executors before executing your tasks.</p></li><li><p>Requires strong knowledge of containerization and Kubernetes, which potentially requires more resources to manage (e.g., SRE teams)</p></li><li><p>Hard to test as it requires users to have a Kubernetes environment.</p></li></ul><h2>Multiple Executors</h2><p>Until Airflow 2.10, an Airflow environment was limited to using a single executor for all its tasks. However, with the introduction of multiple executor support (starting with Airflow 2.10 and later), users can specify different executors for different tasks. </p><p>Most of the time, a single executor is sufficient. However, with diverse workloads, the &#8220;one size fits all" approach may not be effective. Multiple executors could help here. For example:</p><ul><li><p><strong>Short, Numerous Tasks:</strong> Some DAGs might consist of very small tasks. A CeleryExecutor with pre-warmed workers can excel here due to low task startup latency. Small tasks also mean quick-returning resources, limiting the noisy neighbor problems.</p></li><li><p><strong>Long-Running, Resource-Intensive Tasks:</strong> Long-running tasks consume significant CPU/memory on Celery workers, which might lead to "noisy neighbor" problems. With KubernetesExecutor, each task gets its own isolated pod with precisely allocated resources.</p></li></ul><div><hr></div><h2>Outro</h2><p>In this article, we first revisit the fundamentals of Airflow and then explore its common executor options, ranging from local ones with simple setups to limiting concurrency with Sequential and Local executors.</p><p>We then move on to the distributed option with CeleryExecutor, and finally learn about the most isolated and scalable option with KubernetesExecutor. However, more power comes with more responsibility, as this last one requires strong knowledge of Kubernetes to operate smoothly.</p><p>We also learn that Airflow allows us to specify more than one executor in a single environment.</p><p>Thank you for reading this far. See you next time.</p><div><hr></div><h2>References</h2><p><em>[1] <a href="https://airflow.apache.org/docs/apache-airflow/stable/index.html">Apache Airflow Official Documentation</a></em></p><p><em>[3] Airbnb Engineer, <a href="https://medium.com/airbnb-engineering/airflow-a-workflow-management-platform-46318b977fd8">Airflow: a workflow management platform</a> (2015)</em></p>]]></content:encoded></item><item><title><![CDATA[If you're learning Apache Spark, this article is for you]]></title><description><![CDATA[A baseline for your Spark learning and research.]]></description><link>https://vutr.substack.com/p/if-youre-learning-apache-spark-this</link><guid isPermaLink="false">https://vutr.substack.com/p/if-youre-learning-apache-spark-this</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Thu, 26 Jun 2025 03:15:30 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!R2LW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8884c0-bf56-4bcc-b3a1-6bb9398be232_2000x1429.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p>To celebrate Lunar New Year (the true New Year holiday in Vietnam), I&#8217;m offering <em><strong>50% off the annual subscription</strong></em>. The offer ends soon; grab it now to get full access to nearly 200 high-quality data engineering articles.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe&quot;,&quot;text&quot;:&quot;50% off the annual subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe"><span>50% off the annual subscription</span></a></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!R2LW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8884c0-bf56-4bcc-b3a1-6bb9398be232_2000x1429.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!R2LW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8884c0-bf56-4bcc-b3a1-6bb9398be232_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!R2LW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8884c0-bf56-4bcc-b3a1-6bb9398be232_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!R2LW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8884c0-bf56-4bcc-b3a1-6bb9398be232_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!R2LW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8884c0-bf56-4bcc-b3a1-6bb9398be232_2000x1429.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!R2LW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8884c0-bf56-4bcc-b3a1-6bb9398be232_2000x1429.png" width="1456" height="1040" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd8884c0-bf56-4bcc-b3a1-6bb9398be232_2000x1429.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1040,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:233467,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/166248471?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8884c0-bf56-4bcc-b3a1-6bb9398be232_2000x1429.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!R2LW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8884c0-bf56-4bcc-b3a1-6bb9398be232_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!R2LW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8884c0-bf56-4bcc-b3a1-6bb9398be232_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!R2LW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8884c0-bf56-4bcc-b3a1-6bb9398be232_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!R2LW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8884c0-bf56-4bcc-b3a1-6bb9398be232_2000x1429.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Intro</h2><p>At the time of this writing, Apache Spark has been released in its fourth major version, which includes many improvements and innovations.</p><p>However, I believe its core and fundamentals won&#8217;t change soon.</p><p>I have written this article to help you establish a good baseline for learning and researching Spark. It distills everything I know about this infamous engine.</p><blockquote><p><em><strong>Note</strong>: This article contains illustrations with many details. I recommend reading it on a laptop or PC to get the full experience.</em></p></blockquote><div><hr></div><h2>Overview</h2><p>In 2004, Google released a paper introducing a programming paradigm called MapReduce to distribute the data processing to hundreds or thousands of machines.</p><p>In MapReduce, users have to explicitly define the Map and the Reduce functions:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XNId!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8ba3862-0acc-4964-8312-71fff4e278b8_684x626.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XNId!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8ba3862-0acc-4964-8312-71fff4e278b8_684x626.png 424w, https://substackcdn.com/image/fetch/$s_!XNId!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8ba3862-0acc-4964-8312-71fff4e278b8_684x626.png 848w, https://substackcdn.com/image/fetch/$s_!XNId!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8ba3862-0acc-4964-8312-71fff4e278b8_684x626.png 1272w, https://substackcdn.com/image/fetch/$s_!XNId!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8ba3862-0acc-4964-8312-71fff4e278b8_684x626.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XNId!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8ba3862-0acc-4964-8312-71fff4e278b8_684x626.png" width="684" height="626" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d8ba3862-0acc-4964-8312-71fff4e278b8_684x626.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:626,&quot;width&quot;:684,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:178771,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/166248471?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8ba3862-0acc-4964-8312-71fff4e278b8_684x626.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XNId!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8ba3862-0acc-4964-8312-71fff4e278b8_684x626.png 424w, https://substackcdn.com/image/fetch/$s_!XNId!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8ba3862-0acc-4964-8312-71fff4e278b8_684x626.png 848w, https://substackcdn.com/image/fetch/$s_!XNId!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8ba3862-0acc-4964-8312-71fff4e278b8_684x626.png 1272w, https://substackcdn.com/image/fetch/$s_!XNId!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8ba3862-0acc-4964-8312-71fff4e278b8_684x626.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>Map</strong>: It takes key/value pair inputs, processes them, and outputs intermediate key/value pairs. Then, all values of the same key will be grouped and passed to the Reduce tasks.</p></li><li><p><strong>Reduce</strong>: It receives intermediate values from Map tasks. It then merges the intermediate values from the same key using the defined logic (e.g., Count, Sum, ...)</p></li></ul><p>To ensure fault tolerance (e.g., a worker dies during the process), MapReduce relies on disk to exchange intermediate data between data tasks. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7U2z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F808a6733-819d-4d72-8148-9c4d3802bd0d_524x472.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7U2z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F808a6733-819d-4d72-8148-9c4d3802bd0d_524x472.png 424w, https://substackcdn.com/image/fetch/$s_!7U2z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F808a6733-819d-4d72-8148-9c4d3802bd0d_524x472.png 848w, https://substackcdn.com/image/fetch/$s_!7U2z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F808a6733-819d-4d72-8148-9c4d3802bd0d_524x472.png 1272w, https://substackcdn.com/image/fetch/$s_!7U2z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F808a6733-819d-4d72-8148-9c4d3802bd0d_524x472.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7U2z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F808a6733-819d-4d72-8148-9c4d3802bd0d_524x472.png" width="524" height="472" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/808a6733-819d-4d72-8148-9c4d3802bd0d_524x472.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:472,&quot;width&quot;:524,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:85429,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/166248471?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F808a6733-819d-4d72-8148-9c4d3802bd0d_524x472.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7U2z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F808a6733-819d-4d72-8148-9c4d3802bd0d_524x472.png 424w, https://substackcdn.com/image/fetch/$s_!7U2z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F808a6733-819d-4d72-8148-9c4d3802bd0d_524x472.png 848w, https://substackcdn.com/image/fetch/$s_!7U2z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F808a6733-819d-4d72-8148-9c4d3802bd0d_524x472.png 1272w, https://substackcdn.com/image/fetch/$s_!7U2z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F808a6733-819d-4d72-8148-9c4d3802bd0d_524x472.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Based on Google's paper, Yahoo released the open-sourced implementation of MapReduce, which soon became the go-to solution for distributed data processing. It rose and dominated, but it wouldn&#8217;t last long.</p><p>The strict Map and Reduce paradigm limits the flexibility, and the disk-based data exchange might not be suitable for use cases like machine learning or interactive queries.</p><p>UC Berkeley&#8217;s AMPLab saw a problem that needed to be solved. Although cluster computing had a lot of potential, they observed that the MapReduce implementation might not be efficient.</p><p>They created Apache Spark, a functional programming-based API to simplify multistep applications, and developed a new engine for efficient in-memory data sharing across computation steps.</p><div><hr></div><h2>Spark RDD</h2><p>Unlike MapReduce, Spark relies heavily on in-memory processing. The creator introduced the Resilient Distributed Dataset (RDD) abstraction to manage Spark&#8217;s data in memory. No matter the abstraction you use, from dataset to dataframe, they are compiled into RDDs behind the scenes.</p><p>RDD represents an <strong>immutable</strong>, <strong>partitioned collection</strong> of records that can be operated on in parallel. Data inside RDD is stored in memory for as long as possible. </p><h3>Why RDD immutable</h3><p>You might wonder why Spark RDDs are immutable. Here are some of my notes:</p><ul><li><p><strong>Concurrent Processing:</strong> Immutability keeps data consistent across multiple nodes and threads, avoiding complex synchronization and race conditions.</p></li><li><p><strong>Lineage and Fault Tolerance:</strong> Each transformation creates a new RDD, preserving the lineage and allowing Spark to recompute lost data reliably. Mutable RDDs would make this much harder.</p></li><li><p><strong>Functional Programming:</strong> RDDs follow principles that emphasize immutability, making handling failures easier and maintaining data integrity.</p></li></ul><h3>Properties</h3><p>Each RDD in Spark has five key properties:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9C-B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff829400a-c878-42d3-a5bb-255154f1fe5d_526x362.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9C-B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff829400a-c878-42d3-a5bb-255154f1fe5d_526x362.png 424w, https://substackcdn.com/image/fetch/$s_!9C-B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff829400a-c878-42d3-a5bb-255154f1fe5d_526x362.png 848w, https://substackcdn.com/image/fetch/$s_!9C-B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff829400a-c878-42d3-a5bb-255154f1fe5d_526x362.png 1272w, https://substackcdn.com/image/fetch/$s_!9C-B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff829400a-c878-42d3-a5bb-255154f1fe5d_526x362.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9C-B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff829400a-c878-42d3-a5bb-255154f1fe5d_526x362.png" width="526" height="362" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f829400a-c878-42d3-a5bb-255154f1fe5d_526x362.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:362,&quot;width&quot;:526,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:73188,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/166248471?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff829400a-c878-42d3-a5bb-255154f1fe5d_526x362.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9C-B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff829400a-c878-42d3-a5bb-255154f1fe5d_526x362.png 424w, https://substackcdn.com/image/fetch/$s_!9C-B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff829400a-c878-42d3-a5bb-255154f1fe5d_526x362.png 848w, https://substackcdn.com/image/fetch/$s_!9C-B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff829400a-c878-42d3-a5bb-255154f1fe5d_526x362.png 1272w, https://substackcdn.com/image/fetch/$s_!9C-B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff829400a-c878-42d3-a5bb-255154f1fe5d_526x362.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>List of Partitions:</strong> An RDD is divided into partitions, Spark's parallelism units. Each partition is a logical data subset and can be processed independently with different executors (more on executors later).</p></li><li><p><strong>Computation Function:</strong> A function determines how to compute the data for each partition.</p></li><li><p><strong>Dependencies:</strong> The RDD tracks its dependencies on other RDDs, which describe how it was created.</p></li><li><p><strong>Partitioner (Optional):</strong> For key-value RDDs, a partitioner specifies how the data is partitioned, such as using a hash partitioner.</p></li><li><p><strong>Preferred Locations (Optional):</strong> This property lists the preferred locations for computing each partition, such as the data block locations in the HDFS.</p></li></ul><h3>Lazy</h3><p>When you define the RDD, its data is unavailable or transformed immediately until an action triggers the execution. This approach allows Spark to determine the most efficient way to execute the transformations. Speaking of transformation and action: </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!z37r!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39ca1705-4f3d-46ed-8c77-2c3ff6962d11_998x554.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!z37r!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39ca1705-4f3d-46ed-8c77-2c3ff6962d11_998x554.png 424w, https://substackcdn.com/image/fetch/$s_!z37r!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39ca1705-4f3d-46ed-8c77-2c3ff6962d11_998x554.png 848w, https://substackcdn.com/image/fetch/$s_!z37r!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39ca1705-4f3d-46ed-8c77-2c3ff6962d11_998x554.png 1272w, https://substackcdn.com/image/fetch/$s_!z37r!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39ca1705-4f3d-46ed-8c77-2c3ff6962d11_998x554.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!z37r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39ca1705-4f3d-46ed-8c77-2c3ff6962d11_998x554.png" width="998" height="554" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/39ca1705-4f3d-46ed-8c77-2c3ff6962d11_998x554.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:554,&quot;width&quot;:998,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:164070,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/166248471?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39ca1705-4f3d-46ed-8c77-2c3ff6962d11_998x554.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!z37r!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39ca1705-4f3d-46ed-8c77-2c3ff6962d11_998x554.png 424w, https://substackcdn.com/image/fetch/$s_!z37r!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39ca1705-4f3d-46ed-8c77-2c3ff6962d11_998x554.png 848w, https://substackcdn.com/image/fetch/$s_!z37r!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39ca1705-4f3d-46ed-8c77-2c3ff6962d11_998x554.png 1272w, https://substackcdn.com/image/fetch/$s_!z37r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39ca1705-4f3d-46ed-8c77-2c3ff6962d11_998x554.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>Transformations</strong>, such as <code>map</code> or <code>filter</code>, define how the data should be transformed, but they don't execute until an action forces the computation. Because RDD is immutable, Spark creates a new RDD after applying the transformation.</p></li><li><p><strong>Actions</strong> are the commands that Spark runs to produce output or store data, thereby driving the actual execution of the transformations.</p></li></ul><h3>Fault Tolerance</h3><p>Spark RDDs achieve fault tolerance through <em><strong>lineage</strong></em>. </p><p>As mentioned, Spark keeps track of each RDD&#8217;s dependencies on other RDDs, the series of transformations that created it.</p><p>Suppose any partition of an RDD is lost due to a node failure or other issues. Spark can reconstruct the lost data by reapplying the transformations to the original dataset described by the lineage. </p><p>This approach eliminates the need to replicate data across nodes or write data to disk (like MapReduce).</p><div><hr></div><h2>Architecture</h2><p>A Spark application consists of:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HaHQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88e701b-0b7f-4cfb-a7f9-ef49fbdba1a6_458x410.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HaHQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88e701b-0b7f-4cfb-a7f9-ef49fbdba1a6_458x410.png 424w, https://substackcdn.com/image/fetch/$s_!HaHQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88e701b-0b7f-4cfb-a7f9-ef49fbdba1a6_458x410.png 848w, https://substackcdn.com/image/fetch/$s_!HaHQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88e701b-0b7f-4cfb-a7f9-ef49fbdba1a6_458x410.png 1272w, https://substackcdn.com/image/fetch/$s_!HaHQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88e701b-0b7f-4cfb-a7f9-ef49fbdba1a6_458x410.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HaHQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88e701b-0b7f-4cfb-a7f9-ef49fbdba1a6_458x410.png" width="458" height="410" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d88e701b-0b7f-4cfb-a7f9-ef49fbdba1a6_458x410.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:410,&quot;width&quot;:458,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:75176,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/166248471?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88e701b-0b7f-4cfb-a7f9-ef49fbdba1a6_458x410.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HaHQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88e701b-0b7f-4cfb-a7f9-ef49fbdba1a6_458x410.png 424w, https://substackcdn.com/image/fetch/$s_!HaHQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88e701b-0b7f-4cfb-a7f9-ef49fbdba1a6_458x410.png 848w, https://substackcdn.com/image/fetch/$s_!HaHQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88e701b-0b7f-4cfb-a7f9-ef49fbdba1a6_458x410.png 1272w, https://substackcdn.com/image/fetch/$s_!HaHQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd88e701b-0b7f-4cfb-a7f9-ef49fbdba1a6_458x410.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>Driver:</strong> This JVM process manages the entire Spark application, from handling user input to distributing tasks to the executors.</p></li><li><p><strong>Cluster Manager:</strong> This component manages the cluster of machines running the Spark application. Spark can work with various cluster managers, including YARN, Apache Mesos, or its standalone manager.</p></li><li><p><strong>Executors:</strong> These processes execute tasks the driver assigns and report their status and results. Each Spark application has its own set of executors.</p></li></ul><p>The Spark Driver-Executors cluster differs from the cluster hosting your Spark application. To run a Spark application, there must be a cluster of machines or processes (if you&#8217;re running Spark locally) that provides resources to Spark applications.</p><p>The cluster manager manages this cluster and the machines that can host driver and executor processes, called workers.</p><div><hr></div><h2>Mode</h2><p>Spark has different modes of execution, which are distinguished mainly by where the driver process is located.</p><ul><li><p><strong>Cluster Mode:</strong> The driver process is launched on a worker node alongside the executor processes in this mode. The cluster manager handles all the processes related to the Spark application.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jEcD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ddfa3f-4f88-4105-a97d-3be793f3bbc9_452x410.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jEcD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ddfa3f-4f88-4105-a97d-3be793f3bbc9_452x410.png 424w, https://substackcdn.com/image/fetch/$s_!jEcD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ddfa3f-4f88-4105-a97d-3be793f3bbc9_452x410.png 848w, https://substackcdn.com/image/fetch/$s_!jEcD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ddfa3f-4f88-4105-a97d-3be793f3bbc9_452x410.png 1272w, https://substackcdn.com/image/fetch/$s_!jEcD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ddfa3f-4f88-4105-a97d-3be793f3bbc9_452x410.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jEcD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ddfa3f-4f88-4105-a97d-3be793f3bbc9_452x410.png" width="452" height="410" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/94ddfa3f-4f88-4105-a97d-3be793f3bbc9_452x410.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:410,&quot;width&quot;:452,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:66437,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/166248471?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ddfa3f-4f88-4105-a97d-3be793f3bbc9_452x410.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jEcD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ddfa3f-4f88-4105-a97d-3be793f3bbc9_452x410.png 424w, https://substackcdn.com/image/fetch/$s_!jEcD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ddfa3f-4f88-4105-a97d-3be793f3bbc9_452x410.png 848w, https://substackcdn.com/image/fetch/$s_!jEcD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ddfa3f-4f88-4105-a97d-3be793f3bbc9_452x410.png 1272w, https://substackcdn.com/image/fetch/$s_!jEcD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ddfa3f-4f88-4105-a97d-3be793f3bbc9_452x410.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p><strong>Client Mode:</strong> The driver remains on the client machine that submitted the application. This setup requires the client machine to maintain the driver process throughout the application&#8217;s execution.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g1qq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fb2a082-eaf6-4e72-9640-d52c723556d9_630x392.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g1qq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fb2a082-eaf6-4e72-9640-d52c723556d9_630x392.png 424w, https://substackcdn.com/image/fetch/$s_!g1qq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fb2a082-eaf6-4e72-9640-d52c723556d9_630x392.png 848w, https://substackcdn.com/image/fetch/$s_!g1qq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fb2a082-eaf6-4e72-9640-d52c723556d9_630x392.png 1272w, https://substackcdn.com/image/fetch/$s_!g1qq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fb2a082-eaf6-4e72-9640-d52c723556d9_630x392.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g1qq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fb2a082-eaf6-4e72-9640-d52c723556d9_630x392.png" width="630" height="392" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8fb2a082-eaf6-4e72-9640-d52c723556d9_630x392.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:392,&quot;width&quot;:630,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:88112,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/166248471?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fb2a082-eaf6-4e72-9640-d52c723556d9_630x392.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!g1qq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fb2a082-eaf6-4e72-9640-d52c723556d9_630x392.png 424w, https://substackcdn.com/image/fetch/$s_!g1qq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fb2a082-eaf6-4e72-9640-d52c723556d9_630x392.png 848w, https://substackcdn.com/image/fetch/$s_!g1qq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fb2a082-eaf6-4e72-9640-d52c723556d9_630x392.png 1272w, https://substackcdn.com/image/fetch/$s_!g1qq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fb2a082-eaf6-4e72-9640-d52c723556d9_630x392.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p><strong>Local mode</strong>: This mode runs the entire Spark application on a single machine, achieving parallelism through multiple threads. It&#8217;s commonly used for learning Spark or testing applications in a simpler, local environment.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2MhX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ea8b90-9f06-4f6e-ad16-b52dedfec9e8_484x372.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2MhX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ea8b90-9f06-4f6e-ad16-b52dedfec9e8_484x372.png 424w, https://substackcdn.com/image/fetch/$s_!2MhX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ea8b90-9f06-4f6e-ad16-b52dedfec9e8_484x372.png 848w, https://substackcdn.com/image/fetch/$s_!2MhX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ea8b90-9f06-4f6e-ad16-b52dedfec9e8_484x372.png 1272w, https://substackcdn.com/image/fetch/$s_!2MhX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ea8b90-9f06-4f6e-ad16-b52dedfec9e8_484x372.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2MhX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ea8b90-9f06-4f6e-ad16-b52dedfec9e8_484x372.png" width="484" height="372" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/06ea8b90-9f06-4f6e-ad16-b52dedfec9e8_484x372.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:372,&quot;width&quot;:484,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:54251,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/166248471?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ea8b90-9f06-4f6e-ad16-b52dedfec9e8_484x372.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2MhX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ea8b90-9f06-4f6e-ad16-b52dedfec9e8_484x372.png 424w, https://substackcdn.com/image/fetch/$s_!2MhX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ea8b90-9f06-4f6e-ad16-b52dedfec9e8_484x372.png 848w, https://substackcdn.com/image/fetch/$s_!2MhX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ea8b90-9f06-4f6e-ad16-b52dedfec9e8_484x372.png 1272w, https://substackcdn.com/image/fetch/$s_!2MhX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ea8b90-9f06-4f6e-ad16-b52dedfec9e8_484x372.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li></ul><div><hr></div><h2>Anatomy</h2><p>It&#8217;s crucial to understand how Spark manages the workload:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Rh7Q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9890fa7c-1050-4266-bbe2-33ae8cec7522_648x316.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Rh7Q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9890fa7c-1050-4266-bbe2-33ae8cec7522_648x316.png 424w, https://substackcdn.com/image/fetch/$s_!Rh7Q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9890fa7c-1050-4266-bbe2-33ae8cec7522_648x316.png 848w, https://substackcdn.com/image/fetch/$s_!Rh7Q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9890fa7c-1050-4266-bbe2-33ae8cec7522_648x316.png 1272w, https://substackcdn.com/image/fetch/$s_!Rh7Q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9890fa7c-1050-4266-bbe2-33ae8cec7522_648x316.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Rh7Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9890fa7c-1050-4266-bbe2-33ae8cec7522_648x316.png" width="648" height="316" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9890fa7c-1050-4266-bbe2-33ae8cec7522_648x316.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:316,&quot;width&quot;:648,&quot;resizeWidth&quot;:648,&quot;bytes&quot;:81658,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/166248471?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9890fa7c-1050-4266-bbe2-33ae8cec7522_648x316.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Rh7Q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9890fa7c-1050-4266-bbe2-33ae8cec7522_648x316.png 424w, https://substackcdn.com/image/fetch/$s_!Rh7Q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9890fa7c-1050-4266-bbe2-33ae8cec7522_648x316.png 848w, https://substackcdn.com/image/fetch/$s_!Rh7Q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9890fa7c-1050-4266-bbe2-33ae8cec7522_648x316.png 1272w, https://substackcdn.com/image/fetch/$s_!Rh7Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9890fa7c-1050-4266-bbe2-33ae8cec7522_648x316.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>Job:</strong> A job represents a series of transformations applied to data. It encompasses the entire workflow from start to finish.</p></li><li><p><strong>Stage:</strong> A stage is a job segment executed without data shuffling. A job is split into different stages when a transformation requires shuffling data across partitions.</p></li><li><p><strong>DAG: </strong>In Spark, RDD dependencies are used to build a Directed Acyclic Graph (DAG) of stages for a Spark job. The DAG ensures that stages are scheduled in topological order.</p></li><li><p><strong>Task:</strong> A task is the smallest unit of execution within Spark. Each stage is divided into multiple tasks, which execute processing in parallel across different partitions.</p></li></ul><p>You might wonder about the &#8220;data shuffling&#8221; from the <strong>Stage&#8217;s </strong>part. To dive into shuffle, it&#8217;s helpful if we could understand the narrow and wide dependencies:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Y_go!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81b2d727-8095-4700-93e3-ccc3c51bd1d9_638x298.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Y_go!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81b2d727-8095-4700-93e3-ccc3c51bd1d9_638x298.png 424w, https://substackcdn.com/image/fetch/$s_!Y_go!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81b2d727-8095-4700-93e3-ccc3c51bd1d9_638x298.png 848w, https://substackcdn.com/image/fetch/$s_!Y_go!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81b2d727-8095-4700-93e3-ccc3c51bd1d9_638x298.png 1272w, https://substackcdn.com/image/fetch/$s_!Y_go!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81b2d727-8095-4700-93e3-ccc3c51bd1d9_638x298.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Y_go!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81b2d727-8095-4700-93e3-ccc3c51bd1d9_638x298.png" width="638" height="298" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/81b2d727-8095-4700-93e3-ccc3c51bd1d9_638x298.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:298,&quot;width&quot;:638,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:72627,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/166248471?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81b2d727-8095-4700-93e3-ccc3c51bd1d9_638x298.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Y_go!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81b2d727-8095-4700-93e3-ccc3c51bd1d9_638x298.png 424w, https://substackcdn.com/image/fetch/$s_!Y_go!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81b2d727-8095-4700-93e3-ccc3c51bd1d9_638x298.png 848w, https://substackcdn.com/image/fetch/$s_!Y_go!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81b2d727-8095-4700-93e3-ccc3c51bd1d9_638x298.png 1272w, https://substackcdn.com/image/fetch/$s_!Y_go!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81b2d727-8095-4700-93e3-ccc3c51bd1d9_638x298.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>Transformations with <strong>narrow dependencies</strong> are those where each partition in the child RDD has a limited number of dependencies on partitions in the parent RDD. These partitions may depend on a single parent (e.g., the map operator) or a specific subset of parent partitions known beforehand (such as with coalesce). </p></li><li><p>Transformations with <strong>wide dependencies</strong> require data to be partitioned in a specific way, where a single partition of a parent RDD contributes to multiple partitions of the child RDD. This typically occurs with operations like groupByKey, reduceByKey, or join, which involve shuffling data. Consequently, wide dependencies result in stage boundaries in Spark's execution plan.</p></li></ul><div><hr></div><h2>A typical journey of the Spark application</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PLHO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a23c75-5ce2-4af8-889d-3dd803876574_1156x696.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PLHO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a23c75-5ce2-4af8-889d-3dd803876574_1156x696.png 424w, https://substackcdn.com/image/fetch/$s_!PLHO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a23c75-5ce2-4af8-889d-3dd803876574_1156x696.png 848w, https://substackcdn.com/image/fetch/$s_!PLHO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a23c75-5ce2-4af8-889d-3dd803876574_1156x696.png 1272w, https://substackcdn.com/image/fetch/$s_!PLHO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a23c75-5ce2-4af8-889d-3dd803876574_1156x696.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PLHO!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a23c75-5ce2-4af8-889d-3dd803876574_1156x696.png" width="1200" height="722.4913494809689" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a9a23c75-5ce2-4af8-889d-3dd803876574_1156x696.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:696,&quot;width&quot;:1156,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:427980,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/166248471?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a23c75-5ce2-4af8-889d-3dd803876574_1156x696.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PLHO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a23c75-5ce2-4af8-889d-3dd803876574_1156x696.png 424w, https://substackcdn.com/image/fetch/$s_!PLHO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a23c75-5ce2-4af8-889d-3dd803876574_1156x696.png 848w, https://substackcdn.com/image/fetch/$s_!PLHO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a23c75-5ce2-4af8-889d-3dd803876574_1156x696.png 1272w, https://substackcdn.com/image/fetch/$s_!PLHO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a23c75-5ce2-4af8-889d-3dd803876574_1156x696.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p>To celebrate Lunar New Year (the true New Year holiday in Vietnam), I&#8217;m offering <em><strong>50% off the annual subscription</strong></em>. The offer ends soon; grab it now to get full access to nearly 200 high-quality data engineering articles.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe&quot;,&quot;text&quot;:&quot;50% off the annual subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe"><span>50% off the annual subscription</span></a></p></blockquote>
      <p>
          <a href="https://vutr.substack.com/p/if-youre-learning-apache-spark-this">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How does Doordash evolve realtime processing platform with Iceberg]]></title><description><![CDATA[Apache Flink + Apache Iceberg]]></description><link>https://vutr.substack.com/p/how-do-doordash-evolve-realtime-processing</link><guid isPermaLink="false">https://vutr.substack.com/p/how-do-doordash-evolve-realtime-processing</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Thu, 22 May 2025 03:15:23 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!EkON!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89a8d66-ea08-4682-bab0-d8d80764b2d1_2000x1429.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p>To celebrate Lunar New Year (the true New Year holiday in Vietnam), I&#8217;m offering <em><strong>50% off the annual subscription</strong></em>. The offer ends soon; grab it now to get full access to nearly 200 high-quality data engineering articles.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe&quot;,&quot;text&quot;:&quot;50% off the annual subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe"><span>50% off the annual subscription</span></a></p></blockquote><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EkON!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89a8d66-ea08-4682-bab0-d8d80764b2d1_2000x1429.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EkON!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89a8d66-ea08-4682-bab0-d8d80764b2d1_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!EkON!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89a8d66-ea08-4682-bab0-d8d80764b2d1_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!EkON!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89a8d66-ea08-4682-bab0-d8d80764b2d1_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!EkON!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89a8d66-ea08-4682-bab0-d8d80764b2d1_2000x1429.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EkON!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89a8d66-ea08-4682-bab0-d8d80764b2d1_2000x1429.png" width="1456" height="1040" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d89a8d66-ea08-4682-bab0-d8d80764b2d1_2000x1429.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1040,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:212478,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/163813438?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89a8d66-ea08-4682-bab0-d8d80764b2d1_2000x1429.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EkON!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89a8d66-ea08-4682-bab0-d8d80764b2d1_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!EkON!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89a8d66-ea08-4682-bab0-d8d80764b2d1_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!EkON!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89a8d66-ea08-4682-bab0-d8d80764b2d1_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!EkON!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89a8d66-ea08-4682-bab0-d8d80764b2d1_2000x1429.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Intro</h2><p>In the <a href="https://open.substack.com/pub/vutr/p/doordashs-real-time-processing-system?r=2rj6sg&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=false">previous article</a>, we examined how <a href="https://www.doordash.com/">DoorDash</a>, one of the largest food delivery platforms in the United States, utilizes Apache Kafka, Apache Flink, and Snowflake for their real-time processing platform. They used Flink to consume Kafka messages and write them to S3, which is later loaded into Snowflake to serve data users.</p><p>Recently, DoorDash has shared how they improved this architecture with the introduction of Iceberg. Let&#8217;s dive into DoorDash&#8217;s motivation, challenges, and benefits of this decision. </p><p>All credit for the technical details goes to the DoorDash Engineering Team. This article serves as my note after consuming their <a href="https://www.youtube.com/watch?v=_nnNHC90nMI&amp;t=541s">technical sharing resource</a>.</p><div><hr></div><h2>Background</h2><p>DoorDash developed an internal streaming platform to process real-time events from applications, enabling efficient support for business decisions.</p><p>At peak, the platform might receive a very high throughput workload with more than 30 million messages per second, which is approximately 5 GB of event data flowing into their system per second. These events originate from customers, dashers, merchants, or DoorDash's internal applications.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FoxS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86f5c8a7-5d3d-4ddd-85ee-6980d7773f89_1188x402.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FoxS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86f5c8a7-5d3d-4ddd-85ee-6980d7773f89_1188x402.png 424w, https://substackcdn.com/image/fetch/$s_!FoxS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86f5c8a7-5d3d-4ddd-85ee-6980d7773f89_1188x402.png 848w, https://substackcdn.com/image/fetch/$s_!FoxS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86f5c8a7-5d3d-4ddd-85ee-6980d7773f89_1188x402.png 1272w, https://substackcdn.com/image/fetch/$s_!FoxS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86f5c8a7-5d3d-4ddd-85ee-6980d7773f89_1188x402.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FoxS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86f5c8a7-5d3d-4ddd-85ee-6980d7773f89_1188x402.png" width="1188" height="402" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/86f5c8a7-5d3d-4ddd-85ee-6980d7773f89_1188x402.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:402,&quot;width&quot;:1188,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:71432,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/163813438?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86f5c8a7-5d3d-4ddd-85ee-6980d7773f89_1188x402.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FoxS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86f5c8a7-5d3d-4ddd-85ee-6980d7773f89_1188x402.png 424w, https://substackcdn.com/image/fetch/$s_!FoxS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86f5c8a7-5d3d-4ddd-85ee-6980d7773f89_1188x402.png 848w, https://substackcdn.com/image/fetch/$s_!FoxS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86f5c8a7-5d3d-4ddd-85ee-6980d7773f89_1188x402.png 1272w, https://substackcdn.com/image/fetch/$s_!FoxS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86f5c8a7-5d3d-4ddd-85ee-6980d7773f89_1188x402.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The stream platform will consume these events, process them, and write them to the associated tables in the data warehouse. Some use cases require the data to be available in near real-time.</p><p>So, how did DoorDash ensure their platform is low-latency and highly scalable?</p><p>As we recall from the <a href="https://open.substack.com/pub/vutr/p/doordashs-real-time-processing-system?r=2rj6sg&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=false">previous article</a>, DoorDash buffered incoming data with Kafka and used Flink to process and write the data to the sink.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OJrC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0e8d270-eace-4beb-bbe8-693adb68119e_1588x698.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OJrC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0e8d270-eace-4beb-bbe8-693adb68119e_1588x698.png 424w, https://substackcdn.com/image/fetch/$s_!OJrC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0e8d270-eace-4beb-bbe8-693adb68119e_1588x698.png 848w, https://substackcdn.com/image/fetch/$s_!OJrC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0e8d270-eace-4beb-bbe8-693adb68119e_1588x698.png 1272w, https://substackcdn.com/image/fetch/$s_!OJrC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0e8d270-eace-4beb-bbe8-693adb68119e_1588x698.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OJrC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0e8d270-eace-4beb-bbe8-693adb68119e_1588x698.png" width="1456" height="640" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c0e8d270-eace-4beb-bbe8-693adb68119e_1588x698.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:640,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:248014,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/163813438?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0e8d270-eace-4beb-bbe8-693adb68119e_1588x698.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OJrC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0e8d270-eace-4beb-bbe8-693adb68119e_1588x698.png 424w, https://substackcdn.com/image/fetch/$s_!OJrC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0e8d270-eace-4beb-bbe8-693adb68119e_1588x698.png 848w, https://substackcdn.com/image/fetch/$s_!OJrC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0e8d270-eace-4beb-bbe8-693adb68119e_1588x698.png 1272w, https://substackcdn.com/image/fetch/$s_!OJrC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0e8d270-eace-4beb-bbe8-693adb68119e_1588x698.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><a href="https://flink.apache.org/">Apache Flink</a> is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams, unlike Spark, which treats bounded data as a first-class citizen and aligns stream data into micro-batches. For Flink, everything is a stream; the batch is just a special case.</p><blockquote><p><em>If you want to learn more about Flink, check out <a href="https://open.substack.com/pub/vutr/p/apache-flink-overview?r=2rj6sg&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=false">my article</a> to understand its architecture and how it can achieve fault-tolerance and provide stateful processing capability.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WfEi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ba731f8-bbfe-4efc-9d0b-27e3847cf83f_514x278.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WfEi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ba731f8-bbfe-4efc-9d0b-27e3847cf83f_514x278.png 424w, https://substackcdn.com/image/fetch/$s_!WfEi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ba731f8-bbfe-4efc-9d0b-27e3847cf83f_514x278.png 848w, https://substackcdn.com/image/fetch/$s_!WfEi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ba731f8-bbfe-4efc-9d0b-27e3847cf83f_514x278.png 1272w, https://substackcdn.com/image/fetch/$s_!WfEi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ba731f8-bbfe-4efc-9d0b-27e3847cf83f_514x278.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WfEi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ba731f8-bbfe-4efc-9d0b-27e3847cf83f_514x278.png" width="514" height="278" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0ba731f8-bbfe-4efc-9d0b-27e3847cf83f_514x278.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:278,&quot;width&quot;:514,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:50591,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/163813438?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ba731f8-bbfe-4efc-9d0b-27e3847cf83f_514x278.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WfEi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ba731f8-bbfe-4efc-9d0b-27e3847cf83f_514x278.png 424w, https://substackcdn.com/image/fetch/$s_!WfEi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ba731f8-bbfe-4efc-9d0b-27e3847cf83f_514x278.png 848w, https://substackcdn.com/image/fetch/$s_!WfEi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ba731f8-bbfe-4efc-9d0b-27e3847cf83f_514x278.png 1272w, https://substackcdn.com/image/fetch/$s_!WfEi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ba731f8-bbfe-4efc-9d0b-27e3847cf83f_514x278.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The Flink application will consume the data from Kafka and upload it to S3 in the Parquet format. Then, DoorDash used <a href="https://docs.snowflake.com/en/user-guide/data-load-snowpipe-intro">Snowpie</a> to copy data from S3 to Snowflake. Based on the <a href="https://docs.snowaflake.com/en/user-guide/data-load-snowpipe-auto-s3">notifications from the Amazon SQS</a>, Snowpie will load data from S3 to Snowflake as soon as it is available using the <a href="https://docs.snowflake.com/en/sql-reference/sql/copy-into-table">COPY statement</a>.</p><div><hr></div><h2>Challenges</h2><p>The Flink &#8594; S3 &#8594; Snowpie &#8594; Snowflake has some challenges:</p><ul><li><p>Snowflake's cost increases when more users use the data platform. When designing this solution in the first place, DoorDash only planned for hundreds of thousands of messages at peak, which is far smaller than the current peak workload (30 million messages)</p></li><li><p>The solution wrote the data twice, the first time is Flink writing data to S3, and the second time is Snowflake writing data to Snowflake</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g-oC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c73f114-0240-47cd-aaca-77bb27f8f360_366x292.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g-oC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c73f114-0240-47cd-aaca-77bb27f8f360_366x292.png 424w, https://substackcdn.com/image/fetch/$s_!g-oC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c73f114-0240-47cd-aaca-77bb27f8f360_366x292.png 848w, https://substackcdn.com/image/fetch/$s_!g-oC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c73f114-0240-47cd-aaca-77bb27f8f360_366x292.png 1272w, https://substackcdn.com/image/fetch/$s_!g-oC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c73f114-0240-47cd-aaca-77bb27f8f360_366x292.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g-oC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c73f114-0240-47cd-aaca-77bb27f8f360_366x292.png" width="366" height="292" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0c73f114-0240-47cd-aaca-77bb27f8f360_366x292.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:292,&quot;width&quot;:366,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:39359,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/163813438?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c73f114-0240-47cd-aaca-77bb27f8f360_366x292.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!g-oC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c73f114-0240-47cd-aaca-77bb27f8f360_366x292.png 424w, https://substackcdn.com/image/fetch/$s_!g-oC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c73f114-0240-47cd-aaca-77bb27f8f360_366x292.png 848w, https://substackcdn.com/image/fetch/$s_!g-oC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c73f114-0240-47cd-aaca-77bb27f8f360_366x292.png 1272w, https://substackcdn.com/image/fetch/$s_!g-oC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c73f114-0240-47cd-aaca-77bb27f8f360_366x292.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p></li><li><p>It&#8217;s vendor lock-in (Snowflake)</p></li></ul><div><hr></div><blockquote><p>To celebrate Lunar New Year (the true New Year holiday in Vietnam), I&#8217;m offering <em><strong>50% off the annual subscription</strong></em>. The offer ends soon; grab it now to get full access to nearly 200 high-quality data engineering articles.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe&quot;,&quot;text&quot;:&quot;50% off the annual subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe"><span>50% off the annual subscription</span></a></p></blockquote><div><hr></div><h2>Solutions</h2><p>They chose Iceberg for the new real-time data sink. DoorDash also experimented with Delta Lake, but the table format didn&#8217;t meet their expectations in terms of operational and cost aspects.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DRhq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc457464c-692a-4b07-9062-c950e5e52fb5_272x264.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DRhq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc457464c-692a-4b07-9062-c950e5e52fb5_272x264.png 424w, https://substackcdn.com/image/fetch/$s_!DRhq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc457464c-692a-4b07-9062-c950e5e52fb5_272x264.png 848w, https://substackcdn.com/image/fetch/$s_!DRhq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc457464c-692a-4b07-9062-c950e5e52fb5_272x264.png 1272w, https://substackcdn.com/image/fetch/$s_!DRhq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc457464c-692a-4b07-9062-c950e5e52fb5_272x264.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DRhq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc457464c-692a-4b07-9062-c950e5e52fb5_272x264.png" width="272" height="264" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c457464c-692a-4b07-9062-c950e5e52fb5_272x264.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:264,&quot;width&quot;:272,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:25900,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/163813438?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc457464c-692a-4b07-9062-c950e5e52fb5_272x264.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DRhq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc457464c-692a-4b07-9062-c950e5e52fb5_272x264.png 424w, https://substackcdn.com/image/fetch/$s_!DRhq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc457464c-692a-4b07-9062-c950e5e52fb5_272x264.png 848w, https://substackcdn.com/image/fetch/$s_!DRhq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc457464c-692a-4b07-9062-c950e5e52fb5_272x264.png 1272w, https://substackcdn.com/image/fetch/$s_!DRhq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc457464c-692a-4b07-9062-c950e5e52fb5_272x264.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>From their perspective, Iceberg can help because:</p><ul><li><p>The open table format has more mature support for Flink. In contrast, Delta Lake is more Spark-centric.</p></li><li><p>It offers flexible schema and partition evolution.</p></li><li><p>Iceberg has a very active community</p></li><li><p>It supports concurrent table writes. From what I know, this feature is not exclusive to Iceberg, as all table format supports concurrent writes with <a href="https://en.wikipedia.org/wiki/Optimistic_concurrency_control">optimistic concurrency control</a>.</p></li></ul><div><hr></div><h2>Architecture</h2><p>With the introduction of Iceberg, the DoorDash real-time processing platform remains the same, except for the S3 &#8594; Snowpie &#8594; Snowflake pipeline. Now, the Flink continues to sink data to S3 in Parquet format, but this time these files are &#8220;wrapped&#8221; with the Iceberg metadata layer. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!X4aH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d1c62a6-1cdb-4123-8028-1cf77f12872c_1360x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!X4aH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d1c62a6-1cdb-4123-8028-1cf77f12872c_1360x600.png 424w, https://substackcdn.com/image/fetch/$s_!X4aH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d1c62a6-1cdb-4123-8028-1cf77f12872c_1360x600.png 848w, https://substackcdn.com/image/fetch/$s_!X4aH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d1c62a6-1cdb-4123-8028-1cf77f12872c_1360x600.png 1272w, https://substackcdn.com/image/fetch/$s_!X4aH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d1c62a6-1cdb-4123-8028-1cf77f12872c_1360x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!X4aH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d1c62a6-1cdb-4123-8028-1cf77f12872c_1360x600.png" width="1360" height="600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8d1c62a6-1cdb-4123-8028-1cf77f12872c_1360x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1360,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:194885,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/163813438?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d1c62a6-1cdb-4123-8028-1cf77f12872c_1360x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!X4aH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d1c62a6-1cdb-4123-8028-1cf77f12872c_1360x600.png 424w, https://substackcdn.com/image/fetch/$s_!X4aH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d1c62a6-1cdb-4123-8028-1cf77f12872c_1360x600.png 848w, https://substackcdn.com/image/fetch/$s_!X4aH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d1c62a6-1cdb-4123-8028-1cf77f12872c_1360x600.png 1272w, https://substackcdn.com/image/fetch/$s_!X4aH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d1c62a6-1cdb-4123-8028-1cf77f12872c_1360x600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The pipeline that writes data to Snowflake is not necessary anymore, as Snowflake users can query Iceberg data directly on S3. This enables data consumers to continue using Snowflake to interact with the data without major changes or interruptions. In addition, DoorDash also spins up Trino clusters to query this data with the help of the AWS Glue catalog.</p><p>To implement the new solutions, Doordash needs to adjust the Flink jobs.</p><p>A typical Flink application comprises three parts: the source, the transformation, and the sink. For the new approach, DoorDash only needs to change the application&#8217;s sink to the new one that writes data to S3 in Iceberg format.</p><p>Flink provides support for an out-of-the-box Iceberg sink connector, so DoorDash only needs to make minor code changes to make things work.</p><div><hr></div><h2>Challenges when adopting Iceberg</h2><h3>Schema Evolution</h3><p>Although the Iceberg specification supports schema evolution, the Flink-Iceberg connector does not support it and requires the table schema to be static. If the schema changes, they have to stop the Flink job, adjust the logic, and restart it.</p><p>However, with all the benefits that Iceberg could bring (more on this later), DoorDash considers this not a very big deal.</p><h3>Query Performance</h3><p>Some users reported that their queries were very slow compared to the original solution. In these use cases, users typically query very large nested fields with hundreds of key-value pairs. This was handled well in Snowflake with the Variant Snowflake type.</p><p>DoorDash flattens these fields in Iceberg and allocates more resources for the query workload.</p><div><hr></div><h2>Benefit</h2><blockquote><p>So, is it worth it?</p></blockquote><h3>Cost saving</h3><p>With Iceberg, Doordash observed a 25-49% reduction in storage costs compared to native Snowflake storage, using only the default compression scheme (zstd). </p><p>The cost savings also come from the elimination of duplicate data writing from the original solution, which writes data first to S3 and later loads it to Snowflake's native storage.</p><p>The resources used for Snowpie are now allocated for the Iceberg operation process, such as table compaction.</p><h3>The reliability and availability </h3><p>The support for concurrent writes enables DoorDash to develop multiple pipelines for a single Iceberg table. This allows them to write data from different sources or have different workloads, such as a standard data sink pipeline along with the backfill pipeline at the same time.</p><p>DoorDash also enjoys the native support of Iceberg&#8217;s time travel. Although Snowflake also supports this feature, <a href="https://docs.snowflake.com/en/user-guide/data-time-travel#data-retention-period">users must pay more to achieve higher data retention</a>. With Iceberg, DoorDash can achieve time travel capabilities with more control over data retention.</p><p>The Iceberg adoption aligns with their data-lake approach, which limits the dependency on any vendor, thus providing them more flexibility. For example, they can now use other engines such as Trino to query the data.</p><h3>Hidden Partitioning</h3><p>Generally, partitioning a table using a transformation on a column (e.g., partition by day requires transforming the timestamp column to day and creating an extra column). Users must use this exact column to benefit from partition pruning.</p><p>For example, a table is partitioned by day, and every record must have an extra <code>partition_day</code> column derived from the <code>created_timestamp</code> column. When users query the table, they must filter on the exact <code>partition_day</code> column so the query engine can prune unwanted partitions. If the user isn&#8217;t aware of this and uses the <code>created_timestamp</code> column instead, the query engine will scan the whole table.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PPNS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba22e19-185a-4371-87d6-31e56d637b42_1360x1042.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PPNS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba22e19-185a-4371-87d6-31e56d637b42_1360x1042.png 424w, https://substackcdn.com/image/fetch/$s_!PPNS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba22e19-185a-4371-87d6-31e56d637b42_1360x1042.png 848w, https://substackcdn.com/image/fetch/$s_!PPNS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba22e19-185a-4371-87d6-31e56d637b42_1360x1042.png 1272w, https://substackcdn.com/image/fetch/$s_!PPNS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba22e19-185a-4371-87d6-31e56d637b42_1360x1042.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PPNS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba22e19-185a-4371-87d6-31e56d637b42_1360x1042.png" width="1360" height="1042" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fba22e19-185a-4371-87d6-31e56d637b42_1360x1042.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1042,&quot;width&quot;:1360,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PPNS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba22e19-185a-4371-87d6-31e56d637b42_1360x1042.png 424w, https://substackcdn.com/image/fetch/$s_!PPNS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba22e19-185a-4371-87d6-31e56d637b42_1360x1042.png 848w, https://substackcdn.com/image/fetch/$s_!PPNS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba22e19-185a-4371-87d6-31e56d637b42_1360x1042.png 1272w, https://substackcdn.com/image/fetch/$s_!PPNS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba22e19-185a-4371-87d6-31e56d637b42_1360x1042.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is where Iceberg&#8217;s hidden partitioning shines:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NAED!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcffe3679-7b29-408f-9648-6ed4b4a51b18_1360x782.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NAED!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcffe3679-7b29-408f-9648-6ed4b4a51b18_1360x782.png 424w, https://substackcdn.com/image/fetch/$s_!NAED!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcffe3679-7b29-408f-9648-6ed4b4a51b18_1360x782.png 848w, https://substackcdn.com/image/fetch/$s_!NAED!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcffe3679-7b29-408f-9648-6ed4b4a51b18_1360x782.png 1272w, https://substackcdn.com/image/fetch/$s_!NAED!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcffe3679-7b29-408f-9648-6ed4b4a51b18_1360x782.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NAED!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcffe3679-7b29-408f-9648-6ed4b4a51b18_1360x782.png" width="1360" height="782" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cffe3679-7b29-408f-9648-6ed4b4a51b18_1360x782.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:782,&quot;width&quot;:1360,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NAED!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcffe3679-7b29-408f-9648-6ed4b4a51b18_1360x782.png 424w, https://substackcdn.com/image/fetch/$s_!NAED!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcffe3679-7b29-408f-9648-6ed4b4a51b18_1360x782.png 848w, https://substackcdn.com/image/fetch/$s_!NAED!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcffe3679-7b29-408f-9648-6ed4b4a51b18_1360x782.png 1272w, https://substackcdn.com/image/fetch/$s_!NAED!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcffe3679-7b29-408f-9648-6ed4b4a51b18_1360x782.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>Instead of creating additional columns to partition based on transform values, Iceberg only records the transformation used on the column.</p></li><li><p>Thus, Iceberg can save storage cost because it doesn&#8217;t need to store extra columns.</p></li></ul><p>Another challenge with traditional partitioning is that it relies on the physical structure of the files being laid out into subdirectories; changing how the table was partitioned required rewriting the whole table.</p><p>Apache Iceberg solves this problem by storing all the historical partition schemes. If the table is first partitioned by scheme A and then later partitioned by schema B, Iceberg exposes this information to the query engine to create two separate execution plans to evaluate the filter again with each partition scheme.</p><p>Given a table initially partitioned by the <code>created_timestamp</code> field at a monthly granularity, the transformation <code>month(created_timestamp)</code> is recorded as the first partitioning scheme. Later, the user updates the table to be partitioned by <code>created_timestamp</code> at a daily granularity, with the transformation <code>day(created_timestamp)</code> recorded as the second partitioning scheme.</p><p>The data will be organized according to the partition scheme in place at the time of writing.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!k8Uk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d022718-2fab-4b3b-9d71-622c25f639a2_2000x1600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!k8Uk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d022718-2fab-4b3b-9d71-622c25f639a2_2000x1600.png 424w, https://substackcdn.com/image/fetch/$s_!k8Uk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d022718-2fab-4b3b-9d71-622c25f639a2_2000x1600.png 848w, https://substackcdn.com/image/fetch/$s_!k8Uk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d022718-2fab-4b3b-9d71-622c25f639a2_2000x1600.png 1272w, https://substackcdn.com/image/fetch/$s_!k8Uk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d022718-2fab-4b3b-9d71-622c25f639a2_2000x1600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!k8Uk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d022718-2fab-4b3b-9d71-622c25f639a2_2000x1600.png" width="1456" height="1165" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d022718-2fab-4b3b-9d71-622c25f639a2_2000x1600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1165,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!k8Uk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d022718-2fab-4b3b-9d71-622c25f639a2_2000x1600.png 424w, https://substackcdn.com/image/fetch/$s_!k8Uk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d022718-2fab-4b3b-9d71-622c25f639a2_2000x1600.png 848w, https://substackcdn.com/image/fetch/$s_!k8Uk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d022718-2fab-4b3b-9d71-622c25f639a2_2000x1600.png 1272w, https://substackcdn.com/image/fetch/$s_!k8Uk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d022718-2fab-4b3b-9d71-622c25f639a2_2000x1600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>When the application queries this table using <code>created_timestamp</code>, the query engine applies both the first and second transformations to <code>created_timestamp</code> to enable partition pruning. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q0zd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2d3932b-95cc-44e0-b54b-aca367a72cb2_1388x888.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q0zd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2d3932b-95cc-44e0-b54b-aca367a72cb2_1388x888.png 424w, https://substackcdn.com/image/fetch/$s_!Q0zd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2d3932b-95cc-44e0-b54b-aca367a72cb2_1388x888.png 848w, https://substackcdn.com/image/fetch/$s_!Q0zd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2d3932b-95cc-44e0-b54b-aca367a72cb2_1388x888.png 1272w, https://substackcdn.com/image/fetch/$s_!Q0zd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2d3932b-95cc-44e0-b54b-aca367a72cb2_1388x888.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q0zd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2d3932b-95cc-44e0-b54b-aca367a72cb2_1388x888.png" width="1388" height="888" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a2d3932b-95cc-44e0-b54b-aca367a72cb2_1388x888.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:888,&quot;width&quot;:1388,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:205324,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/163813438?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2d3932b-95cc-44e0-b54b-aca367a72cb2_1388x888.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Q0zd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2d3932b-95cc-44e0-b54b-aca367a72cb2_1388x888.png 424w, https://substackcdn.com/image/fetch/$s_!Q0zd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2d3932b-95cc-44e0-b54b-aca367a72cb2_1388x888.png 848w, https://substackcdn.com/image/fetch/$s_!Q0zd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2d3932b-95cc-44e0-b54b-aca367a72cb2_1388x888.png 1272w, https://substackcdn.com/image/fetch/$s_!Q0zd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2d3932b-95cc-44e0-b54b-aca367a72cb2_1388x888.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>By leveraging Iceberg&#8217;s hidden partition, DoorDash helps users feel less confused when they need to know precisely what technical columns are used for partitioning.</p><div><hr></div><h2>My thought</h2><p>One of the advice I remember the most after reading the book <a href="https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/">Fundamentals of Data Engineering</a> is choosing common components wisely. </p><p>According to the author, data engineers should select common components that facilitate team collaboration and break down silos. They could be S3 for object storage, GitHub for version-control systems, Airflow for orchestration tools, or Spark for processing engines.</p><p>These components act like a toolkit for solving problems and prevent the need to reinvent the wheel. For the lake house specific problem, Iceberg is a strong candidate for your organization&#8217;s common component. It can work well with many systems. If you open a document from any cloud data warehouse or data processing engine, there is a very high chance that you will see Iceberg support at some level. </p><p>This provides you with more flexibility. You can make more reversible decisions. You no longer like Snowflake and want to try BigQuery, or you want to return to a self-managed, open-source solution engine like Trino. Iceberg can help you with that.</p><p>This does not mean Iceberg is the go-to choice for any data project. Every technical decision will have trade-offs, and the data practitioners should evaluate and make decisions based on the organization&#8217;s needs, not following trending tools.</p><p>Compared to using managed cloud data warehouses like BigQuery or Snowflake with their native storage offerings, adopting Iceberg requires more effort to understand how the table format works behind the scenes. </p><p>With DoorDash, I think they made a very good choice by storing data in S3 in the first place, rather than loading it directly into Snowflake. This might come from the intention of having total control over their data, but over time, this choice brings them many benefits. The most obvious one we see in this article is that it helps them onboard Iceberg more easily onto the platform.</p><p>Another observation is that we can see the advantage of &#8220;working well with many systems &#8220; from the Iceberg, which could help DoorDash operate the Flink-Iceberg connection with just a few problems that could be easily debugged and fixed. From their sharing, DoorDash mentions more than once that they have trouble getting Flink to work with Delta Lake. </p><div><hr></div><h2>Outro</h2><p>In this article, we explore the motivation for the adoption of Iceberg for their real-time process platform, including its architecture, challenges, and benefits of the new approach. Finally, I have some thoughts on the trend of adopting Iceberg.</p><p>Thank you for reading this far. See you in my next article.</p><div><hr></div><h2>Reference</h2><p>[1] Tristan Culp, Gaurav Sharma, <a href="https://www.youtube.com/watch?v=_nnNHC90nMI&amp;t=541s">Iceberg with Flink at DoorDash</a> (2025)</p>]]></content:encoded></item><item><title><![CDATA[If you're learning Kafka, this article is for you]]></title><description><![CDATA[A baseline for your Kafka learning and research.]]></description><link>https://vutr.substack.com/p/if-youre-learning-kafka-this-article</link><guid isPermaLink="false">https://vutr.substack.com/p/if-youre-learning-kafka-this-article</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Thu, 15 May 2025 03:15:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!f5Cu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f29d7a5-fd96-4bea-a7e7-33edab697c38_2000x1429.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p>To celebrate Lunar New Year (the true New Year holiday in Vietnam), I&#8217;m offering <em><strong>50% off the annual subscription</strong></em>. The offer ends soon; grab it now to get full access to nearly 200 high-quality data engineering articles.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe&quot;,&quot;text&quot;:&quot;50% off the annual subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe"><span>50% off the annual subscription</span></a></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!f5Cu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f29d7a5-fd96-4bea-a7e7-33edab697c38_2000x1429.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!f5Cu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f29d7a5-fd96-4bea-a7e7-33edab697c38_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!f5Cu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f29d7a5-fd96-4bea-a7e7-33edab697c38_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!f5Cu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f29d7a5-fd96-4bea-a7e7-33edab697c38_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!f5Cu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f29d7a5-fd96-4bea-a7e7-33edab697c38_2000x1429.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!f5Cu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f29d7a5-fd96-4bea-a7e7-33edab697c38_2000x1429.png" width="1456" height="1040" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1f29d7a5-fd96-4bea-a7e7-33edab697c38_2000x1429.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1040,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:218508,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162735392?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f29d7a5-fd96-4bea-a7e7-33edab697c38_2000x1429.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!f5Cu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f29d7a5-fd96-4bea-a7e7-33edab697c38_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!f5Cu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f29d7a5-fd96-4bea-a7e7-33edab697c38_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!f5Cu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f29d7a5-fd96-4bea-a7e7-33edab697c38_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!f5Cu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f29d7a5-fd96-4bea-a7e7-33edab697c38_2000x1429.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Intro</h2><p>Fourteen years ago, LinkedIn built <a href="https://kafka.apache.org/">Kafka</a> to handle its log processing demands.</p><p>The system combines the benefits of traditional log aggregators and publish/subscribe messaging systems. Kafka is designed to offer high throughput and scalability. It provides an API similar to a messaging system and allows applications to consume real-time log events.</p><p>Now, you see Kafka everywhere. Over the years, Kafka has continued to evolve with many changes and updates. From the <a href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage">tiered storage</a> to the <a href="https://developer.confluent.io/learn/kraft/">Kraft</a> or the <a href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A+Queues+for+Kafka">queue</a>. </p><p>But the core has remained the same since the first day. This article summarizes my learning and research on Kafka, hoping it will help you feel less overwhelmed when entering the Kafka world.</p><div><hr></div><h2>Overview</h2><h3>Messages</h3><p>Kafka&#8217;s unit of data is called a message. A message can have an optional piece of metadata called the <em>key<strong>. </strong></em>The message and the key are just an array of bytes. The key can be used if users want more control in partitioning; for example, Kafka can guarantee that messages with the same key will be placed on the same partition using consistent hashing on the key.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TRk0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F335d0ea2-1f02-4140-9d29-ff5cdf6dfacb_546x424.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TRk0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F335d0ea2-1f02-4140-9d29-ff5cdf6dfacb_546x424.png 424w, https://substackcdn.com/image/fetch/$s_!TRk0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F335d0ea2-1f02-4140-9d29-ff5cdf6dfacb_546x424.png 848w, https://substackcdn.com/image/fetch/$s_!TRk0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F335d0ea2-1f02-4140-9d29-ff5cdf6dfacb_546x424.png 1272w, https://substackcdn.com/image/fetch/$s_!TRk0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F335d0ea2-1f02-4140-9d29-ff5cdf6dfacb_546x424.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TRk0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F335d0ea2-1f02-4140-9d29-ff5cdf6dfacb_546x424.png" width="546" height="424" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/335d0ea2-1f02-4140-9d29-ff5cdf6dfacb_546x424.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:424,&quot;width&quot;:546,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:148011,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162735392?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F335d0ea2-1f02-4140-9d29-ff5cdf6dfacb_546x424.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TRk0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F335d0ea2-1f02-4140-9d29-ff5cdf6dfacb_546x424.png 424w, https://substackcdn.com/image/fetch/$s_!TRk0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F335d0ea2-1f02-4140-9d29-ff5cdf6dfacb_546x424.png 848w, https://substackcdn.com/image/fetch/$s_!TRk0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F335d0ea2-1f02-4140-9d29-ff5cdf6dfacb_546x424.png 1272w, https://substackcdn.com/image/fetch/$s_!TRk0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F335d0ea2-1f02-4140-9d29-ff5cdf6dfacb_546x424.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A message stored in Kafka doesn&#8217;t have an explicit message ID. Instead, each message is addressed by its logical offset. This avoids the overhead of maintaining index structures that map the message IDs to the actual message locations. To compute the offset of the following message, the consumer has to add the length of the current message to its offset.</p><h3><strong>Topics and Partitions</strong></h3><p>Messages in Kafka are organized into topics. A topic can be split into multiple <em>partitions</em>. Partitions are how Kafka offers redundancy and scalability. Each partition can be hosted on a different server, meaning a topic can be scaled horizontally across multiple servers.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IvmA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1334651a-5ac0-4a68-af95-95149808a8ac_436x408.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IvmA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1334651a-5ac0-4a68-af95-95149808a8ac_436x408.png 424w, https://substackcdn.com/image/fetch/$s_!IvmA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1334651a-5ac0-4a68-af95-95149808a8ac_436x408.png 848w, https://substackcdn.com/image/fetch/$s_!IvmA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1334651a-5ac0-4a68-af95-95149808a8ac_436x408.png 1272w, https://substackcdn.com/image/fetch/$s_!IvmA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1334651a-5ac0-4a68-af95-95149808a8ac_436x408.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IvmA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1334651a-5ac0-4a68-af95-95149808a8ac_436x408.png" width="436" height="408" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1334651a-5ac0-4a68-af95-95149808a8ac_436x408.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:408,&quot;width&quot;:436,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:91088,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162735392?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1334651a-5ac0-4a68-af95-95149808a8ac_436x408.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IvmA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1334651a-5ac0-4a68-af95-95149808a8ac_436x408.png 424w, https://substackcdn.com/image/fetch/$s_!IvmA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1334651a-5ac0-4a68-af95-95149808a8ac_436x408.png 848w, https://substackcdn.com/image/fetch/$s_!IvmA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1334651a-5ac0-4a68-af95-95149808a8ac_436x408.png 1272w, https://substackcdn.com/image/fetch/$s_!IvmA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1334651a-5ac0-4a68-af95-95149808a8ac_436x408.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Each partition of a topic corresponds to a logical log. Physically, a log is implemented as a set of segment files of approximately the same size (e.g., 1GB). Whenever a message is written to the partition, the broker appends that message to the active segment file.</p><div><hr></div><h2>Designs</h2><h3>Kafka use the Filesystem</h3><p>Kafka lets the OS filesystem handle the storage layer. It leverages the kernel page cache mechanism to simplify the design.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hgKX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F853d9414-780f-4cb3-9b88-a75c4fa33fda_524x544.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hgKX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F853d9414-780f-4cb3-9b88-a75c4fa33fda_524x544.png 424w, https://substackcdn.com/image/fetch/$s_!hgKX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F853d9414-780f-4cb3-9b88-a75c4fa33fda_524x544.png 848w, https://substackcdn.com/image/fetch/$s_!hgKX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F853d9414-780f-4cb3-9b88-a75c4fa33fda_524x544.png 1272w, https://substackcdn.com/image/fetch/$s_!hgKX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F853d9414-780f-4cb3-9b88-a75c4fa33fda_524x544.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hgKX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F853d9414-780f-4cb3-9b88-a75c4fa33fda_524x544.png" width="524" height="544" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/853d9414-780f-4cb3-9b88-a75c4fa33fda_524x544.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:544,&quot;width&quot;:524,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:102729,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162735392?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F853d9414-780f-4cb3-9b88-a75c4fa33fda_524x544.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hgKX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F853d9414-780f-4cb3-9b88-a75c4fa33fda_524x544.png 424w, https://substackcdn.com/image/fetch/$s_!hgKX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F853d9414-780f-4cb3-9b88-a75c4fa33fda_524x544.png 848w, https://substackcdn.com/image/fetch/$s_!hgKX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F853d9414-780f-4cb3-9b88-a75c4fa33fda_524x544.png 1272w, https://substackcdn.com/image/fetch/$s_!hgKX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F853d9414-780f-4cb3-9b88-a75c4fa33fda_524x544.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Modern OS systems usually borrow unused memory (RAM) portions for page cache. This cache populates frequently used disk data, avoiding touching the disk too often. Thus, the system is much faster, mitigating the latency of disk seeks. If some application needs the memory to operate, the kernel will take back memory portions used for page cache. This ensures the page cache does not affect the system's performance.</p><p>Rather than implementing a proprietary cache mechanism, Kafka relies on the OS transferring all data to the page cache before flushing it to the disk. This approach also benefits Kafka, given the fact that it was built on the Java Virtual Machine, which has some pain points:</p><ul><li><p>The <a href="https://www.javamex.com/tutorials/memory/object_memory_usage.shtml#google_vignette">high memory overhead</a> of stored objects.</p></li><li><p>The garbage collector process will be slow when the number of in-heap objects increases.</p></li></ul><h3>Sequential access pattern</h3><p>&#8220;Because the disk is always slower than RAM, is that going to affect the Kafka performance?&#8221;, you might wonder.</p><p>The key here is the access pattern. There is no doubt that with random access, the disk will be slower than RAM, but it can outperform memory slightly when it comes to sequential access. Let&#8217;s take a look at these patterns:</p><ul><li><p>Random access is a method of retrieving or storing data in which the data can be accessed in any order.</p></li><li><p>Sequential access is a method of retrieving or storing data in which the data are accessed in a sequential order.</p></li></ul><p>Kafka is designed to make writing (the producers write data) and reading (the consumers consume data) happen sequentially.</p><ul><li><p><strong>Write</strong>:<strong> </strong>As mentioned, Kafka manages messages as segment files internally. The broker will <em><strong>append</strong></em> the message to the last segment. Appending at the end of the segment file ensures that data writing in Kafka happens sequentially.</p></li><li><p><strong>Read: </strong> The consumer always consumes messages from a specific partition sequentially, with the help of the two index files. The first index maps offsets to segment files and positions within the file, allowing brokers to find the message for a given offset quickly. The latter maps timestamps to message offsets; this index is used when searching for messages by timestamp.</p></li></ul><h3>Zero-copy</h3><p>Using the filesystem also helps Kafka leverage the zero-copy optimization behind the scenes. A zero-copy operation doesn&#8217;t mean there are no data copies; it only ensures it does not make unnecessary copies. This optimization was not first invented for Kafka; it just leverages this existing technique from the OS system.</p><p>Initially, when a process reads a file from the disk and transfers it over the network, data is usually copied four times with four <a href="https://www.geeksforgeeks.org/user-mode-and-kernel-mode-switching/">context switches</a> between user and kernel modes. The flow will have the following steps:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Jxp0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bcc47af-e585-4708-bacc-71654568fd19_1248x670.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Jxp0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bcc47af-e585-4708-bacc-71654568fd19_1248x670.png 424w, https://substackcdn.com/image/fetch/$s_!Jxp0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bcc47af-e585-4708-bacc-71654568fd19_1248x670.png 848w, https://substackcdn.com/image/fetch/$s_!Jxp0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bcc47af-e585-4708-bacc-71654568fd19_1248x670.png 1272w, https://substackcdn.com/image/fetch/$s_!Jxp0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bcc47af-e585-4708-bacc-71654568fd19_1248x670.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Jxp0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bcc47af-e585-4708-bacc-71654568fd19_1248x670.png" width="1248" height="670" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0bcc47af-e585-4708-bacc-71654568fd19_1248x670.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:670,&quot;width&quot;:1248,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:137972,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162735392?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bcc47af-e585-4708-bacc-71654568fd19_1248x670.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Jxp0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bcc47af-e585-4708-bacc-71654568fd19_1248x670.png 424w, https://substackcdn.com/image/fetch/$s_!Jxp0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bcc47af-e585-4708-bacc-71654568fd19_1248x670.png 848w, https://substackcdn.com/image/fetch/$s_!Jxp0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bcc47af-e585-4708-bacc-71654568fd19_1248x670.png 1272w, https://substackcdn.com/image/fetch/$s_!Jxp0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bcc47af-e585-4708-bacc-71654568fd19_1248x670.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ol><li><p>It reads the file content on disk and stores it in the OS page cache. This step requires a context switch from user mode to kernel mode.</p></li><li><p>Data is copied from the cache into the application buffer. This requires the context to switch from kernel mode to user mode.</p></li><li><p>Data is then copied to the <a href="https://flylib.com/books/en/3.475.1.30/1/?utm_source=2minutestreaming.beehiiv.com&amp;utm_medium=referral&amp;utm_campaign=zero-copy-basics">socket buffer</a>. Once again, this requires switching the context from user to kernel mode.</p></li><li><p>The context is switched back to user mode after sending data to the socket buffer. It then copies the data from the socket buffer to the <a href="https://en.wikipedia.org/wiki/Network_interface_controller">network interface controller</a> (NIC).</p></li><li><p>The NIC sends data to the destination.</p></li></ol><p>With the zero-copy optimization, the data is copied directly from the page cache to the socket buffer. In a Unix-based system, this technique is handled by a <a href="https://man7.org/linux/man-pages/man2/sendfile.2.html">sendfile()</a> system call. It will copy data directly from one <a href="https://en.wcikipedia.org/wiki/File_descriptor">file descriptor</a> to another without transferring data to and from user space when using <a href="https://man7.org/linux/man-pages/man2/read.2.html">read() </a>and <a href="https://man7.org/linux/man-pages/man2/write.2.html">write()</a> system calls. Thus, this optimization can help Kafka bypass steps two and three from the original transfer flow:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ge_9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F938001f2-0255-4b6a-a831-c2a88033278f_836x464.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ge_9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F938001f2-0255-4b6a-a831-c2a88033278f_836x464.png 424w, https://substackcdn.com/image/fetch/$s_!Ge_9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F938001f2-0255-4b6a-a831-c2a88033278f_836x464.png 848w, https://substackcdn.com/image/fetch/$s_!Ge_9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F938001f2-0255-4b6a-a831-c2a88033278f_836x464.png 1272w, https://substackcdn.com/image/fetch/$s_!Ge_9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F938001f2-0255-4b6a-a831-c2a88033278f_836x464.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ge_9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F938001f2-0255-4b6a-a831-c2a88033278f_836x464.png" width="836" height="464" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/938001f2-0255-4b6a-a831-c2a88033278f_836x464.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:464,&quot;width&quot;:836,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:80436,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162735392?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F938001f2-0255-4b6a-a831-c2a88033278f_836x464.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ge_9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F938001f2-0255-4b6a-a831-c2a88033278f_836x464.png 424w, https://substackcdn.com/image/fetch/$s_!Ge_9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F938001f2-0255-4b6a-a831-c2a88033278f_836x464.png 848w, https://substackcdn.com/image/fetch/$s_!Ge_9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F938001f2-0255-4b6a-a831-c2a88033278f_836x464.png 1272w, https://substackcdn.com/image/fetch/$s_!Ge_9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F938001f2-0255-4b6a-a831-c2a88033278f_836x464.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ol><li><p>The data is copied from the disk to the page cache.</p></li><li><p>Then, the data is copied directly from the page cache to the network controller via the sendfile() call.</p></li><li><p>The NIC sends data to the destination (the consumer).</p></li></ol><p>As a result, the context switch is reduced from four to two, and the data doesn&#8217;t need to be copied to the Kafka application. </p><h3>Batching</h3><p>To make the client-broker request more efficient, the Kafka protocol has a message set abstraction that helps group messages together. This helps mitigate the network round-trip overhead when sending too many single message requests.</p><p>Batching also helps the broker write the message more efficiently; instead of appending the messages one by one, the broker appends a chunk of messages at once. This allows Kafka to achieve larger sequential disk operations.</p><p>Moreover, Kafka supports the compression of batches of messages with an efficient batching format in case the network bandwidth is the bottleneck. A batch of messages can be grouped, compressed, and sent to the broker.</p><div class="pullquote"><p>This article is sponsored by <strong>Aiven</strong>. Their proposal, <a href="https://fnf.dev/43o0CWY">Apache Kafka&#174; KIP-1150: Diskless Topics</a>, is poised to be a game changer, aiming to reduce Kafka infrastructure costs by up to 80% through offloading disk replication to object storage. <a href="https://fnf.dev/43o0CWY">Learn more about the proposal here</a> and leave your feedback to help shape the future of Kafka.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Qvmj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c067649-c433-4884-bab2-0d4a266c3f0e_1368x707.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Qvmj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c067649-c433-4884-bab2-0d4a266c3f0e_1368x707.png 424w, https://substackcdn.com/image/fetch/$s_!Qvmj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c067649-c433-4884-bab2-0d4a266c3f0e_1368x707.png 848w, https://substackcdn.com/image/fetch/$s_!Qvmj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c067649-c433-4884-bab2-0d4a266c3f0e_1368x707.png 1272w, https://substackcdn.com/image/fetch/$s_!Qvmj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c067649-c433-4884-bab2-0d4a266c3f0e_1368x707.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Qvmj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c067649-c433-4884-bab2-0d4a266c3f0e_1368x707.png" width="1368" height="707" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7c067649-c433-4884-bab2-0d4a266c3f0e_1368x707.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:707,&quot;width&quot;:1368,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:164872,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162735392?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c067649-c433-4884-bab2-0d4a266c3f0e_1368x707.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Qvmj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c067649-c433-4884-bab2-0d4a266c3f0e_1368x707.png 424w, https://substackcdn.com/image/fetch/$s_!Qvmj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c067649-c433-4884-bab2-0d4a266c3f0e_1368x707.png 848w, https://substackcdn.com/image/fetch/$s_!Qvmj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c067649-c433-4884-bab2-0d4a266c3f0e_1368x707.png 1272w, https://substackcdn.com/image/fetch/$s_!Qvmj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c067649-c433-4884-bab2-0d4a266c3f0e_1368x707.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></div><div><hr></div><h2>Producer</h2><h3>The flow</h3><p>When you use the Kafka producer API, a few things happen:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ezx4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25d1ff9a-265a-4c83-8911-ea3e5ef1b9fa_1030x642.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ezx4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25d1ff9a-265a-4c83-8911-ea3e5ef1b9fa_1030x642.png 424w, https://substackcdn.com/image/fetch/$s_!Ezx4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25d1ff9a-265a-4c83-8911-ea3e5ef1b9fa_1030x642.png 848w, https://substackcdn.com/image/fetch/$s_!Ezx4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25d1ff9a-265a-4c83-8911-ea3e5ef1b9fa_1030x642.png 1272w, https://substackcdn.com/image/fetch/$s_!Ezx4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25d1ff9a-265a-4c83-8911-ea3e5ef1b9fa_1030x642.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ezx4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25d1ff9a-265a-4c83-8911-ea3e5ef1b9fa_1030x642.png" width="1030" height="642" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/25d1ff9a-265a-4c83-8911-ea3e5ef1b9fa_1030x642.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:642,&quot;width&quot;:1030,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:316997,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162735392?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25d1ff9a-265a-4c83-8911-ea3e5ef1b9fa_1030x642.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ezx4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25d1ff9a-265a-4c83-8911-ea3e5ef1b9fa_1030x642.png 424w, https://substackcdn.com/image/fetch/$s_!Ezx4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25d1ff9a-265a-4c83-8911-ea3e5ef1b9fa_1030x642.png 848w, https://substackcdn.com/image/fetch/$s_!Ezx4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25d1ff9a-265a-4c83-8911-ea3e5ef1b9fa_1030x642.png 1272w, https://substackcdn.com/image/fetch/$s_!Ezx4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25d1ff9a-265a-4c83-8911-ea3e5ef1b9fa_1030x642.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>The process creates a ProducerRecord, including the message&#8217;s value and the destination topic. The ProducerRecord can contain a key, partition, timestamp, and headers.</p></li><li><p>The producer will serialize the ProducerRecord&#8217;s key and value objects to byte arrays to send over the network.</p></li><li><p>If no partition is specified, the data is routed to the partitioner; this component will choose the message&#8217;s partition based on the key.</p></li><li><p>After knowing the destination topic and partition, the producer adds the record to the batch of messages sent to the same topic and partition.</p></li><li><p>A different thread will send these batches to the appropriate Kafka brokers.</p></li><li><p>When the broker receives messages, if successful, it returns a metadata object with the topic, partition, and record offset. If not, it returns an error; in this case, the producer may retry a few times before giving up.</p></li></ul><h3>Sending method</h3><p>So, can we control the way we want to send the message? The answer is yes:</p><ul><li><p><strong>Fire-and-forget</strong>: The producer sends a message to the server and doesn&#8217;t check if it arrives. In case of errors or timeout, messages will be lost, and the application won&#8217;t be notified.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BMXh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea4cb8ad-cc97-42e2-9143-c1c5937579b7_584x228.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BMXh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea4cb8ad-cc97-42e2-9143-c1c5937579b7_584x228.png 424w, https://substackcdn.com/image/fetch/$s_!BMXh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea4cb8ad-cc97-42e2-9143-c1c5937579b7_584x228.png 848w, https://substackcdn.com/image/fetch/$s_!BMXh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea4cb8ad-cc97-42e2-9143-c1c5937579b7_584x228.png 1272w, https://substackcdn.com/image/fetch/$s_!BMXh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea4cb8ad-cc97-42e2-9143-c1c5937579b7_584x228.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BMXh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea4cb8ad-cc97-42e2-9143-c1c5937579b7_584x228.png" width="584" height="228" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ea4cb8ad-cc97-42e2-9143-c1c5937579b7_584x228.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:228,&quot;width&quot;:584,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:62019,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162735392?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea4cb8ad-cc97-42e2-9143-c1c5937579b7_584x228.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BMXh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea4cb8ad-cc97-42e2-9143-c1c5937579b7_584x228.png 424w, https://substackcdn.com/image/fetch/$s_!BMXh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea4cb8ad-cc97-42e2-9143-c1c5937579b7_584x228.png 848w, https://substackcdn.com/image/fetch/$s_!BMXh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea4cb8ad-cc97-42e2-9143-c1c5937579b7_584x228.png 1272w, https://substackcdn.com/image/fetch/$s_!BMXh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea4cb8ad-cc97-42e2-9143-c1c5937579b7_584x228.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></li></ul><ul><li><p><strong>Synchronously</strong>: Sending a message synchronously allows the producer to catch exceptions if Kafka returns an error or retries fail; the producer sends the message and waits for the response. This method is rare in production because it can impact the performance.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DlgQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81f3949-73dc-48b3-bdf4-470025f5ab64_564x248.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DlgQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81f3949-73dc-48b3-bdf4-470025f5ab64_564x248.png 424w, https://substackcdn.com/image/fetch/$s_!DlgQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81f3949-73dc-48b3-bdf4-470025f5ab64_564x248.png 848w, https://substackcdn.com/image/fetch/$s_!DlgQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81f3949-73dc-48b3-bdf4-470025f5ab64_564x248.png 1272w, https://substackcdn.com/image/fetch/$s_!DlgQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81f3949-73dc-48b3-bdf4-470025f5ab64_564x248.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DlgQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81f3949-73dc-48b3-bdf4-470025f5ab64_564x248.png" width="564" height="248" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a81f3949-73dc-48b3-bdf4-470025f5ab64_564x248.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:248,&quot;width&quot;:564,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58142,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162735392?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81f3949-73dc-48b3-bdf4-470025f5ab64_564x248.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DlgQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81f3949-73dc-48b3-bdf4-470025f5ab64_564x248.png 424w, https://substackcdn.com/image/fetch/$s_!DlgQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81f3949-73dc-48b3-bdf4-470025f5ab64_564x248.png 848w, https://substackcdn.com/image/fetch/$s_!DlgQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81f3949-73dc-48b3-bdf4-470025f5ab64_564x248.png 1272w, https://substackcdn.com/image/fetch/$s_!DlgQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81f3949-73dc-48b3-bdf4-470025f5ab64_564x248.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p><strong>Asynchronously</strong>: The producers send all messages without waiting for replies. They support adding a callback to handle errors while executing an asynchronous send.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6K95!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc5d51d-a419-4d8e-9430-f1599b8cabcc_580x286.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6K95!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc5d51d-a419-4d8e-9430-f1599b8cabcc_580x286.png 424w, https://substackcdn.com/image/fetch/$s_!6K95!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc5d51d-a419-4d8e-9430-f1599b8cabcc_580x286.png 848w, https://substackcdn.com/image/fetch/$s_!6K95!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc5d51d-a419-4d8e-9430-f1599b8cabcc_580x286.png 1272w, https://substackcdn.com/image/fetch/$s_!6K95!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc5d51d-a419-4d8e-9430-f1599b8cabcc_580x286.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6K95!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc5d51d-a419-4d8e-9430-f1599b8cabcc_580x286.png" width="580" height="286" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fcc5d51d-a419-4d8e-9430-f1599b8cabcc_580x286.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:286,&quot;width&quot;:580,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:54349,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162735392?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc5d51d-a419-4d8e-9430-f1599b8cabcc_580x286.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6K95!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc5d51d-a419-4d8e-9430-f1599b8cabcc_580x286.png 424w, https://substackcdn.com/image/fetch/$s_!6K95!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc5d51d-a419-4d8e-9430-f1599b8cabcc_580x286.png 848w, https://substackcdn.com/image/fetch/$s_!6K95!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc5d51d-a419-4d8e-9430-f1599b8cabcc_580x286.png 1272w, https://substackcdn.com/image/fetch/$s_!6K95!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcc5d51d-a419-4d8e-9430-f1599b8cabcc_580x286.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li></ul><h3>Was the message delivered successfully?</h3><p>The producer exposes the <code>acks</code> parameter to let the user determine the successful message delivery criteria. It controls how many partition replicas must receive the record before the producer considers the writer successful:</p><ul><li><p><strong>acks=0:</strong> The producer doesn't wait for a reply from the broker and assumes the message was sent successfully. This setting enables very high throughput. However, the risk of losing data is very high.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-JvT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e2d8ea-da71-4238-a1d7-3be3d3abba46_578x240.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-JvT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e2d8ea-da71-4238-a1d7-3be3d3abba46_578x240.png 424w, https://substackcdn.com/image/fetch/$s_!-JvT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e2d8ea-da71-4238-a1d7-3be3d3abba46_578x240.png 848w, https://substackcdn.com/image/fetch/$s_!-JvT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e2d8ea-da71-4238-a1d7-3be3d3abba46_578x240.png 1272w, https://substackcdn.com/image/fetch/$s_!-JvT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e2d8ea-da71-4238-a1d7-3be3d3abba46_578x240.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-JvT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e2d8ea-da71-4238-a1d7-3be3d3abba46_578x240.png" width="578" height="240" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/64e2d8ea-da71-4238-a1d7-3be3d3abba46_578x240.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:240,&quot;width&quot;:578,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:55518,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162735392?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e2d8ea-da71-4238-a1d7-3be3d3abba46_578x240.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-JvT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e2d8ea-da71-4238-a1d7-3be3d3abba46_578x240.png 424w, https://substackcdn.com/image/fetch/$s_!-JvT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e2d8ea-da71-4238-a1d7-3be3d3abba46_578x240.png 848w, https://substackcdn.com/image/fetch/$s_!-JvT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e2d8ea-da71-4238-a1d7-3be3d3abba46_578x240.png 1272w, https://substackcdn.com/image/fetch/$s_!-JvT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e2d8ea-da71-4238-a1d7-3be3d3abba46_578x240.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></li><li><p><strong>acks=1:</strong> The producer receives a &#8220;yes&#8220; response once the leader gets the message. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Xcot!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a777c40-6b20-49b7-aeb0-36d6317d5076_648x474.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Xcot!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a777c40-6b20-49b7-aeb0-36d6317d5076_648x474.png 424w, https://substackcdn.com/image/fetch/$s_!Xcot!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a777c40-6b20-49b7-aeb0-36d6317d5076_648x474.png 848w, https://substackcdn.com/image/fetch/$s_!Xcot!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a777c40-6b20-49b7-aeb0-36d6317d5076_648x474.png 1272w, https://substackcdn.com/image/fetch/$s_!Xcot!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a777c40-6b20-49b7-aeb0-36d6317d5076_648x474.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Xcot!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a777c40-6b20-49b7-aeb0-36d6317d5076_648x474.png" width="648" height="474" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7a777c40-6b20-49b7-aeb0-36d6317d5076_648x474.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:474,&quot;width&quot;:648,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:110673,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162735392?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a777c40-6b20-49b7-aeb0-36d6317d5076_648x474.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Xcot!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a777c40-6b20-49b7-aeb0-36d6317d5076_648x474.png 424w, https://substackcdn.com/image/fetch/$s_!Xcot!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a777c40-6b20-49b7-aeb0-36d6317d5076_648x474.png 848w, https://substackcdn.com/image/fetch/$s_!Xcot!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a777c40-6b20-49b7-aeb0-36d6317d5076_648x474.png 1272w, https://substackcdn.com/image/fetch/$s_!Xcot!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a777c40-6b20-49b7-aeb0-36d6317d5076_648x474.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p><strong>acks=all:</strong> The producer gets a &#8220;yes&#8220; response only after all replicas receive the message. This mode is the safest, ensuring the message survives even if a broker crashes. However, it increases latency.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XxQH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d49348e-3ac3-4b3f-8566-572e7dd59e1e_542x402.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XxQH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d49348e-3ac3-4b3f-8566-572e7dd59e1e_542x402.png 424w, https://substackcdn.com/image/fetch/$s_!XxQH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d49348e-3ac3-4b3f-8566-572e7dd59e1e_542x402.png 848w, https://substackcdn.com/image/fetch/$s_!XxQH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d49348e-3ac3-4b3f-8566-572e7dd59e1e_542x402.png 1272w, https://substackcdn.com/image/fetch/$s_!XxQH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d49348e-3ac3-4b3f-8566-572e7dd59e1e_542x402.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XxQH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d49348e-3ac3-4b3f-8566-572e7dd59e1e_542x402.png" width="542" height="402" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1d49348e-3ac3-4b3f-8566-572e7dd59e1e_542x402.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:402,&quot;width&quot;:542,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:90699,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162735392?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d49348e-3ac3-4b3f-8566-572e7dd59e1e_542x402.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XxQH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d49348e-3ac3-4b3f-8566-572e7dd59e1e_542x402.png 424w, https://substackcdn.com/image/fetch/$s_!XxQH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d49348e-3ac3-4b3f-8566-572e7dd59e1e_542x402.png 848w, https://substackcdn.com/image/fetch/$s_!XxQH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d49348e-3ac3-4b3f-8566-572e7dd59e1e_542x402.png 1272w, https://substackcdn.com/image/fetch/$s_!XxQH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d49348e-3ac3-4b3f-8566-572e7dd59e1e_542x402.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li></ul><h3>How do we distribute the message?</h3><p>Kafka messages can optionally have a key, which is null by default. The message&#8217;s key is mainly used to decide the message destination partition. When the key is null, and no custom partitioner is defined, Kafka will use the following:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!o6uE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c90dedc-346e-45d5-bad3-092d682be57a_1058x968.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!o6uE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c90dedc-346e-45d5-bad3-092d682be57a_1058x968.png 424w, https://substackcdn.com/image/fetch/$s_!o6uE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c90dedc-346e-45d5-bad3-092d682be57a_1058x968.png 848w, https://substackcdn.com/image/fetch/$s_!o6uE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c90dedc-346e-45d5-bad3-092d682be57a_1058x968.png 1272w, https://substackcdn.com/image/fetch/$s_!o6uE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c90dedc-346e-45d5-bad3-092d682be57a_1058x968.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!o6uE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c90dedc-346e-45d5-bad3-092d682be57a_1058x968.png" width="1058" height="968" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6c90dedc-346e-45d5-bad3-092d682be57a_1058x968.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:968,&quot;width&quot;:1058,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:195572,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162735392?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c90dedc-346e-45d5-bad3-092d682be57a_1058x968.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!o6uE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c90dedc-346e-45d5-bad3-092d682be57a_1058x968.png 424w, https://substackcdn.com/image/fetch/$s_!o6uE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c90dedc-346e-45d5-bad3-092d682be57a_1058x968.png 848w, https://substackcdn.com/image/fetch/$s_!o6uE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c90dedc-346e-45d5-bad3-092d682be57a_1058x968.png 1272w, https://substackcdn.com/image/fetch/$s_!o6uE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c90dedc-346e-45d5-bad3-092d682be57a_1058x968.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>Round-Robin partitioner</strong> (with Kafka version &#8804; v2.3): It assigns messages to partitions cyclically. It sequentially assigns messages to each partition, one after another, and then starts again from the first partition.</p></li><li><p><strong>Sticky Partitioner (</strong>with Kafka version<strong> </strong>&#8805; 2.4<strong>): </strong>It aims to stick to a particular partition for a batch of records, meaning it tries to send as many records as possible to the same partition until a specific condition is met, such as the batch reaching its limit. Once that condition is met, it switches to another partition and continues.</p></li></ul><p>If the message&#8217;s key is not null, Kafka will hash it with a hash algorithm and use the result to map the message to a particular partition. Messages with the same key will be routed to the same partition. Kafka also lets users define their custom partitioner to tailor their needs.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-SyL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b35bdd0-3649-401d-8083-137a2dc6d858_928x408.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-SyL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b35bdd0-3649-401d-8083-137a2dc6d858_928x408.png 424w, https://substackcdn.com/image/fetch/$s_!-SyL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b35bdd0-3649-401d-8083-137a2dc6d858_928x408.png 848w, https://substackcdn.com/image/fetch/$s_!-SyL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b35bdd0-3649-401d-8083-137a2dc6d858_928x408.png 1272w, https://substackcdn.com/image/fetch/$s_!-SyL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b35bdd0-3649-401d-8083-137a2dc6d858_928x408.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-SyL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b35bdd0-3649-401d-8083-137a2dc6d858_928x408.png" width="928" height="408" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2b35bdd0-3649-401d-8083-137a2dc6d858_928x408.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:408,&quot;width&quot;:928,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:83690,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162735392?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b35bdd0-3649-401d-8083-137a2dc6d858_928x408.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-SyL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b35bdd0-3649-401d-8083-137a2dc6d858_928x408.png 424w, https://substackcdn.com/image/fetch/$s_!-SyL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b35bdd0-3649-401d-8083-137a2dc6d858_928x408.png 848w, https://substackcdn.com/image/fetch/$s_!-SyL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b35bdd0-3649-401d-8083-137a2dc6d858_928x408.png 1272w, https://substackcdn.com/image/fetch/$s_!-SyL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b35bdd0-3649-401d-8083-137a2dc6d858_928x408.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><blockquote><p>To celebrate Lunar New Year (the true New Year holiday in Vietnam), I&#8217;m offering <em><strong>50% off the annual subscription</strong></em>. The offer ends soon; grab it now to get full access to nearly 200 high-quality data engineering articles.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe&quot;,&quot;text&quot;:&quot;50% off the annual subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe"><span>50% off the annual subscription</span></a></p></blockquote><div><hr></div><h2>Consumer</h2><p>When Kafka was developed, other log-based systems, such as <a href="https://github.com/facebookarchive/scribe">Scribe</a> (from Facebook) or <a href="https://flume.apache.org/">Flume</a>, followed a push-based model where data is pushed to the consumers. However, LinkedIn engineers found the &#8220;pull&#8221; model more suitable for their applications because consumers can read the messages at a rate ideal for their capacity, allowing them to manage their workload effectively. The consumer can also avoid being flooded by messages pushed faster than they can manage.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kuZ-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d8e8aa-8f2f-4e3b-911a-9040062c408b_634x458.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kuZ-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d8e8aa-8f2f-4e3b-911a-9040062c408b_634x458.png 424w, https://substackcdn.com/image/fetch/$s_!kuZ-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d8e8aa-8f2f-4e3b-911a-9040062c408b_634x458.png 848w, https://substackcdn.com/image/fetch/$s_!kuZ-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d8e8aa-8f2f-4e3b-911a-9040062c408b_634x458.png 1272w, https://substackcdn.com/image/fetch/$s_!kuZ-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d8e8aa-8f2f-4e3b-911a-9040062c408b_634x458.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kuZ-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d8e8aa-8f2f-4e3b-911a-9040062c408b_634x458.png" width="634" height="458" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b1d8e8aa-8f2f-4e3b-911a-9040062c408b_634x458.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:458,&quot;width&quot;:634,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:110126,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162735392?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d8e8aa-8f2f-4e3b-911a-9040062c408b_634x458.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kuZ-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d8e8aa-8f2f-4e3b-911a-9040062c408b_634x458.png 424w, https://substackcdn.com/image/fetch/$s_!kuZ-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d8e8aa-8f2f-4e3b-911a-9040062c408b_634x458.png 848w, https://substackcdn.com/image/fetch/$s_!kuZ-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d8e8aa-8f2f-4e3b-911a-9040062c408b_634x458.png 1272w, https://substackcdn.com/image/fetch/$s_!kuZ-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d8e8aa-8f2f-4e3b-911a-9040062c408b_634x458.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The model has the following advantages:</p><ul><li><p><strong>Catching up</strong>: If a consumer falls behind in processing messages, it can catch up at its own pace.</p></li><li><p><strong>Batching</strong>: Consumers can pull batches of messages when ready, enabling efficient data transfer.</p></li></ul><h3>The request</h3><p>A consumer always consumes messages from a particular partition sequentially. If the consumer acknowledges a message offset, the broker implies that the consumer has received all the previous partition&#8217;s messages from this offset.</p><p>The Consumer API is an infinite loop for polling the broker for more data. It will issue asynchronous pull requests to the broker to retrieve the data. Each request contains the offset of the message from which the consumption begins.</p><p>The broker will use the offset to seek and return the desired data. After receiving the message, the consumer computes the offset of the following message (using the current message&#8217;s length and offset) and uses it for the subsequent pull request.</p><h3><strong>Consumer groups</strong></h3><p>Kafka has a concept of consumer groups. Each group has one or more consumers who will consume a set of subscribed topics. LinkedIn made a topic&#8217;s partition the smallest unit of parallelism; all messages from one partition are consumed only by a single consumer within a group. If the number of consumers in the group is larger than the number of partitions in a topic, some consumers will get no message.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FhuK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7211252-efdf-4daf-8a5d-8365d306bcdd_868x896.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FhuK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7211252-efdf-4daf-8a5d-8365d306bcdd_868x896.png 424w, https://substackcdn.com/image/fetch/$s_!FhuK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7211252-efdf-4daf-8a5d-8365d306bcdd_868x896.png 848w, https://substackcdn.com/image/fetch/$s_!FhuK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7211252-efdf-4daf-8a5d-8365d306bcdd_868x896.png 1272w, https://substackcdn.com/image/fetch/$s_!FhuK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7211252-efdf-4daf-8a5d-8365d306bcdd_868x896.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FhuK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7211252-efdf-4daf-8a5d-8365d306bcdd_868x896.png" width="868" height="896" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b7211252-efdf-4daf-8a5d-8365d306bcdd_868x896.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:896,&quot;width&quot;:868,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:178571,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162735392?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7211252-efdf-4daf-8a5d-8365d306bcdd_868x896.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FhuK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7211252-efdf-4daf-8a5d-8365d306bcdd_868x896.png 424w, https://substackcdn.com/image/fetch/$s_!FhuK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7211252-efdf-4daf-8a5d-8365d306bcdd_868x896.png 848w, https://substackcdn.com/image/fetch/$s_!FhuK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7211252-efdf-4daf-8a5d-8365d306bcdd_868x896.png 1272w, https://substackcdn.com/image/fetch/$s_!FhuK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7211252-efdf-4daf-8a5d-8365d306bcdd_868x896.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Consumers in the same group have the same group ID. When a group ID is assigned, any new consumer instance added to the group will automatically receive this same group ID.</p><p>Kafka uses the Group Coordinator (one of the brokers) to balance the load within the group. The coordinator, determined by the group ID, ensures that messages from subscribed topics are evenly distributed among the group members. It also keeps the workload balanced when there are changes in the group membership.</p><p>When a consumer wants to join a group, they send a request to the coordinator. The first one to join the group becomes the leader. The leader gets a list of all active consumers from the coordinator and assigns a subset of partitions to each consumer. Consumers maintain membership in a consumer group and partition ownership by sending heartbeats to the group coordinator.</p><h3>Partition Assignment</h3><p>Each member of the consumer group will be assigned partitions to consume. Kafka has the following assignment strategies:</p><ul><li><p><strong>Range</strong>: This is the default strategy, and it&#8217;s applied to each topic independently. It assigns a consecutive subset of partitions from each topic to each consumer. The assignor divides the number of partitions of each topic by the number of consumers to determine the assigned partitions. If it is not evenly divided, the first few consumers will have more partitions (more burden on these instances).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wLue!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb12c8d6-7630-4d5a-9ff8-e457f793ff6e_588x492.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wLue!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb12c8d6-7630-4d5a-9ff8-e457f793ff6e_588x492.png 424w, https://substackcdn.com/image/fetch/$s_!wLue!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb12c8d6-7630-4d5a-9ff8-e457f793ff6e_588x492.png 848w, https://substackcdn.com/image/fetch/$s_!wLue!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb12c8d6-7630-4d5a-9ff8-e457f793ff6e_588x492.png 1272w, https://substackcdn.com/image/fetch/$s_!wLue!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb12c8d6-7630-4d5a-9ff8-e457f793ff6e_588x492.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wLue!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb12c8d6-7630-4d5a-9ff8-e457f793ff6e_588x492.png" width="588" height="492" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bb12c8d6-7630-4d5a-9ff8-e457f793ff6e_588x492.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:492,&quot;width&quot;:588,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:60424,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162735392?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb12c8d6-7630-4d5a-9ff8-e457f793ff6e_588x492.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wLue!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb12c8d6-7630-4d5a-9ff8-e457f793ff6e_588x492.png 424w, https://substackcdn.com/image/fetch/$s_!wLue!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb12c8d6-7630-4d5a-9ff8-e457f793ff6e_588x492.png 848w, https://substackcdn.com/image/fetch/$s_!wLue!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb12c8d6-7630-4d5a-9ff8-e457f793ff6e_588x492.png 1272w, https://substackcdn.com/image/fetch/$s_!wLue!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb12c8d6-7630-4d5a-9ff8-e457f793ff6e_588x492.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p><strong>Round Robin</strong>: This strategy works across all the subscribed topics and assigns them to the group&#8217;s members sequentially. This approach's advantage is that it maximizes the number of consumers used. If we add one more consumer to the group, each consumer will have two partitions. However, this requires a lot of partition movement in case of rebalancing.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4e_x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F217be28a-d843-4dcd-85d3-f69ecc85a79f_576x496.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4e_x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F217be28a-d843-4dcd-85d3-f69ecc85a79f_576x496.png 424w, https://substackcdn.com/image/fetch/$s_!4e_x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F217be28a-d843-4dcd-85d3-f69ecc85a79f_576x496.png 848w, https://substackcdn.com/image/fetch/$s_!4e_x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F217be28a-d843-4dcd-85d3-f69ecc85a79f_576x496.png 1272w, https://substackcdn.com/image/fetch/$s_!4e_x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F217be28a-d843-4dcd-85d3-f69ecc85a79f_576x496.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4e_x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F217be28a-d843-4dcd-85d3-f69ecc85a79f_576x496.png" width="576" height="496" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/217be28a-d843-4dcd-85d3-f69ecc85a79f_576x496.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:496,&quot;width&quot;:576,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:59187,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162735392?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F217be28a-d843-4dcd-85d3-f69ecc85a79f_576x496.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4e_x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F217be28a-d843-4dcd-85d3-f69ecc85a79f_576x496.png 424w, https://substackcdn.com/image/fetch/$s_!4e_x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F217be28a-d843-4dcd-85d3-f69ecc85a79f_576x496.png 848w, https://substackcdn.com/image/fetch/$s_!4e_x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F217be28a-d843-4dcd-85d3-f69ecc85a79f_576x496.png 1272w, https://substackcdn.com/image/fetch/$s_!4e_x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F217be28a-d843-4dcd-85d3-f69ecc85a79f_576x496.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p><strong>Sticky</strong>: This strategy is similar to the round-robin one used at the first assignment, but is different regarding reassignment. It tries to preserve as many existing assignments as possible when the partition reassignment occurs in the group. The strategy has two main goals: achieving a balanced assignment of partitions and minimizing the overhead during rebalancing by keeping as many assignments in place as possible.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TxE-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20656751-f3bd-4d46-a973-367ba7433475_1816x806.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TxE-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20656751-f3bd-4d46-a973-367ba7433475_1816x806.png 424w, https://substackcdn.com/image/fetch/$s_!TxE-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20656751-f3bd-4d46-a973-367ba7433475_1816x806.png 848w, https://substackcdn.com/image/fetch/$s_!TxE-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20656751-f3bd-4d46-a973-367ba7433475_1816x806.png 1272w, https://substackcdn.com/image/fetch/$s_!TxE-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20656751-f3bd-4d46-a973-367ba7433475_1816x806.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TxE-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20656751-f3bd-4d46-a973-367ba7433475_1816x806.png" width="1456" height="646" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/20656751-f3bd-4d46-a973-367ba7433475_1816x806.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:646,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:275729,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162735392?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20656751-f3bd-4d46-a973-367ba7433475_1816x806.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TxE-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20656751-f3bd-4d46-a973-367ba7433475_1816x806.png 424w, https://substackcdn.com/image/fetch/$s_!TxE-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20656751-f3bd-4d46-a973-367ba7433475_1816x806.png 848w, https://substackcdn.com/image/fetch/$s_!TxE-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20656751-f3bd-4d46-a973-367ba7433475_1816x806.png 1272w, https://substackcdn.com/image/fetch/$s_!TxE-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20656751-f3bd-4d46-a973-367ba7433475_1816x806.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Rebalancing</h3><p>When the number of consumers changed (a member added or a member crashed), the remaining group&#8217;s consumers started consuming messages from partitions previously assigned to other consumers. The process of moving the partition&#8217;s ownership between consumers is called rebalancing. There are two types:</p><ul><li><p><strong>Eager rebalancing</strong>: All consumers stop consuming, give up <strong>all</strong> their partition ownership, and rejoin the group to get a brand-new partition assignment. This causes a short amount of unavailability time for the entire consumer group.</p></li><li><p><strong>Cooperative rebalancing: </strong>This type only moves ownership of a subset of the partitions from one consumer to another and allows consumers to continue handling messages from partitions that are not reassigned.</p></li></ul><h3><strong>Consumption tracking and commit offset</strong></h3><p>The unique thing about Kafka is that the consumer does not need to keep track of which message it consumes; instead, it uses the broker to track the message-consumed position. This process of updating the current position between the consumer and broker is called offset commit. The consumer will send a message to inform them that they have successfully processed messages up to a certain point. The broker will assume that the consumer processes all messages before this point. The broker updates this confirmation message to an internal topic.</p><div><hr></div><h2>The object storage trend</h2><p>We learned that the Kafka design relies on the OS page cache for the storage system. This means compute and storage are tightly coupled. We can&#8217;t scale these two components independently. Scaling storage always requires adding more machines, leading to inefficient resource usage.</p><p>The design of this share-nothing architecture made sense since, in the past, networks were not as fast as they are now, and local data centers were more common than cloud resources. However, in the cloud era, Kafka&#8217;s design makes it hard to leverage the pay-as-you-go pricing. In addition, a Kafka setup could have high cross-availability-zone transfer costs due to Kafka data replication.</p><p>Although the initial designs make Kafka a very high-throughput and reliable system, it might not fit well with the cloud era. Many efforts are being made to solve these challenges. The early one is the tiered storage proposal from Uber, which allows Kafka to store messages in a two-tiered storage system:</p><ul><li><p>Local storage (broker disk) stores the most recent data.</p></li><li><p>Remote storage (HDFS/S3/GCS) stores historical data.</p></li></ul><p>However, brokers are not entirely stateless. Replication still happens, and messages still need to be moved around when the cluster&#8217;s membership changes.</p><p>Until recently, the trend of using object storage for Kafka has been emerging, from WarpStream, AutoMQ, Bufstream, to Redpanda. They made Kafka operate directly on object storage.</p><p>This approach has many benefits. Object storage is cheaper, compute and storage are separate, and data replication is eliminated because object storage ensures data availability and durability.</p><p>Recently, <a href="https://aiven.io/">Aiven</a> introduced a very powerful feature with the <a href="https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d">KIP-1150</a>, which would forever change how we operate the open-source Kafka deployment. The KIP proposes a new class of topics in Apache Kafka that delegates replication to object storage. Users can tell Kafka to store data from a particular topic, whether on disk or in object storage.</p><div><hr></div><h2>Outro</h2><p>Thank you for reading this far.</p><p>In this article, we explored Kafka&#8217;s basics, its technical designs, how the producer and consumer interact with the broker, and finally, a glimpse into the current trend of using object storage to deal with Kafka's limitations when operating on the cloud.</p><p>Now, see you in my next article.</p><div><hr></div><h2>Reference</h2><p><em>[1] Jay Kreps, Neha Narkhede, Jun Rao, <a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2017/09/Kafka.pdf">Kafka: a Distributed Messaging System for Log Processing</a> (2011)</em></p><p><em>[2] Gwen Shapira, Todd Palino, Rajini Sivaram, Krit Petty, <a href="https://www.confluent.io/resources/ebook/kafka-the-definitive-guide/">Kafka The Definitive Guide Real-Time Data and Stream Processing at Scale</a> (2021)</em></p><p><em>[3] <a href="https://kafka.apache.org/documentation/">Kafka Official Documentation</a></em></p><p><em>[4] Wikipedia - <a href="https://en.wikipedia.org/wiki/Memory-mapped_file">Memory-mapped file</a></em></p><p><em>[5] Wikipedia - <a href="https://en.wikipedia.org/wiki/Page_cache">Page cache</a></em></p><p><em>[6] <a href="https://www.linuxatemyram.com/">Linux ate my ram</a></em></p><p><em>[7] Andriy Zabolotnyy, <a href="https://andriymz.github.io/kafka/kafka-disk-write-performance/#">How Kafka Is so Performant If It Writes to Disk?</a> (2021)</em></p><p><em>[8] Stanislav Kozlovski, <a href="https://2minutestreaming.beehiiv.com/p/apache-kafka-zero-copy-operating-system-optimization">Zero Copy Basics</a> (2023)</em></p><p><em>[9] Travis Jeffery, <a href="https://medium.com/the-hoard/how-kafkas-storage-internals-work-3a29b02e026">How Kafka&#8217;s Storage Internals Work</a> (2016)</em></p><p><em>[10] Confluent Document, <a href="https://docs.confluent.io/kafka/design/consumer-design.html">Kafka Consumer Design: Consumers, Consumer Groups, and Offsets</a></em></p><p><em>[11] Conduktor Blog, <a href="https://www.conduktor.io/blog/kafka-partition-assignment-strategy/">Kafka Partition Assignment Strategy</a> (2022)</em></p><p><em>[12] Redpanda Blog, <a href="https://redpanda.com/guides/kafka-tutorial/kafka-partition-strategy">Kafka partition strategy</a></em></p><p><em>[13] Filip Yonov, <a href="https://fnf.dev/43o0CWY">Diskless Kafka: 80% Leaner, 100% Open</a> (2025)</em></p>]]></content:encoded></item><item><title><![CDATA[Deep dive into the challenges of building Kafka on top of S3.]]></title><description><![CDATA[It's really tough]]></description><link>https://vutr.substack.com/p/deep-dive-into-the-challenges-of</link><guid isPermaLink="false">https://vutr.substack.com/p/deep-dive-into-the-challenges-of</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Thu, 08 May 2025 03:15:20 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!KBUa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb65aa640-057f-4f3b-941d-0a96b03f600b_2000x1429.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p><em>I&#8217;m making my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. </em></p><p><em>Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KBUa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb65aa640-057f-4f3b-941d-0a96b03f600b_2000x1429.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KBUa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb65aa640-057f-4f3b-941d-0a96b03f600b_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!KBUa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb65aa640-057f-4f3b-941d-0a96b03f600b_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!KBUa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb65aa640-057f-4f3b-941d-0a96b03f600b_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!KBUa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb65aa640-057f-4f3b-941d-0a96b03f600b_2000x1429.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KBUa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb65aa640-057f-4f3b-941d-0a96b03f600b_2000x1429.png" width="1456" height="1040" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b65aa640-057f-4f3b-941d-0a96b03f600b_2000x1429.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1040,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:303995,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/161465275?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb65aa640-057f-4f3b-941d-0a96b03f600b_2000x1429.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KBUa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb65aa640-057f-4f3b-941d-0a96b03f600b_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!KBUa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb65aa640-057f-4f3b-941d-0a96b03f600b_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!KBUa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb65aa640-057f-4f3b-941d-0a96b03f600b_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!KBUa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb65aa640-057f-4f3b-941d-0a96b03f600b_2000x1429.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Intro</h2><p>Since its open-source release, Kafka has become the de facto standard for distributed messaging. It has gone from operating only on LinkedIn to meeting growing log processing demands, now serving many companies worldwide for various use cases, including messaging, log aggregation, and stream processing.</p><p>However, it was designed at a time when local data centers were more widely adopted than cloud resources. Thus, there are challenges when operating Kafka on the cloud. Compute and storage can&#8217;t scale independently, or cross-availability-zone transfer fees due to data replication.</p><p>When searching for &#8220;Kafka alternative,&#8221; you can easily find four to five solutions that all promise to make your Kafka deployment cheaper and reduce the operational overhead. They can do this or implement that to make their offer more attractive. However, one thing you might observe over and over again is that they all try to store Kafka data completely in object storage.</p><p>This article won&#8217;t explore Kafka's internal workings or why it is so popular. Instead, we will try to understand the challenges of building Kafka on top of S3.</p><div><hr></div><h2>Background</h2><p>But before we go further, let's ask a simple question: &#8220;Why do they want to offload data to S3?&#8220;</p><p>The answer is cost-efficient.</p><p>In Kafka, compute and storage are tightly coupled, which means that scaling storage requires adding more machines, often leading to inefficient resource usage.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!prmU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbf5a71c-2da4-4cd7-91dc-82b5af39e2a0_844x394.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!prmU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbf5a71c-2da4-4cd7-91dc-82b5af39e2a0_844x394.png 424w, https://substackcdn.com/image/fetch/$s_!prmU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbf5a71c-2da4-4cd7-91dc-82b5af39e2a0_844x394.png 848w, https://substackcdn.com/image/fetch/$s_!prmU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbf5a71c-2da4-4cd7-91dc-82b5af39e2a0_844x394.png 1272w, https://substackcdn.com/image/fetch/$s_!prmU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbf5a71c-2da4-4cd7-91dc-82b5af39e2a0_844x394.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!prmU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbf5a71c-2da4-4cd7-91dc-82b5af39e2a0_844x394.png" width="844" height="394" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fbf5a71c-2da4-4cd7-91dc-82b5af39e2a0_844x394.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:394,&quot;width&quot;:844,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:114007,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/161465275?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbf5a71c-2da4-4cd7-91dc-82b5af39e2a0_844x394.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!prmU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbf5a71c-2da4-4cd7-91dc-82b5af39e2a0_844x394.png 424w, https://substackcdn.com/image/fetch/$s_!prmU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbf5a71c-2da4-4cd7-91dc-82b5af39e2a0_844x394.png 848w, https://substackcdn.com/image/fetch/$s_!prmU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbf5a71c-2da4-4cd7-91dc-82b5af39e2a0_844x394.png 1272w, https://substackcdn.com/image/fetch/$s_!prmU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbf5a71c-2da4-4cd7-91dc-82b5af39e2a0_844x394.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Kafka's design also relies on replication for data durability. After storing messages, a leader must replicate data to followers. Because of the tightly coupled architecture, any change in cluster membership forces data to shift from one machine to another.</p><p>Another problem is cross-Availability Zone (AZ) transfer fees. Cloud vendors like AWS or GCP charge fees when we issue requests to different zones. Because producers can only produce messages to the partition leader, when deploying Kafka on the cloud, the producers must write to a leader in a different zone approximately two-thirds of the time (given a setup with three brokers). Kafka setup on the cloud can also incur significant cross-Availability Zone (AZ) transfer fees because the leader must replicate messages to followers in other zones.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!U2h8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a6f7f93-e41f-4da0-a7aa-9eea94744d5e_630x370.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!U2h8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a6f7f93-e41f-4da0-a7aa-9eea94744d5e_630x370.png 424w, https://substackcdn.com/image/fetch/$s_!U2h8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a6f7f93-e41f-4da0-a7aa-9eea94744d5e_630x370.png 848w, https://substackcdn.com/image/fetch/$s_!U2h8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a6f7f93-e41f-4da0-a7aa-9eea94744d5e_630x370.png 1272w, https://substackcdn.com/image/fetch/$s_!U2h8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a6f7f93-e41f-4da0-a7aa-9eea94744d5e_630x370.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!U2h8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a6f7f93-e41f-4da0-a7aa-9eea94744d5e_630x370.png" width="630" height="370" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3a6f7f93-e41f-4da0-a7aa-9eea94744d5e_630x370.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:370,&quot;width&quot;:630,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:82718,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/161465275?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a6f7f93-e41f-4da0-a7aa-9eea94744d5e_630x370.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!U2h8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a6f7f93-e41f-4da0-a7aa-9eea94744d5e_630x370.png 424w, https://substackcdn.com/image/fetch/$s_!U2h8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a6f7f93-e41f-4da0-a7aa-9eea94744d5e_630x370.png 848w, https://substackcdn.com/image/fetch/$s_!U2h8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a6f7f93-e41f-4da0-a7aa-9eea94744d5e_630x370.png 1272w, https://substackcdn.com/image/fetch/$s_!U2h8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a6f7f93-e41f-4da0-a7aa-9eea94744d5e_630x370.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Imagine you offload all the data to object storage like S3, you can:</p><ul><li><p>Save storage money because object storage is cheaper than disk media.</p></li><li><p>Scale computing and storage independently.</p></li><li><p>Avoid data replication because the object storage will ensure data durability and availability.</p></li><li><p>Allow any broker to serve read and write</p></li><li><p>&#8230;</p></li></ul><p>The trend of building a Kafka-compatible solution on object storage is emerging. At least five vendors have introduced a solution like that since 2023. We had WarpStream and AutoMQ in 2023, Confluent Freight Clusters, Bufstream, or Redpanda Cloud Topics in 2024.</p><p>Besides all the hype, I am curious about the challenges of building such a solution that uses S3 for the storage layer. To support this research, I chose <a href="https://github.com/AutoMQ/automq">AutoMQ</a> because it&#8217;s the only open-source version. This allows me to dive deeper into understanding the challenges and solutions.</p><div><hr></div><h2>Brief introduction of AutoMQ</h2><p>AutoMQ is a 100% Kafka-compatible alternative solution. It is designed to run Kafka efficiently on the cloud by leveraging Kafka&#8217;s codebase for the protocol and rewriting the storage layer so it can effectively offload data to object storage with the introduction of the Write Ahead Log. For more details on AutoMQ, you can check <a href="https://open.substack.com/pub/vutr/p/how-do-we-run-kafka-100-on-the-object?r=2rj6sg&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=false">my previous article</a>.</p><p>Next, we will discuss the potential challenges of building Kafka on object storage and then see how AutoMQ overcomes them.</p><div><hr></div><h2>Latency</h2><p>The first and most obvious challenge is the latency. Here are <a href="https://tontinton.com/posts/new-age-data-intensive-apps/">some numbers</a> to help you imagine: with GetObject requests to object storage, the median latency is ~15ms, and P90 is ~60ms. The latency of an NVMe SSD is 20&#8211;100 &#956;s, which is 1000x faster.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sXfz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4d4249-640d-499b-a50f-42ba98772548_698x188.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sXfz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4d4249-640d-499b-a50f-42ba98772548_698x188.png 424w, https://substackcdn.com/image/fetch/$s_!sXfz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4d4249-640d-499b-a50f-42ba98772548_698x188.png 848w, https://substackcdn.com/image/fetch/$s_!sXfz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4d4249-640d-499b-a50f-42ba98772548_698x188.png 1272w, https://substackcdn.com/image/fetch/$s_!sXfz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4d4249-640d-499b-a50f-42ba98772548_698x188.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sXfz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4d4249-640d-499b-a50f-42ba98772548_698x188.png" width="698" height="188" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2a4d4249-640d-499b-a50f-42ba98772548_698x188.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:188,&quot;width&quot;:698,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:26102,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/161465275?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4d4249-640d-499b-a50f-42ba98772548_698x188.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sXfz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4d4249-640d-499b-a50f-42ba98772548_698x188.png 424w, https://substackcdn.com/image/fetch/$s_!sXfz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4d4249-640d-499b-a50f-42ba98772548_698x188.png 848w, https://substackcdn.com/image/fetch/$s_!sXfz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4d4249-640d-499b-a50f-42ba98772548_698x188.png 1272w, https://substackcdn.com/image/fetch/$s_!sXfz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4d4249-640d-499b-a50f-42ba98772548_698x188.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Some vendors choose to sacrifice low-latency performance. WarpStream or Bufstream believes this is a good trade-off for huge cost savings and ease of operation. These systems wait until the message persists in the object storage before sending the acknowledgment message to the producer.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!H5dj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ee16f6c-ea9f-4abd-8a6d-e0c08b0add31_804x230.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!H5dj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ee16f6c-ea9f-4abd-8a6d-e0c08b0add31_804x230.png 424w, https://substackcdn.com/image/fetch/$s_!H5dj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ee16f6c-ea9f-4abd-8a6d-e0c08b0add31_804x230.png 848w, https://substackcdn.com/image/fetch/$s_!H5dj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ee16f6c-ea9f-4abd-8a6d-e0c08b0add31_804x230.png 1272w, https://substackcdn.com/image/fetch/$s_!H5dj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ee16f6c-ea9f-4abd-8a6d-e0c08b0add31_804x230.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!H5dj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ee16f6c-ea9f-4abd-8a6d-e0c08b0add31_804x230.png" width="804" height="230" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ee16f6c-ea9f-4abd-8a6d-e0c08b0add31_804x230.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:230,&quot;width&quot;:804,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:56312,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/161465275?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ee16f6c-ea9f-4abd-8a6d-e0c08b0add31_804x230.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!H5dj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ee16f6c-ea9f-4abd-8a6d-e0c08b0add31_804x230.png 424w, https://substackcdn.com/image/fetch/$s_!H5dj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ee16f6c-ea9f-4abd-8a6d-e0c08b0add31_804x230.png 848w, https://substackcdn.com/image/fetch/$s_!H5dj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ee16f6c-ea9f-4abd-8a6d-e0c08b0add31_804x230.png 1272w, https://substackcdn.com/image/fetch/$s_!H5dj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ee16f6c-ea9f-4abd-8a6d-e0c08b0add31_804x230.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>AutoMQ doesn&#8217;t do that. It achieves low latency through a WAL+S3 architecture. To keep the solution low latency (write latency P99 &lt; 10ms), the AutoMQ broker writes data to WAL. The WAL is essentially a disk device, such as AWS EBS. The brokers must ensure the message is already in the WAL before writing to S3; when the broker receives the message, it returns an &#8220;I got your message&#8221; response only when it persists in the WAL. The data is then later written to object storage asynchronously.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qHqZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f19cbb4-4d5a-4d29-a460-d96a621d3533_984x428.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qHqZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f19cbb4-4d5a-4d29-a460-d96a621d3533_984x428.png 424w, https://substackcdn.com/image/fetch/$s_!qHqZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f19cbb4-4d5a-4d29-a460-d96a621d3533_984x428.png 848w, https://substackcdn.com/image/fetch/$s_!qHqZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f19cbb4-4d5a-4d29-a460-d96a621d3533_984x428.png 1272w, https://substackcdn.com/image/fetch/$s_!qHqZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f19cbb4-4d5a-4d29-a460-d96a621d3533_984x428.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qHqZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f19cbb4-4d5a-4d29-a460-d96a621d3533_984x428.png" width="984" height="428" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1f19cbb4-4d5a-4d29-a460-d96a621d3533_984x428.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:428,&quot;width&quot;:984,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:89317,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/161465275?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f19cbb4-4d5a-4d29-a460-d96a621d3533_984x428.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qHqZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f19cbb4-4d5a-4d29-a460-d96a621d3533_984x428.png 424w, https://substackcdn.com/image/fetch/$s_!qHqZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f19cbb4-4d5a-4d29-a460-d96a621d3533_984x428.png 848w, https://substackcdn.com/image/fetch/$s_!qHqZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f19cbb4-4d5a-4d29-a460-d96a621d3533_984x428.png 1272w, https://substackcdn.com/image/fetch/$s_!qHqZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f19cbb4-4d5a-4d29-a460-d96a621d3533_984x428.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The idea is to use WAL to take advantage of the characteristics of different cloud storage media, which can be freely combined with S3 to <a href="https://open.substack.com/pub/vutr/p/how-automq-reduces-nearly-100-of?r=2rj6sg&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=false">adapt to various scenarios</a>. For example:</p><ul><li><p>With EBS, WAL is optimal for low latency. However, customers are still charged for cross-AZ data transfer when producers send messages to leader partitions.</p></li><li><p>With S3 WAL (AutoMQ treats S3 like WAL besides the primary storage), users can completely remove the cross-AZ cost, but the latency is increased in return.</p></li></ul><div><hr></div><h2>IOPS</h2><p>Related to the latency is the frequency of data writing to object storage. <a href="https://aws.amazon.com/s3/pricing/">S3 Standard PUT requests are $0.005 per 1000 requests</a>. A service with 10,000 writes per second would cost $130,000 per month.</p><p>If the brokers write the message to object storage right after they receive it from the producer, the number of PUT requests should be enormous.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wY7s!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3d78c39-3f10-4ca4-8cac-47578dd26b69_802x262.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wY7s!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3d78c39-3f10-4ca4-8cac-47578dd26b69_802x262.png 424w, https://substackcdn.com/image/fetch/$s_!wY7s!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3d78c39-3f10-4ca4-8cac-47578dd26b69_802x262.png 848w, https://substackcdn.com/image/fetch/$s_!wY7s!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3d78c39-3f10-4ca4-8cac-47578dd26b69_802x262.png 1272w, https://substackcdn.com/image/fetch/$s_!wY7s!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3d78c39-3f10-4ca4-8cac-47578dd26b69_802x262.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wY7s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3d78c39-3f10-4ca4-8cac-47578dd26b69_802x262.png" width="802" height="262" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c3d78c39-3f10-4ca4-8cac-47578dd26b69_802x262.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:262,&quot;width&quot;:802,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:64765,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/161465275?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3d78c39-3f10-4ca4-8cac-47578dd26b69_802x262.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wY7s!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3d78c39-3f10-4ca4-8cac-47578dd26b69_802x262.png 424w, https://substackcdn.com/image/fetch/$s_!wY7s!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3d78c39-3f10-4ca4-8cac-47578dd26b69_802x262.png 848w, https://substackcdn.com/image/fetch/$s_!wY7s!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3d78c39-3f10-4ca4-8cac-47578dd26b69_802x262.png 1272w, https://substackcdn.com/image/fetch/$s_!wY7s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3d78c39-3f10-4ca4-8cac-47578dd26b69_802x262.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>To reduce the number of requests to object storage, all vendors tell the brokers to batch the data before uploading it. They buffer the data for a while or until the accumulated data reaches a specific size. Users can choose to reduce the buffer time for lower latency, but in return, they have to pay more for PUT requests.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NYZ_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59d0295a-77a6-41e9-a332-5317c89ea066_818x300.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NYZ_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59d0295a-77a6-41e9-a332-5317c89ea066_818x300.png 424w, https://substackcdn.com/image/fetch/$s_!NYZ_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59d0295a-77a6-41e9-a332-5317c89ea066_818x300.png 848w, https://substackcdn.com/image/fetch/$s_!NYZ_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59d0295a-77a6-41e9-a332-5317c89ea066_818x300.png 1272w, https://substackcdn.com/image/fetch/$s_!NYZ_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59d0295a-77a6-41e9-a332-5317c89ea066_818x300.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NYZ_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59d0295a-77a6-41e9-a332-5317c89ea066_818x300.png" width="818" height="300" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/59d0295a-77a6-41e9-a332-5317c89ea066_818x300.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:300,&quot;width&quot;:818,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:65632,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/161465275?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59d0295a-77a6-41e9-a332-5317c89ea066_818x300.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NYZ_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59d0295a-77a6-41e9-a332-5317c89ea066_818x300.png 424w, https://substackcdn.com/image/fetch/$s_!NYZ_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59d0295a-77a6-41e9-a332-5317c89ea066_818x300.png 848w, https://substackcdn.com/image/fetch/$s_!NYZ_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59d0295a-77a6-41e9-a332-5317c89ea066_818x300.png 1272w, https://substackcdn.com/image/fetch/$s_!NYZ_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59d0295a-77a6-41e9-a332-5317c89ea066_818x300.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Those brokers can batch data from different topics/partitions to help reduce the cost of writing for a single partition. In the process of batching data in AutoMQ, it may generate two types of objects:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QYFB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc296e281-79c8-4213-9b0d-89a62a2f1d15_2350x634.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QYFB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc296e281-79c8-4213-9b0d-89a62a2f1d15_2350x634.png 424w, https://substackcdn.com/image/fetch/$s_!QYFB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc296e281-79c8-4213-9b0d-89a62a2f1d15_2350x634.png 848w, https://substackcdn.com/image/fetch/$s_!QYFB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc296e281-79c8-4213-9b0d-89a62a2f1d15_2350x634.png 1272w, https://substackcdn.com/image/fetch/$s_!QYFB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc296e281-79c8-4213-9b0d-89a62a2f1d15_2350x634.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QYFB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc296e281-79c8-4213-9b0d-89a62a2f1d15_2350x634.png" width="1456" height="393" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c296e281-79c8-4213-9b0d-89a62a2f1d15_2350x634.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:393,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:223102,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/161465275?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc296e281-79c8-4213-9b0d-89a62a2f1d15_2350x634.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QYFB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc296e281-79c8-4213-9b0d-89a62a2f1d15_2350x634.png 424w, https://substackcdn.com/image/fetch/$s_!QYFB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc296e281-79c8-4213-9b0d-89a62a2f1d15_2350x634.png 848w, https://substackcdn.com/image/fetch/$s_!QYFB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc296e281-79c8-4213-9b0d-89a62a2f1d15_2350x634.png 1272w, https://substackcdn.com/image/fetch/$s_!QYFB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc296e281-79c8-4213-9b0d-89a62a2f1d15_2350x634.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>Stream Set Object</strong> (SSO): An object that contains consecutive data segments from different partitions</p></li><li><p><strong>Stream Object</strong> (SO): An object containing consecutive data segments from a single partition.</p></li></ul><p>When writing the data in object storage, there are two scenarios:</p><ul><li><p>Data from the same stream can fill up the batch size and will be uploaded as SO</p></li><li><p>Data from other partitions&#8217; streams will be combined to meet the batch size, and the broker will upload it as the SSO.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!apXk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68a26e66-76c9-4e96-9604-f9ee3e8e5a16_1150x542.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!apXk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68a26e66-76c9-4e96-9604-f9ee3e8e5a16_1150x542.png 424w, https://substackcdn.com/image/fetch/$s_!apXk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68a26e66-76c9-4e96-9604-f9ee3e8e5a16_1150x542.png 848w, https://substackcdn.com/image/fetch/$s_!apXk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68a26e66-76c9-4e96-9604-f9ee3e8e5a16_1150x542.png 1272w, https://substackcdn.com/image/fetch/$s_!apXk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68a26e66-76c9-4e96-9604-f9ee3e8e5a16_1150x542.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!apXk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68a26e66-76c9-4e96-9604-f9ee3e8e5a16_1150x542.png" width="1150" height="542" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/68a26e66-76c9-4e96-9604-f9ee3e8e5a16_1150x542.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:542,&quot;width&quot;:1150,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:136009,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/161465275?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68a26e66-76c9-4e96-9604-f9ee3e8e5a16_1150x542.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!apXk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68a26e66-76c9-4e96-9604-f9ee3e8e5a16_1150x542.png 424w, https://substackcdn.com/image/fetch/$s_!apXk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68a26e66-76c9-4e96-9604-f9ee3e8e5a16_1150x542.png 848w, https://substackcdn.com/image/fetch/$s_!apXk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68a26e66-76c9-4e96-9604-f9ee3e8e5a16_1150x542.png 1272w, https://substackcdn.com/image/fetch/$s_!apXk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68a26e66-76c9-4e96-9604-f9ee3e8e5a16_1150x542.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">This does not reflect the actual implementation of the AutoMQ compaction process.</figcaption></figure></div><p>Thus, data from a partition can be spread into multiple objects, harming the read performance when the broker issues more requests. AutoMQ has a background compaction process that asynchronously consolidates data from the same partition onto the least possible number of objects to deal with this. This ensures that data within the same partition can be stored close together physically, enabling sequential reads from object storage.</p><div><hr></div><h2>Cache Management</h2><p>Following up on the latency and the IOPS challenges above, the simplest way to improve the performance of reading data in object storage is to make fewer GET requests to object storage.</p><p>Data caching can help with that; it serves two purposes: improving the data read performance and limiting the requests to object storage. But this raises another question: how can a solution manage cache efficiently to improve the cache hit? (<em><a href="https://www.karlton.org/2017/12/naming-things-hard/">There are only two hard things in Computer Science: cache invalidation and naming things.</a></em>)</p><p>WarpStream distributes loads across agents by using a consistent hashing ring. Each agent is responsible for a subset of data within a topic. When an agent receives a request from a client, the agent identifies who is in charge of the required file and routes the request accordingly.</p><p>AutoMQ tries to keep the &#8220;data locality&#8221; characteristic like Kafka, where brokers are still aware of the partition they are in charge of. Thus, cache management in AutoMQ can be implemented by letting brokers cache data from their managed partitions. (We will discuss the data locality later)</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!L75X!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f19901-7927-4029-b715-93140aff9b02_478x454.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!L75X!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f19901-7927-4029-b715-93140aff9b02_478x454.png 424w, https://substackcdn.com/image/fetch/$s_!L75X!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f19901-7927-4029-b715-93140aff9b02_478x454.png 848w, https://substackcdn.com/image/fetch/$s_!L75X!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f19901-7927-4029-b715-93140aff9b02_478x454.png 1272w, https://substackcdn.com/image/fetch/$s_!L75X!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f19901-7927-4029-b715-93140aff9b02_478x454.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!L75X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f19901-7927-4029-b715-93140aff9b02_478x454.png" width="478" height="454" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b5f19901-7927-4029-b715-93140aff9b02_478x454.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:454,&quot;width&quot;:478,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:64820,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/161465275?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f19901-7927-4029-b715-93140aff9b02_478x454.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!L75X!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f19901-7927-4029-b715-93140aff9b02_478x454.png 424w, https://substackcdn.com/image/fetch/$s_!L75X!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f19901-7927-4029-b715-93140aff9b02_478x454.png 848w, https://substackcdn.com/image/fetch/$s_!L75X!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f19901-7927-4029-b715-93140aff9b02_478x454.png 1272w, https://substackcdn.com/image/fetch/$s_!L75X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5f19901-7927-4029-b715-93140aff9b02_478x454.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>AutoMQ manages two distinct caches for different needs: the log cache handles writes and hot reads (recent data), and the block cache handles cold reads (historical data). When brokers receive messages from producers, besides writing data to WAL, brokers also write data to the log cache to serve recent reads.</p><p>If data isn&#8217;t available in the log cache, it will be read from the block cache instead. The block cache is filled by loading data from object storage. It improves the chances of hitting memory even for historical reads using techniques like prefetching and batch reading, which helps maintain performance during cold read operations.</p><div><hr></div><h2>Metadata Management</h2><p>The systems built on object storage need more metadata than Kafka. For example, Kafka can scan the file system directory tree to list Segments under a Partition. The equivalent way to do this in S3 is to issue LIST requests, but unfortunately, these requests perform poorly. In addition, because of batching data, message ordering is not straightforward like in Kafka.</p><p>These new systems need more metadata to answer questions like &#8220;which objects hold this topic&#8217;s data?&#8221; or &#8220;how can I ensure the message ordering?&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FqgT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd5f932e-5987-4195-89ac-9d6859054714_1108x580.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FqgT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd5f932e-5987-4195-89ac-9d6859054714_1108x580.png 424w, https://substackcdn.com/image/fetch/$s_!FqgT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd5f932e-5987-4195-89ac-9d6859054714_1108x580.png 848w, https://substackcdn.com/image/fetch/$s_!FqgT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd5f932e-5987-4195-89ac-9d6859054714_1108x580.png 1272w, https://substackcdn.com/image/fetch/$s_!FqgT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd5f932e-5987-4195-89ac-9d6859054714_1108x580.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FqgT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd5f932e-5987-4195-89ac-9d6859054714_1108x580.png" width="1108" height="580" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bd5f932e-5987-4195-89ac-9d6859054714_1108x580.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:580,&quot;width&quot;:1108,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:318587,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/161465275?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd5f932e-5987-4195-89ac-9d6859054714_1108x580.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FqgT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd5f932e-5987-4195-89ac-9d6859054714_1108x580.png 424w, https://substackcdn.com/image/fetch/$s_!FqgT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd5f932e-5987-4195-89ac-9d6859054714_1108x580.png 848w, https://substackcdn.com/image/fetch/$s_!FqgT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd5f932e-5987-4195-89ac-9d6859054714_1108x580.png 1272w, https://substackcdn.com/image/fetch/$s_!FqgT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd5f932e-5987-4195-89ac-9d6859054714_1108x580.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>These metadata numbers correlate with the total number of objects stored in S3. To keep the number of metadata optimal, AutoMQ leverages the compaction technique from the IOPS section to combine multiple small objects into larger ones, thus limiting the amount of metadata.</p><p>In addition, Kafka uses ZooKeeper or <a href="https://developer.confluent.io/learn/kraft/">Kraft</a> to store cluster metadata such as broker registrations or topic configurations. WarpStream or Bufstream relies on a transactional database for this purpose.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hN7_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fced9b54d-5955-494e-8221-1f9b8e275615_1078x560.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hN7_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fced9b54d-5955-494e-8221-1f9b8e275615_1078x560.png 424w, https://substackcdn.com/image/fetch/$s_!hN7_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fced9b54d-5955-494e-8221-1f9b8e275615_1078x560.png 848w, https://substackcdn.com/image/fetch/$s_!hN7_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fced9b54d-5955-494e-8221-1f9b8e275615_1078x560.png 1272w, https://substackcdn.com/image/fetch/$s_!hN7_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fced9b54d-5955-494e-8221-1f9b8e275615_1078x560.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hN7_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fced9b54d-5955-494e-8221-1f9b8e275615_1078x560.png" width="1078" height="560" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ced9b54d-5955-494e-8221-1f9b8e275615_1078x560.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:560,&quot;width&quot;:1078,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:169356,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/161465275?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fced9b54d-5955-494e-8221-1f9b8e275615_1078x560.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hN7_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fced9b54d-5955-494e-8221-1f9b8e275615_1078x560.png 424w, https://substackcdn.com/image/fetch/$s_!hN7_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fced9b54d-5955-494e-8221-1f9b8e275615_1078x560.png 848w, https://substackcdn.com/image/fetch/$s_!hN7_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fced9b54d-5955-494e-8221-1f9b8e275615_1078x560.png 1272w, https://substackcdn.com/image/fetch/$s_!hN7_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fced9b54d-5955-494e-8221-1f9b8e275615_1078x560.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Zookeeper Mode vs Kraft Mode. <a href="https://developer.confluent.io/learn/kraft/">Source</a></figcaption></figure></div><p>In contrast, AutoMQ leverages the Kraft. It also has a controller quorum that determines the controller leader. The cluster metadata, which includes mapping between topic/partition and data, mapping between partitions and brokers, etc., is stored in the leader. Only the leader can modify this metadata; if a broker wants to change it, it must communicate with the leader. The metadata is replicated to every broker; any change in the metadata is propagated to every broker by the controller.</p><div><hr></div><h2>Kafka Compatibility</h2><p>Besides solving all the problems above, the Kafka alternative solution must provide a critical feature: the ability to let users switch from Kafka to their solution effortlessly. In other words, the new solution must be Kafka-compatible.</p><p>The Kafka protocol is centered around an essential technical design: it relies on local disks to store data. This includes appending messages to the physical logs, dividing the topic into partitions, replicating them among brokers, load balancing, asking for leader information to produce messages, serving consumers by locating the offset in the segment files, and more.</p><p>Thus, developing a Kafka-compatible solution using object storage is extremely challenging. Setting the performance aside, writing to object storage completely differs from how they write data to disk. We can&#8217;t open an immutable object and append data to the end as we do with the filesystem.</p><p>So, how could they provide a solution using object storage to replace a solution designed to work with local disks seamlessly?</p><p>Some (e.g., WarpStream, Bufstream) decided to rewrite the Kafka protocol from scratch to adapt to object storage. They believe this approach is more straightforward than leveraging the open-source Kafka protocol.</p><p>For AutoMQ, they think the opposite. They focus solely on answering how they could rewrite only Kafka&#8217;s storage layer to reuse the open-source protocol. Although the process might encounter many challenges, I think it is rewarding. They can confidently offer 100% Kafka compatibility to the user; if Kafka releases new features, they merge them into the AutoMQ source code. But how did they develop the new storage layer to work with the object store? Let&#8217;s first revisit the Kafka internal.</p><p>In Kafka, there are essential components:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_aZA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00b4a01b-ba6a-4898-ab95-d1c9bcf71336_380x394.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_aZA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00b4a01b-ba6a-4898-ab95-d1c9bcf71336_380x394.png 424w, https://substackcdn.com/image/fetch/$s_!_aZA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00b4a01b-ba6a-4898-ab95-d1c9bcf71336_380x394.png 848w, https://substackcdn.com/image/fetch/$s_!_aZA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00b4a01b-ba6a-4898-ab95-d1c9bcf71336_380x394.png 1272w, https://substackcdn.com/image/fetch/$s_!_aZA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00b4a01b-ba6a-4898-ab95-d1c9bcf71336_380x394.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_aZA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00b4a01b-ba6a-4898-ab95-d1c9bcf71336_380x394.png" width="380" height="394" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/00b4a01b-ba6a-4898-ab95-d1c9bcf71336_380x394.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:394,&quot;width&quot;:380,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58606,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/161465275?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00b4a01b-ba6a-4898-ab95-d1c9bcf71336_380x394.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_aZA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00b4a01b-ba6a-4898-ab95-d1c9bcf71336_380x394.png 424w, https://substackcdn.com/image/fetch/$s_!_aZA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00b4a01b-ba6a-4898-ab95-d1c9bcf71336_380x394.png 848w, https://substackcdn.com/image/fetch/$s_!_aZA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00b4a01b-ba6a-4898-ab95-d1c9bcf71336_380x394.png 1272w, https://substackcdn.com/image/fetch/$s_!_aZA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00b4a01b-ba6a-4898-ab95-d1c9bcf71336_380x394.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>The network</strong> is responsible for managing connections to and from the Kafka Client</p></li><li><p><strong>KafkaApis</strong> dispatches the request to specific modules based on the API key of the request</p></li><li><p><strong>ReplicaManager</strong> is responsible for message sending and receiving and partition management; <strong>Coordinator</strong> is responsible for consumer management and transactional messages; Kraft is responsible for cluster metadata.</p></li><li><p><strong>Storage</strong>: This module provides reliable data storage, providing the Partition abstraction to ReplicaManager, Coordinator, and Kraft. It is divided into multiple levels:</p><ul><li><p><strong>UnifiedLog</strong> ensures high-reliability data through ISR multi-replica replication.</p></li><li><p><strong>LocalLog</strong> handles local data storage, offering an "infinite" stream storage abstraction.</p></li><li><p><strong>LogSegment</strong>, the smallest storage unit in Kafka, splits LocalLog into data segments and maps them to corresponding physical files.</p></li></ul></li></ul><p>To ensure Kafka's 100% Compatibility, AutoMQ reuses all the logic except for the storage layer. For the new implementation, AutoMQ has to ensure that it still provides the partition abstraction so other Kafka modules like ReplicaManager, Coordinator, or Kraft can smoothly integrate.</p><p>Although Kafka exposes a continuous stream abstraction through partitions, many operations must be performed using the segment concept, such as the internal compacting process, Kafka's log recovery, transaction + timestamp indexing, or reading operations.</p><p>AutoMQ still uses segments like Kafka, but it introduces the Stream abstraction over the segments to facilitate data offloading to object storage. The stream&#8217;s core methods at the API level are appending and fetching a stream.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uhMA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a976859-1c9c-434d-99a6-4fffc909a9b3_912x588.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uhMA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a976859-1c9c-434d-99a6-4fffc909a9b3_912x588.png 424w, https://substackcdn.com/image/fetch/$s_!uhMA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a976859-1c9c-434d-99a6-4fffc909a9b3_912x588.png 848w, https://substackcdn.com/image/fetch/$s_!uhMA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a976859-1c9c-434d-99a6-4fffc909a9b3_912x588.png 1272w, https://substackcdn.com/image/fetch/$s_!uhMA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a976859-1c9c-434d-99a6-4fffc909a9b3_912x588.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uhMA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a976859-1c9c-434d-99a6-4fffc909a9b3_912x588.png" width="912" height="588" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a976859-1c9c-434d-99a6-4fffc909a9b3_912x588.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:588,&quot;width&quot;:912,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:137706,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/161465275?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a976859-1c9c-434d-99a6-4fffc909a9b3_912x588.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uhMA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a976859-1c9c-434d-99a6-4fffc909a9b3_912x588.png 424w, https://substackcdn.com/image/fetch/$s_!uhMA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a976859-1c9c-434d-99a6-4fffc909a9b3_912x588.png 848w, https://substackcdn.com/image/fetch/$s_!uhMA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a976859-1c9c-434d-99a6-4fffc909a9b3_912x588.png 1272w, https://substackcdn.com/image/fetch/$s_!uhMA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a976859-1c9c-434d-99a6-4fffc909a9b3_912x588.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Compared to Kafka's Log, it lacks indexing, transaction index, timestamp index, and compaction. To align with how Kafka organizes metadata and index, AutoMQ&#8217;s stream contains:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!D9Qu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8254b9a8-a8ed-499a-83b8-13d5609e130e_478x414.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!D9Qu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8254b9a8-a8ed-499a-83b8-13d5609e130e_478x414.png 424w, https://substackcdn.com/image/fetch/$s_!D9Qu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8254b9a8-a8ed-499a-83b8-13d5609e130e_478x414.png 848w, https://substackcdn.com/image/fetch/$s_!D9Qu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8254b9a8-a8ed-499a-83b8-13d5609e130e_478x414.png 1272w, https://substackcdn.com/image/fetch/$s_!D9Qu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8254b9a8-a8ed-499a-83b8-13d5609e130e_478x414.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!D9Qu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8254b9a8-a8ed-499a-83b8-13d5609e130e_478x414.png" width="478" height="414" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8254b9a8-a8ed-499a-83b8-13d5609e130e_478x414.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:414,&quot;width&quot;:478,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:86632,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/161465275?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8254b9a8-a8ed-499a-83b8-13d5609e130e_478x414.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!D9Qu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8254b9a8-a8ed-499a-83b8-13d5609e130e_478x414.png 424w, https://substackcdn.com/image/fetch/$s_!D9Qu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8254b9a8-a8ed-499a-83b8-13d5609e130e_478x414.png 848w, https://substackcdn.com/image/fetch/$s_!D9Qu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8254b9a8-a8ed-499a-83b8-13d5609e130e_478x414.png 1272w, https://substackcdn.com/image/fetch/$s_!D9Qu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8254b9a8-a8ed-499a-83b8-13d5609e130e_478x414.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>Meta stream</strong> provides a KV-like semantics to store metadata at the Partition level. Apache Kafka can scan the file system directory tree to list segments under a partition. In AutoMQ Kafka, Meta S3Stream uses ElasticLogMeta to record the Segment list and the mapping between Segments and Streams. This also helps avoid sending a LIST request to object storage.</p></li><li><p><strong>Data stream</strong> mapping between stream and segment data. It already provides the capability to query data based on logical offsets. Thus, it can replace xxx. data and xxx.index in Kafka.</p></li><li><p><strong>Txn/Time streams</strong> are equivalent to xxx. tnxindex and xxx. timeindex in Kafka</p></li></ul><p>Unlike Kafka&#8217;s segment abstraction, which is limited to filesystem operations, a stream has more work to do, from caching messages, writing them to a write&#8209;ahead log, to asynchronously offloading them to S3.</p><div><hr></div><h2>Convergence of Shared Nothing and Shared Disk</h2><p>Both shared nothing and shared disk have advantages. The first has data locality that can serve writes and cache data more efficiently. The latter storage provides the efficiency of sharing data across different nodes. Theoretically, any broker can read and write any partitions when storing data in object storage.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qsIP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3549d8c3-36da-433d-bc98-d0e642e670d3_542x438.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qsIP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3549d8c3-36da-433d-bc98-d0e642e670d3_542x438.png 424w, https://substackcdn.com/image/fetch/$s_!qsIP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3549d8c3-36da-433d-bc98-d0e642e670d3_542x438.png 848w, https://substackcdn.com/image/fetch/$s_!qsIP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3549d8c3-36da-433d-bc98-d0e642e670d3_542x438.png 1272w, https://substackcdn.com/image/fetch/$s_!qsIP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3549d8c3-36da-433d-bc98-d0e642e670d3_542x438.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qsIP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3549d8c3-36da-433d-bc98-d0e642e670d3_542x438.png" width="542" height="438" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3549d8c3-36da-433d-bc98-d0e642e670d3_542x438.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:438,&quot;width&quot;:542,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:72630,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/161465275?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3549d8c3-36da-433d-bc98-d0e642e670d3_542x438.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qsIP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3549d8c3-36da-433d-bc98-d0e642e670d3_542x438.png 424w, https://substackcdn.com/image/fetch/$s_!qsIP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3549d8c3-36da-433d-bc98-d0e642e670d3_542x438.png 848w, https://substackcdn.com/image/fetch/$s_!qsIP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3549d8c3-36da-433d-bc98-d0e642e670d3_542x438.png 1272w, https://substackcdn.com/image/fetch/$s_!qsIP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3549d8c3-36da-433d-bc98-d0e642e670d3_542x438.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>With Kafka's initial shared-nothing design, partitions are bound to nodes. Read and write requests can only access the node with the assigned partitions. This is used to identify nodes to handle requests and to achieve load balancing. Thus, vendors must also consider data locality when building an alternative solution with shared disk architecture.</p><p>Warpstream, for example, bypasses the data locality for the write process; any agent in the same Availability Zone (AZ) as the client can serve the operations. When it comes to read requests, they must be served by the responsible agents. (mentioned from the Cache Management section)</p><p>Although AutoMQ is designed to store data completely in object storage, it still wants the brokers to know which partition they are responsible for. AutoMQ intends to keep the &#8220;data locality&#8221; characteristic, just like Kafka, where AutoMQ still assigns partition-specific brokers.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6a6t!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde907376-6ad5-40fb-a28a-b7f011d7dbda_562x626.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6a6t!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde907376-6ad5-40fb-a28a-b7f011d7dbda_562x626.png 424w, https://substackcdn.com/image/fetch/$s_!6a6t!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde907376-6ad5-40fb-a28a-b7f011d7dbda_562x626.png 848w, https://substackcdn.com/image/fetch/$s_!6a6t!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde907376-6ad5-40fb-a28a-b7f011d7dbda_562x626.png 1272w, https://substackcdn.com/image/fetch/$s_!6a6t!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde907376-6ad5-40fb-a28a-b7f011d7dbda_562x626.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6a6t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde907376-6ad5-40fb-a28a-b7f011d7dbda_562x626.png" width="562" height="626" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de907376-6ad5-40fb-a28a-b7f011d7dbda_562x626.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:626,&quot;width&quot;:562,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:104927,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/161465275?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde907376-6ad5-40fb-a28a-b7f011d7dbda_562x626.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6a6t!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde907376-6ad5-40fb-a28a-b7f011d7dbda_562x626.png 424w, https://substackcdn.com/image/fetch/$s_!6a6t!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde907376-6ad5-40fb-a28a-b7f011d7dbda_562x626.png 848w, https://substackcdn.com/image/fetch/$s_!6a6t!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde907376-6ad5-40fb-a28a-b7f011d7dbda_562x626.png 1272w, https://substackcdn.com/image/fetch/$s_!6a6t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde907376-6ad5-40fb-a28a-b7f011d7dbda_562x626.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Throughput</h2><p>A stateless broker has more things to do than a Kafka broker. In Kafka, the brokers let the OS systems handle all the storage aspects. But with a Kafka-compatible solution that runs on object storage, the broker must be responsible for buffering data in memory, uploading, compacting, or parsing data in object storage.</p><p>If not carefully designed, this can cause a lot of overhead for the broker. Compaction processes may affect regular write requests if these flows are not managed effectively.</p><p>In AutoMQ, there are the following types of network traffic:</p><ul><li><p>Message Sending Traffic: Producer -&gt; AutoMQ -&gt; S3</p></li><li><p>Tail read Consumption Traffic: AutoMQ -&gt; Consumer</p></li><li><p>Historical consumption traffic: S3 -&gt; AutoMQ -&gt; Consumer</p></li><li><p>Compaction read traffic: S3 -&gt; AutoMQ</p></li><li><p>Compaction upload traffic: AutoMQ -&gt; S3</p></li></ul><p>To avoid different types of traffic competing with each other under limited bandwidth, AutoMQ has classified the above traffic types as follows:</p><ul><li><p>Tier-0: Message-sending traffic</p></li><li><p>Tier-1: Catch-up read consumption traffic</p></li><li><p>Tier-2: Compaction read/write traffic</p></li><li><p>Tier-3: Chasing Read Consumption Traffic</p></li></ul><p>AutoMQ implements an asynchronous multi-tier rate limiter based on the priority queue and the token bucket.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UxvB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7bcff40-da1a-4457-9c11-c514073ee56d_1694x672.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UxvB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7bcff40-da1a-4457-9c11-c514073ee56d_1694x672.png 424w, https://substackcdn.com/image/fetch/$s_!UxvB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7bcff40-da1a-4457-9c11-c514073ee56d_1694x672.png 848w, https://substackcdn.com/image/fetch/$s_!UxvB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7bcff40-da1a-4457-9c11-c514073ee56d_1694x672.png 1272w, https://substackcdn.com/image/fetch/$s_!UxvB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7bcff40-da1a-4457-9c11-c514073ee56d_1694x672.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UxvB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7bcff40-da1a-4457-9c11-c514073ee56d_1694x672.png" width="1456" height="578" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b7bcff40-da1a-4457-9c11-c514073ee56d_1694x672.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:578,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:240887,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/161465275?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7bcff40-da1a-4457-9c11-c514073ee56d_1694x672.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UxvB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7bcff40-da1a-4457-9c11-c514073ee56d_1694x672.png 424w, https://substackcdn.com/image/fetch/$s_!UxvB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7bcff40-da1a-4457-9c11-c514073ee56d_1694x672.png 848w, https://substackcdn.com/image/fetch/$s_!UxvB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7bcff40-da1a-4457-9c11-c514073ee56d_1694x672.png 1272w, https://substackcdn.com/image/fetch/$s_!UxvB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7bcff40-da1a-4457-9c11-c514073ee56d_1694x672.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><em><strong>Token Bucket:</strong> A token bucket is a rate-limiting algorithm that periodically refills a &#8220;bucket&#8221; with tokens, each representing permission for a request to proceed. When the bucket is empty, requests are delayed or dropped to prevent system overload.</em></p></blockquote><ul><li><p>For Tier-0 requests, the rate limiter does not apply traffic control.</p></li><li><p>For Tier-1 to Tier-3 requests, if the available tokens are insufficient, they are placed into a priority queue based on their priority. When tokens are added to the token bucket periodically, the callback thread is awakened to attempt to fulfill the queued requests.</p></li></ul><div><hr></div><h2>Cross-AZ traffic cost</h2><p>As mentioned in the <strong>Background</strong> section, the original Kafka&#8217;s design can skyrocket your cross-AZ transfer fee billing due to two main reasons:</p><ul><li><p>The producer could produce messages to the leader in different zones (1)</p></li><li><p>The leader must replicate data to two followers in different zones (2)</p></li></ul><p>With solutions built on S3, the point (2) could be resolved easily by letting the object storage take care of the data replication. For point (1), things got interesting.</p><p>Solutions like WarpStream and Bufstream tried to hack the Kafka service discovery protocol. Before the producer can send messages in Kafka, it must acquire the partition&#8217;s leader information by issuing a metadata request to a set of bootstrap servers. WarpStream or Bufstream will try to respond to metadata requests with the broker having the same availability zone as the producer, because to them, any brokers can serve message writing; there is no concept of &#8220;leader&#8221; here.</p><p>With AutoMQ, things got different because it still wants to maintain the data locality, like Kafka.</p><p>It introduced a solution where the WAL is implemented using S3 to eliminate cross-AZ data transfer costs. Imagine a scenario where the producer is in the AZ1, and the leader (B2) of Parition 2 (P2) is in the AZ2. In the AZ1, there is also a broker 1 (B1).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rA3K!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42c47c54-22dc-4ec7-93f8-833215e3aa3f_766x936.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rA3K!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42c47c54-22dc-4ec7-93f8-833215e3aa3f_766x936.png 424w, https://substackcdn.com/image/fetch/$s_!rA3K!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42c47c54-22dc-4ec7-93f8-833215e3aa3f_766x936.png 848w, https://substackcdn.com/image/fetch/$s_!rA3K!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42c47c54-22dc-4ec7-93f8-833215e3aa3f_766x936.png 1272w, https://substackcdn.com/image/fetch/$s_!rA3K!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42c47c54-22dc-4ec7-93f8-833215e3aa3f_766x936.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rA3K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42c47c54-22dc-4ec7-93f8-833215e3aa3f_766x936.png" width="766" height="936" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/42c47c54-22dc-4ec7-93f8-833215e3aa3f_766x936.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:936,&quot;width&quot;:766,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:162510,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/161465275?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42c47c54-22dc-4ec7-93f8-833215e3aa3f_766x936.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rA3K!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42c47c54-22dc-4ec7-93f8-833215e3aa3f_766x936.png 424w, https://substackcdn.com/image/fetch/$s_!rA3K!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42c47c54-22dc-4ec7-93f8-833215e3aa3f_766x936.png 848w, https://substackcdn.com/image/fetch/$s_!rA3K!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42c47c54-22dc-4ec7-93f8-833215e3aa3f_766x936.png 1272w, https://substackcdn.com/image/fetch/$s_!rA3K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42c47c54-22dc-4ec7-93f8-833215e3aa3f_766x936.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The producer still makes the metadata request, including producer zone information, to the set of bootstrap brokers. On the AutoMQ side, brokers are mapped across different AZs using a consistent hash algorithm. Let&#8217;s assume AutoMQ places B2 in AZ2 and B1 in AZ1. Since AutoMQ knows that the producer is in AZ1 (based on the metadata request), it will return the information of B1. If the producer is in the same AZ as B2, it will return the information of B2. The core idea is to ensure the producer always communicates with a broker in the same AZ.</p><p>After receiving the information about B1 (keep in mind that this broker isn't responsible for the desired partition), the producer will begin sending messages to B1. This broker then buffers the messages in memory and asynchronously writes them into object storage as WAL data.</p><p>After successfully writing the messages to S3, B1 makes an RPC request to B2 (the actual leader of the partition) to inform it about the temporary data, including its location (this will result in a small amount of cross-AZ traffic between brokers in different AZs). B2 will then read this temporary data back and append it to the destination partition (P2). Once B2 has completed writing the data to the partition, it responds to B1, which finally sends an acknowledgment to the producer.</p><div><hr></div><h2>Outro</h2><p>Thank you for reading this far.</p><p>We start this article with the trend of building Kafka-compatible solutions on top of object storage, and my curiosity about the challenges of building a system like that. We then discuss some dimensions worth mentioning, such as latency, IOPS, and Kafka compatibility. After identifying potential challenges in each dimension, we examine how AutoMQ tries to solve them.</p><p>A quick note is that I&#8217;m not a Kafka expert at all; I&#8217;m just really interested in the system and want to share my learning with the community. So, feel free to correct me.</p><p>See you next time!</p><div><hr></div><h2>Reference</h2><p><em>[1] Tony Solomonik, <a href="https://tontinton.com/posts/new-age-data-intensive-apps/">The New Age of Data-Intensive Applications</a> (2024)</em></p><p><em>[2] AutoMQ <a href="https://www.automq.com/docs/automq/what-is-automq/overview">Doc</a>, <a href="https://www.automq.com/blog">Blog</a>, <a href="https://github.com/AutoMQ/automq">Github Repo</a></em></p><p><em>[3] Warpstream <a href="https://docs.warpstream.com/warpstream">Doc</a>, <a href="https://www.warpstream.com/blog">Blog</a></em></p><p><em>[4] Bufstream <a href="https://buf.build/docs/bufstream/">Doc</a></em></p>]]></content:encoded></item><item><title><![CDATA[I spent 6 hours learning how Google serves analytics applications ]]></title><description><![CDATA[10GBs/ s throughput and sub-milliseconds query latency]]></description><link>https://vutr.substack.com/p/i-spent-6-hours-learn-how-google</link><guid isPermaLink="false">https://vutr.substack.com/p/i-spent-6-hours-learn-how-google</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Wed, 30 Apr 2025 03:15:25 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!lXHb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9248d129-4c43-4dc0-9ffd-c30d35db1af4_2000x1428.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p><em>I&#8217;m making my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. </em></p><p><em>Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lXHb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9248d129-4c43-4dc0-9ffd-c30d35db1af4_2000x1428.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lXHb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9248d129-4c43-4dc0-9ffd-c30d35db1af4_2000x1428.png 424w, https://substackcdn.com/image/fetch/$s_!lXHb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9248d129-4c43-4dc0-9ffd-c30d35db1af4_2000x1428.png 848w, https://substackcdn.com/image/fetch/$s_!lXHb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9248d129-4c43-4dc0-9ffd-c30d35db1af4_2000x1428.png 1272w, https://substackcdn.com/image/fetch/$s_!lXHb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9248d129-4c43-4dc0-9ffd-c30d35db1af4_2000x1428.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lXHb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9248d129-4c43-4dc0-9ffd-c30d35db1af4_2000x1428.png" width="1456" height="1040" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9248d129-4c43-4dc0-9ffd-c30d35db1af4_2000x1428.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1040,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:287787,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162034266?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9248d129-4c43-4dc0-9ffd-c30d35db1af4_2000x1428.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!lXHb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9248d129-4c43-4dc0-9ffd-c30d35db1af4_2000x1428.png 424w, https://substackcdn.com/image/fetch/$s_!lXHb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9248d129-4c43-4dc0-9ffd-c30d35db1af4_2000x1428.png 848w, https://substackcdn.com/image/fetch/$s_!lXHb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9248d129-4c43-4dc0-9ffd-c30d35db1af4_2000x1428.png 1272w, https://substackcdn.com/image/fetch/$s_!lXHb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9248d129-4c43-4dc0-9ffd-c30d35db1af4_2000x1428.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Intro</h2><p>When it comes to &#8220;big data&#8221;, it&#8217;s hard not to mention Google.</p><p>From <a href="https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf">MapReduce</a>, <a href="https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf">Google File System</a>, to <a href="https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf">BigTable</a>.</p><p>As the business evolves, Google must continuously innovate to adapt to more data requirements. They built <a href="https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf">Spanner</a> for transactional workload and <a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf">Dremel</a> for analytics workload. The latter is the core component of Google BigQuery, the service you might be more familiar with.</p><p>Google operates multiple services with more than a billion users worldwide. They extract insights from data produced by these services to provide better user experiences and improve quality.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-wU7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e764b1-b03d-4113-a541-c6e670cbd661_624x556.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-wU7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e764b1-b03d-4113-a541-c6e670cbd661_624x556.png 424w, https://substackcdn.com/image/fetch/$s_!-wU7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e764b1-b03d-4113-a541-c6e670cbd661_624x556.png 848w, https://substackcdn.com/image/fetch/$s_!-wU7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e764b1-b03d-4113-a541-c6e670cbd661_624x556.png 1272w, https://substackcdn.com/image/fetch/$s_!-wU7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e764b1-b03d-4113-a541-c6e670cbd661_624x556.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-wU7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e764b1-b03d-4113-a541-c6e670cbd661_624x556.png" width="624" height="556" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b2e764b1-b03d-4113-a541-c6e670cbd661_624x556.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:556,&quot;width&quot;:624,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:82665,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162034266?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e764b1-b03d-4113-a541-c6e670cbd661_624x556.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-wU7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e764b1-b03d-4113-a541-c6e670cbd661_624x556.png 424w, https://substackcdn.com/image/fetch/$s_!-wU7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e764b1-b03d-4113-a541-c6e670cbd661_624x556.png 848w, https://substackcdn.com/image/fetch/$s_!-wU7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e764b1-b03d-4113-a541-c6e670cbd661_624x556.png 1272w, https://substackcdn.com/image/fetch/$s_!-wU7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e764b1-b03d-4113-a541-c6e670cbd661_624x556.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Services operating users interact with this data through analytical interfaces to gain insights. To answer the user's questions, the system must process vast amounts of data and provide the results within a rigorous time constraint. Some queries must even be returned in milliseconds.</p><p>In addition to internal analytics use, Google needs to serve user-facing analytics demands with intensive requirements in terms of performance and data freshness.</p><p>This article will explore Napa, the Google warehouse system behind these analytics use cases. </p><div><hr></div><h2>Background</h2><p>Before exploring Napa in detail, it would be helpful to understand the typical workloads the system is trying to serve. As mentioned, Napa doesn&#8217;t handle the same workloads you and I usually discuss when we discuss a data warehouse system. </p><p>Systems like Snowflake or Databricks usually receive batches of historical data from other systems, although these vendors support real-time ingestion. We typically expect a cloud data warehouse to help us extract insights by processing a large amount of data; we can tolerate the results provided after a few minutes and rarely need milliseconds of performance. Even if we need it, we can&#8217;t achieve it due to the characteristics of the workload: scanning a massive amount of data that needs to be aggregated and joined from many tables.</p><p>However, Napa typically serves a different type of workload. It is still analytical but less complicated (i.e, fewer joins, &#8230;), requires more high-throughput data ingestion, high data freshness, and lower response time. Data usually flows to Napa from Google services with billions of users, and the end users of Napa usually require fast analytics results to operate these services properly.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZPWI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F341bed4b-b0d2-41f9-bf4a-f42557c52449_736x310.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZPWI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F341bed4b-b0d2-41f9-bf4a-f42557c52449_736x310.png 424w, https://substackcdn.com/image/fetch/$s_!ZPWI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F341bed4b-b0d2-41f9-bf4a-f42557c52449_736x310.png 848w, https://substackcdn.com/image/fetch/$s_!ZPWI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F341bed4b-b0d2-41f9-bf4a-f42557c52449_736x310.png 1272w, https://substackcdn.com/image/fetch/$s_!ZPWI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F341bed4b-b0d2-41f9-bf4a-f42557c52449_736x310.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZPWI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F341bed4b-b0d2-41f9-bf4a-f42557c52449_736x310.png" width="736" height="310" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/341bed4b-b0d2-41f9-bf4a-f42557c52449_736x310.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:310,&quot;width&quot;:736,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:66261,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162034266?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F341bed4b-b0d2-41f9-bf4a-f42557c52449_736x310.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZPWI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F341bed4b-b0d2-41f9-bf4a-f42557c52449_736x310.png 424w, https://substackcdn.com/image/fetch/$s_!ZPWI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F341bed4b-b0d2-41f9-bf4a-f42557c52449_736x310.png 848w, https://substackcdn.com/image/fetch/$s_!ZPWI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F341bed4b-b0d2-41f9-bf4a-f42557c52449_736x310.png 1272w, https://substackcdn.com/image/fetch/$s_!ZPWI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F341bed4b-b0d2-41f9-bf4a-f42557c52449_736x310.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In my opinion, Napa is closer to systems like <a href="https://pinot.apache.org/">Apache Pinot</a> or <a href="https://druid.apache.org/">Apache Druid</a>, which are positioned as real-time OLAP.</p><div><hr></div><h2>Requirement</h2><p>Based on those typical workloads, it&#8217;s not a surprise that Napa must provide:</p><ul><li><p><strong>Robust Query Performance</strong>: The queries must have low latency performance and low variance in latency regardless of the query and data ingestion load.</p></li><li><p><strong>High-throughput Data Ingestion</strong>: All Napa functions must handle heavy ingestion loads. </p></li></ul><p>The third requirement is very exciting: <strong>flexibility</strong>. Google observed that Napa&#8217;s clients make a three-way tradeoff between data freshness, resource costs, and query performance. Some need data results to be highly fresh and are willing to pay more for that, while some clients can tolerate low query performance to save cost.  </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Fs9B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ec769b-23e8-452e-bf87-40dd0de84b6c_412x340.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Fs9B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ec769b-23e8-452e-bf87-40dd0de84b6c_412x340.png 424w, https://substackcdn.com/image/fetch/$s_!Fs9B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ec769b-23e8-452e-bf87-40dd0de84b6c_412x340.png 848w, https://substackcdn.com/image/fetch/$s_!Fs9B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ec769b-23e8-452e-bf87-40dd0de84b6c_412x340.png 1272w, https://substackcdn.com/image/fetch/$s_!Fs9B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ec769b-23e8-452e-bf87-40dd0de84b6c_412x340.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Fs9B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ec769b-23e8-452e-bf87-40dd0de84b6c_412x340.png" width="412" height="340" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e0ec769b-23e8-452e-bf87-40dd0de84b6c_412x340.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:340,&quot;width&quot;:412,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:27777,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162034266?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ec769b-23e8-452e-bf87-40dd0de84b6c_412x340.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Fs9B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ec769b-23e8-452e-bf87-40dd0de84b6c_412x340.png 424w, https://substackcdn.com/image/fetch/$s_!Fs9B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ec769b-23e8-452e-bf87-40dd0de84b6c_412x340.png 848w, https://substackcdn.com/image/fetch/$s_!Fs9B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ec769b-23e8-452e-bf87-40dd0de84b6c_412x340.png 1272w, https://substackcdn.com/image/fetch/$s_!Fs9B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ec769b-23e8-452e-bf87-40dd0de84b6c_412x340.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>To see how Napa can adapt to these requirements. Let&#8217;s explore its architecture.</p><div><hr></div><h2>Architecture</h2><p>The Napa conceptual design consists of three main blocks: ingestion, storage, and serving. Each was designed to handle its responsibility independently, which is essential for helping Napa adapt to clients&#8217; flexible needs. We will revisit this point in the &#8220;Choose the trade-off sections.&#8221; </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NZq2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc0242ef-c743-4ed4-811e-f196a7f8a16d_1098x344.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NZq2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc0242ef-c743-4ed4-811e-f196a7f8a16d_1098x344.png 424w, https://substackcdn.com/image/fetch/$s_!NZq2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc0242ef-c743-4ed4-811e-f196a7f8a16d_1098x344.png 848w, https://substackcdn.com/image/fetch/$s_!NZq2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc0242ef-c743-4ed4-811e-f196a7f8a16d_1098x344.png 1272w, https://substackcdn.com/image/fetch/$s_!NZq2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc0242ef-c743-4ed4-811e-f196a7f8a16d_1098x344.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NZq2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc0242ef-c743-4ed4-811e-f196a7f8a16d_1098x344.png" width="1098" height="344" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc0242ef-c743-4ed4-811e-f196a7f8a16d_1098x344.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:344,&quot;width&quot;:1098,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:52306,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162034266?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc0242ef-c743-4ed4-811e-f196a7f8a16d_1098x344.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NZq2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc0242ef-c743-4ed4-811e-f196a7f8a16d_1098x344.png 424w, https://substackcdn.com/image/fetch/$s_!NZq2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc0242ef-c743-4ed4-811e-f196a7f8a16d_1098x344.png 848w, https://substackcdn.com/image/fetch/$s_!NZq2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc0242ef-c743-4ed4-811e-f196a7f8a16d_1098x344.png 1272w, https://substackcdn.com/image/fetch/$s_!NZq2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc0242ef-c743-4ed4-811e-f196a7f8a16d_1098x344.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Napa leverages existing Google infrastructure components. It uses <a href="https://cloud.google.com/blog/products/storage-data-transfer/a-peek-behind-colossus-googles-file-system">Colossus</a> to store data, <a href="https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf">Spanner</a> for functions that require strict transaction semantics, such as metadata management, and <a href="https://research.google/pubs/f1-query-declarative-querying-at-scale/">F1 Query</a> for data serving. Napa has a control plan to coordinate work among the sub-services.</p><p>To optimize for its primary workload, Napa uses materialized views as the main technique to maximize query performance.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QJiP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13cdb705-2c90-42ed-9a42-5ead7eacf9bf_856x386.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QJiP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13cdb705-2c90-42ed-9a42-5ead7eacf9bf_856x386.png 424w, https://substackcdn.com/image/fetch/$s_!QJiP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13cdb705-2c90-42ed-9a42-5ead7eacf9bf_856x386.png 848w, https://substackcdn.com/image/fetch/$s_!QJiP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13cdb705-2c90-42ed-9a42-5ead7eacf9bf_856x386.png 1272w, https://substackcdn.com/image/fetch/$s_!QJiP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13cdb705-2c90-42ed-9a42-5ead7eacf9bf_856x386.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QJiP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13cdb705-2c90-42ed-9a42-5ead7eacf9bf_856x386.png" width="856" height="386" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/13cdb705-2c90-42ed-9a42-5ead7eacf9bf_856x386.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:386,&quot;width&quot;:856,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:61920,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162034266?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13cdb705-2c90-42ed-9a42-5ead7eacf9bf_856x386.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QJiP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13cdb705-2c90-42ed-9a42-5ead7eacf9bf_856x386.png 424w, https://substackcdn.com/image/fetch/$s_!QJiP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13cdb705-2c90-42ed-9a42-5ead7eacf9bf_856x386.png 848w, https://substackcdn.com/image/fetch/$s_!QJiP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13cdb705-2c90-42ed-9a42-5ead7eacf9bf_856x386.png 1272w, https://substackcdn.com/image/fetch/$s_!QJiP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13cdb705-2c90-42ed-9a42-5ead7eacf9bf_856x386.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is different from systems like Snowflake or Databricks, which rely on the ability to efficiently prune unnecessary data (e.g., using a min-max index like the one in Parquet). Napa&#8217;s materialized views are sorted, indexed, and range-partitioned by primary key(s).</p><p>Napa implements LSM-trees for its storage engine to achieve high data ingestion throughput. </p><p>We will discuss these technical designs in more detail in the following sections.</p><div><hr></div><h2>Ingestion</h2><p>The goal of the ingestion component is straightforward: insert large volumes of data into Napa&#8217;s storage. It accepts data, performs some lightweight transformation, and writes data (e.g., successfully writes to disk or replicates to other data centers).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vC1y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3d1b0a-1e1a-4a53-a95b-bf1bfeed4859_640x436.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vC1y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3d1b0a-1e1a-4a53-a95b-bf1bfeed4859_640x436.png 424w, https://substackcdn.com/image/fetch/$s_!vC1y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3d1b0a-1e1a-4a53-a95b-bf1bfeed4859_640x436.png 848w, https://substackcdn.com/image/fetch/$s_!vC1y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3d1b0a-1e1a-4a53-a95b-bf1bfeed4859_640x436.png 1272w, https://substackcdn.com/image/fetch/$s_!vC1y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3d1b0a-1e1a-4a53-a95b-bf1bfeed4859_640x436.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vC1y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3d1b0a-1e1a-4a53-a95b-bf1bfeed4859_640x436.png" width="640" height="436" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7b3d1b0a-1e1a-4a53-a95b-bf1bfeed4859_640x436.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:436,&quot;width&quot;:640,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:45304,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162034266?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3d1b0a-1e1a-4a53-a95b-bf1bfeed4859_640x436.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vC1y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3d1b0a-1e1a-4a53-a95b-bf1bfeed4859_640x436.png 424w, https://substackcdn.com/image/fetch/$s_!vC1y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3d1b0a-1e1a-4a53-a95b-bf1bfeed4859_640x436.png 848w, https://substackcdn.com/image/fetch/$s_!vC1y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3d1b0a-1e1a-4a53-a95b-bf1bfeed4859_640x436.png 1272w, https://substackcdn.com/image/fetch/$s_!vC1y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3d1b0a-1e1a-4a53-a95b-bf1bfeed4859_640x436.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Napa allows users to increase or decrease the number of data ingestion workers.</p><div><hr></div><h2>Storage</h2><p>This block's main responsibility is to store the data, and because materialized views are used to boost query performance, it is also in charge of view maintenance.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Nd4G!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd0c991-bdca-4c45-b626-62aec468d705_782x340.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Nd4G!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd0c991-bdca-4c45-b626-62aec468d705_782x340.png 424w, https://substackcdn.com/image/fetch/$s_!Nd4G!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd0c991-bdca-4c45-b626-62aec468d705_782x340.png 848w, https://substackcdn.com/image/fetch/$s_!Nd4G!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd0c991-bdca-4c45-b626-62aec468d705_782x340.png 1272w, https://substackcdn.com/image/fetch/$s_!Nd4G!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd0c991-bdca-4c45-b626-62aec468d705_782x340.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Nd4G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd0c991-bdca-4c45-b626-62aec468d705_782x340.png" width="782" height="340" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ffd0c991-bdca-4c45-b626-62aec468d705_782x340.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:340,&quot;width&quot;:782,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:52447,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162034266?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd0c991-bdca-4c45-b626-62aec468d705_782x340.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Nd4G!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd0c991-bdca-4c45-b626-62aec468d705_782x340.png 424w, https://substackcdn.com/image/fetch/$s_!Nd4G!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd0c991-bdca-4c45-b626-62aec468d705_782x340.png 848w, https://substackcdn.com/image/fetch/$s_!Nd4G!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd0c991-bdca-4c45-b626-62aec468d705_782x340.png 1272w, https://substackcdn.com/image/fetch/$s_!Nd4G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd0c991-bdca-4c45-b626-62aec468d705_782x340.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Google employed two important technical designs for storage. First, Napa used <strong>two file formats</strong>: The write-optimized (WO) format serves high-throughput data writing, and the read-optimized (RO) format is designed for efficient data reading.</p><p>Second, Napa uses the <strong>LSM-tree</strong> (Log-Structured Merge-Tree) paradigm to handle table and view maintenance.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mDUe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f67c887-3dd9-44a1-b44e-0f94e315011c_936x592.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mDUe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f67c887-3dd9-44a1-b44e-0f94e315011c_936x592.png 424w, https://substackcdn.com/image/fetch/$s_!mDUe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f67c887-3dd9-44a1-b44e-0f94e315011c_936x592.png 848w, https://substackcdn.com/image/fetch/$s_!mDUe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f67c887-3dd9-44a1-b44e-0f94e315011c_936x592.png 1272w, https://substackcdn.com/image/fetch/$s_!mDUe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f67c887-3dd9-44a1-b44e-0f94e315011c_936x592.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mDUe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f67c887-3dd9-44a1-b44e-0f94e315011c_936x592.png" width="936" height="592" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1f67c887-3dd9-44a1-b44e-0f94e315011c_936x592.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:592,&quot;width&quot;:936,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:72137,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162034266?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f67c887-3dd9-44a1-b44e-0f94e315011c_936x592.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mDUe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f67c887-3dd9-44a1-b44e-0f94e315011c_936x592.png 424w, https://substackcdn.com/image/fetch/$s_!mDUe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f67c887-3dd9-44a1-b44e-0f94e315011c_936x592.png 848w, https://substackcdn.com/image/fetch/$s_!mDUe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f67c887-3dd9-44a1-b44e-0f94e315011c_936x592.png 1272w, https://substackcdn.com/image/fetch/$s_!mDUe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f67c887-3dd9-44a1-b44e-0f94e315011c_936x592.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>LSM-trees employ an <em>append-only</em> strategy optimized for write throughput; writes are buffered in an in-memory component (memtable) and periodically flushed to disk as immutable, sorted segments (SSTables). It relies on a background <em>compaction</em> process to merge these segments, reconcile updates/deletes, and maintain sorted order across levels, thus converting random writes into sequential disk I/O at the cost of potentially higher read amplification (checking multiple segments) and write amplification (during compaction). </p><p>The other paradigm you might be more familiar with is B-trees. These are mutable, page-oriented structures that perform updates in place, meaning writes (inserts, updates, deletes) navigate the tree to find the specific disk page containing the relevant key range and modify it directly. This often incurs random I/O, potentially triggering page splits or merges to maintain balance.</p><blockquote><p><em>I will not deep dive into LSM-trees or B-trees here because those could require dedicated articles. So, see you in my future articles for these topics.</em></p></blockquote><p>Back in Napa, the data is ingested into the WO format. These files might not be immediately available for reading because they can impact the query performance. Later, the data in WO format will be converted into RO format. The RO format shares some characteristics with the PAX file format, which can be found in Parquet, BigQuery, or Snowflake&#8217;s file format. The WO-RO conversion or merging multiple small RO files process is implemented via the LSM trees compaction process.</p><p>The LSM tree relies heavily on the compaction process, which improves query performance and reduces storage consumption by </p><ul><li><p>Sorting records based on key(s) to allow binary search.</p></li><li><p>Aggregating multiple updates to the duplicate rows to avoid jumping around to find all the updates. Because in LSM-tree, deletes and updates are treated as inserts, there is a high chance that updates for a single record are scattered across multiple files. This differs from B-tree, where data is updated in-place, so fragmentation is not a concern like in LSM-tree.</p></li></ul><p>Because data is sorted in each segment, compaction is essentially merge sorting. Napa applies the compaction process for both table and materialized view updates. A note here is that the process of aggregating updates for records can also happen at the serving times if the LSM compaction process can not handle it before the query engine processes this data for the query. </p><div><hr></div><h2>Serving</h2><p>For many Napa clients, obtaining query results within milliseconds is critical for their business use cases. Google has employed many techniques for Napa to achieve this:</p><ul><li><p>As mentioned, Napa aggressively leverages materialized views. Napa uses views to answer a query whenever possible instead of the base tables.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cdZB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d6b1d6d-fd80-482d-bfe2-c20f4947c3f7_732x370.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cdZB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d6b1d6d-fd80-482d-bfe2-c20f4947c3f7_732x370.png 424w, https://substackcdn.com/image/fetch/$s_!cdZB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d6b1d6d-fd80-482d-bfe2-c20f4947c3f7_732x370.png 848w, https://substackcdn.com/image/fetch/$s_!cdZB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d6b1d6d-fd80-482d-bfe2-c20f4947c3f7_732x370.png 1272w, https://substackcdn.com/image/fetch/$s_!cdZB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d6b1d6d-fd80-482d-bfe2-c20f4947c3f7_732x370.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cdZB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d6b1d6d-fd80-482d-bfe2-c20f4947c3f7_732x370.png" width="732" height="370" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2d6b1d6d-fd80-482d-bfe2-c20f4947c3f7_732x370.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:370,&quot;width&quot;:732,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:62201,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162034266?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d6b1d6d-fd80-482d-bfe2-c20f4947c3f7_732x370.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cdZB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d6b1d6d-fd80-482d-bfe2-c20f4947c3f7_732x370.png 424w, https://substackcdn.com/image/fetch/$s_!cdZB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d6b1d6d-fd80-482d-bfe2-c20f4947c3f7_732x370.png 848w, https://substackcdn.com/image/fetch/$s_!cdZB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d6b1d6d-fd80-482d-bfe2-c20f4947c3f7_732x370.png 1272w, https://substackcdn.com/image/fetch/$s_!cdZB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d6b1d6d-fd80-482d-bfe2-c20f4947c3f7_732x370.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p>Filters are pushed down to the storage layer to minimize the data transferred via the network.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Aoip!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28d87d8d-7cb1-4800-8a51-5312c8be1c0d_508x194.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Aoip!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28d87d8d-7cb1-4800-8a51-5312c8be1c0d_508x194.png 424w, https://substackcdn.com/image/fetch/$s_!Aoip!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28d87d8d-7cb1-4800-8a51-5312c8be1c0d_508x194.png 848w, https://substackcdn.com/image/fetch/$s_!Aoip!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28d87d8d-7cb1-4800-8a51-5312c8be1c0d_508x194.png 1272w, https://substackcdn.com/image/fetch/$s_!Aoip!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28d87d8d-7cb1-4800-8a51-5312c8be1c0d_508x194.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Aoip!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28d87d8d-7cb1-4800-8a51-5312c8be1c0d_508x194.png" width="508" height="194" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/28d87d8d-7cb1-4800-8a51-5312c8be1c0d_508x194.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:194,&quot;width&quot;:508,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:21252,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162034266?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28d87d8d-7cb1-4800-8a51-5312c8be1c0d_508x194.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Aoip!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28d87d8d-7cb1-4800-8a51-5312c8be1c0d_508x194.png 424w, https://substackcdn.com/image/fetch/$s_!Aoip!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28d87d8d-7cb1-4800-8a51-5312c8be1c0d_508x194.png 848w, https://substackcdn.com/image/fetch/$s_!Aoip!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28d87d8d-7cb1-4800-8a51-5312c8be1c0d_508x194.png 1272w, https://substackcdn.com/image/fetch/$s_!Aoip!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28d87d8d-7cb1-4800-8a51-5312c8be1c0d_508x194.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></li><li><p>Napa also relies on parallelism to reduce the data each subquery has to read. Each segment in the LSM-tree keeps a local index. Napa uses this index to partition an input query into thousands of subqueries that satisfy the filters.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Gm4P!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F630b07fb-da09-4e68-8f47-924eead04375_774x394.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Gm4P!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F630b07fb-da09-4e68-8f47-924eead04375_774x394.png 424w, https://substackcdn.com/image/fetch/$s_!Gm4P!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F630b07fb-da09-4e68-8f47-924eead04375_774x394.png 848w, https://substackcdn.com/image/fetch/$s_!Gm4P!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F630b07fb-da09-4e68-8f47-924eead04375_774x394.png 1272w, https://substackcdn.com/image/fetch/$s_!Gm4P!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F630b07fb-da09-4e68-8f47-924eead04375_774x394.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Gm4P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F630b07fb-da09-4e68-8f47-924eead04375_774x394.png" width="774" height="394" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/630b07fb-da09-4e68-8f47-924eead04375_774x394.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:394,&quot;width&quot;:774,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:50815,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162034266?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F630b07fb-da09-4e68-8f47-924eead04375_774x394.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Gm4P!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F630b07fb-da09-4e68-8f47-924eead04375_774x394.png 424w, https://substackcdn.com/image/fetch/$s_!Gm4P!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F630b07fb-da09-4e68-8f47-924eead04375_774x394.png 848w, https://substackcdn.com/image/fetch/$s_!Gm4P!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F630b07fb-da09-4e68-8f47-924eead04375_774x394.png 1272w, https://substackcdn.com/image/fetch/$s_!Gm4P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F630b07fb-da09-4e68-8f47-924eead04375_774x394.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li></ul><ul><li><p>Napa maintains two cache layers to limit disk accesses: the first is the local RAM of the workers who process the query, and the second is the distributed caching layer, which can be shared between workers.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3up3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cd96585-8a89-4d09-8dec-62cc563b8ddc_484x456.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3up3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cd96585-8a89-4d09-8dec-62cc563b8ddc_484x456.png 424w, https://substackcdn.com/image/fetch/$s_!3up3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cd96585-8a89-4d09-8dec-62cc563b8ddc_484x456.png 848w, https://substackcdn.com/image/fetch/$s_!3up3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cd96585-8a89-4d09-8dec-62cc563b8ddc_484x456.png 1272w, https://substackcdn.com/image/fetch/$s_!3up3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cd96585-8a89-4d09-8dec-62cc563b8ddc_484x456.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3up3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cd96585-8a89-4d09-8dec-62cc563b8ddc_484x456.png" width="484" height="456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4cd96585-8a89-4d09-8dec-62cc563b8ddc_484x456.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:456,&quot;width&quot;:484,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:51663,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162034266?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cd96585-8a89-4d09-8dec-62cc563b8ddc_484x456.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3up3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cd96585-8a89-4d09-8dec-62cc563b8ddc_484x456.png 424w, https://substackcdn.com/image/fetch/$s_!3up3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cd96585-8a89-4d09-8dec-62cc563b8ddc_484x456.png 848w, https://substackcdn.com/image/fetch/$s_!3up3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cd96585-8a89-4d09-8dec-62cc563b8ddc_484x456.png 1272w, https://substackcdn.com/image/fetch/$s_!3up3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cd96585-8a89-4d09-8dec-62cc563b8ddc_484x456.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p>Napa also prefetches data to reduce the number of disk accesses further.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4fpH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bf9849d-153d-496e-98e6-b94bc40f6887_488x328.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4fpH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bf9849d-153d-496e-98e6-b94bc40f6887_488x328.png 424w, https://substackcdn.com/image/fetch/$s_!4fpH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bf9849d-153d-496e-98e6-b94bc40f6887_488x328.png 848w, https://substackcdn.com/image/fetch/$s_!4fpH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bf9849d-153d-496e-98e6-b94bc40f6887_488x328.png 1272w, https://substackcdn.com/image/fetch/$s_!4fpH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bf9849d-153d-496e-98e6-b94bc40f6887_488x328.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4fpH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bf9849d-153d-496e-98e6-b94bc40f6887_488x328.png" width="488" height="328" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2bf9849d-153d-496e-98e6-b94bc40f6887_488x328.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:328,&quot;width&quot;:488,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:55279,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162034266?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bf9849d-153d-496e-98e6-b94bc40f6887_488x328.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4fpH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bf9849d-153d-496e-98e6-b94bc40f6887_488x328.png 424w, https://substackcdn.com/image/fetch/$s_!4fpH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bf9849d-153d-496e-98e6-b94bc40f6887_488x328.png 848w, https://substackcdn.com/image/fetch/$s_!4fpH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bf9849d-153d-496e-98e6-b94bc40f6887_488x328.png 1272w, https://substackcdn.com/image/fetch/$s_!4fpH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bf9849d-153d-496e-98e6-b94bc40f6887_488x328.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p>The systems combine small I/Os as much as possible to improve the efficiency.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GW7m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3e8dab6-40d7-4d4f-b99b-70cd0e6eb4bd_488x296.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GW7m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3e8dab6-40d7-4d4f-b99b-70cd0e6eb4bd_488x296.png 424w, https://substackcdn.com/image/fetch/$s_!GW7m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3e8dab6-40d7-4d4f-b99b-70cd0e6eb4bd_488x296.png 848w, https://substackcdn.com/image/fetch/$s_!GW7m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3e8dab6-40d7-4d4f-b99b-70cd0e6eb4bd_488x296.png 1272w, https://substackcdn.com/image/fetch/$s_!GW7m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3e8dab6-40d7-4d4f-b99b-70cd0e6eb4bd_488x296.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GW7m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3e8dab6-40d7-4d4f-b99b-70cd0e6eb4bd_488x296.png" width="488" height="296" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a3e8dab6-40d7-4d4f-b99b-70cd0e6eb4bd_488x296.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:296,&quot;width&quot;:488,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:18676,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162034266?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3e8dab6-40d7-4d4f-b99b-70cd0e6eb4bd_488x296.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GW7m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3e8dab6-40d7-4d4f-b99b-70cd0e6eb4bd_488x296.png 424w, https://substackcdn.com/image/fetch/$s_!GW7m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3e8dab6-40d7-4d4f-b99b-70cd0e6eb4bd_488x296.png 848w, https://substackcdn.com/image/fetch/$s_!GW7m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3e8dab6-40d7-4d4f-b99b-70cd0e6eb4bd_488x296.png 1272w, https://substackcdn.com/image/fetch/$s_!GW7m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3e8dab6-40d7-4d4f-b99b-70cd0e6eb4bd_488x296.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li></ul><div><hr></div><h2>Choose the trade-off</h2><p>As mentioned, Napa allows users to trade off one of performance, freshness, or cost to achieve the remaining two dimensions. Let's first understand how Napa could achieve each dimension:</p><ul><li><p><strong>High Freshness</strong>: As mentioned above, data written in WO format is not available immediately. If users need fresher data, Napa must speed up the conversion/compacting of the newly written data to serve it faster. There are cases when the query requires files that should be merged during the background compaction process; however, for some reason (e.g., fewer resources for the compaction), these files are still not merged at the query time, and the merge must be executed at runtime instead to ensure the freshness requirement.</p></li><li><p><strong>Higher Performance</strong>: Napa maintains more views to speed up query performance. More views mean more workers are needed to update them. Another requirement is to optimize the data's physical layout; too many small files can harm performance. Also, the query performance would be degraded because the file merge process could be executed at runtime instead.</p></li><li><p><strong>Low Cost</strong>: The system's resources can be roughly categorized into these workloads: compaction process, View maintenance, and Data Ingestion. The less the resource is used, the lower the cost.</p></li></ul><p>Giving the decouple architecture of Napa, tuning to optimize for two dimensions, and sacrificing the remaining is straightforward:</p><ul><li><p>If the client wants to <strong>sacrifice data freshness</strong> for <strong>moderate</strong> <strong>query performance and cost</strong>, Napa can maintain a moderate number of views and fewer files to merge at query execution time. It uses fewer workers and cheaper resources (like spot VM instances from AWS or GCP) for view maintenance to keep costs low. Thus, the view maintenance occurs more slowly; hence, the data is not so fresh. However, clients still get pretty good query performance and keep the resource cost down.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IdPI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87e437a1-621e-4750-8304-713ed60c4fd2_1060x310.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IdPI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87e437a1-621e-4750-8304-713ed60c4fd2_1060x310.png 424w, https://substackcdn.com/image/fetch/$s_!IdPI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87e437a1-621e-4750-8304-713ed60c4fd2_1060x310.png 848w, https://substackcdn.com/image/fetch/$s_!IdPI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87e437a1-621e-4750-8304-713ed60c4fd2_1060x310.png 1272w, https://substackcdn.com/image/fetch/$s_!IdPI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87e437a1-621e-4750-8304-713ed60c4fd2_1060x310.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IdPI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87e437a1-621e-4750-8304-713ed60c4fd2_1060x310.png" width="1060" height="310" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/87e437a1-621e-4750-8304-713ed60c4fd2_1060x310.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:310,&quot;width&quot;:1060,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:47780,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162034266?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87e437a1-621e-4750-8304-713ed60c4fd2_1060x310.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IdPI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87e437a1-621e-4750-8304-713ed60c4fd2_1060x310.png 424w, https://substackcdn.com/image/fetch/$s_!IdPI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87e437a1-621e-4750-8304-713ed60c4fd2_1060x310.png 848w, https://substackcdn.com/image/fetch/$s_!IdPI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87e437a1-621e-4750-8304-713ed60c4fd2_1060x310.png 1272w, https://substackcdn.com/image/fetch/$s_!IdPI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87e437a1-621e-4750-8304-713ed60c4fd2_1060x310.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p>If the client wants to <strong>sacrifice query performance</strong> for <strong>high data freshness and low cost</strong>, Napa can maintain fewer materialized views but allow more files to be merged during the query runtime. Napa coordinates more workers for the ingestion process because the view maintenance effort is low. That said, clients can get fresher data results with the trade-off for the query performance because the query engine has fewer materialized views for reference, plus it has to merge more files at execution time.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Vz_k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d98712a-7cad-4256-a107-a18e7e864258_678x226.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Vz_k!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d98712a-7cad-4256-a107-a18e7e864258_678x226.png 424w, https://substackcdn.com/image/fetch/$s_!Vz_k!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d98712a-7cad-4256-a107-a18e7e864258_678x226.png 848w, https://substackcdn.com/image/fetch/$s_!Vz_k!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d98712a-7cad-4256-a107-a18e7e864258_678x226.png 1272w, https://substackcdn.com/image/fetch/$s_!Vz_k!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d98712a-7cad-4256-a107-a18e7e864258_678x226.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Vz_k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d98712a-7cad-4256-a107-a18e7e864258_678x226.png" width="678" height="226" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3d98712a-7cad-4256-a107-a18e7e864258_678x226.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:226,&quot;width&quot;:678,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:45581,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162034266?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d98712a-7cad-4256-a107-a18e7e864258_678x226.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Vz_k!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d98712a-7cad-4256-a107-a18e7e864258_678x226.png 424w, https://substackcdn.com/image/fetch/$s_!Vz_k!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d98712a-7cad-4256-a107-a18e7e864258_678x226.png 848w, https://substackcdn.com/image/fetch/$s_!Vz_k!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d98712a-7cad-4256-a107-a18e7e864258_678x226.png 1272w, https://substackcdn.com/image/fetch/$s_!Vz_k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d98712a-7cad-4256-a107-a18e7e864258_678x226.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></li><li><p>If the client can tolerate <strong>high resource cost</strong>s for both <strong>good query performance</strong> and <strong>high data freshness</strong>, Napa can direct more workers to ingestion, data compaction, and the view maintenance process. The data will be ingested with higher throughput, data in the LSM-tree will be merged faster, and the materialized view will get updates more frequently, thus the clients will have the desired query performance and data freshness. In return, the resource cost will be relatively high.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7Qs4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2474cefc-09ac-45cf-8303-95e77dc690ab_838x366.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7Qs4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2474cefc-09ac-45cf-8303-95e77dc690ab_838x366.png 424w, https://substackcdn.com/image/fetch/$s_!7Qs4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2474cefc-09ac-45cf-8303-95e77dc690ab_838x366.png 848w, https://substackcdn.com/image/fetch/$s_!7Qs4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2474cefc-09ac-45cf-8303-95e77dc690ab_838x366.png 1272w, https://substackcdn.com/image/fetch/$s_!7Qs4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2474cefc-09ac-45cf-8303-95e77dc690ab_838x366.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7Qs4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2474cefc-09ac-45cf-8303-95e77dc690ab_838x366.png" width="838" height="366" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2474cefc-09ac-45cf-8303-95e77dc690ab_838x366.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:366,&quot;width&quot;:838,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:57119,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/162034266?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2474cefc-09ac-45cf-8303-95e77dc690ab_838x366.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7Qs4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2474cefc-09ac-45cf-8303-95e77dc690ab_838x366.png 424w, https://substackcdn.com/image/fetch/$s_!7Qs4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2474cefc-09ac-45cf-8303-95e77dc690ab_838x366.png 848w, https://substackcdn.com/image/fetch/$s_!7Qs4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2474cefc-09ac-45cf-8303-95e77dc690ab_838x366.png 1272w, https://substackcdn.com/image/fetch/$s_!7Qs4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2474cefc-09ac-45cf-8303-95e77dc690ab_838x366.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li></ul><div><hr></div><h2>Outro</h2><p>Thank you for reading this far.</p><p>In this article, we learn the motivation behind Napa and the requirements it must adapt to. We then explore the system's architecture and the technical designs that Google made to make Napa a highly scalable, high-throughput, robust, and extremely flexible system.</p><p>It&#8217;s not hard to point out the shared characteristics between Napa and other analytics  systems:</p><ul><li><p><a href="https://open.substack.com/pub/vutr/p/how-does-vortex-the-bigquery-storage?r=2rj6sg&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=false">Vortex, the BigQuery storage engine</a>, also has two file formats to serve data write and read separately. The system also uses LSM-tree to manage table data like Napa did.</p></li><li><p>Clickhouse uses an <a href="https://open.substack.com/pub/vutr/p/i-spent-8-hours-learning-the-clickhouse?r=2rj6sg&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=false">LSM-tree-like mechanism</a> to achieve high throughput data for the MergeTreeEngine.</p></li><li><p><a href="https://open.substack.com/pub/vutr/p/i-spent-5-hours-exploring-the-story?r=2rj6sg&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=false">Apache Hudi</a> also has two file formats.</p></li><li><p>The new player table format, <a href="https://paimon.apache.org/docs/master/primary-key-table/overview/#lsm-trees">Apache Paimon</a>, adopts an LSM tree for the file storage.</p></li><li><p>RisingWave with its LSM storage engine, <a href="https://risingwave.com/blog/hummock-a-storage-engine-designed-for-stream-processing/">Hummock</a></p></li><li><p>&#8230;</p></li></ul><p>All these systems are designed for high-throughput data writing and low-latency query performance. The case of Vortex is more special because Google decided to build Vortex for BigQuery after years of operation, when they wanted to offer real-time analytics for users, and concluded that the legacy batch storage engine couldn&#8217;t provide what they needed.</p><p>I think we will see more systems like this in the near future, given that real-time analytics data is getting more and more attention.</p><p>What do you think about this observation?</p><p>&#8212; </p><p>Now, see you in my next articles!</p><div><hr></div><h2>Reference</h2><p><em>[1] Google, <a href="https://research.google/pubs/napa-powering-scalable-data-warehousing-with-robust-query-performance-at-google/">Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google</a> (2021)</em></p><p><em>[2] Jagan Sankaranarayanan, <a href="https://www.youtube.com/watch?v=dtWwUWB5JyQ">Google Napa: Scalable Data Warehousing with Robust Query Performance</a> (2021)</em></p>]]></content:encoded></item><item><title><![CDATA[Let's use Orchestra to build an end-to-end data pipeline in 10 minutes]]></title><description><![CDATA[Spoiler: You don't have to manage the infrastructure.]]></description><link>https://vutr.substack.com/p/lets-use-orchestra-to-build-an-end</link><guid isPermaLink="false">https://vutr.substack.com/p/lets-use-orchestra-to-build-an-end</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Thu, 24 Apr 2025 03:15:17 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!kcj-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F403b657a-3166-4829-9e9f-c3caf179f3ee_2000x1428.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><blockquote><p><em>I&#8217;m making my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. </em></p><p><em>Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kcj-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F403b657a-3166-4829-9e9f-c3caf179f3ee_2000x1428.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kcj-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F403b657a-3166-4829-9e9f-c3caf179f3ee_2000x1428.png 424w, https://substackcdn.com/image/fetch/$s_!kcj-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F403b657a-3166-4829-9e9f-c3caf179f3ee_2000x1428.png 848w, https://substackcdn.com/image/fetch/$s_!kcj-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F403b657a-3166-4829-9e9f-c3caf179f3ee_2000x1428.png 1272w, https://substackcdn.com/image/fetch/$s_!kcj-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F403b657a-3166-4829-9e9f-c3caf179f3ee_2000x1428.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kcj-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F403b657a-3166-4829-9e9f-c3caf179f3ee_2000x1428.png" width="1456" height="1040" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/403b657a-3166-4829-9e9f-c3caf179f3ee_2000x1428.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1040,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:595505,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/159654666?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F403b657a-3166-4829-9e9f-c3caf179f3ee_2000x1428.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kcj-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F403b657a-3166-4829-9e9f-c3caf179f3ee_2000x1428.png 424w, https://substackcdn.com/image/fetch/$s_!kcj-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F403b657a-3166-4829-9e9f-c3caf179f3ee_2000x1428.png 848w, https://substackcdn.com/image/fetch/$s_!kcj-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F403b657a-3166-4829-9e9f-c3caf179f3ee_2000x1428.png 1272w, https://substackcdn.com/image/fetch/$s_!kcj-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F403b657a-3166-4829-9e9f-c3caf179f3ee_2000x1428.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Intro</h2><p>We&#8217;re living in a time when it&#8217;s getting easier for data practitioners to build data pipelines. Cloud data warehouses are getting more and more powerful. The introduction of dbt streamlines data transformation using SQL.</p><p>However, that does not mean the above pattern has no challenges. We must set up the traditional orchestrator environments and determine how to schedule dbt tasks. If you use the free version of dbt, you must write a custom operator by yourself, as there is only support for the dbt cloud operator.</p><p>These tasks are not easy.</p><p>Realizing these hassles, <a href="https://getorchestra.io/">Orchestra</a>, a complete Data and AI workflow solution, offers us a more efficient way to operate the end-to-end data pipeline.</p><div><hr></div><h2>Motivation</h2><p>The idea of Orchestra is simple:</p><p>Giving everyone the power to build and manage Data and AI workflows, even if they have little engineering experience.</p><p>Orchestra aims to democratize the ability to build, deploy, and monitor pipelines, which was the main responsibility of highly skilled data engineers in the past.</p><p>With Orchestra, we only need to log in to the platform and set up how to connect with external systems like dbt or the cloud data warehouse, and then we can start building the first data pipeline.</p><p>It&#8217;s an efficient, declarative framework for defining DAGs, with the option to use Python/dbt. Orchestra abstracts away all the complexity for users and exposes only modern UI/UX for all operations.</p><div><hr></div><h2>Let&#8217;s build a data pipeline</h2><p>This section will walk you through the step-by-step process of building a data pipeline on Orchestra. We will also explore Orchestra concepts and features.</p><h3>A brief on the data pipeline</h3><blockquote><p><em>You can find all the related code in this <a href="https://github.com/vutrinh274/dbt_example">repo</a>.</em> </p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PIGf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89b694f7-2fcb-450a-bb34-716ae8c9cbe5_676x374.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PIGf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89b694f7-2fcb-450a-bb34-716ae8c9cbe5_676x374.png 424w, https://substackcdn.com/image/fetch/$s_!PIGf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89b694f7-2fcb-450a-bb34-716ae8c9cbe5_676x374.png 848w, https://substackcdn.com/image/fetch/$s_!PIGf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89b694f7-2fcb-450a-bb34-716ae8c9cbe5_676x374.png 1272w, https://substackcdn.com/image/fetch/$s_!PIGf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89b694f7-2fcb-450a-bb34-716ae8c9cbe5_676x374.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PIGf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89b694f7-2fcb-450a-bb34-716ae8c9cbe5_676x374.png" width="676" height="374" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/89b694f7-2fcb-450a-bb34-716ae8c9cbe5_676x374.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:374,&quot;width&quot;:676,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:82309,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/159654666?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89b694f7-2fcb-450a-bb34-716ae8c9cbe5_676x374.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PIGf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89b694f7-2fcb-450a-bb34-716ae8c9cbe5_676x374.png 424w, https://substackcdn.com/image/fetch/$s_!PIGf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89b694f7-2fcb-450a-bb34-716ae8c9cbe5_676x374.png 848w, https://substackcdn.com/image/fetch/$s_!PIGf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89b694f7-2fcb-450a-bb34-716ae8c9cbe5_676x374.png 1272w, https://substackcdn.com/image/fetch/$s_!PIGf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89b694f7-2fcb-450a-bb34-716ae8c9cbe5_676x374.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>In this project, we will need an<a href="https://app.getorchestra.io/signup"> Orchestra account</a>, a <a href="https://signup.snowflake.com/">Snowflake (trial) account</a>, and an S3 bucket to follow along.</p><p>We will use a Python script to load CSV files to S3.</p><blockquote><p><em>We use five AdventureWorks sample datasets: product, product_category, product_subcategory, sale, and territories. You can find these files in the repo.</em></p></blockquote><p>Then, we load these tables into Snowflake and set up a <a href="https://docs.getdbt.com/docs/core/connect-data-platform/snowflake-setup">dbt-snowflake</a> project for the transformation.</p><p>All the tasks will be scheduled using Orchestra. You can check the Python script <a href="https://github.com/vutrinh274/dbt_example/blob/main/python/upload_to_s3.py">here</a> and the dbt-snowflake project <a href="https://github.com/vutrinh274/dbt_example/tree/main/dbt_example">here</a>.</p><p>In the scope of this article, I won&#8217;t dive deep into how you could set up your Snowflake warehouse or the dbt project. If you want to learn, here are some good resources to get started:</p><ul><li><p><a href="https://docs.snowflake.com/en/user-guide-getting-started">Snowflake quick start</a></p></li><li><p><a href="https://docs.getdbt.com/guides/manual-install?step=4">dbt core quick start</a></p></li><li><p><a href="https://docs.getdbt.com/docs/core/connect-data-platform/snowflake-setup">dbt-snowflake set up</a></p></li></ul><h3>Set up the integrations</h3><p>Like a traditional orchestrator, such as Airflow, one of the first things you want to do before building a data pipeline is to set up how you connect with the external systems. You would add some connections and maybe write some custom <a href="https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/connections.html">hooks</a>, which can be time-consuming.</p><p>Orchestra provides managed integrations that take care of auth, error handling, triggering, polling, and metadata gathering out of the box. We set up &#8220;<a href="https://docs.getorchestra.io/docs/core-concepts/integrations">integrations</a>,&#8221; which are connections to external systems. They support <a href="https://www.getorchestra.io/integrations">a wide range</a> of integrations:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nqCr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F356ca6a7-e3cf-46b9-bd48-0427567d8215_460x300.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nqCr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F356ca6a7-e3cf-46b9-bd48-0427567d8215_460x300.png 424w, https://substackcdn.com/image/fetch/$s_!nqCr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F356ca6a7-e3cf-46b9-bd48-0427567d8215_460x300.png 848w, https://substackcdn.com/image/fetch/$s_!nqCr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F356ca6a7-e3cf-46b9-bd48-0427567d8215_460x300.png 1272w, https://substackcdn.com/image/fetch/$s_!nqCr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F356ca6a7-e3cf-46b9-bd48-0427567d8215_460x300.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nqCr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F356ca6a7-e3cf-46b9-bd48-0427567d8215_460x300.png" width="460" height="300" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/356ca6a7-e3cf-46b9-bd48-0427567d8215_460x300.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:300,&quot;width&quot;:460,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:36412,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/159654666?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F356ca6a7-e3cf-46b9-bd48-0427567d8215_460x300.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nqCr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F356ca6a7-e3cf-46b9-bd48-0427567d8215_460x300.png 424w, https://substackcdn.com/image/fetch/$s_!nqCr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F356ca6a7-e3cf-46b9-bd48-0427567d8215_460x300.png 848w, https://substackcdn.com/image/fetch/$s_!nqCr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F356ca6a7-e3cf-46b9-bd48-0427567d8215_460x300.png 1272w, https://substackcdn.com/image/fetch/$s_!nqCr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F356ca6a7-e3cf-46b9-bd48-0427567d8215_460x300.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><ul><li><p>ETL tools: Airbyte, Fivetran</p></li><li><p>Data Warehouse: Databricks, Snowflake, or BigQuery</p></li><li><p>Cloud Services: common tools in AWS, Azure, and GCP</p></li><li><p>BI tools: Power BI, Tableau, Sigma, Lightdash, etc.</p></li><li><p>Transformation: dbt core, dbt cloud, or Coalesce</p></li><li><p>Utility functions: Python, http, etc.</p></li></ul><p>To create a new integration, we click <strong>&#8220;Integrations&#8220;</strong> from the sidebar and then choose the needed integration from the UI.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DIhB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8061791b-9b00-416a-ad70-7fc18adcad3b_1454x404.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DIhB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8061791b-9b00-416a-ad70-7fc18adcad3b_1454x404.png 424w, https://substackcdn.com/image/fetch/$s_!DIhB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8061791b-9b00-416a-ad70-7fc18adcad3b_1454x404.png 848w, https://substackcdn.com/image/fetch/$s_!DIhB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8061791b-9b00-416a-ad70-7fc18adcad3b_1454x404.png 1272w, https://substackcdn.com/image/fetch/$s_!DIhB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8061791b-9b00-416a-ad70-7fc18adcad3b_1454x404.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DIhB!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8061791b-9b00-416a-ad70-7fc18adcad3b_1454x404.png" width="972" height="270.0742778541953" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8061791b-9b00-416a-ad70-7fc18adcad3b_1454x404.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:404,&quot;width&quot;:1454,&quot;resizeWidth&quot;:972,&quot;bytes&quot;:145327,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/159654666?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8061791b-9b00-416a-ad70-7fc18adcad3b_1454x404.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DIhB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8061791b-9b00-416a-ad70-7fc18adcad3b_1454x404.png 424w, https://substackcdn.com/image/fetch/$s_!DIhB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8061791b-9b00-416a-ad70-7fc18adcad3b_1454x404.png 848w, https://substackcdn.com/image/fetch/$s_!DIhB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8061791b-9b00-416a-ad70-7fc18adcad3b_1454x404.png 1272w, https://substackcdn.com/image/fetch/$s_!DIhB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8061791b-9b00-416a-ad70-7fc18adcad3b_1454x404.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>For this project, we need to define three integrations: Snowflake, dbt, and Python.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Tud2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfd66b31-21ec-476b-9a5b-c90ab68c3fab_746x626.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Tud2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfd66b31-21ec-476b-9a5b-c90ab68c3fab_746x626.png 424w, https://substackcdn.com/image/fetch/$s_!Tud2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfd66b31-21ec-476b-9a5b-c90ab68c3fab_746x626.png 848w, https://substackcdn.com/image/fetch/$s_!Tud2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfd66b31-21ec-476b-9a5b-c90ab68c3fab_746x626.png 1272w, https://substackcdn.com/image/fetch/$s_!Tud2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfd66b31-21ec-476b-9a5b-c90ab68c3fab_746x626.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Tud2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfd66b31-21ec-476b-9a5b-c90ab68c3fab_746x626.png" width="746" height="626" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bfd66b31-21ec-476b-9a5b-c90ab68c3fab_746x626.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:626,&quot;width&quot;:746,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:171702,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/159654666?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfd66b31-21ec-476b-9a5b-c90ab68c3fab_746x626.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Tud2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfd66b31-21ec-476b-9a5b-c90ab68c3fab_746x626.png 424w, https://substackcdn.com/image/fetch/$s_!Tud2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfd66b31-21ec-476b-9a5b-c90ab68c3fab_746x626.png 848w, https://substackcdn.com/image/fetch/$s_!Tud2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfd66b31-21ec-476b-9a5b-c90ab68c3fab_746x626.png 1272w, https://substackcdn.com/image/fetch/$s_!Tud2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfd66b31-21ec-476b-9a5b-c90ab68c3fab_746x626.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>The Python integration requires the connection name, the repo containing the Python scripts, and the token to help Orchestra have read access to the repo. You can check this <a href="https://docs.getorchestra.io/docs/integrations/utility/python/#github">guide</a> to get the GitHub token.</p></li><li><p>The dbt-core integration requires the connection name, the repo containing the dbt project, the GitHub token, and the dbt profile.</p></li><li><p>The Snowflake integration requires the connection name and the Snowflake warehouse information.</p></li></ul><h3>Set up the pipeline</h3><p>Like Airflow, there is a concept called <strong>&#8220;task,&#8220;</strong> the most basic execution unit in Orchestra. Tasks leverage integrations to interact with the external system and execute the user-defined logic.</p><p>Users arrange tasks into <a href="https://docs.getorchestra.io/docs/core-concepts/pipelines/">Pipelines</a>, specifying the upstream and downstream dependencies. In other words, Pipeline lets Orchestra know in which order it should run your tasks.</p><p>To build the pipeline, we choose <strong>Pipelines</strong> in the sidebar &#8594; <strong>New pipeline &#8594; + Create new Pipeline:</strong></p><ul><li><p>After this, you will see two options: <strong>Orchestra</strong> and <strong>GitHub</strong>. At this time, we will go with the <strong>Orchestra</strong>. This option lets you build the data pipeline directly on Orchestra&#8217;s UI.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xOrh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F017bdb74-6130-4a8e-808a-c6b3c027e883_500x148.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xOrh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F017bdb74-6130-4a8e-808a-c6b3c027e883_500x148.png 424w, https://substackcdn.com/image/fetch/$s_!xOrh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F017bdb74-6130-4a8e-808a-c6b3c027e883_500x148.png 848w, https://substackcdn.com/image/fetch/$s_!xOrh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F017bdb74-6130-4a8e-808a-c6b3c027e883_500x148.png 1272w, https://substackcdn.com/image/fetch/$s_!xOrh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F017bdb74-6130-4a8e-808a-c6b3c027e883_500x148.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xOrh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F017bdb74-6130-4a8e-808a-c6b3c027e883_500x148.png" width="500" height="148" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/017bdb74-6130-4a8e-808a-c6b3c027e883_500x148.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:148,&quot;width&quot;:500,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:15801,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/159654666?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F017bdb74-6130-4a8e-808a-c6b3c027e883_500x148.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xOrh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F017bdb74-6130-4a8e-808a-c6b3c027e883_500x148.png 424w, https://substackcdn.com/image/fetch/$s_!xOrh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F017bdb74-6130-4a8e-808a-c6b3c027e883_500x148.png 848w, https://substackcdn.com/image/fetch/$s_!xOrh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F017bdb74-6130-4a8e-808a-c6b3c027e883_500x148.png 1272w, https://substackcdn.com/image/fetch/$s_!xOrh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F017bdb74-6130-4a8e-808a-c6b3c027e883_500x148.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></li><li><p>In the next step, Orchestra prompts you to choose the type of trigger you want. <a href="https://docs.getorchestra.io/docs/core-concepts/triggers/">Available options</a> are manual, triggering by webhook, triggering by other pipelines, triggering by sensor, or cron jobs. For the sake of simplicity, we will go with the manual option. We can change or add trigger types later.</p></li><li><p>Next, we will add the first task by clicking &#8220;<strong>Add task&#8221; </strong>in the <strong>Task Group</strong>. Tasks in the same Task group will be run in parallel.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MoF9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F080c46c9-44d2-416e-996e-d3a08737a676_474x196.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MoF9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F080c46c9-44d2-416e-996e-d3a08737a676_474x196.png 424w, https://substackcdn.com/image/fetch/$s_!MoF9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F080c46c9-44d2-416e-996e-d3a08737a676_474x196.png 848w, https://substackcdn.com/image/fetch/$s_!MoF9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F080c46c9-44d2-416e-996e-d3a08737a676_474x196.png 1272w, https://substackcdn.com/image/fetch/$s_!MoF9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F080c46c9-44d2-416e-996e-d3a08737a676_474x196.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MoF9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F080c46c9-44d2-416e-996e-d3a08737a676_474x196.png" width="534" height="220.81012658227849" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/080c46c9-44d2-416e-996e-d3a08737a676_474x196.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:196,&quot;width&quot;:474,&quot;resizeWidth&quot;:534,&quot;bytes&quot;:35641,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/159654666?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F080c46c9-44d2-416e-996e-d3a08737a676_474x196.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MoF9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F080c46c9-44d2-416e-996e-d3a08737a676_474x196.png 424w, https://substackcdn.com/image/fetch/$s_!MoF9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F080c46c9-44d2-416e-996e-d3a08737a676_474x196.png 848w, https://substackcdn.com/image/fetch/$s_!MoF9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F080c46c9-44d2-416e-996e-d3a08737a676_474x196.png 1272w, https://substackcdn.com/image/fetch/$s_!MoF9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F080c46c9-44d2-416e-996e-d3a08737a676_474x196.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></li><li><p>The first task is the Python task, which uses the defined Python integration. Orchestra will ask us to <strong>&#8220;Choose an integration job.&#8221; </strong>For this task, we will go with the &#8220;<strong>Python-execute script.&#8221;</strong></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xaSj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b4622f-9268-4672-ae72-49484b908010_486x130.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xaSj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b4622f-9268-4672-ae72-49484b908010_486x130.png 424w, https://substackcdn.com/image/fetch/$s_!xaSj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b4622f-9268-4672-ae72-49484b908010_486x130.png 848w, https://substackcdn.com/image/fetch/$s_!xaSj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b4622f-9268-4672-ae72-49484b908010_486x130.png 1272w, https://substackcdn.com/image/fetch/$s_!xaSj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b4622f-9268-4672-ae72-49484b908010_486x130.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xaSj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b4622f-9268-4672-ae72-49484b908010_486x130.png" width="522" height="139.62962962962962" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e1b4622f-9268-4672-ae72-49484b908010_486x130.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:130,&quot;width&quot;:486,&quot;resizeWidth&quot;:522,&quot;bytes&quot;:21569,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/159654666?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b4622f-9268-4672-ae72-49484b908010_486x130.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xaSj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b4622f-9268-4672-ae72-49484b908010_486x130.png 424w, https://substackcdn.com/image/fetch/$s_!xaSj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b4622f-9268-4672-ae72-49484b908010_486x130.png 848w, https://substackcdn.com/image/fetch/$s_!xaSj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b4622f-9268-4672-ae72-49484b908010_486x130.png 1272w, https://substackcdn.com/image/fetch/$s_!xaSj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b4622f-9268-4672-ae72-49484b908010_486x130.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></li><li><p>The task will run this <a href="https://github.com/vutrinh274/dbt_example/blob/main/python/upload_to_s3.py">Python script</a> to upload data from the local to the S3 bucket. This task also needs some environment variables to work with the S3 client (e.g., AWS_ACCESS_KEY_ID, etc<strong>).</strong> We enter some information for this task, and the values of the environment variables will be included in the section <strong>&#8220;Environment Variables JSON.&#8220; </strong>Orchestra will encode these variables when we save them.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!REEd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff88704f9-8f5a-4f6a-93a6-c5d8bd5dc46f_1144x686.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!REEd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff88704f9-8f5a-4f6a-93a6-c5d8bd5dc46f_1144x686.png 424w, https://substackcdn.com/image/fetch/$s_!REEd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff88704f9-8f5a-4f6a-93a6-c5d8bd5dc46f_1144x686.png 848w, https://substackcdn.com/image/fetch/$s_!REEd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff88704f9-8f5a-4f6a-93a6-c5d8bd5dc46f_1144x686.png 1272w, https://substackcdn.com/image/fetch/$s_!REEd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff88704f9-8f5a-4f6a-93a6-c5d8bd5dc46f_1144x686.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!REEd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff88704f9-8f5a-4f6a-93a6-c5d8bd5dc46f_1144x686.png" width="1144" height="686" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f88704f9-8f5a-4f6a-93a6-c5d8bd5dc46f_1144x686.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:686,&quot;width&quot;:1144,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238419,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/159654666?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff88704f9-8f5a-4f6a-93a6-c5d8bd5dc46f_1144x686.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!REEd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff88704f9-8f5a-4f6a-93a6-c5d8bd5dc46f_1144x686.png 424w, https://substackcdn.com/image/fetch/$s_!REEd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff88704f9-8f5a-4f6a-93a6-c5d8bd5dc46f_1144x686.png 848w, https://substackcdn.com/image/fetch/$s_!REEd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff88704f9-8f5a-4f6a-93a6-c5d8bd5dc46f_1144x686.png 1272w, https://substackcdn.com/image/fetch/$s_!REEd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff88704f9-8f5a-4f6a-93a6-c5d8bd5dc46f_1144x686.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>Next, we set up a Snowflake task that runs queries to load the CSV from the S3 bucket. We click <strong>&#8220;Add task&#8220; </strong>in the next task group to set the dependencies between this task and the Python task. Before this task, <a href="https://docs.snowflake.com/en/user-guide/data-load-s3-config-storage-integration">we had to set up a few things from Snowflake</a> so it could read the data in S3.</p></li><li><p>We choose <strong>Run Query (Snowflake)</strong></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RLQi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0600dbf-9c13-4da0-9f2f-3069557c89c6_534x162.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RLQi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0600dbf-9c13-4da0-9f2f-3069557c89c6_534x162.png 424w, https://substackcdn.com/image/fetch/$s_!RLQi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0600dbf-9c13-4da0-9f2f-3069557c89c6_534x162.png 848w, https://substackcdn.com/image/fetch/$s_!RLQi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0600dbf-9c13-4da0-9f2f-3069557c89c6_534x162.png 1272w, https://substackcdn.com/image/fetch/$s_!RLQi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0600dbf-9c13-4da0-9f2f-3069557c89c6_534x162.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RLQi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0600dbf-9c13-4da0-9f2f-3069557c89c6_534x162.png" width="534" height="162" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d0600dbf-9c13-4da0-9f2f-3069557c89c6_534x162.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:162,&quot;width&quot;:534,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:26952,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/159654666?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0600dbf-9c13-4da0-9f2f-3069557c89c6_534x162.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RLQi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0600dbf-9c13-4da0-9f2f-3069557c89c6_534x162.png 424w, https://substackcdn.com/image/fetch/$s_!RLQi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0600dbf-9c13-4da0-9f2f-3069557c89c6_534x162.png 848w, https://substackcdn.com/image/fetch/$s_!RLQi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0600dbf-9c13-4da0-9f2f-3069557c89c6_534x162.png 1272w, https://substackcdn.com/image/fetch/$s_!RLQi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0600dbf-9c13-4da0-9f2f-3069557c89c6_534x162.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></li><li><p>Enter the task&#8217;s name, the SQL needed to be run, and the defined Snowflake connection. You can check the SQL script <a href="https://github.com/vutrinh274/dbt_example/blob/main/snowflake/load_data_from_S3.sql">here</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ola5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65bb92a0-e282-4bd5-bbd6-b2cec5ccd6e0_602x530.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ola5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65bb92a0-e282-4bd5-bbd6-b2cec5ccd6e0_602x530.png 424w, https://substackcdn.com/image/fetch/$s_!ola5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65bb92a0-e282-4bd5-bbd6-b2cec5ccd6e0_602x530.png 848w, https://substackcdn.com/image/fetch/$s_!ola5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65bb92a0-e282-4bd5-bbd6-b2cec5ccd6e0_602x530.png 1272w, https://substackcdn.com/image/fetch/$s_!ola5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65bb92a0-e282-4bd5-bbd6-b2cec5ccd6e0_602x530.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ola5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65bb92a0-e282-4bd5-bbd6-b2cec5ccd6e0_602x530.png" width="602" height="530" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/65bb92a0-e282-4bd5-bbd6-b2cec5ccd6e0_602x530.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:530,&quot;width&quot;:602,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:102016,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/159654666?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65bb92a0-e282-4bd5-bbd6-b2cec5ccd6e0_602x530.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ola5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65bb92a0-e282-4bd5-bbd6-b2cec5ccd6e0_602x530.png 424w, https://substackcdn.com/image/fetch/$s_!ola5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65bb92a0-e282-4bd5-bbd6-b2cec5ccd6e0_602x530.png 848w, https://substackcdn.com/image/fetch/$s_!ola5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65bb92a0-e282-4bd5-bbd6-b2cec5ccd6e0_602x530.png 1272w, https://substackcdn.com/image/fetch/$s_!ola5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65bb92a0-e282-4bd5-bbd6-b2cec5ccd6e0_602x530.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p>The two final tasks will be dbt tasks. The first one will run all staging models to clean the data loaded from the S3 bucket. The latter will run all curated models to transform data from staging into fact and dimension tables. For the dbt staging task, we choose the dbt Core command task.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4yrk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F580dad14-f460-4f4c-a05d-a36430867e04_594x158.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4yrk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F580dad14-f460-4f4c-a05d-a36430867e04_594x158.png 424w, https://substackcdn.com/image/fetch/$s_!4yrk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F580dad14-f460-4f4c-a05d-a36430867e04_594x158.png 848w, https://substackcdn.com/image/fetch/$s_!4yrk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F580dad14-f460-4f4c-a05d-a36430867e04_594x158.png 1272w, https://substackcdn.com/image/fetch/$s_!4yrk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F580dad14-f460-4f4c-a05d-a36430867e04_594x158.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4yrk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F580dad14-f460-4f4c-a05d-a36430867e04_594x158.png" width="594" height="158" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/580dad14-f460-4f4c-a05d-a36430867e04_594x158.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:158,&quot;width&quot;:594,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:24690,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/159654666?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F580dad14-f460-4f4c-a05d-a36430867e04_594x158.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4yrk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F580dad14-f460-4f4c-a05d-a36430867e04_594x158.png 424w, https://substackcdn.com/image/fetch/$s_!4yrk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F580dad14-f460-4f4c-a05d-a36430867e04_594x158.png 848w, https://substackcdn.com/image/fetch/$s_!4yrk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F580dad14-f460-4f4c-a05d-a36430867e04_594x158.png 1272w, https://substackcdn.com/image/fetch/$s_!4yrk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F580dad14-f460-4f4c-a05d-a36430867e04_594x158.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></li><li><p>Then, we enter some required information for this task:</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!f0zQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd41ec739-62dc-4aef-859a-e5279fde1304_1454x656.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!f0zQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd41ec739-62dc-4aef-859a-e5279fde1304_1454x656.png 424w, https://substackcdn.com/image/fetch/$s_!f0zQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd41ec739-62dc-4aef-859a-e5279fde1304_1454x656.png 848w, https://substackcdn.com/image/fetch/$s_!f0zQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd41ec739-62dc-4aef-859a-e5279fde1304_1454x656.png 1272w, https://substackcdn.com/image/fetch/$s_!f0zQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd41ec739-62dc-4aef-859a-e5279fde1304_1454x656.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!f0zQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd41ec739-62dc-4aef-859a-e5279fde1304_1454x656.png" width="1454" height="656" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d41ec739-62dc-4aef-859a-e5279fde1304_1454x656.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:656,&quot;width&quot;:1454,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:174586,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/159654666?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd41ec739-62dc-4aef-859a-e5279fde1304_1454x656.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!f0zQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd41ec739-62dc-4aef-859a-e5279fde1304_1454x656.png 424w, https://substackcdn.com/image/fetch/$s_!f0zQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd41ec739-62dc-4aef-859a-e5279fde1304_1454x656.png 848w, https://substackcdn.com/image/fetch/$s_!f0zQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd41ec739-62dc-4aef-859a-e5279fde1304_1454x656.png 1272w, https://substackcdn.com/image/fetch/$s_!f0zQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd41ec739-62dc-4aef-859a-e5279fde1304_1454x656.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>For the dbt curated task, we will configure it the same as the dbt staging task, except for the dbt commands, which need to be changed to <code>dbt build -s tag:curated</code></p></li></ul><p>And that&#8217;s it; we built a pipeline with Orchestra. To recap what we&#8217;ve done, you can check the video here:</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;5b52efd4-cbd2-468d-abf2-b6d53a356ee1&quot;,&quot;duration&quot;:null}"></div><h3>Run the pipeline</h3><p>After having the pipeline&#8217;s tasks can be run in one of the following ways:</p><ul><li><p>Run based on the trigger configuration.</p></li><li><p>Run the whole pipeline manually.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ovcL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd6d08c8-e6d8-4c66-9033-8fe17db291ab_792x406.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ovcL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd6d08c8-e6d8-4c66-9033-8fe17db291ab_792x406.png 424w, https://substackcdn.com/image/fetch/$s_!ovcL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd6d08c8-e6d8-4c66-9033-8fe17db291ab_792x406.png 848w, https://substackcdn.com/image/fetch/$s_!ovcL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd6d08c8-e6d8-4c66-9033-8fe17db291ab_792x406.png 1272w, https://substackcdn.com/image/fetch/$s_!ovcL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd6d08c8-e6d8-4c66-9033-8fe17db291ab_792x406.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ovcL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd6d08c8-e6d8-4c66-9033-8fe17db291ab_792x406.png" width="792" height="406" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bd6d08c8-e6d8-4c66-9033-8fe17db291ab_792x406.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:406,&quot;width&quot;:792,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:107560,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/159654666?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd6d08c8-e6d8-4c66-9033-8fe17db291ab_792x406.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ovcL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd6d08c8-e6d8-4c66-9033-8fe17db291ab_792x406.png 424w, https://substackcdn.com/image/fetch/$s_!ovcL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd6d08c8-e6d8-4c66-9033-8fe17db291ab_792x406.png 848w, https://substackcdn.com/image/fetch/$s_!ovcL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd6d08c8-e6d8-4c66-9033-8fe17db291ab_792x406.png 1272w, https://substackcdn.com/image/fetch/$s_!ovcL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd6d08c8-e6d8-4c66-9033-8fe17db291ab_792x406.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p>Run a specific task.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b30R!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c40d0bd-619e-4cc7-9273-e076a96e26ba_792x272.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b30R!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c40d0bd-619e-4cc7-9273-e076a96e26ba_792x272.png 424w, https://substackcdn.com/image/fetch/$s_!b30R!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c40d0bd-619e-4cc7-9273-e076a96e26ba_792x272.png 848w, https://substackcdn.com/image/fetch/$s_!b30R!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c40d0bd-619e-4cc7-9273-e076a96e26ba_792x272.png 1272w, https://substackcdn.com/image/fetch/$s_!b30R!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c40d0bd-619e-4cc7-9273-e076a96e26ba_792x272.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b30R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c40d0bd-619e-4cc7-9273-e076a96e26ba_792x272.png" width="792" height="272" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7c40d0bd-619e-4cc7-9273-e076a96e26ba_792x272.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:272,&quot;width&quot;:792,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:67691,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/159654666?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c40d0bd-619e-4cc7-9273-e076a96e26ba_792x272.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!b30R!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c40d0bd-619e-4cc7-9273-e076a96e26ba_792x272.png 424w, https://substackcdn.com/image/fetch/$s_!b30R!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c40d0bd-619e-4cc7-9273-e076a96e26ba_792x272.png 848w, https://substackcdn.com/image/fetch/$s_!b30R!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c40d0bd-619e-4cc7-9273-e076a96e26ba_792x272.png 1272w, https://substackcdn.com/image/fetch/$s_!b30R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c40d0bd-619e-4cc7-9273-e076a96e26ba_792x272.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li></ul><h3>Observability</h3><p>A very cool feature of Orchestra is that the platform will aggregate all the metadata for us.</p><p>After running the pipeline, Orchestra will display the status of that run for us; in the screenshot below, we can check the status of each task. If a task has any issues, we can detect them from here, fix them, and re-run the pipeline.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nj98!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8350531-4cee-4bff-a012-91dafa39c9e3_794x276.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nj98!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8350531-4cee-4bff-a012-91dafa39c9e3_794x276.png 424w, https://substackcdn.com/image/fetch/$s_!nj98!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8350531-4cee-4bff-a012-91dafa39c9e3_794x276.png 848w, https://substackcdn.com/image/fetch/$s_!nj98!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8350531-4cee-4bff-a012-91dafa39c9e3_794x276.png 1272w, https://substackcdn.com/image/fetch/$s_!nj98!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8350531-4cee-4bff-a012-91dafa39c9e3_794x276.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nj98!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8350531-4cee-4bff-a012-91dafa39c9e3_794x276.png" width="794" height="276" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a8350531-4cee-4bff-a012-91dafa39c9e3_794x276.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:276,&quot;width&quot;:794,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:81179,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/159654666?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8350531-4cee-4bff-a012-91dafa39c9e3_794x276.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nj98!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8350531-4cee-4bff-a012-91dafa39c9e3_794x276.png 424w, https://substackcdn.com/image/fetch/$s_!nj98!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8350531-4cee-4bff-a012-91dafa39c9e3_794x276.png 848w, https://substackcdn.com/image/fetch/$s_!nj98!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8350531-4cee-4bff-a012-91dafa39c9e3_794x276.png 1272w, https://substackcdn.com/image/fetch/$s_!nj98!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8350531-4cee-4bff-a012-91dafa39c9e3_794x276.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Upon the pipeline finishes, we can click the <strong>&#8220;Explore lineage&#8220;</strong> button to explore its lineage.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!q11_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3b91ae6-da31-4670-9ba5-24fd60b95872_812x266.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!q11_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3b91ae6-da31-4670-9ba5-24fd60b95872_812x266.png 424w, https://substackcdn.com/image/fetch/$s_!q11_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3b91ae6-da31-4670-9ba5-24fd60b95872_812x266.png 848w, https://substackcdn.com/image/fetch/$s_!q11_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3b91ae6-da31-4670-9ba5-24fd60b95872_812x266.png 1272w, https://substackcdn.com/image/fetch/$s_!q11_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3b91ae6-da31-4670-9ba5-24fd60b95872_812x266.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!q11_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3b91ae6-da31-4670-9ba5-24fd60b95872_812x266.png" width="812" height="266" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e3b91ae6-da31-4670-9ba5-24fd60b95872_812x266.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:266,&quot;width&quot;:812,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:72539,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/159654666?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3b91ae6-da31-4670-9ba5-24fd60b95872_812x266.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!q11_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3b91ae6-da31-4670-9ba5-24fd60b95872_812x266.png 424w, https://substackcdn.com/image/fetch/$s_!q11_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3b91ae6-da31-4670-9ba5-24fd60b95872_812x266.png 848w, https://substackcdn.com/image/fetch/$s_!q11_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3b91ae6-da31-4670-9ba5-24fd60b95872_812x266.png 1272w, https://substackcdn.com/image/fetch/$s_!q11_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3b91ae6-da31-4670-9ba5-24fd60b95872_812x266.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Orchestra also keeps the history of every pipeline&#8217;s run.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6s5v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cbba241-282b-4b8d-bee9-795642bb031b_894x274.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6s5v!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cbba241-282b-4b8d-bee9-795642bb031b_894x274.png 424w, https://substackcdn.com/image/fetch/$s_!6s5v!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cbba241-282b-4b8d-bee9-795642bb031b_894x274.png 848w, https://substackcdn.com/image/fetch/$s_!6s5v!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cbba241-282b-4b8d-bee9-795642bb031b_894x274.png 1272w, https://substackcdn.com/image/fetch/$s_!6s5v!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cbba241-282b-4b8d-bee9-795642bb031b_894x274.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6s5v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cbba241-282b-4b8d-bee9-795642bb031b_894x274.png" width="894" height="274" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3cbba241-282b-4b8d-bee9-795642bb031b_894x274.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:274,&quot;width&quot;:894,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:81854,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/159654666?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cbba241-282b-4b8d-bee9-795642bb031b_894x274.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6s5v!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cbba241-282b-4b8d-bee9-795642bb031b_894x274.png 424w, https://substackcdn.com/image/fetch/$s_!6s5v!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cbba241-282b-4b8d-bee9-795642bb031b_894x274.png 848w, https://substackcdn.com/image/fetch/$s_!6s5v!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cbba241-282b-4b8d-bee9-795642bb031b_894x274.png 1272w, https://substackcdn.com/image/fetch/$s_!6s5v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cbba241-282b-4b8d-bee9-795642bb031b_894x274.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In addition, Orchestra also collects metadata about our data assets. In our project, we made five Snowflake tables and eight dbt models.</p><p>For the native Snowflake tables, Orchestra can automatically collect the metadata; for the dbt models, we must explicitly allow Orchestra to do that. We can go back to the pipeline and edit the dbt tasks to let them collect the metadata:</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;e594cd92-c13e-4558-9446-3802ca9c787d&quot;,&quot;duration&quot;:null}"></div><p>We can check all the assets from the sidebar in the Data Assets section. For my pipeline, the data assets include metadata like Snowflake&#8217;s table structure (from the left), the number of assets, the asset coverage, the asset health, or the asset listing.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;c676c0e3-ff74-40bc-a4f6-22a0e174a8f8&quot;,&quot;duration&quot;:null}"></div><h3>Environment</h3><p>When we build the pipeline, we want different development environments (e.g., dev, staging, and prod). Airflow requires us to set up multiple instances and multiple compute instances (e.g., two Spark clusters for the dev and prod environments).</p><p>Orchestra provides a simpler way; an environment is just a configuration. When defining a pipeline, users can add this configuration in each task to specify the different environments it can run.</p><p>For our project, imagine we need to separate the environment: &#8220;develop&#8221; and &#8220;production&#8221; for the Snowflake and dbt tasks. We want the same pipeline to run on two different Snowflake warehouses. To achieve this in Orchestra, we need:</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;238f1f69-4acf-40e2-be8a-04287525d224&quot;,&quot;duration&quot;:null}"></div><ul><li><p>Two Snowflake and two dbt integrations are associated with the &#8220;develop&#8221; and &#8220;production&#8221; environments.</p></li><li><p>Defining &#8220;develop&#8221; and &#8220;production&#8221; configuration environments in Orchestra UI, including associated integrations for each environment</p></li><li><p>Use these configurations in desired tasks.</p></li></ul><p>After this, whenever you run the pipeline or a single task, it will run in a specific environment&#8212;the default one or the one you choose. Orchestra will spin up separate resources and align these across environments behind the scene for us.</p><p>With this approach, we can control which tasks should be run in different environments and which are acceptable to run in one environment.</p><p>Orchestra also allows us to trigger the pipeline manually via GitHub workflow, which helps us integrate deeply with the CI/CD pipeline; let's say the current pipeline has two environments: &#8220;develop&#8221; and &#8220;production&#8221;:</p><ul><li><p>I want whenever I push dbt&#8217;s changes to the &#8220;develop&#8221; branch, it will run the Orchestra pipeline in the &#8220;develop&#8221; environment to validate the change.</p></li><li><p>If there are no issues, I will merge changes to the &#8220;main&#8220; branch and run the Orchestra pipeline in the &#8220;production&#8221; environment.</p></li></ul><p>Orchestra provides the <a href="https://github.com/marketplace/actions/orchestra-run-pipeline">orchestra-hq/run-pipeline</a> GitHub Action to let users integrate into the GitHub workflow. All we need is the <a href="https://github.com/marketplace/actions/orchestra-run-pipeline">Orchestra API key and the pipeline&#8217;s ID</a> (from the URL).</p><p>For this project, I prepared <a href="https://github.com/vutrinh274/dbt_example/blob/main/.github/workflows/main.yml">a GitHub workflow</a> as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DRHz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F040b49e8-8ed4-41b2-aafe-08d5bd597d5a_762x1056.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DRHz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F040b49e8-8ed4-41b2-aafe-08d5bd597d5a_762x1056.png 424w, https://substackcdn.com/image/fetch/$s_!DRHz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F040b49e8-8ed4-41b2-aafe-08d5bd597d5a_762x1056.png 848w, https://substackcdn.com/image/fetch/$s_!DRHz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F040b49e8-8ed4-41b2-aafe-08d5bd597d5a_762x1056.png 1272w, https://substackcdn.com/image/fetch/$s_!DRHz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F040b49e8-8ed4-41b2-aafe-08d5bd597d5a_762x1056.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DRHz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F040b49e8-8ed4-41b2-aafe-08d5bd597d5a_762x1056.png" width="762" height="1056" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/040b49e8-8ed4-41b2-aafe-08d5bd597d5a_762x1056.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1056,&quot;width&quot;:762,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:146854,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/159654666?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F040b49e8-8ed4-41b2-aafe-08d5bd597d5a_762x1056.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DRHz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F040b49e8-8ed4-41b2-aafe-08d5bd597d5a_762x1056.png 424w, https://substackcdn.com/image/fetch/$s_!DRHz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F040b49e8-8ed4-41b2-aafe-08d5bd597d5a_762x1056.png 848w, https://substackcdn.com/image/fetch/$s_!DRHz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F040b49e8-8ed4-41b2-aafe-08d5bd597d5a_762x1056.png 1272w, https://substackcdn.com/image/fetch/$s_!DRHz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F040b49e8-8ed4-41b2-aafe-08d5bd597d5a_762x1056.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The workflow has two jobs. The first will check if there is a pull request from the develop branch; if so, it will run my pipeline in the &#8220;develop&#8220; environment. The latter will check if there is a push into the &#8220;main&#8220; branch; if so, it will execute the pipeline in the &#8220;production&#8220; environment.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VbE5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F164aed37-a5b9-4bce-a98c-84f004473a50_888x458.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VbE5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F164aed37-a5b9-4bce-a98c-84f004473a50_888x458.png 424w, https://substackcdn.com/image/fetch/$s_!VbE5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F164aed37-a5b9-4bce-a98c-84f004473a50_888x458.png 848w, https://substackcdn.com/image/fetch/$s_!VbE5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F164aed37-a5b9-4bce-a98c-84f004473a50_888x458.png 1272w, https://substackcdn.com/image/fetch/$s_!VbE5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F164aed37-a5b9-4bce-a98c-84f004473a50_888x458.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VbE5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F164aed37-a5b9-4bce-a98c-84f004473a50_888x458.png" width="888" height="458" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/164aed37-a5b9-4bce-a98c-84f004473a50_888x458.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:458,&quot;width&quot;:888,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:99185,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/159654666?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F164aed37-a5b9-4bce-a98c-84f004473a50_888x458.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VbE5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F164aed37-a5b9-4bce-a98c-84f004473a50_888x458.png 424w, https://substackcdn.com/image/fetch/$s_!VbE5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F164aed37-a5b9-4bce-a98c-84f004473a50_888x458.png 848w, https://substackcdn.com/image/fetch/$s_!VbE5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F164aed37-a5b9-4bce-a98c-84f004473a50_888x458.png 1272w, https://substackcdn.com/image/fetch/$s_!VbE5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F164aed37-a5b9-4bce-a98c-84f004473a50_888x458.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>A very cool thing is that the output, when running from the pipeline, will be streamed to the UI when we observe the running workflow from GitHub.</p><h3>Version Control</h3><p>The level of how Orchestrate can integrate with Git does not stop there. When we create the pipeline, Orchestra will record the pipeline definition in a YAML file. Whenever we make changes to the pipeline in the UI, it will show changes just like Git:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hA6x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43714c6d-7b83-4acf-b88e-d009a9053129_1204x368.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hA6x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43714c6d-7b83-4acf-b88e-d009a9053129_1204x368.png 424w, https://substackcdn.com/image/fetch/$s_!hA6x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43714c6d-7b83-4acf-b88e-d009a9053129_1204x368.png 848w, https://substackcdn.com/image/fetch/$s_!hA6x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43714c6d-7b83-4acf-b88e-d009a9053129_1204x368.png 1272w, https://substackcdn.com/image/fetch/$s_!hA6x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43714c6d-7b83-4acf-b88e-d009a9053129_1204x368.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hA6x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43714c6d-7b83-4acf-b88e-d009a9053129_1204x368.png" width="1204" height="368" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/43714c6d-7b83-4acf-b88e-d009a9053129_1204x368.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:368,&quot;width&quot;:1204,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:74256,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/159654666?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43714c6d-7b83-4acf-b88e-d009a9053129_1204x368.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hA6x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43714c6d-7b83-4acf-b88e-d009a9053129_1204x368.png 424w, https://substackcdn.com/image/fetch/$s_!hA6x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43714c6d-7b83-4acf-b88e-d009a9053129_1204x368.png 848w, https://substackcdn.com/image/fetch/$s_!hA6x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43714c6d-7b83-4acf-b88e-d009a9053129_1204x368.png 1272w, https://substackcdn.com/image/fetch/$s_!hA6x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43714c6d-7b83-4acf-b88e-d009a9053129_1204x368.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Orchestra gives us the ability to version control this YAML file with Git. To do this for our project, we need to:</p><ul><li><p>Allow Orchestra to have access to your Repo. We go to Settings &#8594; User Git-control settings &#8594; Connect. It will ask if you allow Orchestra to have some specific permissions.</p></li><li><p>Edit the pipeline, click the <strong>Cogwheel</strong> symbol, and choose <strong>GitHub</strong> for pipeline storage. We then fill in some Git&#8217;s information, click <strong>update,</strong> and click <strong>save.</strong></p></li><li><p>There will be a pull request to add the Orchestra pipeline&#8217;s YAML file to your repo.</p></li><li><p>From now on, we can change the YAML file in the repo, and those changes will be reflected in the Orchestra pipeline UI.</p></li></ul><p>In the video below, I configured the version control for the Pipeline YAML file in my repo and adjusted the file from my IDE to add another DBT task. Let's see how it works:</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;f45cd891-ca0b-4d6a-9e72-de37556596a1&quot;,&quot;duration&quot;:null}"></div><h3>Access Control</h3><p>To implement access control, we created a Group with associated permissions. Orchestra lets us define permissions on high-level resources like account settings to low-level ones like pipelines or environments.</p><p>After that, we will add users to suitable groups. When in a group, users will be granted all the permissions associated with that group</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;cb72a13f-108a-43d0-873b-2f982a012a18&quot;,&quot;duration&quot;:null}"></div><div><hr></div><h2>My thoughts</h2><p>Orchestra is built for scale and makes building modern data workflows super easy. It provides a much more seamless way to build data pipelines, especially the &#8220;modern&#8220; ones with dbt + cloud data warehouse; everything can be done on its intuitive UI/UX. The ability to collect, aggregate, and display metadata is also really valuable.</p><p>Regarding operating in different environments, Orchestra also does a very good job of abstracting the complexity of infrastructure management behind the scenes and only letting users operate on some configurations. The GitHub integration also impresses me a lot.</p><p>These are only some of my first impressions of Orchestra. There is a lot of flexibility and features I did not explore fully, like how to support multiple domains and effectively govern access to different resources in large enterprise organizations - these are very difficult using traditional workflow orchestration platforms.</p><p>Of course, all the tools will have pros and cons, and I believe that if I spend more time with Orchestra, I can spot some points that need to improve.</p><p>With this platform, it is sure that you can&#8217;t have the flexibility you got with Airflow, where you can write custom operators and construct the DAG in Python; in return, Orchestra will abstract all the complexity away, while you can build a robust pipeline from the UI or in code.</p><p>I think Orchestra is really worth your time <a href="https://www.getorchestra.io/pricing">trying</a>, especially if you have limited resources on your team or just want to spend time on business logic instead of maintaining an workflow orchestration system.</p><div><hr></div><h2>Outro</h2><p>Thank you for reading this far.</p><p>In the last 10 minutes, we have explored Orchestra and its motivation. We have also built a pipeline and tried out some very cool features from the platform. Finally, these are my naive thoughts on Orchestra.</p><p>Now, it&#8217;s time to say goodbye. See you in my next article.</p><div><hr></div><h2>Reference</h2><p><em>[1] <a href="https://docs.getorchestra.io/docs/quick-start/">Orchestra Documentation</a></em></p><p><em>[2] <a href="https://getorchestra.io">Orchestra website</a></em></p><p><em>[3] <a href="https://www.youtube.com/channel/UC562ybrRtpDC9gNQTx6nYKg">Product-Demos</a></em></p>]]></content:encoded></item><item><title><![CDATA[I spent 5 hours understanding how Uber built their ETL pipelines.]]></title><description><![CDATA[Spoiler: They don't use batch or stream pipelines]]></description><link>https://vutr.substack.com/p/i-spent-5-hours-understanding-how</link><guid isPermaLink="false">https://vutr.substack.com/p/i-spent-5-hours-understanding-how</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Thu, 10 Apr 2025 03:57:45 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!IVAi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3379e6b1-e3d4-40d7-b9bc-5f7edd1be0c5_2000x1429.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p><em>My ultimate goal is to help you break into the data engineering field and become a more impactful data engineer. To take this a step further and dedicate even more time to creating in-depth, practical content, I&#8217;m excited to introduce a paid membership option.</em></p><p><em>This will allow me to produce even higher-quality articles, diving deeper into the topics that matter most for your growth and making this whole endeavor more sustainable.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Upgrade subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe?"><span>Upgrade subscription</span></a></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IVAi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3379e6b1-e3d4-40d7-b9bc-5f7edd1be0c5_2000x1429.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IVAi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3379e6b1-e3d4-40d7-b9bc-5f7edd1be0c5_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!IVAi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3379e6b1-e3d4-40d7-b9bc-5f7edd1be0c5_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!IVAi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3379e6b1-e3d4-40d7-b9bc-5f7edd1be0c5_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!IVAi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3379e6b1-e3d4-40d7-b9bc-5f7edd1be0c5_2000x1429.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IVAi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3379e6b1-e3d4-40d7-b9bc-5f7edd1be0c5_2000x1429.png" width="1456" height="1040" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3379e6b1-e3d4-40d7-b9bc-5f7edd1be0c5_2000x1429.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1040,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:326437,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160751145?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3379e6b1-e3d4-40d7-b9bc-5f7edd1be0c5_2000x1429.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IVAi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3379e6b1-e3d4-40d7-b9bc-5f7edd1be0c5_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!IVAi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3379e6b1-e3d4-40d7-b9bc-5f7edd1be0c5_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!IVAi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3379e6b1-e3d4-40d7-b9bc-5f7edd1be0c5_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!IVAi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3379e6b1-e3d4-40d7-b9bc-5f7edd1be0c5_2000x1429.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><div><hr></div><h2>Intro</h2><p>This week, we will explore how Uber engineers build ETL pipelines to support the internet-scale business.</p><p>Uber is the tech company that transformed the taxi market in the early 2010s. <a href="https://www.businessofapps.com/data/uber-statistics/">In 2023, 137 million people used Uber or Uber Eats once a month, and Uber drivers completed 9.44 billion trips.</a></p><p>This article will first discover the company business's requirements for data pipelines and how Uber delivered the solution.</p><div><hr></div><h2>Business Requirement</h2><p>At Uber, data is unified in a centralized petabyte-scale data lake. The Global data warehouse team is in charge of building foundation fact and dimension tables on this lake, acting as Lego pieces for all the data use cases, from reporting to machine learning.</p><p>The data is not only used for common analytic cases; Uber also uses data to power critical functions of their services and applications, such as rider safety, ETA predictions, or fraud detection.</p><p>For Uber, data freshness is the backbone of the business; they invested heavily in the ability to process data as soon as it&#8217;s captured to reflect changes in the real world.</p><p>They build, evolve, and manage their data lakehouse to ensure it can do one thing efficiently: handling data incrementally.</p><p>Let&#8217;s review a typical use case at Uber to understand why incremental data processing is essential.</p><p>The use case is the <strong>driver and courier earnings. </strong>Imagine Uber had a dataset containing daily driver earnings for every driver. A rider can tip the driver hours after a trip is completed. A driver earned $10 for the trip on Monday night and got an extra 1$ tip on Tuesday morning. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VZGY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44ef386d-f37c-4b7d-bf0a-33fe27edab5b_476x328.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VZGY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44ef386d-f37c-4b7d-bf0a-33fe27edab5b_476x328.png 424w, https://substackcdn.com/image/fetch/$s_!VZGY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44ef386d-f37c-4b7d-bf0a-33fe27edab5b_476x328.png 848w, https://substackcdn.com/image/fetch/$s_!VZGY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44ef386d-f37c-4b7d-bf0a-33fe27edab5b_476x328.png 1272w, https://substackcdn.com/image/fetch/$s_!VZGY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44ef386d-f37c-4b7d-bf0a-33fe27edab5b_476x328.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VZGY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44ef386d-f37c-4b7d-bf0a-33fe27edab5b_476x328.png" width="476" height="328" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/44ef386d-f37c-4b7d-bf0a-33fe27edab5b_476x328.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:328,&quot;width&quot;:476,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:31397,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160751145?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44ef386d-f37c-4b7d-bf0a-33fe27edab5b_476x328.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VZGY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44ef386d-f37c-4b7d-bf0a-33fe27edab5b_476x328.png 424w, https://substackcdn.com/image/fetch/$s_!VZGY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44ef386d-f37c-4b7d-bf0a-33fe27edab5b_476x328.png 848w, https://substackcdn.com/image/fetch/$s_!VZGY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44ef386d-f37c-4b7d-bf0a-33fe27edab5b_476x328.png 1272w, https://substackcdn.com/image/fetch/$s_!VZGY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44ef386d-f37c-4b7d-bf0a-33fe27edab5b_476x328.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>With batch processing, Uber doesn&#8217;t know if the driver&#8217;s earning data will be changed. They have to assume that &#8220;Data was changed in the last X days&#8220; and reprocess all X data partitions to update the driver earnings. A small change can cost them a lot of time and resources to re-process the whole month of data (for example)</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OYIP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61ea3802-19a6-4246-a7e7-48d010619f8f_410x438.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OYIP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61ea3802-19a6-4246-a7e7-48d010619f8f_410x438.png 424w, https://substackcdn.com/image/fetch/$s_!OYIP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61ea3802-19a6-4246-a7e7-48d010619f8f_410x438.png 848w, https://substackcdn.com/image/fetch/$s_!OYIP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61ea3802-19a6-4246-a7e7-48d010619f8f_410x438.png 1272w, https://substackcdn.com/image/fetch/$s_!OYIP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61ea3802-19a6-4246-a7e7-48d010619f8f_410x438.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OYIP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61ea3802-19a6-4246-a7e7-48d010619f8f_410x438.png" width="410" height="438" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/61ea3802-19a6-4246-a7e7-48d010619f8f_410x438.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:438,&quot;width&quot;:410,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:50421,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160751145?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61ea3802-19a6-4246-a7e7-48d010619f8f_410x438.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OYIP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61ea3802-19a6-4246-a7e7-48d010619f8f_410x438.png 424w, https://substackcdn.com/image/fetch/$s_!OYIP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61ea3802-19a6-4246-a7e7-48d010619f8f_410x438.png 848w, https://substackcdn.com/image/fetch/$s_!OYIP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61ea3802-19a6-4246-a7e7-48d010619f8f_410x438.png 1272w, https://substackcdn.com/image/fetch/$s_!OYIP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61ea3802-19a6-4246-a7e7-48d010619f8f_410x438.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>With another use case where merchants can update the menu whenever needed, Uber has to ensure these changes are reflected on the Uber Eats app. For a given day, Uber observed 408 million delta changes compared to 11 billion total entities, roughly 3.7%.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!M-QQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66a106d9-0d2e-40af-9cda-b3db0f54a4ac_396x288.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!M-QQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66a106d9-0d2e-40af-9cda-b3db0f54a4ac_396x288.png 424w, https://substackcdn.com/image/fetch/$s_!M-QQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66a106d9-0d2e-40af-9cda-b3db0f54a4ac_396x288.png 848w, https://substackcdn.com/image/fetch/$s_!M-QQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66a106d9-0d2e-40af-9cda-b3db0f54a4ac_396x288.png 1272w, https://substackcdn.com/image/fetch/$s_!M-QQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66a106d9-0d2e-40af-9cda-b3db0f54a4ac_396x288.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!M-QQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66a106d9-0d2e-40af-9cda-b3db0f54a4ac_396x288.png" width="396" height="288" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/66a106d9-0d2e-40af-9cda-b3db0f54a4ac_396x288.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:288,&quot;width&quot;:396,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!M-QQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66a106d9-0d2e-40af-9cda-b3db0f54a4ac_396x288.png 424w, https://substackcdn.com/image/fetch/$s_!M-QQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66a106d9-0d2e-40af-9cda-b3db0f54a4ac_396x288.png 848w, https://substackcdn.com/image/fetch/$s_!M-QQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66a106d9-0d2e-40af-9cda-b3db0f54a4ac_396x288.png 1272w, https://substackcdn.com/image/fetch/$s_!M-QQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66a106d9-0d2e-40af-9cda-b3db0f54a4ac_396x288.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The batch approach could result in the same problem as the use case above: a small fraction of updates can waste time and resources running the pipeline over a large amount of data, leading to data freshness SLA violation.</p><p>What if they could extract only the update (e.g., the event where the rider tipped $1$) and apply it to the target table?</p><div><hr></div><h2>Apache Hudi</h2><p>To bring the incremental processing capability to the lakehouse, Uber developed the Apache Hudi table format. In the scope of this article, I won&#8217;t dive deep into the story behind Hudi. If you want to read more about its story and features, check out my previous article:</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;ec5fb449-9653-4b01-a24b-48a008a8eb9e&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;I spent 5 hours exploring the story behind Apache Hudi.&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;My mom read my articles to support her son. Now, she can design a data architecture and write ETL scripts. &quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-10-08T11:00:55.420Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F580ba064-c120-4b4a-866f-c8da4c754c1c_2000x1429.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/i-spent-5-hours-exploring-the-story&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:149755728,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:11,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>In short, Hudi has a very special design compared to the Iceberg or Delta Lake format. The ultimate goal of it is what you see over and over again in this article: processing data incrementally as efficiently as possible. To achieve this, there are Hudi&#8217;s characteristics that we need to be aware of:</p><ul><li><p><strong>Two file formats</strong>: The <strong>base files</strong> store the table&#8217;s records. To optimize data reading, Hudi uses a columnar-oriented file format (e.g., Parquet) for the Base Files. The <strong>log files</strong> capture changes to records on top of their associated Base File. Hudi uses a row-oriented file format (e.g., Avro) for Log Files to optimize data writing.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8j5l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26fddf5a-45df-49bf-94f6-ab736e57a299_430x406.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8j5l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26fddf5a-45df-49bf-94f6-ab736e57a299_430x406.png 424w, https://substackcdn.com/image/fetch/$s_!8j5l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26fddf5a-45df-49bf-94f6-ab736e57a299_430x406.png 848w, https://substackcdn.com/image/fetch/$s_!8j5l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26fddf5a-45df-49bf-94f6-ab736e57a299_430x406.png 1272w, https://substackcdn.com/image/fetch/$s_!8j5l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26fddf5a-45df-49bf-94f6-ab736e57a299_430x406.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8j5l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26fddf5a-45df-49bf-94f6-ab736e57a299_430x406.png" width="430" height="406" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26fddf5a-45df-49bf-94f6-ab736e57a299_430x406.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:406,&quot;width&quot;:430,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:46058,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160751145?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26fddf5a-45df-49bf-94f6-ab736e57a299_430x406.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8j5l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26fddf5a-45df-49bf-94f6-ab736e57a299_430x406.png 424w, https://substackcdn.com/image/fetch/$s_!8j5l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26fddf5a-45df-49bf-94f6-ab736e57a299_430x406.png 848w, https://substackcdn.com/image/fetch/$s_!8j5l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26fddf5a-45df-49bf-94f6-ab736e57a299_430x406.png 1272w, https://substackcdn.com/image/fetch/$s_!8j5l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26fddf5a-45df-49bf-94f6-ab736e57a299_430x406.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p><strong>Timeline</strong>: Hudi Timeline records all actions performed on the table at different times, which helps provide instantaneous views of the table while also efficiently supporting data retrieval in the order of arrival.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RYDn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc14ab00c-314d-48ff-baa2-1a635f436f47_564x236.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RYDn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc14ab00c-314d-48ff-baa2-1a635f436f47_564x236.png 424w, https://substackcdn.com/image/fetch/$s_!RYDn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc14ab00c-314d-48ff-baa2-1a635f436f47_564x236.png 848w, https://substackcdn.com/image/fetch/$s_!RYDn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc14ab00c-314d-48ff-baa2-1a635f436f47_564x236.png 1272w, https://substackcdn.com/image/fetch/$s_!RYDn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc14ab00c-314d-48ff-baa2-1a635f436f47_564x236.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RYDn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc14ab00c-314d-48ff-baa2-1a635f436f47_564x236.png" width="564" height="236" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c14ab00c-314d-48ff-baa2-1a635f436f47_564x236.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:236,&quot;width&quot;:564,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:28850,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160751145?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc14ab00c-314d-48ff-baa2-1a635f436f47_564x236.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RYDn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc14ab00c-314d-48ff-baa2-1a635f436f47_564x236.png 424w, https://substackcdn.com/image/fetch/$s_!RYDn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc14ab00c-314d-48ff-baa2-1a635f436f47_564x236.png 848w, https://substackcdn.com/image/fetch/$s_!RYDn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc14ab00c-314d-48ff-baa2-1a635f436f47_564x236.png 1272w, https://substackcdn.com/image/fetch/$s_!RYDn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc14ab00c-314d-48ff-baa2-1a635f436f47_564x236.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></li><li><p><strong>Primary key</strong>: Each record in a Hudi table has a unique identifier called a primary key, consisting of a pair of record keys and the partition's location to which the record belongs. Using primary keys, Hudi ensures no duplicate records across partitions and enables fast updates and deletes on records. Hudi maintains an index to allow quick record lookups.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dErX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8576c88e-62bb-45f3-86ff-c777843cf508_436x310.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dErX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8576c88e-62bb-45f3-86ff-c777843cf508_436x310.png 424w, https://substackcdn.com/image/fetch/$s_!dErX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8576c88e-62bb-45f3-86ff-c777843cf508_436x310.png 848w, https://substackcdn.com/image/fetch/$s_!dErX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8576c88e-62bb-45f3-86ff-c777843cf508_436x310.png 1272w, https://substackcdn.com/image/fetch/$s_!dErX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8576c88e-62bb-45f3-86ff-c777843cf508_436x310.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dErX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8576c88e-62bb-45f3-86ff-c777843cf508_436x310.png" width="436" height="310" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8576c88e-62bb-45f3-86ff-c777843cf508_436x310.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:310,&quot;width&quot;:436,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:26920,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160751145?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8576c88e-62bb-45f3-86ff-c777843cf508_436x310.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dErX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8576c88e-62bb-45f3-86ff-c777843cf508_436x310.png 424w, https://substackcdn.com/image/fetch/$s_!dErX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8576c88e-62bb-45f3-86ff-c777843cf508_436x310.png 848w, https://substackcdn.com/image/fetch/$s_!dErX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8576c88e-62bb-45f3-86ff-c777843cf508_436x310.png 1272w, https://substackcdn.com/image/fetch/$s_!dErX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8576c88e-62bb-45f3-86ff-c777843cf508_436x310.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div></li></ul><div><hr></div><h2>Hudi at Uber</h2><blockquote><p><em>This section will explore in detail how Uber implements Hudi for their Lakehouse.</em></p></blockquote><h3>Data Read</h3><p>Hudi supports these types of queries:</p><ul><li><p><strong>Snapshot</strong>: The queries will see the latest snapshot of the table.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JEL6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1336b1d2-3d9b-4124-a199-9b9108dce1bf_308x386.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JEL6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1336b1d2-3d9b-4124-a199-9b9108dce1bf_308x386.png 424w, https://substackcdn.com/image/fetch/$s_!JEL6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1336b1d2-3d9b-4124-a199-9b9108dce1bf_308x386.png 848w, https://substackcdn.com/image/fetch/$s_!JEL6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1336b1d2-3d9b-4124-a199-9b9108dce1bf_308x386.png 1272w, https://substackcdn.com/image/fetch/$s_!JEL6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1336b1d2-3d9b-4124-a199-9b9108dce1bf_308x386.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JEL6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1336b1d2-3d9b-4124-a199-9b9108dce1bf_308x386.png" width="308" height="386" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1336b1d2-3d9b-4124-a199-9b9108dce1bf_308x386.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:386,&quot;width&quot;:308,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:17754,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160751145?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1336b1d2-3d9b-4124-a199-9b9108dce1bf_308x386.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JEL6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1336b1d2-3d9b-4124-a199-9b9108dce1bf_308x386.png 424w, https://substackcdn.com/image/fetch/$s_!JEL6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1336b1d2-3d9b-4124-a199-9b9108dce1bf_308x386.png 848w, https://substackcdn.com/image/fetch/$s_!JEL6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1336b1d2-3d9b-4124-a199-9b9108dce1bf_308x386.png 1272w, https://substackcdn.com/image/fetch/$s_!JEL6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1336b1d2-3d9b-4124-a199-9b9108dce1bf_308x386.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p><strong>Time Travel</strong>: The queries will read a snapshot of the past.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kHeN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F252857d5-fac4-4a2f-b7ce-33298ef6e5b5_350x378.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kHeN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F252857d5-fac4-4a2f-b7ce-33298ef6e5b5_350x378.png 424w, https://substackcdn.com/image/fetch/$s_!kHeN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F252857d5-fac4-4a2f-b7ce-33298ef6e5b5_350x378.png 848w, https://substackcdn.com/image/fetch/$s_!kHeN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F252857d5-fac4-4a2f-b7ce-33298ef6e5b5_350x378.png 1272w, https://substackcdn.com/image/fetch/$s_!kHeN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F252857d5-fac4-4a2f-b7ce-33298ef6e5b5_350x378.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kHeN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F252857d5-fac4-4a2f-b7ce-33298ef6e5b5_350x378.png" width="350" height="378" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/252857d5-fac4-4a2f-b7ce-33298ef6e5b5_350x378.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:378,&quot;width&quot;:350,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:18702,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160751145?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F252857d5-fac4-4a2f-b7ce-33298ef6e5b5_350x378.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kHeN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F252857d5-fac4-4a2f-b7ce-33298ef6e5b5_350x378.png 424w, https://substackcdn.com/image/fetch/$s_!kHeN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F252857d5-fac4-4a2f-b7ce-33298ef6e5b5_350x378.png 848w, https://substackcdn.com/image/fetch/$s_!kHeN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F252857d5-fac4-4a2f-b7ce-33298ef6e5b5_350x378.png 1272w, https://substackcdn.com/image/fetch/$s_!kHeN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F252857d5-fac4-4a2f-b7ce-33298ef6e5b5_350x378.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p><strong>Read Optimized</strong>: This one is similar to the snapshot query but performs better because Hudi will read the snapshot using only the columnar files. </p></li><li><p><strong>Incremental (Latest State)</strong>: The queries only return new data written since an instant on the timeline.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!L5Sa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26353abd-cf32-4e67-8719-87630d3ad6a8_504x222.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!L5Sa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26353abd-cf32-4e67-8719-87630d3ad6a8_504x222.png 424w, https://substackcdn.com/image/fetch/$s_!L5Sa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26353abd-cf32-4e67-8719-87630d3ad6a8_504x222.png 848w, https://substackcdn.com/image/fetch/$s_!L5Sa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26353abd-cf32-4e67-8719-87630d3ad6a8_504x222.png 1272w, https://substackcdn.com/image/fetch/$s_!L5Sa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26353abd-cf32-4e67-8719-87630d3ad6a8_504x222.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!L5Sa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26353abd-cf32-4e67-8719-87630d3ad6a8_504x222.png" width="504" height="222" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26353abd-cf32-4e67-8719-87630d3ad6a8_504x222.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:222,&quot;width&quot;:504,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:38442,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160751145?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26353abd-cf32-4e67-8719-87630d3ad6a8_504x222.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!L5Sa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26353abd-cf32-4e67-8719-87630d3ad6a8_504x222.png 424w, https://substackcdn.com/image/fetch/$s_!L5Sa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26353abd-cf32-4e67-8719-87630d3ad6a8_504x222.png 848w, https://substackcdn.com/image/fetch/$s_!L5Sa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26353abd-cf32-4e67-8719-87630d3ad6a8_504x222.png 1272w, https://substackcdn.com/image/fetch/$s_!L5Sa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26353abd-cf32-4e67-8719-87630d3ad6a8_504x222.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p><em>A <strong>Hudi instant</strong> is a point-in-time marker in Apache Hudi&#8217;s timeline that captures a single atomic action (such as a data commit or compaction)</em></p></blockquote></li><li><p><strong>Incremental (CDC): </strong>This is a variant of the <strong>Incremental </strong>one where it provides database-like change data capture log streams out of Hudi tables</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nl44!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F720e18e4-4331-44ae-99cb-781aa9c5c702_428x268.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nl44!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F720e18e4-4331-44ae-99cb-781aa9c5c702_428x268.png 424w, https://substackcdn.com/image/fetch/$s_!nl44!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F720e18e4-4331-44ae-99cb-781aa9c5c702_428x268.png 848w, https://substackcdn.com/image/fetch/$s_!nl44!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F720e18e4-4331-44ae-99cb-781aa9c5c702_428x268.png 1272w, https://substackcdn.com/image/fetch/$s_!nl44!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F720e18e4-4331-44ae-99cb-781aa9c5c702_428x268.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nl44!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F720e18e4-4331-44ae-99cb-781aa9c5c702_428x268.png" width="428" height="268" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/720e18e4-4331-44ae-99cb-781aa9c5c702_428x268.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:268,&quot;width&quot;:428,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:23027,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160751145?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F720e18e4-4331-44ae-99cb-781aa9c5c702_428x268.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nl44!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F720e18e4-4331-44ae-99cb-781aa9c5c702_428x268.png 424w, https://substackcdn.com/image/fetch/$s_!nl44!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F720e18e4-4331-44ae-99cb-781aa9c5c702_428x268.png 848w, https://substackcdn.com/image/fetch/$s_!nl44!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F720e18e4-4331-44ae-99cb-781aa9c5c702_428x268.png 1272w, https://substackcdn.com/image/fetch/$s_!nl44!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F720e18e4-4331-44ae-99cb-781aa9c5c702_428x268.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li></ul><p>Uber uses <strong>Incremental (Latest State)</strong> most of the time to handle many types of reads and joins with Hudi:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8Ciy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11ce9e44-1074-4060-8f6f-a8d81d554b3b_532x424.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8Ciy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11ce9e44-1074-4060-8f6f-a8d81d554b3b_532x424.png 424w, https://substackcdn.com/image/fetch/$s_!8Ciy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11ce9e44-1074-4060-8f6f-a8d81d554b3b_532x424.png 848w, https://substackcdn.com/image/fetch/$s_!8Ciy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11ce9e44-1074-4060-8f6f-a8d81d554b3b_532x424.png 1272w, https://substackcdn.com/image/fetch/$s_!8Ciy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11ce9e44-1074-4060-8f6f-a8d81d554b3b_532x424.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8Ciy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11ce9e44-1074-4060-8f6f-a8d81d554b3b_532x424.png" width="532" height="424" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/11ce9e44-1074-4060-8f6f-a8d81d554b3b_532x424.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:424,&quot;width&quot;:532,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:37324,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160751145?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11ce9e44-1074-4060-8f6f-a8d81d554b3b_532x424.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8Ciy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11ce9e44-1074-4060-8f6f-a8d81d554b3b_532x424.png 424w, https://substackcdn.com/image/fetch/$s_!8Ciy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11ce9e44-1074-4060-8f6f-a8d81d554b3b_532x424.png 848w, https://substackcdn.com/image/fetch/$s_!8Ciy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11ce9e44-1074-4060-8f6f-a8d81d554b3b_532x424.png 1272w, https://substackcdn.com/image/fetch/$s_!8Ciy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11ce9e44-1074-4060-8f6f-a8d81d554b3b_532x424.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>Incremental update from a single table</strong>: Uber reads data incrementally from the Hudi source table and uses this data to update the target table.</p></li><li><p><strong>Consolidation from single table incremental update and other raw tables</strong>: To prepare for the updated data for the target table, Uber reads data incrementally from the Hudi source table and performs left outer join on other raw data tables with T-24 hr incremental pull data.</p></li><li><p><strong>Consolidation from single table incremental update and other derived and lookup tables</strong>: Uber reads data incrementally from the Hudi source table and performs left outer join on other derived tables with only the affected partitions</p></li><li><p><strong>Backfilling</strong>: Uber leverages Hudi&#8217;s snapshot read on single or multiple source tables using the backfill start and end date boundaries.</p></li></ul><h3>Data Write</h3><p>In Hudi, write operations can be classified into two types:</p><ul><li><p><strong>Incremental</strong>: Hudi applies only incremental changes to the table/partition.</p></li><li><p><strong>Batch</strong>: Hudi overwrites entire tables and/or partitions entirely every few hours.</p></li></ul><p>For each type, Hudi further categorizes operations into these types:</p><ul><li><p><strong>Upsert (Incremental): </strong>Hudi first looks up the index to check whether the record is tagged as inserts (new) or updates (existing). Then, Hudi determines how to pack the record in the storage. The target table will never show duplicates.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!On9Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5356614c-7261-4c47-ad4a-e13add8fb8e8_454x194.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!On9Z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5356614c-7261-4c47-ad4a-e13add8fb8e8_454x194.png 424w, https://substackcdn.com/image/fetch/$s_!On9Z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5356614c-7261-4c47-ad4a-e13add8fb8e8_454x194.png 848w, https://substackcdn.com/image/fetch/$s_!On9Z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5356614c-7261-4c47-ad4a-e13add8fb8e8_454x194.png 1272w, https://substackcdn.com/image/fetch/$s_!On9Z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5356614c-7261-4c47-ad4a-e13add8fb8e8_454x194.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!On9Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5356614c-7261-4c47-ad4a-e13add8fb8e8_454x194.png" width="454" height="194" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5356614c-7261-4c47-ad4a-e13add8fb8e8_454x194.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:194,&quot;width&quot;:454,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:23783,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160751145?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5356614c-7261-4c47-ad4a-e13add8fb8e8_454x194.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!On9Z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5356614c-7261-4c47-ad4a-e13add8fb8e8_454x194.png 424w, https://substackcdn.com/image/fetch/$s_!On9Z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5356614c-7261-4c47-ad4a-e13add8fb8e8_454x194.png 848w, https://substackcdn.com/image/fetch/$s_!On9Z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5356614c-7261-4c47-ad4a-e13add8fb8e8_454x194.png 1272w, https://substackcdn.com/image/fetch/$s_!On9Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5356614c-7261-4c47-ad4a-e13add8fb8e8_454x194.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></li><li><p><strong>Insert (Incremental): </strong>This one resembles <strong>Upsert, </strong>but Hudi skips the index-look-up step. This option is faster than Upsert; however,  the target table can show duplicates.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pRWj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b8be2d-99df-4e99-8a1e-b67ba69f820d_394x182.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pRWj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b8be2d-99df-4e99-8a1e-b67ba69f820d_394x182.png 424w, https://substackcdn.com/image/fetch/$s_!pRWj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b8be2d-99df-4e99-8a1e-b67ba69f820d_394x182.png 848w, https://substackcdn.com/image/fetch/$s_!pRWj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b8be2d-99df-4e99-8a1e-b67ba69f820d_394x182.png 1272w, https://substackcdn.com/image/fetch/$s_!pRWj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b8be2d-99df-4e99-8a1e-b67ba69f820d_394x182.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pRWj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b8be2d-99df-4e99-8a1e-b67ba69f820d_394x182.png" width="394" height="182" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/25b8be2d-99df-4e99-8a1e-b67ba69f820d_394x182.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:182,&quot;width&quot;:394,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:19517,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160751145?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b8be2d-99df-4e99-8a1e-b67ba69f820d_394x182.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pRWj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b8be2d-99df-4e99-8a1e-b67ba69f820d_394x182.png 424w, https://substackcdn.com/image/fetch/$s_!pRWj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b8be2d-99df-4e99-8a1e-b67ba69f820d_394x182.png 848w, https://substackcdn.com/image/fetch/$s_!pRWj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b8be2d-99df-4e99-8a1e-b67ba69f820d_394x182.png 1272w, https://substackcdn.com/image/fetch/$s_!pRWj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b8be2d-99df-4e99-8a1e-b67ba69f820d_394x182.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></li><li><p><strong>Delete (Incremental): </strong>Hudi supports two types of deletes on Hudi table data. Based on the record key, Hudi can <strong>soft delete </strong>where it retains only the record key and fills null for all the other fields. The other approach is <strong>hard delete, </strong>which entirely clears all evidence of a record</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VGC2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0466a0b7-2a75-40a7-b0ff-f58aa4c49492_240x220.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VGC2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0466a0b7-2a75-40a7-b0ff-f58aa4c49492_240x220.png 424w, https://substackcdn.com/image/fetch/$s_!VGC2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0466a0b7-2a75-40a7-b0ff-f58aa4c49492_240x220.png 848w, https://substackcdn.com/image/fetch/$s_!VGC2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0466a0b7-2a75-40a7-b0ff-f58aa4c49492_240x220.png 1272w, https://substackcdn.com/image/fetch/$s_!VGC2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0466a0b7-2a75-40a7-b0ff-f58aa4c49492_240x220.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VGC2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0466a0b7-2a75-40a7-b0ff-f58aa4c49492_240x220.png" width="240" height="220" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0466a0b7-2a75-40a7-b0ff-f58aa4c49492_240x220.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:220,&quot;width&quot;:240,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:15233,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160751145?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0466a0b7-2a75-40a7-b0ff-f58aa4c49492_240x220.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VGC2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0466a0b7-2a75-40a7-b0ff-f58aa4c49492_240x220.png 424w, https://substackcdn.com/image/fetch/$s_!VGC2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0466a0b7-2a75-40a7-b0ff-f58aa4c49492_240x220.png 848w, https://substackcdn.com/image/fetch/$s_!VGC2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0466a0b7-2a75-40a7-b0ff-f58aa4c49492_240x220.png 1272w, https://substackcdn.com/image/fetch/$s_!VGC2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0466a0b7-2a75-40a7-b0ff-f58aa4c49492_240x220.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></li><li><p><strong>Bulk Insert (Batch): </strong>Insert or Upsert keeps data in the memory to speed up computations, which can cause some problems for initial data loading. Bulk insert has the same semantics as insert plus implementing a sort-based data writing algorithm, which can scale well for initial data load. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iE-9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f7c4ad5-92c9-4c8e-8cde-7276554deb50_316x310.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iE-9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f7c4ad5-92c9-4c8e-8cde-7276554deb50_316x310.png 424w, https://substackcdn.com/image/fetch/$s_!iE-9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f7c4ad5-92c9-4c8e-8cde-7276554deb50_316x310.png 848w, https://substackcdn.com/image/fetch/$s_!iE-9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f7c4ad5-92c9-4c8e-8cde-7276554deb50_316x310.png 1272w, https://substackcdn.com/image/fetch/$s_!iE-9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f7c4ad5-92c9-4c8e-8cde-7276554deb50_316x310.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iE-9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f7c4ad5-92c9-4c8e-8cde-7276554deb50_316x310.png" width="316" height="310" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0f7c4ad5-92c9-4c8e-8cde-7276554deb50_316x310.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:310,&quot;width&quot;:316,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:18895,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160751145?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f7c4ad5-92c9-4c8e-8cde-7276554deb50_316x310.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iE-9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f7c4ad5-92c9-4c8e-8cde-7276554deb50_316x310.png 424w, https://substackcdn.com/image/fetch/$s_!iE-9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f7c4ad5-92c9-4c8e-8cde-7276554deb50_316x310.png 848w, https://substackcdn.com/image/fetch/$s_!iE-9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f7c4ad5-92c9-4c8e-8cde-7276554deb50_316x310.png 1272w, https://substackcdn.com/image/fetch/$s_!iE-9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f7c4ad5-92c9-4c8e-8cde-7276554deb50_316x310.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p><strong>Insert Overwrite (Batch): </strong>Hudi will rewrite all the partitions that are present in the input.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CEg2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16dd6dc2-1d85-4721-8871-1ea442b32925_512x216.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CEg2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16dd6dc2-1d85-4721-8871-1ea442b32925_512x216.png 424w, https://substackcdn.com/image/fetch/$s_!CEg2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16dd6dc2-1d85-4721-8871-1ea442b32925_512x216.png 848w, https://substackcdn.com/image/fetch/$s_!CEg2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16dd6dc2-1d85-4721-8871-1ea442b32925_512x216.png 1272w, https://substackcdn.com/image/fetch/$s_!CEg2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16dd6dc2-1d85-4721-8871-1ea442b32925_512x216.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CEg2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16dd6dc2-1d85-4721-8871-1ea442b32925_512x216.png" width="512" height="216" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/16dd6dc2-1d85-4721-8871-1ea442b32925_512x216.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:216,&quot;width&quot;:512,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:30976,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160751145?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16dd6dc2-1d85-4721-8871-1ea442b32925_512x216.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CEg2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16dd6dc2-1d85-4721-8871-1ea442b32925_512x216.png 424w, https://substackcdn.com/image/fetch/$s_!CEg2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16dd6dc2-1d85-4721-8871-1ea442b32925_512x216.png 848w, https://substackcdn.com/image/fetch/$s_!CEg2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16dd6dc2-1d85-4721-8871-1ea442b32925_512x216.png 1272w, https://substackcdn.com/image/fetch/$s_!CEg2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16dd6dc2-1d85-4721-8871-1ea442b32925_512x216.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></li><li><p><strong>Insert Overwrite Table (Batch)</strong>: Hudi will rewrite the whole table.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!y9dA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69aaa51c-8b4e-4669-bd6b-86d40b521fbf_396x146.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!y9dA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69aaa51c-8b4e-4669-bd6b-86d40b521fbf_396x146.png 424w, https://substackcdn.com/image/fetch/$s_!y9dA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69aaa51c-8b4e-4669-bd6b-86d40b521fbf_396x146.png 848w, https://substackcdn.com/image/fetch/$s_!y9dA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69aaa51c-8b4e-4669-bd6b-86d40b521fbf_396x146.png 1272w, https://substackcdn.com/image/fetch/$s_!y9dA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69aaa51c-8b4e-4669-bd6b-86d40b521fbf_396x146.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!y9dA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69aaa51c-8b4e-4669-bd6b-86d40b521fbf_396x146.png" width="396" height="146" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/69aaa51c-8b4e-4669-bd6b-86d40b521fbf_396x146.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:146,&quot;width&quot;:396,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!y9dA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69aaa51c-8b4e-4669-bd6b-86d40b521fbf_396x146.png 424w, https://substackcdn.com/image/fetch/$s_!y9dA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69aaa51c-8b4e-4669-bd6b-86d40b521fbf_396x146.png 848w, https://substackcdn.com/image/fetch/$s_!y9dA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69aaa51c-8b4e-4669-bd6b-86d40b521fbf_396x146.png 1272w, https://substackcdn.com/image/fetch/$s_!y9dA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69aaa51c-8b4e-4669-bd6b-86d40b521fbf_396x146.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></li></ul><p>To write data to Hudi tables, Uber has to handle it differently based on whether the table is partitioned or not:</p><blockquote><p><em>Hudi stores data files under partition paths for partitioned tables (like Hive table) or under the base path for non-partitioned tables. For example, Hudi organizes table_1, partitioned by date, in folders like table_1/date=2025-04-01, table_1/date=2025-04-02,&#8230;.For non-partitioned tables, Hudi stores it using only the base path: table_2/.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!o_mK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F618ef1c0-109d-4299-81d8-039a42edee4e_350x298.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!o_mK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F618ef1c0-109d-4299-81d8-039a42edee4e_350x298.png 424w, https://substackcdn.com/image/fetch/$s_!o_mK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F618ef1c0-109d-4299-81d8-039a42edee4e_350x298.png 848w, https://substackcdn.com/image/fetch/$s_!o_mK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F618ef1c0-109d-4299-81d8-039a42edee4e_350x298.png 1272w, https://substackcdn.com/image/fetch/$s_!o_mK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F618ef1c0-109d-4299-81d8-039a42edee4e_350x298.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!o_mK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F618ef1c0-109d-4299-81d8-039a42edee4e_350x298.png" width="350" height="298" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/618ef1c0-109d-4299-81d8-039a42edee4e_350x298.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:298,&quot;width&quot;:350,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:32826,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160751145?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F618ef1c0-109d-4299-81d8-039a42edee4e_350x298.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!o_mK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F618ef1c0-109d-4299-81d8-039a42edee4e_350x298.png 424w, https://substackcdn.com/image/fetch/$s_!o_mK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F618ef1c0-109d-4299-81d8-039a42edee4e_350x298.png 848w, https://substackcdn.com/image/fetch/$s_!o_mK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F618ef1c0-109d-4299-81d8-039a42edee4e_350x298.png 1272w, https://substackcdn.com/image/fetch/$s_!o_mK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F618ef1c0-109d-4299-81d8-039a42edee4e_350x298.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>Partitioned tables: </strong>Uber uses upserts to apply the incremental updates. For backfilling, they use insert_overwrite to rewrite the affected partition. For non-incremental columns, they use Spark SQL targeted merge/update statements. </p><blockquote><p><em>A non-incremental column is any column whose updates do not determine how a record changes over time in the sense of incremental data loads (e.g., a restaurant located in Las Vegas last year and later changed to New York).</em></p></blockquote></li><li><p><strong>Non-partitioned tables: </strong>Uber also uses upserts to apply the incremental updates. To update the incremental and non-incremental columns, they use insert_overwrite when joining (full outer join) incremental rows with the target table.</p></li></ul><h3><strong>The actual implementation</strong></h3><p>Uber handles the incremental ETL pipeline using Hudi, Spark, and its internal data workflow, Piper (think Airflow). They built a Spark ETL framework to manage ETL pipelines at scale, using Hudi&#8217;s incremental data processing tool called DeltaStreamer to power this framework.</p><blockquote><p><em>Uber initially contributed to <a href="https://github.com/apache/hudi/tree/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer">DeltaStreamer</a>, and many organizations have used it to streamline incremental data processing with Hudi. <a href="https://hudi.apache.org/docs/hoodie_streaming_ingestion/#hudi-streamer">In more detail</a>, the tool provides ways to ingest from different sources, such as Kafka.</em></p></blockquote><p>The Spark ETL framework abstracts all the complexity and lets users configure how their pipeline should run with simple steps. Users must give the framework a few inputs, like the table definition, DeltaStreamer YAML configs, and the SQL or Java/Scale transformation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fhcB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba0dc2ca-23a0-4d0b-b7ee-86e222f18a3d_432x506.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fhcB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba0dc2ca-23a0-4d0b-b7ee-86e222f18a3d_432x506.png 424w, https://substackcdn.com/image/fetch/$s_!fhcB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba0dc2ca-23a0-4d0b-b7ee-86e222f18a3d_432x506.png 848w, https://substackcdn.com/image/fetch/$s_!fhcB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba0dc2ca-23a0-4d0b-b7ee-86e222f18a3d_432x506.png 1272w, https://substackcdn.com/image/fetch/$s_!fhcB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba0dc2ca-23a0-4d0b-b7ee-86e222f18a3d_432x506.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fhcB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba0dc2ca-23a0-4d0b-b7ee-86e222f18a3d_432x506.png" width="432" height="506" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ba0dc2ca-23a0-4d0b-b7ee-86e222f18a3d_432x506.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:506,&quot;width&quot;:432,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:57190,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160751145?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba0dc2ca-23a0-4d0b-b7ee-86e222f18a3d_432x506.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fhcB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba0dc2ca-23a0-4d0b-b7ee-86e222f18a3d_432x506.png 424w, https://substackcdn.com/image/fetch/$s_!fhcB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba0dc2ca-23a0-4d0b-b7ee-86e222f18a3d_432x506.png 848w, https://substackcdn.com/image/fetch/$s_!fhcB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba0dc2ca-23a0-4d0b-b7ee-86e222f18a3d_432x506.png 1272w, https://substackcdn.com/image/fetch/$s_!fhcB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba0dc2ca-23a0-4d0b-b7ee-86e222f18a3d_432x506.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>Table definition</strong>: A DDL definition file with the table&#8217;s schema information and Apache Hudi format.</p></li><li><p><strong>DeltaStreamer YAML configs: </strong>This file will provide a list of configurations expected by the Apache Spark DeltaStreamer application. Some important ones are the <code>hoodie.datasource.recordkey.field</code>, which declares the target table&#8217;s primary key. As mentioned, Hudi uses the primary key to perform data duplication (with the upsert write operation). The next important one is <code>hoodie.datasource.write.operation</code>, which expects one of the values listed in the &#8220;Data Write&#8220; section above.</p></li><li><p><strong>Transformation logic</strong>: The user will provide a file with the SQL transformation logic. The DeltaStreamer will execute this logic using Spark SQL. Users must specify the incremental source from which the DeltaStreamer performs the incremental read operation. The tool will read from the latest checkpoint in the target table&#8217;s Hudi metadata to capture the new data. Users can express the transformation logic using Spark Scala/Java for more advanced use cases.</p></li></ul><div><hr></div><h2>Impact</h2><h3><strong>Performance and Cost Savings</strong></h3><p>Because of migrating all the batch ETL pipelines to the incremental solution with Hudi, Uber decreased the pipeline run time by 50%. I captured the table from Uber&#8217;s article to show how efficient the new solution is:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MFmH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bcc4461-8220-482c-9042-7c915be4deec_1138x936.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MFmH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bcc4461-8220-482c-9042-7c915be4deec_1138x936.png 424w, https://substackcdn.com/image/fetch/$s_!MFmH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bcc4461-8220-482c-9042-7c915be4deec_1138x936.png 848w, https://substackcdn.com/image/fetch/$s_!MFmH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bcc4461-8220-482c-9042-7c915be4deec_1138x936.png 1272w, https://substackcdn.com/image/fetch/$s_!MFmH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bcc4461-8220-482c-9042-7c915be4deec_1138x936.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MFmH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bcc4461-8220-482c-9042-7c915be4deec_1138x936.png" width="1138" height="936" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3bcc4461-8220-482c-9042-7c915be4deec_1138x936.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:936,&quot;width&quot;:1138,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:128807,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160751145?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bcc4461-8220-482c-9042-7c915be4deec_1138x936.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MFmH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bcc4461-8220-482c-9042-7c915be4deec_1138x936.png 424w, https://substackcdn.com/image/fetch/$s_!MFmH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bcc4461-8220-482c-9042-7c915be4deec_1138x936.png 848w, https://substackcdn.com/image/fetch/$s_!MFmH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bcc4461-8220-482c-9042-7c915be4deec_1138x936.png 1272w, https://substackcdn.com/image/fetch/$s_!MFmH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bcc4461-8220-482c-9042-7c915be4deec_1138x936.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image captured from the article Setting Uber&#8217;s Transactional Data Lake in Motion with Incremental ETL Using Apache Hudi (2023). <a href="https://www.uber.com/en-VN/blog/ubers-lakehouse-architecture/">Source</a></figcaption></figure></div><p>They used 59.06%  CPU core and 73.01% memory less than the ETL approach for the Dimensional Driver Table. In the past, the pipeline would take roughly 3.7 hours to finish; with the incremental pipeline, it only takes Uber 39 minutes to finish.</p><h3>Data Consistency</h3><p>To achieve availability, Uber organized data redundantly across multiple data centers. Achieving strong data consistency across tables in different data centers is critical to Uber&#8217;s business operations.</p><p>Hudi helps Uber consistently replicate data across data lakes in many data centers. After computing the table in the primary center, Uber replicates the data by using the Hudi metadata to move incrementally changed files across data centers.</p><h3>Data Quality</h3><p>Uber implement the <a href="https://vutr.substack.com/p/how-does-netflix-ensure-the-data">write-audit-publish (WAP) pattern</a> with Hudi to prevent low-quality data from entering the production environment. This approach requires users to run SQL-based data quality checks on the data before it gets pushed to the production dataset.</p><h3><strong>Observability</strong></h3><p>The Hudi&#8217;s DeltaStreamer <a href="https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamerMetrics.java">outputs valuable metrics</a> to provide insights during the incremental ETL processes. Uber can observe the number of Hudi&#8217;s commits in progress or the total records inserted/updated/deleted. </p><div><hr></div><h2>Outro</h2><p>Thank you for reading this far. </p><p>In this article, we explored why incremental processing is critical to Uber&#8217;s business and how Uber solves the problems with Apache Hudi.</p><p>For me, Hudi is an exciting table format with many interesting technical designs. Although it does not get wide adoption like Iceberg or Delta Lake, Hudi will shine in the <a href="https://vutr.substack.com/p/why-walmart-chose-apache-hudi-for">use cases it was originally designed for</a>.</p><p>Would you like to read more Hudi articles? If yes, please let me know in the comment section or leave a reaction to this article.</p><p>Now, it&#8217;s time to say goodbye.</p><p>See you in my following articles.</p><div><hr></div><h2>Reference</h2><p><em>[1] Uber Engineering Blog, <a href="https://www.uber.com/en-VN/blog/ubers-lakehouse-architecture/">Setting Uber&#8217;s Transactional Data Lake in Motion with Incremental ETL Using Apache Hudi</a> (2023)</em></p><p><em>[2] <a href="https://hudi.apache.org/docs/write_operations/">Hudi Write Operations</a></em></p><p><em>[3] <a href="https://hudi.apache.org/docs/table_types">Table &amp; Query Types</a></em></p>]]></content:encoded></item><item><title><![CDATA[How did Meta modernize their lakehouse?]]></title><description><![CDATA[The new approach enabled Meta to innovate faster.]]></description><link>https://vutr.substack.com/p/how-did-meta-modernize-their-lakehouse</link><guid isPermaLink="false">https://vutr.substack.com/p/how-did-meta-modernize-their-lakehouse</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Thu, 03 Apr 2025 03:15:43 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!MsmX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe656b89e-865d-467d-aeea-7f4a28f7ef67_2000x1428.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p>To celebrate Lunar New Year (the true New Year holiday in Vietnam), I&#8217;m offering <em><strong>50% off the annual subscription</strong></em>. The offer ends soon; grab it now to get full access to nearly 200 high-quality data engineering articles.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe&quot;,&quot;text&quot;:&quot;50% off the annual subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe"><span>50% off the annual subscription</span></a></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MsmX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe656b89e-865d-467d-aeea-7f4a28f7ef67_2000x1428.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MsmX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe656b89e-865d-467d-aeea-7f4a28f7ef67_2000x1428.png 424w, https://substackcdn.com/image/fetch/$s_!MsmX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe656b89e-865d-467d-aeea-7f4a28f7ef67_2000x1428.png 848w, https://substackcdn.com/image/fetch/$s_!MsmX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe656b89e-865d-467d-aeea-7f4a28f7ef67_2000x1428.png 1272w, https://substackcdn.com/image/fetch/$s_!MsmX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe656b89e-865d-467d-aeea-7f4a28f7ef67_2000x1428.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MsmX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe656b89e-865d-467d-aeea-7f4a28f7ef67_2000x1428.png" width="1456" height="1040" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e656b89e-865d-467d-aeea-7f4a28f7ef67_2000x1428.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1040,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:489350,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160061351?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe656b89e-865d-467d-aeea-7f4a28f7ef67_2000x1428.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MsmX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe656b89e-865d-467d-aeea-7f4a28f7ef67_2000x1428.png 424w, https://substackcdn.com/image/fetch/$s_!MsmX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe656b89e-865d-467d-aeea-7f4a28f7ef67_2000x1428.png 848w, https://substackcdn.com/image/fetch/$s_!MsmX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe656b89e-865d-467d-aeea-7f4a28f7ef67_2000x1428.png 1272w, https://substackcdn.com/image/fetch/$s_!MsmX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe656b89e-865d-467d-aeea-7f4a28f7ef67_2000x1428.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><div><hr></div><h2>Intro</h2><p>In this article, we will explore how Meta, one of the world's biggest tech companies, re-architected its data lakehouse. The texts you&#8217;ll read will not cover detailed components of the Meta lakehouse. Instead, we will see how Meta&#8217;s initial approach caused them troubles and their effort to fix them at the organizational scale.</p><p>For this article, I referred to material from the Meta paper released in 2023 called <a href="https://www.cidrdb.org/cidr2023/papers/p77-chattopadhyay.pdf">Shared Foundations: Modernizing Meta&#8217;s Data Lakehouse.</a></p><div><hr></div><h2>The initial approach and its problems</h2><p>Meta started their data journey about +20 years ago.</p><p>They started implementing the paradigm of bringing the query engines to the data stored in object storage with&nbsp;<a href="https://en.wikipedia.org/wiki/Apache_Hive">Hive</a>&nbsp;in 2010. It was eleven years before Databricks released the paper introducing the lakehouse architecture.</p><p>Since then, Meta's warehouse system has grown from tens to hundreds of petabytes, and in 2023, it reached multiple exabytes. With Hive, the Meta high-level warehouse solution can be described as below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Hcdg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd7c480d-0f15-40c3-8c3a-1a0b3f1851bf_1912x780.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Hcdg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd7c480d-0f15-40c3-8c3a-1a0b3f1851bf_1912x780.png 424w, https://substackcdn.com/image/fetch/$s_!Hcdg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd7c480d-0f15-40c3-8c3a-1a0b3f1851bf_1912x780.png 848w, https://substackcdn.com/image/fetch/$s_!Hcdg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd7c480d-0f15-40c3-8c3a-1a0b3f1851bf_1912x780.png 1272w, https://substackcdn.com/image/fetch/$s_!Hcdg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd7c480d-0f15-40c3-8c3a-1a0b3f1851bf_1912x780.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Hcdg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd7c480d-0f15-40c3-8c3a-1a0b3f1851bf_1912x780.png" width="1456" height="594" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bd7c480d-0f15-40c3-8c3a-1a0b3f1851bf_1912x780.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:594,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:189809,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160061351?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd7c480d-0f15-40c3-8c3a-1a0b3f1851bf_1912x780.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Hcdg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd7c480d-0f15-40c3-8c3a-1a0b3f1851bf_1912x780.png 424w, https://substackcdn.com/image/fetch/$s_!Hcdg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd7c480d-0f15-40c3-8c3a-1a0b3f1851bf_1912x780.png 848w, https://substackcdn.com/image/fetch/$s_!Hcdg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd7c480d-0f15-40c3-8c3a-1a0b3f1851bf_1912x780.png 1272w, https://substackcdn.com/image/fetch/$s_!Hcdg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd7c480d-0f15-40c3-8c3a-1a0b3f1851bf_1912x780.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><ul><li><p>They managed data, metadata, and computing independently.</p></li><li><p>They stored data in HDFS, which let them scale the storage layer independently from the computing layer. In recent years, Meta has replaced HDFS with an in-house file system called Tectonic, which helped them achieve operational efficiency.</p></li><li><p>They stored metadata in the MySQL database. With Hive, users can store partition information in the Hive Metastore.</p></li><li><p>They store data in a columnar format. They first created the RC file and later enhanced it to create ORC. They also developed an ORC variant called DWRF to support nested data and encryption better.</p></li><li><p>Internal users can bring their favorite compute engine to join the party. From Spark, Presto to Meta deployment of <a href="https://giraph.apache.org/">Giraph</a> - an iterative graph processing system.</p></li></ul><p>However, this architecture caused Meta some problems:</p><ul><li><p>This architecture did not support stream processing. This made Meta build various streaming systems over time, and of course, they were not so well integrated with Hive.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DTBH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80925394-9d2c-493a-b5a7-67c4eabd73aa_266x306.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DTBH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80925394-9d2c-493a-b5a7-67c4eabd73aa_266x306.png 424w, https://substackcdn.com/image/fetch/$s_!DTBH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80925394-9d2c-493a-b5a7-67c4eabd73aa_266x306.png 848w, https://substackcdn.com/image/fetch/$s_!DTBH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80925394-9d2c-493a-b5a7-67c4eabd73aa_266x306.png 1272w, https://substackcdn.com/image/fetch/$s_!DTBH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80925394-9d2c-493a-b5a7-67c4eabd73aa_266x306.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DTBH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80925394-9d2c-493a-b5a7-67c4eabd73aa_266x306.png" width="266" height="306" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/80925394-9d2c-493a-b5a7-67c4eabd73aa_266x306.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:306,&quot;width&quot;:266,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:19037,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160061351?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80925394-9d2c-493a-b5a7-67c4eabd73aa_266x306.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DTBH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80925394-9d2c-493a-b5a7-67c4eabd73aa_266x306.png 424w, https://substackcdn.com/image/fetch/$s_!DTBH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80925394-9d2c-493a-b5a7-67c4eabd73aa_266x306.png 848w, https://substackcdn.com/image/fetch/$s_!DTBH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80925394-9d2c-493a-b5a7-67c4eabd73aa_266x306.png 1272w, https://substackcdn.com/image/fetch/$s_!DTBH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80925394-9d2c-493a-b5a7-67c4eabd73aa_266x306.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p>The architecture did not support real-time data ingestion to Hive. They ended up using Scuba for this purpose, although it was initially built for log analytics.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DbLL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b4a3b7-0aa0-4db0-acff-e29dd640ffea_318x222.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DbLL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b4a3b7-0aa0-4db0-acff-e29dd640ffea_318x222.png 424w, https://substackcdn.com/image/fetch/$s_!DbLL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b4a3b7-0aa0-4db0-acff-e29dd640ffea_318x222.png 848w, https://substackcdn.com/image/fetch/$s_!DbLL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b4a3b7-0aa0-4db0-acff-e29dd640ffea_318x222.png 1272w, https://substackcdn.com/image/fetch/$s_!DbLL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b4a3b7-0aa0-4db0-acff-e29dd640ffea_318x222.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DbLL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b4a3b7-0aa0-4db0-acff-e29dd640ffea_318x222.png" width="318" height="222" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b3b4a3b7-0aa0-4db0-acff-e29dd640ffea_318x222.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:222,&quot;width&quot;:318,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:19368,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160061351?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b4a3b7-0aa0-4db0-acff-e29dd640ffea_318x222.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DbLL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b4a3b7-0aa0-4db0-acff-e29dd640ffea_318x222.png 424w, https://substackcdn.com/image/fetch/$s_!DbLL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b4a3b7-0aa0-4db0-acff-e29dd640ffea_318x222.png 848w, https://substackcdn.com/image/fetch/$s_!DbLL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b4a3b7-0aa0-4db0-acff-e29dd640ffea_318x222.png 1272w, https://substackcdn.com/image/fetch/$s_!DbLL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b4a3b7-0aa0-4db0-acff-e29dd640ffea_318x222.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></li><li><p>There were a lot of programming languages. Most of the data stack in Meta was written in Java, but most of the other systems in Meta use C++. Java is also not primarily supported at Meta.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K-Jh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd9ca203-c681-45e1-8325-612cb670c978_348x166.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K-Jh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd9ca203-c681-45e1-8325-612cb670c978_348x166.png 424w, https://substackcdn.com/image/fetch/$s_!K-Jh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd9ca203-c681-45e1-8325-612cb670c978_348x166.png 848w, https://substackcdn.com/image/fetch/$s_!K-Jh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd9ca203-c681-45e1-8325-612cb670c978_348x166.png 1272w, https://substackcdn.com/image/fetch/$s_!K-Jh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd9ca203-c681-45e1-8325-612cb670c978_348x166.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K-Jh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd9ca203-c681-45e1-8325-612cb670c978_348x166.png" width="348" height="166" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fd9ca203-c681-45e1-8325-612cb670c978_348x166.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:166,&quot;width&quot;:348,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:27951,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160061351?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd9ca203-c681-45e1-8325-612cb670c978_348x166.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!K-Jh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd9ca203-c681-45e1-8325-612cb670c978_348x166.png 424w, https://substackcdn.com/image/fetch/$s_!K-Jh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd9ca203-c681-45e1-8325-612cb670c978_348x166.png 848w, https://substackcdn.com/image/fetch/$s_!K-Jh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd9ca203-c681-45e1-8325-612cb670c978_348x166.png 1272w, https://substackcdn.com/image/fetch/$s_!K-Jh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd9ca203-c681-45e1-8325-612cb670c978_348x166.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></li><li><p>Hive was too slow for interactive queries. Meta had to create new engines to address this problem. They wrote some in Java and some in C++. Even though some engines were written in the same language, they did not share any components, resulting in solution fragmentation.</p></li><li><p>At first, Meta stored data in HDFS storage nodes, mostly using HDD for local disks. For interactive queries, fetching data from HDD over the network is slow. Meta had to develop many interactive query engines that had compute and storage tightly coupled to improve query latency. This caused the solution fragmentation and data deuplication to become more serious.</p></li><li><p>The fragmentation did not stop there. At Meta, there were at least <strong>six</strong> SQL dialects, <strong>three</strong> implementations of Metastore client and ORC codecs, about <strong>twelve</strong> different engines targeting similar workloads, and many copies of the same data in various locations and formats.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!exLy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6783fbb2-1eea-425a-b278-e5ed8238eda9_542x370.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!exLy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6783fbb2-1eea-425a-b278-e5ed8238eda9_542x370.png 424w, https://substackcdn.com/image/fetch/$s_!exLy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6783fbb2-1eea-425a-b278-e5ed8238eda9_542x370.png 848w, https://substackcdn.com/image/fetch/$s_!exLy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6783fbb2-1eea-425a-b278-e5ed8238eda9_542x370.png 1272w, https://substackcdn.com/image/fetch/$s_!exLy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6783fbb2-1eea-425a-b278-e5ed8238eda9_542x370.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!exLy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6783fbb2-1eea-425a-b278-e5ed8238eda9_542x370.png" width="542" height="370" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6783fbb2-1eea-425a-b278-e5ed8238eda9_542x370.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:370,&quot;width&quot;:542,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:37183,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160061351?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6783fbb2-1eea-425a-b278-e5ed8238eda9_542x370.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!exLy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6783fbb2-1eea-425a-b278-e5ed8238eda9_542x370.png 424w, https://substackcdn.com/image/fetch/$s_!exLy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6783fbb2-1eea-425a-b278-e5ed8238eda9_542x370.png 848w, https://substackcdn.com/image/fetch/$s_!exLy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6783fbb2-1eea-425a-b278-e5ed8238eda9_542x370.png 1272w, https://substackcdn.com/image/fetch/$s_!exLy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6783fbb2-1eea-425a-b278-e5ed8238eda9_542x370.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div></li></ul><p>Meta lacked the standardization and reusable components. The engineers got more operational burden. The users had to interact with different SQL dialects and suffer inconsistent semantics.</p><p>They couldn&#8217;t put the most effort into innovation. </p><div><hr></div><h2>The new paradigm shift</h2><p>So, how did Meta solve those problems?</p><p>They started an effort on an organizational scale, which Meta called the Shared Foundations. The purpose is to re-architect the data lakehouse.</p><p>The Shared Foundations program involves hundreds of engineers throughout Meta. The program has the following principles:</p><ul><li><p><strong>Using fewer systems</strong>: Many systems that serve the same use cases with overlapping functionality should be merged into one system. For example, Meta aimed to have a single query engine for each area: batch, streaming, interactive, and machine learning. </p></li><li><p><strong>Reusable components:&nbsp;</strong>Meta can still provide different compute engines if use cases and requirements are distinct. They focused on reusing as many components as possible for these cases. For example, interactive and batch engines can share the storage encodings or data formats.</p></li><li><p><strong>Consistent APIs</strong> can lower the learning curve for users and make the integration of components more straightforward. Thus paving the way for modularization and reusability.</p></li></ul><p>With these principles, Meta aimed to achieve:</p><ul><li><p><strong>Engineering efficiency</strong>: Their engineers can work on a smaller number of systems. These principles also reduced duplication and prevented them from re-inventing the wheel. </p></li><li><p><strong>Faster innovation</strong>: Having fewer systems means less operational burden. This allows Meta to focus on new features and other improvements.</p></li><li><p><strong>Better user experience: </strong>End users can expect consistent syntax, features, and semantics across systems, lowering the barrier to using these systems and increasing productivity.</p></li></ul><p>Meta implemented the Shared Foundations in areas such as storage, metadata, execution, language, and engine.</p><div><hr></div><blockquote><p>To celebrate Lunar New Year (the true New Year holiday in Vietnam), I&#8217;m offering <em><strong>50% off the annual subscription</strong></em>. The offer ends soon; grab it now to get full access to nearly 200 high-quality data engineering articles.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe&quot;,&quot;text&quot;:&quot;50% off the annual subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe"><span>50% off the annual subscription</span></a></p></blockquote><div><hr></div><h3>Compute Engine</h3><p>As mentioned, internal teams built different query engines to adapt to different workloads and performance requirements. </p><p>Presto, Raptor, Cubrick and Scuba for interactive queries.</p><p>Presto and Spark for batch execution.</p><p>Puma, Stylus, XStream, and MRT for stream processing.</p><p>Let&#8217;s dive into each area.</p><p><strong>For the interactive engines</strong>, the ideal one would have the best features from Presto, Raptor, Cubrick, and Scuba. This engine should provide:</p><ul><li><p>Full SQL support, complex queries, and data models.</p></li><li><p>The ability to process data directly on the lakehouse.</p></li><li><p>Low latency performance is achieved by managing data in memory or SSD.</p></li><li><p>Supporting for real-time data.</p></li></ul><p>In the end, Meta built the convergence engine based on Presto because it provides most of the requirements above. Meta compensated the performance gap between Presto and other interactive systems through local caching.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iY5o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a500f18-5015-4c7c-adf3-51bd8f28fdbd_518x296.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iY5o!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a500f18-5015-4c7c-adf3-51bd8f28fdbd_518x296.png 424w, https://substackcdn.com/image/fetch/$s_!iY5o!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a500f18-5015-4c7c-adf3-51bd8f28fdbd_518x296.png 848w, https://substackcdn.com/image/fetch/$s_!iY5o!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a500f18-5015-4c7c-adf3-51bd8f28fdbd_518x296.png 1272w, https://substackcdn.com/image/fetch/$s_!iY5o!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a500f18-5015-4c7c-adf3-51bd8f28fdbd_518x296.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iY5o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a500f18-5015-4c7c-adf3-51bd8f28fdbd_518x296.png" width="518" height="296" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9a500f18-5015-4c7c-adf3-51bd8f28fdbd_518x296.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:296,&quot;width&quot;:518,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:56011,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160061351?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a500f18-5015-4c7c-adf3-51bd8f28fdbd_518x296.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iY5o!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a500f18-5015-4c7c-adf3-51bd8f28fdbd_518x296.png 424w, https://substackcdn.com/image/fetch/$s_!iY5o!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a500f18-5015-4c7c-adf3-51bd8f28fdbd_518x296.png 848w, https://substackcdn.com/image/fetch/$s_!iY5o!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a500f18-5015-4c7c-adf3-51bd8f28fdbd_518x296.png 1272w, https://substackcdn.com/image/fetch/$s_!iY5o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a500f18-5015-4c7c-adf3-51bd8f28fdbd_518x296.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>They developed the smart hierarchical caching mechanism, which stored the most frequently used data and metadata in the local memory and SSDs of Presto&#8217;s workers and coordinator.</p><p>This mechanism helps improve the order of magnitude of the latency of most of Meta&#8217;s common interactive query patterns. This speedup even exceeded the performance of existing systems, which use less hardware.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!quoX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea9db25e-03f9-4a4d-be64-869b6e073ba9_586x542.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!quoX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea9db25e-03f9-4a4d-be64-869b6e073ba9_586x542.png 424w, https://substackcdn.com/image/fetch/$s_!quoX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea9db25e-03f9-4a4d-be64-869b6e073ba9_586x542.png 848w, https://substackcdn.com/image/fetch/$s_!quoX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea9db25e-03f9-4a4d-be64-869b6e073ba9_586x542.png 1272w, https://substackcdn.com/image/fetch/$s_!quoX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea9db25e-03f9-4a4d-be64-869b6e073ba9_586x542.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!quoX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea9db25e-03f9-4a4d-be64-869b6e073ba9_586x542.png" width="586" height="542" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ea9db25e-03f9-4a4d-be64-869b6e073ba9_586x542.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:542,&quot;width&quot;:586,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:131108,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160061351?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea9db25e-03f9-4a4d-be64-869b6e073ba9_586x542.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!quoX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea9db25e-03f9-4a4d-be64-869b6e073ba9_586x542.png 424w, https://substackcdn.com/image/fetch/$s_!quoX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea9db25e-03f9-4a4d-be64-869b6e073ba9_586x542.png 848w, https://substackcdn.com/image/fetch/$s_!quoX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea9db25e-03f9-4a4d-be64-869b6e073ba9_586x542.png 1272w, https://substackcdn.com/image/fetch/$s_!quoX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea9db25e-03f9-4a4d-be64-869b6e073ba9_586x542.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Although Presto can query directly near real-time data on Hive, the engine can only tap into real-time data partitions once all the partition&#8217;s data is available. A Hive partition typically has hourly or daily partitions, limiting the near-real-time capability.</p><p>To address this, Meta introduced the <code>open</code> partition state in the Hive Metastore as the systems could register the partitions as soon as the data arrived. Presto can now access data immediately after it lands in the storage layer.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-bsu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bfd03c1-3f20-4b69-abf3-96786af55ed0_762x372.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-bsu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bfd03c1-3f20-4b69-abf3-96786af55ed0_762x372.png 424w, https://substackcdn.com/image/fetch/$s_!-bsu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bfd03c1-3f20-4b69-abf3-96786af55ed0_762x372.png 848w, https://substackcdn.com/image/fetch/$s_!-bsu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bfd03c1-3f20-4b69-abf3-96786af55ed0_762x372.png 1272w, https://substackcdn.com/image/fetch/$s_!-bsu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bfd03c1-3f20-4b69-abf3-96786af55ed0_762x372.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-bsu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bfd03c1-3f20-4b69-abf3-96786af55ed0_762x372.png" width="762" height="372" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8bfd03c1-3f20-4b69-abf3-96786af55ed0_762x372.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:372,&quot;width&quot;:762,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:63395,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160061351?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bfd03c1-3f20-4b69-abf3-96786af55ed0_762x372.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-bsu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bfd03c1-3f20-4b69-abf3-96786af55ed0_762x372.png 424w, https://substackcdn.com/image/fetch/$s_!-bsu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bfd03c1-3f20-4b69-abf3-96786af55ed0_762x372.png 848w, https://substackcdn.com/image/fetch/$s_!-bsu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bfd03c1-3f20-4b69-abf3-96786af55ed0_762x372.png 1272w, https://substackcdn.com/image/fetch/$s_!-bsu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bfd03c1-3f20-4b69-abf3-96786af55ed0_762x372.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Meta took two years to migrate all workloads from other interactive systems to Presto. When migrating queries to a Presto, Meta had to address the syntactic incompatibilities and implement the functions mapping between old systems and Presto, which is flexible to allow Meta to map all queries to supported Presto queries. </p><p>Because systems like Raptor, Cubrick, and Scuba load data from the lakehouse, the data migration was not a challenge, as users can use Presto to load the data from the lakehouse. At the end of the migration, Meta completely deprecated Raptor and Cubrick, saving several hundred thousand lines of code and several thousand machines.</p><p><strong>For the batch engines,</strong> Meta also decided to migrate most of the batch pipelines to Presto.</p><p>Meta created the Hive engine for all batch processing in late 2000 and later replaced it with SparkSQL. When migrating to Presto, Meta faced a problem in which Presto&#8217;s architecture at that time was insufficiently resilient to machine failures compared to Spark when executing long-running pipelines.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OfEj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d4aeb6b-a305-4b32-97bb-0853688834e1_946x114.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OfEj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d4aeb6b-a305-4b32-97bb-0853688834e1_946x114.png 424w, https://substackcdn.com/image/fetch/$s_!OfEj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d4aeb6b-a305-4b32-97bb-0853688834e1_946x114.png 848w, https://substackcdn.com/image/fetch/$s_!OfEj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d4aeb6b-a305-4b32-97bb-0853688834e1_946x114.png 1272w, https://substackcdn.com/image/fetch/$s_!OfEj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d4aeb6b-a305-4b32-97bb-0853688834e1_946x114.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OfEj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d4aeb6b-a305-4b32-97bb-0853688834e1_946x114.png" width="946" height="114" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0d4aeb6b-a305-4b32-97bb-0853688834e1_946x114.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:114,&quot;width&quot;:946,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:34650,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160061351?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d4aeb6b-a305-4b32-97bb-0853688834e1_946x114.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!OfEj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d4aeb6b-a305-4b32-97bb-0853688834e1_946x114.png 424w, https://substackcdn.com/image/fetch/$s_!OfEj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d4aeb6b-a305-4b32-97bb-0853688834e1_946x114.png 848w, https://substackcdn.com/image/fetch/$s_!OfEj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d4aeb6b-a305-4b32-97bb-0853688834e1_946x114.png 1272w, https://substackcdn.com/image/fetch/$s_!OfEj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d4aeb6b-a305-4b32-97bb-0853688834e1_946x114.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>They solve this problem by combining the scalability of the Spark engine with the cleaner standards-compliant SQL called PrestoSQL, which resulted in Presto on Spark. The solution achieved this by refactoring the Presto front-end (parser, analyzer, optimizer, planner) and backend (evaluation and I/O) libraries and embedding these in the Spark driver and worker.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3gq6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb112d8e5-32c1-468b-9fd3-17d7450db247_436x490.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3gq6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb112d8e5-32c1-468b-9fd3-17d7450db247_436x490.png 424w, https://substackcdn.com/image/fetch/$s_!3gq6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb112d8e5-32c1-468b-9fd3-17d7450db247_436x490.png 848w, https://substackcdn.com/image/fetch/$s_!3gq6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb112d8e5-32c1-468b-9fd3-17d7450db247_436x490.png 1272w, https://substackcdn.com/image/fetch/$s_!3gq6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb112d8e5-32c1-468b-9fd3-17d7450db247_436x490.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3gq6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb112d8e5-32c1-468b-9fd3-17d7450db247_436x490.png" width="436" height="490" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b112d8e5-32c1-468b-9fd3-17d7450db247_436x490.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:490,&quot;width&quot;:436,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:72701,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160061351?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb112d8e5-32c1-468b-9fd3-17d7450db247_436x490.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3gq6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb112d8e5-32c1-468b-9fd3-17d7450db247_436x490.png 424w, https://substackcdn.com/image/fetch/$s_!3gq6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb112d8e5-32c1-468b-9fd3-17d7450db247_436x490.png 848w, https://substackcdn.com/image/fetch/$s_!3gq6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb112d8e5-32c1-468b-9fd3-17d7450db247_436x490.png 1272w, https://substackcdn.com/image/fetch/$s_!3gq6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb112d8e5-32c1-468b-9fd3-17d7450db247_436x490.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>With interactive queries already ran on Presto, Presto on Spark offers 100% compatibility with PrestoSQL; users can switch from interactive queries to batch pipelines without needing to rewrite the queries.</p><p>At Meta, Presto on Spark is currently in production and running thousands of pipelines daily.</p><p><strong>There were also fragmented solutions in the streaming engines. </strong>The two main reasons were:</p><ul><li><p>Programming language divergence (C++ vs. Java vs. PHP).</p></li><li><p>The abstraction level divergence (low-level procedural vs. high-level declarative API)</p></li></ul><p>The legacy stacks had Puma (Java, declarative), Stylus (C++, low level), and others with different combinations of abstraction levels (declarative, procedural) and implementation languages (C++, Java, PHP).<strong> </strong></p><p>To deal with this, Meta built XStream, the next generation of stream processing platform. Meta promoted SQL as the primary way to interact with XStream by integrating with the CoreSQL. They also made it more efficient with a Velox-based execution engine.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tvFx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37345064-cc53-4717-9234-13d11966b628_478x382.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tvFx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37345064-cc53-4717-9234-13d11966b628_478x382.png 424w, https://substackcdn.com/image/fetch/$s_!tvFx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37345064-cc53-4717-9234-13d11966b628_478x382.png 848w, https://substackcdn.com/image/fetch/$s_!tvFx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37345064-cc53-4717-9234-13d11966b628_478x382.png 1272w, https://substackcdn.com/image/fetch/$s_!tvFx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37345064-cc53-4717-9234-13d11966b628_478x382.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tvFx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37345064-cc53-4717-9234-13d11966b628_478x382.png" width="478" height="382" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/37345064-cc53-4717-9234-13d11966b628_478x382.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:382,&quot;width&quot;:478,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:50546,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160061351?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37345064-cc53-4717-9234-13d11966b628_478x382.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tvFx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37345064-cc53-4717-9234-13d11966b628_478x382.png 424w, https://substackcdn.com/image/fetch/$s_!tvFx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37345064-cc53-4717-9234-13d11966b628_478x382.png 848w, https://substackcdn.com/image/fetch/$s_!tvFx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37345064-cc53-4717-9234-13d11966b628_478x382.png 1272w, https://substackcdn.com/image/fetch/$s_!tvFx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37345064-cc53-4717-9234-13d11966b628_478x382.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><blockquote><p><em>We will expore CoreSQL and Velox later.</em></p></blockquote><p>XStream today supports various use cases from SQL queries and machine learning workloads to function as a service.</p><h3>SQL Dialect</h3><p>Meta had more than six variants of SQL being used internally. If users wanted to use different systems, there was a high chance that they had to learn a different SQL dialect. Meta decided to narrow it down to two dialects: MySQL and PrestoSQL. The first is for OLTP workloads, and the latter for OLAP workloads.</p><p>However, Meta found it challenging to achieve compatibility across the different engines. They looked around and found that the way Google achieved the same purpose with <a href="https://github.com/google/zetasql">ZetaSQL</a> could help them; they needed two components:</p><ul><li><p>The SQL parser and analyzer for parsing and analyzing queries plus creating and validating query plans. Meta already had a Java implementation (Presto) and a Python implementation (used by developer tools). They rewrote the Python implementation in C++ for better performance and better integration with the C++ engines. They are working to bind Java implementation to the C++ library.</p></li><li><p>A library of query functions and operators. Meta initially reused the Java implementation from Presto and tried to replace it with the Velox engine to maximize the performance. We will explore Velox in the <strong>Execution Engine</strong> section. </p></li></ul><p>Meta called this solution CoreSQL. It acts as the standard dialect across engines, from Presto to XStream.</p><h3>Storage</h3><p>Meta used ORC as the columnar format for the lakehouse. Later, they developed DWRF, the ORC variant to support better deeper nested data and finer grained encryption. Meta has fragmented codec implementation for this format: one Java implementation of Spark, one Java implementation for Presto, and one C++ implementation for ML applications.</p><p>Because of its higher performance, Meta chose the Presto codec as the base one and added necessary features to it. Then, they migrated all codecs used in Spark and other systems to the new one. In addition, Meta refactored the DWIO library into Velox, added some features, and open-sourced the library as part of Velox.</p><h3>Execution Engine</h3><p>Like all the areas above, the lakehouse&#8217;s evolution created fragmentation in the execution engines. More than twelve specialized engines that shared little to nothing with each other were written in different languages and developed by different teams.</p><p>To address this challenge, Meta created Velox, a C++ database acceleration library (think Databricks&#8217;s Photon in the case of Spark). Velox aimed to unify execution engines across different compute engines.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Mzn0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5076fe3-709f-42c3-b42e-ef86a6e9dcc5_488x402.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Mzn0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5076fe3-709f-42c3-b42e-ef86a6e9dcc5_488x402.png 424w, https://substackcdn.com/image/fetch/$s_!Mzn0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5076fe3-709f-42c3-b42e-ef86a6e9dcc5_488x402.png 848w, https://substackcdn.com/image/fetch/$s_!Mzn0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5076fe3-709f-42c3-b42e-ef86a6e9dcc5_488x402.png 1272w, https://substackcdn.com/image/fetch/$s_!Mzn0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5076fe3-709f-42c3-b42e-ef86a6e9dcc5_488x402.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Mzn0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5076fe3-709f-42c3-b42e-ef86a6e9dcc5_488x402.png" width="488" height="402" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a5076fe3-709f-42c3-b42e-ef86a6e9dcc5_488x402.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:402,&quot;width&quot;:488,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:38438,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160061351?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5076fe3-709f-42c3-b42e-ef86a6e9dcc5_488x402.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Mzn0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5076fe3-709f-42c3-b42e-ef86a6e9dcc5_488x402.png 424w, https://substackcdn.com/image/fetch/$s_!Mzn0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5076fe3-709f-42c3-b42e-ef86a6e9dcc5_488x402.png 848w, https://substackcdn.com/image/fetch/$s_!Mzn0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5076fe3-709f-42c3-b42e-ef86a6e9dcc5_488x402.png 1272w, https://substackcdn.com/image/fetch/$s_!Mzn0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5076fe3-709f-42c3-b42e-ef86a6e9dcc5_488x402.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Velox aimed to unify execution engines across different compute engines. Image created by the author.</figcaption></figure></div><p>Typically, Velox receives the fully optimized query plans and performs the computation using the resources in the local machine.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xdLe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0608d0a3-2d29-4a2e-8877-c1eac785af95_656x188.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xdLe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0608d0a3-2d29-4a2e-8877-c1eac785af95_656x188.png 424w, https://substackcdn.com/image/fetch/$s_!xdLe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0608d0a3-2d29-4a2e-8877-c1eac785af95_656x188.png 848w, https://substackcdn.com/image/fetch/$s_!xdLe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0608d0a3-2d29-4a2e-8877-c1eac785af95_656x188.png 1272w, https://substackcdn.com/image/fetch/$s_!xdLe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0608d0a3-2d29-4a2e-8877-c1eac785af95_656x188.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xdLe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0608d0a3-2d29-4a2e-8877-c1eac785af95_656x188.png" width="656" height="188" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0608d0a3-2d29-4a2e-8877-c1eac785af95_656x188.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:188,&quot;width&quot;:656,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:32133,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/160061351?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0608d0a3-2d29-4a2e-8877-c1eac785af95_656x188.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xdLe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0608d0a3-2d29-4a2e-8877-c1eac785af95_656x188.png 424w, https://substackcdn.com/image/fetch/$s_!xdLe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0608d0a3-2d29-4a2e-8877-c1eac785af95_656x188.png 848w, https://substackcdn.com/image/fetch/$s_!xdLe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0608d0a3-2d29-4a2e-8877-c1eac785af95_656x188.png 1272w, https://substackcdn.com/image/fetch/$s_!xdLe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0608d0a3-2d29-4a2e-8877-c1eac785af95_656x188.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>As Meta claimed, Velox democratizes the optimizations that are only found in individual engines, which reduces duplication, offers reusability, and improves consistency.</p><p>At the time of the paper&#8217;s release, Meta integrated Velox into many systems. Meta also provided the implementation of the CoreSQL dialect for Velox.</p><div><hr></div><h2>Outro</h2><p>Thank you for reading this far</p><p>In this article, we explored the limitations of Meta&#8217;s legacy approach for their lakehouse, how they addressed them with the Shared Foundations, and how they implemented it in different areas from the compute engine, SQL dialect, and storage format to the execution engine.</p><p>Now, it&#8217;s time to say goodbye. See you in the next articles.</p><div><hr></div><h2>Reference</h2><p><em>[1] Biswapesh Chattopadhyay, Pedro Pedreira, Sameer Agarwal, Yutian "James" Sun, Suketu Vakharia, Peng Li, Weiran Liu, Sundaram Narayanan, <a href="https://www.cidrdb.org/cidr2023/papers/p77-chattopadhyay.pdf">Shared Foundations: Modernizing Meta&#8217;s Data Lakehouse</a> (2023)</em></p>]]></content:encoded></item><item><title><![CDATA[Bufstream: Stream Kafka Messages to Iceberg Tables in Minutes]]></title><description><![CDATA[8x cheaper than Kafka + native support for data quality and seamless transformation of Kafka topics into Iceberg tables.]]></description><link>https://vutr.substack.com/p/bufstream-stream-kafka-messages-to</link><guid isPermaLink="false">https://vutr.substack.com/p/bufstream-stream-kafka-messages-to</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Thu, 27 Mar 2025 03:15:31 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ivxg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbabe6bb8-3d15-45a2-8cc2-b07aabb70eff_2000x1429.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><blockquote><p><em>I&#8217;m making my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. </em></p><p><em>Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ivxg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbabe6bb8-3d15-45a2-8cc2-b07aabb70eff_2000x1429.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ivxg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbabe6bb8-3d15-45a2-8cc2-b07aabb70eff_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!ivxg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbabe6bb8-3d15-45a2-8cc2-b07aabb70eff_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!ivxg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbabe6bb8-3d15-45a2-8cc2-b07aabb70eff_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!ivxg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbabe6bb8-3d15-45a2-8cc2-b07aabb70eff_2000x1429.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ivxg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbabe6bb8-3d15-45a2-8cc2-b07aabb70eff_2000x1429.png" width="1456" height="1040" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/babe6bb8-3d15-45a2-8cc2-b07aabb70eff_2000x1429.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1040,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:402912,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/157438538?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbabe6bb8-3d15-45a2-8cc2-b07aabb70eff_2000x1429.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ivxg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbabe6bb8-3d15-45a2-8cc2-b07aabb70eff_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!ivxg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbabe6bb8-3d15-45a2-8cc2-b07aabb70eff_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!ivxg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbabe6bb8-3d15-45a2-8cc2-b07aabb70eff_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!ivxg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbabe6bb8-3d15-45a2-8cc2-b07aabb70eff_2000x1429.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author</figcaption></figure></div><div><hr></div><h2>Intro</h2><p><a href="https://enlyft.com/tech/products/apache-kafka">Nearly 50,000 companies use Apache Kafka.</a></p><p>Fourteen years ago, a team led by Jay Kreps built Kafka to meet LinkedIn's growing log processing demands. Since its open-source release, Kafka has become the de facto standard for distributed messaging.</p><p>But here&#8217;s the catch: Kafka&#8217;s design isn&#8217;t optimized for the cloud era. Compute and storage can&#8217;t scale independently, cross-availability-zone transfer fees due to data replication, and other challenges persist whether you run Kafka locally or in the cloud.</p><ul><li><p><strong>Data quality concerns</strong>: Kafka brokers treat messages as raw byte sequences, leaving schema validation up to producers and consumers. If someone skips this step, downstream systems suffer.</p></li><li><p><strong>Pipeline complexity</strong>: Once data lands in a Kafka topic, you need an entire pipeline to move it to a data lake before analytics engines can query it.</p></li></ul><p>What if there were a solution that helped you manage Kafka more efficiently in the cloud, ensured data quality, and transformed Kafka messages into an Iceberg table in just a few minutes?</p><p>Today, we explore Bufstream&#8212;the solution that promises all of this.</p><div><hr></div><h2>A bit of Kafka</h2><p>Kafka achieves high throughput by leveraging a page cache and a sequential disk access pattern. It simplifies the system by relying on the OS for storage management; all read and write operations must pass through the page cache.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zO5F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cc7d4e-2ceb-41e2-96a6-58fcdb874137_556x546.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zO5F!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cc7d4e-2ceb-41e2-96a6-58fcdb874137_556x546.png 424w, https://substackcdn.com/image/fetch/$s_!zO5F!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cc7d4e-2ceb-41e2-96a6-58fcdb874137_556x546.png 848w, https://substackcdn.com/image/fetch/$s_!zO5F!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cc7d4e-2ceb-41e2-96a6-58fcdb874137_556x546.png 1272w, https://substackcdn.com/image/fetch/$s_!zO5F!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cc7d4e-2ceb-41e2-96a6-58fcdb874137_556x546.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zO5F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cc7d4e-2ceb-41e2-96a6-58fcdb874137_556x546.png" width="556" height="546" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f6cc7d4e-2ceb-41e2-96a6-58fcdb874137_556x546.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:546,&quot;width&quot;:556,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:112133,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/157438538?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cc7d4e-2ceb-41e2-96a6-58fcdb874137_556x546.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zO5F!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cc7d4e-2ceb-41e2-96a6-58fcdb874137_556x546.png 424w, https://substackcdn.com/image/fetch/$s_!zO5F!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cc7d4e-2ceb-41e2-96a6-58fcdb874137_556x546.png 848w, https://substackcdn.com/image/fetch/$s_!zO5F!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cc7d4e-2ceb-41e2-96a6-58fcdb874137_556x546.png 1272w, https://substackcdn.com/image/fetch/$s_!zO5F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6cc7d4e-2ceb-41e2-96a6-58fcdb874137_556x546.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>This tightly coupled design means that scaling storage requires adding more machines, often leading to inefficient resource usage. To address this limitation, Uber proposed Kafka Tiered Storage (<a href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage">KIP-405</a>), introducing a two-tiered storage system:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g_Pl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15a7535d-99b6-411e-ba4f-0f9fff801383_550x330.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g_Pl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15a7535d-99b6-411e-ba4f-0f9fff801383_550x330.png 424w, https://substackcdn.com/image/fetch/$s_!g_Pl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15a7535d-99b6-411e-ba4f-0f9fff801383_550x330.png 848w, https://substackcdn.com/image/fetch/$s_!g_Pl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15a7535d-99b6-411e-ba4f-0f9fff801383_550x330.png 1272w, https://substackcdn.com/image/fetch/$s_!g_Pl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15a7535d-99b6-411e-ba4f-0f9fff801383_550x330.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g_Pl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15a7535d-99b6-411e-ba4f-0f9fff801383_550x330.png" width="550" height="330" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/15a7535d-99b6-411e-ba4f-0f9fff801383_550x330.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:330,&quot;width&quot;:550,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:68432,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/157438538?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15a7535d-99b6-411e-ba4f-0f9fff801383_550x330.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!g_Pl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15a7535d-99b6-411e-ba4f-0f9fff801383_550x330.png 424w, https://substackcdn.com/image/fetch/$s_!g_Pl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15a7535d-99b6-411e-ba4f-0f9fff801383_550x330.png 848w, https://substackcdn.com/image/fetch/$s_!g_Pl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15a7535d-99b6-411e-ba4f-0f9fff801383_550x330.png 1272w, https://substackcdn.com/image/fetch/$s_!g_Pl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15a7535d-99b6-411e-ba4f-0f9fff801383_550x330.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Kafka Tiered Storage. Image created by the author.</figcaption></figure></div><ul><li><p>Local storage (broker disk) stores the most recent data.</p></li><li><p>Remote storage (HDFS/S3/GCS) stores historical data.</p></li></ul><p>However, brokers are not entirely stateless.</p><p>Kafka's design also relies on replication for message durability. Each Kafka partition has a single leader and zero or more followers (those storing replicas). All writes must go to the partition&#8217;s leader, and reads can be served by a leader or the partition's followers.</p><p>When the producer writes messages to the leader, the leader replicates them to followers. This helps Kafka to fail over other replicas when a broker fails. Because Kafka storage and compute are tightly coupled, any change in cluster membership forces data to move around the network.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!l9Ym!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f302006-652e-4b40-9399-073e43b4149e_1024x572.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!l9Ym!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f302006-652e-4b40-9399-073e43b4149e_1024x572.png 424w, https://substackcdn.com/image/fetch/$s_!l9Ym!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f302006-652e-4b40-9399-073e43b4149e_1024x572.png 848w, https://substackcdn.com/image/fetch/$s_!l9Ym!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f302006-652e-4b40-9399-073e43b4149e_1024x572.png 1272w, https://substackcdn.com/image/fetch/$s_!l9Ym!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f302006-652e-4b40-9399-073e43b4149e_1024x572.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!l9Ym!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f302006-652e-4b40-9399-073e43b4149e_1024x572.png" width="1024" height="572" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f302006-652e-4b40-9399-073e43b4149e_1024x572.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:572,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:114921,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/157438538?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f302006-652e-4b40-9399-073e43b4149e_1024x572.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!l9Ym!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f302006-652e-4b40-9399-073e43b4149e_1024x572.png 424w, https://substackcdn.com/image/fetch/$s_!l9Ym!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f302006-652e-4b40-9399-073e43b4149e_1024x572.png 848w, https://substackcdn.com/image/fetch/$s_!l9Ym!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f302006-652e-4b40-9399-073e43b4149e_1024x572.png 1272w, https://substackcdn.com/image/fetch/$s_!l9Ym!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f302006-652e-4b40-9399-073e43b4149e_1024x572.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>However, Kafka's design becomes less efficient when operating in the cloud:</p><ul><li><p>It can&#8217;t fully leverage the cloud's pay-as-you-go pricing model, as computing and storage cannot be scaled independently.</p></li><li><p>It can incur significant cross-availability-zone (AZ) data transfer fees because messages are replicated across different AZs.</p></li></ul><p>That&#8217;s why many solutions are emerging to offer a cloud-native alternative to Kafka, and Bufstream stands out as a compelling contender.</p><div><hr></div><h2>Bufstream</h2><h3>The motivation</h3><p>Bufstream was developed by <a href="https://buf.build/">Buf</a>, a software company founded in 2020 to bring schema-driven development to the world via Protobuf and gRPC for many companies.</p><blockquote><p><em>Protocol Buffers (Protobuf) is an efficient binary serialization format developed by Google. Unlike JSON, Protobuf enforces strict schemas using .proto files, where fields are assigned unique numbers for efficient encoding. It supports schema evolution by allowing new fields to be added without breaking existing consumers, ensuring backward and forward compatibility.</em></p></blockquote><p>Buf has been building the <a href="https://buf.build/product/bsr">Buf Schema Registry (BSR)</a>, the complete Protobuf schema registry, and a robust Protobuf package manager. As BSR grew, Buf saw more customers wanting these capabilities for data streaming use cases, specifically customers sending Protobuf payloads over Kafka.</p><p>These customers wanted to tie Kafka topics to specific Protobuf message formats, enable broker-side validation, automatically envelop raw data, and leverage BSR&#8217;s support for custom Protobuf options to enforce field-level RBAC at the gateway.</p><p>At first, Buf only built the Buf Kafka Gateway, a Kafka proxy that leveraged BSR to provide validation, automatic enveloping, and field-level RBAC. As Buf developed the gateway, object-store-based Kafka emerged (e.g., WarpStream, AutoMQ).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HWg9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed0c5736-ec1e-49e8-bfcd-43390b8f3293_1004x550.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HWg9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed0c5736-ec1e-49e8-bfcd-43390b8f3293_1004x550.png 424w, https://substackcdn.com/image/fetch/$s_!HWg9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed0c5736-ec1e-49e8-bfcd-43390b8f3293_1004x550.png 848w, https://substackcdn.com/image/fetch/$s_!HWg9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed0c5736-ec1e-49e8-bfcd-43390b8f3293_1004x550.png 1272w, https://substackcdn.com/image/fetch/$s_!HWg9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed0c5736-ec1e-49e8-bfcd-43390b8f3293_1004x550.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HWg9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed0c5736-ec1e-49e8-bfcd-43390b8f3293_1004x550.png" width="1004" height="550" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ed0c5736-ec1e-49e8-bfcd-43390b8f3293_1004x550.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:550,&quot;width&quot;:1004,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:135843,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/157438538?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed0c5736-ec1e-49e8-bfcd-43390b8f3293_1004x550.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HWg9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed0c5736-ec1e-49e8-bfcd-43390b8f3293_1004x550.png 424w, https://substackcdn.com/image/fetch/$s_!HWg9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed0c5736-ec1e-49e8-bfcd-43390b8f3293_1004x550.png 848w, https://substackcdn.com/image/fetch/$s_!HWg9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed0c5736-ec1e-49e8-bfcd-43390b8f3293_1004x550.png 1272w, https://substackcdn.com/image/fetch/$s_!HWg9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed0c5736-ec1e-49e8-bfcd-43390b8f3293_1004x550.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author</figcaption></figure></div><p>Realizing they could offer an end-to-end solution, they built a Kafka-compatible message queue with native support for features like directly writing Iceberg tables to S3 while bringing the same reliability and developer experience to data streaming as they did with Protobuf and gRPC.</p><p>The result was Bufstream, an enterprise-grade, object storage-based Kafka-compatible message queue <a href="https://jepsen.io/analyses/bufstream-0.1.0">verified by Jepsen.</a></p><blockquote><p><em>Jepsen is the gold standard for distributed systems testing, and Bufstream is the only cloud-native Kafka implementation that has been independently tested by Jepsen.</em></p></blockquote><p>But how is Bufstream different as a Kafka replacement?</p><h3>Replacing local disks with object storage</h3><p>Buf designed Bufstream from scratch to ensure 100% Kafka compatibility while storing all data in object storage. For the Kafka protocol, <a href="https://buf.build/docs/bufstream/kafka-compatibility/conformance/">Bufstream supports</a> the latest version of each Kafka API (as of Kafka 3.7.1) while striving to maintain compatibility with all previous endpoint versions.</p><p>For the storage, instead of writing to a local disk, Bufstream now writes directly to object storage like AWS S3, Google Cloud Storage, or Azure Blog Storage, allowing these services to be in charge of data durability and availability.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!luX2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5b708c5-66ad-4eb0-8e21-876db9a50137_554x452.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!luX2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5b708c5-66ad-4eb0-8e21-876db9a50137_554x452.png 424w, https://substackcdn.com/image/fetch/$s_!luX2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5b708c5-66ad-4eb0-8e21-876db9a50137_554x452.png 848w, https://substackcdn.com/image/fetch/$s_!luX2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5b708c5-66ad-4eb0-8e21-876db9a50137_554x452.png 1272w, https://substackcdn.com/image/fetch/$s_!luX2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5b708c5-66ad-4eb0-8e21-876db9a50137_554x452.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!luX2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5b708c5-66ad-4eb0-8e21-876db9a50137_554x452.png" width="554" height="452" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b5b708c5-66ad-4eb0-8e21-876db9a50137_554x452.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:452,&quot;width&quot;:554,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:65251,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/157438538?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5b708c5-66ad-4eb0-8e21-876db9a50137_554x452.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!luX2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5b708c5-66ad-4eb0-8e21-876db9a50137_554x452.png 424w, https://substackcdn.com/image/fetch/$s_!luX2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5b708c5-66ad-4eb0-8e21-876db9a50137_554x452.png 848w, https://substackcdn.com/image/fetch/$s_!luX2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5b708c5-66ad-4eb0-8e21-876db9a50137_554x452.png 1272w, https://substackcdn.com/image/fetch/$s_!luX2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5b708c5-66ad-4eb0-8e21-876db9a50137_554x452.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author</figcaption></figure></div><p>Unlike the tiered storage approach, which maintains local and remote storage, Bufstream stores messages entirely in the object storage. This allows users to scale computing and storage independently. Need more computing power? Add RAM and CPUs. Need more storage? Object storage enables you to expand capacity without limits (except for your budget.)</p><p>With <a href="https://buf.build/docs/bufstream/cost/#the-benchmark-setup">the same setup</a> of a single topic with 288 partitions, 1 GiB/s of symmetric reads and writes, and a 7-day data retention period on both Kafka on AWS and Bufstream, the Kafka cluster's EBS volumes cost <strong>$42,025</strong> per month. For the Bufstream storage, it only costs <strong>$4,625</strong> per month. The cost savings are mainly due to:</p><ul><li><p>Object storage is cheaper than disk media like AWS EBS.</p></li><li><p>The actual data stored in Bufstream is smaller than Kafka because it doesn&#8217;t need to replicate the data between brokers.</p></li></ul><p>With object storage, here is a typical message-writing process of Bufstream:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GaCI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F006d01c8-3387-4f9a-adde-68696b9a923b_1370x814.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GaCI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F006d01c8-3387-4f9a-adde-68696b9a923b_1370x814.png 424w, https://substackcdn.com/image/fetch/$s_!GaCI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F006d01c8-3387-4f9a-adde-68696b9a923b_1370x814.png 848w, https://substackcdn.com/image/fetch/$s_!GaCI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F006d01c8-3387-4f9a-adde-68696b9a923b_1370x814.png 1272w, https://substackcdn.com/image/fetch/$s_!GaCI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F006d01c8-3387-4f9a-adde-68696b9a923b_1370x814.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GaCI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F006d01c8-3387-4f9a-adde-68696b9a923b_1370x814.png" width="1370" height="814" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/006d01c8-3387-4f9a-adde-68696b9a923b_1370x814.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:814,&quot;width&quot;:1370,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:299889,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/157438538?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F006d01c8-3387-4f9a-adde-68696b9a923b_1370x814.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GaCI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F006d01c8-3387-4f9a-adde-68696b9a923b_1370x814.png 424w, https://substackcdn.com/image/fetch/$s_!GaCI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F006d01c8-3387-4f9a-adde-68696b9a923b_1370x814.png 848w, https://substackcdn.com/image/fetch/$s_!GaCI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F006d01c8-3387-4f9a-adde-68696b9a923b_1370x814.png 1272w, https://substackcdn.com/image/fetch/$s_!GaCI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F006d01c8-3387-4f9a-adde-68696b9a923b_1370x814.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><ul><li><p>Brokers write messages into the object storage as intake files and acknowledge the write to the producers.</p></li><li><p>Intake files include messages from many topics and partitions and are grouped according to a time boundary.</p></li><li><p>This message batching can help reduce the cost of writing for a single partition.</p></li><li><p>Bufstream has a background process to organize unordered messages from intake files into archives files with the help of message-ordering metadata from the metadata store, which can be etcd, Postgres, Google Spanner,&#8230;</p></li></ul><h3>Reducing the cross-availability zone transfer fee</h3><p>The benefit of leveraging object storage does not stop there.</p><p><a href="https://www.confluent.io/blog/understanding-and-optimizing-your-kafka-costs-part-1-infrastructure/#networking">According to Confluent</a>, cross-AZ replication can account for more than 50% of total infrastructure costs when self-managing Apache Kafka, making it a significant financial consideration for cloud deployments.</p><p>In the same benchmark above, the Kafka setup requires users to pay <strong>$34,732 monthly</strong> for the cross-availability zone transfer fee, <strong>three times the cost of the Bufstream clusters.</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Hf9U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff652d52a-ba9a-4e5a-8b13-76412b7c9c77_786x490.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Hf9U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff652d52a-ba9a-4e5a-8b13-76412b7c9c77_786x490.png 424w, https://substackcdn.com/image/fetch/$s_!Hf9U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff652d52a-ba9a-4e5a-8b13-76412b7c9c77_786x490.png 848w, https://substackcdn.com/image/fetch/$s_!Hf9U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff652d52a-ba9a-4e5a-8b13-76412b7c9c77_786x490.png 1272w, https://substackcdn.com/image/fetch/$s_!Hf9U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff652d52a-ba9a-4e5a-8b13-76412b7c9c77_786x490.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Hf9U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff652d52a-ba9a-4e5a-8b13-76412b7c9c77_786x490.png" width="786" height="490" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f652d52a-ba9a-4e5a-8b13-76412b7c9c77_786x490.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:490,&quot;width&quot;:786,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:134209,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/157438538?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff652d52a-ba9a-4e5a-8b13-76412b7c9c77_786x490.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Hf9U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff652d52a-ba9a-4e5a-8b13-76412b7c9c77_786x490.png 424w, https://substackcdn.com/image/fetch/$s_!Hf9U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff652d52a-ba9a-4e5a-8b13-76412b7c9c77_786x490.png 848w, https://substackcdn.com/image/fetch/$s_!Hf9U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff652d52a-ba9a-4e5a-8b13-76412b7c9c77_786x490.png 1272w, https://substackcdn.com/image/fetch/$s_!Hf9U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff652d52a-ba9a-4e5a-8b13-76412b7c9c77_786x490.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">How does a Kafka deployment cost users so much in Cross-AZ transfer fees? Image created by the author.</figcaption></figure></div><p>This high cost is primarily driven by:</p><ul><li><p>Kafka producers must always write to the partition leader. If a Kafka cluster spans the leader partition into three availability zones, producers will write to a leader in a different zone approximately two-thirds of the time.</p></li><li><p>The leader replicates the data to brokers in the other two availability zones.</p></li></ul><p>With Bufstream, the cross-availability zone transfer fee is only <strong>$226</strong> due to the metadata communication; this huge saving is mainly because:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v871!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04369771-1c9f-42c9-b84a-0d09a66c6aa1_1406x664.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v871!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04369771-1c9f-42c9-b84a-0d09a66c6aa1_1406x664.png 424w, https://substackcdn.com/image/fetch/$s_!v871!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04369771-1c9f-42c9-b84a-0d09a66c6aa1_1406x664.png 848w, https://substackcdn.com/image/fetch/$s_!v871!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04369771-1c9f-42c9-b84a-0d09a66c6aa1_1406x664.png 1272w, https://substackcdn.com/image/fetch/$s_!v871!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04369771-1c9f-42c9-b84a-0d09a66c6aa1_1406x664.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v871!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04369771-1c9f-42c9-b84a-0d09a66c6aa1_1406x664.png" width="1406" height="664" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/04369771-1c9f-42c9-b84a-0d09a66c6aa1_1406x664.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:664,&quot;width&quot;:1406,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:193844,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/157438538?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04369771-1c9f-42c9-b84a-0d09a66c6aa1_1406x664.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!v871!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04369771-1c9f-42c9-b84a-0d09a66c6aa1_1406x664.png 424w, https://substackcdn.com/image/fetch/$s_!v871!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04369771-1c9f-42c9-b84a-0d09a66c6aa1_1406x664.png 848w, https://substackcdn.com/image/fetch/$s_!v871!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04369771-1c9f-42c9-b84a-0d09a66c6aa1_1406x664.png 1272w, https://substackcdn.com/image/fetch/$s_!v871!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F04369771-1c9f-42c9-b84a-0d09a66c6aa1_1406x664.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><ul><li><p>Bufstream stores data in object storage and lets it ensure data durability; It doesn&#8217;t need to replicate data like Kafka.</p></li><li><p>Bufstream brokers are stateless. When adding and removing brokers, data doesn&#8217;t need to be moved over the network like Kafka. Instead, it only needs to update the metadata that maps the brokers and partitions in the object storage.</p></li><li><p>Bufstream brokers are leaderless; any broker can serve read and write. To limit cross-availability zone traffic, Bufstream uses flags to identify the client&#8217;s availability zone (AZ) and returns only brokers within that AZ during service discovery, avoiding cross-zone data transfer.</p></li></ul><h3>Deployment</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ePif!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff59f182f-03b6-4afb-9d0b-8c6b55ac3039_972x536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ePif!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff59f182f-03b6-4afb-9d0b-8c6b55ac3039_972x536.png 424w, https://substackcdn.com/image/fetch/$s_!ePif!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff59f182f-03b6-4afb-9d0b-8c6b55ac3039_972x536.png 848w, https://substackcdn.com/image/fetch/$s_!ePif!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff59f182f-03b6-4afb-9d0b-8c6b55ac3039_972x536.png 1272w, https://substackcdn.com/image/fetch/$s_!ePif!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff59f182f-03b6-4afb-9d0b-8c6b55ac3039_972x536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ePif!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff59f182f-03b6-4afb-9d0b-8c6b55ac3039_972x536.png" width="972" height="536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f59f182f-03b6-4afb-9d0b-8c6b55ac3039_972x536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:536,&quot;width&quot;:972,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:125141,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/157438538?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff59f182f-03b6-4afb-9d0b-8c6b55ac3039_972x536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ePif!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff59f182f-03b6-4afb-9d0b-8c6b55ac3039_972x536.png 424w, https://substackcdn.com/image/fetch/$s_!ePif!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff59f182f-03b6-4afb-9d0b-8c6b55ac3039_972x536.png 848w, https://substackcdn.com/image/fetch/$s_!ePif!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff59f182f-03b6-4afb-9d0b-8c6b55ac3039_972x536.png 1272w, https://substackcdn.com/image/fetch/$s_!ePif!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff59f182f-03b6-4afb-9d0b-8c6b55ac3039_972x536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Deploying Bufstream is straightforward. All you need is a Helm chart, and you&#8217;re good to go; Bufstream grants the customer complete control over the deployment. While <a href="https://open.substack.com/pub/vutr/p/i-spent-8-hours-researching-warpstream?r=2rj6sg&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=false">WarpStream</a> claims data sovereignty via BYOC, they lets users secure data within their private VPC but still requires routing metadata back to the WarpStream Cloud. With Bufstream, <strong>no</strong><em><strong> </strong></em><strong>data</strong> is sent back to Buf. A Bufstream deployment is entirely within a customers&#8217; VPC.</p><p>For a typical Bufstream deployment, you only need the following tech stack:</p><ul><li><p>A Kubernetes cluster</p></li><li><p>Object storage (S3, GCS, or Azure Blob Storage)</p></li><li><p>A metadata store (Etcd, PostgreSQL, Google Cloud Spanner, or AWS Aurora)</p></li></ul><h3>Pricing</h3><p>So, we've explored Bufstream as a much cheaper alternative to Kafka, but how does its pricing model work?</p><p>It&#8217;s straightforward: $0.002 per uncompressed GiB written (about $2 per TiB).</p><p><a href="https://www.linkedin.com/in/stanislavkozlovski/">Stanislav Kozlovski</a>, a Kafka expert and writer,<a href="https://www.linkedin.com/posts/stanislavkozlovski_bufstream-activity-7296172586965635072-zfIp?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAACaI7mQBV1xJYGEQ7HhOYLsECQJDDi_X1-4"> gives some juicy numbers</a> for the setups to achieve 256 MiB/s throughput, 7-day retention, 4x compression, and 1GiB uncompressed:</p><ul><li><p>A Kafka setup costs <strong>$1,077,922</strong></p></li><li><p>A Kafka-optimized setup costs <strong>$554,958.</strong> It has tiered storage and allows the consumer to fetch data from followers to save a cross-AZ transfer fee.</p></li><li><p>A Bufstream setup costs only <strong>$128,136,</strong> less than <strong>8.4 times</strong> compared to the Kafka setup and less than <strong>4.3 times</strong> compared to the optimized setup.</p></li></ul><p>As <a href="https://www.linkedin.com/posts/stanislavkozlovski_bufstream-activity-7296172586965635072-zfIp?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAACaI7mQBV1xJYGEQ7HhOYLsECQJDDi_X1-4">Stanislav confidently said</a>, Bufstream is the lowest-cost Kafka-replaced solution on the market.</p><h3>Ensuring data quality</h3><p>In addition to cutting costs, Bufstream provides first-class schema support at the broker level to help users with data quality issues. Before discovering how Bufstream can help, let&#8217;s understand how users perform data quality checks in Kafka.</p><p>Kafka sees your message as just an array of bytes. It has no clue if you tell the broker to check if a message has all expected fields or if a field has a string value instead of an integer.</p><p>The schema validation process must occur outside the brokers with the help of the Schema Registry, a centralized service that manages and enforces data schemas for Kafka topics, ensuring consistency and compatibility between producers and consumers.</p><p>The Schema Registry operates independently of the Kafka brokers and interacts with producers and consumers through a RESTful API. The topic schemas are stored and referenced by unique schema IDs. A typical process is:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jQkg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F419aaf53-2e40-45d9-9d15-10b4c540d02f_1056x524.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jQkg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F419aaf53-2e40-45d9-9d15-10b4c540d02f_1056x524.png 424w, https://substackcdn.com/image/fetch/$s_!jQkg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F419aaf53-2e40-45d9-9d15-10b4c540d02f_1056x524.png 848w, https://substackcdn.com/image/fetch/$s_!jQkg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F419aaf53-2e40-45d9-9d15-10b4c540d02f_1056x524.png 1272w, https://substackcdn.com/image/fetch/$s_!jQkg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F419aaf53-2e40-45d9-9d15-10b4c540d02f_1056x524.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jQkg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F419aaf53-2e40-45d9-9d15-10b4c540d02f_1056x524.png" width="1056" height="524" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/419aaf53-2e40-45d9-9d15-10b4c540d02f_1056x524.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:524,&quot;width&quot;:1056,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:146143,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/157438538?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F419aaf53-2e40-45d9-9d15-10b4c540d02f_1056x524.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jQkg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F419aaf53-2e40-45d9-9d15-10b4c540d02f_1056x524.png 424w, https://substackcdn.com/image/fetch/$s_!jQkg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F419aaf53-2e40-45d9-9d15-10b4c540d02f_1056x524.png 848w, https://substackcdn.com/image/fetch/$s_!jQkg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F419aaf53-2e40-45d9-9d15-10b4c540d02f_1056x524.png 1272w, https://substackcdn.com/image/fetch/$s_!jQkg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F419aaf53-2e40-45d9-9d15-10b4c540d02f_1056x524.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><ul><li><p>The producer has two client instances, one for the Kafka cluster and another for the Schema Registry.</p></li><li><p>The producer checks whether the schema is already in the Schema Registry. If it doesn't, the producer sends a POST request to register it.</p></li><li><p>The producer retrieves the schema ID from the Schema Registry.</p></li><li><p>The producer serializes the message with the schema ID and sends the serialized message to the Kafka broker.</p></li><li><p>The consumer also has two client instances, one for the Kafka cluster and another for the Schema Registry.</p></li><li><p>The consumer polls the Kafka broker for new messages.</p></li><li><p>It extracts the schema ID from the first few bytes of the serialized message.</p></li><li><p>It then sends a GET request to the Schema Registry, using the schema ID, to retrieve the schema.</p></li><li><p>The consumer deserializes the message according to the schema, converting the binary data to its original format.</p></li></ul><p>There are some problems:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!efEC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd31b0f8-e33b-4fe3-bea9-5a43473addca_928x568.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!efEC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd31b0f8-e33b-4fe3-bea9-5a43473addca_928x568.png 424w, https://substackcdn.com/image/fetch/$s_!efEC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd31b0f8-e33b-4fe3-bea9-5a43473addca_928x568.png 848w, https://substackcdn.com/image/fetch/$s_!efEC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd31b0f8-e33b-4fe3-bea9-5a43473addca_928x568.png 1272w, https://substackcdn.com/image/fetch/$s_!efEC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd31b0f8-e33b-4fe3-bea9-5a43473addca_928x568.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!efEC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd31b0f8-e33b-4fe3-bea9-5a43473addca_928x568.png" width="928" height="568" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dd31b0f8-e33b-4fe3-bea9-5a43473addca_928x568.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:568,&quot;width&quot;:928,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:143116,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/157438538?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd31b0f8-e33b-4fe3-bea9-5a43473addca_928x568.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!efEC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd31b0f8-e33b-4fe3-bea9-5a43473addca_928x568.png 424w, https://substackcdn.com/image/fetch/$s_!efEC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd31b0f8-e33b-4fe3-bea9-5a43473addca_928x568.png 848w, https://substackcdn.com/image/fetch/$s_!efEC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd31b0f8-e33b-4fe3-bea9-5a43473addca_928x568.png 1272w, https://substackcdn.com/image/fetch/$s_!efEC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd31b0f8-e33b-4fe3-bea9-5a43473addca_928x568.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><ul><li><p>A misconfigured producer can send malformed or unregistered messages.</p></li><li><p>Bad data can still enter the system if a producer forgets to validate the schema.</p></li><li><p>The producer and consumer clients become thick. They must handle the schema validation logic, which increases code complexity, dependency management issues, and inconsistency across teams.</p></li></ul><p>Bufstream takes a different approach when treating schema as the first-class citizen with Protobuf messages in both the binary format and the ProtoJSON format. Buf is working to support Avro and JSON messages in the future.</p><p>The broker can check and reject messages that don't match the topic's schema. It achieves this by integrating with any schema registry that implements the Confluent Schema Registry API, including the Confluent Schema Registry itself and the Buf Schema Registry (BSR). This Schema Registry serves as a single source of truth for all the Protobuf assets, including the .proto files that define the data schema.</p><p>Whenever the Bufstream broker receives a Protobuf message from the producer, it checks whether the message&#8217;s schema matches the topic schema defined in the BSR. If yes, the broker accepts the message and prepares for the upcoming write to the object storage. If not, it rejects the message and informs the producer.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q_6S!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029d98e6-6ac2-4895-97e0-dd5aef0fdac5_1054x574.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q_6S!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029d98e6-6ac2-4895-97e0-dd5aef0fdac5_1054x574.png 424w, https://substackcdn.com/image/fetch/$s_!Q_6S!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029d98e6-6ac2-4895-97e0-dd5aef0fdac5_1054x574.png 848w, https://substackcdn.com/image/fetch/$s_!Q_6S!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029d98e6-6ac2-4895-97e0-dd5aef0fdac5_1054x574.png 1272w, https://substackcdn.com/image/fetch/$s_!Q_6S!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029d98e6-6ac2-4895-97e0-dd5aef0fdac5_1054x574.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q_6S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029d98e6-6ac2-4895-97e0-dd5aef0fdac5_1054x574.png" width="1054" height="574" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/029d98e6-6ac2-4895-97e0-dd5aef0fdac5_1054x574.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:574,&quot;width&quot;:1054,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:141187,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/157438538?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029d98e6-6ac2-4895-97e0-dd5aef0fdac5_1054x574.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Q_6S!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029d98e6-6ac2-4895-97e0-dd5aef0fdac5_1054x574.png 424w, https://substackcdn.com/image/fetch/$s_!Q_6S!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029d98e6-6ac2-4895-97e0-dd5aef0fdac5_1054x574.png 848w, https://substackcdn.com/image/fetch/$s_!Q_6S!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029d98e6-6ac2-4895-97e0-dd5aef0fdac5_1054x574.png 1272w, https://substackcdn.com/image/fetch/$s_!Q_6S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F029d98e6-6ac2-4895-97e0-dd5aef0fdac5_1054x574.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>In Kafka, client-side validation isn&#8217;t really validation; clients opt-in to do that. A trusted, centralized validation point is needed, which, in this case, is the broker. Since all clients connect to the broker, validation can be enforced there. Relying on client-side validation is risky because clients can simply skip it.</p><p>Additionally, Bufstream can offer a more robust way to ensure data quality; although the schema validation process can help prevent bad data, it is sometimes insufficient.</p><ul><li><p>You expect the &#8220;age&#8221; field to be an integer, but what if the field with 350 arrives?</p></li><li><p>You expect the &#8220;email&#8220; field to be a string, but what if the field &#8220;abc&#8220; arrives?</p></li></ul><p>Schema validation can not find the unusual here. Bufstream lets you implement semantic validation of Protobuf messages at runtime <a href="https://buf.build/bufbuild/protovalidate">based on user-defined validation rules</a>. For example, an age field must have a value from 0 to 120, or an email must have an &#8220;@.&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tvvk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F535ccddb-2e4d-4a3f-b8b0-afc5f96e9263_916x274.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tvvk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F535ccddb-2e4d-4a3f-b8b0-afc5f96e9263_916x274.png 424w, https://substackcdn.com/image/fetch/$s_!tvvk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F535ccddb-2e4d-4a3f-b8b0-afc5f96e9263_916x274.png 848w, https://substackcdn.com/image/fetch/$s_!tvvk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F535ccddb-2e4d-4a3f-b8b0-afc5f96e9263_916x274.png 1272w, https://substackcdn.com/image/fetch/$s_!tvvk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F535ccddb-2e4d-4a3f-b8b0-afc5f96e9263_916x274.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tvvk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F535ccddb-2e4d-4a3f-b8b0-afc5f96e9263_916x274.png" width="916" height="274" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/535ccddb-2e4d-4a3f-b8b0-afc5f96e9263_916x274.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:274,&quot;width&quot;:916,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tvvk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F535ccddb-2e4d-4a3f-b8b0-afc5f96e9263_916x274.png 424w, https://substackcdn.com/image/fetch/$s_!tvvk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F535ccddb-2e4d-4a3f-b8b0-afc5f96e9263_916x274.png 848w, https://substackcdn.com/image/fetch/$s_!tvvk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F535ccddb-2e4d-4a3f-b8b0-afc5f96e9263_916x274.png 1272w, https://substackcdn.com/image/fetch/$s_!tvvk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F535ccddb-2e4d-4a3f-b8b0-afc5f96e9263_916x274.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Beyond data quality, Bufstream provides granular access control, allowing on-the-fly Protobuf redaction and exposing some fields to consumers. Currently, this logic is static, but Buf plans to introduce field-level RBAC, enabling producers to tag sensitive fields in Protobuf schemas; consumers will only receive authorized data.</p><h3>Kafka topic &#8594; Iceberg table</h3><p>Suppose we want to execute analytics on Kafka messages, such as ad-hoc exploration or reporting. We must build a pipeline with Kafka Connect, Spark, or Flink to consume messages from the Kafka topic, write them into files, and push these files to the data lake.</p><p>We have to take care of everything from managing the pipeline to ensuring the physical layout of the files is optimized for downstream consumption (e.g., too many small files can hurt the read operations)</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q836!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92432e97-77c5-4f04-bc37-981cc3a9309e_1084x516.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q836!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92432e97-77c5-4f04-bc37-981cc3a9309e_1084x516.png 424w, https://substackcdn.com/image/fetch/$s_!Q836!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92432e97-77c5-4f04-bc37-981cc3a9309e_1084x516.png 848w, https://substackcdn.com/image/fetch/$s_!Q836!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92432e97-77c5-4f04-bc37-981cc3a9309e_1084x516.png 1272w, https://substackcdn.com/image/fetch/$s_!Q836!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92432e97-77c5-4f04-bc37-981cc3a9309e_1084x516.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q836!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92432e97-77c5-4f04-bc37-981cc3a9309e_1084x516.png" width="1084" height="516" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/92432e97-77c5-4f04-bc37-981cc3a9309e_1084x516.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:516,&quot;width&quot;:1084,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:163566,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/157438538?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92432e97-77c5-4f04-bc37-981cc3a9309e_1084x516.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Q836!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92432e97-77c5-4f04-bc37-981cc3a9309e_1084x516.png 424w, https://substackcdn.com/image/fetch/$s_!Q836!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92432e97-77c5-4f04-bc37-981cc3a9309e_1084x516.png 848w, https://substackcdn.com/image/fetch/$s_!Q836!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92432e97-77c5-4f04-bc37-981cc3a9309e_1084x516.png 1272w, https://substackcdn.com/image/fetch/$s_!Q836!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92432e97-77c5-4f04-bc37-981cc3a9309e_1084x516.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Since Bufstream already stores the topic&#8217;s messages in object storage, it transforms data in transit to rest in S3 as Parquet files with Iceberg metadata on top. Users don&#8217;t have to deploy, monitor, or manage a dedicated data pipeline. Bufstream will handle all that. With schema awareness, Bufstream can synchronously update the user's iceberg catalog to notify them of field changes or new files.</p><p>Here&#8217;s an interesting point: the way Bufstream stores the Iceberg table is very unique. Other systems, such as <a href="https://www.confluent.io/blog/introducing-tableflow/">Tableflow from Confluent</a>, promise to write Kafka messages to an Iceberg table by reading Kafka data and copying it over, thereby duplicating data for two different purposes&#8212;serving consumers and handling analytics workloads. In contrast, Bufstream <strong>only</strong> stores the Iceberg tables.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sOK9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21519013-7ff4-42d5-962e-f7bbafca1774_566x504.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sOK9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21519013-7ff4-42d5-962e-f7bbafca1774_566x504.png 424w, https://substackcdn.com/image/fetch/$s_!sOK9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21519013-7ff4-42d5-962e-f7bbafca1774_566x504.png 848w, https://substackcdn.com/image/fetch/$s_!sOK9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21519013-7ff4-42d5-962e-f7bbafca1774_566x504.png 1272w, https://substackcdn.com/image/fetch/$s_!sOK9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21519013-7ff4-42d5-962e-f7bbafca1774_566x504.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sOK9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21519013-7ff4-42d5-962e-f7bbafca1774_566x504.png" width="566" height="504" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/21519013-7ff4-42d5-962e-f7bbafca1774_566x504.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:504,&quot;width&quot;:566,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:82730,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/157438538?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21519013-7ff4-42d5-962e-f7bbafca1774_566x504.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sOK9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21519013-7ff4-42d5-962e-f7bbafca1774_566x504.png 424w, https://substackcdn.com/image/fetch/$s_!sOK9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21519013-7ff4-42d5-962e-f7bbafca1774_566x504.png 848w, https://substackcdn.com/image/fetch/$s_!sOK9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21519013-7ff4-42d5-962e-f7bbafca1774_566x504.png 1272w, https://substackcdn.com/image/fetch/$s_!sOK9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21519013-7ff4-42d5-962e-f7bbafca1774_566x504.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Remember the Bufstream message-writing process mentioned above? Initially, it writes messages into intake files and later rewrites them into archive files. With Iceberg integrations, Bufstream will rewrite the intake files directly into Iceberg tables. It uses the Iceberg table for both Kafka and the lakehouse storage layer. The query engine can tap into this layer to process data, while the broker will read data from these Iceberg tables and return it row by row to consumers when they poll for messages.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pRIZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92e7f988-ad1d-4457-a209-2ef027234879_1038x464.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pRIZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92e7f988-ad1d-4457-a209-2ef027234879_1038x464.png 424w, https://substackcdn.com/image/fetch/$s_!pRIZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92e7f988-ad1d-4457-a209-2ef027234879_1038x464.png 848w, https://substackcdn.com/image/fetch/$s_!pRIZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92e7f988-ad1d-4457-a209-2ef027234879_1038x464.png 1272w, https://substackcdn.com/image/fetch/$s_!pRIZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92e7f988-ad1d-4457-a209-2ef027234879_1038x464.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pRIZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92e7f988-ad1d-4457-a209-2ef027234879_1038x464.png" width="1038" height="464" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/92e7f988-ad1d-4457-a209-2ef027234879_1038x464.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:464,&quot;width&quot;:1038,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:121811,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/157438538?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92e7f988-ad1d-4457-a209-2ef027234879_1038x464.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pRIZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92e7f988-ad1d-4457-a209-2ef027234879_1038x464.png 424w, https://substackcdn.com/image/fetch/$s_!pRIZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92e7f988-ad1d-4457-a209-2ef027234879_1038x464.png 848w, https://substackcdn.com/image/fetch/$s_!pRIZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92e7f988-ad1d-4457-a209-2ef027234879_1038x464.png 1272w, https://substackcdn.com/image/fetch/$s_!pRIZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92e7f988-ad1d-4457-a209-2ef027234879_1038x464.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Using the Iceberg table as a 2-for-1 solution like this can lead to massive storage savings. With this Bufstream feature, users can reuse the storage already allocated for Iceberg tables in the lakehouse, effectively eliminating the cost of Kafka storage altogether.</p><p>With support for popular Iceberg catalogs like REST Catalog, BigQuery Metastore, and upcoming support for Databricks Unity Catalog, Snowflake Polaris, and AWS Glue, you can seamlessly use any Iceberg-compatible query engine to access Iceberg tables from Bufstream.</p><p>Here is a process of transforming Kafka messages into an Iceberg table from Bufstream:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iv1w!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12f15e3-d356-4664-8ce5-2e296a8561be_1060x748.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iv1w!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12f15e3-d356-4664-8ce5-2e296a8561be_1060x748.png 424w, https://substackcdn.com/image/fetch/$s_!iv1w!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12f15e3-d356-4664-8ce5-2e296a8561be_1060x748.png 848w, https://substackcdn.com/image/fetch/$s_!iv1w!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12f15e3-d356-4664-8ce5-2e296a8561be_1060x748.png 1272w, https://substackcdn.com/image/fetch/$s_!iv1w!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12f15e3-d356-4664-8ce5-2e296a8561be_1060x748.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iv1w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12f15e3-d356-4664-8ce5-2e296a8561be_1060x748.png" width="1060" height="748" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b12f15e3-d356-4664-8ce5-2e296a8561be_1060x748.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:748,&quot;width&quot;:1060,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:236996,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/157438538?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12f15e3-d356-4664-8ce5-2e296a8561be_1060x748.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iv1w!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12f15e3-d356-4664-8ce5-2e296a8561be_1060x748.png 424w, https://substackcdn.com/image/fetch/$s_!iv1w!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12f15e3-d356-4664-8ce5-2e296a8561be_1060x748.png 848w, https://substackcdn.com/image/fetch/$s_!iv1w!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12f15e3-d356-4664-8ce5-2e296a8561be_1060x748.png 1272w, https://substackcdn.com/image/fetch/$s_!iv1w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12f15e3-d356-4664-8ce5-2e296a8561be_1060x748.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><ul><li><p>With Iceberg integration, the user needs to specify the archive format is <code>iceberg</code>.</p></li><li><p>First, the writer contacts the BSR to fetch the latest schema and caches it in memory for later use with the same topic messages.</p></li><li><p>The writer uses this schema to form the Iceberg table schema. To handle schema evolution, Bufstream keeps the Iceberg schema state in the metadata store.</p></li><li><p>After the writer forms the schema, it talks to the Iceberg catalog to check if it has changed. If yes, the writer updates the schema in the metadata store. If the destination table does not exist, the writer creates the table and sets the schema ID to 0.</p></li><li><p>The writer derives the Parquet schema from the Iceberg schema to prepare to write the data files.</p></li><li><p>After writing the Parquet data files, the writer writes the manifest files, the manifest lists, and the metadata files.</p></li><li><p>Finally, the writer tells the catalog to update the table&#8217;s current metadata pointer to the new metadata file.</p></li></ul><div><hr></div><h2>My Thoughts</h2><p>Although choosing to store data in object storage can make Bufstream way cheaper than Kafka, it must sacrifice the low-latency performance of the disks. In their benchmark, the median end-to-end latency <a href="https://buf.build/docs/bufstream/cost/#the-benchmark-setup">was 260 milliseconds, and the p99 latency was 500 milliseconds</a>. Still, these numbers are considerably better than those of other solutions, such as WarpStream.</p><p>Bufstream offers a way to optimize latency. It batches messages before writing to object storage to limit the PUT request. Thus, users can adjust the batch size to reduce latency, but more frequent PUT requests to object storage will increase the cost.</p><p>Given the vast cost savings compared to Kafka, Bufstream&#8217;s latency is acceptable. Unless you&#8217;re dealing with use cases that require super low latency, Bufstream's latency sacrifice does not impact much.</p><p>But if we set latency aside, Bufstream presents a strong alternative to Kafka in the cloud. Beyond cost efficiency, it offers a straightforward deployment model, built-in schema awareness for data quality enforcement, and the seamless transformation of Kafka&#8217;s storage layer into a lakehouse.</p><p>The native Iceberg support is a very valuable feature for me. In data engineering, transforming message queue data into analytics tables is inevitable. By transforming Kafka topics into Iceberg tables, Bufstream significantly reduces the burden on data engineers. The Iceberg format ensures broad compatibility, letting us use our favorite query engine over it, from Databricks, Snowflake, and BigQuery to Spark or Trino. Avoiding vendor lock-in is a big win for any company.</p><div><hr></div><h2>Outro</h2><p>Thank you for reading this far!</p><p>Throughout this article, we&#8217;ve explored why Kafka may be inefficient in the cloud, how Bufstream offers a more cost-effective alternative by storing data in object storage, how it enhances data quality by making the broker schema-aware, and how Bufstream seamlessly transforms topic messages into Iceberg tables. We wrap up the article with some of my naive thoughts.</p><p>Now it&#8217;s time to say goodbye. See you in my next articles :)</p><div><hr></div><h2>Reference</h2><p><em>[1] Gwen Shapira, Todd Palino, Rajini Sivaram, Krit Petty, <a href="https://www.confluent.io/resources/ebook/kafka-the-definitive-guide/">Kafka The Definitive Guide Real-Time Data and Stream Processing at Scale</a> (2021)</em></p><p><em>[2] Confluent, <a href="https://www.confluent.io/blog/how-schema-registry-clients-work/">Schema Registry Clients in Action</a> (2024)</em></p><p><em>[3] <a href="https://buf.build/docs/bufstream/">Bufstream Documents</a></em></p><p><em>[4] <a href="https://www.linkedin.com/posts/stanislavkozlovski_bufstream-activity-7296172586965635072-zfIp?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAACaI7mQBV1xJYGEQ7HhOYLsECQJDDi_X1-4">Stanislav Kozlovski&#8217;s Bustream post on Linkedin</a></em></p>]]></content:encoded></item><item><title><![CDATA[Bauplan: Operate your lakehouse with zero infrastructure]]></title><description><![CDATA[FaaS data pipelines on S3]]></description><link>https://vutr.substack.com/p/bauplan-operate-your-lakehouse-with</link><guid isPermaLink="false">https://vutr.substack.com/p/bauplan-operate-your-lakehouse-with</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Thu, 20 Mar 2025 03:15:37 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!B8NP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F630ecef6-7c95-46f0-a849-bc57654fa14b_2000x1428.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><blockquote><p><em>I&#8217;m making my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. </em></p><p><em>Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!B8NP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F630ecef6-7c95-46f0-a849-bc57654fa14b_2000x1428.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!B8NP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F630ecef6-7c95-46f0-a849-bc57654fa14b_2000x1428.png 424w, https://substackcdn.com/image/fetch/$s_!B8NP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F630ecef6-7c95-46f0-a849-bc57654fa14b_2000x1428.png 848w, https://substackcdn.com/image/fetch/$s_!B8NP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F630ecef6-7c95-46f0-a849-bc57654fa14b_2000x1428.png 1272w, https://substackcdn.com/image/fetch/$s_!B8NP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F630ecef6-7c95-46f0-a849-bc57654fa14b_2000x1428.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!B8NP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F630ecef6-7c95-46f0-a849-bc57654fa14b_2000x1428.png" width="1456" height="1040" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/630ecef6-7c95-46f0-a849-bc57654fa14b_2000x1428.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1040,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:733529,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158296262?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F630ecef6-7c95-46f0-a849-bc57654fa14b_2000x1428.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!B8NP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F630ecef6-7c95-46f0-a849-bc57654fa14b_2000x1428.png 424w, https://substackcdn.com/image/fetch/$s_!B8NP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F630ecef6-7c95-46f0-a849-bc57654fa14b_2000x1428.png 848w, https://substackcdn.com/image/fetch/$s_!B8NP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F630ecef6-7c95-46f0-a849-bc57654fa14b_2000x1428.png 1272w, https://substackcdn.com/image/fetch/$s_!B8NP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F630ecef6-7c95-46f0-a849-bc57654fa14b_2000x1428.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><div><hr></div><h2>Intro</h2><p>AWS Lambda is a fascinating service.</p><p>I first used it in 2021, and the experience was seamless. I wrote some code and configured how it should be triggered, and that was it.</p><p>Whenever a new file arrived in S3, my Lambda function would wake up, execute some logic, and then go back to sleep. I didn&#8217;t have to worry anything about the infrastructure.</p><p>That made me wonder: could I achieve the same simplicity with my data pipelines?</p><p>What if there was no need to set up an Airflow environment or provision a Spark cluster? What if I could define the pipeline logic&#8212;similar to an AWS Lambda function&#8212;and somehow, the input data would transform into the desired output?</p><p>This week, we&#8217;re diving into Bauplan, a solution that makes that wish come true.</p><div><hr></div><h2>Overview</h2><p>Function-as-a-Service (FaaS) is a cloud computing model that allows developers to run code in response to events without managing the infrastructure. It enables a serverless approach, where the cloud provider handles provisioning, scaling for bursty workloads, and execution, allowing engineers to focus on writing logic. A well-known example is AWS Lambda.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mVa3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59eacd31-cb8b-4088-a62b-278a62a4a2e2_922x446.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mVa3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59eacd31-cb8b-4088-a62b-278a62a4a2e2_922x446.png 424w, https://substackcdn.com/image/fetch/$s_!mVa3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59eacd31-cb8b-4088-a62b-278a62a4a2e2_922x446.png 848w, https://substackcdn.com/image/fetch/$s_!mVa3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59eacd31-cb8b-4088-a62b-278a62a4a2e2_922x446.png 1272w, https://substackcdn.com/image/fetch/$s_!mVa3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59eacd31-cb8b-4088-a62b-278a62a4a2e2_922x446.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mVa3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59eacd31-cb8b-4088-a62b-278a62a4a2e2_922x446.png" width="922" height="446" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/59eacd31-cb8b-4088-a62b-278a62a4a2e2_922x446.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:446,&quot;width&quot;:922,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:105809,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158296262?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59eacd31-cb8b-4088-a62b-278a62a4a2e2_922x446.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mVa3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59eacd31-cb8b-4088-a62b-278a62a4a2e2_922x446.png 424w, https://substackcdn.com/image/fetch/$s_!mVa3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59eacd31-cb8b-4088-a62b-278a62a4a2e2_922x446.png 848w, https://substackcdn.com/image/fetch/$s_!mVa3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59eacd31-cb8b-4088-a62b-278a62a4a2e2_922x446.png 1272w, https://substackcdn.com/image/fetch/$s_!mVa3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59eacd31-cb8b-4088-a62b-278a62a4a2e2_922x446.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>FaaS makes it simple for the developer.</p><p><a href="https://www.bauplanlabs.com/">Bauplan</a>, a team from New York and San Francisco, believes that the FaaS model can also simplify work for data engineers, analysts, data scientists, or anyone who wants to work with data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ol9W!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f0b7487-df79-412f-bcfa-471eb1448ab9_592x326.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ol9W!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f0b7487-df79-412f-bcfa-471eb1448ab9_592x326.png 424w, https://substackcdn.com/image/fetch/$s_!Ol9W!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f0b7487-df79-412f-bcfa-471eb1448ab9_592x326.png 848w, https://substackcdn.com/image/fetch/$s_!Ol9W!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f0b7487-df79-412f-bcfa-471eb1448ab9_592x326.png 1272w, https://substackcdn.com/image/fetch/$s_!Ol9W!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f0b7487-df79-412f-bcfa-471eb1448ab9_592x326.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ol9W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f0b7487-df79-412f-bcfa-471eb1448ab9_592x326.png" width="592" height="326" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f0b7487-df79-412f-bcfa-471eb1448ab9_592x326.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:326,&quot;width&quot;:592,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:69456,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158296262?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f0b7487-df79-412f-bcfa-471eb1448ab9_592x326.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ol9W!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f0b7487-df79-412f-bcfa-471eb1448ab9_592x326.png 424w, https://substackcdn.com/image/fetch/$s_!Ol9W!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f0b7487-df79-412f-bcfa-471eb1448ab9_592x326.png 848w, https://substackcdn.com/image/fetch/$s_!Ol9W!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f0b7487-df79-412f-bcfa-471eb1448ab9_592x326.png 1272w, https://substackcdn.com/image/fetch/$s_!Ol9W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f0b7487-df79-412f-bcfa-471eb1448ab9_592x326.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Still, the solutions available on the market cannot adapt to the data workload. We usually define a data pipeline as a Directed Acyclic Graph (DAG), in which each node is a function that receives data from previous nodes, applies logic, and outputs results for the following nodes.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QQqp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a801450-9a24-4fda-abee-9fd67ca4bfb9_898x354.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QQqp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a801450-9a24-4fda-abee-9fd67ca4bfb9_898x354.png 424w, https://substackcdn.com/image/fetch/$s_!QQqp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a801450-9a24-4fda-abee-9fd67ca4bfb9_898x354.png 848w, https://substackcdn.com/image/fetch/$s_!QQqp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a801450-9a24-4fda-abee-9fd67ca4bfb9_898x354.png 1272w, https://substackcdn.com/image/fetch/$s_!QQqp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a801450-9a24-4fda-abee-9fd67ca4bfb9_898x354.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QQqp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a801450-9a24-4fda-abee-9fd67ca4bfb9_898x354.png" width="898" height="354" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0a801450-9a24-4fda-abee-9fd67ca4bfb9_898x354.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:354,&quot;width&quot;:898,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:64569,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158296262?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a801450-9a24-4fda-abee-9fd67ca4bfb9_898x354.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QQqp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a801450-9a24-4fda-abee-9fd67ca4bfb9_898x354.png 424w, https://substackcdn.com/image/fetch/$s_!QQqp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a801450-9a24-4fda-abee-9fd67ca4bfb9_898x354.png 848w, https://substackcdn.com/image/fetch/$s_!QQqp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a801450-9a24-4fda-abee-9fd67ca4bfb9_898x354.png 1272w, https://substackcdn.com/image/fetch/$s_!QQqp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a801450-9a24-4fda-abee-9fd67ca4bfb9_898x354.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Modularizing the business logic into nodes makes developing, collaborating, and testing more convenient. But when implementing the DAG data pipeline using available FaaS solutions, some challenges emerge:</p><ul><li><p><strong>Scaling: </strong>Existing FaaS runtimes are designed for simple, independent functions that produce small outputs (e.g., a webhook). They have limitations when applied to data pipelines. Additionally, these FaaS platforms usually reuse instances for subsequent triggers. It can use the same function instance for 10 GB and 1TB of input with the same data function. &#8220;Out of memory&#8221; errors are common.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1R5W!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa80f11e1-700d-4cd7-b038-10cae424b629_472x268.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1R5W!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa80f11e1-700d-4cd7-b038-10cae424b629_472x268.png 424w, https://substackcdn.com/image/fetch/$s_!1R5W!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa80f11e1-700d-4cd7-b038-10cae424b629_472x268.png 848w, https://substackcdn.com/image/fetch/$s_!1R5W!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa80f11e1-700d-4cd7-b038-10cae424b629_472x268.png 1272w, https://substackcdn.com/image/fetch/$s_!1R5W!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa80f11e1-700d-4cd7-b038-10cae424b629_472x268.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1R5W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa80f11e1-700d-4cd7-b038-10cae424b629_472x268.png" width="472" height="268" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a80f11e1-700d-4cd7-b038-10cae424b629_472x268.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:268,&quot;width&quot;:472,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:47764,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158296262?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa80f11e1-700d-4cd7-b038-10cae424b629_472x268.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1R5W!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa80f11e1-700d-4cd7-b038-10cae424b629_472x268.png 424w, https://substackcdn.com/image/fetch/$s_!1R5W!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa80f11e1-700d-4cd7-b038-10cae424b629_472x268.png 848w, https://substackcdn.com/image/fetch/$s_!1R5W!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa80f11e1-700d-4cd7-b038-10cae424b629_472x268.png 1272w, https://substackcdn.com/image/fetch/$s_!1R5W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa80f11e1-700d-4cd7-b038-10cae424b629_472x268.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div></li><li><p><strong>Large intermediate I/O</strong>: To implement the DAG concept, users must chain functions. A function acts as a &#8220;node&#8220; that receives input from previous functions and produces output for the following functions. Data functions typically have large inputs and outputs, which increases the cost of serializing and moving the data payload between functions. Popular FaaS platforms' chaining best practices are limiting because intermediate data frames can only be transferred through object storage.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yLbq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5cd9f4d-13b9-47e4-8760-0a49b403df70_552x330.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yLbq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5cd9f4d-13b9-47e4-8760-0a49b403df70_552x330.png 424w, https://substackcdn.com/image/fetch/$s_!yLbq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5cd9f4d-13b9-47e4-8760-0a49b403df70_552x330.png 848w, https://substackcdn.com/image/fetch/$s_!yLbq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5cd9f4d-13b9-47e4-8760-0a49b403df70_552x330.png 1272w, https://substackcdn.com/image/fetch/$s_!yLbq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5cd9f4d-13b9-47e4-8760-0a49b403df70_552x330.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yLbq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5cd9f4d-13b9-47e4-8760-0a49b403df70_552x330.png" width="552" height="330" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c5cd9f4d-13b9-47e4-8760-0a49b403df70_552x330.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:330,&quot;width&quot;:552,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:55564,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158296262?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5cd9f4d-13b9-47e4-8760-0a49b403df70_552x330.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yLbq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5cd9f4d-13b9-47e4-8760-0a49b403df70_552x330.png 424w, https://substackcdn.com/image/fetch/$s_!yLbq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5cd9f4d-13b9-47e4-8760-0a49b403df70_552x330.png 848w, https://substackcdn.com/image/fetch/$s_!yLbq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5cd9f4d-13b9-47e4-8760-0a49b403df70_552x330.png 1272w, https://substackcdn.com/image/fetch/$s_!yLbq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5cd9f4d-13b9-47e4-8760-0a49b403df70_552x330.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div></li><li><p><strong>Slow feedback loop</strong>: Data science projects are exploratory and require rapid iteration to validate hypotheses. Current FaaS platforms lack the interactivity needed for these projects due to their slow build times and lack of interactive logging. AWS Lambda provides only observability through Cloudwatch.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!x87c!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7faaab81-6f83-427f-9e99-0e97902ba7af_384x282.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!x87c!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7faaab81-6f83-427f-9e99-0e97902ba7af_384x282.png 424w, https://substackcdn.com/image/fetch/$s_!x87c!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7faaab81-6f83-427f-9e99-0e97902ba7af_384x282.png 848w, https://substackcdn.com/image/fetch/$s_!x87c!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7faaab81-6f83-427f-9e99-0e97902ba7af_384x282.png 1272w, https://substackcdn.com/image/fetch/$s_!x87c!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7faaab81-6f83-427f-9e99-0e97902ba7af_384x282.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!x87c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7faaab81-6f83-427f-9e99-0e97902ba7af_384x282.png" width="384" height="282" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7faaab81-6f83-427f-9e99-0e97902ba7af_384x282.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:282,&quot;width&quot;:384,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42918,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158296262?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7faaab81-6f83-427f-9e99-0e97902ba7af_384x282.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!x87c!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7faaab81-6f83-427f-9e99-0e97902ba7af_384x282.png 424w, https://substackcdn.com/image/fetch/$s_!x87c!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7faaab81-6f83-427f-9e99-0e97902ba7af_384x282.png 848w, https://substackcdn.com/image/fetch/$s_!x87c!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7faaab81-6f83-427f-9e99-0e97902ba7af_384x282.png 1272w, https://substackcdn.com/image/fetch/$s_!x87c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7faaab81-6f83-427f-9e99-0e97902ba7af_384x282.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div></li></ul><p>So, how does Bauplan promise to solve these problems?</p><div><hr></div><h2>The Bauplan FaaS</h2><p>Bauplan is a FaaS service designed for data pipelines. Unlike other services, Bauplan initiates and scales independent instances for every run. It also promises to boost the intermediate data exchange process and allow users to modify and run DAGs interactively.</p><p>For the design principles:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gNGs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65db2a00-ba35-4896-96ac-6211159d7be2_568x242.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gNGs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65db2a00-ba35-4896-96ac-6211159d7be2_568x242.png 424w, https://substackcdn.com/image/fetch/$s_!gNGs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65db2a00-ba35-4896-96ac-6211159d7be2_568x242.png 848w, https://substackcdn.com/image/fetch/$s_!gNGs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65db2a00-ba35-4896-96ac-6211159d7be2_568x242.png 1272w, https://substackcdn.com/image/fetch/$s_!gNGs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65db2a00-ba35-4896-96ac-6211159d7be2_568x242.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gNGs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65db2a00-ba35-4896-96ac-6211159d7be2_568x242.png" width="568" height="242" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/65db2a00-ba35-4896-96ac-6211159d7be2_568x242.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:242,&quot;width&quot;:568,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:82393,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158296262?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65db2a00-ba35-4896-96ac-6211159d7be2_568x242.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gNGs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65db2a00-ba35-4896-96ac-6211159d7be2_568x242.png 424w, https://substackcdn.com/image/fetch/$s_!gNGs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65db2a00-ba35-4896-96ac-6211159d7be2_568x242.png 848w, https://substackcdn.com/image/fetch/$s_!gNGs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65db2a00-ba35-4896-96ac-6211159d7be2_568x242.png 1272w, https://substackcdn.com/image/fetch/$s_!gNGs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65db2a00-ba35-4896-96ac-6211159d7be2_568x242.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><ul><li><p>Bauplan aims to make an execution stateless, and its instances only live during that run. Starting with new instances each time enables Bauplan to adapt to different resource requirements; the same pipeline can run with a 10GB dataset and later scale up to a 100 GB dataset.</p></li><li><p>For the infrastructure, the Bauplan pipeline will run on the cloud Virtual Machines (VMs), which offer the highest level of customization. Using cloud VMs also allows Bauplan to offer multiple deployment models, such as BYOC, where customers can control where data is stored and processed.</p></li><li><p>Bauplan is different from other tools because Bauplan has both data and runtime awareness (i.e., serverless runtimes like AWS Lambda don&#8217;t know about the data, and orchestration tools don&#8217;t know about the runtime). We&#8217;ll explore this design more deeply when we run some code later.</p></li><li><p>Bauplan brings an interactive experience to the developer; although the pipeline runs on the cloud, users can develop as it runs on their laptops. Bauplan provides a CLI tool and Python SDK for users to interact with the system.</p></li><li><p>Users define a function in Bauplan by specifying tables as input and output.</p></li></ul><h3>Architecture</h3><p>Bauplan has a Control Plane (CP) and a Data Plane (DP):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-ftB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a36f2c-7bfc-428d-9bc3-35495a10dd8f_896x430.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-ftB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a36f2c-7bfc-428d-9bc3-35495a10dd8f_896x430.png 424w, https://substackcdn.com/image/fetch/$s_!-ftB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a36f2c-7bfc-428d-9bc3-35495a10dd8f_896x430.png 848w, https://substackcdn.com/image/fetch/$s_!-ftB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a36f2c-7bfc-428d-9bc3-35495a10dd8f_896x430.png 1272w, https://substackcdn.com/image/fetch/$s_!-ftB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a36f2c-7bfc-428d-9bc3-35495a10dd8f_896x430.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-ftB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a36f2c-7bfc-428d-9bc3-35495a10dd8f_896x430.png" width="896" height="430" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62a36f2c-7bfc-428d-9bc3-35495a10dd8f_896x430.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:430,&quot;width&quot;:896,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:110946,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158296262?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a36f2c-7bfc-428d-9bc3-35495a10dd8f_896x430.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-ftB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a36f2c-7bfc-428d-9bc3-35495a10dd8f_896x430.png 424w, https://substackcdn.com/image/fetch/$s_!-ftB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a36f2c-7bfc-428d-9bc3-35495a10dd8f_896x430.png 848w, https://substackcdn.com/image/fetch/$s_!-ftB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a36f2c-7bfc-428d-9bc3-35495a10dd8f_896x430.png 1272w, https://substackcdn.com/image/fetch/$s_!-ftB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a36f2c-7bfc-428d-9bc3-35495a10dd8f_896x430.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><ul><li><p>The CP exposes multi-tenant APIs. It only deals with metadata. The CP lives in Bauplan&#8217;s VPC.</p></li><li><p>Each customer has a DP, a fleet of one or more cloud VMs that can be deployed in the customer VPCs. A Golang binary is installed in each VM to spawn the worker. These workers are the only Bauplan components that can access customer data.</p></li></ul><p>To enable the developer to have an interactive experience, there is a bidirectional gRPC connection between customers and workers. Users write some <code>print</code> or <code>logging</code> statements to understand what happens during the pipeline run; although the code is run inside cloud VMs, the results are immediately visible to users thanks to the bidirectional connection.</p><h3>Planning</h3><p>So, the CP needs to deal with metadata, but what is its responsibility?</p><p>Bauplan acts like a database; it translates Python and SQL code into an execution plan when it begins running the pipeline. When the user requests a run, the code is routed to the control plane (CP). The CP will parse the code and reconstruct the DAG topology from the functions, resulting in a logical plan.</p><p>This plan only represents the dependencies between steps and the required packages as specified by the user. Importantly, Bauplan will refuse to run DAGs that refer to non-existing tables (unlike dbt, for example), point to wrong snapshots, or ship Python code with invalid formatting.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sOEK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76939d02-3465-44f3-8483-85bae5a751da_1448x544.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sOEK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76939d02-3465-44f3-8483-85bae5a751da_1448x544.png 424w, https://substackcdn.com/image/fetch/$s_!sOEK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76939d02-3465-44f3-8483-85bae5a751da_1448x544.png 848w, https://substackcdn.com/image/fetch/$s_!sOEK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76939d02-3465-44f3-8483-85bae5a751da_1448x544.png 1272w, https://substackcdn.com/image/fetch/$s_!sOEK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76939d02-3465-44f3-8483-85bae5a751da_1448x544.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sOEK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76939d02-3465-44f3-8483-85bae5a751da_1448x544.png" width="1448" height="544" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/76939d02-3465-44f3-8483-85bae5a751da_1448x544.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:544,&quot;width&quot;:1448,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:211151,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158296262?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76939d02-3465-44f3-8483-85bae5a751da_1448x544.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sOEK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76939d02-3465-44f3-8483-85bae5a751da_1448x544.png 424w, https://substackcdn.com/image/fetch/$s_!sOEK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76939d02-3465-44f3-8483-85bae5a751da_1448x544.png 848w, https://substackcdn.com/image/fetch/$s_!sOEK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76939d02-3465-44f3-8483-85bae5a751da_1448x544.png 1272w, https://substackcdn.com/image/fetch/$s_!sOEK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76939d02-3465-44f3-8483-85bae5a751da_1448x544.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>To give running instructions to the worker, the CP forms the physical plan from the logical one. This physical plan contains instructions for the containerized runtime of the transformation functions and mapping dataframes to a physical table in object storage. When having the physical plan, the CP sends it to workers to start the execution.</p><blockquote><p><em>Data in Bauplan are store in Iceberg table in object storage, we will explore the storage layer soon.</em></p></blockquote><h3>The cache</h3><p>As mentioned, function instances only exist during execution time. The two runs of the data pipeline will have different sets of instances. To reduce the latency when re-running the pipeline, Bauplan developed a robust package caching mechanism that avoids re-installing packages across runs, thus avoiding the overhead calls to PyPI.</p><p>For data caching, Bauplan&#8217;s data awareness makes database-like optimizations possible:</p><ul><li><p><strong>Re-using intermediate data</strong>: Functions produce intermediate dataframes, and Bauplan tracks the change in code and data to cache and reuse intermediate data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5Ly-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4069726b-749e-4a41-8b9b-eab2e9bd5de0_564x272.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5Ly-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4069726b-749e-4a41-8b9b-eab2e9bd5de0_564x272.png 424w, https://substackcdn.com/image/fetch/$s_!5Ly-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4069726b-749e-4a41-8b9b-eab2e9bd5de0_564x272.png 848w, https://substackcdn.com/image/fetch/$s_!5Ly-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4069726b-749e-4a41-8b9b-eab2e9bd5de0_564x272.png 1272w, https://substackcdn.com/image/fetch/$s_!5Ly-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4069726b-749e-4a41-8b9b-eab2e9bd5de0_564x272.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5Ly-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4069726b-749e-4a41-8b9b-eab2e9bd5de0_564x272.png" width="564" height="272" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4069726b-749e-4a41-8b9b-eab2e9bd5de0_564x272.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:272,&quot;width&quot;:564,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:47090,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158296262?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4069726b-749e-4a41-8b9b-eab2e9bd5de0_564x272.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5Ly-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4069726b-749e-4a41-8b9b-eab2e9bd5de0_564x272.png 424w, https://substackcdn.com/image/fetch/$s_!5Ly-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4069726b-749e-4a41-8b9b-eab2e9bd5de0_564x272.png 848w, https://substackcdn.com/image/fetch/$s_!5Ly-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4069726b-749e-4a41-8b9b-eab2e9bd5de0_564x272.png 1272w, https://substackcdn.com/image/fetch/$s_!5Ly-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4069726b-749e-4a41-8b9b-eab2e9bd5de0_564x272.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div></li><li><p><strong>Retrieving only missing columns</strong>: The first run reads four columns from the table, and the second run requires exactly these four columns plus column X. Bauplan will reuse the four columns from the cache and only download one additional column X from the data source.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ltQ4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff42229c-c448-45de-a1ff-75ab84417447_616x300.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ltQ4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff42229c-c448-45de-a1ff-75ab84417447_616x300.png 424w, https://substackcdn.com/image/fetch/$s_!ltQ4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff42229c-c448-45de-a1ff-75ab84417447_616x300.png 848w, https://substackcdn.com/image/fetch/$s_!ltQ4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff42229c-c448-45de-a1ff-75ab84417447_616x300.png 1272w, https://substackcdn.com/image/fetch/$s_!ltQ4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff42229c-c448-45de-a1ff-75ab84417447_616x300.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ltQ4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff42229c-c448-45de-a1ff-75ab84417447_616x300.png" width="616" height="300" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ff42229c-c448-45de-a1ff-75ab84417447_616x300.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:300,&quot;width&quot;:616,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:63954,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158296262?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff42229c-c448-45de-a1ff-75ab84417447_616x300.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ltQ4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff42229c-c448-45de-a1ff-75ab84417447_616x300.png 424w, https://substackcdn.com/image/fetch/$s_!ltQ4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff42229c-c448-45de-a1ff-75ab84417447_616x300.png 848w, https://substackcdn.com/image/fetch/$s_!ltQ4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff42229c-c448-45de-a1ff-75ab84417447_616x300.png 1272w, https://substackcdn.com/image/fetch/$s_!ltQ4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff42229c-c448-45de-a1ff-75ab84417447_616x300.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div></li><li><p><strong>Cache invalidation:</strong> Because the physical data are stored in immutable files (via the Iceberg metadata), dataframe changes are identified with data commits such that the cache knows when data needs an update.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9v_w!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b69f655-fc9f-46b7-9589-ba10d5850208_620x280.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9v_w!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b69f655-fc9f-46b7-9589-ba10d5850208_620x280.png 424w, https://substackcdn.com/image/fetch/$s_!9v_w!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b69f655-fc9f-46b7-9589-ba10d5850208_620x280.png 848w, https://substackcdn.com/image/fetch/$s_!9v_w!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b69f655-fc9f-46b7-9589-ba10d5850208_620x280.png 1272w, https://substackcdn.com/image/fetch/$s_!9v_w!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b69f655-fc9f-46b7-9589-ba10d5850208_620x280.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9v_w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b69f655-fc9f-46b7-9589-ba10d5850208_620x280.png" width="620" height="280" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3b69f655-fc9f-46b7-9589-ba10d5850208_620x280.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:280,&quot;width&quot;:620,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:50708,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158296262?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b69f655-fc9f-46b7-9589-ba10d5850208_620x280.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9v_w!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b69f655-fc9f-46b7-9589-ba10d5850208_620x280.png 424w, https://substackcdn.com/image/fetch/$s_!9v_w!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b69f655-fc9f-46b7-9589-ba10d5850208_620x280.png 848w, https://substackcdn.com/image/fetch/$s_!9v_w!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b69f655-fc9f-46b7-9589-ba10d5850208_620x280.png 1272w, https://substackcdn.com/image/fetch/$s_!9v_w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b69f655-fc9f-46b7-9589-ba10d5850208_620x280.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div></li></ul><h3>The data exchange</h3><p>To enhance the data exchange process between functions, Bauplan represents intermediate dataframes as Arrow tables. From the official document:</p><blockquote><p>The Arrow columnar format includes a language-agnostic in-memory data structure specification, metadata serialization, and a protocol for serialization and generic data transport.</p></blockquote><p>Unlike file formats like Parquet or CSV, which specify how data is organized on disk, Arrow focuses on how data is organized in memory.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dR-h!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75b893c6-34ec-4b48-86a2-087ac813d9d4_536x262.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dR-h!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75b893c6-34ec-4b48-86a2-087ac813d9d4_536x262.png 424w, https://substackcdn.com/image/fetch/$s_!dR-h!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75b893c6-34ec-4b48-86a2-087ac813d9d4_536x262.png 848w, https://substackcdn.com/image/fetch/$s_!dR-h!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75b893c6-34ec-4b48-86a2-087ac813d9d4_536x262.png 1272w, https://substackcdn.com/image/fetch/$s_!dR-h!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75b893c6-34ec-4b48-86a2-087ac813d9d4_536x262.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dR-h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75b893c6-34ec-4b48-86a2-087ac813d9d4_536x262.png" width="536" height="262" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/75b893c6-34ec-4b48-86a2-087ac813d9d4_536x262.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:262,&quot;width&quot;:536,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42550,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158296262?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75b893c6-34ec-4b48-86a2-087ac813d9d4_536x262.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dR-h!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75b893c6-34ec-4b48-86a2-087ac813d9d4_536x262.png 424w, https://substackcdn.com/image/fetch/$s_!dR-h!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75b893c6-34ec-4b48-86a2-087ac813d9d4_536x262.png 848w, https://substackcdn.com/image/fetch/$s_!dR-h!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75b893c6-34ec-4b48-86a2-087ac813d9d4_536x262.png 1272w, https://substackcdn.com/image/fetch/$s_!dR-h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75b893c6-34ec-4b48-86a2-087ac813d9d4_536x262.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Arrow store values for each column contiguously in memory. This design is highly advantageous for data analytics workloads, which focus on a subset of columns when dealing with large datasets.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Px5H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F773dea41-6a3e-4000-811e-bcfe8fceeb24_460x216.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Px5H!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F773dea41-6a3e-4000-811e-bcfe8fceeb24_460x216.png 424w, https://substackcdn.com/image/fetch/$s_!Px5H!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F773dea41-6a3e-4000-811e-bcfe8fceeb24_460x216.png 848w, https://substackcdn.com/image/fetch/$s_!Px5H!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F773dea41-6a3e-4000-811e-bcfe8fceeb24_460x216.png 1272w, https://substackcdn.com/image/fetch/$s_!Px5H!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F773dea41-6a3e-4000-811e-bcfe8fceeb24_460x216.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Px5H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F773dea41-6a3e-4000-811e-bcfe8fceeb24_460x216.png" width="460" height="216" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/773dea41-6a3e-4000-811e-bcfe8fceeb24_460x216.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:216,&quot;width&quot;:460,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Px5H!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F773dea41-6a3e-4000-811e-bcfe8fceeb24_460x216.png 424w, https://substackcdn.com/image/fetch/$s_!Px5H!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F773dea41-6a3e-4000-811e-bcfe8fceeb24_460x216.png 848w, https://substackcdn.com/image/fetch/$s_!Px5H!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F773dea41-6a3e-4000-811e-bcfe8fceeb24_460x216.png 1272w, https://substackcdn.com/image/fetch/$s_!Px5H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F773dea41-6a3e-4000-811e-bcfe8fceeb24_460x216.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Before Arrow, each system used its internal memory format. When two systems communicate, each converts its data into a standard format before transferring it, incurring serialization and deserialization costs. Apache Arrow aims to provide a highly efficient format for processing within a single system. As more systems adopt it, they can share data at a very low cost, potentially even through shared memory at zero cost.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Mv-A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eceb864-8e2c-427f-8b8e-f13df4380060_1240x528.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Mv-A!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eceb864-8e2c-427f-8b8e-f13df4380060_1240x528.png 424w, https://substackcdn.com/image/fetch/$s_!Mv-A!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eceb864-8e2c-427f-8b8e-f13df4380060_1240x528.png 848w, https://substackcdn.com/image/fetch/$s_!Mv-A!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eceb864-8e2c-427f-8b8e-f13df4380060_1240x528.png 1272w, https://substackcdn.com/image/fetch/$s_!Mv-A!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eceb864-8e2c-427f-8b8e-f13df4380060_1240x528.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Mv-A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eceb864-8e2c-427f-8b8e-f13df4380060_1240x528.png" width="1240" height="528" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0eceb864-8e2c-427f-8b8e-f13df4380060_1240x528.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:528,&quot;width&quot;:1240,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:95715,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158296262?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eceb864-8e2c-427f-8b8e-f13df4380060_1240x528.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Mv-A!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eceb864-8e2c-427f-8b8e-f13df4380060_1240x528.png 424w, https://substackcdn.com/image/fetch/$s_!Mv-A!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eceb864-8e2c-427f-8b8e-f13df4380060_1240x528.png 848w, https://substackcdn.com/image/fetch/$s_!Mv-A!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eceb864-8e2c-427f-8b8e-f13df4380060_1240x528.png 1272w, https://substackcdn.com/image/fetch/$s_!Mv-A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eceb864-8e2c-427f-8b8e-f13df4380060_1240x528.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>When Bauplan executes the pipeline, it will pick the sharing mechanism: memory or local disk (functions in the same worker) or Arrow Flight (across workers). Because other solutions only support S3-backed data exchange, moving data between functions in Bauplan can be hundreds of times faster, thanks to Arrow.</p><p>With Arrow, functions can read tables from shared memory, memory-map, or stream them from gRPC (with Flight), which gives the function greater flexibility when dealing with multiple data sources with different data transfer mechanisms.</p><blockquote><p><em>A <a href="https://en.wikipedia.org/wiki/Memory-mapped_file">memory-mapped file</a> is a segment of virtual memory that has been assigned a direct byte-for-byte correlation with some portion of a file or file-like resource. The benefit of memory mapping a file is increasing I/O performance, especially when used on large files.</em></p></blockquote><p>Moreover, if a downstream function runs on the same worker as the upstream function, it can read the Arrow intermediate data on the worker without copying the data. Given intermediate data with 10GBs and four functions needed to read it, it only takes 10Gbs physical RAM instead of 10x4=40Gbs.</p><div><hr></div><h2>The storage</h2><p>Bauplan does not stop there; beyond the FaaS data pipeline, they also aim to provide a complete lakehouse solution by offering a storage layer with Iceberg and Project Nessie.</p><p>If you have some Parquet files in the object storage, Bauplan can help you transform them into an Iceberg table in a single line of code. The data stays in your VPC; it doesn&#8217;t need to move anywhere.</p><p>Netflix created Apache Iceberg to achieve better table correctness and faster query planning (than Hives). An Apache Iceberg table has three layers organized hierarchically:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qXMo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0da0451b-2a2b-4f26-8aad-63ba94220266_474x620.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qXMo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0da0451b-2a2b-4f26-8aad-63ba94220266_474x620.png 424w, https://substackcdn.com/image/fetch/$s_!qXMo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0da0451b-2a2b-4f26-8aad-63ba94220266_474x620.png 848w, https://substackcdn.com/image/fetch/$s_!qXMo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0da0451b-2a2b-4f26-8aad-63ba94220266_474x620.png 1272w, https://substackcdn.com/image/fetch/$s_!qXMo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0da0451b-2a2b-4f26-8aad-63ba94220266_474x620.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qXMo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0da0451b-2a2b-4f26-8aad-63ba94220266_474x620.png" width="474" height="620" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0da0451b-2a2b-4f26-8aad-63ba94220266_474x620.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:620,&quot;width&quot;:474,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:101742,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158296262?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0da0451b-2a2b-4f26-8aad-63ba94220266_474x620.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qXMo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0da0451b-2a2b-4f26-8aad-63ba94220266_474x620.png 424w, https://substackcdn.com/image/fetch/$s_!qXMo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0da0451b-2a2b-4f26-8aad-63ba94220266_474x620.png 848w, https://substackcdn.com/image/fetch/$s_!qXMo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0da0451b-2a2b-4f26-8aad-63ba94220266_474x620.png 1272w, https://substackcdn.com/image/fetch/$s_!qXMo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0da0451b-2a2b-4f26-8aad-63ba94220266_474x620.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><ul><li><p>The data layer stores the table&#8217;s actual data, including the data and deleted files.</p></li><li><p>Manifest files track the data files in the data layer.</p></li><li><p>A manifest list captures the snapshot of an Iceberg table at a specific moment.</p></li><li><p>Metadata files contain information about an Iceberg table at a specific time, such as the schema or the latest snapshot.</p></li><li><p>The catalog is where every Iceberg data operation begins. It provides the engine with the location of the current metadata pointer and tells you where to go first.</p></li></ul><p>Like other table formats, Iceberg's ultimate goal is to bring data warehouse capabilities to the data lake; one important one is the ACID constraints.</p><p>The Iceberg only ensures atomic transactions at the table level. To bring the software development experience to the lakehouse, Bauplan uses Project Nessie for the Iceberg table catalog. It is an open-source versioned metadata catalog that enables cross-table transactions for Iceberg. Users can update multiple tables together and guarantee all changes occur atomically &#8211; an all-or-nothing commit across tables.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Swfv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40b50529-0184-419f-b5f4-1fc167498e2e_884x328.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Swfv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40b50529-0184-419f-b5f4-1fc167498e2e_884x328.png 424w, https://substackcdn.com/image/fetch/$s_!Swfv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40b50529-0184-419f-b5f4-1fc167498e2e_884x328.png 848w, https://substackcdn.com/image/fetch/$s_!Swfv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40b50529-0184-419f-b5f4-1fc167498e2e_884x328.png 1272w, https://substackcdn.com/image/fetch/$s_!Swfv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40b50529-0184-419f-b5f4-1fc167498e2e_884x328.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Swfv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40b50529-0184-419f-b5f4-1fc167498e2e_884x328.png" width="884" height="328" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/40b50529-0184-419f-b5f4-1fc167498e2e_884x328.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:328,&quot;width&quot;:884,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:88829,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158296262?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40b50529-0184-419f-b5f4-1fc167498e2e_884x328.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Swfv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40b50529-0184-419f-b5f4-1fc167498e2e_884x328.png 424w, https://substackcdn.com/image/fetch/$s_!Swfv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40b50529-0184-419f-b5f4-1fc167498e2e_884x328.png 848w, https://substackcdn.com/image/fetch/$s_!Swfv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40b50529-0184-419f-b5f4-1fc167498e2e_884x328.png 1272w, https://substackcdn.com/image/fetch/$s_!Swfv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40b50529-0184-419f-b5f4-1fc167498e2e_884x328.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Bauplan makes it easier for users to seamlessly work with Nessie and Iceberg tables by providing CLI commands and Python SDKs. Users will feel just like they are working with a Git repository.</p><p>In the next section, we'll explore all the cool Bauplan features mentioned above, where we'll run some code.</p><div class="pullquote"><p>This post was written in collaboration with the <a href="https://www.bauplanlabs.com/">Bauplan team</a>. The final wording and opinions are mine.</p></div><h2>Run some code</h2><p>We will run some Python code and CLI commands; I prepared a <a href="https://github.com/vutrinh274/bauplan_example">Git repo</a> so you can follow along. Make sure you pull the repo locally and enter the <a href="https://github.com/vutrinh274/bauplan_example">bauplan_example</a> folder.</p><p>First, we need to set up a Python virtual environment with the <a href="https://github.com/vutrinh274/bauplan_example/blob/master/requirements.txt">requirements.txt</a> file from the repo. We will install <code>bauplan</code>, <code>streamlit</code> and <code>duckdb</code> packages.</p><p>Next, we need the Bauplan API key, which gives you access to the Bauplan sandbox environment. You can contact Bauplan <a href="https://www.bauplanlabs.com/#join">here</a> for the key.</p><pre><code><code>bauplan config set api_key "your_bauplan_key"</code></code></pre><p>Then, we will run some bash scripts to set up; let&#8217;s make those scripts executable:</p><pre><code><code>chmod -R +x scripts/</code></code></pre><p>Bauplan is designed to operate exclusively in the cloud to ensure a fully auditable and secure data development cycle. They require us to store data in object storage so it can support importing to the Iceberg table.</p><p>Currently, Bauplan only supports S3 as the data source and Parquet and CSV as file formats. We will run a script that creates an S3 bucket, uploads some CSV files, checks out to a branch, creates a namespace, and then imports these files to the Iceberg table in Bauplan Sandbox. But first, make sure you configure your AWS CLI:</p><pre><code><code>aws configure # Entering AWS access key, secret and default region</code></code></pre><p>Then, run the <a href="https://github.com/vutrinh274/bauplan_example/blob/master/scripts/setup.sh">setup.sh</a> with the S3 bucket name you want to create and the Bauplan branch. The branch must be in a pattern <code>&lt;your-user-name&gt;.&lt;something&gt;</code></p><pre><code><code>./scripts/setup.sh &lt;bucket name&gt; &lt;bauplan branch&gt;</code></code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!q6l7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F882583fa-5ec7-44f3-8e24-ecf4ce66b5f5_1008x692.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!q6l7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F882583fa-5ec7-44f3-8e24-ecf4ce66b5f5_1008x692.png 424w, https://substackcdn.com/image/fetch/$s_!q6l7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F882583fa-5ec7-44f3-8e24-ecf4ce66b5f5_1008x692.png 848w, https://substackcdn.com/image/fetch/$s_!q6l7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F882583fa-5ec7-44f3-8e24-ecf4ce66b5f5_1008x692.png 1272w, https://substackcdn.com/image/fetch/$s_!q6l7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F882583fa-5ec7-44f3-8e24-ecf4ce66b5f5_1008x692.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!q6l7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F882583fa-5ec7-44f3-8e24-ecf4ce66b5f5_1008x692.png" width="1008" height="692" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/882583fa-5ec7-44f3-8e24-ecf4ce66b5f5_1008x692.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:692,&quot;width&quot;:1008,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:125935,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158296262?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F882583fa-5ec7-44f3-8e24-ecf4ce66b5f5_1008x692.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!q6l7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F882583fa-5ec7-44f3-8e24-ecf4ce66b5f5_1008x692.png 424w, https://substackcdn.com/image/fetch/$s_!q6l7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F882583fa-5ec7-44f3-8e24-ecf4ce66b5f5_1008x692.png 848w, https://substackcdn.com/image/fetch/$s_!q6l7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F882583fa-5ec7-44f3-8e24-ecf4ce66b5f5_1008x692.png 1272w, https://substackcdn.com/image/fetch/$s_!q6l7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F882583fa-5ec7-44f3-8e24-ecf4ce66b5f5_1008x692.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Wait for a while, and then you will have five tables in your catalog. The script will create a namespace called <code>adventure</code>. A namespace in Bauplan is a logical container that helps organize tables within a data catalog. We will work on your input branch and the <code>adventure</code> namespace from this time. Let&#8217;s list out the input tables:</p><pre><code><code>bauplan table --namespace adventure</code></code></pre><p>Before we move on, let&#8217;s understand the input data. We will have five input tables from the <a href="https://github.com/vutrinh274/bauplan_example/tree/master/adventure_works_data">AdventureWorks sample dataset</a>: product, product_category, product_subcategory, sale, and territories.</p><blockquote><p><em><a href="https://dataedo.com/samples/html/AdventureWorks/doc/AdventureWorks_2/home.html">AdventureWorks</a> database supports standard online transaction processing scenarios for a fictitious bicycle manufacturer - <strong>Adventure Works Cycles</strong>.</em></p></blockquote><p>The relationship of the tables is:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!f1Er!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3563684-51d2-4652-b59f-bf2c48013a86_1822x830.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!f1Er!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3563684-51d2-4652-b59f-bf2c48013a86_1822x830.png 424w, https://substackcdn.com/image/fetch/$s_!f1Er!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3563684-51d2-4652-b59f-bf2c48013a86_1822x830.png 848w, https://substackcdn.com/image/fetch/$s_!f1Er!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3563684-51d2-4652-b59f-bf2c48013a86_1822x830.png 1272w, https://substackcdn.com/image/fetch/$s_!f1Er!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3563684-51d2-4652-b59f-bf2c48013a86_1822x830.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!f1Er!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3563684-51d2-4652-b59f-bf2c48013a86_1822x830.png" width="1200" height="546.4285714285714" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a3563684-51d2-4652-b59f-bf2c48013a86_1822x830.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:663,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:275363,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158296262?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3563684-51d2-4652-b59f-bf2c48013a86_1822x830.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!f1Er!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3563684-51d2-4652-b59f-bf2c48013a86_1822x830.png 424w, https://substackcdn.com/image/fetch/$s_!f1Er!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3563684-51d2-4652-b59f-bf2c48013a86_1822x830.png 848w, https://substackcdn.com/image/fetch/$s_!f1Er!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3563684-51d2-4652-b59f-bf2c48013a86_1822x830.png 1272w, https://substackcdn.com/image/fetch/$s_!f1Er!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3563684-51d2-4652-b59f-bf2c48013a86_1822x830.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>In this project, we will write a Bauplan pipeline to transform these input tables into a dimensional data model with a <code>fact_sale</code>, <code>dim_product</code>, and <code>dim_country</code>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!q0I0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21590eee-d4c7-4d17-a949-bf418182603c_1706x786.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!q0I0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21590eee-d4c7-4d17-a949-bf418182603c_1706x786.png 424w, https://substackcdn.com/image/fetch/$s_!q0I0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21590eee-d4c7-4d17-a949-bf418182603c_1706x786.png 848w, https://substackcdn.com/image/fetch/$s_!q0I0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21590eee-d4c7-4d17-a949-bf418182603c_1706x786.png 1272w, https://substackcdn.com/image/fetch/$s_!q0I0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21590eee-d4c7-4d17-a949-bf418182603c_1706x786.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!q0I0!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21590eee-d4c7-4d17-a949-bf418182603c_1706x786.png" width="1200" height="553.021978021978" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/21590eee-d4c7-4d17-a949-bf418182603c_1706x786.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:671,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:212869,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158296262?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21590eee-d4c7-4d17-a949-bf418182603c_1706x786.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!q0I0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21590eee-d4c7-4d17-a949-bf418182603c_1706x786.png 424w, https://substackcdn.com/image/fetch/$s_!q0I0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21590eee-d4c7-4d17-a949-bf418182603c_1706x786.png 848w, https://substackcdn.com/image/fetch/$s_!q0I0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21590eee-d4c7-4d17-a949-bf418182603c_1706x786.png 1272w, https://substackcdn.com/image/fetch/$s_!q0I0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21590eee-d4c7-4d17-a949-bf418182603c_1706x786.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>We implement the transformation pipeline using Bauplan models. A model function takes tabular data as input and produces tabular data. We will write a DAG to transform the data using the duckdb engine:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!N1w_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6b713fd-4d67-4f21-b2f6-280c4268b53a_672x506.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!N1w_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6b713fd-4d67-4f21-b2f6-280c4268b53a_672x506.png 424w, https://substackcdn.com/image/fetch/$s_!N1w_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6b713fd-4d67-4f21-b2f6-280c4268b53a_672x506.png 848w, https://substackcdn.com/image/fetch/$s_!N1w_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6b713fd-4d67-4f21-b2f6-280c4268b53a_672x506.png 1272w, https://substackcdn.com/image/fetch/$s_!N1w_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6b713fd-4d67-4f21-b2f6-280c4268b53a_672x506.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!N1w_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6b713fd-4d67-4f21-b2f6-280c4268b53a_672x506.png" width="672" height="506" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e6b713fd-4d67-4f21-b2f6-280c4268b53a_672x506.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:506,&quot;width&quot;:672,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:83815,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158296262?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6b713fd-4d67-4f21-b2f6-280c4268b53a_672x506.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!N1w_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6b713fd-4d67-4f21-b2f6-280c4268b53a_672x506.png 424w, https://substackcdn.com/image/fetch/$s_!N1w_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6b713fd-4d67-4f21-b2f6-280c4268b53a_672x506.png 848w, https://substackcdn.com/image/fetch/$s_!N1w_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6b713fd-4d67-4f21-b2f6-280c4268b53a_672x506.png 1272w, https://substackcdn.com/image/fetch/$s_!N1w_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6b713fd-4d67-4f21-b2f6-280c4268b53a_672x506.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>We define these models in the <a href="https://github.com/vutrinh274/bauplan_example/blob/master/pipeline/models.py">models.py</a> file. A very important point is that the Bauplan model is aware of both data and runtime.</p><p>For runtime awareness, you can use Bauplan&#8217;s decorator to specify the runtime for each model, such as the Python version and packages, how to materialize the output, and allow for explicit column selection and filter pushdown.</p><p>For data awareness, each model must have inputs, which can be tables in the catalog or other models. We specify and use them like Python function parameters.</p><p>Here is the code of the <code>dim_product</code> model. As you can see, for this model, Bauplan knows that it must run the model with <code>Python 3.11</code>, <code>duckdb 1.0.0</code> and the model has <code>product</code>, <code>product_category</code>, <code>product_subcategory</code> as input data.</p><p>You can check the codes for all the models <a href="https://github.com/vutrinh274/bauplan_example/blob/master/pipeline/models.py">here</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oABQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F906352e5-0569-4cb1-836a-3cfd58b11396_1362x994.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oABQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F906352e5-0569-4cb1-836a-3cfd58b11396_1362x994.png 424w, https://substackcdn.com/image/fetch/$s_!oABQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F906352e5-0569-4cb1-836a-3cfd58b11396_1362x994.png 848w, https://substackcdn.com/image/fetch/$s_!oABQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F906352e5-0569-4cb1-836a-3cfd58b11396_1362x994.png 1272w, https://substackcdn.com/image/fetch/$s_!oABQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F906352e5-0569-4cb1-836a-3cfd58b11396_1362x994.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oABQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F906352e5-0569-4cb1-836a-3cfd58b11396_1362x994.png" width="1362" height="994" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/906352e5-0569-4cb1-836a-3cfd58b11396_1362x994.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:994,&quot;width&quot;:1362,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:237600,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158296262?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F906352e5-0569-4cb1-836a-3cfd58b11396_1362x994.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oABQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F906352e5-0569-4cb1-836a-3cfd58b11396_1362x994.png 424w, https://substackcdn.com/image/fetch/$s_!oABQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F906352e5-0569-4cb1-836a-3cfd58b11396_1362x994.png 848w, https://substackcdn.com/image/fetch/$s_!oABQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F906352e5-0569-4cb1-836a-3cfd58b11396_1362x994.png 1272w, https://substackcdn.com/image/fetch/$s_!oABQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F906352e5-0569-4cb1-836a-3cfd58b11396_1362x994.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Screenshot from my <a href="https://github.com/vutrinh274/bauplan_example">code</a>.</figcaption></figure></div><p>After having the models, we run the pipeline:</p><pre><code>bauplan run --project-dir pipeline --namespace adventure </code></pre><p>We submit the code to Bauplan. It will plan and execute it. If there are any errors, it will display in real-time in the terminal for us thanks to the bidirectional gRPC connection between us and the Bauplan workers.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;50736b7f-8ba2-4c03-8569-fc78f6cff120&quot;,&quot;duration&quot;:null}"></div><p>To run the pipeline, we need a <a href="https://github.com/vutrinh274/bauplan_example/blob/master/pipeline/bauplan_project.yml">bauplan_project.yml</a> file containing the project&#8217;s unique ID and name. I located both the bauplan_project.yml and models.py files in the pipeline folder.</p><p>After the pipeline finishes, you can check the output tables by listing the namespace adventure again:</p><pre><code><code>bauplan table --namespace adventure</code></code></pre><p>Finally, to have more fun with the project, I created a small Streamlit app with a world-class SQL editor (:D) to query the output table:</p><pre><code>streamlit run streamlit/app.py </code></pre><p>Here is a quick demo video to showcase my world-class SQL editor:</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;bf2ac0b7-59ee-480a-95f5-86c07e49eae2&quot;,&quot;duration&quot;:null}"></div><p>To clean up, you can run the <a href="https://github.com/vutrinh274/bauplan_example/blob/master/scripts/clean_up.sh">clean_up.sh</a> to clean up the S3 bucket and Bauplan table automatically:</p><pre><code>./scripts/clean_up.sh &lt;bucket name&gt; &lt;bauplan branch&gt;</code></pre><div><hr></div><h2>My thoughts</h2><p>In a world where data is the new gold, every company wants the ability to capture, store, process, and serve data to drive business decisions. However, not every company has a dedicated data team. In many cases, you might be the team's first and only data person.</p><p>At the beginning of this article, I had a wish&#8212;and it came true. Bauplan handles tasks that would typically require an entire infrastructure team. Its goal is to provide a seamless, developer-friendly way to work with large-scale data directly in Python, eliminating infrastructure bottlenecks.</p><p>When running some code with Bauplan, I was truly impressed by how seamlessly it imports data files from object storage into Iceberg tables. Setting up an Iceberg catalog, configuring the Iceberg writer, and managing the physical layout is usually a painful process, but Bauplan simplifies it significantly.</p><p>Defining data transformations is also a pleasant experience, thanks to Bauplan&#8217;s concept of a &#8220;model.&#8221; I can run transformations with Python 3.9 or 3.10 simply by changing a few lines in a decorator. The model&#8217;s data-awareness makes it incredibly intuitive to write transformation logic&#8212;I can specify it as easily as defining function inputs in Python.</p><p>Bauplan is truly innovative. It&#8217;s well worth your time trying, especially if you&#8217;re a data engineer, data analyst, or data scientist&#8212;or if you simply love working with data.</p><p>Personally, I hope Bauplan will expand to support data processing runtimes like Spark or Trino. A serverless Spark or Trino cluster would be a game-changer. Additionally, a robust SQL editor for querying data in the catalog would be a valuable addition.</p><div><hr></div><h2>Outro</h2><p>Thank you for reading this far.</p><p>In this article, we explore the challenges of implementing the data pipeline with available FaaS solutions, how Bauplan promises to solve them, Bauplan&#8217;s design goals and architecture, how Bauplan offers a complete zero-infrastructure lakehouse with the Iceberg + Project Nessie storage layer, and finally we write some code to build a Bauplan pipeline.</p><p>Now, it&#8217;s time to say goodbye. See you in my following articles.</p><div><hr></div><h2>Reference</h2><p><em>[1] <a href="https://docs.bauplanlabs.com/en/latest/">Bauplan Documents</a></em></p><p><em>[2] Jacopo Tagliabue, Tyler Caraza-Harter, Ciro Greco, <a href="https://arxiv.org/pdf/2410.17465">Bauplan: zero-copy, scale-up FaaS for data pipelines</a> (2024)</em></p>]]></content:encoded></item><item><title><![CDATA[How did Airbnb build their semantic layer?]]></title><description><![CDATA[Minerva, the Airbnb metric platform]]></description><link>https://vutr.substack.com/p/how-did-airbnb-build-their-semantic</link><guid isPermaLink="false">https://vutr.substack.com/p/how-did-airbnb-build-their-semantic</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Thu, 13 Mar 2025 03:15:41 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!AGhi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a5d0850-6a8e-4267-bb9b-bfedf491721e_2000x1429.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p>To celebrate Lunar New Year (the true New Year holiday in Vietnam), I&#8217;m offering <em><strong>50% off the annual subscription</strong></em>. The offer ends soon; grab it now to get full access to nearly 200 high-quality data engineering articles.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe&quot;,&quot;text&quot;:&quot;50% off the annual subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe"><span>50% off the annual subscription</span></a></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AGhi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a5d0850-6a8e-4267-bb9b-bfedf491721e_2000x1429.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AGhi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a5d0850-6a8e-4267-bb9b-bfedf491721e_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!AGhi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a5d0850-6a8e-4267-bb9b-bfedf491721e_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!AGhi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a5d0850-6a8e-4267-bb9b-bfedf491721e_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!AGhi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a5d0850-6a8e-4267-bb9b-bfedf491721e_2000x1429.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AGhi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a5d0850-6a8e-4267-bb9b-bfedf491721e_2000x1429.png" width="1456" height="1040" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5a5d0850-6a8e-4267-bb9b-bfedf491721e_2000x1429.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1040,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:428843,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158825951?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a5d0850-6a8e-4267-bb9b-bfedf491721e_2000x1429.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AGhi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a5d0850-6a8e-4267-bb9b-bfedf491721e_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!AGhi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a5d0850-6a8e-4267-bb9b-bfedf491721e_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!AGhi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a5d0850-6a8e-4267-bb9b-bfedf491721e_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!AGhi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a5d0850-6a8e-4267-bb9b-bfedf491721e_2000x1429.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><div><hr></div><h2>Intro</h2><p>Today, we will explore how Airbnb builds and serves its semantic layer internally and what we can learn from it. More correctly, Airbnb did not only build a layer that <a href="https://www.ibm.com/think/topics/semantic-layer">&#8220;</a><em><a href="https://www.ibm.com/think/topics/semantic-layer">simplifies interactions between complex data storage systems and business users.</a></em><a href="https://www.ibm.com/think/topics/semantic-layer">&#8220;</a> They create a complete platform.</p><div><hr></div><h2>Motivation</h2><p>In 2010, Airbnb had only one data analyst. His laptop was Airbnb data warehouse. He often ran queries right on production databases, and Airbnb.com was down for some time because of the heavy queries.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pXa7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa73cad80-f143-4b45-a033-aa7999a9ed97_532x320.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pXa7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa73cad80-f143-4b45-a033-aa7999a9ed97_532x320.png 424w, https://substackcdn.com/image/fetch/$s_!pXa7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa73cad80-f143-4b45-a033-aa7999a9ed97_532x320.png 848w, https://substackcdn.com/image/fetch/$s_!pXa7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa73cad80-f143-4b45-a033-aa7999a9ed97_532x320.png 1272w, https://substackcdn.com/image/fetch/$s_!pXa7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa73cad80-f143-4b45-a033-aa7999a9ed97_532x320.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pXa7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa73cad80-f143-4b45-a033-aa7999a9ed97_532x320.png" width="532" height="320" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a73cad80-f143-4b45-a033-aa7999a9ed97_532x320.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:320,&quot;width&quot;:532,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42336,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158825951?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa73cad80-f143-4b45-a033-aa7999a9ed97_532x320.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pXa7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa73cad80-f143-4b45-a033-aa7999a9ed97_532x320.png 424w, https://substackcdn.com/image/fetch/$s_!pXa7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa73cad80-f143-4b45-a033-aa7999a9ed97_532x320.png 848w, https://substackcdn.com/image/fetch/$s_!pXa7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa73cad80-f143-4b45-a033-aa7999a9ed97_532x320.png 1272w, https://substackcdn.com/image/fetch/$s_!pXa7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa73cad80-f143-4b45-a033-aa7999a9ed97_532x320.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>In the early 2010s, they hired more data scientists, and data kept growing. They upgraded their data infrastructure and built their own data orchestration tool, Airflow, with later open source.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TrTB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fb011b4-fb23-4c26-baa5-9a599c4f59d7_804x568.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TrTB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fb011b4-fb23-4c26-baa5-9a599c4f59d7_804x568.png 424w, https://substackcdn.com/image/fetch/$s_!TrTB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fb011b4-fb23-4c26-baa5-9a599c4f59d7_804x568.png 848w, https://substackcdn.com/image/fetch/$s_!TrTB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fb011b4-fb23-4c26-baa5-9a599c4f59d7_804x568.png 1272w, https://substackcdn.com/image/fetch/$s_!TrTB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fb011b4-fb23-4c26-baa5-9a599c4f59d7_804x568.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TrTB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fb011b4-fb23-4c26-baa5-9a599c4f59d7_804x568.png" width="804" height="568" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2fb011b4-fb23-4c26-baa5-9a599c4f59d7_804x568.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:568,&quot;width&quot;:804,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:128771,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158825951?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fb011b4-fb23-4c26-baa5-9a599c4f59d7_804x568.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TrTB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fb011b4-fb23-4c26-baa5-9a599c4f59d7_804x568.png 424w, https://substackcdn.com/image/fetch/$s_!TrTB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fb011b4-fb23-4c26-baa5-9a599c4f59d7_804x568.png 848w, https://substackcdn.com/image/fetch/$s_!TrTB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fb011b4-fb23-4c26-baa5-9a599c4f59d7_804x568.png 1272w, https://substackcdn.com/image/fetch/$s_!TrTB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fb011b4-fb23-4c26-baa5-9a599c4f59d7_804x568.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Their upgraded architecture. Image created by the author</figcaption></figure></div><p>Their top priority was to build a set of tables called <strong>core_data. </strong>These tables set the foundation for many data demands at Airbnb:</p><ul><li><p>Airbnb&#8217;s experimentation platform for streamlining the A/B testing processes.</p></li><li><p><a href="https://medium.com/airbnb-engineering/democratizing-data-at-airbnb-852d76c51770">Dataportal</a> &#8212; Airbnb's internal data catalog organizes and documents data assets.</p></li><li><p>Interactively data exploration with Apache Superset</p></li><li><p><a href="https://medium.com/airbnb-engineering/how-airbnb-democratizes-data-science-with-data-university-3eccc71e073a">Data University</a>  &#8212; a program that teaches non-data scientists valuable skills to democratize data analysis at Airbnb.</p></li></ul><p>However, the growth came with challenges:</p><ul><li><p>More users wanted to consume core_data, so they created many tables on top of it. There was no way to tell if a table with the exact requirement existed.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ro2Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ade169a-b90f-4c35-8564-a3eff150a2ae_532x390.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ro2Z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ade169a-b90f-4c35-8564-a3eff150a2ae_532x390.png 424w, https://substackcdn.com/image/fetch/$s_!ro2Z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ade169a-b90f-4c35-8564-a3eff150a2ae_532x390.png 848w, https://substackcdn.com/image/fetch/$s_!ro2Z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ade169a-b90f-4c35-8564-a3eff150a2ae_532x390.png 1272w, https://substackcdn.com/image/fetch/$s_!ro2Z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ade169a-b90f-4c35-8564-a3eff150a2ae_532x390.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ro2Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ade169a-b90f-4c35-8564-a3eff150a2ae_532x390.png" width="532" height="390" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ade169a-b90f-4c35-8564-a3eff150a2ae_532x390.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:390,&quot;width&quot;:532,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:40253,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158825951?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ade169a-b90f-4c35-8564-a3eff150a2ae_532x390.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ro2Z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ade169a-b90f-4c35-8564-a3eff150a2ae_532x390.png 424w, https://substackcdn.com/image/fetch/$s_!ro2Z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ade169a-b90f-4c35-8564-a3eff150a2ae_532x390.png 848w, https://substackcdn.com/image/fetch/$s_!ro2Z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ade169a-b90f-4c35-8564-a3eff150a2ae_532x390.png 1272w, https://substackcdn.com/image/fetch/$s_!ro2Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ade169a-b90f-4c35-8564-a3eff150a2ae_532x390.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div></li><li><p>Because of the complexity of the growing warehouse, Airbnb found it challenging to track data. Data users could spend many hours debugging the data discrepancies.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hIKQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a827e55-201d-488d-b23c-f0dfde28c9bc_1002x322.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hIKQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a827e55-201d-488d-b23c-f0dfde28c9bc_1002x322.png 424w, https://substackcdn.com/image/fetch/$s_!hIKQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a827e55-201d-488d-b23c-f0dfde28c9bc_1002x322.png 848w, https://substackcdn.com/image/fetch/$s_!hIKQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a827e55-201d-488d-b23c-f0dfde28c9bc_1002x322.png 1272w, https://substackcdn.com/image/fetch/$s_!hIKQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a827e55-201d-488d-b23c-f0dfde28c9bc_1002x322.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hIKQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a827e55-201d-488d-b23c-f0dfde28c9bc_1002x322.png" width="1002" height="322" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8a827e55-201d-488d-b23c-f0dfde28c9bc_1002x322.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:322,&quot;width&quot;:1002,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:61392,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158825951?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a827e55-201d-488d-b23c-f0dfde28c9bc_1002x322.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hIKQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a827e55-201d-488d-b23c-f0dfde28c9bc_1002x322.png 424w, https://substackcdn.com/image/fetch/$s_!hIKQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a827e55-201d-488d-b23c-f0dfde28c9bc_1002x322.png 848w, https://substackcdn.com/image/fetch/$s_!hIKQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a827e55-201d-488d-b23c-f0dfde28c9bc_1002x322.png 1272w, https://substackcdn.com/image/fetch/$s_!hIKQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a827e55-201d-488d-b23c-f0dfde28c9bc_1002x322.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div></li><li><p>For data consumption, decision-makers complained that different teams reported different numbers for simple business questions. As a result, business users and even data scientists lost trust in the data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QFR7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F831d263a-e59d-4643-bea5-a8795d0d19a2_470x338.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QFR7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F831d263a-e59d-4643-bea5-a8795d0d19a2_470x338.png 424w, https://substackcdn.com/image/fetch/$s_!QFR7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F831d263a-e59d-4643-bea5-a8795d0d19a2_470x338.png 848w, https://substackcdn.com/image/fetch/$s_!QFR7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F831d263a-e59d-4643-bea5-a8795d0d19a2_470x338.png 1272w, https://substackcdn.com/image/fetch/$s_!QFR7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F831d263a-e59d-4643-bea5-a8795d0d19a2_470x338.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QFR7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F831d263a-e59d-4643-bea5-a8795d0d19a2_470x338.png" width="470" height="338" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/831d263a-e59d-4643-bea5-a8795d0d19a2_470x338.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:338,&quot;width&quot;:470,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:59364,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158825951?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F831d263a-e59d-4643-bea5-a8795d0d19a2_470x338.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QFR7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F831d263a-e59d-4643-bea5-a8795d0d19a2_470x338.png 424w, https://substackcdn.com/image/fetch/$s_!QFR7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F831d263a-e59d-4643-bea5-a8795d0d19a2_470x338.png 848w, https://substackcdn.com/image/fetch/$s_!QFR7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F831d263a-e59d-4643-bea5-a8795d0d19a2_470x338.png 1272w, https://substackcdn.com/image/fetch/$s_!QFR7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F831d263a-e59d-4643-bea5-a8795d0d19a2_470x338.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div></li></ul><div><hr></div><blockquote><p>To celebrate Lunar New Year (the true New Year holiday in Vietnam), I&#8217;m offering <em><strong>50% off the annual subscription</strong></em>. The offer ends soon; grab it now to get full access to nearly 200 high-quality data engineering articles.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe&quot;,&quot;text&quot;:&quot;50% off the annual subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe"><span>50% off the annual subscription</span></a></p></blockquote><div><hr></div><h2>Airbnb Minerva</h2><p>They revamped the data warehouse to improve data quality.</p><p>First, their data engineering team rebuilt key business data models, resulting in lean tables that eliminate redundant joins. These tables served as the new foundation for the analytics.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UjzG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b2e7c6-973e-4165-82eb-55c933c1a7ed_370x486.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UjzG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b2e7c6-973e-4165-82eb-55c933c1a7ed_370x486.png 424w, https://substackcdn.com/image/fetch/$s_!UjzG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b2e7c6-973e-4165-82eb-55c933c1a7ed_370x486.png 848w, https://substackcdn.com/image/fetch/$s_!UjzG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b2e7c6-973e-4165-82eb-55c933c1a7ed_370x486.png 1272w, https://substackcdn.com/image/fetch/$s_!UjzG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b2e7c6-973e-4165-82eb-55c933c1a7ed_370x486.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UjzG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b2e7c6-973e-4165-82eb-55c933c1a7ed_370x486.png" width="370" height="486" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/93b2e7c6-973e-4165-82eb-55c933c1a7ed_370x486.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:486,&quot;width&quot;:370,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:64600,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158825951?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b2e7c6-973e-4165-82eb-55c933c1a7ed_370x486.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UjzG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b2e7c6-973e-4165-82eb-55c933c1a7ed_370x486.png 424w, https://substackcdn.com/image/fetch/$s_!UjzG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b2e7c6-973e-4165-82eb-55c933c1a7ed_370x486.png 848w, https://substackcdn.com/image/fetch/$s_!UjzG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b2e7c6-973e-4165-82eb-55c933c1a7ed_370x486.png 1272w, https://substackcdn.com/image/fetch/$s_!UjzG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b2e7c6-973e-4165-82eb-55c933c1a7ed_370x486.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>That still was not enough.</p><p>They needed to join these tables to extract insight, backfill data whenever logic changes, or present the data consistently and correctly in many different consumption tools.</p><p>Airbnb built Minerva for these purposes.</p><p>Minerva took fact and dimension tables as inputs, performed data denormalization, and served the aggregated data to downstream applications. Airbnb hoped the Minerva API would close the gap between upstream data and downstream consumption.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zod-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5297e02f-6bb2-49f2-8542-a402866085b6_1296x398.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zod-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5297e02f-6bb2-49f2-8542-a402866085b6_1296x398.png 424w, https://substackcdn.com/image/fetch/$s_!zod-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5297e02f-6bb2-49f2-8542-a402866085b6_1296x398.png 848w, https://substackcdn.com/image/fetch/$s_!zod-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5297e02f-6bb2-49f2-8542-a402866085b6_1296x398.png 1272w, https://substackcdn.com/image/fetch/$s_!zod-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5297e02f-6bb2-49f2-8542-a402866085b6_1296x398.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zod-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5297e02f-6bb2-49f2-8542-a402866085b6_1296x398.png" width="1296" height="398" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5297e02f-6bb2-49f2-8542-a402866085b6_1296x398.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:398,&quot;width&quot;:1296,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:149831,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158825951?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5297e02f-6bb2-49f2-8542-a402866085b6_1296x398.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zod-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5297e02f-6bb2-49f2-8542-a402866085b6_1296x398.png 424w, https://substackcdn.com/image/fetch/$s_!zod-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5297e02f-6bb2-49f2-8542-a402866085b6_1296x398.png 848w, https://substackcdn.com/image/fetch/$s_!zod-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5297e02f-6bb2-49f2-8542-a402866085b6_1296x398.png 1272w, https://substackcdn.com/image/fetch/$s_!zod-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5297e02f-6bb2-49f2-8542-a402866085b6_1296x398.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>At the time of Airbnb&#8217;s sharing, Minerva contained more than 12,000 metrics and 4,000 dimensions, with 200+ data producers across different functions and teams.</p><div><hr></div><h2>Architecture</h2><p>Airbnb built Minerva on top of open-source projects:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TfL3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf146ebe-8c5d-4cfe-9258-7d017fc099e0_392x338.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TfL3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf146ebe-8c5d-4cfe-9258-7d017fc099e0_392x338.png 424w, https://substackcdn.com/image/fetch/$s_!TfL3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf146ebe-8c5d-4cfe-9258-7d017fc099e0_392x338.png 848w, https://substackcdn.com/image/fetch/$s_!TfL3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf146ebe-8c5d-4cfe-9258-7d017fc099e0_392x338.png 1272w, https://substackcdn.com/image/fetch/$s_!TfL3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf146ebe-8c5d-4cfe-9258-7d017fc099e0_392x338.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TfL3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf146ebe-8c5d-4cfe-9258-7d017fc099e0_392x338.png" width="392" height="338" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cf146ebe-8c5d-4cfe-9258-7d017fc099e0_392x338.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:338,&quot;width&quot;:392,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:52422,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158825951?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf146ebe-8c5d-4cfe-9258-7d017fc099e0_392x338.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TfL3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf146ebe-8c5d-4cfe-9258-7d017fc099e0_392x338.png 424w, https://substackcdn.com/image/fetch/$s_!TfL3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf146ebe-8c5d-4cfe-9258-7d017fc099e0_392x338.png 848w, https://substackcdn.com/image/fetch/$s_!TfL3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf146ebe-8c5d-4cfe-9258-7d017fc099e0_392x338.png 1272w, https://substackcdn.com/image/fetch/$s_!TfL3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf146ebe-8c5d-4cfe-9258-7d017fc099e0_392x338.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><ul><li><p>Airflow for workflow orchestration.</p></li><li><p>Apache Hive and Apache Spark for the compute engine.</p></li><li><p>Presto and Apache Druid for serving.</p></li></ul><p>For a metric, Minerva has components to cover its whole life cycle:</p><ul><li><p>Minerva defines metrics, dimensions, and metadata in a centralized Github repository. Anyone at Airbnb with proper permissions can update these definitions.</p></li><li><p>It has a development flow for code review, static validation, and test runs.</p></li><li><p>It executes data aggregation/denormalization by resue data assets and intermediate joined results.</p></li><li><p>Minerva has a robust computation flow that can<strong> </strong>automatically retry after job failures, plus the built-in data-quality checks.</p></li><li><p>It exposes a unified data API to serve metrics and metadata.</p></li><li><p>Because the Minerva<strong>&nbsp;</strong>version controls data definitions (via Git), it can detect and track changes and then execute data backfilling.</p></li><li><p>Its data management features include cost attribution, GDPR-based deletion, or data access control.</p></li><li><p>For data retention, Minerva supports clean-up of data based on usage; infrequently used datasets can be deleted to save cost.</p></li></ul><div><hr></div><h2>Design principle</h2><p>Airbnb built Minerva to be:</p><ul><li><p><strong>Standardized</strong>: Data is defined in a single place. It must serve as a single entry point for anyone searching for definitions.</p></li><li><p><strong>Declarative: </strong>Users define the output they want (like SQL). Minerva will handle everything from calculating metrics to storing and serving.</p></li><li><p><strong>Scalable</strong>: Minerva must be scalable to support Airbnb&#8217;s internal data demands.</p></li><li><p><strong>Consistent</strong>: The data is always consistent. If the user changes the definitions, Minerva must perform data backfill and keep the data up-to-date.</p></li><li><p><strong>Highly available</strong>: Dataset replacement must be handled with minimal impact on data consumption.</p></li><li><p><strong>Well-tested</strong>: Users can prototype and validate their changes before they are merged into production.</p></li></ul><h3><strong>Standardized</strong></h3><p>Minerva is focused on metrics and dimensions instead of tables and columns like databases.</p><p>When a metric is defined in Minerva, users must provide necessary metadata, such as ownership, lineage, or metric description. Before Minerva, Airbnb managed metadata inefficiently as definitions scattered across various business intelligence tools.</p><p>Regarding version control in Minerva, they treat all definitions as code that must go through a review process before merging to production, just like code review.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3ahR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee832728-d4fe-4615-bef3-93183053891a_500x194.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3ahR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee832728-d4fe-4615-bef3-93183053891a_500x194.png 424w, https://substackcdn.com/image/fetch/$s_!3ahR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee832728-d4fe-4615-bef3-93183053891a_500x194.png 848w, https://substackcdn.com/image/fetch/$s_!3ahR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee832728-d4fe-4615-bef3-93183053891a_500x194.png 1272w, https://substackcdn.com/image/fetch/$s_!3ahR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee832728-d4fe-4615-bef3-93183053891a_500x194.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3ahR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee832728-d4fe-4615-bef3-93183053891a_500x194.png" width="500" height="194" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ee832728-d4fe-4615-bef3-93183053891a_500x194.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:194,&quot;width&quot;:500,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:25181,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158825951?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee832728-d4fe-4615-bef3-93183053891a_500x194.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3ahR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee832728-d4fe-4615-bef3-93183053891a_500x194.png 424w, https://substackcdn.com/image/fetch/$s_!3ahR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee832728-d4fe-4615-bef3-93183053891a_500x194.png 848w, https://substackcdn.com/image/fetch/$s_!3ahR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee832728-d4fe-4615-bef3-93183053891a_500x194.png 1272w, https://substackcdn.com/image/fetch/$s_!3ahR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee832728-d4fe-4615-bef3-93183053891a_500x194.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Minerva&#8217;s configuration system cores are event and dimension sources, corresponding to fact tables and dimension tables in the data warehouse:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xLNP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe35058b-e2a0-437a-bc76-a62cf8bf7a6a_364x270.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xLNP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe35058b-e2a0-437a-bc76-a62cf8bf7a6a_364x270.png 424w, https://substackcdn.com/image/fetch/$s_!xLNP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe35058b-e2a0-437a-bc76-a62cf8bf7a6a_364x270.png 848w, https://substackcdn.com/image/fetch/$s_!xLNP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe35058b-e2a0-437a-bc76-a62cf8bf7a6a_364x270.png 1272w, https://substackcdn.com/image/fetch/$s_!xLNP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe35058b-e2a0-437a-bc76-a62cf8bf7a6a_364x270.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xLNP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe35058b-e2a0-437a-bc76-a62cf8bf7a6a_364x270.png" width="364" height="270" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/be35058b-e2a0-437a-bc76-a62cf8bf7a6a_364x270.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:270,&quot;width&quot;:364,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:30128,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158825951?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe35058b-e2a0-437a-bc76-a62cf8bf7a6a_364x270.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xLNP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe35058b-e2a0-437a-bc76-a62cf8bf7a6a_364x270.png 424w, https://substackcdn.com/image/fetch/$s_!xLNP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe35058b-e2a0-437a-bc76-a62cf8bf7a6a_364x270.png 848w, https://substackcdn.com/image/fetch/$s_!xLNP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe35058b-e2a0-437a-bc76-a62cf8bf7a6a_364x270.png 1272w, https://substackcdn.com/image/fetch/$s_!xLNP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe35058b-e2a0-437a-bc76-a62cf8bf7a6a_364x270.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><ul><li><p>Event sources define the atomic events which are used to calculate metrics.</p></li><li><p>Dimension sources contain attributes that can be used with the metrics.</p></li></ul><h3><strong>Declarative</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9W9C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ac7d462-4571-4b25-aa26-f700dc19c597_1400x606.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9W9C!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ac7d462-4571-4b25-aa26-f700dc19c597_1400x606.png 424w, https://substackcdn.com/image/fetch/$s_!9W9C!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ac7d462-4571-4b25-aa26-f700dc19c597_1400x606.png 848w, https://substackcdn.com/image/fetch/$s_!9W9C!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ac7d462-4571-4b25-aa26-f700dc19c597_1400x606.png 1272w, https://substackcdn.com/image/fetch/$s_!9W9C!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ac7d462-4571-4b25-aa26-f700dc19c597_1400x606.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9W9C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ac7d462-4571-4b25-aa26-f700dc19c597_1400x606.png" width="1400" height="606" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1ac7d462-4571-4b25-aa26-f700dc19c597_1400x606.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:606,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9W9C!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ac7d462-4571-4b25-aa26-f700dc19c597_1400x606.png 424w, https://substackcdn.com/image/fetch/$s_!9W9C!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ac7d462-4571-4b25-aa26-f700dc19c597_1400x606.png 848w, https://substackcdn.com/image/fetch/$s_!9W9C!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ac7d462-4571-4b25-aa26-f700dc19c597_1400x606.png 1272w, https://substackcdn.com/image/fetch/$s_!9W9C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ac7d462-4571-4b25-aa26-f700dc19c597_1400x606.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The workflow of extracting insight before Minerva has a lot of steps (Left). Minerva saves users a lot of time (Right). <a href="https://medium.com/airbnb-engineering/airbnb-metric-computation-with-minerva-part-2-9afe6695b486">How Airbnb Standardized Metric Computation at Scale</a> (2021)</figcaption></figure></div><p>One of Minerva&#8217;s promises is to simplify the time-consuming workflow so that users can quickly turn data into insights. Users can define a dimension set, an analysis-friendly dataset created from Minerva metrics and dimensions. Unlike ad-hoc datasets, dimension sets have several advantages:</p><ul><li><p>Users define what they want. Minerva abstracts all the technical implementation details and complexity of creating it from the users.</p></li><li><p>Dimension sets can benefit from Minerva&#8217;s existing features.</p></li><li><p>Minerva can store and optimize these dimension sets to reduce query times.</p></li><li><p>Minerva can reuse dimension sets, which help reduce dataset duplication.</p></li></ul><h3><strong>Scalable</strong></h3><p>Minerva was serving 5,000+ datasets across hundreds of users and 80+ teams.</p><p>To ensure it can scale, Airbnb built Minerva&#8217;s computation with the DRY (Do not&nbsp;Repeat&nbsp;Yourself) principle. They tried to reuse materialized data as much as possible to reduce wasted computing resources.</p><p>The computational flow has several stages:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Hvc2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ad2314-3603-4207-813c-d1477a2fe59d_1228x306.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Hvc2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ad2314-3603-4207-813c-d1477a2fe59d_1228x306.png 424w, https://substackcdn.com/image/fetch/$s_!Hvc2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ad2314-3603-4207-813c-d1477a2fe59d_1228x306.png 848w, https://substackcdn.com/image/fetch/$s_!Hvc2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ad2314-3603-4207-813c-d1477a2fe59d_1228x306.png 1272w, https://substackcdn.com/image/fetch/$s_!Hvc2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ad2314-3603-4207-813c-d1477a2fe59d_1228x306.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Hvc2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ad2314-3603-4207-813c-d1477a2fe59d_1228x306.png" width="1228" height="306" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b1ad2314-3603-4207-813c-d1477a2fe59d_1228x306.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:306,&quot;width&quot;:1228,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:94480,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158825951?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ad2314-3603-4207-813c-d1477a2fe59d_1228x306.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Hvc2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ad2314-3603-4207-813c-d1477a2fe59d_1228x306.png 424w, https://substackcdn.com/image/fetch/$s_!Hvc2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ad2314-3603-4207-813c-d1477a2fe59d_1228x306.png 848w, https://substackcdn.com/image/fetch/$s_!Hvc2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ad2314-3603-4207-813c-d1477a2fe59d_1228x306.png 1272w, https://substackcdn.com/image/fetch/$s_!Hvc2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ad2314-3603-4207-813c-d1477a2fe59d_1228x306.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><ul><li><p><strong>Ingestion Stage</strong>: Minerva sensors are triggered when new data is added to the table&#8217;s partitions. The latest data is then ingested into Minerva.</p></li><li><p><strong>Data Check Stage</strong>: This stage ensures that upstream data is &#8220;right. &#8220; For example, a field should not be empty, or primary keys should be unique.</p></li><li><p><strong>Join Stage</strong>: Minerva executes the joins based on join keys to generate dimension sets. Minerva computes the same calculations (e.g., same city dimension) that happen on different dimension sets using the same logic on the same source tables. This ensures consistent dataset computation at scale. </p></li><li><p><strong>Post-processing and serving stage</strong>: Minerva further aggregates outputs for downstream consumption. It can optionally optimize data end-user query performance.</p></li></ul><p>In addition, Airbnb included features to make Minerva operate efficiently. Some features are self-healing and automated backfilling.</p><p>Minerva tries to be data-aware. It checks for missing data for every job. If missing data is identified, it is included in the current run. This means a single run can have a data range changed dynamically (e.g., 3 days &#8594; 4 days of data). Users don&#8217;t need to reset tasks manually.</p><p>For the backfilling, if the backfill window is long (e.g., several months), it may generate an expensive query. If they split the backfill window into smaller ones, it will be too slow for a large initial window. To solve this, Airbnb introduced the batch backfill for Minerva.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VfqX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc26fab7e-42c9-4607-bd17-0632f293b540_466x390.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VfqX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc26fab7e-42c9-4607-bd17-0632f293b540_466x390.png 424w, https://substackcdn.com/image/fetch/$s_!VfqX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc26fab7e-42c9-4607-bd17-0632f293b540_466x390.png 848w, https://substackcdn.com/image/fetch/$s_!VfqX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc26fab7e-42c9-4607-bd17-0632f293b540_466x390.png 1272w, https://substackcdn.com/image/fetch/$s_!VfqX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc26fab7e-42c9-4607-bd17-0632f293b540_466x390.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VfqX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc26fab7e-42c9-4607-bd17-0632f293b540_466x390.png" width="466" height="390" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c26fab7e-42c9-4607-bd17-0632f293b540_466x390.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:390,&quot;width&quot;:466,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:45970,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158825951?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc26fab7e-42c9-4607-bd17-0632f293b540_466x390.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VfqX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc26fab7e-42c9-4607-bd17-0632f293b540_466x390.png 424w, https://substackcdn.com/image/fetch/$s_!VfqX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc26fab7e-42c9-4607-bd17-0632f293b540_466x390.png 848w, https://substackcdn.com/image/fetch/$s_!VfqX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc26fab7e-42c9-4607-bd17-0632f293b540_466x390.png 1272w, https://substackcdn.com/image/fetch/$s_!VfqX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc26fab7e-42c9-4607-bd17-0632f293b540_466x390.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>They still split the backfill window into smaller ones based on the scalability of that dataset. For example, a one-year window would be divided into 12 1-month windows. Then, they run these 12 jobs in parallel.</p><h3>Consistent</h3><p>Internal users frequently change Minerva's definitions. Airbnb introduced a data version to ensure that Minerva datasets are consistent and up-to-date.</p><p>The data version is a hash of all the essential fields specified in the definitions (e.g., data source). When users change any field used for the hashing, the data version is automatically updated.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Migc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320eeca1-73fe-46d9-8390-3e89f2704146_528x332.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Migc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320eeca1-73fe-46d9-8390-3e89f2704146_528x332.png 424w, https://substackcdn.com/image/fetch/$s_!Migc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320eeca1-73fe-46d9-8390-3e89f2704146_528x332.png 848w, https://substackcdn.com/image/fetch/$s_!Migc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320eeca1-73fe-46d9-8390-3e89f2704146_528x332.png 1272w, https://substackcdn.com/image/fetch/$s_!Migc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320eeca1-73fe-46d9-8390-3e89f2704146_528x332.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Migc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320eeca1-73fe-46d9-8390-3e89f2704146_528x332.png" width="528" height="332" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/320eeca1-73fe-46d9-8390-3e89f2704146_528x332.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:332,&quot;width&quot;:528,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:53265,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158825951?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320eeca1-73fe-46d9-8390-3e89f2704146_528x332.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Migc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320eeca1-73fe-46d9-8390-3e89f2704146_528x332.png 424w, https://substackcdn.com/image/fetch/$s_!Migc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320eeca1-73fe-46d9-8390-3e89f2704146_528x332.png 848w, https://substackcdn.com/image/fetch/$s_!Migc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320eeca1-73fe-46d9-8390-3e89f2704146_528x332.png 1272w, https://substackcdn.com/image/fetch/$s_!Migc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F320eeca1-73fe-46d9-8390-3e89f2704146_528x332.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Each dataset has a data version, which makes Minerva automatically create and backfill a new dataset. This approach ensures that upstream changes are propagated to all downstream datasets, and no Minerva dataset will diverge from the source of truth.</p><h3><strong>Highly Available</strong></h3><p>Airbnb observed that backfills often could not catch up with user changes when updates affect many datasets. Given that Minerva promises to provide consistent and up-to-date data, a frequently changing dataset could result in backfill  forever and cause data downtime.</p><p>Airbnb deployed a parallel computation environment called the Staging environment. The Staging environment replicates the Production environment. They will perform data backfilling in the staging before publishing it on the Production. The flow for the Staging environment is as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!L24r!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db21156-a49d-4b73-acac-b58c84f6c5ba_708x396.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!L24r!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db21156-a49d-4b73-acac-b58c84f6c5ba_708x396.png 424w, https://substackcdn.com/image/fetch/$s_!L24r!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db21156-a49d-4b73-acac-b58c84f6c5ba_708x396.png 848w, https://substackcdn.com/image/fetch/$s_!L24r!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db21156-a49d-4b73-acac-b58c84f6c5ba_708x396.png 1272w, https://substackcdn.com/image/fetch/$s_!L24r!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db21156-a49d-4b73-acac-b58c84f6c5ba_708x396.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!L24r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db21156-a49d-4b73-acac-b58c84f6c5ba_708x396.png" width="708" height="396" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5db21156-a49d-4b73-acac-b58c84f6c5ba_708x396.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:396,&quot;width&quot;:708,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:35299,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/158825951?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db21156-a49d-4b73-acac-b58c84f6c5ba_708x396.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!L24r!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db21156-a49d-4b73-acac-b58c84f6c5ba_708x396.png 424w, https://substackcdn.com/image/fetch/$s_!L24r!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db21156-a49d-4b73-acac-b58c84f6c5ba_708x396.png 848w, https://substackcdn.com/image/fetch/$s_!L24r!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db21156-a49d-4b73-acac-b58c84f6c5ba_708x396.png 1272w, https://substackcdn.com/image/fetch/$s_!L24r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db21156-a49d-4b73-acac-b58c84f6c5ba_708x396.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><ol><li><p>Users developed and tested changes in the local environment.</p></li><li><p>They merge changes to the Staging environment.</p></li><li><p>The Staging environment loads the Staging configurations, retrieves any necessary Production configurations if needed, and starts backfilling modified datasets.</p></li><li><p>The Staging changes are merged into Production when the backfill is done.</p></li></ol><h3><strong>Well-Tested</strong></h3><p>To help users validate data correctness, Minerva has a tool that reads from production but writes to a sandbox environment. The tool generates sample data on top of the user&#8217;s local modifications, allowing users to validate their changes.</p><p>The tool shows the step-by-step computation that Minerva follows to generate the output. This feature helps users debug issues just like they are running the logic. Finally, it also allows users to configure date ranges to limit the test data size, which helps them save a lot of time waiting for the test to finish.</p><div><hr></div><h2>Consumption</h2><p>The Minerva teams partnered with other internal teams to create an ecosystem around Minerva:</p><ul><li><p>Data catalog: They index all Minerva metrics and dimensions in Airbnb&#8217;s Dataportal. When a user searches for a metric, the Dataportal shows the result from Minerva.</p></li></ul><ul><li><p>Dataportal also offers a data exploration feature called Metric Explorer. Users can see metric trends with additional slicing and drill-down options, such as Group By and Filter. Users who want to dig deeper can switch to Superset to perform more advanced analytics.</p></li><li><p>They migrate the A/B test platform&#8217;s proprietary metric repo to Minerva, which helps achieve consistency across experimentation and analytics.</p></li><li><p>To enable executive reporting, they built a reporting framework that turns a set of user-specified Minerva metrics and dimensions into aggregated metric time series.</p></li><li><p>Minerva exposes API for Airbnb&#8217;s R and Python clients. This lets data scientists query Minerva data in a notebook environment. Data scientists can now have metric calculation results exactly like those of other tools such as Metric Explorer, saving them lots of time when investigating data discrepancies.</p></li></ul><div><hr></div><h2>Outro</h2><p>Thank you for reading this far.</p><p>In this article, we explore the motivation behind the need for the semantic platform from Airbnb, the platform architecture and design principle, and finally, how Minerva can serve downstream consumption.</p><p>Now it&#8217;s time to say goodbye. See you in my following articles.</p><div><hr></div><h2>Reference</h2><p><em>[1] The Airbnb Tech Blog, <a href="https://medium.com/airbnb-engineering/how-airbnb-achieved-metric-consistency-at-scale-f23cc53dea70">How Airbnb Achieved Metric Consistency at Scale</a> (2021)</em></p><p><em>[2] The Airbnb Tech Blog, <a href="https://medium.com/airbnb-engineering/airbnb-metric-computation-with-minerva-part-2-9afe6695b486">How Airbnb Standardized Metric Computation at Scale</a> (2021)</em></p><p></p>]]></content:encoded></item><item><title><![CDATA[How is Databricks' Spark different from Open-Source Spark?]]></title><description><![CDATA[Why don't they just use the open-sourced Apache Spark?]]></description><link>https://vutr.substack.com/p/how-is-databricks-spark-different</link><guid isPermaLink="false">https://vutr.substack.com/p/how-is-databricks-spark-different</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Thu, 06 Mar 2025 03:15:59 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!6sgM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F568aa09c-1a49-4d6a-8a84-4d5ef37aaa3b_2000x1429.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p>To celebrate Lunar New Year (the true New Year holiday in Vietnam), I&#8217;m offering <em><strong>50% off the annual subscription</strong></em>. The offer ends soon; grab it now to get full access to nearly 200 high-quality data engineering articles.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe&quot;,&quot;text&quot;:&quot;50% off the annual subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe"><span>50% off the annual subscription</span></a></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6sgM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F568aa09c-1a49-4d6a-8a84-4d5ef37aaa3b_2000x1429.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6sgM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F568aa09c-1a49-4d6a-8a84-4d5ef37aaa3b_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!6sgM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F568aa09c-1a49-4d6a-8a84-4d5ef37aaa3b_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!6sgM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F568aa09c-1a49-4d6a-8a84-4d5ef37aaa3b_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!6sgM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F568aa09c-1a49-4d6a-8a84-4d5ef37aaa3b_2000x1429.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6sgM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F568aa09c-1a49-4d6a-8a84-4d5ef37aaa3b_2000x1429.png" width="1456" height="1040" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/568aa09c-1a49-4d6a-8a84-4d5ef37aaa3b_2000x1429.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1040,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:281815,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/156976428?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F568aa09c-1a49-4d6a-8a84-4d5ef37aaa3b_2000x1429.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6sgM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F568aa09c-1a49-4d6a-8a84-4d5ef37aaa3b_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!6sgM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F568aa09c-1a49-4d6a-8a84-4d5ef37aaa3b_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!6sgM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F568aa09c-1a49-4d6a-8a84-4d5ef37aaa3b_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!6sgM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F568aa09c-1a49-4d6a-8a84-4d5ef37aaa3b_2000x1429.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><div><hr></div><h2>Intro</h2><p>This week, we will explore the differences between open-source Spark and Databricks Spark, why the creators originally developed Spark, why Spark alone is insufficient for Databricks' Lakehouse solution, and how Databricks makes Spark significantly more efficient.</p><div><hr></div><h2>Apache Spark</h2><h3>Why it was created</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!t_H3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5fc448d-eb0b-4307-9528-c5f9b82e2858_640x490.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!t_H3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5fc448d-eb0b-4307-9528-c5f9b82e2858_640x490.png 424w, https://substackcdn.com/image/fetch/$s_!t_H3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5fc448d-eb0b-4307-9528-c5f9b82e2858_640x490.png 848w, https://substackcdn.com/image/fetch/$s_!t_H3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5fc448d-eb0b-4307-9528-c5f9b82e2858_640x490.png 1272w, https://substackcdn.com/image/fetch/$s_!t_H3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5fc448d-eb0b-4307-9528-c5f9b82e2858_640x490.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!t_H3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5fc448d-eb0b-4307-9528-c5f9b82e2858_640x490.png" width="640" height="490" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d5fc448d-eb0b-4307-9528-c5f9b82e2858_640x490.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:490,&quot;width&quot;:640,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:160502,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/156976428?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5fc448d-eb0b-4307-9528-c5f9b82e2858_640x490.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!t_H3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5fc448d-eb0b-4307-9528-c5f9b82e2858_640x490.png 424w, https://substackcdn.com/image/fetch/$s_!t_H3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5fc448d-eb0b-4307-9528-c5f9b82e2858_640x490.png 848w, https://substackcdn.com/image/fetch/$s_!t_H3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5fc448d-eb0b-4307-9528-c5f9b82e2858_640x490.png 1272w, https://substackcdn.com/image/fetch/$s_!t_H3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5fc448d-eb0b-4307-9528-c5f9b82e2858_640x490.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Apache Spark is an open-source distributed computing system designed to quickly process large volumes of data that can hardly accomplished by operating on a single machine. Spark distributes data and computations across multiple machines.</p><p>It was first developed at UC Berkeley&#8217;s AMPLab in 2009.</p><p>At the time, Hadoop MapReduce was the popular choice for processing big datasets across multiple machines. AMPLab collaborated with early MapReduce users to identify its strengths and limitations. They also worked closely with Hadoop users at UC Berkeley, who focused on large-scale machine learning requiring iterative algorithms and multiple data passes.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yS5J!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb9953d0-e5fd-4c1d-864d-5e12c7d4b582_586x476.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yS5J!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb9953d0-e5fd-4c1d-864d-5e12c7d4b582_586x476.png 424w, https://substackcdn.com/image/fetch/$s_!yS5J!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb9953d0-e5fd-4c1d-864d-5e12c7d4b582_586x476.png 848w, https://substackcdn.com/image/fetch/$s_!yS5J!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb9953d0-e5fd-4c1d-864d-5e12c7d4b582_586x476.png 1272w, https://substackcdn.com/image/fetch/$s_!yS5J!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb9953d0-e5fd-4c1d-864d-5e12c7d4b582_586x476.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yS5J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb9953d0-e5fd-4c1d-864d-5e12c7d4b582_586x476.png" width="586" height="476" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fb9953d0-e5fd-4c1d-864d-5e12c7d4b582_586x476.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:476,&quot;width&quot;:586,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:176970,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/156976428?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb9953d0-e5fd-4c1d-864d-5e12c7d4b582_586x476.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yS5J!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb9953d0-e5fd-4c1d-864d-5e12c7d4b582_586x476.png 424w, https://substackcdn.com/image/fetch/$s_!yS5J!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb9953d0-e5fd-4c1d-864d-5e12c7d4b582_586x476.png 848w, https://substackcdn.com/image/fetch/$s_!yS5J!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb9953d0-e5fd-4c1d-864d-5e12c7d4b582_586x476.png 1272w, https://substackcdn.com/image/fetch/$s_!yS5J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb9953d0-e5fd-4c1d-864d-5e12c7d4b582_586x476.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Hadoop was famous back then&#8212;Image created by the author.</figcaption></figure></div><p>These discussions highlighted some insights. Cluster computing had significant potential. However, MapReduce made building large applications inefficient, especially for machine learning tasks requiring multiple data passes. For example, the machine learning algorithm might need to make many passes over the data. With MapReduce, each pass must be written as a separate job and launched individually on the cluster.</p><p>To address this, the Spark team created a functional programming-based API to simplify multistep applications and developed a new engine for efficient in-memory data sharing across computation steps.</p><h3>SparkSQL</h3><p>Spark was intended to focus more on a general-purpose cluster computing engine than a specified database&#8217;s query engine. Realizing the need for relation processing over big datasets, the people behind Apache Spark presented the new model Spark SQL in 2014. This new module lets Spark programmers leverage the benefits of relational processing (e.g., declarative queries and optimized storage). Spark SQL introduces two significant enhancements. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zBOo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c9865cb-aa88-466d-a6ef-49075dcaf57a_508x516.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zBOo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c9865cb-aa88-466d-a6ef-49075dcaf57a_508x516.png 424w, https://substackcdn.com/image/fetch/$s_!zBOo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c9865cb-aa88-466d-a6ef-49075dcaf57a_508x516.png 848w, https://substackcdn.com/image/fetch/$s_!zBOo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c9865cb-aa88-466d-a6ef-49075dcaf57a_508x516.png 1272w, https://substackcdn.com/image/fetch/$s_!zBOo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c9865cb-aa88-466d-a6ef-49075dcaf57a_508x516.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zBOo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c9865cb-aa88-466d-a6ef-49075dcaf57a_508x516.png" width="508" height="516" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4c9865cb-aa88-466d-a6ef-49075dcaf57a_508x516.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:516,&quot;width&quot;:508,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:56240,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/156976428?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c9865cb-aa88-466d-a6ef-49075dcaf57a_508x516.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zBOo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c9865cb-aa88-466d-a6ef-49075dcaf57a_508x516.png 424w, https://substackcdn.com/image/fetch/$s_!zBOo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c9865cb-aa88-466d-a6ef-49075dcaf57a_508x516.png 848w, https://substackcdn.com/image/fetch/$s_!zBOo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c9865cb-aa88-466d-a6ef-49075dcaf57a_508x516.png 1272w, https://substackcdn.com/image/fetch/$s_!zBOo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c9865cb-aa88-466d-a6ef-49075dcaf57a_508x516.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author. <a href="https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf">Reference</a></figcaption></figure></div><ul><li><p>First, it integrates relational and procedural processing through a declarative DataFrame API.</p></li><li><p>Second, it incorporates a highly extensible optimizer, Catalyst, which leverages Scala's features to facilitate the addition of composable rules and manage code generation.</p></li></ul><p>The goals of SparkSQL are:</p><ul><li><p>Support relational processing of Spark&#8217;s native RDDs and external data sources using a convenient API.</p></li><li><p>Offering high performance using DBMS techniques.</p></li><li><p>Efficiently supporting new data sources,</p></li><li><p>Enabling extension with advanced analytics algorithms such as graph processing and machine learning.</p></li></ul><p>The people behind Spark aim to make it a viable option as a query engine.</p><div><hr></div><h2>Databricks</h2><p>The Apache Spark team founded Databricks in 2013. The company aims to simplify the process of building and deploying Spark applications for organizations. In 2019, Databricks introduced Delta Lake, a table format that provides the warehouse capability to the data lakes.</p><p>In 2021, they&nbsp;<a href="https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf">released a paper</a>&nbsp;introducing the new data management paradigm, the Lakehouse. This paradigm combines the best of both worlds: the warehouse's robust management features with the lake's theoretically unlimited scalability.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!R3Oj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e7c61f6-1431-4efe-b45e-c563de9612e1_1496x1040.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!R3Oj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e7c61f6-1431-4efe-b45e-c563de9612e1_1496x1040.png 424w, https://substackcdn.com/image/fetch/$s_!R3Oj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e7c61f6-1431-4efe-b45e-c563de9612e1_1496x1040.png 848w, https://substackcdn.com/image/fetch/$s_!R3Oj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e7c61f6-1431-4efe-b45e-c563de9612e1_1496x1040.png 1272w, https://substackcdn.com/image/fetch/$s_!R3Oj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e7c61f6-1431-4efe-b45e-c563de9612e1_1496x1040.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!R3Oj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e7c61f6-1431-4efe-b45e-c563de9612e1_1496x1040.png" width="724.5703125" height="503.61617874313185" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1e7c61f6-1431-4efe-b45e-c563de9612e1_1496x1040.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:1012,&quot;width&quot;:1456,&quot;resizeWidth&quot;:724.5703125,&quot;bytes&quot;:405902,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/156976428?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e7c61f6-1431-4efe-b45e-c563de9612e1_1496x1040.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!R3Oj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e7c61f6-1431-4efe-b45e-c563de9612e1_1496x1040.png 424w, https://substackcdn.com/image/fetch/$s_!R3Oj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e7c61f6-1431-4efe-b45e-c563de9612e1_1496x1040.png 848w, https://substackcdn.com/image/fetch/$s_!R3Oj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e7c61f6-1431-4efe-b45e-c563de9612e1_1496x1040.png 1272w, https://substackcdn.com/image/fetch/$s_!R3Oj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e7c61f6-1431-4efe-b45e-c563de9612e1_1496x1040.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Lakehouse. Image created by the author. </figcaption></figure></div><p>Databricks aimed to solve some problems with the two-tier data architecture, such as&nbsp;<strong>t</strong>he stale data in the warehouse compared to the lake&#8217;s, the difficulty and cost of consolidating the data lake and warehouse, and users being billed twice the storage cost for data duplication in the data lake and warehouse<strong>.</strong></p><p>They have been offering the managed lakehouse solution with Delta Lake for the storage layer and Spark for the query engine.</p><h2>The challenges</h2><p>Databricks does not just want to offer a data management system; it must also ensure high performance to compete with other solutions in the market, such as Snowflake, BigQuery, and Redshift.</p><p>At that time, all the above solutions primarily positioned themselves as cloud data warehouse solutions&#8212;the lakehouse paradigm caused Databricks some problems because Spark was initially not developed to be a native query engine:</p><ul><li><p>The Lakehouse query engines deal with a greater variety of data than traditional warehouses. From organized datasets to raw data with messy layouts, many small files, many columns, and no valuable statistics, the execution engine must be flexible enough to deliver good performance on a wide range of data.</p></li><li><p>Databricks initially offered Spark as the lakehouse engine. To enhance the query engine, they must ensure that many customers using Spark do not experience disruptions.</p></li></ul><p>They need a more efficient query engine but can&#8217;t replace Spark. So, what did they do? Simple&#8212;they enhanced Spark in place.</p><div><hr></div><blockquote><p>To celebrate Lunar New Year (the true New Year holiday in Vietnam), I&#8217;m offering <em><strong>50% off the annual subscription</strong></em>. The offer ends soon; grab it now to get full access to nearly 200 high-quality data engineering articles.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe&quot;,&quot;text&quot;:&quot;50% off the annual subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe"><span>50% off the annual subscription</span></a></p></blockquote><div><hr></div><h2>Their effort</h2><p>An important thing to note is that before this effort to enhance Apache Spark, Databricks already built their own Spark runtime, the Databricks Runtime (DBR), which is a <a href="https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/fork-a-repo">fork</a> of Apache Spark that provides the same interface but has enhancements for reliability and performance.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XmZS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa07aa832-0aa2-4bfb-8c16-73d66c501506_936x258.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XmZS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa07aa832-0aa2-4bfb-8c16-73d66c501506_936x258.png 424w, https://substackcdn.com/image/fetch/$s_!XmZS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa07aa832-0aa2-4bfb-8c16-73d66c501506_936x258.png 848w, https://substackcdn.com/image/fetch/$s_!XmZS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa07aa832-0aa2-4bfb-8c16-73d66c501506_936x258.png 1272w, https://substackcdn.com/image/fetch/$s_!XmZS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa07aa832-0aa2-4bfb-8c16-73d66c501506_936x258.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XmZS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa07aa832-0aa2-4bfb-8c16-73d66c501506_936x258.png" width="936" height="258" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a07aa832-0aa2-4bfb-8c16-73d66c501506_936x258.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:258,&quot;width&quot;:936,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:45423,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/156976428?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa07aa832-0aa2-4bfb-8c16-73d66c501506_936x258.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XmZS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa07aa832-0aa2-4bfb-8c16-73d66c501506_936x258.png 424w, https://substackcdn.com/image/fetch/$s_!XmZS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa07aa832-0aa2-4bfb-8c16-73d66c501506_936x258.png 848w, https://substackcdn.com/image/fetch/$s_!XmZS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa07aa832-0aa2-4bfb-8c16-73d66c501506_936x258.png 1272w, https://substackcdn.com/image/fetch/$s_!XmZS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa07aa832-0aa2-4bfb-8c16-73d66c501506_936x258.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>But they need a little more than that.</p><p>They built the Photon engine, a library that integrates closely with the DBR. The engine acts as a new set of physical operators inside the DBR. The query plan can use these operators like any other Spark. Databricks&#8217;s customers can continue to run their workloads without any changes and still benefit from Photon.</p><p>The system can run the queries partially in Photon; if it needs unsupported operations, they are switched back to SparkSQL. Databricks tests Photon to ensure its semantics are consistent with Spark SQL&#8217;s</p><p>Databricks built Photon using a <a href="https://www.youtube.com/watch?v=FrspnYbFSxQ">vectorized model</a> instead of <a href="https://www.youtube.com/watch?v=UPQ53hM6AWE">the code generation</a> approach that Apache Spark implements. Vectorized execution enabled support runtime adaptivity; Photon discovers, maintains, and leverages micro-batch data characteristics with specialized code paths to adapt to the properties of Lakehouse data.</p><p>Another essential design that Databricks made when developing Photon is writing it in&nbsp;<a href="https://vi.wikipedia.org/wiki/C%2B%2B">C++</a>&nbsp;instead of following the Spark approach, which used the&nbsp;<a href="https://en.wikipedia.org/wiki/Java_virtual_machine">Java Virtual Machine (JVM)</a>. Databricks observed that&nbsp;<em>&#8220;the Spark applications were hitting performance ceilings with the existing JVM-based engine.&#8221;</em>&nbsp;Moreover, they found that the performance of native code was more effortless to explain than that of the JVM engine, as they can explicitly control aspects like&nbsp;<a href="https://isocpp.org/wiki/faq/freestore-mgmt">memory management</a>&nbsp;and&nbsp;<a href="https://en.wikipedia.org/wiki/Single_instruction,_multiple_data">SIMD</a>&nbsp;in C++.</p><div><hr></div><h2>The Photon Designs</h2><h3>JVM vs. Native Execution</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Wpr1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1443b02b-96e4-4fc8-a5dc-f93477f1f4a9_554x338.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Wpr1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1443b02b-96e4-4fc8-a5dc-f93477f1f4a9_554x338.png 424w, https://substackcdn.com/image/fetch/$s_!Wpr1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1443b02b-96e4-4fc8-a5dc-f93477f1f4a9_554x338.png 848w, https://substackcdn.com/image/fetch/$s_!Wpr1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1443b02b-96e4-4fc8-a5dc-f93477f1f4a9_554x338.png 1272w, https://substackcdn.com/image/fetch/$s_!Wpr1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1443b02b-96e4-4fc8-a5dc-f93477f1f4a9_554x338.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Wpr1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1443b02b-96e4-4fc8-a5dc-f93477f1f4a9_554x338.png" width="554" height="338" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1443b02b-96e4-4fc8-a5dc-f93477f1f4a9_554x338.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:338,&quot;width&quot;:554,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:67008,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/156976428?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1443b02b-96e4-4fc8-a5dc-f93477f1f4a9_554x338.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Wpr1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1443b02b-96e4-4fc8-a5dc-f93477f1f4a9_554x338.png 424w, https://substackcdn.com/image/fetch/$s_!Wpr1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1443b02b-96e4-4fc8-a5dc-f93477f1f4a9_554x338.png 848w, https://substackcdn.com/image/fetch/$s_!Wpr1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1443b02b-96e4-4fc8-a5dc-f93477f1f4a9_554x338.png 1272w, https://substackcdn.com/image/fetch/$s_!Wpr1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1443b02b-96e4-4fc8-a5dc-f93477f1f4a9_554x338.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Databricks decided to move away from the JVM and implement a native code execution engine. Integrating the new engine with the existing JVM-based runtime is challenging for Databricks. Here are several reasons that led Databricks to the decision to develop a new native execution engine:</p><ul><li><p>The Lakehouse paradigm demands processing a wide range of workloads that stresses the JVM engine's in-memory performance.</p></li><li><p>Improving the JVM engine performance requires deep knowledge of JVM internals.</p></li><li><p>Databricks found they lack control over lower-level optimizations such as custom <a href="https://en.wikipedia.org/wiki/Single_instruction,_multiple_data">SIMD</a> kernels.</p></li><li><p>They also observed that garbage collection performance degraded on heap memory larger than 64GB. Databricks had to manually manage off-heap memory in the JVM-based engine, which made the codebase more complex.</p></li></ul><h3>Interpreted Vectorization vs. Code Generation</h3><p>Modern OLAP systems build high-performance engines predominantly using two approaches: interpreted vectorized design inspired by the MonetDB/X100 system or code-generated design used in systems like&nbsp;<a href="https://spark.apache.org/docs/latest/sql-programming-guide.html">Spark SQL</a>&nbsp;or&nbsp;<a href="https://impala.apache.org/">Apache Impala</a>.</p><p>Vectorized engines use a dynamic dispatch mechanism like <a href="https://www.geeksforgeeks.org/virtual-function-cpp/">virtual function calls</a> to choose the code block for the execution; then, the system will process data in batches and enable SIMD to amortize virtual function call overhead. On the other hand, code generation uses a compiler at runtime to generate specific code for each query; this way, the approach doesn&#8217;t have to deal with virtual function call overhead. Databricks tries to implement both of the above methods; here are their observations:</p><ul><li><p>Code generation is more complicated to build and debug because the approach generates executing code at runtime; Databricks engineers need to add extra code manually to find issues. In contrast, the interpreted approach only deals with native C++ code; print debugging was much more manageable. As a result, their engineers only needed a couple of weeks to prototype the vectorized approach, while it took them two months with the code-generated approach.</p></li><li><p>Code generation removes interpretation and function call overheads by collapsing and inlining operators into a few functions. Despite the performance boost, this makes observability difficult. Operator collapsing prevents the engineers from observing metrics on how much time is spent in each operator, &#8220;given that the operator code may be fused into a row-at-a-time processing loop.&#8221; In contrast, the vectorized approach maintains clear boundaries between operators.</p></li><li><p>Photon can adapt to data properties by choosing a code path at runtime based on the input&#8217;s type. This is critical in the Lakehouse context because constraints and statistics may not be available for all queries.</p></li><li><p>Databricks found they can achieve code-generated specialization with vectorized engines by creating <a href="https://dl.acm.org/doi/10.14778/3151113.3151114">specialized fused operators</a> for the most common cases.</p></li></ul><p>For these reasons, Databricks chose the vectorized approach for the Photon engine.</p><blockquote><p><em>If you want to learn more about vectorization and code generation, here are the two resources you should check out:</em></p></blockquote><div id="youtube2-yU1S8gwjGEw" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;yU1S8gwjGEw&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/yU1S8gwjGEw?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><div id="youtube2-UPQ53hM6AWE" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;UPQ53hM6AWE&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/UPQ53hM6AWE?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h3>Row vs. Column-Oriented Execution</h3><p>Traditionally, Spark SQL represents records in memory with a row-oriented format. Since the Lakehouse execution engine mainly deals with columnar files like Parquet, scanning data from disk to memory requires expensive column-to-row pivoting when using the Spark SQL engine.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dgKh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6478278-2536-4ba8-818d-d40f34a2432c_1156x614.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dgKh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6478278-2536-4ba8-818d-d40f34a2432c_1156x614.png 424w, https://substackcdn.com/image/fetch/$s_!dgKh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6478278-2536-4ba8-818d-d40f34a2432c_1156x614.png 848w, https://substackcdn.com/image/fetch/$s_!dgKh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6478278-2536-4ba8-818d-d40f34a2432c_1156x614.png 1272w, https://substackcdn.com/image/fetch/$s_!dgKh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6478278-2536-4ba8-818d-d40f34a2432c_1156x614.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dgKh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6478278-2536-4ba8-818d-d40f34a2432c_1156x614.png" width="1156" height="614" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e6478278-2536-4ba8-818d-d40f34a2432c_1156x614.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:614,&quot;width&quot;:1156,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:254020,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vutr.substack.com/i/156976428?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6478278-2536-4ba8-818d-d40f34a2432c_1156x614.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dgKh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6478278-2536-4ba8-818d-d40f34a2432c_1156x614.png 424w, https://substackcdn.com/image/fetch/$s_!dgKh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6478278-2536-4ba8-818d-d40f34a2432c_1156x614.png 848w, https://substackcdn.com/image/fetch/$s_!dgKh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6478278-2536-4ba8-818d-d40f34a2432c_1156x614.png 1272w, https://substackcdn.com/image/fetch/$s_!dgKh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6478278-2536-4ba8-818d-d40f34a2432c_1156x614.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>In contrast, Photon adopts columnar in-memory data representation; the system stores values of a particular column contiguously in memory. This layout is more convenient for SIMD and enables more efficient data&nbsp;<a href="https://en.wikipedia.org/wiki/Pipeline_(computing)">pipelining</a>&nbsp;and&nbsp;<a href="https://en.wikipedia.org/wiki/Prefetching">pre-fetching</a>. Moreover, it allows for the efficient working of columnar data on disks, eliminating the column-to-row pivoting process and making it easier to write data to disks with the columnar engine.</p><div><hr></div><h2>Outro</h2><p>Based on my observation, many solutions are out there that try to do the same things as Datbricks has done with Spark: they tried to make Spark more efficient as a query engine by implementing state-of-the-art techniques for OAN LAP systems while keeping it compatible with Spark.</p><ul><li><p><a href="https://datafusion.apache.org/comet/">Apache DataFusion Comet</a> implements Apache Datafusion as a runtime for Spark to achieve improvement in terms of query efficiency and query runtime.</p></li><li><p><a href="https://gluten.apache.org/">Apache Gluten(incubating)</a> is a middle layer that offloads JVM-based SQL engines&#8217; execution to native engines.</p></li></ul><p>Even with the community versions, contributors actively work to make Spark more efficient as an OLAP query engine. One significant improvement is the introduction of Adaptive Query Execution (AQE), which allows query plans to be adjusted based on runtime statistics collected during execution.</p><p>Your turn: What&#8217;s your experience with Databricks&#8217; Spark? Do you think the open-source version of Spark will catch up with Databricks&#8217; version at any point in the future?</p><p>&#8212;</p><p>Thank you for reading this far. If you notice any logical gaps, please let me know.</p><p>It&#8217;s time to say goodbye&#8212;see you in my next article! ;)</p><div><hr></div><h2>Reference</h2><p><em>[1] Databricks, <a href="https://people.eecs.berkeley.edu/~matei/papers/2022/sigmod_photon.pdf">Photon: A Fast Query Engine for Lakehouse Systems</a> (2022).</em></p><p>[2] <em>Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, Matei Zaharia <a href="https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf">Spark SQL: Relational Data Processing in Spark</a> (2015)</em></p><p>[3] Liz Elfman, <a href="https://www.bigeye.com/blog/a-brief-history-of-databricks">A brief history of Databricks</a> (2023)</p>]]></content:encoded></item><item><title><![CDATA[Why is dbt so popular?]]></title><description><![CDATA[The motivation behind dbt and why it's becoming a transformation standard(?)]]></description><link>https://vutr.substack.com/p/why-is-dbt-so-popular</link><guid isPermaLink="false">https://vutr.substack.com/p/why-is-dbt-so-popular</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Thu, 27 Feb 2025 03:15:40 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!-fOd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4f4d8f5-99a4-4e56-85e0-eee14465a613_2000x1429.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p>To celebrate Lunar New Year (the true New Year holiday in Vietnam), I&#8217;m offering <em><strong>50% off the annual subscription</strong></em>. The offer ends soon; grab it now to get full access to nearly 200 high-quality data engineering articles.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe&quot;,&quot;text&quot;:&quot;50% off the annual subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe"><span>50% off the annual subscription</span></a></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-fOd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4f4d8f5-99a4-4e56-85e0-eee14465a613_2000x1429.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-fOd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4f4d8f5-99a4-4e56-85e0-eee14465a613_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!-fOd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4f4d8f5-99a4-4e56-85e0-eee14465a613_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!-fOd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4f4d8f5-99a4-4e56-85e0-eee14465a613_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!-fOd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4f4d8f5-99a4-4e56-85e0-eee14465a613_2000x1429.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-fOd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4f4d8f5-99a4-4e56-85e0-eee14465a613_2000x1429.png" width="1456" height="1040" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e4f4d8f5-99a4-4e56-85e0-eee14465a613_2000x1429.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1040,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2556888,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-fOd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4f4d8f5-99a4-4e56-85e0-eee14465a613_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!-fOd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4f4d8f5-99a4-4e56-85e0-eee14465a613_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!-fOd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4f4d8f5-99a4-4e56-85e0-eee14465a613_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!-fOd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4f4d8f5-99a4-4e56-85e0-eee14465a613_2000x1429.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><div><hr></div><h2>Intro</h2><p>In 2021, my company reorganized the data team.</p><p>The data engineering team got a new leader. </p><p>After the initial greetings, he announced, "I&#8217;m bringing dbt into our stack. Do you know dbt? It&#8217;s popular right now."</p><p>I shook my head. Most of my teammates did, too.</p><p>Despite our attentive expressions, my first thought was: <em>Here we go again&#8212;another leader introducing a trendy tool to prove a point.</em></p><p>Fast forward four years, and dbt has proven it&#8217;s far more than just hype. <a href="https://www.getdbt.com/product/what-is-dbt">It&#8217;s becoming a standard(?)</a></p><p>If you work in data, you&#8217;ve probably heard of dbt at least once, or you may even have used it yourself.</p><p>This week's article will explore dbt, what it is, why people created it, and why it got adopted so much.</p><blockquote><p><em>This is not a dbt tutorial article, and all the research is solely driven by my curiosity.</em></p></blockquote><div><hr></div><h2>What is dbt</h2><p>dbt is a CLI tool that lets us efficiently transform data with SQL.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fLrG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F387c295c-7d92-4013-8e9c-5479715bec03_542x190.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fLrG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F387c295c-7d92-4013-8e9c-5479715bec03_542x190.png 424w, https://substackcdn.com/image/fetch/$s_!fLrG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F387c295c-7d92-4013-8e9c-5479715bec03_542x190.png 848w, https://substackcdn.com/image/fetch/$s_!fLrG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F387c295c-7d92-4013-8e9c-5479715bec03_542x190.png 1272w, https://substackcdn.com/image/fetch/$s_!fLrG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F387c295c-7d92-4013-8e9c-5479715bec03_542x190.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fLrG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F387c295c-7d92-4013-8e9c-5479715bec03_542x190.png" width="542" height="190" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/387c295c-7d92-4013-8e9c-5479715bec03_542x190.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:190,&quot;width&quot;:542,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:23104,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fLrG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F387c295c-7d92-4013-8e9c-5479715bec03_542x190.png 424w, https://substackcdn.com/image/fetch/$s_!fLrG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F387c295c-7d92-4013-8e9c-5479715bec03_542x190.png 848w, https://substackcdn.com/image/fetch/$s_!fLrG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F387c295c-7d92-4013-8e9c-5479715bec03_542x190.png 1272w, https://substackcdn.com/image/fetch/$s_!fLrG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F387c295c-7d92-4013-8e9c-5479715bec03_542x190.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>That&#8217;s it. </p><p>It&#8217;s not an engine like Spark; it&#8217;s not a database like Postgres or Snowflake; it&#8217;s a tool that helps you manage your SQL data transformation.</p><p>At the heart of dbt is the concept of model. A model is an SQL query saved in a <code>.sql</code> file. Each model defines a transformation that transforms data into a desired output inside your data warehouse. When dbt runs, it executes these queries and materializes the transformed data as a table or view. Models give us a tangible form of the SQL transformation logic.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!V59Q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09df9f99-56a4-4e63-beb5-8da710d0be82_556x264.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!V59Q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09df9f99-56a4-4e63-beb5-8da710d0be82_556x264.png 424w, https://substackcdn.com/image/fetch/$s_!V59Q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09df9f99-56a4-4e63-beb5-8da710d0be82_556x264.png 848w, https://substackcdn.com/image/fetch/$s_!V59Q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09df9f99-56a4-4e63-beb5-8da710d0be82_556x264.png 1272w, https://substackcdn.com/image/fetch/$s_!V59Q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09df9f99-56a4-4e63-beb5-8da710d0be82_556x264.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!V59Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09df9f99-56a4-4e63-beb5-8da710d0be82_556x264.png" width="556" height="264" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/09df9f99-56a4-4e63-beb5-8da710d0be82_556x264.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:264,&quot;width&quot;:556,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:32060,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!V59Q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09df9f99-56a4-4e63-beb5-8da710d0be82_556x264.png 424w, https://substackcdn.com/image/fetch/$s_!V59Q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09df9f99-56a4-4e63-beb5-8da710d0be82_556x264.png 848w, https://substackcdn.com/image/fetch/$s_!V59Q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09df9f99-56a4-4e63-beb5-8da710d0be82_556x264.png 1272w, https://substackcdn.com/image/fetch/$s_!V59Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09df9f99-56a4-4e63-beb5-8da710d0be82_556x264.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>dbt has two components: a <strong>compiler</strong> and a <strong>runner</strong>. We write dbt models and run some commands in the terminal. It will compile all the model&#8217;s code into SQL statements and execute them on the data warehouse: Snowflake, BigQuery, Databricks, or an engine like Spark or Trino. dbt doesn&#8217;t load your data or even know your data content (except for the schema and some metadata); everything stays inside your warehouse.</p><p>The model&#8217;s code is not solely SQL; it combines SQL and Jinja. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oG1k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31558400-e5a1-4e39-9e8b-acb2ca51dec6_338x248.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oG1k!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31558400-e5a1-4e39-9e8b-acb2ca51dec6_338x248.png 424w, https://substackcdn.com/image/fetch/$s_!oG1k!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31558400-e5a1-4e39-9e8b-acb2ca51dec6_338x248.png 848w, https://substackcdn.com/image/fetch/$s_!oG1k!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31558400-e5a1-4e39-9e8b-acb2ca51dec6_338x248.png 1272w, https://substackcdn.com/image/fetch/$s_!oG1k!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31558400-e5a1-4e39-9e8b-acb2ca51dec6_338x248.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oG1k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31558400-e5a1-4e39-9e8b-acb2ca51dec6_338x248.png" width="338" height="248" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/31558400-e5a1-4e39-9e8b-acb2ca51dec6_338x248.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:248,&quot;width&quot;:338,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:26866,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oG1k!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31558400-e5a1-4e39-9e8b-acb2ca51dec6_338x248.png 424w, https://substackcdn.com/image/fetch/$s_!oG1k!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31558400-e5a1-4e39-9e8b-acb2ca51dec6_338x248.png 848w, https://substackcdn.com/image/fetch/$s_!oG1k!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31558400-e5a1-4e39-9e8b-acb2ca51dec6_338x248.png 1272w, https://substackcdn.com/image/fetch/$s_!oG1k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31558400-e5a1-4e39-9e8b-acb2ca51dec6_338x248.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author</figcaption></figure></div><p>dbt&#8217;s jinja has special functions called&nbsp;<em>source() and ref(),&nbsp;</em>where the first lets the user reference a physical table in the data warehouse, and the latter enables us to reference other dbt models. Together, dbt can form a complete data transformation lineage in which the very left model points to the physical table (using <em>source</em>) and the following models using <em>ref </em>to refer to the previous models.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZUEW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17f5b8e4-1da2-466d-8f80-8a167d1300a5_1156x432.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZUEW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17f5b8e4-1da2-466d-8f80-8a167d1300a5_1156x432.png 424w, https://substackcdn.com/image/fetch/$s_!ZUEW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17f5b8e4-1da2-466d-8f80-8a167d1300a5_1156x432.png 848w, https://substackcdn.com/image/fetch/$s_!ZUEW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17f5b8e4-1da2-466d-8f80-8a167d1300a5_1156x432.png 1272w, https://substackcdn.com/image/fetch/$s_!ZUEW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17f5b8e4-1da2-466d-8f80-8a167d1300a5_1156x432.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZUEW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17f5b8e4-1da2-466d-8f80-8a167d1300a5_1156x432.png" width="1156" height="432" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/17f5b8e4-1da2-466d-8f80-8a167d1300a5_1156x432.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:432,&quot;width&quot;:1156,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:102544,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZUEW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17f5b8e4-1da2-466d-8f80-8a167d1300a5_1156x432.png 424w, https://substackcdn.com/image/fetch/$s_!ZUEW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17f5b8e4-1da2-466d-8f80-8a167d1300a5_1156x432.png 848w, https://substackcdn.com/image/fetch/$s_!ZUEW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17f5b8e4-1da2-466d-8f80-8a167d1300a5_1156x432.png 1272w, https://substackcdn.com/image/fetch/$s_!ZUEW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17f5b8e4-1da2-466d-8f80-8a167d1300a5_1156x432.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p><a href="https://docs.getdbt.com/docs/build/jinja-macros">dbt also lets us write modular and reusable model</a> structures, which allows teams to break down transformations into smaller, maintainable components.</p><p>Before the production run, you can test your dbt models to ensure they produce the expected results. It also auto-generates documentation, providing a clear overview of your data transformations and lineage.</p><p>A dbt model is purely code at its core, making it naturally compatible with Git for version control. Teams can track changes, collaborate via pull requests, roll back to previous versions, and implement CI/CD pipelines&#8212;just like software engineers do with application code.</p><div><hr></div><h2>Why did people create it?</h2><p>The creators of dbt encourage data analysts (DA) to take more responsibility for managing data transformations by adopting software engineering best practices.</p><p>In 2016, while at RJMetrics, Tristan Handy developed dbt to address the challenges of complex data transformation pipelines. It enabled analysts to write modular SQL code, implement version control, and conduct testing, thereby enhancing the efficiency and reliability of data workflows.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7mJm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94f51c0d-4729-4f13-9858-a1bae2328aa8_546x302.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7mJm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94f51c0d-4729-4f13-9858-a1bae2328aa8_546x302.png 424w, https://substackcdn.com/image/fetch/$s_!7mJm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94f51c0d-4729-4f13-9858-a1bae2328aa8_546x302.png 848w, https://substackcdn.com/image/fetch/$s_!7mJm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94f51c0d-4729-4f13-9858-a1bae2328aa8_546x302.png 1272w, https://substackcdn.com/image/fetch/$s_!7mJm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94f51c0d-4729-4f13-9858-a1bae2328aa8_546x302.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7mJm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94f51c0d-4729-4f13-9858-a1bae2328aa8_546x302.png" width="546" height="302" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/94f51c0d-4729-4f13-9858-a1bae2328aa8_546x302.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:302,&quot;width&quot;:546,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:63580,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7mJm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94f51c0d-4729-4f13-9858-a1bae2328aa8_546x302.png 424w, https://substackcdn.com/image/fetch/$s_!7mJm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94f51c0d-4729-4f13-9858-a1bae2328aa8_546x302.png 848w, https://substackcdn.com/image/fetch/$s_!7mJm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94f51c0d-4729-4f13-9858-a1bae2328aa8_546x302.png 1272w, https://substackcdn.com/image/fetch/$s_!7mJm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94f51c0d-4729-4f13-9858-a1bae2328aa8_546x302.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>But what motivates data analysts (DAs) to become more involved in data transformation, which was previously the primary responsibility of data engineers (DEs)?</p><p>Setting up a pipeline to move data from A to B is no trivial task. Data engineers must understand the data sources and the expected output format and manage the underlying infrastructure. This includes scaling Spark clusters, maintaining Airflow environments, and optimizing transformation logic&#8212;all while ensuring data quality across potentially hundreds of pipelines. The complexity and effort required to keep these systems running efficiently can quickly become overwhelming.</p><blockquote><p>To celebrate Lunar New Year (the true New Year holiday in Vietnam), I&#8217;m offering <em><strong>50% off the annual subscription</strong></em>. The offer ends soon; grab it now to get full access to nearly 200 high-quality data engineering articles.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe&quot;,&quot;text&quot;:&quot;50% off the annual subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe"><span>50% off the annual subscription</span></a></p></blockquote><p>A team of highly skilled data engineers is often required for large-scale systems. However, not every company has the resources to build such a team. It&#8217;s common to find organizations&#8212;especially medium-sized businesses and startups&#8212;operating with one or two data engineers.</p><p>A data engineer can manage two or three pipelines well, but what happens when that number grows to fifty? The workflow begins to slow down because pipelines need time to develop, test, and deploy. The ability to maintain data quality, implement necessary changes, and deliver timely insights starts to deteriorate. The data engineer becomes a bottleneck, and the data team struggles to keep up with business demands.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EXB7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb30ca4c7-2879-459e-a484-2ebe3b3204e9_366x358.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EXB7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb30ca4c7-2879-459e-a484-2ebe3b3204e9_366x358.png 424w, https://substackcdn.com/image/fetch/$s_!EXB7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb30ca4c7-2879-459e-a484-2ebe3b3204e9_366x358.png 848w, https://substackcdn.com/image/fetch/$s_!EXB7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb30ca4c7-2879-459e-a484-2ebe3b3204e9_366x358.png 1272w, https://substackcdn.com/image/fetch/$s_!EXB7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb30ca4c7-2879-459e-a484-2ebe3b3204e9_366x358.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EXB7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb30ca4c7-2879-459e-a484-2ebe3b3204e9_366x358.png" width="366" height="358" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b30ca4c7-2879-459e-a484-2ebe3b3204e9_366x358.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:358,&quot;width&quot;:366,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:51890,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EXB7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb30ca4c7-2879-459e-a484-2ebe3b3204e9_366x358.png 424w, https://substackcdn.com/image/fetch/$s_!EXB7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb30ca4c7-2879-459e-a484-2ebe3b3204e9_366x358.png 848w, https://substackcdn.com/image/fetch/$s_!EXB7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb30ca4c7-2879-459e-a484-2ebe3b3204e9_366x358.png 1272w, https://substackcdn.com/image/fetch/$s_!EXB7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb30ca4c7-2879-459e-a484-2ebe3b3204e9_366x358.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p></p><p>The ultimate goal is to make data more meaningful to the organization. A better way to manage the raw-to-usable-data process is to democratize the data transformation instead of requiring a small group of people (data engineers) to know how to do it.</p><p>Imagine an alternative scenario: What if data analysts could take a more active role in data transformation? Instead of waiting for a data engineer to implement every transformation they need, analysts could self-serve, define, and build transformations independently. Since data analysts deeply understand the business domain, they could ensure that the final datasets align with business needs from the start rather than relying on multiple handoffs.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JMCX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61683589-e308-4b60-858c-54b40371b272_380x390.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JMCX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61683589-e308-4b60-858c-54b40371b272_380x390.png 424w, https://substackcdn.com/image/fetch/$s_!JMCX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61683589-e308-4b60-858c-54b40371b272_380x390.png 848w, https://substackcdn.com/image/fetch/$s_!JMCX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61683589-e308-4b60-858c-54b40371b272_380x390.png 1272w, https://substackcdn.com/image/fetch/$s_!JMCX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61683589-e308-4b60-858c-54b40371b272_380x390.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JMCX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61683589-e308-4b60-858c-54b40371b272_380x390.png" width="380" height="390" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/61683589-e308-4b60-858c-54b40371b272_380x390.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:390,&quot;width&quot;:380,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:63703,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JMCX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61683589-e308-4b60-858c-54b40371b272_380x390.png 424w, https://substackcdn.com/image/fetch/$s_!JMCX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61683589-e308-4b60-858c-54b40371b272_380x390.png 848w, https://substackcdn.com/image/fetch/$s_!JMCX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61683589-e308-4b60-858c-54b40371b272_380x390.png 1272w, https://substackcdn.com/image/fetch/$s_!JMCX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61683589-e308-4b60-858c-54b40371b272_380x390.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><div><hr></div><h2>Why is it so popular?</h2><p>2016: Launched with 3 companies using dbt.</p><p>2017: Adoption grew to 100 companies.</p><p>2018: Expanded to 280 companies.</p><p>2019: Still 280 companies, steady growth.</p><p>2021: Surpassed 5,000 companies.</p><p>2022: Exceeded 9,000 companies.</p><p><a href="https://www.getdbt.com/case-studies/mcdonalds-nordics">McDonald's</a>,&nbsp;<a href="https://www.getdbt.com/case-studies/nasdaq">Nasdaq</a>,&nbsp;<a href="https://discord.com/blog/how-discord-uses-open-source-tools-for-scalable-data-orchestration-transformation">Discord</a>,&nbsp;<a href="https://www.dataengineeringpodcast.com/episodepage/shopify-data-warehouse-with-dbt-episode-171">Shopify,</a>&nbsp;and many other companies use it. If your company uses SQL on data transformation, there is a high chance that dbt is one of your company&#8217;s tech stacks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UMTv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda7e8ac3-aba3-4eb0-a5b9-33dcfd1a4954_3152x2274.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UMTv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda7e8ac3-aba3-4eb0-a5b9-33dcfd1a4954_3152x2274.png 424w, https://substackcdn.com/image/fetch/$s_!UMTv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda7e8ac3-aba3-4eb0-a5b9-33dcfd1a4954_3152x2274.png 848w, https://substackcdn.com/image/fetch/$s_!UMTv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda7e8ac3-aba3-4eb0-a5b9-33dcfd1a4954_3152x2274.png 1272w, https://substackcdn.com/image/fetch/$s_!UMTv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda7e8ac3-aba3-4eb0-a5b9-33dcfd1a4954_3152x2274.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UMTv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda7e8ac3-aba3-4eb0-a5b9-33dcfd1a4954_3152x2274.png" width="1456" height="1050" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/da7e8ac3-aba3-4eb0-a5b9-33dcfd1a4954_3152x2274.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1050,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:316927,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UMTv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda7e8ac3-aba3-4eb0-a5b9-33dcfd1a4954_3152x2274.png 424w, https://substackcdn.com/image/fetch/$s_!UMTv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda7e8ac3-aba3-4eb0-a5b9-33dcfd1a4954_3152x2274.png 848w, https://substackcdn.com/image/fetch/$s_!UMTv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda7e8ac3-aba3-4eb0-a5b9-33dcfd1a4954_3152x2274.png 1272w, https://substackcdn.com/image/fetch/$s_!UMTv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda7e8ac3-aba3-4eb0-a5b9-33dcfd1a4954_3152x2274.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">dbt core&#8217;s Github start history. Generated from start-history.com</figcaption></figure></div><p>So why is it so popular?</p><p>An obvious observation is that dbt hit the market fit; the creator finds and solves some problems. Turns out many organizations out there want to solve the same problems: a &#8220;right&#8220; way to manage, democratize, and collaborate the data transformation using SQL.</p><p>However, solving the problem versus solving the problem efficiently are two different things. I believe the way dbt solves problems is crucial to its success. To explain this aspect, I borrow criteria from the <a href="https://en.wikipedia.org/wiki/Unified_theory_of_acceptance_and_use_of_technology">Unified Theory of Acceptance and Use of Technology (UTAUT), a model that aims to explain user intentions to use an information system and subsequent usage behavior</a>:</p><ul><li><p><strong>Performance Expectancy</strong>: dbt enables data analysts and engineers to transform data within their warehouses effectively. It provides a framework for unifying how they write, test, and document SQL transformation logic.</p></li><li><p><strong>Effort Expectancy</strong>: Using dbt does not require much effort; if you&#8217;re familiar with SQL (By accident, DE, DA, and DS communicate via SQL), 30 minutes to learn dbt jinja is enough to make you ready to build the first dbt model. Furthermore, it is easy to install with pip, and due to its simple nature, containerizing the whole dbt project is possible: running it with Airflow on Kubernetes or implementing CI/CD with GitLab runner are all convenient.</p></li><li><p><strong>Social Influence</strong>: The growing adoption of dbt within the data community and endorsements from reputable organizations contribute to its perceived importance and encourage others to adopt it.</p></li><li><p><strong>Facilitating Conditions</strong>: The prerequisites of running dbt? An IDE, a data warehouse that can run SQL, and the willpower to write SQL transformation. That&#8217;s it. You don&#8217;t prepare dedicated hardware for it, plan storage capacity, or estimate the CPUs and RAMs for it. Their documents and the support from the community are sufficient to upgrade the company&#8217;s standard of write, testing, and documenting to the next level (in most cases). Like most of the tools, dbt starts with a limited set of integration points, over time, more and more organizations use dbt, resulting in more integration options. Its interoperability seamlessly integrates with existing systems, lowering the barrier to adoption.</p></li></ul><p>Besides all the points above, I believe an essential factor that leads to the wide adoption of dbt is the emergence of the cloud data warehouse.</p><p>In the past, data transformation happened before loading it into the warehouse. The raw data was not present at the destination; it was only clean and structured data.</p><p>Data warehouse systems were expensive, and companies had to purchase servers and licenses from vendors. Storage disks were also expensive, and networks weren&#8217;t as fast as they are today. Compute and storage were tightly coupled, and system scaling was difficult.</p><p>Additionally, storing data in a columnar format wasn&#8217;t common then, and row-oriented databases didn&#8217;t perform well for analytics workloads.</p><p>All of these factors made ETL a perfect solution. Data had to be carefully extracted and transformed so that only a relatively small, curated subset was loaded into the warehouse for analysis.</p><p>But things have changed.</p><p>The birth and rise of cloud data warehouses have made the solution much more accessible. Pay-as-you-go pricing models, cheaper storage, faster networks, and columnar storage/processing as the standard have commoditized high-performance, cost-efficient data warehouses.</p><p>Your shiny warehouse will be up and running with just a few clicks.</p><p>People soon realized they didn&#8217;t have to transform the data before loading it into the warehouse. They could dump data straight from the source (maybe some lightweight processing is needed) and let the transformation happen directly in the warehouse later.</p><p>So why is my data already in the warehouse? Why just not use SQL to transform the data? All cloud data warehouse query engines let you execute queries on enormous amounts of historical data with state-of-the-art techniques applied to both storage and processing, and these systems only get better.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!u9K2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ebba336-8bdc-4145-8727-47704ce69c86_922x350.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!u9K2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ebba336-8bdc-4145-8727-47704ce69c86_922x350.png 424w, https://substackcdn.com/image/fetch/$s_!u9K2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ebba336-8bdc-4145-8727-47704ce69c86_922x350.png 848w, https://substackcdn.com/image/fetch/$s_!u9K2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ebba336-8bdc-4145-8727-47704ce69c86_922x350.png 1272w, https://substackcdn.com/image/fetch/$s_!u9K2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ebba336-8bdc-4145-8727-47704ce69c86_922x350.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!u9K2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ebba336-8bdc-4145-8727-47704ce69c86_922x350.png" width="922" height="350" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6ebba336-8bdc-4145-8727-47704ce69c86_922x350.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:350,&quot;width&quot;:922,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:63238,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!u9K2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ebba336-8bdc-4145-8727-47704ce69c86_922x350.png 424w, https://substackcdn.com/image/fetch/$s_!u9K2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ebba336-8bdc-4145-8727-47704ce69c86_922x350.png 848w, https://substackcdn.com/image/fetch/$s_!u9K2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ebba336-8bdc-4145-8727-47704ce69c86_922x350.png 1272w, https://substackcdn.com/image/fetch/$s_!u9K2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ebba336-8bdc-4145-8727-47704ce69c86_922x350.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Data can now be transformed directly in the warehouse using SQL&#8212; image created by the author.</figcaption></figure></div><p>So, again, why invest in a separate, expensive system for data transformation when we've already spent significant resources on a platform that enables us to process massive amounts of data using SQL?</p><p>And guess which tool helps you streamline your SQL transformation scripts?</p><div><hr></div><h2>My thoughts</h2><p>Firstly, dbt encourages DAs to be more involved in data transformation, which does not mean DAs will completely replace the DEs in this process. My opinion on dbt is it allows DEs and DAs to collaborate efficiently. The DAs can contribute business domain knowledge, and the the DEs can contribute the expertise and knowledge of how to optimize the SQL query based on the underlying engine or other engineering practices, such as organization standards of writing modular and reusable DBT macros.</p><p>Secondly, providing the ability to do SQL transformation for a wide range of users does not mean SQL transformation can not be done arbitrarily. Typically, the data transformation must serve the organization's data modeling. dbt is a tool that helps us manage the SQL transformation; whether the transformation is meaningful or not depends on us; how data is transformed, organized, and served depends on how we model data. If you dump the data in the warehouse, adopting dbt is pointless. Many people also think that writing dbt models is doing data modeling. A data model defines how data is structured and related, ensuring consistency; it&#8217;s tool agnostic. A dbt model is a SQL-based transformation script that shapes raw data into a structured format inside the data warehouse.</p><p>&#8212;</p><p>All of the above are my research and thoughts on the popularity of dbt. We first explore what is the dbt, why it&#8217;s so popular, and some of my thoughts on it.</p><p>I independently consolidate, analyze, and present this information. Please let me know if you spot any logical gaps.</p><p>Also, your feedback on what works well and what can be improved is invaluable in helping me create higher-quality content.</p><p>Thank you for reading this far.</p><p>Now, it&#8217;s your turn: Are you a fan of dbt?</p><div><hr></div><h2>Reference</h2><p><em>[1] Tristan Handy, <a href="https://www.getdbt.com/blog/what-exactly-is-dbt">What, exactly, is dbt?</a> (2017)</em></p><p><em>[2] Connor McArthur, <a href="https://www.youtube.com/watch?v=qqlbYDfqeI4">DBT: Powerful, Open Source Data Transformations | Fishtown Analytics / DBT</a> (2017)</em></p><p><em>[3] Wikipedia, <a href="https://en.wikipedia.org/wiki/Unified_theory_of_acceptance_and_use_of_technology">Unified theory of acceptance and use of technology</a></em></p>]]></content:encoded></item><item><title><![CDATA[Why Walmart Chose Apache Hudi for Their Lakehouse]]></title><description><![CDATA[What can we learn.]]></description><link>https://vutr.substack.com/p/why-walmart-chose-apache-hudi-for</link><guid isPermaLink="false">https://vutr.substack.com/p/why-walmart-chose-apache-hudi-for</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Thu, 20 Feb 2025 03:15:43 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!aKW9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f5d0b14-95f8-4110-beed-45144e0b61b3_2000x1429.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p>To celebrate Lunar New Year (the true New Year holiday in Vietnam), I&#8217;m offering <em><strong>50% off the annual subscription</strong></em>. The offer ends soon; grab it now to get full access to nearly 200 high-quality data engineering articles.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe&quot;,&quot;text&quot;:&quot;50% off the annual subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe"><span>50% off the annual subscription</span></a></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aKW9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f5d0b14-95f8-4110-beed-45144e0b61b3_2000x1429.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aKW9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f5d0b14-95f8-4110-beed-45144e0b61b3_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!aKW9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f5d0b14-95f8-4110-beed-45144e0b61b3_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!aKW9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f5d0b14-95f8-4110-beed-45144e0b61b3_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!aKW9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f5d0b14-95f8-4110-beed-45144e0b61b3_2000x1429.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aKW9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f5d0b14-95f8-4110-beed-45144e0b61b3_2000x1429.png" width="1456" height="1040" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6f5d0b14-95f8-4110-beed-45144e0b61b3_2000x1429.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1040,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:225956,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aKW9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f5d0b14-95f8-4110-beed-45144e0b61b3_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!aKW9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f5d0b14-95f8-4110-beed-45144e0b61b3_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!aKW9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f5d0b14-95f8-4110-beed-45144e0b61b3_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!aKW9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f5d0b14-95f8-4110-beed-45144e0b61b3_2000x1429.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><div><hr></div><h2>Intro</h2><p>Apache Hudi often flies under the radar compared to Delta Lake and Iceberg. While both of these formats are popular in modern data lakes, Hudi has a unique design that prioritizes incremental and real-time processing. This makes it particularly valuable for organizations with constantly changing data. However, it doesn't get as much attention in discussions about modern data architectures.</p><p>Curious about its adoption, I scoured the internet for real-world implementations of Hudi. That&#8217;s when I came across Walmart&#8217;s case study. </p><p>Walmart, one of the largest retailers globally, decided to use Hudi for its lakehouse transformation. Their journey provides valuable insights into how large enterprises select, implement, and optimize data formats for real-time data processing.</p><p>In this issue, we&#8217;ll explore Walmart&#8217;s decision to use Hudi, their challenges, and the lessons we can learn from their experience. By the end, you&#8217;ll gain practical takeaways to help you make informed decisions when selecting a data format for your lakehouse.</p><div><hr></div><h2>About Walmart</h2><p>Before diving into the technical details, let&#8217;s get a sense of Walmart&#8217;s scale:</p><ul><li><p>10,000+ stores worldwide</p></li><li><p>Millions of transactions per hour</p></li><li><p>600K+ compute cores across Hadoop and Spark clusters</p></li></ul><p>The company needed a solution to keep up with their evolving data needs.</p><div><hr></div><h2>Evolving to a Near Real-Time Lakehouse</h2><p>Walmart wanted to transition from a <strong>batch-oriented data lake</strong> to a <strong>modern lakehouse</strong> that supports near real-time data processing. This transformation would allow them to:</p><ul><li><p>Make faster decisions with fresher data</p></li><li><p>Improve operational efficiency</p></li><li><p>Enable real-time analytics</p></li></ul><p>Additionally, they needed to maintain <strong>complete control</strong> over their infrastructure while operating across <strong>multiple cloud providers</strong>. They wanted an open-source format that would prevent vendor lock-in and allow them to optimize performance across their diverse tech stack.</p><p>The main issue was that Walmart&#8217;s existing batch-oriented system could not support it; without addressing these problems, the company risked falling behind in real-time analytics and operational intelligence.</p><ul><li><p>Low-latency updates for operational and analytical queries</p></li><li><p>Efficient handling of late-arriving data</p></li><li><p>Optimized ingestion performance across multiple workloads</p></li></ul><p>We live in an era of unprecedented data generation, making real-time insights and decision-making more critical than ever. While not every organization operates at Walmart's scale, many face similar challenges when transitioning from batch processing to real-time data architectures. Companies exploring streaming and incremental processing technologies can draw valuable lessons from Walmart&#8217;s approach.</p><p>Let's dive into Walmart's detailed approach.</p><div><hr></div><h2>How did Walmart choose the table format?</h2><p>They spend a lot of time evaluating and benchmarking Delta Lake, Iceberg, and Hudi to select the table format that best fits their needs. This format must help them evolve batch to real-time analytics and provide complete control without vendor lock-in.</p><p>Walmart abstracts the two most popular current workloads to do the benchmark.</p><ul><li><p>The batch workload deals with partition tables (by year, month, day, or hour). It suffers from late-arriving records, causing the Spark worker to read and write many partitions in the past (e.g., one-week-late data causes Spark to update a partition from one week ago). The workload characteristics are &lt; 0.1% Updates and &gt; 99.9% Inserts.</p></li><li><p>The streaming workload deals with row-level upsert to data with low latency. A multi-TB Cassandra table produces these updated data via change data capture. The workload characteristics are &gt; 99.999% Updates and &lt; 0.001% inserts.</p></li></ul><p>They started by addressing the ingestion aspect of this workload. To prepare for the benchmarking, Walmart isolated the three separate environments. Then, they deployed ingestion jobs (Delta, Hudi, Iceberg, Legacy) in these environments, giving them time to reach a steady state.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LzFq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b49db29-ec7a-44fd-84ce-7b8fb85aa0e0_591x343.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LzFq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b49db29-ec7a-44fd-84ce-7b8fb85aa0e0_591x343.png 424w, https://substackcdn.com/image/fetch/$s_!LzFq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b49db29-ec7a-44fd-84ce-7b8fb85aa0e0_591x343.png 848w, https://substackcdn.com/image/fetch/$s_!LzFq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b49db29-ec7a-44fd-84ce-7b8fb85aa0e0_591x343.png 1272w, https://substackcdn.com/image/fetch/$s_!LzFq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b49db29-ec7a-44fd-84ce-7b8fb85aa0e0_591x343.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LzFq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b49db29-ec7a-44fd-84ce-7b8fb85aa0e0_591x343.png" width="591" height="343" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0b49db29-ec7a-44fd-84ce-7b8fb85aa0e0_591x343.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:343,&quot;width&quot;:591,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LzFq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b49db29-ec7a-44fd-84ce-7b8fb85aa0e0_591x343.png 424w, https://substackcdn.com/image/fetch/$s_!LzFq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b49db29-ec7a-44fd-84ce-7b8fb85aa0e0_591x343.png 848w, https://substackcdn.com/image/fetch/$s_!LzFq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b49db29-ec7a-44fd-84ce-7b8fb85aa0e0_591x343.png 1272w, https://substackcdn.com/image/fetch/$s_!LzFq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b49db29-ec7a-44fd-84ce-7b8fb85aa0e0_591x343.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Ingestion Benchmark Scores (GB-ingested * Time [min]) / Cores. Source:  <a href="https://medium.com/walmartglobaltech/lakehouse-at-fortune-1-scale-480bcb10391b">Lakehouse at Fortune 1 Scale (2023)</a></figcaption></figure></div><p>The Hudi + Spark 3.x. was the most performant for the batch workload, more than five times faster than the legacy systems.</p><p>Delta Lake is 27% faster than Hudi for the steaming workload. However, Hudi&#8217;s compaction process was faster because its approach lacked the ZOrdering optimizations in the Delta pipeline. This Delta optimization pays off later when it significantly improves the query performance.</p><p>When it comes to query performance, Walmart leverages <a href="https://medium.com/walmartglobaltech/lakehouse-at-fortune-1-scale-480bcb10391b">TPC-H</a> for benchmarking. They use Queries 1 to 7 for the batch workload and Queries 1 to 10 for the streaming workload.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tG-x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F172bf639-43a7-4540-9644-680ffc24baba_555x277.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tG-x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F172bf639-43a7-4540-9644-680ffc24baba_555x277.png 424w, https://substackcdn.com/image/fetch/$s_!tG-x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F172bf639-43a7-4540-9644-680ffc24baba_555x277.png 848w, https://substackcdn.com/image/fetch/$s_!tG-x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F172bf639-43a7-4540-9644-680ffc24baba_555x277.png 1272w, https://substackcdn.com/image/fetch/$s_!tG-x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F172bf639-43a7-4540-9644-680ffc24baba_555x277.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tG-x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F172bf639-43a7-4540-9644-680ffc24baba_555x277.png" width="555" height="277" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/172bf639-43a7-4540-9644-680ffc24baba_555x277.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:277,&quot;width&quot;:555,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tG-x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F172bf639-43a7-4540-9644-680ffc24baba_555x277.png 424w, https://substackcdn.com/image/fetch/$s_!tG-x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F172bf639-43a7-4540-9644-680ffc24baba_555x277.png 848w, https://substackcdn.com/image/fetch/$s_!tG-x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F172bf639-43a7-4540-9644-680ffc24baba_555x277.png 1272w, https://substackcdn.com/image/fetch/$s_!tG-x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F172bf639-43a7-4540-9644-680ffc24baba_555x277.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Query Benchmark Scores &#8212; Median Query times [min] across typical workloads. Source:  <a href="https://medium.com/walmartglobaltech/lakehouse-at-fortune-1-scale-480bcb10391b">Lakehouse at Fortune 1 Scale</a> (2023)</figcaption></figure></div><p>Delta Lake outperformed in most queries by about 40%, primarily due to its ZOrdering feature, which optimized query performance. However, Hudi excelled in real-time deduplication, providing faster access to the latest record. Since the benchmark, Hudi has introduced ZOrdering and improved filegroup metadata management, likely narrowing the performance gap significantly.</p><p>Regarding Iceberg, Walmart encounters challenges cleaning up to provide an optimal file size during the ingestion job. So, they skip implementing the ingestion and query benchmarking on Iceberg.</p><p>Walmart chooses Hudi in the end.</p><p>With a highly diverse tech stack spanning&nbsp;<strong>600K+ cores on Hadoop and Spark</strong>&nbsp;across&nbsp;<strong>Google Cloud and Azure</strong>, Hudi seamlessly integrates into the system.</p><p>So Hudi is excellent:</p><ul><li><p>It supports both batch and streaming workloads.</p></li><li><p>It offers incremental processing capabilities, reducing the need for full table rewrites.</p></li><li><p>It enables efficient upserts and deletes using unique keys and indexing.</p></li><li><p>Hudi offers standout features, including Bloom filters, commit notifications, and monitoring interfaces.</p></li></ul><p>But how does Hudi do all of these things? Its architecture and designs play a crucial role here.</p><div><hr></div><blockquote><p>To celebrate Lunar New Year (the true New Year holiday in Vietnam), I&#8217;m offering <em><strong>50% off the annual subscription</strong></em>. The offer ends soon; grab it now to get full access to nearly 200 high-quality data engineering articles.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe&quot;,&quot;text&quot;:&quot;50% off the annual subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe"><span>50% off the annual subscription</span></a></p></blockquote><div><hr></div><h2>How Hudi Works: Architecture and Design</h2><h3>Metadata Management</h3><p>Metadata files are stored in &lt;base_path&gt;/.hoodie/ directory. Here, a file called hoodie.properties contains Hudi table configurations, such as table name, version, partition scheme, file format, or table type.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZRrS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d9715b-f55a-4610-983e-57f9aad15110_1460x946.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZRrS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d9715b-f55a-4610-983e-57f9aad15110_1460x946.png 424w, https://substackcdn.com/image/fetch/$s_!ZRrS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d9715b-f55a-4610-983e-57f9aad15110_1460x946.png 848w, https://substackcdn.com/image/fetch/$s_!ZRrS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d9715b-f55a-4610-983e-57f9aad15110_1460x946.png 1272w, https://substackcdn.com/image/fetch/$s_!ZRrS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d9715b-f55a-4610-983e-57f9aad15110_1460x946.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZRrS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d9715b-f55a-4610-983e-57f9aad15110_1460x946.png" width="1456" height="943" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/17d9715b-f55a-4610-983e-57f9aad15110_1460x946.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:943,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZRrS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d9715b-f55a-4610-983e-57f9aad15110_1460x946.png 424w, https://substackcdn.com/image/fetch/$s_!ZRrS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d9715b-f55a-4610-983e-57f9aad15110_1460x946.png 848w, https://substackcdn.com/image/fetch/$s_!ZRrS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d9715b-f55a-4610-983e-57f9aad15110_1460x946.png 1272w, https://substackcdn.com/image/fetch/$s_!ZRrS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d9715b-f55a-4610-983e-57f9aad15110_1460x946.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Screenshot of the hoodie.properties</figcaption></figure></div><p>Besides hoodie.properties, metadata files record transactional actions on the table, constructing the table's Timeline.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Cdb-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffac3770a-e22f-43a0-8b60-712f8ebae607_622x136.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Cdb-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffac3770a-e22f-43a0-8b60-712f8ebae607_622x136.png 424w, https://substackcdn.com/image/fetch/$s_!Cdb-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffac3770a-e22f-43a0-8b60-712f8ebae607_622x136.png 848w, https://substackcdn.com/image/fetch/$s_!Cdb-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffac3770a-e22f-43a0-8b60-712f8ebae607_622x136.png 1272w, https://substackcdn.com/image/fetch/$s_!Cdb-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffac3770a-e22f-43a0-8b60-712f8ebae607_622x136.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Cdb-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffac3770a-e22f-43a0-8b60-712f8ebae607_622x136.png" width="622" height="136" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fac3770a-e22f-43a0-8b60-712f8ebae607_622x136.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:136,&quot;width&quot;:622,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Cdb-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffac3770a-e22f-43a0-8b60-712f8ebae607_622x136.png 424w, https://substackcdn.com/image/fetch/$s_!Cdb-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffac3770a-e22f-43a0-8b60-712f8ebae607_622x136.png 848w, https://substackcdn.com/image/fetch/$s_!Cdb-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffac3770a-e22f-43a0-8b60-712f8ebae607_622x136.png 1272w, https://substackcdn.com/image/fetch/$s_!Cdb-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffac3770a-e22f-43a0-8b60-712f8ebae607_622x136.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Screenshot of the Hudi transactional metadata files.</figcaption></figure></div><h3>Hudi Timeline</h3><p>Hudi Timeline records all actions performed on the table at different instants, providing instantaneous views of the table while efficiently supporting the retrieval of data in the order of arrival.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tVIv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa942fe10-04b8-479e-b8c1-a12a9fed4b2d_1470x614.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tVIv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa942fe10-04b8-479e-b8c1-a12a9fed4b2d_1470x614.png 424w, https://substackcdn.com/image/fetch/$s_!tVIv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa942fe10-04b8-479e-b8c1-a12a9fed4b2d_1470x614.png 848w, https://substackcdn.com/image/fetch/$s_!tVIv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa942fe10-04b8-479e-b8c1-a12a9fed4b2d_1470x614.png 1272w, https://substackcdn.com/image/fetch/$s_!tVIv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa942fe10-04b8-479e-b8c1-a12a9fed4b2d_1470x614.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tVIv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa942fe10-04b8-479e-b8c1-a12a9fed4b2d_1470x614.png" width="1456" height="608" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a942fe10-04b8-479e-b8c1-a12a9fed4b2d_1470x614.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:608,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tVIv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa942fe10-04b8-479e-b8c1-a12a9fed4b2d_1470x614.png 424w, https://substackcdn.com/image/fetch/$s_!tVIv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa942fe10-04b8-479e-b8c1-a12a9fed4b2d_1470x614.png 848w, https://substackcdn.com/image/fetch/$s_!tVIv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa942fe10-04b8-479e-b8c1-a12a9fed4b2d_1470x614.png 1272w, https://substackcdn.com/image/fetch/$s_!tVIv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa942fe10-04b8-479e-b8c1-a12a9fed4b2d_1470x614.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>A Hudi instant consists of the following components. Each transactional metadata file is associated with an instance. The file has the following pattern:</p><p>&lt;instant timestamp&gt;.&lt;instant action&gt;[.&lt;instant state&gt;]</p><p>A Hudi instant consists of:</p><ul><li><p>Instant timestamp: Instant time is typically a timestamp (e.g., 20241004000131320 from the screenshot), which monotonically increases in the order of the instant action&#8217;s beginning time.</p></li><li><p>Instant action: Type of actions that can be performed on the table. COMMITS refer to an atomic write of a batch of records. CLEANS remove outdated file versions. DELTA_COMMIT involves atomic writes to a MergeOnRead table, with data written to delta logs. COMPACTION reconciles data structures, such as converting updates from row-based logs to columnar formats, which appear as a special commit. ROLLBACK occurs when a commit fails, removing any partial files. Lastly, SAVEPOINT marks specific file groups as preserved for potential recovery, preventing their deletion by cleaners.</p></li><li><p>State: At any given moment, instant action can be in one of three states: REQUESTED, indicating an action has been scheduled but not yet started; INFLIGHT, showing the action is currently in progress; and COMPLETED, marking the action as finished. Note: The metadata file associated with the COMPLETED state will have no state suffix. Hudi maintains two types of timelines:</p></li></ul><p>Hudi manages timelines as active and archived timelines:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QW_0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F619f7620-3ffa-4724-b341-670b18e13885_1838x512.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QW_0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F619f7620-3ffa-4724-b341-670b18e13885_1838x512.png 424w, https://substackcdn.com/image/fetch/$s_!QW_0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F619f7620-3ffa-4724-b341-670b18e13885_1838x512.png 848w, https://substackcdn.com/image/fetch/$s_!QW_0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F619f7620-3ffa-4724-b341-670b18e13885_1838x512.png 1272w, https://substackcdn.com/image/fetch/$s_!QW_0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F619f7620-3ffa-4724-b341-670b18e13885_1838x512.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QW_0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F619f7620-3ffa-4724-b341-670b18e13885_1838x512.png" width="1456" height="406" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/619f7620-3ffa-4724-b341-670b18e13885_1838x512.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:406,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QW_0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F619f7620-3ffa-4724-b341-670b18e13885_1838x512.png 424w, https://substackcdn.com/image/fetch/$s_!QW_0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F619f7620-3ffa-4724-b341-670b18e13885_1838x512.png 848w, https://substackcdn.com/image/fetch/$s_!QW_0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F619f7620-3ffa-4724-b341-670b18e13885_1838x512.png 1272w, https://substackcdn.com/image/fetch/$s_!QW_0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F619f7620-3ffa-4724-b341-670b18e13885_1838x512.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><ul><li><p>Active Timeline: It serves valid data files, ensuring that read requests don&#8217;t experience unnecessary latencies as the timeline grows. It is bounded by the instants (metadata files) it can serve.</p></li><li><p>Archived Timeline: Hudi moves older timeline events to the archived timeline after certain thresholds. Generally, the archived timeline is not used for regular table operations but is kept for bookkeeping and debugging purposes. Any instances under the ".hoodie" directory refer to active timelines, while archived events are moved to the ".hoodie/archived" folder.</p></li></ul><h3>Data Storage in Hudi</h3><p>Hudi stores data as Base Files (in a columnar format like Parquet) and Log Files (in a row-based format like Avro). These files are structured into File Groups, each with multiple File Slices.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pP5o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01bb900f-a7cc-4687-aaf1-e81febf2fa20_1550x694.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pP5o!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01bb900f-a7cc-4687-aaf1-e81febf2fa20_1550x694.png 424w, https://substackcdn.com/image/fetch/$s_!pP5o!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01bb900f-a7cc-4687-aaf1-e81febf2fa20_1550x694.png 848w, https://substackcdn.com/image/fetch/$s_!pP5o!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01bb900f-a7cc-4687-aaf1-e81febf2fa20_1550x694.png 1272w, https://substackcdn.com/image/fetch/$s_!pP5o!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01bb900f-a7cc-4687-aaf1-e81febf2fa20_1550x694.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pP5o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01bb900f-a7cc-4687-aaf1-e81febf2fa20_1550x694.png" width="1456" height="652" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/01bb900f-a7cc-4687-aaf1-e81febf2fa20_1550x694.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:652,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:288884,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pP5o!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01bb900f-a7cc-4687-aaf1-e81febf2fa20_1550x694.png 424w, https://substackcdn.com/image/fetch/$s_!pP5o!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01bb900f-a7cc-4687-aaf1-e81febf2fa20_1550x694.png 848w, https://substackcdn.com/image/fetch/$s_!pP5o!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01bb900f-a7cc-4687-aaf1-e81febf2fa20_1550x694.png 1272w, https://substackcdn.com/image/fetch/$s_!pP5o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01bb900f-a7cc-4687-aaf1-e81febf2fa20_1550x694.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><ul><li><p>Base Files: Optimized for read efficiency.</p></li><li><p>Log Files: Capture incremental changes for write optimization.</p></li></ul><p>A Hudi table is divided into multiple file groups, similar to database sharding, where each group contains a subset of the table&#8217;s data. A File Group is uniquely identified by a fileId, and each group contains File Slices. Each slice has a single Base File (Parquet/ORC) and associated Log Files (Avro). A slice represents a version of the group at a specific time.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!umub!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e07c506-36f1-46b1-a356-b228bf942492_800x618.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!umub!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e07c506-36f1-46b1-a356-b228bf942492_800x618.jpeg 424w, https://substackcdn.com/image/fetch/$s_!umub!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e07c506-36f1-46b1-a356-b228bf942492_800x618.jpeg 848w, https://substackcdn.com/image/fetch/$s_!umub!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e07c506-36f1-46b1-a356-b228bf942492_800x618.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!umub!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e07c506-36f1-46b1-a356-b228bf942492_800x618.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!umub!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e07c506-36f1-46b1-a356-b228bf942492_800x618.jpeg" width="800" height="618" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8e07c506-36f1-46b1-a356-b228bf942492_800x618.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:618,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image preview&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image preview" title="Image preview" srcset="https://substackcdn.com/image/fetch/$s_!umub!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e07c506-36f1-46b1-a356-b228bf942492_800x618.jpeg 424w, https://substackcdn.com/image/fetch/$s_!umub!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e07c506-36f1-46b1-a356-b228bf942492_800x618.jpeg 848w, https://substackcdn.com/image/fetch/$s_!umub!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e07c506-36f1-46b1-a356-b228bf942492_800x618.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!umub!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e07c506-36f1-46b1-a356-b228bf942492_800x618.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Hudi adopts Multiversion Concurrency Control (MVCC), where compaction action merges logs and base files to produce new file slices, and cleaning action removes unused/older file slices to reclaim space on the file system.</p><p>With this design, Hudi achieves:</p><ul><li><p>Read and write efficiency: The Base File format efficiently supports large data scans, while the row-based Log File format provides high performance for data writing.</p></li><li><p>Data versioning: Each File Slice is tied to a specific timestamp on the Timeline, enabling tracking of how records within a File Group evolve.</p></li></ul><h3>Indexing for Fast Record Lookups</h3><p>Each record in a Hudi table has a unique identifier called a primary key, which consists of a pair of record keys and the partition path to which the record belongs.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6vjO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55de7846-bc38-42fb-8ad3-14dbd5641177_1198x880.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6vjO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55de7846-bc38-42fb-8ad3-14dbd5641177_1198x880.png 424w, https://substackcdn.com/image/fetch/$s_!6vjO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55de7846-bc38-42fb-8ad3-14dbd5641177_1198x880.png 848w, https://substackcdn.com/image/fetch/$s_!6vjO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55de7846-bc38-42fb-8ad3-14dbd5641177_1198x880.png 1272w, https://substackcdn.com/image/fetch/$s_!6vjO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55de7846-bc38-42fb-8ad3-14dbd5641177_1198x880.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6vjO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55de7846-bc38-42fb-8ad3-14dbd5641177_1198x880.png" width="1198" height="880" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/55de7846-bc38-42fb-8ad3-14dbd5641177_1198x880.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:880,&quot;width&quot;:1198,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6vjO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55de7846-bc38-42fb-8ad3-14dbd5641177_1198x880.png 424w, https://substackcdn.com/image/fetch/$s_!6vjO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55de7846-bc38-42fb-8ad3-14dbd5641177_1198x880.png 848w, https://substackcdn.com/image/fetch/$s_!6vjO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55de7846-bc38-42fb-8ad3-14dbd5641177_1198x880.png 1272w, https://substackcdn.com/image/fetch/$s_!6vjO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55de7846-bc38-42fb-8ad3-14dbd5641177_1198x880.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Using primary keys, Hudi ensures no duplicate records (primary keys) across partitions and enables fast updates and deletes on records. For non-partitioned tables, the primary key includes only the record key, which means Hudi enforces a record uniqueness constraint over the entire table.</p><p>Primary keys in Hudi are also referred to as "hoodie keys." Recalling that Uber faced challenges with data updates and deletions over HDFS, Hudi introduces a feature that sets it apart from Delta Lake or Iceberg&#8212;the index.</p><p>Hudi maintains an index to enable quick record lookups. This index maps hoodie keys to file groups (fileIds), and this mapping remains unchanged once the first version of a record is written to a file.</p><div><hr></div><h2>Lessons Learned: What Data Engineers Can Apply</h2><h3><strong>The Most Popular Tool Isn't Always the Best for Your Needs</strong></h3><ul><li><p>Delta Lake and IcebergIceberg are widely used, but Hudi fits Walmart&#8217;s requirements best.</p></li><li><p>Choose the right tool based on <strong>workload characteristics and your company needs</strong>, not popularity.</p></li></ul><h3> <strong>Benchmarking And Setting Benchmarking Are Crucial</strong></h3><ul><li><p>Walmart ensured a <strong>fair comparison</strong> between Hudi, Delta Lake, and Iceberg.</p></li><li><p>It is crucial to conduct performance tests fairly and in isolation. This can provide more accurate results, leading to more accurate decisions.</p></li></ul><blockquote><p><em>I don&#8217;t have hands-on experience with setting up benchmarking like this, so if you do, I&#8217;d love to hear from you! Feel free to share your insights and experiences in the comments&#8212;me and the readers would greatly appreciate it.</em></p></blockquote><h3>Open vs close</h3><ul><li><p>The vendor's solutions are cool. They take care of everything. However, they will try to keep you in the loop as long as possible to maximize the lifetime value; this can limit your control over the technology and force you to depend on the vendor.</p></li><li><p>If you want complete control, self-managed open-source deployments are a viable option. However, the trade-off for this is you have to manage everything.</p></li><li><p>Once again, organizations must make this kind of decision based on their needs. Comparing what other companies do will not help.</p></li></ul><div><hr></div><h2>Outro</h2><p>We explored how Walmart tackled their transition to a near real-time lakehouse by choosing Apache Hudi. Their decision was driven by the need for efficient batch and streaming processing, ensuring that they could seamlessly handle large-scale batch workloads and real-time data updates.</p><p>Another critical factor was maintaining control over their tech stack across multiple clouds. Walmart needed an open-source solution that would prevent vendor lock-in while allowing them to optimize their architecture across Google Cloud and Azure.</p><p>Finally, Walmart conducted careful benchmarking against Delta Lake and Iceberg, evaluating ingestion performance, query speed, and operational overhead. This thorough comparison helped them make an informed decision tailored to their unique needs.</p><p>Your Turn</p><p>What&#8217;s your experience with Hudi, Delta Lake, or Iceberg? Have you encountered challenges when deciding on a data lake format? Let&#8217;s discuss&#8212;reply to this email or share your thoughts in the comments!</p><div><hr></div><h2>Reference</h2><p><em>[1] Samuel Guleff, <a href="https://medium.com/walmartglobaltech/lakehouse-at-fortune-1-scale-480bcb10391b">Lakehouse at Fortune 1 Scale</a> (2023)</em></p><p><em>[2] <a href="https://www.onehouse.ai/blog/enabling-walmarts-data-lakehouse-with-apache-hudi">Enabling Walmart's Data Lakehouse With Apache Hudi</a> (2024)</em></p>]]></content:encoded></item><item><title><![CDATA[How Meta Solves Data Lineage At Scale]]></title><description><![CDATA[Meta&#8217;s Approach to Data Lineage: How They Did It and What We Can Learn]]></description><link>https://vutr.substack.com/p/how-meta-solves-data-lineage-at-scale</link><guid isPermaLink="false">https://vutr.substack.com/p/how-meta-solves-data-lineage-at-scale</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Thu, 13 Feb 2025 03:15:44 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!qH69!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6494f0a7-8e18-4770-b797-f0ee64c75517_2000x1429.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p>To celebrate Lunar New Year (the true New Year holiday in Vietnam), I&#8217;m offering <em><strong>50% off the annual subscription</strong></em>. The offer ends soon; grab it now to get full access to nearly 200 high-quality data engineering articles.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe&quot;,&quot;text&quot;:&quot;50% off the annual subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe"><span>50% off the annual subscription</span></a></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qH69!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6494f0a7-8e18-4770-b797-f0ee64c75517_2000x1429.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qH69!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6494f0a7-8e18-4770-b797-f0ee64c75517_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!qH69!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6494f0a7-8e18-4770-b797-f0ee64c75517_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!qH69!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6494f0a7-8e18-4770-b797-f0ee64c75517_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!qH69!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6494f0a7-8e18-4770-b797-f0ee64c75517_2000x1429.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qH69!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6494f0a7-8e18-4770-b797-f0ee64c75517_2000x1429.png" width="1456" height="1040" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6494f0a7-8e18-4770-b797-f0ee64c75517_2000x1429.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1040,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:326947,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qH69!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6494f0a7-8e18-4770-b797-f0ee64c75517_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!qH69!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6494f0a7-8e18-4770-b797-f0ee64c75517_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!qH69!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6494f0a7-8e18-4770-b797-f0ee64c75517_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!qH69!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6494f0a7-8e18-4770-b797-f0ee64c75517_2000x1429.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><div><hr></div><h2>Intro</h2><p>When Meta recently published an article titled <a href="https://engineering.fb.com/2025/01/22/security/how-meta-discovers-data-flows-via-lineage-at-scale/">How Meta discovers data flows via lineage at scale</a>, it instantly caught my attention. </p><p>As data engineers, we often hear about data lineage, but how many of us deeply understand its implications or the challenges of implementing it at scale? Meta&#8217;s approach to solving data lineage problems within their privacy infrastructure offers fascinating lessons.</p><p>In this article, we&#8217;ll explore Meta's challenges with data lineage, their solutions, and the practical lessons we can adopt&#8212;even if we don&#8217;t operate at Meta&#8217;s scale.</p><div><hr></div><h2>A Little Bit About Meta</h2><blockquote><p><em>Even my low-tech mom use Facebook.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4p_-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f63999c-fad1-42c8-b6ec-2061042ea8f1_376x362.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4p_-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f63999c-fad1-42c8-b6ec-2061042ea8f1_376x362.png 424w, https://substackcdn.com/image/fetch/$s_!4p_-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f63999c-fad1-42c8-b6ec-2061042ea8f1_376x362.png 848w, https://substackcdn.com/image/fetch/$s_!4p_-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f63999c-fad1-42c8-b6ec-2061042ea8f1_376x362.png 1272w, https://substackcdn.com/image/fetch/$s_!4p_-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f63999c-fad1-42c8-b6ec-2061042ea8f1_376x362.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4p_-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f63999c-fad1-42c8-b6ec-2061042ea8f1_376x362.png" width="376" height="362" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9f63999c-fad1-42c8-b6ec-2061042ea8f1_376x362.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:362,&quot;width&quot;:376,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:44926,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4p_-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f63999c-fad1-42c8-b6ec-2061042ea8f1_376x362.png 424w, https://substackcdn.com/image/fetch/$s_!4p_-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f63999c-fad1-42c8-b6ec-2061042ea8f1_376x362.png 848w, https://substackcdn.com/image/fetch/$s_!4p_-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f63999c-fad1-42c8-b6ec-2061042ea8f1_376x362.png 1272w, https://substackcdn.com/image/fetch/$s_!4p_-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f63999c-fad1-42c8-b6ec-2061042ea8f1_376x362.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>With billions of users across Facebook, Instagram, WhatsApp, and more, the company handles petabytes of data daily. This data isn&#8217;t just about scale; it&#8217;s deeply interconnected. Every click, post, or message can flow through a complex web of systems&#8212;from user-facing apps to backend services and data warehouses. Managing and understanding these flows is no small feat, especially as Meta prioritizes user privacy.</p><p>At the heart of their efforts is the Privacy-Aware Infrastructure (PAI), a suite of technologies that ensures privacy controls across their systems. Data lineage is a cornerstone of PAI, allowing Meta to trace how data flows and ensure compliance with privacy requirements.</p><div><hr></div><h2>But what is data lineage?</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UFgu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98dcd58d-1caa-469b-b9ab-bb5310d5135d_1278x586.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UFgu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98dcd58d-1caa-469b-b9ab-bb5310d5135d_1278x586.png 424w, https://substackcdn.com/image/fetch/$s_!UFgu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98dcd58d-1caa-469b-b9ab-bb5310d5135d_1278x586.png 848w, https://substackcdn.com/image/fetch/$s_!UFgu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98dcd58d-1caa-469b-b9ab-bb5310d5135d_1278x586.png 1272w, https://substackcdn.com/image/fetch/$s_!UFgu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98dcd58d-1caa-469b-b9ab-bb5310d5135d_1278x586.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UFgu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98dcd58d-1caa-469b-b9ab-bb5310d5135d_1278x586.png" width="1278" height="586" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/98dcd58d-1caa-469b-b9ab-bb5310d5135d_1278x586.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:586,&quot;width&quot;:1278,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UFgu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98dcd58d-1caa-469b-b9ab-bb5310d5135d_1278x586.png 424w, https://substackcdn.com/image/fetch/$s_!UFgu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98dcd58d-1caa-469b-b9ab-bb5310d5135d_1278x586.png 848w, https://substackcdn.com/image/fetch/$s_!UFgu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98dcd58d-1caa-469b-b9ab-bb5310d5135d_1278x586.png 1272w, https://substackcdn.com/image/fetch/$s_!UFgu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98dcd58d-1caa-469b-b9ab-bb5310d5135d_1278x586.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Data lineage is the process of tracing data's journey through various systems, from its source to its final destination. It answers questions like: Where did this data originate? How has it been transformed? Where is it being used? It gives us:</p><ol><li><p><strong>Transparency and Trust</strong>: It clarifies how data flows through systems, essential for ensuring compliance with privacy regulations and building user trust.</p></li><li><p><strong>Troubleshooting</strong>: Knowing the data's path helps engineers pinpoint the root cause when issues arise.</p></li><li><p><strong>Impact Analysis</strong>: When making changes to systems, data lineage allows teams to assess potential downstream effects, minimizing unintended disruptions.</p></li><li><p><strong>Compliance</strong>: In an era of stringent data privacy laws, like GDPR and CCPA, having a clear picture of data flows is mandatory to demonstrate compliance and protect user privacy.</p></li></ol><p>Data lineage isn't just a "nice-to-have"&#8212;it's a foundational piece of modern data systems.</p><div><hr></div><blockquote><p>To celebrate Lunar New Year (the true New Year holiday in Vietnam), I&#8217;m offering <em><strong>50% off the annual subscription</strong></em>. The offer ends soon; grab it now to get full access to nearly 200 high-quality data engineering articles.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe&quot;,&quot;text&quot;:&quot;50% off the annual subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe"><span>50% off the annual subscription</span></a></p></blockquote><div><hr></div><h2>The Problem At Meta</h2><h3>Why Data Lineage Matters</h3><p>For Meta, data lineage helps them understand how data&#8212;such as a user&#8217;s religious views on Facebook Dating&#8212;moves from the input stage to backend processing, storage, and usage in downstream systems.</p><p>This transparency is critical for implementing and validating privacy controls. The initial data lineage status at Meta:</p><ul><li><p>Understanding the data flows across the system is crucial to establishing privacy controls in the PAI.</p></li><li><p>An important service is Policy Zones, which answers the question: &#8220;Where does my data come from, and where does it go?&#8221;</p></li><li><p>Internal users can use the lineage graphs to explain the data flow and where they collect and process it.</p></li><li><p>Meta developed the Policy Zone Manager (PZM), a tool based on data lineage that lets developers identify multiple downstream assets from a set of sources. This accelerates the rollout of privacy controls.</p></li><li><p>Once they implement privacy requirements, data lineage helps monitor and validate data flows continuously and provides enforcement mechanisms.</p></li></ul><p>However, as Meta scaled PAI across all its apps, its existing lineage solutions fell short.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JPye!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F260a060b-b217-4b3d-ac6f-3f31436c5e35_1468x702.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JPye!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F260a060b-b217-4b3d-ac6f-3f31436c5e35_1468x702.png 424w, https://substackcdn.com/image/fetch/$s_!JPye!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F260a060b-b217-4b3d-ac6f-3f31436c5e35_1468x702.png 848w, https://substackcdn.com/image/fetch/$s_!JPye!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F260a060b-b217-4b3d-ac6f-3f31436c5e35_1468x702.png 1272w, https://substackcdn.com/image/fetch/$s_!JPye!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F260a060b-b217-4b3d-ac6f-3f31436c5e35_1468x702.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JPye!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F260a060b-b217-4b3d-ac6f-3f31436c5e35_1468x702.png" width="1456" height="696" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/260a060b-b217-4b3d-ac6f-3f31436c5e35_1468x702.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:696,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:126122,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JPye!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F260a060b-b217-4b3d-ac6f-3f31436c5e35_1468x702.png 424w, https://substackcdn.com/image/fetch/$s_!JPye!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F260a060b-b217-4b3d-ac6f-3f31436c5e35_1468x702.png 848w, https://substackcdn.com/image/fetch/$s_!JPye!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F260a060b-b217-4b3d-ac6f-3f31436c5e35_1468x702.png 1272w, https://substackcdn.com/image/fetch/$s_!JPye!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F260a060b-b217-4b3d-ac6f-3f31436c5e35_1468x702.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Expanding PAI to all of Meta&#8217;s apps introduced a massive challenge: ensuring high-quality, detailed data lineage across diverse systems. Manual methods couldn&#8217;t keep up with the pace of change or the sheer number of data flows. Manually authoring diagrams and spreadsheets couldn&#8217;t handle the complexity or volume of their data.</p><p> Meta risked delays in implementing privacy controls without robust lineage tools, which could impact user trust and regulatory compliance. </p><h3>Is This Problem Unique?</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aA_m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47ce6dcd-9c71-4f28-9227-07ae33f0cc8f_796x462.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aA_m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47ce6dcd-9c71-4f28-9227-07ae33f0cc8f_796x462.png 424w, https://substackcdn.com/image/fetch/$s_!aA_m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47ce6dcd-9c71-4f28-9227-07ae33f0cc8f_796x462.png 848w, https://substackcdn.com/image/fetch/$s_!aA_m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47ce6dcd-9c71-4f28-9227-07ae33f0cc8f_796x462.png 1272w, https://substackcdn.com/image/fetch/$s_!aA_m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47ce6dcd-9c71-4f28-9227-07ae33f0cc8f_796x462.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aA_m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47ce6dcd-9c71-4f28-9227-07ae33f0cc8f_796x462.png" width="796" height="462" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/47ce6dcd-9c71-4f28-9227-07ae33f0cc8f_796x462.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:462,&quot;width&quot;:796,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:102218,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aA_m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47ce6dcd-9c71-4f28-9227-07ae33f0cc8f_796x462.png 424w, https://substackcdn.com/image/fetch/$s_!aA_m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47ce6dcd-9c71-4f28-9227-07ae33f0cc8f_796x462.png 848w, https://substackcdn.com/image/fetch/$s_!aA_m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47ce6dcd-9c71-4f28-9227-07ae33f0cc8f_796x462.png 1272w, https://substackcdn.com/image/fetch/$s_!aA_m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47ce6dcd-9c71-4f28-9227-07ae33f0cc8f_796x462.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>While Meta&#8217;s scale is unparalleled, the core problem&#8212;managing data lineage efficiently&#8212;is something many companies face. As organizations grow, they often grapple with fragmented systems and incomplete lineage. This impacts everything from troubleshooting to compliance, making it a universal challenge for data teams.</p><div><hr></div><h2>How Meta Solved It</h2><p>Meta developed a comprehensive lineage solution integrated into their PAI to tackle their challenges. The Policy Zone Manager (PZM) is central to this effort. This tool builds on lineage data, enabling developers to trace data flows and implement privacy controls efficiently.</p><p>The solution has the following steps.</p><h3><strong>Collecting data flow signals</strong> from many data activities</h3><ol><li><p><strong>Meta discovers data flows</strong> <strong>for the</strong> <strong>web system</strong>&nbsp;activities by employing static and runtime analysis tools. It focuses on sensitive data, such as religious views. For instance, when users input data on the app, this data is transmitted to a web endpoint, written in the logging table, and stored in a database. </p><p></p><p>Static analysis tools simulate code execution to map out potential data flows. Data at Meta can flow through stacks of function calls in different programming languages, such as C++ or Python, from web systems to backend services. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QfZu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eb8237f-4875-4a33-8046-fc4ac25f9dde_1284x668.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QfZu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eb8237f-4875-4a33-8046-fc4ac25f9dde_1284x668.png 424w, https://substackcdn.com/image/fetch/$s_!QfZu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eb8237f-4875-4a33-8046-fc4ac25f9dde_1284x668.png 848w, https://substackcdn.com/image/fetch/$s_!QfZu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eb8237f-4875-4a33-8046-fc4ac25f9dde_1284x668.png 1272w, https://substackcdn.com/image/fetch/$s_!QfZu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eb8237f-4875-4a33-8046-fc4ac25f9dde_1284x668.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QfZu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eb8237f-4875-4a33-8046-fc4ac25f9dde_1284x668.png" width="1284" height="668" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1eb8237f-4875-4a33-8046-fc4ac25f9dde_1284x668.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:668,&quot;width&quot;:1284,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:141544,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QfZu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eb8237f-4875-4a33-8046-fc4ac25f9dde_1284x668.png 424w, https://substackcdn.com/image/fetch/$s_!QfZu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eb8237f-4875-4a33-8046-fc4ac25f9dde_1284x668.png 848w, https://substackcdn.com/image/fetch/$s_!QfZu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eb8237f-4875-4a33-8046-fc4ac25f9dde_1284x668.png 1272w, https://substackcdn.com/image/fetch/$s_!QfZu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eb8237f-4875-4a33-8046-fc4ac25f9dde_1284x668.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Static code analysis is a debugging&nbsp;method by examining the code without executing the program. In the lineage context, although it doesn't execute the code, static analysis simulates the logical paths a program might take; this simulation helps identify potential data flows, such as data being read from a source (e.g., a form or API endpoint), data being processed or transformed by various functions, data being written to a destination (e.g., a database table or log file)</p><p></p><p>However, the static approach is not enough. It does not account for runtime-specific data flows, such as conditional logic based on user input.</p><p></p><p>Meta collects real-time signals during request execution. It captures and compares payloads at source and sink points, categorizing data flow evidence into match sets (high-confidence matches) and complete sets (broader potential matches for human review). </p><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KObj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0886fa36-c6a4-4d18-b5cd-3ef9e9c8e080_676x580.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KObj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0886fa36-c6a4-4d18-b5cd-3ef9e9c8e080_676x580.png 424w, https://substackcdn.com/image/fetch/$s_!KObj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0886fa36-c6a4-4d18-b5cd-3ef9e9c8e080_676x580.png 848w, https://substackcdn.com/image/fetch/$s_!KObj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0886fa36-c6a4-4d18-b5cd-3ef9e9c8e080_676x580.png 1272w, https://substackcdn.com/image/fetch/$s_!KObj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0886fa36-c6a4-4d18-b5cd-3ef9e9c8e080_676x580.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KObj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0886fa36-c6a4-4d18-b5cd-3ef9e9c8e080_676x580.png" width="676" height="580" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0886fa36-c6a4-4d18-b5cd-3ef9e9c8e080_676x580.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:580,&quot;width&quot;:676,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:91076,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KObj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0886fa36-c6a4-4d18-b5cd-3ef9e9c8e080_676x580.png 424w, https://substackcdn.com/image/fetch/$s_!KObj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0886fa36-c6a4-4d18-b5cd-3ef9e9c8e080_676x580.png 848w, https://substackcdn.com/image/fetch/$s_!KObj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0886fa36-c6a4-4d18-b5cd-3ef9e9c8e080_676x580.png 1272w, https://substackcdn.com/image/fetch/$s_!KObj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0886fa36-c6a4-4d18-b5cd-3ef9e9c8e080_676x580.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>For example, Meta collects two payloads from a source and a sink. The source payload is {&#8220;data&#8221;: &#8220;Buddhist&#8221;} and . the sink payload is {&#8220;data&#8221;: &#8220;Buddhist&#8221; &#8220;event_timestamp&#8220;: &#8220;00:00:00&#8220;}, Meta sees this data likely flow from this source and sink.</p><p></p><p>However, if the sink payload represents a  &#8220;more compacted and abstracted&#8221; value such as {&#8220;religion_count&#8220;: 1}, Meta is not sure if this data flows from the source to this sink. In such cases, Meta requires humans to review the flow result.</p><p></p><p>Unfortunately, Meta doesn&#8217;t share detailed rules for defining the confidence level for a flow result.</p><p></p></li><li><p><strong>For the data warehousing activities</strong>,  they combine runtime instrumentation with static analysis of SQL queries (from tools like Presto and Spark). Contextual runtime information, such as job IDs, helps fill gaps where static analysis might miss connections.</p></li><li><p><strong>For AI systems</strong>, lineage tracking&nbsp;involves tracking relationships between datasets, models, and workflows. These systems construct detailed lineage graphs by integrating runtime signals from libraries like PyTorch and workflow engines like FBLearner Flow.</p></li></ol><h3><strong>Identifying Relevant Data Flows</strong></h3><p>After building comprehensive lineage graphs, Meta needed a way to focus on specific data flows, like those involving religious views. </p><p>They developed an iterative analysis tool that allows developers to filter and refine these graphs efficiently. This tool uses a process of discovery, exclusion, and iteration to identify the most relevant flows.</p><h3>How It Helps</h3><p>The result? Developers can now confidently trace granular data flows and implement privacy controls quickly. This has significantly reduced the time and effort required to ensure compliance while maintaining Meta&#8217;s commitment to user privacy.</p><div><hr></div><h2>Lessons We Can Learn</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AF4d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea76604b-abc4-4633-b2ce-7c87c41750d9_548x540.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AF4d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea76604b-abc4-4633-b2ce-7c87c41750d9_548x540.png 424w, https://substackcdn.com/image/fetch/$s_!AF4d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea76604b-abc4-4633-b2ce-7c87c41750d9_548x540.png 848w, https://substackcdn.com/image/fetch/$s_!AF4d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea76604b-abc4-4633-b2ce-7c87c41750d9_548x540.png 1272w, https://substackcdn.com/image/fetch/$s_!AF4d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea76604b-abc4-4633-b2ce-7c87c41750d9_548x540.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AF4d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea76604b-abc4-4633-b2ce-7c87c41750d9_548x540.png" width="548" height="540" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ea76604b-abc4-4633-b2ce-7c87c41750d9_548x540.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:540,&quot;width&quot;:548,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:72028,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AF4d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea76604b-abc4-4633-b2ce-7c87c41750d9_548x540.png 424w, https://substackcdn.com/image/fetch/$s_!AF4d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea76604b-abc4-4633-b2ce-7c87c41750d9_548x540.png 848w, https://substackcdn.com/image/fetch/$s_!AF4d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea76604b-abc4-4633-b2ce-7c87c41750d9_548x540.png 1272w, https://substackcdn.com/image/fetch/$s_!AF4d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea76604b-abc4-4633-b2ce-7c87c41750d9_548x540.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><h3>Start Thinking About Data Lineage Early</h3><p>I believe data lineage isn&#8217;t just for large companies. Even smaller teams can benefit from building lineage into their processes early. As your data ecosystem grows, having this foundation will save countless hours of debugging and compliance headaches.</p><h3>Implementing the data linage</h3><p>If you&#8217;re not working at Meta&#8217;s scale, start small. Tools like dbt lineage or metadata platforms like DataHub offer a solid foundation. If these tools fall short, consider Meta&#8217;s approach of embedding tracking logic into the code. Just remember, starting simple and iterating gradually will always outperform building a complex system that doesn&#8217;t fit your organization.</p><h3>Lineage Graphs Alone Aren&#8217;t Enough</h3><p>Meta&#8217;s case study also highlights an important point: </p><p>Simply having a lineage graph isn&#8217;t enough. You need tools that empower end-users to interact with and extract actionable insights from these graphs. </p><p>Start by leveraging existing interfaces from tools like dbt documentation or DataHub UI/API. Use these as a foundation to gather user feedback and iteratively enhance or customize solutions. This iterative approach ensures the tools meet user needs effectively, maximizing the value of your lineage data. </p><h3>Measure and Iterate</h3><p>Data lineage, like any engineering effort, benefits from continuous improvement. Regularly measure the effectiveness of your lineage tools and processes, and iterate based on feedback.</p><div><hr></div><h2>Outro</h2><p>Above are my notes after learning how Meta does data lineage at a mega scale.</p><p>Meta&#8217;s journey with data lineage offers efficient ways to tackle complex challenges with innovative solutions. From scalable data flow collection to user-friendly tools, their approach provides valuable lessons for teams of all sizes.</p><p>As you reflect on these insights, consider how your organization handles data lineage. Are there gaps you can address? Tools you can adopt? Starting today can lead you to smoother operations and stronger compliance.</p><p>I&#8217;d love to hear from you if this has sparked ideas or questions.</p><div><hr></div><h2>Reference</h2><p><em>[1] Facebook Engineering Blog, <a href="https://engineering.fb.com/2025/01/22/security/how-meta-discovers-data-flows-via-lineage-at-scale/">How Meta discovers data flows via lineage at scale</a> (2025)</em></p>]]></content:encoded></item><item><title><![CDATA[Kimball Dimensional Modeling Overview]]></title><description><![CDATA[Is it still valid?]]></description><link>https://vutr.substack.com/p/dimensional-modeling-overview</link><guid isPermaLink="false">https://vutr.substack.com/p/dimensional-modeling-overview</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Thu, 06 Feb 2025 03:15:48 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!h1wl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49f4af46-1a0a-4d18-8f81-d513dba88b3e_2000x1429.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p>To celebrate Lunar New Year (the true New Year holiday in Vietnam), I&#8217;m offering <em><strong>50% off the annual subscription</strong></em>. The offer ends soon; grab it now to get full access to nearly 200 high-quality data engineering articles.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe&quot;,&quot;text&quot;:&quot;50% off the annual subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe"><span>50% off the annual subscription</span></a></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!h1wl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49f4af46-1a0a-4d18-8f81-d513dba88b3e_2000x1429.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h1wl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49f4af46-1a0a-4d18-8f81-d513dba88b3e_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!h1wl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49f4af46-1a0a-4d18-8f81-d513dba88b3e_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!h1wl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49f4af46-1a0a-4d18-8f81-d513dba88b3e_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!h1wl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49f4af46-1a0a-4d18-8f81-d513dba88b3e_2000x1429.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h1wl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49f4af46-1a0a-4d18-8f81-d513dba88b3e_2000x1429.png" width="1456" height="1040" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49f4af46-1a0a-4d18-8f81-d513dba88b3e_2000x1429.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1040,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:316191,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!h1wl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49f4af46-1a0a-4d18-8f81-d513dba88b3e_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!h1wl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49f4af46-1a0a-4d18-8f81-d513dba88b3e_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!h1wl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49f4af46-1a0a-4d18-8f81-d513dba88b3e_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!h1wl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49f4af46-1a0a-4d18-8f81-d513dba88b3e_2000x1429.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><div><hr></div><h2>Intro</h2><p>I started my data engineering in 2019.</p><p>Spark was released 5 years earlier.</p><p>Vendors released BigQuery and Snowflake 10 years earlier.</p><p>Hadoop was released 13 years earlier.</p><p>I was lucky enough to live in an era when there were a lot of technologies and tools to help data engineers streamline the &#8220;big data&#8221; storage and processing.</p><p>I was lucky enough to live in an era where what sits between a company and a robust data system is just a few clicks on cloud consoles instead of month after month of planning and setting up local servers.</p><p>But everything has a price.</p><p>Hardware in the past was expensive, software licenses and servers required spending upfront, and a robust data infrastructure needs time to plan and implement. They must ensure that data is organized and managed in a way that can support the business efficiently. They can&#8217;t throw data into the system and hope for the best. They carefully do the data modeling.</p><p>I live in an era where people belittle data modeling because they need to move fast and because &#8220;putting more resources&#8221; will somehow solve the slow and messy query.</p><p>I only realized the importance of data modeling a year ago, and since then, I&#8217;ve tried to learn this fundamental skill. Like most advice you&#8217;ve seen online, I started with <em><a href="https://www.amazon.com/Data-Warehouse-Toolkit-Definitive-Dimensional/dp/1118530802">the Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling</a></em>.</p><p>This article notes what I learned from the book's first two chapters.</p><div><hr></div><h2>Data Warehousing</h2><p>Since Bill Inmon laid the foundation of data warehousing in the late 1980s, separating the systems that produce data and the system that offers analytic capabilities has become the norm.</p><p>The &#8220;left&#8221; side records sign-up information, web tracking events, or orders worldwide. On this side, companies use systems optimized for transactional point queries with very high concurrency (OLTP).</p><p>The &#8220;right&#8220; side gathers and organizes information from the systems on the &#8220;left&#8220; side; it helps users answer questions like &#8220;How many users visited our website last week?&#8220; or &#8220;How many orders came from Vietnam in the previous 3 months.&#8220; On this side, companies use systems optimized for high-performance queries over vast amounts of historical data but might not need so much concurrency. (OLAP)</p><p>The two sides serve different needs. This article focuses on the &#8220;right&#8220; side - the data warehousing:</p><ul><li><p>The system should be intuitive for business users, not just developers.</p></li><li><p>Data from various sources must be presented with consistent labels and definitions.</p></li><li><p>The system should adapt to needs and changes.</p></li><li><p>It must safeguard sensitive information.</p></li><li><p>The data warehouse team and business users should agree on delivery timelines, mainly when time limits restrict data cleaning or validation.</p></li><li><p><strong>It must have the right data to support decision-making.</strong></p></li><li><p><strong>The business users must accept the DW/BI system;&nbsp;</strong>you thought you built an excellent data warehousing system, but nobody used it;  your solutions were not that great. </p></li></ul><p>Kimball believes that dimensional modeling can help us build a data warehousing solution that meets all the above criteria.</p><div><hr></div><h2>Dimensional Modeling</h2><h3>Overview</h3><p>Dimensional modeling first appeared in Ralph Kimball's 1996 book, The Data Warehouse Toolkit (1st edition). Organizations have widely adopted it to present analytic data. The approach aims for simplicity, which aligns with how most business users intuitively think.</p><p>They naturally think about their operations in terms of measurable metrics and the contexts in which those metrics are observed. For example, a retail manager might want to analyze sales performance by product categories, across different regions, and over time. This way of thinking is inherently dimensional: products, regions, and time are all distinct perspectives or dimensions through which performance can be evaluated. </p><p>Kimball&#8217;s approach promises to align with business users' thoughts. This alignment gives the user a tangible way to think of the data. Clear thinking leads to simple data modeling.</p><h3>Star Schema</h3><p>Dimensional modeling differs from the third standard form (3NF) models. Normalization&#8217;s ultimate goal is to ensure data integrity by removing redundancies. The normalized 3NF structures divide data into many entities, each a relational table. We store users&#8217; information separately from the product&#8217;s information. This approach is helpful in operational processing, where data integrity is the priority.</p><p>However, it is too complicated for data warehousing. Figuring out how to calculate the January revenue of users from India can be overwhelming when they look at entity-relationship diagrams (ERDs) with hundreds of entities.</p><p>People implement dimensional models by organizing data in star schemas. Named for resembling a star, the schema consists of a central fact table surrounded by multiple-dimension tables.</p><div><hr></div><blockquote><p>To celebrate Lunar New Year (the true New Year holiday in Vietnam), I&#8217;m offering <em><strong>50% off the annual subscription</strong></em>. The offer ends soon; grab it now to get full access to nearly 200 high-quality data engineering articles.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe&quot;,&quot;text&quot;:&quot;50% off the annual subscription&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/subscribe"><span>50% off the annual subscription</span></a></p></blockquote><div><hr></div><h3>Fact</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fXEn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb0701d9-afa8-416e-8a5c-d617a45adab1_1116x590.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fXEn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb0701d9-afa8-416e-8a5c-d617a45adab1_1116x590.png 424w, https://substackcdn.com/image/fetch/$s_!fXEn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb0701d9-afa8-416e-8a5c-d617a45adab1_1116x590.png 848w, https://substackcdn.com/image/fetch/$s_!fXEn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb0701d9-afa8-416e-8a5c-d617a45adab1_1116x590.png 1272w, https://substackcdn.com/image/fetch/$s_!fXEn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb0701d9-afa8-416e-8a5c-d617a45adab1_1116x590.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fXEn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb0701d9-afa8-416e-8a5c-d617a45adab1_1116x590.png" width="1116" height="590" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eb0701d9-afa8-416e-8a5c-d617a45adab1_1116x590.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:590,&quot;width&quot;:1116,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:94432,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fXEn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb0701d9-afa8-416e-8a5c-d617a45adab1_1116x590.png 424w, https://substackcdn.com/image/fetch/$s_!fXEn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb0701d9-afa8-416e-8a5c-d617a45adab1_1116x590.png 848w, https://substackcdn.com/image/fetch/$s_!fXEn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb0701d9-afa8-416e-8a5c-d617a45adab1_1116x590.png 1272w, https://substackcdn.com/image/fetch/$s_!fXEn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb0701d9-afa8-416e-8a5c-d617a45adab1_1116x590.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>The fact table is the central table in the star schema. It stores the performance measurements resulting from an organization&#8217;s business process events. Kimball encourages us to store the low-level measurements to achieve more flexibility.</p><p>Each row in a fact table corresponds to a measurement event. The data on each row is at a specific level of detail, referred to as the grain; all rows in a fact table must be in the same grain. For example, each row in the event-tracking fact table corresponds to a user&#8217;s event, such as clicking a button or purchasing an item.</p><p>A fact&#8217;s row contains:</p><ul><li><p><strong>Foreign Keys</strong>: Links to the related dimension tables.</p></li><li><p><strong>Measures</strong>: Numerical values, such as revenue, quantity sold, or profit.</p></li></ul><p>When all the keys in the fact table correctly match their respective primary keys in the corresponding dimension tables, the tables satisfy referential integrity. Users can find insight by joining fact and dim using the foreign key from fact and the primary key from dim.</p><p>For example, a user&#8217;s revenue in Europe can be calculated by joining the revenue fact table (user grain) with the country dim table using the fact&#8217;s foreign key country code and the primary ID from the country dimension table, which records the country&#8217;s associated continent.</p><h3>Dimension</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6lV-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F504495ea-97bf-4820-876a-8aeb99f8d519_1256x690.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6lV-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F504495ea-97bf-4820-876a-8aeb99f8d519_1256x690.png 424w, https://substackcdn.com/image/fetch/$s_!6lV-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F504495ea-97bf-4820-876a-8aeb99f8d519_1256x690.png 848w, https://substackcdn.com/image/fetch/$s_!6lV-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F504495ea-97bf-4820-876a-8aeb99f8d519_1256x690.png 1272w, https://substackcdn.com/image/fetch/$s_!6lV-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F504495ea-97bf-4820-876a-8aeb99f8d519_1256x690.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6lV-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F504495ea-97bf-4820-876a-8aeb99f8d519_1256x690.png" width="1256" height="690" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/504495ea-97bf-4820-876a-8aeb99f8d519_1256x690.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:690,&quot;width&quot;:1256,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:127233,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6lV-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F504495ea-97bf-4820-876a-8aeb99f8d519_1256x690.png 424w, https://substackcdn.com/image/fetch/$s_!6lV-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F504495ea-97bf-4820-876a-8aeb99f8d519_1256x690.png 848w, https://substackcdn.com/image/fetch/$s_!6lV-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F504495ea-97bf-4820-876a-8aeb99f8d519_1256x690.png 1272w, https://substackcdn.com/image/fetch/$s_!6lV-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F504495ea-97bf-4820-876a-8aeb99f8d519_1256x690.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Dimension tables provide descriptive context for the facts. They describe the &#8220;who, what, where, when, how, and why.&#8221; Each table focuses on a business dimension, such as product, country, or date.</p><p>Dimension tables play a crucial role in the data warehousing system because they provide a context for measurements. A skyrocketing revenue number alone does not give insight into the business.</p><p>Kimball suggests that the data warehouse is only as good as the dimensions. We must model the dimensions&#8217; attributes (columns) to ensure they are as close to the business terminology as possible. </p><blockquote><p><em>Robust dimension attributes deliver robust analytic slicing-and-dicing capabilities.</em></p></blockquote><h3>The process</h3><p>There are four steps in the dimensional design process:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!39Wk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7b48202-dec3-4635-ad95-e7e246cb6155_832x842.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!39Wk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7b48202-dec3-4635-ad95-e7e246cb6155_832x842.png 424w, https://substackcdn.com/image/fetch/$s_!39Wk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7b48202-dec3-4635-ad95-e7e246cb6155_832x842.png 848w, https://substackcdn.com/image/fetch/$s_!39Wk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7b48202-dec3-4635-ad95-e7e246cb6155_832x842.png 1272w, https://substackcdn.com/image/fetch/$s_!39Wk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7b48202-dec3-4635-ad95-e7e246cb6155_832x842.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!39Wk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7b48202-dec3-4635-ad95-e7e246cb6155_832x842.png" width="832" height="842" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e7b48202-dec3-4635-ad95-e7e246cb6155_832x842.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:842,&quot;width&quot;:832,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:136828,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!39Wk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7b48202-dec3-4635-ad95-e7e246cb6155_832x842.png 424w, https://substackcdn.com/image/fetch/$s_!39Wk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7b48202-dec3-4635-ad95-e7e246cb6155_832x842.png 848w, https://substackcdn.com/image/fetch/$s_!39Wk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7b48202-dec3-4635-ad95-e7e246cb6155_832x842.png 1272w, https://substackcdn.com/image/fetch/$s_!39Wk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7b48202-dec3-4635-ad95-e7e246cb6155_832x842.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><ul><li><p>The process begins with&nbsp;<strong>selecting the business process,&nbsp;</strong>a step in which you identify the key activity or operation to analyze, such as sales, inventory management, or customer interactions.&nbsp;</p></li><li><p>Next comes <strong>declaring the grain</strong>, which defines the level of detail for your analysis; &#8220;are you tracking sales by individual transactions, daily summaries, or monthly aggregates?&#8221; This clarity is foundational to ensure consistency and scalability.</p></li><li><p>Once we define the grain, we&nbsp;<strong>identify dimensions</strong>&nbsp;that capture the process's descriptive attributes, such as product details, time, or customer demographics.</p></li><li><p>Finally, we focus on&nbsp;<strong>identifying facts and</strong>&nbsp;the quantitative metrics or measures tied to the process, such as sales revenue, quantity sold, or discount amounts.</p></li></ul><p>Each step builds on the last, ensuring the design supports the bottom-up business's analytical needs while remaining easy to query and understand.</p><div><hr></div><h2>My thoughts</h2><p>Although I don&#8217;t have much experience with dimensional data modeling, my neurons still form some thoughts about this topic after living in my last companies, where data modeling was the most luxurious thing. I will write down my thoughts (not only about dimensional modeling) here, hoping to learn from experts in this field.</p><ul><li><p>The Kimball dimensional modeling approach is well-suited to how people observe their business: a measurement of a business process (fact) with contexts (dimension)</p></li><li><p>It might take less time to deliver the process compared to other approaches. If you are a newly hired data engineer on a team lacking time and resources, Kimball dimensional modeling seems a good choice. </p></li><li><p>Because the modeling is designed for specific analytical requirements, there is a chance that the Kimball dimensional modeling can&#8217;t adapt to a new requirement, and the modeler needs to model new facts and dims.</p></li><li><p>Adopting a proven modeling approach like the dimensional one is far better than dumping all your data into a cloud data warehouse. These established modeling frameworks are designed and tested to ensure data understandability and effective management. If you encounter challenges in managing your data, leveraging a widely used approach allows you to tap into a wealth of community knowledge and solutions. In contrast, choosing a strategy that only you or your team understands makes troubleshooting and scaling much harder.</p></li><li><p>Despite the wide adoption of dimensional modeling, companies also use other approaches, such as Inmon or Data Vault, to organize their analytics data. You must decide how to model your data based on the organization and its business; you can&#8217;t select Kimball when building a data warehouse from scratch just because you already read The Data Warehouse Toolkit 3 times.</p></li><li><p>My experience with One Big Table (OBT) is that it will prove its value only when we have a careful data modeling layer beneath it. Putting all the data in one table in the first place will make you trade data understandability for query performance, which is terrible.</p></li><li><p>Cloud data warehouses like BigQuery encourage users to denormalize using Nested or Array fields to improve performance by avoiding joins. This indirectly causes people to think that joins are bad; data modeling requires organizing information where it belongs, so it requires joins at the end; consequently, people also think data modeling is not good for query performance. I have been observing that BigQuery, Snowflake, or Databricks introduced the notions of Primary Key and Foreign Key in the last few years, plus some techniques to optimize the performance of joining using PK and FK; they encourage us to set these constraints on our table, they encourage us to organize data decently.</p></li></ul><p>I&#8217;d love to hear from you if this has sparked ideas or questions.</p><div><hr></div><h2>Outro</h2><p>In this article, I summarized the key insights I gained from reading the first two chapters of <em>The Data Warehouse Toolkit</em>. We explore the purpose of data warehousing systems, the approach and process of dimensional modeling, an introduction to facts and dimensions, and, finally, I share some of my thoughts on this topic.</p><p>Thank you for reading this far.</p><p>See you in my next piece!</p><div><hr></div><h2>Reference</h2><p><em>[1] Ralph Kimball, Margy Ross, <a href="https://www.amazon.com/Data-Warehouse-Toolkit-Definitive-Dimensional/dp/1118530802">The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling</a> (3rd Edition)</em></p>]]></content:encoded></item><item><title><![CDATA[8 minutes to understand Presto]]></title><description><![CDATA[Uber, Netflix, Airbnb, and LinkedIn uses this query engine.]]></description><link>https://vutr.substack.com/p/8-minutes-to-understand-presto</link><guid isPermaLink="false">https://vutr.substack.com/p/8-minutes-to-understand-presto</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Thu, 30 Jan 2025 03:15:46 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!gyaQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebcf596-7b61-431c-b45e-c40de96bc0cd_2000x1429.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><blockquote><p><em>I'm offering <strong>an</strong> <strong>exclusive</strong> <strong>sponsorship slot</strong> <strong>in each issue</strong> to keep this newsletter free for readers. If you want to feature your product in my newsletter, explore my media kit:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/p/media-kit&quot;,&quot;text&quot;:&quot;View Media Kit &amp; Sponsor Now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://vutr.substack.com/p/media-kit"><span>View Media Kit &amp; Sponsor Now</span></a></p></blockquote><blockquote><p><em>I&#8217;m making my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. </em></p><p><em>Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gyaQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebcf596-7b61-431c-b45e-c40de96bc0cd_2000x1429.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gyaQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebcf596-7b61-431c-b45e-c40de96bc0cd_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!gyaQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebcf596-7b61-431c-b45e-c40de96bc0cd_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!gyaQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebcf596-7b61-431c-b45e-c40de96bc0cd_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!gyaQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebcf596-7b61-431c-b45e-c40de96bc0cd_2000x1429.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gyaQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebcf596-7b61-431c-b45e-c40de96bc0cd_2000x1429.png" width="1456" height="1040" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bebcf596-7b61-431c-b45e-c40de96bc0cd_2000x1429.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1040,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:245025,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gyaQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebcf596-7b61-431c-b45e-c40de96bc0cd_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!gyaQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebcf596-7b61-431c-b45e-c40de96bc0cd_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!gyaQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebcf596-7b61-431c-b45e-c40de96bc0cd_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!gyaQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebcf596-7b61-431c-b45e-c40de96bc0cd_2000x1429.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><div><hr></div><h2>Intro</h2><p>Apache Spark is the king of data processing.</p><p>It was developed in 2012 in response to limitations in the <a href="https://en.wikipedia.org/wiki/MapReduce">MapReduce</a>.</p><p>People first adopted Spark for ETL processes. However, in 2015, the Spark team introduced SQL capability, making it an attractive option for a relational query engine.</p><p>In 2020, Databricks introduced the lakehouse paradigm. <a href="https://open.substack.com/pub/vutr/p/why-did-databricks-build-the-photon?r=2rj6sg&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=false">They equipped Spark with the Photon engine</a> to make it more efficient as the query engine over the datalake.</p><p>A robust query engine operating on vast amounts of unseen data can provide many advantages.</p><p>Not only Databricks realizes this.</p><p>BigQuery is the query engine (Dremel) that operates on giant storage systems (Coloussus).</p><p>Snowflake is a set of workers that operates on S3.</p><p>Aside from cloud data warehouses, a big tech company joined the party.</p><p>Facebook developed an interactive SQL query engine with the same vision in 2012.</p><p>They called it Presto. With the promises of  &#8220;<a href="https://trino.io/Presto_SQL_on_Everything.pdf">SQL on everything</a>. &#8220;</p><div><hr></div><h2>Overview</h2><p>Facebook developed Presto to address the growing need to extract insights from large amounts of data. The goal was to use SQL to make data analytics accessible to more people in the organization.</p><p>In late 2018, Facebook's data professionals used Presto for most SQL analytic workloads, including interactive/BI queries and long-running batch ETL jobs.</p><p>Presto is a distributed SQL query engine that processes hundreds of petabytes of data and quadrillions of rows daily at Facebook.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Puiv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a88a84-7109-496e-a54e-72242076a098_1090x322.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Puiv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a88a84-7109-496e-a54e-72242076a098_1090x322.png 424w, https://substackcdn.com/image/fetch/$s_!Puiv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a88a84-7109-496e-a54e-72242076a098_1090x322.png 848w, https://substackcdn.com/image/fetch/$s_!Puiv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a88a84-7109-496e-a54e-72242076a098_1090x322.png 1272w, https://substackcdn.com/image/fetch/$s_!Puiv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a88a84-7109-496e-a54e-72242076a098_1090x322.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Puiv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a88a84-7109-496e-a54e-72242076a098_1090x322.png" width="1090" height="322" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/99a88a84-7109-496e-a54e-72242076a098_1090x322.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:322,&quot;width&quot;:1090,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:43689,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Puiv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a88a84-7109-496e-a54e-72242076a098_1090x322.png 424w, https://substackcdn.com/image/fetch/$s_!Puiv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a88a84-7109-496e-a54e-72242076a098_1090x322.png 848w, https://substackcdn.com/image/fetch/$s_!Puiv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a88a84-7109-496e-a54e-72242076a098_1090x322.png 1272w, https://substackcdn.com/image/fetch/$s_!Puiv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99a88a84-7109-496e-a54e-72242076a098_1090x322.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Here are its characteristics:</p><ul><li><p>It can run hundreds of resource-intensive queries at the same time.</p></li><li><p>It can scale to thousands of workers.</p></li><li><p>It can query multiple data sources, even in the same query.</p></li><li><p>It can support many use cases with different constraints and performance characteristics.</p></li><li><p>It promises to operate at high performance.</p></li></ul><p>Some use cases at Facebook are:</p><ul><li><p><strong>Interactive Analytics:</strong> Engineers and data scientists use Presto to examine small amounts of data, test hypotheses, and build visualizations or dashboards<strong>.</strong></p></li><li><p><strong>Batch ETL:</strong> Presto supports users migrating from legacy batch processing systems for ETL queries. These queries are more resource-intensive than interactive ones.</p></li><li><p><strong>A/B Testing:</strong>&nbsp;Presto supports Facebook's A/B testing infrastructure<strong>.&nbsp;</strong>It helps join multiple large datasets to produce experiment details or population information.</p></li><li><p><strong>Developer/Advertiser Analytics:</strong>&nbsp;Presto supports custom reporting tools, such as <a href="https://www.facebook.com/business/help/966883707418907">Facebook Analytics</a>, for external developers and advertisers<strong>.</strong></p></li></ul><div><hr></div><h2>Presto or Trino</h2><p>Before learning about Presto&#8217;s architecture, I will explore its history.</p><p>As mentioned, Facebook started developing Presto in 2012 and later opened it in 2013.</p><p>In 2014, Netflix shared that they used Presto on 10 petabytes of S3 data. </p><p>In 2016, Amazon announced the famous service <a href="https://en.wikipedia.org/wiki/Amazon_Athena">Athena</a>. They built Athena based on Presto.</p><p>In 2017, Starburst Data was found to support Presto commercially.</p><p>In 2018, original Presto developers left Facebook due to a policy change that gave Facebook committers more privilege to commit changes over the open source community.</p><p>In 2019, Presto development forked PrestoDB, maintained by Facebook, and PrestoSQL, which the Presto Software Foundation maintains.</p><p>In the same year, Facebook donated PrestoDB to the Linux Foundation.</p><p>In December 2020, PrestoSQL was rebranded as Trino because Facebook had obtained a trademark for "Presto."</p><div><hr></div><h2>Architecture</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3w6A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b1d4259-ce97-4b8a-8957-065537b120bc_1302x1002.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3w6A!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b1d4259-ce97-4b8a-8957-065537b120bc_1302x1002.png 424w, https://substackcdn.com/image/fetch/$s_!3w6A!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b1d4259-ce97-4b8a-8957-065537b120bc_1302x1002.png 848w, https://substackcdn.com/image/fetch/$s_!3w6A!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b1d4259-ce97-4b8a-8957-065537b120bc_1302x1002.png 1272w, https://substackcdn.com/image/fetch/$s_!3w6A!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b1d4259-ce97-4b8a-8957-065537b120bc_1302x1002.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3w6A!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b1d4259-ce97-4b8a-8957-065537b120bc_1302x1002.png" width="1200" height="923.5023041474655" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7b1d4259-ce97-4b8a-8957-065537b120bc_1302x1002.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:1002,&quot;width&quot;:1302,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:252928,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3w6A!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b1d4259-ce97-4b8a-8957-065537b120bc_1302x1002.png 424w, https://substackcdn.com/image/fetch/$s_!3w6A!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b1d4259-ce97-4b8a-8957-065537b120bc_1302x1002.png 848w, https://substackcdn.com/image/fetch/$s_!3w6A!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b1d4259-ce97-4b8a-8957-065537b120bc_1302x1002.png 1272w, https://substackcdn.com/image/fetch/$s_!3w6A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b1d4259-ce97-4b8a-8957-065537b120bc_1302x1002.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>A Presto cluster has a coordinator node and a set of worker nodes:</p><ul><li><p>The coordinator parses, plans, and orchestrates queries.</p></li><li><p>The workers execute the query.</p></li></ul><p>Here is a typical flow:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aoi7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31e9fd29-66f7-46de-b5eb-ddbea5844658_1580x1004.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aoi7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31e9fd29-66f7-46de-b5eb-ddbea5844658_1580x1004.png 424w, https://substackcdn.com/image/fetch/$s_!aoi7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31e9fd29-66f7-46de-b5eb-ddbea5844658_1580x1004.png 848w, https://substackcdn.com/image/fetch/$s_!aoi7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31e9fd29-66f7-46de-b5eb-ddbea5844658_1580x1004.png 1272w, https://substackcdn.com/image/fetch/$s_!aoi7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31e9fd29-66f7-46de-b5eb-ddbea5844658_1580x1004.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aoi7!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31e9fd29-66f7-46de-b5eb-ddbea5844658_1580x1004.png" width="1200" height="762.3626373626373" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/31e9fd29-66f7-46de-b5eb-ddbea5844658_1580x1004.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:925,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:299993,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aoi7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31e9fd29-66f7-46de-b5eb-ddbea5844658_1580x1004.png 424w, https://substackcdn.com/image/fetch/$s_!aoi7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31e9fd29-66f7-46de-b5eb-ddbea5844658_1580x1004.png 848w, https://substackcdn.com/image/fetch/$s_!aoi7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31e9fd29-66f7-46de-b5eb-ddbea5844658_1580x1004.png 1272w, https://substackcdn.com/image/fetch/$s_!aoi7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31e9fd29-66f7-46de-b5eb-ddbea5844658_1580x1004.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><ul><li><p>The client sends an HTTP request with the SQL statement to the coordinator.</p></li><li><p>The coordinator parses and analyzes the SQL.</p></li><li><p>It then creates and optimizes the execution plan.</p></li><li><p>The coordinator sends the plan to the workers.</p></li><li><p>Workers start executing the tasks, operating on splits, which are chunks of data in an external storage system.</p></li><li><p>Workers' inputs are remote splits or intermediate results from upstream workers. Workers store intermediate data in memory as much as possible.</p></li></ul><p>Facebook designed Presto with the extensibility in mine; they introduced the plugin interface for Presto. The interface lets users make many customizations:</p><ul><li><p>Custom data types</p></li><li><p>Custom function</p></li><li><p>Custom access control implementations.</p></li><li><p>Custom queuing policies</p></li><li><p>Custom connectors enable Presto to communicate with external data stores through the Connector API, which has four parts: the Metadata API, Data Location API, Data Source API, and Data Sink API.</p></li></ul><div><hr></div><h2>Key Design Decision</h2><h3>SQL Dialect </h3><p>Presto adheres to the ANSI SQL to achieve broad compatibility. Facebook also selected extensions from ANSI SQL for Presto, such as lambda expressions and higher-order functions, to improve usability with complex data types like maps and arrays.</p><h3>Client Interface</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Vl3A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74f41092-e0b5-452f-8a31-15248b319c38_474x320.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Vl3A!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74f41092-e0b5-452f-8a31-15248b319c38_474x320.png 424w, https://substackcdn.com/image/fetch/$s_!Vl3A!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74f41092-e0b5-452f-8a31-15248b319c38_474x320.png 848w, https://substackcdn.com/image/fetch/$s_!Vl3A!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74f41092-e0b5-452f-8a31-15248b319c38_474x320.png 1272w, https://substackcdn.com/image/fetch/$s_!Vl3A!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74f41092-e0b5-452f-8a31-15248b319c38_474x320.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Vl3A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74f41092-e0b5-452f-8a31-15248b319c38_474x320.png" width="474" height="320" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/74f41092-e0b5-452f-8a31-15248b319c38_474x320.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:320,&quot;width&quot;:474,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:25894,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Vl3A!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74f41092-e0b5-452f-8a31-15248b319c38_474x320.png 424w, https://substackcdn.com/image/fetch/$s_!Vl3A!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74f41092-e0b5-452f-8a31-15248b319c38_474x320.png 848w, https://substackcdn.com/image/fetch/$s_!Vl3A!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74f41092-e0b5-452f-8a31-15248b319c38_474x320.png 1272w, https://substackcdn.com/image/fetch/$s_!Vl3A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74f41092-e0b5-452f-8a31-15248b319c38_474x320.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Presto provides multiple client interfaces:</p><ul><li><p>A RESTful HTTP interface for clients.</p></li><li><p>A command-line interface.</p></li><li><p>A JDBC client, enabling compatibility with BI tools like Tableau.</p></li></ul><h3>Query Planning And Optimization</h3><p>The logical planner generates an intermediate representation (IR) of the query plan based on the syntax tree. The IR is a plan nodes tree. Each node is a physical or logical operation; it receives input from its children.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DfM_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdbbac06-461c-405a-9ae5-d1b21a74e3db_516x314.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DfM_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdbbac06-461c-405a-9ae5-d1b21a74e3db_516x314.png 424w, https://substackcdn.com/image/fetch/$s_!DfM_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdbbac06-461c-405a-9ae5-d1b21a74e3db_516x314.png 848w, https://substackcdn.com/image/fetch/$s_!DfM_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdbbac06-461c-405a-9ae5-d1b21a74e3db_516x314.png 1272w, https://substackcdn.com/image/fetch/$s_!DfM_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdbbac06-461c-405a-9ae5-d1b21a74e3db_516x314.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DfM_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdbbac06-461c-405a-9ae5-d1b21a74e3db_516x314.png" width="516" height="314" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fdbbac06-461c-405a-9ae5-d1b21a74e3db_516x314.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:314,&quot;width&quot;:516,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:28851,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DfM_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdbbac06-461c-405a-9ae5-d1b21a74e3db_516x314.png 424w, https://substackcdn.com/image/fetch/$s_!DfM_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdbbac06-461c-405a-9ae5-d1b21a74e3db_516x314.png 848w, https://substackcdn.com/image/fetch/$s_!DfM_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdbbac06-461c-405a-9ae5-d1b21a74e3db_516x314.png 1272w, https://substackcdn.com/image/fetch/$s_!DfM_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdbbac06-461c-405a-9ae5-d1b21a74e3db_516x314.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>The query optimizer creates the physical plan from the logical plan. This process uses a set of transformation rules, such as predicate and limit pushdown, column pruning, and decorrelation.</p><h3>Data Layouts</h3><p>Presto leverages the physical layout of data provided by the connector's Data Layout API to optimize queries. Some layout information includes data location, its partitioning schema, the data index, and how they sort or group the data. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qlbG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a8e619d-c25e-4d35-99a7-4154750ba594_646x262.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qlbG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a8e619d-c25e-4d35-99a7-4154750ba594_646x262.png 424w, https://substackcdn.com/image/fetch/$s_!qlbG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a8e619d-c25e-4d35-99a7-4154750ba594_646x262.png 848w, https://substackcdn.com/image/fetch/$s_!qlbG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a8e619d-c25e-4d35-99a7-4154750ba594_646x262.png 1272w, https://substackcdn.com/image/fetch/$s_!qlbG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a8e619d-c25e-4d35-99a7-4154750ba594_646x262.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qlbG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a8e619d-c25e-4d35-99a7-4154750ba594_646x262.png" width="646" height="262" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6a8e619d-c25e-4d35-99a7-4154750ba594_646x262.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:262,&quot;width&quot;:646,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:33868,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qlbG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a8e619d-c25e-4d35-99a7-4154750ba594_646x262.png 424w, https://substackcdn.com/image/fetch/$s_!qlbG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a8e619d-c25e-4d35-99a7-4154750ba594_646x262.png 848w, https://substackcdn.com/image/fetch/$s_!qlbG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a8e619d-c25e-4d35-99a7-4154750ba594_646x262.png 1272w, https://substackcdn.com/image/fetch/$s_!qlbG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a8e619d-c25e-4d35-99a7-4154750ba594_646x262.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>For a table, the connector can return more than layout information; the optimizer can select the most efficient layout for the query. (e.g., leverage partitioning but ignoring the sorting)</p><h3>Predicate Pushdown</h3><p>Presto can push down predicates to the data source through connectors to improve filtering efficiency. The optimizer will talk with the connector to decide when to execute this technique.</p><h3>Inter-node parallelism</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HBuo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc59d26a3-ef00-4e83-af56-c2aacacd4ef0_392x346.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HBuo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc59d26a3-ef00-4e83-af56-c2aacacd4ef0_392x346.png 424w, https://substackcdn.com/image/fetch/$s_!HBuo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc59d26a3-ef00-4e83-af56-c2aacacd4ef0_392x346.png 848w, https://substackcdn.com/image/fetch/$s_!HBuo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc59d26a3-ef00-4e83-af56-c2aacacd4ef0_392x346.png 1272w, https://substackcdn.com/image/fetch/$s_!HBuo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc59d26a3-ef00-4e83-af56-c2aacacd4ef0_392x346.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HBuo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc59d26a3-ef00-4e83-af56-c2aacacd4ef0_392x346.png" width="392" height="346" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c59d26a3-ef00-4e83-af56-c2aacacd4ef0_392x346.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:346,&quot;width&quot;:392,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:55712,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HBuo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc59d26a3-ef00-4e83-af56-c2aacacd4ef0_392x346.png 424w, https://substackcdn.com/image/fetch/$s_!HBuo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc59d26a3-ef00-4e83-af56-c2aacacd4ef0_392x346.png 848w, https://substackcdn.com/image/fetch/$s_!HBuo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc59d26a3-ef00-4e83-af56-c2aacacd4ef0_392x346.png 1272w, https://substackcdn.com/image/fetch/$s_!HBuo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc59d26a3-ef00-4e83-af56-c2aacacd4ef0_392x346.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>The optimizer also decides which plan stages can run parallel across workers. A stage can have many tasks, executing the same logic on a subset of input data. A shuffle happens when exchanging data between stages. Data shuffling increases latency and uses a lot of CPU and memory. Thus, the optimizer must consider the number of shuffles in a plan.</p><h3>Intra-node parallelism</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7Aj9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc25897c4-6915-431d-938c-57aca43ae351_404x250.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7Aj9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc25897c4-6915-431d-938c-57aca43ae351_404x250.png 424w, https://substackcdn.com/image/fetch/$s_!7Aj9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc25897c4-6915-431d-938c-57aca43ae351_404x250.png 848w, https://substackcdn.com/image/fetch/$s_!7Aj9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc25897c4-6915-431d-938c-57aca43ae351_404x250.png 1272w, https://substackcdn.com/image/fetch/$s_!7Aj9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc25897c4-6915-431d-938c-57aca43ae351_404x250.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7Aj9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc25897c4-6915-431d-938c-57aca43ae351_404x250.png" width="404" height="250" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c25897c4-6915-431d-938c-57aca43ae351_404x250.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:250,&quot;width&quot;:404,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:31820,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7Aj9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc25897c4-6915-431d-938c-57aca43ae351_404x250.png 424w, https://substackcdn.com/image/fetch/$s_!7Aj9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc25897c4-6915-431d-938c-57aca43ae351_404x250.png 848w, https://substackcdn.com/image/fetch/$s_!7Aj9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc25897c4-6915-431d-938c-57aca43ae351_404x250.png 1272w, https://substackcdn.com/image/fetch/$s_!7Aj9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc25897c4-6915-431d-938c-57aca43ae351_404x250.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>The optimizer can identify and parallelize sections in a plan stage across threads on a single worker. This is much more efficient than inter-node parallelism; threads can share memory data, such as hash tables or dictionaries, with less overhead.</p><h3>Scheduling</h3><p>To execute a query, the engine makes two scheduling decisions:</p><ul><li><p><strong>Stage Scheduling</strong>: Presto supports two policies: all-at-once and phased. The first schedules all stages concurrently, which benefits latency-sensitive use cases such as Interactive Analytics. The phased policy executes stages in a topological order. For example, a hash-join will not schedule tasks from the probe phase until it&#8217;s finished with the build phase. The phased policy improves memory efficiency for the batch use case.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iYqJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F065e51ba-6cca-4475-b5c2-415e5b8a2af7_642x294.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iYqJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F065e51ba-6cca-4475-b5c2-415e5b8a2af7_642x294.png 424w, https://substackcdn.com/image/fetch/$s_!iYqJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F065e51ba-6cca-4475-b5c2-415e5b8a2af7_642x294.png 848w, https://substackcdn.com/image/fetch/$s_!iYqJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F065e51ba-6cca-4475-b5c2-415e5b8a2af7_642x294.png 1272w, https://substackcdn.com/image/fetch/$s_!iYqJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F065e51ba-6cca-4475-b5c2-415e5b8a2af7_642x294.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iYqJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F065e51ba-6cca-4475-b5c2-415e5b8a2af7_642x294.png" width="642" height="294" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/065e51ba-6cca-4475-b5c2-415e5b8a2af7_642x294.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:294,&quot;width&quot;:642,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:47816,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iYqJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F065e51ba-6cca-4475-b5c2-415e5b8a2af7_642x294.png 424w, https://substackcdn.com/image/fetch/$s_!iYqJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F065e51ba-6cca-4475-b5c2-415e5b8a2af7_642x294.png 848w, https://substackcdn.com/image/fetch/$s_!iYqJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F065e51ba-6cca-4475-b5c2-415e5b8a2af7_642x294.png 1272w, https://substackcdn.com/image/fetch/$s_!iYqJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F065e51ba-6cca-4475-b5c2-415e5b8a2af7_642x294.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div></li></ul><blockquote><p><em>In a hash join, the build phase creates a lookup table (by hashing) from one dataset. The probe phase uses this table to find matching rows from the lookup table.</em></p></blockquote><ul><li><p><strong>Task Scheduling</strong>: The task scheduler categorized stages into leaf and intermediate. The leaf stages read data from the connector, and the intermediate stages process results from other stages. <strong>Leaf stages </strong>read data from connectors; placement considers network and connector constraints. <strong>Intermediate Stages </strong>process intermediate results; they can be placed on any worker node.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IAY9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92eae417-2b1e-4249-99ee-14b0c0704f79_544x360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IAY9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92eae417-2b1e-4249-99ee-14b0c0704f79_544x360.png 424w, https://substackcdn.com/image/fetch/$s_!IAY9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92eae417-2b1e-4249-99ee-14b0c0704f79_544x360.png 848w, https://substackcdn.com/image/fetch/$s_!IAY9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92eae417-2b1e-4249-99ee-14b0c0704f79_544x360.png 1272w, https://substackcdn.com/image/fetch/$s_!IAY9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92eae417-2b1e-4249-99ee-14b0c0704f79_544x360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IAY9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92eae417-2b1e-4249-99ee-14b0c0704f79_544x360.png" width="544" height="360" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/92eae417-2b1e-4249-99ee-14b0c0704f79_544x360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:360,&quot;width&quot;:544,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:33968,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IAY9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92eae417-2b1e-4249-99ee-14b0c0704f79_544x360.png 424w, https://substackcdn.com/image/fetch/$s_!IAY9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92eae417-2b1e-4249-99ee-14b0c0704f79_544x360.png 848w, https://substackcdn.com/image/fetch/$s_!IAY9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92eae417-2b1e-4249-99ee-14b0c0704f79_544x360.png 1272w, https://substackcdn.com/image/fetch/$s_!IAY9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92eae417-2b1e-4249-99ee-14b0c0704f79_544x360.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div></li></ul><p>In a leaf stage, the node receives one or more splits (chunks of data) from the external systems. The coordinator must assign one or more splits to a leaf stage task for it to become eligible to run. Intermediate-stage tasks are always eligible to run and finish when all upstream tasks are completed.</p><p>The coordinator assigns splits after Presto sets up tasks for the worker nodes. Presto asks connectors to enumerate small batches of splits and assigns them to tasks lazily. This has some benefits:</p><ul><li><p>Queries that don't need to process all data, like those with filters or LIMIT clauses, can be canceled early<strong>.</strong></p></li><li><p>It separates the time it takes to get the first result from the total time it takes to enumerate all splits. This is useful when connectors like Hive might take significant time to list all partitions and files.</p></li><li><p>Lazy enumeration prevents storing all split metadata in memory; a Hive connector can handle millions of splits.</p></li><li><p>The worker has a queue of assigned splits. The coordinator assigns splits to tasks with the shortest queue, keeping the queue size small and helping manage variations in processing times across different splits and worker performance.</p></li></ul><h3>Query Execution</h3><p>A thread executes in a loop over a split. The data unit the driver loop operates on is a page, a columnar encoding of a sequence of rows. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WuPn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6ab8f96-d25d-41d2-a934-63b9d44010c1_1528x658.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WuPn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6ab8f96-d25d-41d2-a934-63b9d44010c1_1528x658.png 424w, https://substackcdn.com/image/fetch/$s_!WuPn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6ab8f96-d25d-41d2-a934-63b9d44010c1_1528x658.png 848w, https://substackcdn.com/image/fetch/$s_!WuPn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6ab8f96-d25d-41d2-a934-63b9d44010c1_1528x658.png 1272w, https://substackcdn.com/image/fetch/$s_!WuPn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6ab8f96-d25d-41d2-a934-63b9d44010c1_1528x658.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WuPn!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6ab8f96-d25d-41d2-a934-63b9d44010c1_1528x658.png" width="1200" height="516.7582417582418" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d6ab8f96-d25d-41d2-a934-63b9d44010c1_1528x658.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:627,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:200199,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WuPn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6ab8f96-d25d-41d2-a934-63b9d44010c1_1528x658.png 424w, https://substackcdn.com/image/fetch/$s_!WuPn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6ab8f96-d25d-41d2-a934-63b9d44010c1_1528x658.png 848w, https://substackcdn.com/image/fetch/$s_!WuPn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6ab8f96-d25d-41d2-a934-63b9d44010c1_1528x658.png 1272w, https://substackcdn.com/image/fetch/$s_!WuPn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6ab8f96-d25d-41d2-a934-63b9d44010c1_1528x658.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><p>Presto uses in-memory buffered shuffles over HTTP for efficient data exchange between worker nodes. Workers store produced data in the memory so other workers can consume it by issuing HTTP polling. The engine tunes parallelism to maintain target utilization rates for output and input buffers. Full output buffers cause split execution to stall and take up all memory, while underutilized input buffers add unnecessary processing overhead.</p><p>For the result writing process, Presto employs an adaptive approach to increase writer concurrency dynamically. </p><h3>Resource management</h3><p>Presto is ideal for multitenant deployments because of its fine-grained resource management system; a  cluster can handle hundreds of queries at the same time.</p><p>Facebook designed Presto's CPU scheduling mechanism to maximize overall cluster throughput; they prioritize the total CPU time spent processing data.</p><p>Presto uses a cooperative multitasking model and schedules concurrent tasks on every worker node to achieve multi-tenancy. A given split can only run on a threat for a maximum execution time slice, called quanta. After that time, the thread will stop processing this split, whether it is finished or not. This approach ensures that no single split takes all the resources and allows for efficient sharing among multiple queries.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XvzE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea56aff-4e23-4f8e-b6d4-3fae287dd2aa_528x328.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XvzE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea56aff-4e23-4f8e-b6d4-3fae287dd2aa_528x328.png 424w, https://substackcdn.com/image/fetch/$s_!XvzE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea56aff-4e23-4f8e-b6d4-3fae287dd2aa_528x328.png 848w, https://substackcdn.com/image/fetch/$s_!XvzE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea56aff-4e23-4f8e-b6d4-3fae287dd2aa_528x328.png 1272w, https://substackcdn.com/image/fetch/$s_!XvzE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea56aff-4e23-4f8e-b6d4-3fae287dd2aa_528x328.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XvzE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea56aff-4e23-4f8e-b6d4-3fae287dd2aa_528x328.png" width="528" height="328" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bea56aff-4e23-4f8e-b6d4-3fae287dd2aa_528x328.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:328,&quot;width&quot;:528,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:28823,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XvzE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea56aff-4e23-4f8e-b6d4-3fae287dd2aa_528x328.png 424w, https://substackcdn.com/image/fetch/$s_!XvzE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea56aff-4e23-4f8e-b6d4-3fae287dd2aa_528x328.png 848w, https://substackcdn.com/image/fetch/$s_!XvzE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea56aff-4e23-4f8e-b6d4-3fae287dd2aa_528x328.png 1272w, https://substackcdn.com/image/fetch/$s_!XvzE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea56aff-4e23-4f8e-b6d4-3fae287dd2aa_528x328.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created the author.</figcaption></figure></div><blockquote><p><em><a href="https://www.devx.com/terms/cooperative-multitasking/">Cooperative multitasking</a> is a multitasking method used by operating systems where each running process must periodically signal that it has completed its task or that it no longer needs CPU resources to allow other processes to execute. This approach relies on the voluntary cooperation of each process to take control of system resources to other processes.</em></p></blockquote><p>Presto provides a mechanism for operators to give up control to address the challenges of long-running computations within a cooperative multi-tasking environment. If an operator exceeds its quanta, the scheduler &#8220;charges" the task with the thread time used, temporarily reducing its future execution frequency.<strong>&nbsp;</strong>This adaptability ensures efficient resource sharing even with diverse query shapes.</p><p>Instead of predicting resource needs in advance, Presto classifies tasks based on their accumulated CPU time. As a task uses more CPU, it moves to higher queue levels, each receiving a configurable fraction of the available CPU time. This strategy ensures that less demanding queries receive resources, as they accumulate less CPU time and remain in lower queue levels. This reflects the expectation that users prioritize fast responses for interactive queries while being less sensitive about the return time of intensive jobs.</p><p>After the CPU, we will see how Presto manages memory resources.</p><p>Presto categorizes memory allocations as user or system memory. User memory refers to memory usage that users can estimate based on their understanding of the query and data. System memory represents usage from implementation choices, such as shuffle buffers.</p><p>Presto has limits on user and total memory (user + system). It will kill a query requiring a memory resource larger than the cluster&#8217;s memory or a per-node limit. These separate limits provide flexibility in managing diverse workloads.</p><p>When a worker node's memory is exhausted, Presto halts task processing on that node. Presto employs several strategies to address memory pressure and prevent cluster instability:</p><ul><li><p><strong>Spilling</strong>: Presto can revoke memory from eligible tasks when a node runs out of memory by writing their in-memory state to disk. Presto prioritizes the process based on task execution time, starting with the longest-running tasks. Of course, spilling to disk will increase the overall query response time. At Facebook, they don&#8217;t enable spilling by default because users appreciate the predictable latency of  in-memory execution.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vfQN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ef71d04-cf8d-4f85-a133-81b5f430fd8e_400x298.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vfQN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ef71d04-cf8d-4f85-a133-81b5f430fd8e_400x298.png 424w, https://substackcdn.com/image/fetch/$s_!vfQN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ef71d04-cf8d-4f85-a133-81b5f430fd8e_400x298.png 848w, https://substackcdn.com/image/fetch/$s_!vfQN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ef71d04-cf8d-4f85-a133-81b5f430fd8e_400x298.png 1272w, https://substackcdn.com/image/fetch/$s_!vfQN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ef71d04-cf8d-4f85-a133-81b5f430fd8e_400x298.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vfQN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ef71d04-cf8d-4f85-a133-81b5f430fd8e_400x298.png" width="400" height="298" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3ef71d04-cf8d-4f85-a133-81b5f430fd8e_400x298.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:298,&quot;width&quot;:400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:83539,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vfQN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ef71d04-cf8d-4f85-a133-81b5f430fd8e_400x298.png 424w, https://substackcdn.com/image/fetch/$s_!vfQN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ef71d04-cf8d-4f85-a133-81b5f430fd8e_400x298.png 848w, https://substackcdn.com/image/fetch/$s_!vfQN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ef71d04-cf8d-4f85-a133-81b5f430fd8e_400x298.png 1272w, https://substackcdn.com/image/fetch/$s_!vfQN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ef71d04-cf8d-4f85-a133-81b5f430fd8e_400x298.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div></li><li><p><strong>Reserved Pool:&nbsp;</strong>Another mechanism is the reserved memory pool. Presto divides the node&#8217;s memory pool into general and reserved pools. Presto promotes the query to consume memory resources in the reserved pool. The system counts this query's memory usage against the reserved pool, preventing it from competing with other queries for the general pool.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!J-ns!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b3bee58-2c52-4a7d-8721-5a3509a073a0_392x304.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!J-ns!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b3bee58-2c52-4a7d-8721-5a3509a073a0_392x304.png 424w, https://substackcdn.com/image/fetch/$s_!J-ns!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b3bee58-2c52-4a7d-8721-5a3509a073a0_392x304.png 848w, https://substackcdn.com/image/fetch/$s_!J-ns!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b3bee58-2c52-4a7d-8721-5a3509a073a0_392x304.png 1272w, https://substackcdn.com/image/fetch/$s_!J-ns!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b3bee58-2c52-4a7d-8721-5a3509a073a0_392x304.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!J-ns!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b3bee58-2c52-4a7d-8721-5a3509a073a0_392x304.png" width="392" height="304" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6b3bee58-2c52-4a7d-8721-5a3509a073a0_392x304.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:304,&quot;width&quot;:392,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:82974,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!J-ns!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b3bee58-2c52-4a7d-8721-5a3509a073a0_392x304.png 424w, https://substackcdn.com/image/fetch/$s_!J-ns!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b3bee58-2c52-4a7d-8721-5a3509a073a0_392x304.png 848w, https://substackcdn.com/image/fetch/$s_!J-ns!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b3bee58-2c52-4a7d-8721-5a3509a073a0_392x304.png 1272w, https://substackcdn.com/image/fetch/$s_!J-ns!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b3bee58-2c52-4a7d-8721-5a3509a073a0_392x304.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div></li></ul><h3>Fault Tolerance</h3><p>Here are the effects if failures happen:</p><ul><li><p><strong>Coordinator:</strong> If the coordinator fails, the cluster becomes unavailable.</p></li><li><p><strong>Worker Node: </strong>If a worker node crashes, all queries running on that node will fail</p></li></ul><p>To mitigate the impact of these failures, Presto relies on external mechanisms:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1FXB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35fc9-f2f8-4b74-b58c-a452afa8cb2c_942x374.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1FXB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35fc9-f2f8-4b74-b58c-a452afa8cb2c_942x374.png 424w, https://substackcdn.com/image/fetch/$s_!1FXB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35fc9-f2f8-4b74-b58c-a452afa8cb2c_942x374.png 848w, https://substackcdn.com/image/fetch/$s_!1FXB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35fc9-f2f8-4b74-b58c-a452afa8cb2c_942x374.png 1272w, https://substackcdn.com/image/fetch/$s_!1FXB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35fc9-f2f8-4b74-b58c-a452afa8cb2c_942x374.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1FXB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35fc9-f2f8-4b74-b58c-a452afa8cb2c_942x374.png" width="942" height="374" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4ad35fc9-f2f8-4b74-b58c-a452afa8cb2c_942x374.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:374,&quot;width&quot;:942,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:80074,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1FXB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35fc9-f2f8-4b74-b58c-a452afa8cb2c_942x374.png 424w, https://substackcdn.com/image/fetch/$s_!1FXB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35fc9-f2f8-4b74-b58c-a452afa8cb2c_942x374.png 848w, https://substackcdn.com/image/fetch/$s_!1FXB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35fc9-f2f8-4b74-b58c-a452afa8cb2c_942x374.png 1272w, https://substackcdn.com/image/fetch/$s_!1FXB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35fc9-f2f8-4b74-b58c-a452afa8cb2c_942x374.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><ul><li><p><strong>Standby Coordinators:</strong>&nbsp;Facebook employs a backup coordinator, ready to take over if the primary one fails.</p></li><li><p><strong>Multiple Active Clusters:</strong> Facebook runs multiple active Presto clusters. If one cluster fails, queries can run on another available cluster.</p></li><li><p><strong>External Monitoring:</strong> External systems monitor Presto clusters, identify failing nodes, and remove them from the cluster</p></li></ul><p>While these mechanisms reduce downtime, they can't eliminate it. Implementing traditional fault tolerance methods like checkpointing or replication is challenging and resource-intensive. At the time of paper writing, Facebook was working to improve fault tolerance for long-running queries.</p><div><hr></div><h2>Optimization</h2><blockquote><p><em>Facebook implement some techniques to optmize the Presto query processing.</em></p></blockquote><h3>JVM and Code Generation</h3><p>Because Facebook developed Presto in Java, they leverage the strengths of the Java Virtual Machine (JVM) while minimizing the impact of its limitations<strong>. </strong>Presto utilizes the JVM's Just-In-Time (JIT) compiler to optimize performance-critical code. </p><p>Presto avoids allocating large objects to prevent performance issues and uses flat memory arrays for critical data structures, reducing garbage collection overhead.</p><h3>File Format Features</h3><p>Presto utilizes features of columnar file formats to optimize data processing:</p><ul><li><p><strong>Data Skipping</strong>: Custom readers for formats like ORC and Parquet use statistics in file headers and footers (e.g., min-max ranges, Bloom filters) to efficiently skip irrelevant data sections.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9MHK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd5b2b0-8474-41b2-9043-04847867e5eb_622x426.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9MHK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd5b2b0-8474-41b2-9043-04847867e5eb_622x426.png 424w, https://substackcdn.com/image/fetch/$s_!9MHK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd5b2b0-8474-41b2-9043-04847867e5eb_622x426.png 848w, https://substackcdn.com/image/fetch/$s_!9MHK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd5b2b0-8474-41b2-9043-04847867e5eb_622x426.png 1272w, https://substackcdn.com/image/fetch/$s_!9MHK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd5b2b0-8474-41b2-9043-04847867e5eb_622x426.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9MHK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd5b2b0-8474-41b2-9043-04847867e5eb_622x426.png" width="622" height="426" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9cd5b2b0-8474-41b2-9043-04847867e5eb_622x426.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:426,&quot;width&quot;:622,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:109877,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9MHK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd5b2b0-8474-41b2-9043-04847867e5eb_622x426.png 424w, https://substackcdn.com/image/fetch/$s_!9MHK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd5b2b0-8474-41b2-9043-04847867e5eb_622x426.png 848w, https://substackcdn.com/image/fetch/$s_!9MHK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd5b2b0-8474-41b2-9043-04847867e5eb_622x426.png 1272w, https://substackcdn.com/image/fetch/$s_!9MHK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd5b2b0-8474-41b2-9043-04847867e5eb_622x426.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div></li><li><p><strong>Direct Block Conversion</strong>: The readers can directly convert compressed data into Presto's native block format, enabling efficient processing without decompression overhead.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_uE-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F291eeeef-a905-44b2-add0-a7c0291005e1_474x332.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_uE-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F291eeeef-a905-44b2-add0-a7c0291005e1_474x332.png 424w, https://substackcdn.com/image/fetch/$s_!_uE-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F291eeeef-a905-44b2-add0-a7c0291005e1_474x332.png 848w, https://substackcdn.com/image/fetch/$s_!_uE-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F291eeeef-a905-44b2-add0-a7c0291005e1_474x332.png 1272w, https://substackcdn.com/image/fetch/$s_!_uE-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F291eeeef-a905-44b2-add0-a7c0291005e1_474x332.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_uE-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F291eeeef-a905-44b2-add0-a7c0291005e1_474x332.png" width="474" height="332" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/291eeeef-a905-44b2-add0-a7c0291005e1_474x332.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:332,&quot;width&quot;:474,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:74665,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_uE-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F291eeeef-a905-44b2-add0-a7c0291005e1_474x332.png 424w, https://substackcdn.com/image/fetch/$s_!_uE-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F291eeeef-a905-44b2-add0-a7c0291005e1_474x332.png 848w, https://substackcdn.com/image/fetch/$s_!_uE-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F291eeeef-a905-44b2-add0-a7c0291005e1_474x332.png 1272w, https://substackcdn.com/image/fetch/$s_!_uE-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F291eeeef-a905-44b2-add0-a7c0291005e1_474x332.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div></li></ul><h3>Working with Compressed Data</h3><p>Presto processes data in its compressed form whenever possible:</p><ul><li><p><strong>Dictionary and Run-Length-Encoded (RLE) Block Processing</strong>: Presto performs operations right on compressed data, taking advantage of their structure for efficient processing. It processes dictionaries in fast, unconditional loops, and their structure is exploited during hash table building for joins and aggregations.</p></li></ul><ul><li><p><strong>Compressed Intermediate Results</strong>: Presto produces compressed intermediate results, minimizing data movement and storage. For instance, the join processor generates dictionary or RLE blocks for output data, leveraging the existing compressed structures.</p></li></ul><h3>Lazy Data Loading</h3><p>Presto supports lazy materialization, loading, and processing data only when required: Presto only decompresses and decodes data in compressed blocks (dictionary or RLE) when accessing the block&#8217;s cells. This minimizes the data fetched and processed, leading to significant performance gains.</p><div><hr></div><h2>Outro</h2><p>Above are all my notes after reading the paper&nbsp;<a href="https://research.facebook.com/publications/presto-sql-on-everything/">Presto: SQL on Everything</a>&nbsp;from Facebook.</p><p>We explored why Facebook created Presto, its history, architecture, key decisions made during its development, and the optimization techniques it implemented for the query engine.</p><p>Thank you for reading this far.</p><p>See you on my following pieces.</p><div><hr></div><h2><strong>References</strong></h2><p><em>[1] Facebook, <a href="https://research.facebook.com/publications/presto-sql-on-everything/">Presto: SQL on Everything </a>(2019)</em></p><p><em>[2] Wikipedia, <a href="https://en.wikipedia.org/wiki/Presto_(SQL_query_engine)">Presto (SQL query engine)</a></em></p>]]></content:encoded></item></channel></rss>