<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[VuTrinh.: GroupBy.]]></title><description><![CDATA[Weekly curated data engineering resource.]]></description><link>https://vutr.substack.com/s/groupby</link><image><url>https://substackcdn.com/image/fetch/$s_!2JXp!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png</url><title>VuTrinh.: GroupBy.</title><link>https://vutr.substack.com/s/groupby</link></image><generator>Substack</generator><lastBuildDate>Sun, 03 May 2026 11:20:25 GMT</lastBuildDate><atom:link href="https://vutr.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Vu Trinh]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[vutr27@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[vutr27@substack.com]]></itunes:email><itunes:name><![CDATA[Vu Trinh]]></itunes:name></itunes:owner><itunes:author><![CDATA[Vu Trinh]]></itunes:author><googleplay:owner><![CDATA[vutr27@substack.com]]></googleplay:owner><googleplay:email><![CDATA[vutr27@substack.com]]></googleplay:email><googleplay:author><![CDATA[Vu Trinh]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[GroupBy #44: Meta | The Data Stack]]></title><description><![CDATA[Plus: A Brief History of Modern Data Stack, How Canva collects 25 billion events per day]]></description><link>https://vutr.substack.com/p/groupby-44-meta-the-data-stack</link><guid isPermaLink="false">https://vutr.substack.com/p/groupby-44-meta-the-data-stack</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Tue, 16 Jul 2024 11:01:45 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!sv65!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc212358-5d3e-4daf-a791-344ac3ee08ad_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is <strong>GroupBy</strong>, the weekly compiled resources for data engineers.</em></p><p><em>Not subscribed yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p><blockquote><p><em>&#128075; Hi, my name is Vu Trinh, a data engineer.</em></p><p><em>I share my lesson and excellent resources to read in this newsletter.</em></p><p><em>Hope this issue finds you well.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sv65!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc212358-5d3e-4daf-a791-344ac3ee08ad_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sv65!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc212358-5d3e-4daf-a791-344ac3ee08ad_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!sv65!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc212358-5d3e-4daf-a791-344ac3ee08ad_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!sv65!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc212358-5d3e-4daf-a791-344ac3ee08ad_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!sv65!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc212358-5d3e-4daf-a791-344ac3ee08ad_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sv65!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc212358-5d3e-4daf-a791-344ac3ee08ad_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bc212358-5d3e-4daf-a791-344ac3ee08ad_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:650198,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sv65!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc212358-5d3e-4daf-a791-344ac3ee08ad_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!sv65!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc212358-5d3e-4daf-a791-344ac3ee08ad_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!sv65!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc212358-5d3e-4daf-a791-344ac3ee08ad_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!sv65!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc212358-5d3e-4daf-a791-344ac3ee08ad_1400x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><div><hr></div><h1><strong>Meta: Overview of the internal data stack</strong></h1><h2>Intro</h2><p>A planet-scale company like Meta, Twitter, or LinkedIn processes tons of data daily to support its operations. To support that need, those companies must build a robust data infrastructure. Today, we will have a glimpse of the Meta internal data stack.</p><blockquote><p><em>You can find the original article from Meta <a href="https://medium.com/@AnalyticsAtMeta/data-engineering-at-meta-high-level-overview-of-the-internal-tech-stack-a200460a44fe">here</a>.</em></p></blockquote><h2>Data Warehouse</h2><p>The Meta&#8217;s data warehouse is a repository for analytics requirements. It has millions of Hive tables, physically stored using an internal fork of ORC. </p><p>Data is spread across namespaces. Namespaces are either physical (geographical) or logical partitioning of the warehouse: tables are grouped into a namespace to efficiently be used in the same queries without moving data around. Thus, data replication is required when tables need to be accessed from two different namespaces.</p><p>Data in Meta's systems is kept only as long as needed. Tables in the warehouse usually have a set retention period, such as 90 days, after which older data is either archived or deleted.</p><p>Each table is linked to an on-call group, which designates the responsible team and the point of contact for any issues or questions about the data.</p><h3>Data warehouse ingestion</h3><ul><li><p>Snapshot from operational databases (Meta&#8217;s graph database)</p></li><li><p>Logs from clients or servers.</p></li><li><p>From the Dataswarm pipeline, mainly retrieved by querying other warehouse&#8217;s tables.</p></li></ul><h2><strong>Data discovery, data catalog</strong></h2><p>Engineers at Meta developed a web-based tool called iData, which lets users search data assets (tables, dashboards,&#8230;) by keyword. The tool also includes lineage tools to help users trace upstream and downstream data assets. </p><h2><strong>The Query Engine</strong></h2><p>Meta's data warehouse can be queried, mainly using Presto and Spark. Most of their pipelines and queries are written in either Spark SQL or Presto SQL. </p><p>They also use imperative approaches with Spark&#8217;s Java, Scala, and Python APIs for complex transformations.</p><p>The choice between Presto and Spark depends on the workload. Presto is more efficient for most queries, while Spark handles heavier workloads needing more memory or complex joins.</p><p>Presto clusters are sized to handle day-to-day ad-hoc queries, scanning billions of rows in seconds or minutes, depending on complexity.</p><h2><strong>Scuba: Real-time analytics</strong></h2><p>Scuba is Meta's real-time data analytics framework. It is widely used by data and software engineers to analyze trends in real-time logging data and by software and production engineers to debug.</p><h2><strong>Daiquery &amp; Bento: The notebooks</strong></h2><p>At Meta, data engineers use Daiquery daily. This web-based notebook is a single entry point for querying various data sources, including the warehouse (via Presto or Spark), Scuba, and more. Users can upgrade their Daiquery notebooks to Bento notebooks for more complex query analysis. Bento is Meta&#8217;s managed Jupyter Notebook implementation, supporting Python or R code and a variety of visualization libraries in addition to queries.</p><h2><strong>Unidash: Dashboarding</strong></h2><p>Unidash is the internal tool data engineers use to create dashboards (you can imagine <a href="https://superset.apache.org/">Apache Superset</a> here). It integrates with Daiquery and many other tools; for example, engineers can write their query in Daiquery, create their graph there, and then export it to a new or existing Unidash dashboard.</p><h2><strong>Software development</strong></h2><p>Most engineers at Meta use a customized version of <a href="https://code.visualstudio.com/">Visual Studio Code</a> as an IDE. It has many custom plugins maintained by internal teams. They also use an fork of <a href="https://www.mercurial-scm.org/">Mercurial</a> for source control and a monorepo structure&#8212;all data pipelines and most internal tools are in a single repository.</p><h2><strong>Pipeline Developing</strong></h2><p>At Meta, engineers mainly develop data pipelines in SQL (for business logic) and wrapped in Python code (for orchestration and scheduling).</p><p>Their internal Python library for orchestrating and scheduling pipelines is called Dataswarm, a predecessor to <a href="https://airflow.apache.org/">Airflow</a>, and is developed and maintained internally.</p><h2><strong>Monitoring &amp; operations</strong></h2><p>Pipeline monitoring is done via a web-based tool called CDM (Central Data Manager), which can be seen as the <em>Dataswarm UI</em>.</p><p>This is the entry point to a broader tool:</p><ul><li><p>Identify failing tasks and find the corresponding logs</p></li><li><p>Define and run backfills</p></li><li><p>Navigate to upstream dependencies</p></li><li><p>Identify upstream blockers</p></li><li><p>Notifications</p></li><li><p>Set up and monitor data quality checks</p></li></ul><h2><strong>Outro</strong></h2><p>Thank you for reading my note. </p><p>Now, let&#8217;s check some curated resources I found on the internet last week.</p><div><hr></div><h1>&#128203; The list</h1><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><a href="https://discord.com/blog/how-discord-uses-open-source-tools-for-scalable-data-orchestration-transformation">How Discord uses open-source tools for scalable data orchestration &amp; transformation</a>  &#8212; 11 mins, by Zach Bluhm</p><blockquote><p><em>To continue delivering seamless service and insightful data analytics, we embraced an ambitious project: <strong>to overhaul our data orchestration infrastructure using modern, open-source tools</strong>.&nbsp;</em></p></blockquote><p><a href="https://duckdb.org/2024/07/09/memory-management">Memory Management in DuckDB</a> &#8212; 8 mins, by Mark Raasveldt</p><blockquote><p><em>In this blog post, we will cover aspects of memory management within DuckDB &#8211; and provide examples of where they are utilized.</em></p></blockquote><p><a href="https://www.canva.dev/blog/engineering/product-analytics-event-collection/">How Canva collects 25 billion events per day</a> &#8212; 10 mins, by Long Nguyen</p><blockquote><p><em>The architecture of our product analytics event delivery pipeline.</em></p></blockquote><p><a href="https://www.dataengineeringweekly.com/p/a-brief-history-of-modern-data-stack">A Brief History of Modern Data Stack</a> &#8212; 7 mins, by Ananth Packkildurai</p><blockquote><p><em>A rise &amp; fall of Modern Data Stack and what comes next?</em></p></blockquote><p><a href="https://medium.com/agoda-engineering/booking-deduplication-how-agoda-manages-duplicate-bookings-across-multiple-data-centers-08ddbe9e22f1">Booking Deduplication: How Agoda Manages Duplicate Bookings Across Multiple Data Centers (Part 1)</a> &#8212; 13 mins, Agoda Engineering Blog</p><blockquote><p><em>Agoda introduced the booking deduplication feature many years ago to prevent the creation of duplicate bookings.</em></p></blockquote><p><a href="https://medium.com/@mikldd/how-data-observability-fits-into-the-different-stages-in-the-data-pipeline-70d47aba8cbd">How data observability fits into the different stages in the data pipeline</a> &#8212; 9 mins, by Mikkel Dengs&#248;e</p><blockquote><p><em>In this article, we&#8217;ll look into data observability tools&#8217; role in different parts of the data pipeline and their limitations.</em></p></blockquote><div><hr></div><h2>&#128521; Previously on Dimension</h2><blockquote><p><em>Dimension is my sub-newsletter, where I note things I learn from people smarter than me in data engineering. Here is the latest article</em></p></blockquote><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;e3b60bbe-02e2-4aee-a1d1-d8778ab1ab82&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Apache Kafka - Important Designs&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;I research and write weekly deep-dive content at vutr.substack.com&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-07-13T11:01:27.424Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88080c69-93ac-4db9-aedd-07208e67c91c_1397x1000.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/apache-kafka-important-designs&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:146010668,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:7,&quot;comment_count&quot;:2,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="pullquote"><p>Let me hear your voice, for example: </p><p>'Your newsletter is so terrible, I can't handle it anymore.'</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/p/groupby-44-meta-the-data-stack/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/p/groupby-44-meta-the-data-stack/comments"><span>Leave a comment</span></a></p></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for scrolling this far! There's a convenient subscribe box here if you want me to annoy you every week. &#128516;</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[GroupBy #43: Uber | Kafka - The Tiered Storage]]></title><description><![CDATA[Plus: Notion&#8217;s data lake - Building and Scaling]]></description><link>https://vutr.substack.com/p/groupby-43-uber-kafka-the-tiered</link><guid isPermaLink="false">https://vutr.substack.com/p/groupby-43-uber-kafka-the-tiered</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Tue, 09 Jul 2024 11:00:28 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!CKi-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dbc9812-a96c-40c4-b747-0a817b03c59e_2000x1429.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is <strong>GroupBy</strong>, the weekly compiled resources for data engineers.</em></p><p><em>Not subscribed yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p><blockquote><p><em>&#128075; Hi, my name is Vu Trinh, a data engineer.</em></p><p><em>I share my lesson and excellent resources to read in this newsletter.</em></p><p><em>Hope this issue finds you well.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CKi-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dbc9812-a96c-40c4-b747-0a817b03c59e_2000x1429.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CKi-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dbc9812-a96c-40c4-b747-0a817b03c59e_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!CKi-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dbc9812-a96c-40c4-b747-0a817b03c59e_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!CKi-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dbc9812-a96c-40c4-b747-0a817b03c59e_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!CKi-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dbc9812-a96c-40c4-b747-0a817b03c59e_2000x1429.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CKi-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dbc9812-a96c-40c4-b747-0a817b03c59e_2000x1429.png" width="1456" height="1040" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4dbc9812-a96c-40c4-b747-0a817b03c59e_2000x1429.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1040,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2219060,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CKi-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dbc9812-a96c-40c4-b747-0a817b03c59e_2000x1429.png 424w, https://substackcdn.com/image/fetch/$s_!CKi-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dbc9812-a96c-40c4-b747-0a817b03c59e_2000x1429.png 848w, https://substackcdn.com/image/fetch/$s_!CKi-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dbc9812-a96c-40c4-b747-0a817b03c59e_2000x1429.png 1272w, https://substackcdn.com/image/fetch/$s_!CKi-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4dbc9812-a96c-40c4-b747-0a817b03c59e_2000x1429.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author.</figcaption></figure></div><div><hr></div><h1><strong>Kafka Tiered Storage at Uber</strong></h1><h2>Intro</h2><p>Last week, Uber released an article introducing Kafka Tiered Storage; the following texts will cover some insights from that article.</p><blockquote><p><em>You can find the original article from Uber <a href="https://www.uber.com/en-SG/blog/kafka-tiered-storage/">here</a>.</em></p></blockquote><h2>The Kafka original</h2><p>At first, Kafka stores the messages in segment files on the broker&#8217;s file system. The only way to scale the storage capacity is by adding more machines. Due to Kafka's compute-storage coupling, we must add unused memory and CPUs when we only need more storage.</p><h2>De-coupling effort</h2><p>Uber proposed Kafka Tiered Storage (<a href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage?uclick_id=15b6739c-0acd-406e-bdf6-884992beefa0">KIP-405</a>) to de-couple the Kafka storage and processing power. The proposal suggests that instead of having one storage system, Kafka will have two storage tiers: local (as the initial design) and remote.</p><h2><strong>Architecture</strong></h2><p>A Kafka cluster with tiered storage uses two tiers: local and remote. The local tier consists of the broker's current local storage, while the remote tier extends to storage like HDFS/S3/GCS/Azure. Each tier has its own retention configurations based on size and time. The local tier's retention can be reduced to hours, whereas the remote tier can retain data for days or months. Latency-sensitive applications perform reads from the local tier, leveraging efficient page cache. In contrast, applications needing older data, such as backfill or failure recovery, access the remote tier.</p><p>This brings some advantages:</p><ul><li><p>Allowing Kafka&#8217;s storage capacity to scale independently.</p></li><li><p>Reducing the local storage burden on brokers, minimizing data transfer during recovery and rebalancing.</p></li><li><p>Consumers can access messages on remote storage directly without loading data back to the broker.</p></li><li><p>Allowing for longer data retention.</p></li></ul><p>Tiered storage divides a topic partition&#8217;s log into two components: local log and remote log. The first consists of local log segments, while the latter consists of remote log segments. The remote log subsystem copies eligible segments from local storage to remote storage. A segment becomes eligible for copying when its end offset is less than the partition's <a href="https://docs.confluent.io/platform/current/installation/configuration/consumer-configs.html#isolation-level">LastStableOffset</a>.</p><blockquote><p><em>The last stable offset (LSO) is the offset in a user partition such that all lower offsets have been decided and are always present - <a href="https://strimzi.io/blog/2023/05/03/kafka-transactions/">Source</a>.</em></p></blockquote><h2>Copying to the remote storage</h2><p>The leader broker for a topic partition will process the copying of the eligible log segments to the remote storage. It copies the log segments from the earliest to the latest in a sequence.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bhma!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbda7948e-292d-404b-be7e-5158eafb705a_899x333.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bhma!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbda7948e-292d-404b-be7e-5158eafb705a_899x333.jpeg 424w, https://substackcdn.com/image/fetch/$s_!bhma!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbda7948e-292d-404b-be7e-5158eafb705a_899x333.jpeg 848w, https://substackcdn.com/image/fetch/$s_!bhma!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbda7948e-292d-404b-be7e-5158eafb705a_899x333.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!bhma!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbda7948e-292d-404b-be7e-5158eafb705a_899x333.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bhma!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbda7948e-292d-404b-be7e-5158eafb705a_899x333.jpeg" width="899" height="333" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bda7948e-292d-404b-be7e-5158eafb705a_899x333.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:333,&quot;width&quot;:899,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!bhma!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbda7948e-292d-404b-be7e-5158eafb705a_899x333.jpeg 424w, https://substackcdn.com/image/fetch/$s_!bhma!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbda7948e-292d-404b-be7e-5158eafb705a_899x333.jpeg 848w, https://substackcdn.com/image/fetch/$s_!bhma!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbda7948e-292d-404b-be7e-5158eafb705a_899x333.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!bhma!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbda7948e-292d-404b-be7e-5158eafb705a_899x333.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"> Some local segments before the &#8220;300&#8221; were deleted based on the local retention configuration, but those segments are available remotely because they were copied to the remote storage. <a href="https://www.uber.com/en-SG/blog/kafka-tiered-storage/">Source</a></figcaption></figure></div><h2><strong>Fetching from remote storage</strong></h2><p>When a consumer fetch request targets data in remote storage, it is handled by a dedicated thread pool. If the requested offset is available in the broker&#8217;s local storage, it is served using the local fetch mechanism. This ensures that local and remote reads do not block each other.</p><h2><strong>Follower Replication</strong></h2><p>With tiered storage, followers replicate segments from the leader's local storage and must build auxiliary data (e.g., leader epoch state, producer-ID snapshots) before fetching messages. The follower fetch protocol ensures message consistency and order across replicas, even during cluster changes like broker replacements or failures.</p><h2><strong>Outro</strong></h2><p>Speaking of Kafka, I&#8217;m currently writing a series of articles about my Kafka learning journey; you can find the first article <a href="https://vutr.substack.com/p/apache-kafka-part-1-overview?r=2rj6sg">here</a>. This article is an overview of Kafka and its use on LinkedIn. The following article will be about some of Kafka&#8217;s important designs and will be released this Saturday.</p><div><hr></div><h1>&#128203; The list</h1><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><a href="https://www.notion.so/blog/building-and-scaling-notions-data-lake">Building and scaling Notion&#8217;s data lake</a> &#8212; 12 mins, by Notion Blog</p><blockquote><p><em>In the past three years, Notion's data has expanded 10x due to user and content growth, with a doubling rate of 6-12 months. Managing this rapid growth while meeting the ever-increasing data demands of critical product and analytics use cases, especially our recent Notion AI features, meant building and scaling Notion&#8217;s data lake. Here&#8217;s how we did it.</em></p></blockquote><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><a href="https://slack.engineering/unlocking-efficiency-and-performance-navigating-the-spark-3-and-emr-6-upgrade-journey-at-slack/">Canva | Unlocking Efficiency and Performance: Navigating the Spark 3 and EMR 6 Upgrade Journey at Slack</a> &#8212; 10 mins, by Slack Engineering Blog</p><blockquote><p><em>Slack Data Engineering recently underwent data workload migration from <a href="https://aws.amazon.com/emr/">AWS</a> <a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-5x.html">EMR 5</a> (Spark 2/<a href="https://hive.apache.org/">Hive 2</a> processing engine) to <a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html">EMR 6 </a>(<a href="https://spark.apache.org/news/spark-3-0-0-released.html">Spark 3</a> processing engine). In this blog, we will share our migration journey, challenges, and the performance gains we observed in the process. This blog aims to assist Data Engineers, Data Infrastructure Engineers, and Product Managers who may be considering migrating to EMR 6/Spark 3.</em></p></blockquote><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><a href="https://datagibberish.com/p/etl-elt-basics">Demystifying Data Flow: ETL and ELT Explained Simply</a> &#8212; 8 mins, by <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Yordan Ivanov&quot;,&quot;id&quot;:40945395,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/76f52904-5428-4d97-82a5-3faa722b8d46_2234x1253.jpeg&quot;,&quot;uuid&quot;:&quot;a1d87795-72df-4067-8212-83f0b6609ca0&quot;}" data-component-name="MentionToDOM"></span> </p><blockquote><p><em>I'll explain ETL and ELT in a clear and understandable way. By the end of this article, you'll have a clear grasp of these essential data processing methods and know precisely when to use each one.</em></p></blockquote><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><a href="https://juhache.substack.com/p/data-pipelines-and-scds">Data pipelines and SCDs</a> &#8212; 4 mins, by <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Julien Hurault&quot;,&quot;id&quot;:35734446,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5c58ce4-f4f1-4eac-854c-a32157cb7b5a_499x579.png&quot;,&quot;uuid&quot;:&quot;d18efdcf-76c1-4928-9863-ab9824013a8e&quot;}" data-component-name="MentionToDOM"></span> </p><blockquote><p><em>The recommended approach for backfilling is not to write ad-hoc SQL but to re-run the pipeline over a specified interval. This is done by designing a pipeline with idempotent transformation code. But what about SCDs?</em></p></blockquote><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><a href="https://www.junaideffendi.com/p/data-modelling-using-complex-data">Data Modelling Using Complex Data Types</a> &#8212; 6 mins, by <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Junaid Effendi&quot;,&quot;id&quot;:21393641,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b06559f3-ee33-46f8-bfa0-50964179f235_1200x1200.png&quot;,&quot;uuid&quot;:&quot;8369fe13-08cf-47f6-9486-f7395ec932c4&quot;}" data-component-name="MentionToDOM"></span> </p><blockquote><p><em>Complex data types like struct, array, map in modern warehouses are game changer, learn the useful aspects from a Data Engineer.</em></p></blockquote><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><a href="https://www.startdataengineering.com/post/sql-v-python/">SQL or Python for Data Transformations? </a>- 9 mins, by Start Data Engineering</p><blockquote><p><em>By the end of this post, you will understand how the underlying execution engine impacts your pipeline performance. You will have a list of criteria to consider when using Python or SQL for a data processing task. With this checklist, you can use each tool to its benefit.</em></p></blockquote><div><hr></div><h2>&#128521; Previously on Dimension</h2><blockquote><p><em>Dimension is my sub-newsletter, where I note down things I learn from people smarter than me in the data engineering field. Here is the latest article</em></p></blockquote><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;5a340804-da6c-437a-ad40-b944e7add22d&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Apache Kafka - Overview&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;I research and write weekly deep-dive content at vutr.substack.com&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-07-06T11:01:53.013Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3308d4f0-eb14-4b4b-9e37-207b499a5489_1397x1001.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/apache-kafka-part-1-overview&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:145861741,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:9,&quot;comment_count&quot;:3,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="pullquote"><p>Let me hear your voice, for example: </p><p>'Your newsletter is so terrible, I can't handle it anymore.'</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/p/groupby-43-uber-kafka-the-tiered/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/p/groupby-43-uber-kafka-the-tiered/comments"><span>Leave a comment</span></a></p></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for scrolling this far! There's a convenient subscribe box here if you want me to annoy you every week. &#128516;</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[GroupBy #42: Paypal - Scaling Kafka]]></title><description><![CDATA[Plus: Introduction to Kafka Tiered Storage at Uber, Modern Good Practices for Python Development]]></description><link>https://vutr.substack.com/p/groupby-42-paypal-scaling-kafka</link><guid isPermaLink="false">https://vutr.substack.com/p/groupby-42-paypal-scaling-kafka</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Tue, 02 Jul 2024 11:02:19 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!qm74!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aef1d57-4862-440e-8b04-ed52b8b8ed73_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is <strong>GroupBy</strong>, the weekly compiled resources for data engineers.</em></p><p><em>Not subscribed yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p><blockquote><p><em>&#128075; Hi, my name is Vu Trinh, a data engineer.</em></p><p><em>I share my lesson and excellent resources to read in this newsletter.</em></p><p><em>Hope this issue finds you well.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qm74!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aef1d57-4862-440e-8b04-ed52b8b8ed73_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qm74!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aef1d57-4862-440e-8b04-ed52b8b8ed73_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!qm74!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aef1d57-4862-440e-8b04-ed52b8b8ed73_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!qm74!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aef1d57-4862-440e-8b04-ed52b8b8ed73_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!qm74!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aef1d57-4862-440e-8b04-ed52b8b8ed73_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qm74!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aef1d57-4862-440e-8b04-ed52b8b8ed73_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6aef1d57-4862-440e-8b04-ed52b8b8ed73_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3523569,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qm74!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aef1d57-4862-440e-8b04-ed52b8b8ed73_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!qm74!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aef1d57-4862-440e-8b04-ed52b8b8ed73_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!qm74!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aef1d57-4862-440e-8b04-ed52b8b8ed73_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!qm74!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aef1d57-4862-440e-8b04-ed52b8b8ed73_1400x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><div><hr></div><h1><strong>Paypal - Scaling Kafka</strong></h1><blockquote><p><em>This week, we will see how PayPal manages and operates Kafka to support its data growth. This mini-blog is based on the article <strong><a href="https://medium.com/paypal-tech/scaling-kafka-to-support-paypals-data-growth-a0b4da420fab">Scaling Kafka to Support PayPal&#8217;s Data Growth (2023)</a>.</strong></em></p></blockquote><h2>Kafka at Paypal</h2><p>At the time of the article writing, Paypal&#8217;s Kafka fleet consists of over 85+ clusters with 1,500 brokers that host over 20,000 topics and close to 2,000 Mirror Makers (used to mirror the data among the clusters). During the 2022 Retail Friday, Kafka traffic volume peaked at about <strong>1.3 trillion messages daily</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wiFg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdfb92f5-fef8-43ab-b7f1-d54a109640ad_956x444.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wiFg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdfb92f5-fef8-43ab-b7f1-d54a109640ad_956x444.png 424w, https://substackcdn.com/image/fetch/$s_!wiFg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdfb92f5-fef8-43ab-b7f1-d54a109640ad_956x444.png 848w, https://substackcdn.com/image/fetch/$s_!wiFg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdfb92f5-fef8-43ab-b7f1-d54a109640ad_956x444.png 1272w, https://substackcdn.com/image/fetch/$s_!wiFg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdfb92f5-fef8-43ab-b7f1-d54a109640ad_956x444.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wiFg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdfb92f5-fef8-43ab-b7f1-d54a109640ad_956x444.png" width="956" height="444" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bdfb92f5-fef8-43ab-b7f1-d54a109640ad_956x444.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:444,&quot;width&quot;:956,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Fig 4. Kafka cluster deployments in security zones within a data center&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Fig 4. Kafka cluster deployments in security zones within a data center" title="Fig 4. Kafka cluster deployments in security zones within a data center" srcset="https://substackcdn.com/image/fetch/$s_!wiFg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdfb92f5-fef8-43ab-b7f1-d54a109640ad_956x444.png 424w, https://substackcdn.com/image/fetch/$s_!wiFg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdfb92f5-fef8-43ab-b7f1-d54a109640ad_956x444.png 848w, https://substackcdn.com/image/fetch/$s_!wiFg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdfb92f5-fef8-43ab-b7f1-d54a109640ad_956x444.png 1272w, https://substackcdn.com/image/fetch/$s_!wiFg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdfb92f5-fef8-43ab-b7f1-d54a109640ad_956x444.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Kafka cluster deployments in security zones &#8212; Scaling Kafka to Support PayPal&#8217;s Data Growth (2023). <a href="https://medium.com/paypal-tech/scaling-kafka-to-support-paypals-data-growth-a0b4da420fab">Source</a></figcaption></figure></div><p>PayPal's infrastructure is spread across multiple geographical data centers and security zones. They deploy Kafka clusters across these zones and use Mirror Makers to replicate data between clusters. Client applications connect to the topics on these brokers to publish (write) or subscribe (read) the data in the same or different zone. PayPal internally supports various Kafka&#8217;s clients such as Java, Python, Spark, Node,&#8230;</p><p>Operating Kafka at PayPal had its own set of challenges. With different frameworks and tech stacks, they must invest in building robust tools that help them reduce operational overhead. Each section below will describe the era in which PayPay invested.</p><h2>Cluster Management</h2><p>PayPal introduced a few improvements: </p><ul><li><p><strong>Kafka Config Service: </strong>If clients want to interact with the Kafka cluster, they must hardcode the broker IPs in the code. When the brokers are replaced due to upgrades, patching, disk failures, etc., the clients must change the broker IP manually. Kafka Config Service pushes information about a set of bootstrap servers (brokers that host the topics) to all the Kafka clients during initialization. If the broker's details change, the Kafka application only needs to restart so that the config service can push the new configuration for them.</p></li><li><p><strong>Kafka Access Control Lists (ACLs): </strong>ACLs were onboarded at PayPal to help control access to Kafka clusters via the Simple Authentication and Security Layer (SASL) port. Initially, Kafka allowed connections on plain text ports, and any application could connect to any existing topic.</p></li><li><p><strong>PayPal Kafka Libraries: </strong>PayPal introduced a set of libraries to ensure security, interoperability and user experience: </p><ul><li><p><strong>Resilient Client Library: </strong>The resilient client library integrates with the discovery service.</p></li><li><p><strong>Monitoring Library</strong>: The monitoring library publishes critical metrics for client applications, which helps monitor the applications&#8217; health.</p></li><li><p><strong>Kafka Security Library</strong>: All the applications need SSL authentication to connect to the Kafka clusters. This library pulls the required certificates and tokens to authenticate the application during the startup.</p></li></ul></li><li><p><strong>Kafka QA Platform: </strong>The older QA environment has a lot of ad-hoc topics, all hosted on a handful of clusters. PayPal redesigned and introduced a new QA platform that provides a one-to-one mapping between production and QA clusters, following the same security standards as the production setup.</p></li></ul><h2><strong>Monitoring and Alerting</strong></h2><p>PayPal's Kafka platform is tightly integrated with its monitoring and alerting systems. Although Apache Kafka provides many metrics by default, they have optimized a subset for quicker issue identification with minimal overhead. Key metrics from brokers, zookeepers, MirrorMakers, and clients monitor application health and underlying machines, triggering alerts at abnormal thresholds. PayPal also developed a custom Kafka Metrics library to filter metrics.</p><h2><strong>Enhancements and Automation</strong></h2><p>PayPal automated CRUD operations for clusters and topics, metadata management:</p><ul><li><p><strong>Patching security vulnerabilities</strong>: All hosts in the Kafka platform must be patched frequently to resolve any security vulnerabilities. Patching usually requires broker restarts, risking under-replicated partitions and data loss. To prevent this, they developed a plugin to check under-replicated partitions before patching, allowing clusters to be patched in parallel while ensuring only one broker is patched at a time.</p></li><li><p><strong>Topic Onboarding: </strong>Application teams must submit a request via the Onboarding Dashboard to create a new topic. The team reviews the capacity requirements and assigns the topic to an available cluster, determined by the capacity analysis tool integrated into the workflow. A unique token is generated for each new application to authenticate access to the Kafka topic, and ACLs are created based on roles. The application can then successfully connect to the Kafka topic.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Zfig!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68d57f13-605d-446c-9f6f-60cba03b7eb6_960x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Zfig!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68d57f13-605d-446c-9f6f-60cba03b7eb6_960x600.png 424w, https://substackcdn.com/image/fetch/$s_!Zfig!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68d57f13-605d-446c-9f6f-60cba03b7eb6_960x600.png 848w, https://substackcdn.com/image/fetch/$s_!Zfig!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68d57f13-605d-446c-9f6f-60cba03b7eb6_960x600.png 1272w, https://substackcdn.com/image/fetch/$s_!Zfig!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68d57f13-605d-446c-9f6f-60cba03b7eb6_960x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Zfig!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68d57f13-605d-446c-9f6f-60cba03b7eb6_960x600.png" width="960" height="600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/68d57f13-605d-446c-9f6f-60cba03b7eb6_960x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:960,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Zfig!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68d57f13-605d-446c-9f6f-60cba03b7eb6_960x600.png 424w, https://substackcdn.com/image/fetch/$s_!Zfig!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68d57f13-605d-446c-9f6f-60cba03b7eb6_960x600.png 848w, https://substackcdn.com/image/fetch/$s_!Zfig!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68d57f13-605d-446c-9f6f-60cba03b7eb6_960x600.png 1272w, https://substackcdn.com/image/fetch/$s_!Zfig!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68d57f13-605d-446c-9f6f-60cba03b7eb6_960x600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Topic Onboarding workflow &#8212; Scaling Kafka to Support PayPal&#8217;s Data Growth (2023). <a href="https://medium.com/paypal-tech/scaling-kafka-to-support-paypals-data-growth-a0b4da420fab">Source</a></figcaption></figure></div></li><li><p><strong>MirrorMaker Onboarding</strong>: </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9kgE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf8f5eb2-cbfa-4711-a6e4-e6af0d4325de_1400x924.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9kgE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf8f5eb2-cbfa-4711-a6e4-e6af0d4325de_1400x924.png 424w, https://substackcdn.com/image/fetch/$s_!9kgE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf8f5eb2-cbfa-4711-a6e4-e6af0d4325de_1400x924.png 848w, https://substackcdn.com/image/fetch/$s_!9kgE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf8f5eb2-cbfa-4711-a6e4-e6af0d4325de_1400x924.png 1272w, https://substackcdn.com/image/fetch/$s_!9kgE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf8f5eb2-cbfa-4711-a6e4-e6af0d4325de_1400x924.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9kgE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf8f5eb2-cbfa-4711-a6e4-e6af0d4325de_1400x924.png" width="1400" height="924" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cf8f5eb2-cbfa-4711-a6e4-e6af0d4325de_1400x924.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:924,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9kgE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf8f5eb2-cbfa-4711-a6e4-e6af0d4325de_1400x924.png 424w, https://substackcdn.com/image/fetch/$s_!9kgE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf8f5eb2-cbfa-4711-a6e4-e6af0d4325de_1400x924.png 848w, https://substackcdn.com/image/fetch/$s_!9kgE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf8f5eb2-cbfa-4711-a6e4-e6af0d4325de_1400x924.png 1272w, https://substackcdn.com/image/fetch/$s_!9kgE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf8f5eb2-cbfa-4711-a6e4-e6af0d4325de_1400x924.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">MirrorMaker onboarding workflow  &#8212; Scaling Kafka to Support PayPal&#8217;s Data Growth (2023). <a href="https://medium.com/paypal-tech/scaling-kafka-to-support-paypals-data-growth-a0b4da420fab">Source</a></figcaption></figure></div></li><li><p><strong>Repartition Assignment Enhancements: </strong>By default, Kafka repartitions all partitions, including those on healthy brokers. PayPal modified this to reassign only under-replicated partitions on affected brokers, avoiding long re-partitioning times. Previously, re-partitioning could make clusters unavailable for days, severely impacting availability.</p></li></ul><h2>PayPal&#8217;s lessons learned.</h2><ul><li><p>Operating Kafka at a large scale requires tools for regular operations.</p></li><li><p>Critical health metrics such as CPU and disk utilization are monitored to ensure high availability and business continuity.</p></li><li><p>They introduced ACLs to improve application tracking and security and are on the way to developing automation tools to enhance ACL management.</p></li><li><p>Benchmarking cluster performance across various environments (on-premises and cloud) with different configurations has provided insights for operational efficiency.</p></li></ul><div><hr></div><h1>&#128203; The list</h1><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><a href="https://www.uber.com/en-SG/blog/kafka-tiered-storage/">Introduction to Kafka Tiered Storage at Uber</a> &#8212; 9 mins, by Uber Engineering Blog</p><blockquote><p><em>Uber proposed Kafka Tiered Storage (<a href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage?uclick_id=15b6739c-0acd-406e-bdf6-884992beefa0">KIP-405</a>) to avoid tight coupling of storage and processing in a broker. It provides two tiers of storage, called local and remote.</em></p></blockquote><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><a href="https://databased.pedramnavid.com/p/the-rise-of-the-data-platform-engineer">The Rise of the Data Platform Engineer </a>&#8212; 6 mins, by Pedram Navid</p><blockquote><p><em>The next evolution of the role is more akin to a Data Platform Engineer.</em></p></blockquote><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><a href="https://www.startdataengineering.com/post/why-to-use-orchestrators/">Why use Apache Airflow (or any orchestrator)?</a> &#8212; 7 mins, by Start Data Engineering </p><blockquote><p><em>Understanding the needs of complex data pipelines can help you understand the need for a tool like Airflow. This post will cover the three main concepts of running data pipelines: scheduling, orchestration, and Observability.</em></p></blockquote><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><a href="https://www.stuartellis.name/articles/python-modern-practices/">Modern Good Practices for Python Development</a> &#8212; 13 mins, by Stuart Ellis</p><blockquote><p><em><a href="https://www.python.org/">Python</a> has a long history, and it has evolved over time. This article describes some agreed modern best practices.</em></p></blockquote><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><a href="https://www.datadoghq.com/blog/engineering/timeseries-indexing-at-scale/">Datadog - Timeseries Indexing at Scale</a> &#8212; 20 mins, Artem Krylysov and May Lee</p><blockquote><p><em>This blog post provides an overview of the Datadog time-series databaseseries indexing at scale. We&#8217;ll compare the performance and reliability of two generations of indexing services.</em></p></blockquote><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><a href="https://sqlpatterns.com/p/no-such-thing-as-dirty-data">No Such Thing As Dirty Data</a> &#8212; 3 mins, by Ergest Xheblati</p><blockquote><p><em>There&#8217;s no such thing as &#8220;dirty data.&#8221; Data is either "fit for purpose" or "unfit for purpose." Data "fit for purpose" requires no changes and can be used as is. Data "unfit for purpose" requires "retrofitting" which will ALWAYS cause problems.</em></p></blockquote><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><a href="https://jaehyeon.me/blog/2024-05-30-beam-deploy-1/">Deploy Python Stream Processing App on Kubernetes</a> &#8212; 13 mins, by Jaehyeon Kim</p><blockquote><p><em>The Flink Kubernetes Operator manages the entire deployment lifecycle of Apache Flink applications, simplifying the deployment and management of Python stream processing applications. This series covers deploying a PyFlink application and a Python Apache Beam pipeline on the Flink Runner on Kubernetes.</em></p></blockquote><div><hr></div><h2>&#128521; Previously on Dimension</h2><blockquote><p><em>Dimension is my sub-newsletter, where I note down things I learn from people smarter than me in the data engineering field. Here is the latest article</em></p></blockquote><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;85792dce-6398-4208-92a0-c072ea1172bd&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Procella - The query engine at YouTube&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;I research and write weekly deep-dive content at vutr.substack.com&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-06-29T11:00:21.905Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc47808dd-158f-4d50-87ae-784c257f38b9_1399x996.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/procella-the-query-engine-at-youtube&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:145646666,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="pullquote"><p>Let me hear your voice, for example: </p><p>'Your newsletter is so terrible, I can't handle it anymore.'</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/p/groupby-42-paypal-scaling-kafka/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/p/groupby-42-paypal-scaling-kafka/comments"><span>Leave a comment</span></a></p></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for scrolling this far! There's a convenient subscribe box here if you want me to annoy you every week. &#128516;</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[GroupBy #41: Uber’s Batch Data Infrastructure with Google Cloud Platform]]></title><description><![CDATA[Plus: Debugging Data Pipelines, How to learn data engineering]]></description><link>https://vutr.substack.com/p/groupby-41-ubers-batch-data-infrastructure</link><guid isPermaLink="false">https://vutr.substack.com/p/groupby-41-ubers-batch-data-infrastructure</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Tue, 25 Jun 2024 11:01:57 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!VkQf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff924d7a1-3e20-43d8-9571-fde84fb4557d_1401x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is <strong>GroupBy</strong>, the weekly compiled resources for data engineers.</em></p><p><em>Not subscribed yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p><blockquote><p><em>&#128075; Hi, my name is Vu Trinh, a data engineer.</em></p><p><em>I share my lesson and excellent resources to read in this newsletter.</em></p><p><em>Hope this issue finds you well.</em></p></blockquote><div><hr></div><h1>Uber&#8217;s Batch Data Infrastructure with Google Cloud Platform</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VkQf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff924d7a1-3e20-43d8-9571-fde84fb4557d_1401x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VkQf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff924d7a1-3e20-43d8-9571-fde84fb4557d_1401x1000.png 424w, https://substackcdn.com/image/fetch/$s_!VkQf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff924d7a1-3e20-43d8-9571-fde84fb4557d_1401x1000.png 848w, https://substackcdn.com/image/fetch/$s_!VkQf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff924d7a1-3e20-43d8-9571-fde84fb4557d_1401x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!VkQf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff924d7a1-3e20-43d8-9571-fde84fb4557d_1401x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VkQf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff924d7a1-3e20-43d8-9571-fde84fb4557d_1401x1000.png" width="1401" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f924d7a1-3e20-43d8-9571-fde84fb4557d_1401x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1401,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2328502,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VkQf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff924d7a1-3e20-43d8-9571-fde84fb4557d_1401x1000.png 424w, https://substackcdn.com/image/fetch/$s_!VkQf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff924d7a1-3e20-43d8-9571-fde84fb4557d_1401x1000.png 848w, https://substackcdn.com/image/fetch/$s_!VkQf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff924d7a1-3e20-43d8-9571-fde84fb4557d_1401x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!VkQf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff924d7a1-3e20-43d8-9571-fde84fb4557d_1401x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author with the help of Canva Image Generator and Draw.io.</figcaption></figure></div><h2>Intro</h2><p>Uber runs one of the largest Hadoop installations in the world. Their Hadoop ecosystem hosts more than 1 exabyte of data across tens of thousands of servers. A few weeks ago, Uber released an article introducing that they&#8217;re working with <a href="https://cloud.google.com/?hl=en">Google Cloud Platform</a> (GCP) to move their batch data analytics and ML training stack to GCP to keep up with Uber's growing needs.&nbsp;Let&#8217;s review some key insights from that article.</p><blockquote><p>You can find the original blog post here: <a href="https://www.uber.com/en-SG/blog/modernizing-ubers-data-infrastructure-with-gcp/?uclick_id=15b6739c-0acd-406e-bdf6-884992beefa0">Modernizing Uber&#8217;s Batch Data Infrastructure with Google Cloud Platform</a>.</p></blockquote><h2><strong>Strategy</strong></h2><ul><li><p>Their initial GCP migration strategy is to use cloud object storage for the data lake and migrate the rest of the data stack to cloud IaaS (Infrastructure as a Service). This ensures a quick, minimally disruptive migration.</p></li><li><p>They plan to leverage applicable PaaS (Platform as a Service) offerings, e.g., GCP Dataproc or BigQuery</p></li></ul><h2>Migration Principles</h2><ul><li><p><strong>Avoid painful migrations for data users: </strong>Uber tries to minimize the change for users (e.g., dashboard owners). They will use a cloud storage connector for HDFS compatibility with Google Cloud Storage, leveraging open standards like <a href="https://parquet.apache.org/">Apache Parquet</a>, <a href="https://hudi.apache.org/">Apache Hudi</a>, <a href="https://spark.apache.org/">Apache Spark</a>, <a href="https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html">Apache Hadoop YARN</a>, and <a href="https://kubernetes.io/">Kubernetes</a>. This minimizes migration challenges and allows smooth integration of on-prem HDFS services with GCP storage.</p></li><li><p><strong>Enhance data access proxies: </strong>Uber has developed data access proxies for <a href="https://prestodb.io/">Presto</a>, Spark, and Hive to hide the underlying compute clusters. Once fully migrated, all queries and jobs submitted to these proxies will be routed to the cloud-based stack.</p></li><li><p><strong>Leverage Uber&#8217;s container and deployment infrastructure: </strong>The batch data stack sits on top of Uber&#8217;s infrastructure building blocks, which are built to be <a href="https://www.uber.com/blog/crane-ubers-next-gen-infrastructure-stack/">agnostic between cloud and on-prem</a>. These platforms allow Uber to expand the batch data ecosystem onto the cloud seamlessly.</p></li><li><p><strong>Forecast potential data governance issues from cloud services: </strong>Uber will enhance data management services to support only approved data services from the cloud vendor, avoiding future data governance complexities.</p></li></ul><h2><strong>Major Workload</strong></h2><ul><li><p><strong>Bucket mapping and cloud resources layout</strong>:  Formulating the mapping algorithm for migrating HDFS files and directories from the source cluster to cloud objects.</p></li><li><p><strong>Security integration: </strong>Enable support for all users, groups, and service accounts to continue to be <a href="https://www.uber.com/blog/scaling-adoption-of-kerberos-at-uber/">authenticated</a> against the object store data lake and any other cloud PaaS. Also, maintain the same levels of authorized access as on-prem.</p></li><li><p><strong>Data replication: </strong>HiveSync is a permissions-aware, bi-directional data replication service built at Uber. The goal of this is to extend HiveSync&#8217;s capabilities to replicate the on-prem data lake&#8217;s data into the cloud-based data lake and corresponding Hive Metastore.</p></li><li><p><strong>YARN and Presto clusters: </strong>Uber will provision new YARN and Presto clusters on GCP. The existing data access proxies will route traffic to these new cloud-based clusters.</p></li></ul><h2>Challenges and Initiatives</h2><p>Here are some of the significant categories of challenges of this large migration:</p><ul><li><p><strong>Performance</strong>: There are differences in features and performance characteristics between Object Store and HDFS. They will leverage the open-source Hadoop connectors and help evolve them to maximize performance.</p></li><li><p><strong>Usage governance</strong>: Cloud usage costs can be uncontrollable if not carefully managed. Uber will leverage the cloud&#8217;s elasticity to control the costs and partner with the internal capacity engineering team to build more advanced cost tracking.&nbsp;</p></li><li><p><strong>Non-analytics/ML-specific usage of HDFS</strong> <strong>by applications</strong>: Uber teams have used HDFS as a generic file store over the years. They migrate these use cases to other internal blob stores while providing a transparent migration path to avoid disruptions.</p></li><li><p><strong>Unknown unknowns</strong>: There will be unanticipated challenges. They hope to detect these issues with early end-to-end integrations.</p></li></ul><h2>Outro </h2><p>Uber plans to execute the migration plan over the next several quarters and share its progress through a series of blog posts. You can check the <a href="https://www.uber.com/blog/asia/?uclick_id=15b6739c-0acd-406e-bdf6-884992beefa0">Uber blog</a> here for upcoming posts on their migration journey.</p><div><hr></div><h1>&#128203; The list</h1><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><a href="https://blog.dataengineer.io/p/how-i-built-a-huge-graph-database">How I built a huge graph database of Netflix's cloud infrastructure</a> &#8212; 5 mins, by <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Zach Wilson&quot;,&quot;id&quot;:10367987,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8a857d08-ec8d-4a0e-9cb5-ad8434fe519e_2333x3500.jpeg&quot;,&quot;uuid&quot;:&quot;7e1e2210-03cc-4665-88f1-ba269dd03654&quot;}" data-component-name="MentionToDOM"></span> </p><blockquote><p><em>In this article, I&#8217;ll be going over:</em></p><ul><li><p><em>How a graph database data model would help Netflix manage their cloud security, and what datasets would we need</em></p></li><li><p><em>How we built a Spring Boot API on top of Postgres to serve most use cases</em></p></li></ul></blockquote><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><a href="https://seattledataguy.substack.com/p/when-and-why-to-automate-a-data-engineers">When and Why to Automate: A Data Engineer's Perspective</a> &#8212; 8 mins, by <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;SeattleDataGuy&quot;,&quot;id&quot;:4963622,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F1ec905aa-9a7b-4f21-b0ff-fec92e8916d1_512x512.jpeg&quot;,&quot;uuid&quot;:&quot;296a6e78-e36c-4297-9714-103ef124f7a3&quot;}" data-component-name="MentionToDOM"></span> </p><blockquote><p><em>The goal of this article is to:</em></p><ul><li><p><em>Outline why we automate certain tasks</em></p></li><li><p><em>Call out some of the reasons not to automate a project</em></p></li></ul><p><em>Have you, the reader pause and ask, should I really automate this? With that, let&#8217;s dive into why you&#8217;d look to automate a task.</em></p></blockquote><p><a href="https://www.uber.com/en-SG/blog/single-zone-failure-tolerance/">How Uber ensures Apache Cassandra&#174;&#8217;s tolerance for single-zone failure</a> &#8212; 12 mins, by Uber Engineering Blog</p><blockquote><p><em>This blog shows how we ensured the single-zone failure tolerance for Cassandra and, notably, how we converted the large Cassandra fleet in real-time with zero downtime from non-zone-failure-tolerant to single-zone-failure tolerant.</em></p></blockquote><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><a href="https://juhache.substack.com/p/debugging-data-pipelines">Debugging Data Pipelines</a> &#8212; 5 mins, by <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Julien Hurault&quot;,&quot;id&quot;:35734446,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5c58ce4-f4f1-4eac-854c-a32157cb7b5a_499x579.png&quot;,&quot;uuid&quot;:&quot;228de7ca-e144-414f-bbdc-8a5bdbb7996a&quot;}" data-component-name="MentionToDOM"></span> </p><blockquote><p><em>Julien shares an approach to help your data pipeline bug-fixing process be a little less stressful.</em></p></blockquote><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><a href="https://medium.com/@meskensjan/scoping-data-projects-why-technology-alone-isnt-enough-2f2dafdd1c1c">Scoping Data Projects: Why Technology Alone Isn&#8217;t Enough</a> &#8212; 9 mins, by janmeskens</p><blockquote><p><em>This is the first article in a series about data strategy. In this article, I discuss the challenges in defining data projects. Future articles will delve into the topic of data strategy.</em></p></blockquote><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><a href="https://www.blef.fr/learn-data-engineering/">How to learn data engineering</a> &#8212; 6 mins, by Christophe Blefari</p><blockquote><p><em>How to learn data engineering in 2024? This article will help you understand everything related to data engineering.</em></p></blockquote><p></p><div><hr></div><h2>&#128521; Previously on Dimension</h2><blockquote><p><em>Dimension is my sub-newsletter, where I note down things I learn from people smarter than me in the data engineering field. Here is the latest article</em></p></blockquote><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;1c2b44d3-2253-4e57-84bb-145180b63c0d&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How does Uber handle petabytes of Spark shuffle data every day?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;I research and write weekly deep-dive content at vutr.substack.com&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-06-22T11:01:08.994Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42e26302-86d9-450a-92b0-9d6df846cbd8_1400x1000.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/how-does-uber-handle-petabytes-of&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:145445763,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:6,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="pullquote"><p>Let me hear your voice, for example: </p><p>'Your newsletter is so terrible, I can't handle it anymore.'</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/p/groupby-41-ubers-batch-data-infrastructure/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/p/groupby-41-ubers-batch-data-infrastructure/comments"><span>Leave a comment</span></a></p></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for scrolling this far! There's a convenient subscribe box here if you want me to annoy you every week. &#128516;</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[GroupBy #40: Data Infrastructure at Airbnb]]></title><description><![CDATA[Plus: How trip.com migrated from Elasticsearch and built a 50PB logging solution with ClickHouse]]></description><link>https://vutr.substack.com/p/groupby-40-data-infrastructure-at</link><guid isPermaLink="false">https://vutr.substack.com/p/groupby-40-data-infrastructure-at</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Tue, 18 Jun 2024 11:03:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ATxg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bcb0b91-4380-4f6d-8556-e390203d9940_1344x960.gif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is <strong>GroupBy</strong>, the weekly compiled resources for data engineers.</em></p><p><em>Not subscribed yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p><blockquote><p><em>&#128075; Hi, my name is Vu Trinh, a data engineer.</em></p><p><em>I share my lesson and excellent resources to read in this newsletter.</em></p><p><em>Hope this issue finds you well.</em></p></blockquote><div><hr></div><h1>&#128640; Data Infrastructure at Airbnb</h1><h2>Intro</h2><p>Starting from this issue, I'm introducing a new format to GroupBy. In addition to the usual curated links, I'll be sharing a brief blog/note on my recent learnings and readings in the data/software engineering field. I'll strive to keep it concise, under 7 minutes, to respect your time.</p><p>This week is my short note after reading the article about data infrastructure at Airbnb (2016).</p><blockquote><p><em><strong>Reference</strong>: <a href="https://medium.com/airbnb-engineering/data-infrastructure-at-airbnb-8adfb34f169c">Data Infrastructure at Airbnb</a> (2016)</em></p></blockquote><h2>Airbnb philosophy</h2><p>The data infrastructure at Airbnb was built up by the following philosophies:</p><ul><li><p><strong>Open-source</strong>: trying to adopt the open-source system; if Airbnb builds something that they find helpful, they will contribute back to the community. </p></li><li><p><strong>Prefer standard components and methods: </strong>Having intuition about when to build a unique solution and when to adopt an existing solution is essential.</p></li><li><p><strong>Scalability: </strong>Airbnb had to ensure its infrastructure could scale with the growth of the data.</p></li><li><p><strong>Solve real problems by listening to your colleagues: </strong>Empathizing with internal data users is essential.</p></li></ul><h2><strong>Infrastructure Overview</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ATxg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bcb0b91-4380-4f6d-8556-e390203d9940_1344x960.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ATxg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bcb0b91-4380-4f6d-8556-e390203d9940_1344x960.gif 424w, https://substackcdn.com/image/fetch/$s_!ATxg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bcb0b91-4380-4f6d-8556-e390203d9940_1344x960.gif 848w, https://substackcdn.com/image/fetch/$s_!ATxg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bcb0b91-4380-4f6d-8556-e390203d9940_1344x960.gif 1272w, https://substackcdn.com/image/fetch/$s_!ATxg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bcb0b91-4380-4f6d-8556-e390203d9940_1344x960.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ATxg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bcb0b91-4380-4f6d-8556-e390203d9940_1344x960.gif" width="1344" height="960" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9bcb0b91-4380-4f6d-8556-e390203d9940_1344x960.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:960,&quot;width&quot;:1344,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1147280,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ATxg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bcb0b91-4380-4f6d-8556-e390203d9940_1344x960.gif 424w, https://substackcdn.com/image/fetch/$s_!ATxg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bcb0b91-4380-4f6d-8556-e390203d9940_1344x960.gif 848w, https://substackcdn.com/image/fetch/$s_!ATxg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bcb0b91-4380-4f6d-8556-e390203d9940_1344x960.gif 1272w, https://substackcdn.com/image/fetch/$s_!ATxg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bcb0b91-4380-4f6d-8556-e390203d9940_1344x960.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the author. <a href="https://medium.com/airbnb-engineering/data-infrastructure-at-airbnb-8adfb34f169c">Reference</a></figcaption></figure></div><ul><li><p>Data came from two sources: events from Kafka and MySQL database dumps.</p></li><li><p>This source data contains user activity event data and dimensional snapshots.</p></li><li><p>There are two Hadoop clusters: Gold and Silver. Critical jobs ran in the Gold environment, and more &#8220;relaxed&#8221; jobs ran in the Silver one.</p></li><li><p>Data in Gold is treated as a single source of truth; data can ONLY be copied from Gold to Silver.</p></li><li><p>Despite the isolation, separating into two clusters comes at the cost of data replication and keeping data in sync.</p></li><li><p>Treating Hive-managed tables as their central source and sink for data.</p></li><li><p>Using Presto for almost all ad hoc queries on Hive-managed tables.</p></li><li><p>They built a web-based query engine called <a href="https://airbnb.io/airpal/">Airpal</a> backed by Presto. This is Airbnb's primary interface for users to run SQL.</p></li><li><p>They use Airflow for job scheduling.</p></li><li><p>Engineers and data scientists working on machine learning will work with Spark.</p></li><li><p>Airbnb also leverages Spark for stream processing.</p></li></ul><h2><strong>Detailed Look at The Hadoop Cluster</strong></h2><p>Airbnb made a significant migration for their Hadoop Cluster. Two years before the article writing time, two sets of poorly architected clusters called Pinky and Brain were run on a set of EC2 instances running HDFS with 300 terabytes.</p><p>At the time of writing, Airbnb had two separate HDFS clusters (Gold and Silver) with 11 petabytes of data, and they also stored multiple petabytes of data in S3. Here are some problems they have overcome during the migration:</p><ul><li><p><strong>Running Hadoop on Mesos</strong></p><ul><li><p>Lacking visibility into logs and cluster health</p></li><li><p>Hadoop on Mesos could only run MapReduce version 1.</p></li><li><p>Cluster underutilization</p></li><li><p>High operational load and difficulty reasoning about the system</p></li><li><p><strong>Solution</strong>: moving away from Mesos.</p></li></ul></li></ul><ul><li><p><strong>Remote reads and writes</strong></p><ul><li><p>By storing all the HDFS data in mounted <a href="https://aws.amazon.com/ebs/">EBS</a> volumes, Airbnb sent large amounts of data over the public <a href="https://aws.amazon.com/ec2/?gclid=CjwKCAjwmrqzBhAoEiwAXVpgosv6h8mDc8mt2gA7-K5_cf91Ng2u73ymFfVMME8Ix7S-c37LuWSvtRoCH7wQAvD_BwE&amp;trk=04e4a6fd-7779-47a2-87a1-3becd8d90d5b&amp;sc_channel=ps&amp;ef_id=CjwKCAjwmrqzBhAoEiwAXVpgosv6h8mDc8mt2gA7-K5_cf91Ng2u73ymFfVMME8Ix7S-c37LuWSvtRoCH7wQAvD_BwE:G:s&amp;s_kwcid=AL!4422!3!589846461771!e!!g!!amazon%20ec2!16178327434!136912441367">Amazon EC2</a> network for queries. &#8594; Agains the Hadoop design of local reads and writes on disks.</p></li><li><p>Moreover, they mistakenly split the data storage across three separate availability zones within a single AWS region. Each zone was designated as its own rack, causing remote reads and writes for the three replicas. &#8594; More remote data transfer &#8594; Slow performance.</p></li><li><p><strong>Solution</strong>: having dedicated instances using local storage and running in a single availability zone.</p></li></ul></li><li><p><strong>Heterogeneous Workload on Homogeneous Machines:</strong></p><ul><li><p>There were distinct requirements for the architectural components.</p></li><li><p>Hive/Hadoop/HDFS machines required a lot of storage but didn&#8217;t need much RAM or CPU.</p></li><li><p>Presto and Spark required RAM and CPU but didn&#8217;t need much storage.</p></li><li><p><strong>Solution</strong>: Leveraging the flexibility of EC2 instance types supported by Amazon for each component to save cost and increase resource utilization.</p></li></ul></li><li><p><strong>System Monitoring</strong></p><ul><li><p>One major issue was creating custom monitoring and alerting for the cluster. Hadoop, Hive, and HDFS are complex systems with many potential failure points. Anticipating all failure states and setting reasonable alert thresholds felt like reinventing the wheel for Airbnb.</p></li><li><p><strong>Solutions: </strong>They signed a support contract with <a href="https://www.cloudera.com/">Cloudera</a> to gain from their expertise in architecting and operating these large systems and to reduce the maintenance burden by using the Cloudera Manager tool.</p></li></ul></li></ul><p>After the migration, they were able to cut costs dramatically and, at the same time, increase awesome performance. Here are a few numbers:</p><blockquote><ul><li><p><em>Disk read/write improved from 70&#8211;150MB/sec to 400+ MB/sec</em></p></li><li><p><em>Read throughput is ~3X better</em></p></li><li><p><em>Write throughput is ~2X better</em></p></li><li><p><em>Cost is reduced by 70%</em></p></li></ul></blockquote><h2>Outro</h2><p>That&#8217;s all for my note this week. I decided to write a note like this to share more about things I&#8217;ve learned with you guys.</p><p>Through this week's note, I hope to help you look closely into the internal infrastructure of Airbnb. Now it&#8217;s time for some cool links I found last week.</p><div><hr></div><h1>&#128203; The list</h1><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><strong><a href="https://www.databricks.com/blog/open-sourcing-unity-catalog">Databricks - Open Sourcing Unity Catalog</a></strong> &#8212; 12 mins, by Databricks blog</p><blockquote><p><em>We are excited to announce that we are open-sourcing Unity Catalog, the industry&#8217;s first open source catalog for data and AI governance across clouds, data formats, and data platforms.</em></p></blockquote><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><strong><a href="https://www.startdataengineering.com/post/data-engineering-projects-with-free-template/">Build Data Engineering Projects with Free Template</a> &#8212; </strong>6 mins, by Start Data Engineering</p><blockquote><p><em>This post will cover the critical concepts of setting up data infrastructure, development workflow, and a few sample data projects that follow this pattern.</em></p></blockquote><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><strong><a href="https://medium.com/walmartglobaltech/reliably-processing-trillions-of-kafka-messages-per-day-23494f553ef9">Walmart - Reliably Processing Trillions of Kafka Messages Per Day</a> &#8212; </strong>8 mins, by Ravinder Matte</p><blockquote><p><em>In this article we highlight how Apache Kafka messages are reliably processed at a scale of trillions of messages per day with low cost and elasticity.</em></p></blockquote><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><strong><a href="https://clickhouse.com/blog/how-trip.com-migrated-from-elasticsearch-and-built-a-50pb-logging-solution-with-clickhouse?">How trip.com migrated from Elasticsearch and built a 50PB logging solution with ClickHouse</a></strong> &#8212; 20 mins, by Dongyu Lin</p><blockquote><p><em>This blog article will explain the story of our logging platform, why we initially built it, the technology we used, and finally, our plan for its future on top of ClickHouse leveraging some of the features like SharedMergeTree.</em></p></blockquote><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><strong><a href="https://github.com/Wittline/uber-expenses-tracking">Building an ETL pipeline with Apache Airflow and Visualizing AWS Redshift data using Microsoft Power BI</a> &#8212; </strong>by Ramses Alexander Coraspe Valdez</p><blockquote><p><em>The goal of this project is to track the expenses of <a href="https://www.uber.com/">Uber Rides</a> and <a href="https://www.ubereats.com/">Uber Eats</a> through data Engineering processes using technologies such as Apache Airflow, AWS Redshift, and <a href="https://powerbi.microsoft.com/es-es/">Power BI</a>.</em></p></blockquote><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><strong><a href="https://sqlpatterns.com/p/reducing-data-questions-deluge">Reducing Data Questions Deluge</a> </strong><em>&#8212; </em>5 mins, by Ergest Xheblati </p><blockquote><p><em>How properly done self-service analytics can reduce requests on data teams. </em></p></blockquote><p>&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;</p><p><strong><a href="https://luminousmen.com/post/senior-engineer-fatigue">Senior Engineer Fatigue</a></strong> &#8212; 4 mins, by luminousmen</p><blockquote><p><em>Senior fatigue is, perhaps paradoxically, a sign of maturity in engineering. It's an indicator that you&#8217;re transitioning from doing everything to ensuring that everything that needs to be done gets done in the most effective way.</em></p></blockquote><div><hr></div><h2>&#128521; Previously on Dimension</h2><blockquote><p><em>Dimension is my sub-newsletter, where I note down things I learn from people smarter than me in the data engineering field. Here is the latest article</em></p></blockquote><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;c84b6047-379c-4185-b92e-ddc66313ae4a&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;The Architecture of Apache Druid&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;I research and write weekly deep-dive content at vutr.substack.com&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-06-15T11:01:34.553Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb18f60f6-d4e6-4aff-ad19-d6d09b9606c6_1399x999.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/the-architecture-of-apache-druid&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:145161414,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:3,&quot;comment_count&quot;:1,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="pullquote"><p>Let me hear your voice, for example: </p><p>'Your newsletter is so terrible, I can't handle it anymore.'</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/p/groupby-40-data-infrastructure-at/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/p/groupby-40-data-infrastructure-at/comments"><span>Leave a comment</span></a></p></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for scrolling this far! There's a convenient subscribe box here if you want me to annoy you every week. &#128516;</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[GroupBy #39: 2000+ DBT models in airflow; Serverless Jupyter Notebooks at Meta]]></title><description><![CDATA[Plus: What 10 Years at Uber, Meta and Startups Taught Me About Data Analytics]]></description><link>https://vutr.substack.com/p/groupby-39-how-stripes-document-databases</link><guid isPermaLink="false">https://vutr.substack.com/p/groupby-39-how-stripes-document-databases</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Tue, 11 Jun 2024 11:03:28 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!O7T6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa22211f4-7f87-41fa-acd4-ccd28fc110c3_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is <strong>GroupBy</strong>, the weekly compiled resources for data engineers.</em></p><p><em>Not subscribed yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p><blockquote><p><em>&#128075; Hi, my name is Vu Trinh, a data engineer.</em></p><p><em>I share my lesson and excellent resources to read in this newsletter.</em></p><p><em>Hope this issue finds you well.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!O7T6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa22211f4-7f87-41fa-acd4-ccd28fc110c3_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!O7T6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa22211f4-7f87-41fa-acd4-ccd28fc110c3_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!O7T6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa22211f4-7f87-41fa-acd4-ccd28fc110c3_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!O7T6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa22211f4-7f87-41fa-acd4-ccd28fc110c3_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!O7T6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa22211f4-7f87-41fa-acd4-ccd28fc110c3_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!O7T6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa22211f4-7f87-41fa-acd4-ccd28fc110c3_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a22211f4-7f87-41fa-acd4-ccd28fc110c3_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1963724,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!O7T6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa22211f4-7f87-41fa-acd4-ccd28fc110c3_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!O7T6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa22211f4-7f87-41fa-acd4-ccd28fc110c3_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!O7T6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa22211f4-7f87-41fa-acd4-ccd28fc110c3_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!O7T6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa22211f4-7f87-41fa-acd4-ccd28fc110c3_1400x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">By Canva Image Generator</figcaption></figure></div><div><hr></div><h2>&#128160;&#9478;<a href="https://blog.dataengineer.io/p/how-to-choose-between-batch-micro">How to choose between batch, micro-batch, and streaming when building a data pipeline</a></h2><ul><li><p><em>3 mins, by Zach Wilson</em></p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4Xu-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa33399d2-691e-476c-ab84-ef43e969b164_1074x710.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4Xu-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa33399d2-691e-476c-ab84-ef43e969b164_1074x710.png 424w, https://substackcdn.com/image/fetch/$s_!4Xu-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa33399d2-691e-476c-ab84-ef43e969b164_1074x710.png 848w, https://substackcdn.com/image/fetch/$s_!4Xu-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa33399d2-691e-476c-ab84-ef43e969b164_1074x710.png 1272w, https://substackcdn.com/image/fetch/$s_!4Xu-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa33399d2-691e-476c-ab84-ef43e969b164_1074x710.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4Xu-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa33399d2-691e-476c-ab84-ef43e969b164_1074x710.png" width="1074" height="710" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a33399d2-691e-476c-ab84-ef43e969b164_1074x710.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:710,&quot;width&quot;:1074,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4Xu-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa33399d2-691e-476c-ab84-ef43e969b164_1074x710.png 424w, https://substackcdn.com/image/fetch/$s_!4Xu-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa33399d2-691e-476c-ab84-ef43e969b164_1074x710.png 848w, https://substackcdn.com/image/fetch/$s_!4Xu-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa33399d2-691e-476c-ab84-ef43e969b164_1074x710.png 1272w, https://substackcdn.com/image/fetch/$s_!4Xu-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa33399d2-691e-476c-ab84-ef43e969b164_1074x710.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://blog.dataengineer.io/p/how-to-choose-between-batch-micro?utm_source=post-email-title&amp;publication_id=1644342&amp;post_id=145309895&amp;utm_campaign=email-post-title&amp;isFreemail=true&amp;r=2rj6sg&amp;triedRedirect=true&amp;utm_medium=email">Source</a></figcaption></figure></div><blockquote><p><em>Speed is all the rage in data engineering nowadays. Stakeholders want their data sooner with higher quality and efficiency! All this demand for speed can cause data engineers to make stupid decisions by not asking the proper follow-up questions to decipher the true requirements from the false ones!</em></p></blockquote><h2>&#128160;&#9478;<strong><a href="https://engineering.fb.com/2024/06/10/data-infrastructure/serverless-jupyter-notebooks-bento-meta/">Serverless Jupyter Notebooks at Meta</a></strong></h2><ul><li><p><em>5 mins, by Steve Dini</em></p></li></ul><blockquote><p><em>This blog shows how Meta enable their internal Jupyter Notebook platform, Bento, to run directly on the web browser using a library called <a href="https://pyodide.org/en/stable/">Pyodide</a> that sits on top of <a href="https://webassembly.org/">WebAssembly</a> (Wasm)</em> </p></blockquote><h2>&#128160;&#9478;<a href="https://stripe.com/blog/how-stripes-document-databases-supported-99.999-uptime-with-zero-downtime-data-migrations">How Stripe&#8217;s document databases supported 99.999% uptime with zero-downtime data migrations</a></h2><ul><li><p><em>12 mins, by Jimmy Morzaria, Suraj Narkhede</em></p></li></ul><blockquote><p><em>In this blog post we&#8217;ll share an overview of Stripe&#8217;s database infrastructure, and discuss the design and application of the Data Movement Platform.</em></p></blockquote><h2>&#128160;&#9478;<strong><a href="https://medium.com/@jeremysrgt/what-i-learned-after-one-year-of-building-a-data-platform-from-scratch-d7075629cab1">What I learned after one year of building a Data Platform from scratch</a></strong></h2><ul><li><p><em>9 mins, Jeremy Surget</em></p></li></ul><blockquote><p><em>One year ago, I joined a French start-up called <a href="https://www.allowa.com/">Allowa</a>, which is on a mission to be the marketplace for real estate services. I joined as the first data guy to help structure all their data and ultimately extract value from it. Building a data platform from scratch is an amazing experience and I wanted to share the lessons that I learned along the way.</em></p></blockquote><h2>&#128160;&#9478;<strong><a href="https://blog.plerion.com/things-you-wish-you-didnt-need-to-know-about-s3/?utm_source=newsletter.programmingdigest.net&amp;utm_medium=referral&amp;utm_campaign=queueing">Things you wish you didn't need to know about S3</a></strong></h2><ul><li><p><em>8 mins, Daniel Grzelak</em></p></li></ul><blockquote><p><em>This blog gives some weird things you might not know about Amazon S3.</em></p></blockquote><h2>&#128160;&#9478;<strong><a href="https://medium.com/apache-airflow/how-we-orchestrate-2000-dbt-models-in-apache-airflow-90901504032d">How we orchestrate 2000+ DBT models in Apache Airflow</a></strong></h2><ul><li><p><em>16 mins, by Alexandre Magno Lima Martins</em></p></li></ul><blockquote><p><em>In this blog, you will find how we used Airflow to orchestrate our DBT Core project, creating an intuitive pipeline that empowered data analysts and even product owners to develop and maintain their own data models.</em></p></blockquote><h2>&#128160;&#9478;<strong><a href="https://www.startdataengineering.com/post/python-for-de/">Python Essentials for Data Engineers</a></strong></h2><ul><li><p><em>10 mins, by Start Data Engineer</em></p></li></ul><blockquote><p><em>You know Python is important for data engineers. But what does knowing Python mean for data engineering? Python is a programming language that supports a wide range of functions; how would you know if you know it well enough for data engineering?</em></p></blockquote><h2>&#128160;&#9478;<strong><a href="https://towardsdatascience.com/what-10-years-at-uber-meta-and-startups-taught-me-about-data-analytics-fd948b912556">What 10 Years at Uber, Meta and Startups Taught Me About Data Analytics</a></strong></h2><ul><li><p><em>9 mins, Torsten Walbaum</em></p></li></ul><blockquote><p><em>Over the last 10 years, I have worked in analytical roles in a number of companies, from a small Fintech startup in Germany to high-growth pre-IPO scale-ups (Rippling) and big tech companies (Uber, Meta). Below, you&#8217;ll find ten of my key learnings over the last decade, many of which I&#8217;ve found to hold true regardless of company stage, product or business model.</em></p></blockquote><h2>&#128160;&#9478;<strong><a href="https://www.theseattledataguy.com/how-to-data-model-real-life-examples-of-how-companies-model-their-data/#page-content">How To Data Model &#8211; Real-Life Examples Of How Companies Model Their Data</a></strong></h2><ul><li><p><em>8 mins, by Seattle Data Guy</em></p></li></ul><blockquote><p><em>How companies&#8217; data model varies widely</em>.</p></blockquote><div><hr></div><h2>&#128640; Previously on Dimension</h2><blockquote><p><em>Dimension is my sub-newsletter, where I note down things I learn from people smarter than me in the data engineering field. Here is the latest article</em></p></blockquote><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;ec4717d0-fcf0-4886-a464-15d2eee4ca3e&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;4 Trillion Events Daily at LinkedIn&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;I research and write weekly deep-dive content at vutr.substack.com&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-06-08T11:02:12.912Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ff4879d-fe91-4f06-a1a1-2c3e829a9d61_1404x1010.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/4-trillion-events-daily-at-linkedin&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:145014385,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:12,&quot;comment_count&quot;:1,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="pullquote"><p>Let me hear your voice, for example: </p><p>'Your newsletter is so terrible, I can't handle it anymore.'</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/p/groupby-39-how-stripes-document-databases/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/p/groupby-39-how-stripes-document-databases/comments"><span>Leave a comment</span></a></p></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for scrolling this far! There's a convenient subscribe box here if you want me to annoy you every week. &#128516;</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[GroupBy #38: Modernizing Uber’s Batch Data Infrastructure with Google Cloud Platform, Apache Iceberg - What Is It]]></title><description><![CDATA[Plus: The ABCs of Data Products: An Essential Beginner's Introduction,]]></description><link>https://vutr.substack.com/p/groupby-38-modernizing-ubers-batch</link><guid isPermaLink="false">https://vutr.substack.com/p/groupby-38-modernizing-ubers-batch</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Tue, 04 Jun 2024 11:02:40 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!2iW2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfe58c18-21b2-45d0-abfe-ff3ec8dfe1f8_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is <strong>GroupBy</strong>, the weekly compiled resources for data engineers.</em></p><p><em>Not subscribed yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p><blockquote><p><em>&#128075; Hi, my name is Vu Trinh, a data engineer.</em></p><p><em>I share my lesson and excellent resources to read in this newsletter.</em></p><p><em>Hope this issue finds you well.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2iW2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfe58c18-21b2-45d0-abfe-ff3ec8dfe1f8_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2iW2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfe58c18-21b2-45d0-abfe-ff3ec8dfe1f8_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!2iW2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfe58c18-21b2-45d0-abfe-ff3ec8dfe1f8_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!2iW2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfe58c18-21b2-45d0-abfe-ff3ec8dfe1f8_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!2iW2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfe58c18-21b2-45d0-abfe-ff3ec8dfe1f8_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2iW2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfe58c18-21b2-45d0-abfe-ff3ec8dfe1f8_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dfe58c18-21b2-45d0-abfe-ff3ec8dfe1f8_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1936251,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2iW2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfe58c18-21b2-45d0-abfe-ff3ec8dfe1f8_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!2iW2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfe58c18-21b2-45d0-abfe-ff3ec8dfe1f8_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!2iW2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfe58c18-21b2-45d0-abfe-ff3ec8dfe1f8_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!2iW2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfe58c18-21b2-45d0-abfe-ff3ec8dfe1f8_1400x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image generated by the Canva Image Generator</figcaption></figure></div><div><hr></div><h2>&#128160;&#9478;<strong><a href="https://www.uber.com/en-SG/blog/modernizing-ubers-data-infrastructure-with-gcp/">Modernizing Uber&#8217;s Batch Data Infrastructure with Google Cloud Platform</a></strong></h2><ul><li><p><em>7 mins, by Uber Engineering Blog</em></p></li><li><p><em>Topics: Data Infrastructure</em></p></li></ul><blockquote><p><em>Uber runs one of the largest Hadoop installations in the world. Our <a href="https://www.uber.com/blog/uber-big-data-platform/">Hadoop ecosystem</a> hosts more than 1 exabyte of data across tens of thousands of servers in each of our two regions. The open-source data ecosystem, including the Hadoop ecosystem discussed in previous <a href="https://www.uber.com/blog/engineering/data/">engineering blogs</a>, has been the core of our data platform. Today, we are excited to announce that we are working with Google Cloud Platform (GCP) to move our batch data analytics and ML training stack to GCP.</em>&nbsp;</p></blockquote><h2>&#128160;&#9478;Spotify | <a href="https://engineering.atspotify.com/2024/05/data-platform-explained-part-ii/">Data Platform Explained Part II&nbsp;</a></h2><ul><li><p><em>6 mins, by Spotify Engineering Blog</em></p></li><li><p><em>Topics: Data Platform</em></p></li></ul><blockquote><p><em>In the second part, Spotify engineer will talk about scalability, the tooling they use and provide, alongside the value each building block brings to a data platform &#8212; and finally, their strategy to navigate the complexity of a data ecosystem by building a strong community around it.</em></p></blockquote><h3>&#128160;&#9478;<a href="https://datagibberish.com/p/data-products-101">The ABCs of Data Products: An Essential Beginner's Introduction</a></h3><ul><li><p><em>6 mins, by </em><span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Yordan Ivanov&quot;,&quot;id&quot;:40945395,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/76f52904-5428-4d97-82a5-3faa722b8d46_2234x1253.jpeg&quot;,&quot;uuid&quot;:&quot;78255edf-bffd-4bf6-8a7f-fe0b4a5ccb86&quot;}" data-component-name="MentionToDOM"></span> </p></li><li><p><em>Topics: Data Platform</em></p></li></ul><blockquote><p><em>From Concept to Execution: Building Strong Foundations for Data Engineering Success</em></p></blockquote><p>I&#8217;ve just found an awesome newsletter, <a href="https://datagibberish.com/">Data Gibberish</a>, by <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Yordan Ivanov&quot;,&quot;id&quot;:40945395,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/76f52904-5428-4d97-82a5-3faa722b8d46_2234x1253.jpeg&quot;,&quot;uuid&quot;:&quot;88f1c7e0-ceba-4494-9c4f-0334f34fba0a&quot;}" data-component-name="MentionToDOM"></span>; if you&#8217;re looking for easy-to-read and valuable content on DataOps, you should visit Yordan&#8217;s newsletter.</p><h2>&#128160;&#9478;<a href="https://www.junaideffendi.com/p/incremental-deprecation">Incremental Deprecation</a></h2><ul><li><p><em>2 mins, by </em><span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Junaid Effendi&quot;,&quot;id&quot;:21393641,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b06559f3-ee33-46f8-bfa0-50964179f235_1200x1200.png&quot;,&quot;uuid&quot;:&quot;c56bc3a5-236c-4e54-b777-5eee1ccc9217&quot;}" data-component-name="MentionToDOM"></span> </p></li><li><p><em>Topics: Software Engineering</em></p></li></ul><blockquote><p><em>Deprecate incrementally along with incremental releases to save cost, avoid tech debt and more, all in today's article.</em></p></blockquote><h2>&#128160;&#9478;<a href="https://seattledataguy.substack.com/p/apache-iceberg-what-is-it">Apache Iceberg - What Is It</a></h2><ul><li><p><em>13 mins, by </em><span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Julien Hurault&quot;,&quot;id&quot;:35734446,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5c58ce4-f4f1-4eac-854c-a32157cb7b5a_499x579.png&quot;,&quot;uuid&quot;:&quot;930ac392-ce82-492f-8408-fdea5d3e8d73&quot;}" data-component-name="MentionToDOM"></span></p></li><li><p><em>Topics: Storage Formats, Lakehouse</em></p></li></ul><blockquote><p><em>This article consolidates everything I've learned about Iceberg over the past month. It&#8217;s a bit long; I hope you will enjoy it!</em></p></blockquote><h2>&#128160;&#9478;<strong><a href="https://www.startdataengineering.com/post/cost-effective-pipelines/">Building Cost Efficient Data Pipelines with Python &amp; DuckDB</a></strong></h2><ul><li><p><em>11 mins, by Start Data Engineering</em></p></li><li><p><em>Topics: Data Processing, HandsOn</em></p></li></ul><blockquote><p><em>In this post, we will discuss an approach that uses the latest advancements in in-memory processing coupled with cheap, powerful hardware to reduce data processing costs significantly!</em></p><p><em>We will use DuckDB with ephemeral VMs to process data. While DuckDB + VMs are typically cheaper, you will need to be mindful of how you use them. In this post, we will see</em></p></blockquote><h2>&#128160;&#9478;<strong><a href="https://www.startdataengineering.com/post/dbt-data-build-tool-tutorial/">dbt(Data Build Tool) Tutorial</a></strong></h2><ul><li><p><em>10 mins, by Start Data Engineering</em></p></li><li><p><em>Topics: Data Processing, HandsOn</em></p></li></ul><blockquote><p><em>An excellent and detailed project to help you start with the dbt.</em></p></blockquote><h2>&#128160;&#9478;<strong><a href="https://towardsdatascience.com/building-durable-data-pipelines-cf3cbf68a7e6">Building Durable Data Pipelines</a></strong></h2><ul><li><p><em>12 mins, by Mike Shakhomirov</em></p></li><li><p><em>Topics: Data Pipeline, HandsOn</em></p></li></ul><blockquote><p><em>Data engineering techniques for robust and sustainable ETL</em></p></blockquote><div><hr></div><h2>&#128640; Previously on Dimension</h2><blockquote><p><em>Dimension is my sub-newsletter, where I note down things I learn from people smarter than me in the data engineering field.</em></p></blockquote><p>This week is my guest post on Junaid&#8217;s newsletter.</p><div class="embedded-post-wrap" data-attrs="{&quot;id&quot;:144939776,&quot;url&quot;:&quot;https://www.junaideffendi.com/p/everything-you-need-to-know-about&quot;,&quot;publication_id&quot;:2256445,&quot;publication_name&quot;:&quot;Junaid Effendi | Sharing knowledge for Engineers&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aa39096-d454-439f-98b5-baea84b501aa_800x800.png&quot;,&quot;title&quot;:&quot;Everything you need to know about MapReduce&quot;,&quot;truncated_body_text&quot;:&quot;Table of contents The Motivation The Model The MapReduce Implementation Support Features&quot;,&quot;date&quot;:&quot;2024-05-29T16:31:11.942Z&quot;,&quot;like_count&quot;:11,&quot;comment_count&quot;:5,&quot;bylines&quot;:[{&quot;id&quot;:21393641,&quot;name&quot;:&quot;Junaid Effendi&quot;,&quot;handle&quot;:&quot;junaideffendi&quot;,&quot;previous_name&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b06559f3-ee33-46f8-bfa0-50964179f235_1200x1200.png&quot;,&quot;bio&quot;:&quot;I love to write, been writing since 2016.&quot;,&quot;profile_set_up_at&quot;:&quot;2022-05-25T22:34:01.768Z&quot;,&quot;publicationUsers&quot;:[{&quot;id&quot;:2273688,&quot;user_id&quot;:21393641,&quot;publication_id&quot;:2256445,&quot;role&quot;:&quot;admin&quot;,&quot;public&quot;:true,&quot;is_primary&quot;:false,&quot;publication&quot;:{&quot;id&quot;:2256445,&quot;name&quot;:&quot;Junaid Effendi | Sharing knowledge for Engineers&quot;,&quot;subdomain&quot;:&quot;junaideffendi&quot;,&quot;custom_domain&quot;:&quot;www.junaideffendi.com&quot;,&quot;custom_domain_optional&quot;:false,&quot;hero_text&quot;:&quot;Covering tech, career, data, growth experiences from my journey.&quot;,&quot;logo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6aa39096-d454-439f-98b5-baea84b501aa_800x800.png&quot;,&quot;author_id&quot;:21393641,&quot;theme_var_background_pop&quot;:&quot;#8AE1A2&quot;,&quot;created_at&quot;:&quot;2024-01-13T20:16:55.701Z&quot;,&quot;rss_website_url&quot;:null,&quot;email_from_name&quot;:&quot;Junaid Effendi&quot;,&quot;copyright&quot;:&quot;Junaid Effendi&quot;,&quot;founding_plan_name&quot;:null,&quot;community_enabled&quot;:true,&quot;invite_only&quot;:false,&quot;payments_state&quot;:&quot;disabled&quot;,&quot;language&quot;:null,&quot;explicit&quot;:false}}],&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null},{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;handle&quot;:&quot;vutr&quot;,&quot;previous_name&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;bio&quot;:&quot;I research and write weekly deep-dive content at vutr.substack.com&quot;,&quot;profile_set_up_at&quot;:&quot;2023-09-06T14:19:32.001Z&quot;,&quot;is_guest&quot;:true,&quot;bestseller_tier&quot;:null,&quot;primaryPublicationId&quot;:1930705,&quot;primaryPublicationName&quot;:&quot;VuTrinh.&quot;,&quot;primaryPublicationUrl&quot;:&quot;https://vutr.substack.com&quot;,&quot;primaryPublicationSubscribeUrl&quot;:&quot;https://vutr.substack.com/subscribe?&quot;}],&quot;utm_campaign&quot;:null,&quot;belowTheFold&quot;:true,&quot;type&quot;:&quot;newsletter&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="EmbeddedPostToDOM"><a class="embedded-post" native="true" href="https://www.junaideffendi.com/p/everything-you-need-to-know-about?utm_source=substack&amp;utm_campaign=post_embed&amp;utm_medium=web"><div class="embedded-post-header"><img class="embedded-post-publication-logo" src="https://substackcdn.com/image/fetch/$s_!iYad!,w_56,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aa39096-d454-439f-98b5-baea84b501aa_800x800.png" loading="lazy"><span class="embedded-post-publication-name">Junaid Effendi | Sharing knowledge for Engineers</span></div><div class="embedded-post-title-wrapper"><div class="embedded-post-title">Everything you need to know about MapReduce</div></div><div class="embedded-post-body">Table of contents The Motivation The Model The MapReduce Implementation Support Features&#8230;</div><div class="embedded-post-cta-wrapper"><span class="embedded-post-cta">Read more</span></div><div class="embedded-post-meta">2 years ago &#183; 11 likes &#183; 5 comments &#183; Junaid Effendi and Vu Trinh</div></a></div><div class="pullquote"><p>Let me hear your voice, for example: </p><p>'Your newsletter is so terrible, I can't handle it anymore.'</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/p/groupby-38-modernizing-ubers-batch/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/p/groupby-38-modernizing-ubers-batch/comments"><span>Leave a comment</span></a></p></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for scrolling this far! There's a convenient subscribe box here if you want me to annoy you every week. &#128516;</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[GroupBy #37: Composable data management at Meta, How Uber Accomplishes Job Counting At Scale]]></title><description><![CDATA[Plus: The author notes on learning new things.]]></description><link>https://vutr.substack.com/p/groupby-37-composable-data-management</link><guid isPermaLink="false">https://vutr.substack.com/p/groupby-37-composable-data-management</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Tue, 28 May 2024 11:03:27 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!QeFA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff86755a5-c880-4237-b5fd-f01d36e305ee_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is <strong>GroupBy</strong>, the weekly compiled resources for data engineers.</em></p><p><em>Not subscribed yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p><blockquote><p><em>&#128075; Hi, my name is Vu Trinh, a data engineer.</em></p><p><em>I share my lesson and excellent resources to read in this newsletter.</em></p><p><em>Hope this issue find you well.</em></p></blockquote><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QeFA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff86755a5-c880-4237-b5fd-f01d36e305ee_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QeFA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff86755a5-c880-4237-b5fd-f01d36e305ee_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!QeFA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff86755a5-c880-4237-b5fd-f01d36e305ee_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!QeFA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff86755a5-c880-4237-b5fd-f01d36e305ee_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!QeFA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff86755a5-c880-4237-b5fd-f01d36e305ee_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QeFA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff86755a5-c880-4237-b5fd-f01d36e305ee_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f86755a5-c880-4237-b5fd-f01d36e305ee_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2905586,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QeFA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff86755a5-c880-4237-b5fd-f01d36e305ee_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!QeFA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff86755a5-c880-4237-b5fd-f01d36e305ee_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!QeFA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff86755a5-c880-4237-b5fd-f01d36e305ee_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!QeFA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff86755a5-c880-4237-b5fd-f01d36e305ee_1400x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image generated by the Microsoft Image Generator.</figcaption></figure></div><div><hr></div><h2><strong>&#128522; It&#8217;s Vu Trinh here &#8230;</strong></h2><blockquote><p><em>My note on learning new things.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PI9D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2d315df-4814-4082-bba2-b8d7be5df2c6_989x473.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PI9D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2d315df-4814-4082-bba2-b8d7be5df2c6_989x473.png 424w, https://substackcdn.com/image/fetch/$s_!PI9D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2d315df-4814-4082-bba2-b8d7be5df2c6_989x473.png 848w, https://substackcdn.com/image/fetch/$s_!PI9D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2d315df-4814-4082-bba2-b8d7be5df2c6_989x473.png 1272w, https://substackcdn.com/image/fetch/$s_!PI9D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2d315df-4814-4082-bba2-b8d7be5df2c6_989x473.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PI9D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2d315df-4814-4082-bba2-b8d7be5df2c6_989x473.png" width="989" height="473" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a2d315df-4814-4082-bba2-b8d7be5df2c6_989x473.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:473,&quot;width&quot;:989,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1521083,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PI9D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2d315df-4814-4082-bba2-b8d7be5df2c6_989x473.png 424w, https://substackcdn.com/image/fetch/$s_!PI9D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2d315df-4814-4082-bba2-b8d7be5df2c6_989x473.png 848w, https://substackcdn.com/image/fetch/$s_!PI9D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2d315df-4814-4082-bba2-b8d7be5df2c6_989x473.png 1272w, https://substackcdn.com/image/fetch/$s_!PI9D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2d315df-4814-4082-bba2-b8d7be5df2c6_989x473.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Myself without the learning strategy.</figcaption></figure></div><p>For me, Data Engineering requires a lot of skills. From hard to soft ones. Thus, the daily DE job allows me to learn many things.</p><p>Besides the daily DE job, as you know, I&#8217;ve spent quite a decent amount of time on my newsletter. From the GroupBy to the Dimension, I can learn new things whenever I write.</p><p>After being on the journey for a while, here are my few notes about learning new things:</p><ul><li><p>Before spending time learning something, I ask myself, &#8220;Why do I need to learn this?&#8220; and set expectations for what I will learn.</p></li><li><p>After ensuring the goal and expectations, I sit down and start learning; the important thing is I try to focus on <strong>ONLY</strong> <strong>ONE</strong> thing. For example, if I want to learn SQL window functions, I spend all the learning time on it; I avoid being like &#8220;one hour for learning this&#8220;or &#8220;two hours for learning that. &#8220;</p></li><li><p>The first step when I start learning is asking, &#8220;Why does this exist? It must solve some problems. So what are the problems?&#8220; If you find it hard to imagine, try to answer this, and you will know what I mean: &#8220;Why does Apache Airflow exist? What problem does it solve? &#8220; </p><ul><li><p>I ask these questions every time. From the first time I start learning how to write API to when I start reading papers about OLAP system</p></li></ul></li><li><p>Then, I try to break down the things I will learn into the most basic form. There are specific terms for this approach: <a href="https://www.forbes.com/sites/forbescommunicationscouncil/2023/09/13/first-principles-thinking-the-blueprint-for-solving-business-problems/?sh=8e5680b33d8c">First Principle Thinking</a>. There is a video on YouTube where <a href="https://www.youtube.com/watch?v=XK6MyKmIew8">Elon Musk is sharing about this</a>; the video may explain better what I&#8217;m saying.</p></li><li><p>The above steps will give me the fundamentals of understanding it. After that, I will try to &#8220;materialize&#8220; things I have learned: writing a small paragraph, writing a piece of code, deploying this and that. The MOST IMPORTANT thing here is to start simple. I will try a small function or class if I'm learning a new programming language.</p><ul><li><p>Trying to do complicated things at first usually prevents me from completing them. This frustrates and demotivates me, and finally, I have to give up. Instead, I will start small at first and improve myself every iteration; then, I will be ready to go with the complicated things.</p></li></ul></li><li><p>Exposing myself. After achieving the learning goal, I will try to expose it. Posting my blog on the writing platform, having seniors reviewing my code,&#8230;. I&#8217;m trying to get feedback as much as possible.</p></li><li><p>I almost forgot this point: It&#8217;s okay if I get stuck a little bit. In the past, when I faced obstacles in the process, I felt terrible about myself, &#8220;Why I&#8217;m so stupid? Why I&#8217;m so slow?&#8220;. Through time, I realized it&#8217;s okay if I get stuck; that&#8217;s not a big deal, and the terrible feelings will not solve the problem. Instead, I try to relax myself and enjoy the learning process. Keeping moving is the most critical factor here.</p></li></ul><p>That&#8217;s it. These are only my notes, and I do not even strictly follow these things (last week, I tried to jump right into some advanced concept of Rust and felt so lost). </p><p>The thing I want to say here is&#8230;</p><p>&#8230; you must have a strategy. </p><p>Learning is a lifelong process, and our time is finite, so we need an efficient way to learn.</p><p>So, what is your learning strategy?</p><div class="pullquote"><p>The curated list</p></div><h2>&#128160;&#9478;<a href="https://engineering.fb.com/2024/05/22/data-infrastructure/composable-data-management-at-meta/">Composable data management at Meta</a></h2><ul><li><p><em>9 mins, by Pedro Pedreira, Amit Purohit</em></p></li><li><p><em>Topics: Data Infrastructure</em></p></li></ul><blockquote><p><em>In recent years, Meta&#8217;s data management systems have evolved into a composable architecture that creates interoperability and promotes reusability. We&#8217;re sharing how we&#8217;ve achieved this, in part, by leveraging <a href="https://engineering.fb.com/2023/03/09/open-source/velox-open-source-execution-engine/">Velox</a>, Meta&#8217;s open source execution engine, as well as work ahead as we continue to rethink our data management systems.</em>&nbsp;</p></blockquote><h2>&#128160;&#9478;<a href="https://www.uber.com/en-SG/blog/job-counting-at-scale/?uclick_id=15b6739c-0acd-406e-bdf6-884992beefa0">How Uber Accomplishes Job Counting  At Scale</a></h2><ul><li><p><em>11 mins, by Uber Engineering Blog</em></p></li><li><p><em>Topic: Data Processing</em></p></li></ul><blockquote><p><em>Uber operates on a massive scale, facilitating over 2.2 billion trips every quarter. Deriving even simple insights necessitates a scaled solution. In our case, we needed to count the number of jobs someone had participated in while on the Uber platform for arbitrary time windows. This article focuses on the challenges faced and lessons learned as we integrated Apache Pinot&#8482; into our solution.</em></p></blockquote><h2>&#128160;&#9478;<a href="https://atomic.engineering/what-exactly-is-tech-debt-45750ac27039">What exactly is &#8220;tech debt&#8221;?</a></h2><ul><li><p><em>5 mins, by Jacob Bennett</em></p></li><li><p><em>Topics: Software Development</em></p></li></ul><blockquote><p><em>Defining and quantifying technical debt</em></p></blockquote><h2>&#128160;&#9478;<a href="https://dataanalysis.substack.com/p/what-is-the-best-advice-you-have-d7c">What Is the Best Advice You Have Ever Received? - Issue 203</a></h2><ul><li><p><em>15 mins, by Olga Berezovsky</em></p></li><li><p><em>Topics: Career</em></p></li></ul><blockquote><p><em>What changed or transformed your career? Take some advice from data and analytics leaders.</em></p></blockquote><h2>&#128160;&#9478;<a href="https://artem.krylysov.com/blog/2023/04/19/how-rocksdb-works/">How RocksDB works</a></h2><ul><li><p><em>15 mins, by Artem Krylysov</em></p></li><li><p><em>Topics: OLAP database, deep dive</em></p></li></ul><blockquote><p><em>I spent the past 4 years at Datadog building and running services on top of RocksDB in production. In this post, I'll give a high-level overview of how RocksDB works.</em></p></blockquote><h2>&#128160;&#9478;<strong><a href="https://blog.det.life/pydantic-for-experts-reusing-importing-validators-2a4300bdcc81">Pydantic for Experts: Reusing &amp; Importing Validators</a></strong></h2><ul><li><p><em>10 mins, by Yaakov Bressler</em></p></li><li><p><em>Topics: Python, Coding</em></p></li></ul><blockquote><p><em>Advanced techniques for reusing and importing validation across Python models. Pydantic introduces a new way of thinking about schema enforcement &#8212; as policy and as manipulation. In other words, assume a schema can be transformed from a variety of inputs, if all of these fail, then the object is invalid.</em></p></blockquote><h2>&#128160;&#9478;<strong><a href="https://medium.com/@yukithejapanese/duckdb-vs-polars-which-one-is-faster-61e73a7680e0">DuckDB vs Polars &#8212; Which One Is Faster?</a></strong></h2><ul><li><p><em>10 mins, by Yuki Kakegawa</em></p></li><li><p><em>Topics: Data Processing</em></p></li></ul><blockquote><p><em>An unofficial benchmark on DuckDB and Polars. Read the article to see how the author carries out the benchmark.</em></p></blockquote><h2>&#128160;&#9478;<a href="https://juhache.substack.com/p/virtual-data-environments">Virtual Data Environments</a></h2><ul><li><p><em>9 mins, by Julien Hurault</em></p></li><li><p><em>Topics: Data Infrastructure, Data Architecture, Data Warehouse</em></p></li></ul><blockquote><p><em>The separation of computation and storage offered by many modern data infrastructure tools enables us to design smarter development processes based on virtual data environments.</em></p></blockquote><div><hr></div><h2>&#128640; Previously on Dimension</h2><blockquote><p><em>Dimension is my sub-newsletter, where I note down things I learn from people smarter than me in the data engineering field. Here is my latest article.</em></p></blockquote><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;b9db688a-bfda-48c4-950b-d32abe16d791&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How Twitter processes 4 billion events in real-time daily&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-05-25T11:02:08.457Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6098c6f2-813b-42eb-af71-b99badab5dae_1395x996.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/how-twitter-processes-4-billion-events&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:144766185,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:4,&quot;comment_count&quot;:1,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="pullquote"><p>Let me here your voice, for example: </p><p>'Your newsletter is so terrible, I can't handle it anymore.'</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/p/groupby-37-composable-data-management/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/p/groupby-37-composable-data-management/comments"><span>Leave a comment</span></a></p></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for scrolling this far! There's a convenient subscribe box here if you want me to annoy you every week. &#128516;</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[GroupBy #36: Agoda- How We Solve Load Balancing Challenges in Apache Kafka, How to reduce your Snowflake cost]]></title><description><![CDATA[Plus: What we learned after running Airflow on Kubernetes for 2 years]]></description><link>https://vutr.substack.com/p/groupby-36-agoda-how-we-solve-load</link><guid isPermaLink="false">https://vutr.substack.com/p/groupby-36-agoda-how-we-solve-load</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Tue, 21 May 2024 11:00:57 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!95xK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f285a1b-1cc2-40e3-b440-c27d8e12a4dd_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is <strong>GroupBy</strong>, the weekly compiled resources for data engineers.</em></p><p><em>Not subscribed yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p><blockquote><p><em>&#128075; Hi, my name is Vu Trinh, a data engineer.</em></p><p><em>I enjoy reading <strong>good stuff</strong>  (related to data and engineering), and this newsletter is my effort on the journey to seek the "good stuff" across the entire Internet. </em></p><p><em>Hope this issue find you well.</em></p></blockquote><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!95xK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f285a1b-1cc2-40e3-b440-c27d8e12a4dd_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!95xK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f285a1b-1cc2-40e3-b440-c27d8e12a4dd_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!95xK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f285a1b-1cc2-40e3-b440-c27d8e12a4dd_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!95xK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f285a1b-1cc2-40e3-b440-c27d8e12a4dd_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!95xK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f285a1b-1cc2-40e3-b440-c27d8e12a4dd_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!95xK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f285a1b-1cc2-40e3-b440-c27d8e12a4dd_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0f285a1b-1cc2-40e3-b440-c27d8e12a4dd_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1380356,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!95xK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f285a1b-1cc2-40e3-b440-c27d8e12a4dd_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!95xK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f285a1b-1cc2-40e3-b440-c27d8e12a4dd_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!95xK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f285a1b-1cc2-40e3-b440-c27d8e12a4dd_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!95xK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f285a1b-1cc2-40e3-b440-c27d8e12a4dd_1400x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image generated by the Canva Image Generator</figcaption></figure></div><div><hr></div><h2><strong>&#128522; It&#8217;s Vu Trinh here &#8230;</strong></h2><blockquote><p><em>Sharing</em></p></blockquote><p>Working with pet projects is a good way to learn data engineering.</p><p>There were times I loved finding new pet projects on the internet (like a collector)</p><p>I opened the browser and entered something like &#8220;data engineering side project.&#8221;</p><p>I picked the one that excites me the most.</p><p>Spending all weekend following the tutorial and finally getting things up and running.</p><p>I feel proud of myself. I feel like I have accomplished something.</p><p>But two weeks later, I remember nothing. Empty head&#8230;.</p><p>&#8230;and you know what? I started looking for different projects to play with.</p><p>The loop continues on and on, and I barely learn anything from those projects I&#8217;ve done.</p><p>I know that many data engineers try to learn by doing as many pet projects as possible, but you will learn nothing just by getting things up and running.</p><p>Here are my notes when doing a pet project.</p><ul><li><p>Set expectations on what you will learn.</p></li><li><p>Do things right: don&#8217;t just try to get things done, for example</p><ul><li><p>Learn the fundamentals and best practices of Docker, Git,&#8230;</p></li><li><p>Write and organize code nicely, write tests,&#8230;</p></li><li><p>The order of SQL execution,&#8230;</p></li><li><p>&#8230;</p></li></ul></li><li><p>Try to do the project with your friends. This will give you a good impression of how to work on a project with more than one person.</p></li><li><p>When playing with any tools (Spark, Airflow, Git, Docker&#8230;), ask, &#8220;What problem do these tools try to solve?&#8221; and &#8220;How are they gonna solve that problem?&#8221;</p></li><li><p>After doing the project, try to put yourself in the role of a user who needs to consume the data from the data warehouse or the dashboard you&#8217;ve just set up, and ask yourself, &#8220;Am I satisfied?&#8221; &#8220;Do I get things I need?&#8221;. Because, as a data engineer, most of the time, you will work to serve the internal-data-consumed user</p></li><li><p>Apply data modeling to your project.</p></li></ul><p>So, what is your lesson after doing the pet project?</p><div class="pullquote"><p>The curated list</p></div><h2>&#128160;&#9478;<a href="https://medium.com/agoda-engineering/how-we-solve-load-balancing-challenges-in-apache-kafka-8cd88fdad02b">How We Solve Load Balancing Challenges in Apache Kafka</a></h2><ul><li><p><em>15 mins, Agoda Engineering</em></p></li><li><p><em>Topics: Kafka, data infrastructure, software engineering</em></p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RFtM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c3ea2f8-f4f1-4f85-82a0-4b7b7d49d309_723x288.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RFtM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c3ea2f8-f4f1-4f85-82a0-4b7b7d49d309_723x288.png 424w, https://substackcdn.com/image/fetch/$s_!RFtM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c3ea2f8-f4f1-4f85-82a0-4b7b7d49d309_723x288.png 848w, https://substackcdn.com/image/fetch/$s_!RFtM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c3ea2f8-f4f1-4f85-82a0-4b7b7d49d309_723x288.png 1272w, https://substackcdn.com/image/fetch/$s_!RFtM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c3ea2f8-f4f1-4f85-82a0-4b7b7d49d309_723x288.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RFtM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c3ea2f8-f4f1-4f85-82a0-4b7b7d49d309_723x288.png" width="723" height="288" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9c3ea2f8-f4f1-4f85-82a0-4b7b7d49d309_723x288.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:288,&quot;width&quot;:723,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!RFtM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c3ea2f8-f4f1-4f85-82a0-4b7b7d49d309_723x288.png 424w, https://substackcdn.com/image/fetch/$s_!RFtM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c3ea2f8-f4f1-4f85-82a0-4b7b7d49d309_723x288.png 848w, https://substackcdn.com/image/fetch/$s_!RFtM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c3ea2f8-f4f1-4f85-82a0-4b7b7d49d309_723x288.png 1272w, https://substackcdn.com/image/fetch/$s_!RFtM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c3ea2f8-f4f1-4f85-82a0-4b7b7d49d309_723x288.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Demonstration of Kafka partitions. <a href="https://medium.com/agoda-engineering/how-we-solve-load-balancing-challenges-in-apache-kafka-8cd88fdad02b">Source</a></figcaption></figure></div><blockquote><p><em>In this blog post, we will introduce the concept of Load Balancing for Kafka-based applications at Agoda. We will also explore the issue of imbalance and discuss strategies for effectively addressing these challenges.</em></p></blockquote><h2>&#128160;&#9478;<a href="https://blog.dataengineer.io/p/why-its-hard-for-data-engineers-to">Why it's hard for data engineers to get promoted after senior engineer</a></h2><ul><li><p><em>5 mins, Zach Wilson</em></p></li><li><p><em>Topics: Career, Learning</em></p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4Nbl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc97ae205-da7f-40d7-8aec-cb22451fe7fb_1206x1208.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4Nbl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc97ae205-da7f-40d7-8aec-cb22451fe7fb_1206x1208.png 424w, https://substackcdn.com/image/fetch/$s_!4Nbl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc97ae205-da7f-40d7-8aec-cb22451fe7fb_1206x1208.png 848w, https://substackcdn.com/image/fetch/$s_!4Nbl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc97ae205-da7f-40d7-8aec-cb22451fe7fb_1206x1208.png 1272w, https://substackcdn.com/image/fetch/$s_!4Nbl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc97ae205-da7f-40d7-8aec-cb22451fe7fb_1206x1208.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4Nbl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc97ae205-da7f-40d7-8aec-cb22451fe7fb_1206x1208.png" width="1206" height="1208" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c97ae205-da7f-40d7-8aec-cb22451fe7fb_1206x1208.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1208,&quot;width&quot;:1206,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4Nbl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc97ae205-da7f-40d7-8aec-cb22451fe7fb_1206x1208.png 424w, https://substackcdn.com/image/fetch/$s_!4Nbl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc97ae205-da7f-40d7-8aec-cb22451fe7fb_1206x1208.png 848w, https://substackcdn.com/image/fetch/$s_!4Nbl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc97ae205-da7f-40d7-8aec-cb22451fe7fb_1206x1208.png 1272w, https://substackcdn.com/image/fetch/$s_!4Nbl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc97ae205-da7f-40d7-8aec-cb22451fe7fb_1206x1208.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://blog.dataengineer.io/p/why-its-hard-for-data-engineers-to">Source</a></figcaption></figure></div><blockquote><p><em>In this article, we&#8217;ll be talking about:</em></p><ul><li><p><em>How data engineering lacks as much visibility as other data roles</em></p></li><li><p><em>Data engineers at some companies are viewed as &#8220;less technical&#8221;</em></p></li><li><p><em>Data engineering teams are usually smaller than software engineering teams</em></p></li><li><p><em>The data engineering individual contributor track &#8220;tops out&#8221; at data architect, and companies don&#8217;t need many of them</em></p></li></ul></blockquote><h2>&#128160;&#9478;<a href="https://materializedview.io/p/nimble-and-lance-parquet-killers">Nimble and Lance: The Parquet Killers</a></h2><ul><li><p><em>10 mins, Chris Riccomini</em></p></li><li><p><em>Topics: Parquet, Storage Format</em></p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!h4Bm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97013aad-9736-4436-8d6e-8a61e9205801_1170x399.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h4Bm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97013aad-9736-4436-8d6e-8a61e9205801_1170x399.png 424w, https://substackcdn.com/image/fetch/$s_!h4Bm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97013aad-9736-4436-8d6e-8a61e9205801_1170x399.png 848w, https://substackcdn.com/image/fetch/$s_!h4Bm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97013aad-9736-4436-8d6e-8a61e9205801_1170x399.png 1272w, https://substackcdn.com/image/fetch/$s_!h4Bm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97013aad-9736-4436-8d6e-8a61e9205801_1170x399.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h4Bm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97013aad-9736-4436-8d6e-8a61e9205801_1170x399.png" width="1170" height="399" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/97013aad-9736-4436-8d6e-8a61e9205801_1170x399.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:399,&quot;width&quot;:1170,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!h4Bm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97013aad-9736-4436-8d6e-8a61e9205801_1170x399.png 424w, https://substackcdn.com/image/fetch/$s_!h4Bm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97013aad-9736-4436-8d6e-8a61e9205801_1170x399.png 848w, https://substackcdn.com/image/fetch/$s_!h4Bm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97013aad-9736-4436-8d6e-8a61e9205801_1170x399.png 1272w, https://substackcdn.com/image/fetch/$s_!h4Bm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97013aad-9736-4436-8d6e-8a61e9205801_1170x399.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Nimble, A New Columnar File Format - Yoav Helfman, Meta</figcaption></figure></div><blockquote><p><em>In this blog post, the author will dig into Nimble - the new storage format from Meta and Lance from LanceDB- to see if these two storage formats can dethrone the Parquet.</em></p></blockquote><h2>&#128160;&#9478;<a href="https://medium.com/apache-airflow/what-we-learned-after-running-airflow-on-kubernetes-for-2-years-0537b157acfd">What we learned after running Airflow on Kubernetes for two years</a></h2><ul><li><p><em>15 mins, Alexandre Magno Lima Martins</em></p></li><li><p><em>Topics: Infrastructure, Airflow, Kubernetes</em></p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xC4f!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55c69ae7-402b-4f8a-8a32-a36565f9ba71_1400x633.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xC4f!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55c69ae7-402b-4f8a-8a32-a36565f9ba71_1400x633.png 424w, https://substackcdn.com/image/fetch/$s_!xC4f!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55c69ae7-402b-4f8a-8a32-a36565f9ba71_1400x633.png 848w, https://substackcdn.com/image/fetch/$s_!xC4f!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55c69ae7-402b-4f8a-8a32-a36565f9ba71_1400x633.png 1272w, https://substackcdn.com/image/fetch/$s_!xC4f!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55c69ae7-402b-4f8a-8a32-a36565f9ba71_1400x633.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xC4f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55c69ae7-402b-4f8a-8a32-a36565f9ba71_1400x633.png" width="1400" height="633" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/55c69ae7-402b-4f8a-8a32-a36565f9ba71_1400x633.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:633,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xC4f!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55c69ae7-402b-4f8a-8a32-a36565f9ba71_1400x633.png 424w, https://substackcdn.com/image/fetch/$s_!xC4f!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55c69ae7-402b-4f8a-8a32-a36565f9ba71_1400x633.png 848w, https://substackcdn.com/image/fetch/$s_!xC4f!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55c69ae7-402b-4f8a-8a32-a36565f9ba71_1400x633.png 1272w, https://substackcdn.com/image/fetch/$s_!xC4f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55c69ae7-402b-4f8a-8a32-a36565f9ba71_1400x633.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Decentralized DAGs team repositories. <a href="https://medium.com/apache-airflow/what-we-learned-after-running-airflow-on-kubernetes-for-2-years-0537b157acfd">Source</a></figcaption></figure></div><blockquote><p><em>With this post, the author wants to share important aspects of our deployment that helped them to achieve a scalable and reliable environment. This article is highly recommended if you are running your Airflow environment of Kubernetes.</em></p></blockquote><h2>&#128160;&#9478;<a href="https://read.engineerscodex.com/p/4-software-design-principles-i-learned">4 Software Design Principles I Learned the Hard Way</a></h2><ul><li><p><em>5 mins, Engineer Codex</em></p></li><li><p><em>Topics: Software Development</em></p></li></ul><blockquote><p><em>During the design and implementation process, I found that the following list of &#8220;rules&#8221; kept coming back up over and over in various scenarios. These rules are common enough that I daresay that at least one of them will be useful for a project that any software engineers reading this are currently working on.</em></p></blockquote><h2>&#128160;&#9478;<strong><a href="https://medium.com/israeli-tech-radar/how-to-save-90-on-bigquery-storage-a1ca99582c5c">How to save 90% of BigQuery&#8217;s storage cost</a></strong></h2><ul><li><p><em>6 mins, Yerachmiel Feltzman</em></p></li><li><p><em>Topics: GCP Cloud, Operation, Google BigQuery</em></p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Hvbs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7d54af-bfa2-444b-9d6a-4b93e0cd8502_1045x504.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Hvbs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7d54af-bfa2-444b-9d6a-4b93e0cd8502_1045x504.png 424w, https://substackcdn.com/image/fetch/$s_!Hvbs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7d54af-bfa2-444b-9d6a-4b93e0cd8502_1045x504.png 848w, https://substackcdn.com/image/fetch/$s_!Hvbs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7d54af-bfa2-444b-9d6a-4b93e0cd8502_1045x504.png 1272w, https://substackcdn.com/image/fetch/$s_!Hvbs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7d54af-bfa2-444b-9d6a-4b93e0cd8502_1045x504.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Hvbs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7d54af-bfa2-444b-9d6a-4b93e0cd8502_1045x504.png" width="1045" height="504" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ed7d54af-bfa2-444b-9d6a-4b93e0cd8502_1045x504.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:504,&quot;width&quot;:1045,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Hvbs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7d54af-bfa2-444b-9d6a-4b93e0cd8502_1045x504.png 424w, https://substackcdn.com/image/fetch/$s_!Hvbs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7d54af-bfa2-444b-9d6a-4b93e0cd8502_1045x504.png 848w, https://substackcdn.com/image/fetch/$s_!Hvbs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7d54af-bfa2-444b-9d6a-4b93e0cd8502_1045x504.png 1272w, https://substackcdn.com/image/fetch/$s_!Hvbs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed7d54af-bfa2-444b-9d6a-4b93e0cd8502_1045x504.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">BigQuery Storage cost in a real project the author worked on. <a href="https://medium.com/israeli-tech-radar/how-to-save-90-on-bigquery-storage-a1ca99582c5c">Source</a></figcaption></figure></div><blockquote><p><em>By changing a simple default configuration, I managed to drop BigQuery storage costs bill by 90% for one of Tikal&#8217;s clients.</em></p></blockquote><h2>&#128160;&#9478;<a href="https://www.startdataengineering.com/post/optimize-snowflake-cost/">How to reduce your Snowflake cost</a></h2><ul><li><p><em>14 mins, Start Data Engineering</em></p></li><li><p><em>Topics: Snowflake, Operation</em></p></li></ul><blockquote><p><em>In this post we will go over fours strategies that you can follow to lower your Snowflake bill. We will start with some quick wins, analyzing table structure and making changes, consolidating data pipelines, and finally setting up monitoring and alerting to ensure continued cost reduction.</em></p></blockquote><h2>&#128160;&#9478;<strong><a href="https://towardsdev.com/implementing-change-data-capture-cdc-with-docker-postgresql-mongodb-kafka-and-debezium-a-c49b2b38a88c">Implementing Change Data Capture (CDC) with Docker, PostgreSQL, MongoDB, Kafka, and Debezium: A Comprehensive Guide</a></strong></h2><ul><li><p><em> 20 mins, Ashkan Golehpour</em></p></li><li><p><em>Topics: Hands-On, CDC, Docker</em></p></li></ul><blockquote><p><em>A detailed guide on how to set up your own CDC solution with Debezium using docker-compose. You definitely give this a look if you&#8217;re looking for a side project to get your hands dirty at the weekend.</em></p></blockquote><h2>&#128160;&#9478;<a href="https://blog.det.life/how-i-build-an-etl-pipeline-with-aws-glue-lambda-and-terraform-bbdf0788cc75">How I build an ETL pipeline with AWS Glue, Lambda, and Terraform</a></h2><ul><li><p><em>15 mins, Lorena Gongang</em></p></li><li><p><em>Topics: Hands-On,</em> <em>AWS, Terraform</em></p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xVZj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe80cae9d-eded-423b-aa61-f27d032118d9_1400x1112.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xVZj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe80cae9d-eded-423b-aa61-f27d032118d9_1400x1112.png 424w, https://substackcdn.com/image/fetch/$s_!xVZj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe80cae9d-eded-423b-aa61-f27d032118d9_1400x1112.png 848w, https://substackcdn.com/image/fetch/$s_!xVZj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe80cae9d-eded-423b-aa61-f27d032118d9_1400x1112.png 1272w, https://substackcdn.com/image/fetch/$s_!xVZj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe80cae9d-eded-423b-aa61-f27d032118d9_1400x1112.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xVZj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe80cae9d-eded-423b-aa61-f27d032118d9_1400x1112.png" width="1400" height="1112" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e80cae9d-eded-423b-aa61-f27d032118d9_1400x1112.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1112,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xVZj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe80cae9d-eded-423b-aa61-f27d032118d9_1400x1112.png 424w, https://substackcdn.com/image/fetch/$s_!xVZj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe80cae9d-eded-423b-aa61-f27d032118d9_1400x1112.png 848w, https://substackcdn.com/image/fetch/$s_!xVZj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe80cae9d-eded-423b-aa61-f27d032118d9_1400x1112.png 1272w, https://substackcdn.com/image/fetch/$s_!xVZj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe80cae9d-eded-423b-aa61-f27d032118d9_1400x1112.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://blog.det.life/how-i-build-an-etl-pipeline-with-aws-glue-lambda-and-terraform-bbdf0788cc75">Source</a></figcaption></figure></div><blockquote><p><em>Another great hands-on article to get your weekend busy, if you want to learn more about AWS service and Terraform this one is for you.</em></p></blockquote><div><hr></div><h2>&#128640; Previously on Dimension</h2><blockquote><p><em>Dimension is my sub-newsletter where I note down things I learn from people smarter than me in the data engineering field. Here are my latest article.</em></p></blockquote><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;7d5b6f4a-9258-4d6d-bd07-0733a2623f27&quot;,&quot;caption&quot;:&quot;Hi, I am Vu Trinh, a data engineer. Welcome to my knowledge hub, a place where I am excited to share the valuable insights and discoveries I've gained from my data engineering journey. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;The Hadoop Distributed File System&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-05-18T11:00:53.930Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2303091-935b-4860-b650-8df7fd9fe019_1398x998.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/i-spent-8-hours-reading-the-paper-523&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:144263612,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:8,&quot;comment_count&quot;:4,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="pullquote"><p>Let me here your voice, for example: </p><p>'Your newsletter is so terrible, I can't handle it anymore.'</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/p/groupby-36-agoda-how-we-solve-load/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/p/groupby-36-agoda-how-we-solve-load/comments"><span>Leave a comment</span></a></p></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for scrolling this far! There's a convenient subscribe box here if you want me to annoy you every week. &#128516;</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[GroupBy #35: The Netflix Data Engineering Stack, Atlassian - Evolve the data platform with a Deployment Capability]]></title><description><![CDATA[Plus: How We Migrated From dbt Cloud and Scaled Our Data Development]]></description><link>https://vutr.substack.com/p/groupby-35-the-netflix-data-engineering</link><guid isPermaLink="false">https://vutr.substack.com/p/groupby-35-the-netflix-data-engineering</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Tue, 14 May 2024 11:00:51 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!NbuF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a7fbd0e-5b3f-4dc9-883d-2f5fa61482f7_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is <strong>GroupBy</strong>, the weekly compiled resources for data engineers.</em></p><p><em>Not subscribed yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p><div class="pullquote"><div class="install-substack-app-embed install-substack-app-embed-web" data-component-name="InstallSubstackAppToDOM"><img class="install-substack-app-embed-img" src="https://substackcdn.com/image/fetch/$s_!D8N-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fvutr.substack.com%2Fimg%2Fsubstack.png"><div class="install-substack-app-embed-text"><div class="install-substack-app-header">Get more from Vu Trinh in the Substack app</div><div class="install-substack-app-text">Available for iOS and Android</div></div><a href="https://substack.com/app/app-store-redirect?utm_campaign=app-marketing&amp;utm_content=author-post-insert&amp;utm_source=vutr" target="_blank" class="install-substack-app-embed-link"><button class="install-substack-app-embed-btn button primary">Get the app</button></a></div></div><blockquote><p><em>&#128075; Hi, my name is Vu Trinh, a data engineer.</em></p><p><em>I enjoy reading <strong>good stuff</strong>  (related to data and engineering), and this newsletter is my effort on the journey to seek the "good stuff" across the entire Internet. </em></p><p><em>Hope this issue find you well.</em></p></blockquote><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NbuF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a7fbd0e-5b3f-4dc9-883d-2f5fa61482f7_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NbuF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a7fbd0e-5b3f-4dc9-883d-2f5fa61482f7_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!NbuF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a7fbd0e-5b3f-4dc9-883d-2f5fa61482f7_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!NbuF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a7fbd0e-5b3f-4dc9-883d-2f5fa61482f7_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!NbuF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a7fbd0e-5b3f-4dc9-883d-2f5fa61482f7_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NbuF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a7fbd0e-5b3f-4dc9-883d-2f5fa61482f7_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9a7fbd0e-5b3f-4dc9-883d-2f5fa61482f7_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1275233,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NbuF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a7fbd0e-5b3f-4dc9-883d-2f5fa61482f7_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!NbuF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a7fbd0e-5b3f-4dc9-883d-2f5fa61482f7_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!NbuF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a7fbd0e-5b3f-4dc9-883d-2f5fa61482f7_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!NbuF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a7fbd0e-5b3f-4dc9-883d-2f5fa61482f7_1400x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image generated by the Canva Image Generator.</figcaption></figure></div><div><hr></div><h2><strong>&#128522; It&#8217;s Vu Trinh here &#8230;</strong></h2><p>Hi everyone, base on the <a href="https://open.substack.com/pub/vutr/p/groupby-34-hybrid-transactionalanalytical?r=2rj6sg&amp;utm_campaign=post&amp;utm_medium=web">survey</a> from previous issue, I will adjust some points on the GroupBy issue:</p><ul><li><p>There will be more Hands-On articles.</p></li><li><p>The number of articles will be kept reasonable. (&lt;=10)</p></li><li><p>There will be more detailed descriptions in each article.</p></li><li><p>I will start sharing more about my data engineering experience in upcoming issues.</p></li></ul><div class="pullquote"><p>Now, it&#8217;s time for the curated list.</p></div><h2>&#128160;&#9478;<strong><a href="https://www.youtube.com/watch?v=QxaOlmv79ls">The Netflix Data Engineering Stack</a></strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qoVl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5a4f0a1-aecb-4959-8e03-fe86caa2603e_1576x872.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qoVl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5a4f0a1-aecb-4959-8e03-fe86caa2603e_1576x872.png 424w, https://substackcdn.com/image/fetch/$s_!qoVl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5a4f0a1-aecb-4959-8e03-fe86caa2603e_1576x872.png 848w, https://substackcdn.com/image/fetch/$s_!qoVl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5a4f0a1-aecb-4959-8e03-fe86caa2603e_1576x872.png 1272w, https://substackcdn.com/image/fetch/$s_!qoVl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5a4f0a1-aecb-4959-8e03-fe86caa2603e_1576x872.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qoVl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5a4f0a1-aecb-4959-8e03-fe86caa2603e_1576x872.png" width="1456" height="806" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c5a4f0a1-aecb-4959-8e03-fe86caa2603e_1576x872.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:806,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:382779,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qoVl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5a4f0a1-aecb-4959-8e03-fe86caa2603e_1576x872.png 424w, https://substackcdn.com/image/fetch/$s_!qoVl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5a4f0a1-aecb-4959-8e03-fe86caa2603e_1576x872.png 848w, https://substackcdn.com/image/fetch/$s_!qoVl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5a4f0a1-aecb-4959-8e03-fe86caa2603e_1576x872.png 1272w, https://substackcdn.com/image/fetch/$s_!qoVl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5a4f0a1-aecb-4959-8e03-fe86caa2603e_1576x872.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Screenshot Real-time Data services at Netflix. Netflix Data Engineering Tech Talks - The Netflix Data Engineering Stack, <a href="https://www.youtube.com/watch?v=QxaOlmv79ls">Source</a></figcaption></figure></div><ul><li><p><em>Youtube Video, 32 mins, Netflix Data</em></p></li><li><p><em>Topics: Data Architecture</em></p></li></ul><blockquote><p><em>In this excited talk, we will go through the building blocks of the <strong>Netflix Data Engineering stack</strong>. Learn more about how <strong>batch</strong> and <strong>streaming data pipelines</strong> are built at Netflix.</em></p></blockquote><h2>&#128160;&#9478;Atlassian<em>, </em><strong><a href="https://www.atlassian.com/engineering/evolve-your-data-platform-with-a-deployment-capability">Evolve your data platform with a Deployment Capability</a></strong></h2><ul><li><p><em>11 mins, Atlassian Engineering Blog</em></p></li><li><p><em>Topics: Data Architecture</em></p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!U6s4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9431ee9f-db42-4140-b108-3f0deed76a78_1024x757.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!U6s4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9431ee9f-db42-4140-b108-3f0deed76a78_1024x757.png 424w, https://substackcdn.com/image/fetch/$s_!U6s4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9431ee9f-db42-4140-b108-3f0deed76a78_1024x757.png 848w, https://substackcdn.com/image/fetch/$s_!U6s4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9431ee9f-db42-4140-b108-3f0deed76a78_1024x757.png 1272w, https://substackcdn.com/image/fetch/$s_!U6s4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9431ee9f-db42-4140-b108-3f0deed76a78_1024x757.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!U6s4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9431ee9f-db42-4140-b108-3f0deed76a78_1024x757.png" width="1024" height="757" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9431ee9f-db42-4140-b108-3f0deed76a78_1024x757.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:757,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!U6s4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9431ee9f-db42-4140-b108-3f0deed76a78_1024x757.png 424w, https://substackcdn.com/image/fetch/$s_!U6s4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9431ee9f-db42-4140-b108-3f0deed76a78_1024x757.png 848w, https://substackcdn.com/image/fetch/$s_!U6s4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9431ee9f-db42-4140-b108-3f0deed76a78_1024x757.png 1272w, https://substackcdn.com/image/fetch/$s_!U6s4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9431ee9f-db42-4140-b108-3f0deed76a78_1024x757.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The re-build data platform at Atlassian<em>, </em><a href="https://www.atlassian.com/engineering/evolve-your-data-platform-with-a-deployment-capability">Source</a></figcaption></figure></div><blockquote><p><em>The data platform team at Atlassian re-built their data platform into an opinionated platform; one that saw the introduction of a novel deployment capability. The new capability takes inspiration from <a href="https://kubernetes.io/">Kubernetes</a> and <a href="https://blog.developer.atlassian.com/why-atlassian-uses-an-internal-paas-to-regulate-aws-access/">Micros (Atlassian's internal Platform-as-a-Service)</a> but adjusted to the data domain.</em></p></blockquote><p></p><h2>&#128160;&#9478;<strong><a href="https://medium.com/apache-hudi-blogs/building-analytical-apps-on-the-lakehouse-using-apache-hudi-daft-streamlit-3224766fe58a">Building Analytical Apps on the Lakehouse using Apache Hudi, Daft &amp; Streamlit</a></strong></h2><ul><li><p><em>10 mins, Dipankar Mazumdar</em></p></li><li><p><em>Topics: Lakehouse, Apache Hudi, HandsOn</em></p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MDwI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00ec4400-ec85-4082-8bff-96ba4526e3fe_1400x1080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MDwI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00ec4400-ec85-4082-8bff-96ba4526e3fe_1400x1080.png 424w, https://substackcdn.com/image/fetch/$s_!MDwI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00ec4400-ec85-4082-8bff-96ba4526e3fe_1400x1080.png 848w, https://substackcdn.com/image/fetch/$s_!MDwI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00ec4400-ec85-4082-8bff-96ba4526e3fe_1400x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!MDwI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00ec4400-ec85-4082-8bff-96ba4526e3fe_1400x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MDwI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00ec4400-ec85-4082-8bff-96ba4526e3fe_1400x1080.png" width="1400" height="1080" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/00ec4400-ec85-4082-8bff-96ba4526e3fe_1400x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1080,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MDwI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00ec4400-ec85-4082-8bff-96ba4526e3fe_1400x1080.png 424w, https://substackcdn.com/image/fetch/$s_!MDwI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00ec4400-ec85-4082-8bff-96ba4526e3fe_1400x1080.png 848w, https://substackcdn.com/image/fetch/$s_!MDwI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00ec4400-ec85-4082-8bff-96ba4526e3fe_1400x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!MDwI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00ec4400-ec85-4082-8bff-96ba4526e3fe_1400x1080.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"> Lakehouse architecture for the exercise, <a href="https://medium.com/apache-hudi-blogs/building-analytical-apps-on-the-lakehouse-using-apache-hudi-daft-streamlit-3224766fe58a">Source</a></figcaption></figure></div><blockquote><p><em>In this blog, our focus would be on building a data app using data directly from an open lakehouse platform.</em></p></blockquote><h2>&#128160;&#9478;<strong>&#8;<a href="https://www.startdataengineering.com/post/design-patterns/">Data Pipeline Design Patterns - #1. Data flow patterns</a></strong></h2><ul><li><p><em>12 mins, By Start Data Engineering</em></p></li><li><p><em>Topics: Data Pipeline, Design Pattern</em></p></li></ul><blockquote><p><em>This post will cover the typical <strong>data pipeline flow design patterns</strong>. We will learn about the pros and cons of each design pattern, when to use them, and, more importantly, when not to use them. By the end of this post, you will have an overview of the typical data flow patterns.</em></p></blockquote><h2>&#128160;&#9478;<strong><a href="https://towardsdatascience.com/python-one-billion-row-challenge-from-10-minutes-to-4-seconds-0718662b303e">Python One Billion Row Challenge &#8212; From 10 Minutes to 4 Seconds</a></strong></h2><ul><li><p><em>12 mins, By Dario Rade&#269;i&#263;</em></p></li><li><p><em>Topics: Python, Data Processing, HandsOn</em></p></li></ul><blockquote><p><em>The question of how fast a programming language can aggregate <a href="https://1brc.dev/">1 billion rows</a> of data has been gaining traction lately. The author tried to solve this challenge by using <strong>pure Python</strong> at first. Then, he used external data processing libraries like <strong>Pandas</strong>, <strong>Polars</strong>, or event <strong>DuckDB</strong> to see how it goes&#8212;a worth-reading article.</em></p></blockquote><h2>&#128160;&#9478;<strong><a href="https://amsayed.medium.com/coding-data-pipeline-design-patterns-in-python-44a705f0af9e">Coding Data Pipeline Design Patterns in Python</a></strong></h2><ul><li><p><em>10 mins, By Ahmed Sayed</em></p></li><li><p><em>Topics: Python, Programming, Code Design Pattern, HandsOn</em></p></li></ul><blockquote><p><em><strong>Design patterns are language-independent strategies for solving common design problems</strong>. They encapsulate successful design practices and provide a set of proven solutions to design challenges, promoting best practices in software design. They are not mandatory to implement in every project, but using them can make code more flexible, reusable, and maintainable.</em></p></blockquote><h2>&#128160;&#9478;<a href="https://glossgenius.com/blog/how-we-migrated-from-dbt-cloud-and-scaled-our-data-development">How We Migrated From dbt Cloud and Scaled Our Data Development</a></h2><ul><li><p><em>10 mins, By Gloss Genius Blog</em></p></li><li><p><em>Topics: Data Architecture</em></p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UZg6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15314ff4-1f04-441d-bd0b-b1a0b0a5ad5e_1186x1043.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UZg6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15314ff4-1f04-441d-bd0b-b1a0b0a5ad5e_1186x1043.png 424w, https://substackcdn.com/image/fetch/$s_!UZg6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15314ff4-1f04-441d-bd0b-b1a0b0a5ad5e_1186x1043.png 848w, https://substackcdn.com/image/fetch/$s_!UZg6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15314ff4-1f04-441d-bd0b-b1a0b0a5ad5e_1186x1043.png 1272w, https://substackcdn.com/image/fetch/$s_!UZg6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15314ff4-1f04-441d-bd0b-b1a0b0a5ad5e_1186x1043.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UZg6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15314ff4-1f04-441d-bd0b-b1a0b0a5ad5e_1186x1043.png" width="1186" height="1043" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/15314ff4-1f04-441d-bd0b-b1a0b0a5ad5e_1186x1043.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1043,&quot;width&quot;:1186,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UZg6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15314ff4-1f04-441d-bd0b-b1a0b0a5ad5e_1186x1043.png 424w, https://substackcdn.com/image/fetch/$s_!UZg6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15314ff4-1f04-441d-bd0b-b1a0b0a5ad5e_1186x1043.png 848w, https://substackcdn.com/image/fetch/$s_!UZg6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15314ff4-1f04-441d-bd0b-b1a0b0a5ad5e_1186x1043.png 1272w, https://substackcdn.com/image/fetch/$s_!UZg6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15314ff4-1f04-441d-bd0b-b1a0b0a5ad5e_1186x1043.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The <em>Gloss Genius&#8217;s </em>alternative solution to dbt Cloud, <a href="https://glossgenius.com/blog/how-we-migrated-from-dbt-cloud-and-scaled-our-data-development">Source</a></figcaption></figure></div><blockquote><p><em>In this blog post, we will share our journey of migrating away from dbt Cloud &#8211; exploring the reasons behind this decision, the alternatives we considered and the lessons we learned along the way.</em></p></blockquote><div><hr></div><h2>&#128640; Previously on Dimension</h2><blockquote><p><em>Dimension is my sub-newsletter where I note down things I learn from people smarter than me in the data engineering field. Here are my latest article.</em></p></blockquote><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;f9ccaf5d-7ce8-41f6-bba2-89b75f88485e&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;All you need to know about the Google File System&quot;,&quot;publishedBylines&quot;:[],&quot;post_date&quot;:&quot;2024-05-11T11:00:58.856Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69f439d1-b704-4900-b386-d4ad7feb2813_1400x1000.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/i-spent-8-hours-reading-the-paper&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:144042827,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:4,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="pullquote"><p>Let me here your voice, for example: </p><p>'Your newsletter is so terrible, I can't handle it anymore.'</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/p/groupby-35-the-netflix-data-engineering/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/p/groupby-35-the-netflix-data-engineering/comments"><span>Leave a comment</span></a></p></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for scrolling this far! There's a convenient subscribe box here if you want me to annoy you every week. &#128516;</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[GroupBy #34: Hybrid Transactional/Analytical Storage, From Predictive to Generative – How Michelangelo Accelerates Uber’s AI Journey]]></title><description><![CDATA[Plus: Some words from the GroupBy's author.]]></description><link>https://vutr.substack.com/p/groupby-34-hybrid-transactionalanalytical</link><guid isPermaLink="false">https://vutr.substack.com/p/groupby-34-hybrid-transactionalanalytical</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Tue, 07 May 2024 11:01:23 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!kmAh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd65f370b-c3ba-4eb0-ba04-31d4f35b8418_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is <strong>GroupBy</strong>, the weekly compiled resources for data engineers.</em></p><p><em>Not subscribed yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p><div class="pullquote"><div class="install-substack-app-embed install-substack-app-embed-web" data-component-name="InstallSubstackAppToDOM"><img class="install-substack-app-embed-img" src="https://substackcdn.com/image/fetch/$s_!D8N-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fvutr.substack.com%2Fimg%2Fsubstack.png"><div class="install-substack-app-embed-text"><div class="install-substack-app-header">Get more from Vu Trinh in the Substack app</div><div class="install-substack-app-text">Available for iOS and Android</div></div><a href="https://substack.com/app/app-store-redirect?utm_campaign=app-marketing&amp;utm_content=author-post-insert&amp;utm_source=vutr" target="_blank" class="install-substack-app-embed-link"><button class="install-substack-app-embed-btn button primary">Get the app</button></a></div></div><blockquote><p><em>&#128075; Hi, my name is Vu Trinh, a data engineer.</em></p><p><em>I enjoy reading <strong>good stuff</strong>  (related to data and engineering), and this newsletter is my effort on the journey to seek the "good stuff" across the entire Internet. </em></p><p><em>Hope this issue find you well.</em></p></blockquote><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kmAh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd65f370b-c3ba-4eb0-ba04-31d4f35b8418_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kmAh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd65f370b-c3ba-4eb0-ba04-31d4f35b8418_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!kmAh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd65f370b-c3ba-4eb0-ba04-31d4f35b8418_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!kmAh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd65f370b-c3ba-4eb0-ba04-31d4f35b8418_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!kmAh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd65f370b-c3ba-4eb0-ba04-31d4f35b8418_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kmAh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd65f370b-c3ba-4eb0-ba04-31d4f35b8418_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d65f370b-c3ba-4eb0-ba04-31d4f35b8418_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1818848,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kmAh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd65f370b-c3ba-4eb0-ba04-31d4f35b8418_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!kmAh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd65f370b-c3ba-4eb0-ba04-31d4f35b8418_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!kmAh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd65f370b-c3ba-4eb0-ba04-31d4f35b8418_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!kmAh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd65f370b-c3ba-4eb0-ba04-31d4f35b8418_1400x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image generated by the Canva Image Generator.</figcaption></figure></div><div><hr></div><h2><strong>&#128522; It&#8217;s Vu Trinh here &#8230;</strong></h2><p>&#127881;&#127881; Hi, everyone. I just want to let you know that this newsletter currently has over 1,000 subscribers.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4G6u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02d1fab3-bd2f-4db6-8fd9-681b0e352713_800x800.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4G6u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02d1fab3-bd2f-4db6-8fd9-681b0e352713_800x800.jpeg 424w, https://substackcdn.com/image/fetch/$s_!4G6u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02d1fab3-bd2f-4db6-8fd9-681b0e352713_800x800.jpeg 848w, https://substackcdn.com/image/fetch/$s_!4G6u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02d1fab3-bd2f-4db6-8fd9-681b0e352713_800x800.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!4G6u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02d1fab3-bd2f-4db6-8fd9-681b0e352713_800x800.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4G6u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02d1fab3-bd2f-4db6-8fd9-681b0e352713_800x800.jpeg" width="800" height="800" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/02d1fab3-bd2f-4db6-8fd9-681b0e352713_800x800.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:800,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4G6u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02d1fab3-bd2f-4db6-8fd9-681b0e352713_800x800.jpeg 424w, https://substackcdn.com/image/fetch/$s_!4G6u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02d1fab3-bd2f-4db6-8fd9-681b0e352713_800x800.jpeg 848w, https://substackcdn.com/image/fetch/$s_!4G6u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02d1fab3-bd2f-4db6-8fd9-681b0e352713_800x800.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!4G6u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02d1fab3-bd2f-4db6-8fd9-681b0e352713_800x800.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Thank you, Substack.</figcaption></figure></div><p>Thank you so much for all of your support.</p><p>Reaching a significant milestone like this makes me reflect on my work, especially with the GroupBy newsletter.</p><p>The GroupBy newsletter is dedicated to providing you with a curated collection of valuable resources in the field of data engineering, helping you stay updated and informed.</p><p>I will put more than 20 links into the GroupBy issue weekly and send them to my audience.</p><p>&#129398; But I think GroupBy is quite&#8230; cold.</p><p>I could do better than this.</p><p>&#128293; From this newsletter, I intend to warm the GroupBy up a little.</p><p>So, I need some help.</p><p>To prepare better for upcoming issues, I have some short questions below to get your feedback and idea:</p><div class="poll-embed" data-attrs="{&quot;id&quot;:172363}" data-component-name="PollToDOM"></div><div class="poll-embed" data-attrs="{&quot;id&quot;:172370}" data-component-name="PollToDOM"></div><div class="poll-embed" data-attrs="{&quot;id&quot;:172373}" data-component-name="PollToDOM"></div><div class="poll-embed" data-attrs="{&quot;id&quot;:172365}" data-component-name="PollToDOM"></div><div class="poll-embed" data-attrs="{&quot;id&quot;:172366}" data-component-name="PollToDOM"></div><div class="poll-embed" data-attrs="{&quot;id&quot;:172367}" data-component-name="PollToDOM"></div><div class="poll-embed" data-attrs="{&quot;id&quot;:172368}" data-component-name="PollToDOM"></div><p>If you have any feedback or critiques, please leave a comment or let me know directly via the <a href="https://substack.com/@vutr?utm_source=user-menu">Substack chat</a>, <a href="https://www.linkedin.com/in/vutr27/">LinkedIn</a>, or <a href="http://vutrinh2704@gmail.com">Email</a>.</p><p>Once again, thank you so much for your support.</p><p>Ah! I almost forgot about the resource list; from now on, the list will be more compact and related.</p><p>The number of links will be kept more reasonable to help you focus more.</p><p>I will curate the resources more carefully to ensure they are worth your time. &#128521;</p><p>Now, it&#8217;s time for this week&#8217;s list.</p><div><hr></div><h2>&#9999;&#65039; My curated list</h2><h4>&#128160;&#9478;<strong><a href="https://jack-vanlightly.com/blog/2024/5/2/hybrid-transactional-analytical-storage">Hybrid Transactional/Analytical Storage</a></strong></h4><ul><li><p><em>10 mins, by Jack Vanlightly</em></p></li><li><p><em>Topics: Table Format || Stream Processing || Object Storage</em></p></li></ul><blockquote><p><em>In this article, the author will discuss the <a href="https://www.confluent.io/lp/confluent-kafka/?utm_medium=sem&amp;utm_source=google&amp;utm_campaign=ch.sem_br.brand_tp.prs_tgt.confluent-brand_mt.xct_rgn.apac_lng.eng_dv.all_con.confluent-general&amp;utm_term=confluent&amp;creative=&amp;device=c&amp;placement=&amp;gad_source=1&amp;gclid=Cj0KCQjwudexBhDKARIsAI-GWYXuvfmdzgadSITuAqyhrVM4eIBe8Uv6ubKbzouxcDOlszq4SPyDBskaAhfPEALw_wcB">Confluent</a>&#8217;s vision has for using object storage (with the two recent announcement of <a href="https://www.confluent.io/blog/introducing-confluent-cloud-freight-clusters/">Freight Clusters</a> and <a href="https://www.confluent.io/es-es/blog/introducing-tableflow/">Tableflow</a>) as an enabler of a new storage paradigm - Hybrid Transactional/Analytical Storage.</em> </p></blockquote><h4>&#128160;&#9478;<a href="https://www.linkedin.com/blog/engineering/data-management/lakechime-a-data-trigger-service-for-modern-data-lakes">LakeChime: A Data Trigger Service for Modern Data Lakes</a></h4><ul><li><p><em>15 mins, by Linkedin Engineering Blog</em></p></li><li><p><em>Topics: Data Lake || Table Format || Data Pipeline </em></p></li></ul><blockquote><p><em>In this blog post, we introduce LakeChime, a data trigger service that unifies data trigger semantics not only among modern table formats, but also between modern and traditional table formats such as Hive, bridging the impedance mismatch between traditional partition semantics and modern snapshot semantics.</em></p></blockquote><h4>&#128160;&#9478;<a href="https://luminousmen.com/post/from-etl-and-elt-to-reverse-etl">From ETL and ELT to Reverse ETL</a></h4><ul><li><p><em>5 mins, by luminousmen</em></p></li><li><p><em>Topics: Data Pipeline</em></p></li></ul><blockquote><p><em>In recent years, we've witnessed a significant transformation in data management, moving from the traditional <strong>ETL</strong> (Extract, Transform, Load) framework to the more agile <strong>ELT</strong> (Extract, Load, Transform) methodology. However, an even newer trend, Reverse ETL, is reshaping our approach to data integration.</em></p></blockquote><h4>&#128160;&#9478;<a href="https://www.uber.com/en-SG/blog/from-predictive-to-generative-ai/">From Predictive to Generative &#8211; How Michelangelo Accelerates Uber&#8217;s AI Journey</a></h4><ul><li><p><em>20 mins, by Kai Wang, Min Cai, Joseph Wang and Eric Chen from Uber</em></p></li><li><p><em>Topics: AI || Machine Learning</em></p></li></ul><blockquote><p><em>As Uber&#8217;s centralized ML platform, <a href="https://www.uber.com/blog/michelangelo-machine-learning-platform/">Michelangelo</a> has been instrumental in driving its ML evolution since its introduction in 2016. In this blog post, we will see how Michelangelo empowers many end-to-end high-quality ML applications at Uber.</em></p></blockquote><h4>&#128160;&#9478;<a href="https://engineering.grab.com/data-observability">Ensuring data reliability and observability in risk systems</a></h4><ul><li><p><em>5 mins, by Yi Ni Ong, Kamesh Chandran and Jia Long Loh from Grab</em></p></li><li><p><em>Topics: Data Management</em></p></li></ul><blockquote><p><em>As Grab&#8217;s business grows, so does the amount of data. Any data discrepancy or missing data surely impact the business operation. They need to quickly detect any data anomalies, which is where data observability comes in.</em></p></blockquote><h4>&#128160;&#9478;<a href="https://ntietz.com/blog/the-only-two-log-levels-you-need-are-info-and-error/">The only two log levels you need are INFO and ERROR</a></h4><ul><li><p><em>8 mins, by Nicole Tietz-Sokolskaya</em></p></li><li><p><em>Topics: Programming || Software Development</em></p></li></ul><blockquote><p><em>It's common for us to log in ways that are unhelpful.</em></p></blockquote><h4>&#128160;&#9478;<strong><a href="https://medium.com/@fengruohang/database-in-kubernetes-is-that-a-good-idea-daf5775b5c1f">Database in Kubernetes: Is that a good idea?</a></strong></h4><ul><li><p><em>13 mins, by Vonng</em></p></li><li><p><em>Topics: Database || Infrastructure</em></p></li></ul><blockquote><p><em>Whether databases should be housed in Kubernetes/Docker remains highly controversial. While Kubernetes (k8s) excels in managing stateless applications, it has fundamental drawbacks with stateful services, especially databases like PostgreSQL and MySQL.</em></p></blockquote><h4>&#128160;&#9478;<strong><a href="https://towardsdatascience.com/the-foundation-of-data-validation-0163892e5aa1">The Foundation of Data Validation</a></strong></h4><ul><li><p><em>7 mins , by Chengzhi Zhao</em></p></li><li><p><em>Topics: Data Processing</em></p></li></ul><blockquote><p><em>Discussing the basic principles and methodology of data validation</em></p></blockquote><h4>&#128160;&#9478;<strong><a href="https://medium.com/towards-data-science/slowly-changing-dimensions-6a08dc0386ae">Modeling Slowly Changing Dimensions</a></strong></h4><ul><li><p><em>16 mins, by Giorgos Myrianthous</em></p></li><li><p><em>Topics: Data Modeling</em></p></li></ul><blockquote><p><em>A deep dive into the various SCD types and how they can be implemented in Data Warehouses</em></p></blockquote><h4>&#128160;&#9478;<a href="https://medium.com/@maciej.pocwierz/how-an-empty-s3-bucket-can-make-your-aws-bill-explode-934a383cb8b1">How an empty S3 bucket can make your AWS bill explode</a></h4><ul><li><p><em>4 mins, by Maciej Pocwierz</em></p></li><li><p><em>Topics: Cloud || Object Storage</em></p></li></ul><blockquote><p><em>Imagine you create an empty, private AWS S3 bucket in a region of your preference. What will your AWS bill be the next morning? Two days later, I checked my AWS billing page, primarily to make sure that what I was doing was well within the free-tier limits; My bill was over $1,300, with the billing console showing nearly 100,000,000 S3 PUT requests executed within just one day!</em></p></blockquote><div><hr></div><h2>&#128640; Previously on Dimension</h2><blockquote><p><em>Dimension is my sub-newsletter where I note down things I learn from people smarter than me in the data engineering field. Here are my latest article.</em></p></blockquote><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;62e1a52b-87f6-4a1e-9ede-cb26f130d031&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;I spent 5 hours understanding more about the Delta Lake table format&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-05-04T11:01:32.164Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a996301-764e-4ddf-811b-31eff8e5ba7f_1400x1000.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/i-spent-5-hours-to-understand-more&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:143733762,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:5,&quot;comment_count&quot;:1,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="pullquote"><p>Let me here your voice, for example: </p><p>'Your newsletter is so terrible, I can't handle it anymore.'</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/p/groupby-34-hybrid-transactionalanalytical/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/p/groupby-34-hybrid-transactionalanalytical/comments"><span>Leave a comment</span></a></p></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for scrolling this far! There's a convenient subscribe box here if you want me to annoy you every week. &#128516;</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[GroupBy #33: Data Gateway - A Platform for Growing and Protecting the Data Tier at Netflix, The Cloud Storage Triad: Latency, Cost, Durability]]></title><description><![CDATA[Plus: Solving RevenueCat's data ingestion challenges into Snowflake, From ZooKeeper to KRaft: How the Kafka migration works]]></description><link>https://vutr.substack.com/p/groupby-33-data-gateway-a-platform</link><guid isPermaLink="false">https://vutr.substack.com/p/groupby-33-data-gateway-a-platform</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Tue, 30 Apr 2024 11:01:47 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!LnWo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeed28de-4e1a-41c4-a79d-c381b5b13a68_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is <strong>GroupBy</strong>, the weekly compiled resources for data engineers.</em></p><p><em>Not subscribed yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p><div class="pullquote"><div class="install-substack-app-embed install-substack-app-embed-web" data-component-name="InstallSubstackAppToDOM"><img class="install-substack-app-embed-img" src="https://substackcdn.com/image/fetch/$s_!D8N-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fvutr.substack.com%2Fimg%2Fsubstack.png"><div class="install-substack-app-embed-text"><div class="install-substack-app-header">Get more from Vu Trinh in the Substack app</div><div class="install-substack-app-text">Available for iOS and Android</div></div><a href="https://substack.com/app/app-store-redirect?utm_campaign=app-marketing&amp;utm_content=author-post-insert&amp;utm_source=vutr" target="_blank" class="install-substack-app-embed-link"><button class="install-substack-app-embed-btn button primary">Get the app</button></a></div></div><blockquote><p><em>&#128075; Hi, my name is Vu Trinh, a data engineer.</em></p><p><em>I enjoy reading <strong>good stuff</strong>  (related to data and engineering), and this newsletter is my effort on the journey to seek the "good stuff" across the entire Internet. </em></p><p><em>Hope this issue find you well.</em></p></blockquote><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LnWo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeed28de-4e1a-41c4-a79d-c381b5b13a68_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LnWo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeed28de-4e1a-41c4-a79d-c381b5b13a68_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!LnWo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeed28de-4e1a-41c4-a79d-c381b5b13a68_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!LnWo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeed28de-4e1a-41c4-a79d-c381b5b13a68_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!LnWo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeed28de-4e1a-41c4-a79d-c381b5b13a68_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LnWo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeed28de-4e1a-41c4-a79d-c381b5b13a68_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/feed28de-4e1a-41c4-a79d-c381b5b13a68_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1609148,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LnWo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeed28de-4e1a-41c4-a79d-c381b5b13a68_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!LnWo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeed28de-4e1a-41c4-a79d-c381b5b13a68_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!LnWo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeed28de-4e1a-41c4-a79d-c381b5b13a68_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!LnWo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeed28de-4e1a-41c4-a79d-c381b5b13a68_1400x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the Canva Image Generator.</figcaption></figure></div><div><hr></div><h1>&#128025; Learning</h1><blockquote><p><em>I love to learn, and I assume you do too.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://blog.lawrencejones.dev/learn-one-thing/">Learn one thing at a time</a></h4><p>&#9997; <a href="https://twitter.com/lawrjones">Lawrence Jones</a></p><blockquote><p><em>Of the mental models and rules I use in my life, by far the most useful is to learn only one thing at any given time.</em></p></blockquote><div><hr></div><h1>&#128640; Engineering</h1><blockquote><p><em>I have to believe in a world outside my own mind. &#8212; Memento (2000)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://netflixtechblog.medium.com/data-gateway-a-platform-for-growing-and-protecting-the-data-tier-f1ed8db8f5c6">Data Gateway - A Platform for Growing and Protecting the Data Tier</a></h4><p>&#9997; <a href="https://netflixtechblog.medium.com/?source=post_page-----f1ed8db8f5c6--------------------------------">Netflix Technology Blog</a></p><blockquote><p><em>The Netflix Online Datastore team has built a platform we call the Data Gateway to enable our datastore engineers to deliver powerful data abstractions which protect Netflix application developers from complex distributed databases and incompatible API changes. In this opening post, we cover the platform as the first part of a series which shows how we use this platform to raise the level of abstraction that application developers use every day to create, access, and maintain their online data.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.revenuecat.com/blog/engineering/data-ingestion-snowflake/">Solving RevenueCat's data ingestion challenges into Snowflake</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/jes%C3%BAs-antonio-s%C3%A1nchez-m%C3%A9ndez-64799a25/?originalSubdomain=es">Jes&#250;s S&#225;nchez</a></p><blockquote><p><em>In this blog, I&#8217;ll take you through the intricacies of our data management practices, specifically focusing on the journey of our data from its origins to its final destination in Snowflake. We&#8217;ll explore the challenges we faced, the solutions we devised, and the insights we gained through the process of optimizing our data ingestion pipeline.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://sympathetic.ink/2024/04/29/The-Deconstructed-Database.html">The Deconstructed Database</a></h4><p>&#9997; <a href="https://julien.ledem.net/">Julien Le Dem</a></p><blockquote><p><em>Recently there&#8217;s been more talk around the same idea of building composable data systems. Arrow and Iceberg have grown exponentially in popularity and some of the aspirational ideas in my talk in 2018 are now well established. We do live in the future I was hoping for then. In this post, I want to explain in more detail what those components are and, more importantly, the contracts that keep them decoupled and composable.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://materializedview.io/p/cloud-storage-triad-latency-cost-durability">The Cloud Storage Triad: Latency, Cost, Durability</a></h4><p>&#9997; <a href="https://substack.com/profile/69592459-chris-riccomini">Chris Riccomini</a></p><blockquote><p><em>I believe that the future of database persistence is object storage&#8212;S3, Google Cloud Storage, and so on. New systems like Neon, WarpStream, and Turbopuffer persist data in object storage to offer infinite retention, durability, replication, data warehouse integration, and so on.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://duckdb.org/2024/03/29/external-aggregation.html">No Memory? No Problem. External Aggregation in DuckDB</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/lnkuiper/">Laurens Kuiper</a></p><blockquote><p><em>In a nutshell, that&#8217;s what this post is about. Since the 0.9.0 release, DuckDB&#8217;s hash aggregation can process more unique groups than fit in memory by offloading data to storage. In this post, we&#8217;ll explain how this works. If you want to know what hash aggregation is, how hash collisions are resolved, or how DuckDB&#8217;s hash table is structured, check out our first blog post on hash aggregation.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://strimzi.io/blog/2024/03/21/kraft-migration/">From ZooKeeper to KRaft: How the Kafka migration works</a></h4><p>&#9997; <a href="https://twitter.com/ppatierno">Paolo Patierno</a></p><blockquote><p><em>Through this blog post we are going to describe the main differences between using ZooKeeper and KRaft for Kafka to store the cluster metadata and how the migration from the former to the latter works.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://blogs.halodoc.io/maximizing-kafka-efficiency-exploring-parallel-consumers/">Scaling Kafka by Parallel Processing</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/tanujkumar13334/">Tanuj Kumar</a></p><blockquote><p><em>In this blog post, we will delve into the world of parallel consumer strategies for Kafka, exploring various approaches and techniques for achieving parallelism in message processing. We will discuss the benefits and trade-offs of each approach, along with best practices for implementation.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://blogit.michelin.io/dkafka-streams/">Designing Kafka Streams Applications</a></h4><p>&#9997; <a href="https://blogit.michelin.io/author/servaire/">Val&#233;rie Servaire</a> + <a href="https://blogit.michelin.io/author/paul-2/">Paul Amar</a> + <a href="https://blogit.michelin.io/author/damien/">Damien Fayet</a>+ <a href="https://blogit.michelin.io/author/sebastien/">S&#233;bastien Viale</a></p><blockquote><p><em>For the past 4 years, our journey into the heart of Kafka's capabilities has been shaped by two pivotal concepts: Master Topologies and Micro Topologies. These conceptual frameworks have become the backbone of our Kafka Streams application design, offering a comprehensive and granular understanding of our end-to-end communication.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://blog.paradedb.com/pages/introducing_analytics">pg_analytics: Transforming Postgres into a Fast OLAP Database</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/ming-ying/">Ming Ying</a></p><blockquote><p><em>We&#8217;re excited to introduce pg_analytics, an extension that accelerates the native analytical performance of any Postgres database by 94x. With pg_analytics installed, Postgres is 8x faster than Elasticsearch and nearly ties ClickHouse on analytical benchmarks.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://blog.cloudflare.com/data-anywhere-events-pipelines-durable-execution-workflows">Data Anywhere with Pipelines, Event Notifications, and Workflows</a></h4><p>&#9997; <a href="https://blog.cloudflare.com/author/silverlock">Matt Silverlock</a></p><blockquote><p><em>Data is fundamental to any real-world application: the database storing your user data and inventory, the analytics tracking sales events and/or error rates, the object storage with your web assets and/or the Parquet files driving your data science team, and the vector database enabling semantic search or AI-powered recommendations for your users.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://netflixtechblog.medium.com/investigation-of-a-cross-regional-network-performance-issue-422d6218fdf1">Investigation of a Cross-regional Network Performance Issue</a></h4><p>&#9997; <a href="https://netflixtechblog.medium.com/?source=post_page-----422d6218fdf1--------------------------------">Netflix Technology Blog</a></p><blockquote><p><em>This was a very interesting debugging exercise that covered many layers of Netflix&#8217;s stack and infrastructure. While it technically wasn&#8217;t the &#8220;network&#8221; to blame, this time it turned out the culprit was the software components that make up the network (i.e. the TCP implementation in the kernel).</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.startdataengineering.com/post/test-pyspark/">How to test PySpark code with pytest</a></h4><p>&#9997; <a href="https://www.startdataengineering.com/">Start Data Engineering</a></p><blockquote><p><em>Have you worked, or are you working with a code base that &#8220;moved fast&#8221; but had zero to no tests? Every minor feature request makes you start sweating because looking at your codebase the wrong way makes things explode unpredictably.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.startdataengineering.com/post/docker-for-de/">Docker Fundamentals for Data Engineers</a></h4><p>&#9997; <a href="https://www.startdataengineering.com/">Start Data Engineering</a></p><blockquote><p><em>Docker can be overwhelming to start with. Most data projects use Docker to set up the data infra locally (and often in production). Setting up data tools locally without Docker is (usually)a nightmare! The official docker documentation, while extremely instructive, does not provide a simple guide covering the basics for setting up data infrastructure.</em></p></blockquote><div><hr></div><h1>&#9999; Data</h1><blockquote><p><em>The one thing that this job has taught me is that truth is stranger than fiction. &#8212; Predestination (2014)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://practicaldatamodeling.substack.com/p/practical-data-modeling-chapter-1?utm_source=post-email-title&amp;publication_id=1473069&amp;post_id=144076836&amp;utm_campaign=email-post-title&amp;isFreemail=true&amp;r=2rj6sg&amp;triedRedirect=true&amp;utm_medium=email">Practical Data Modeling - Chapter 1</a></h4><p>&#9997; <a href="https://substack.com/profile/3531217-joe-reis">Joe Reis</a></p><blockquote><p><em>What is data modeling? If you ask a group of people this question, you&#8217;ll get as many answers as the number of people you asked. Let&#8217;s start by defining data and models and then clarifying what data modeling is and is not.</em></p></blockquote><div><hr></div><h1>&#129302; AI&#9478;ML&#9478;Data Science</h1><blockquote><p><em>You know, Burke, I don&#8217;t know which species is worse. &#8212; Ripley, Aliens (1986)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.linkedin.com/blog/engineering/generative-ai/musings-on-building-a-generative-ai-product">Linkedin - Musings on building a Generative AI product</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/juan-pablo-bottaro/">Juan Pablo Bottaro</a></p><blockquote><p><em><strong>Was it easy to build? What went well and what didn&#8217;t?</strong> Building on top of generative AI wasn&#8217;t all smooth sailing, and we hit a wall in many places. We want to pull back the &#8220;engineering&#8221; curtain and share what came easy, where we struggled, and what&#8217;s coming next.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.strangeloopcanon.com/p/what-can-llms-never-do">What can LLMs never do?</a></h4><p>&#9997; <a href="https://substack.com/profile/12282408-rohit-krishnan">Rohit Krishnan</a></p><blockquote><p><em>On goal drift and lower reliability. Or, why can't LLMs play Conway's Game Of Life?</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.statsignificant.com/p/when-do-we-stop-finding-new-music">When Do We Stop Finding New Music? A Statistical Analysis</a></h4><p>&#9997; <a href="https://substack.com/profile/112812180-daniel-parris">Daniel Parris</a></p><blockquote><p><em>So today, we'll explore how our relationship to music changes with age and the developmental phenomena driving our forever-shifting cultural tastes.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.uber.com/en-SG/blog/generative-ai-for-high-quality-mobile-testing/">DragonCrawl: Generative AI for High-Quality Mobile Testing</a></h4><p>&#9997; <a href="https://www.uber.com/blog/asia/?uclick_id=15b6739c-0acd-406e-bdf6-884992beefa0">Uber Engineering Blog</a></p><blockquote><p><em>This blog will cover a quick introduction to large language models, deep dive into our architecture, challenges, and results. We will close by touching a little on what is in store for DragonCrawl.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/airbnb-engineering/airbnb-brandometer-powering-brand-perception-measurement-on-social-media-data-with-ai-c83019408051">Airbnb Brandometer: Powering Brand Perception Measurement on Social Media Data with AI</a></h4><p>&#9997; <a href="https://medium.com/@watera427_75688?source=post_page-----c83019408051--------------------------------">Tiantian Zhang</a></p><blockquote><p><em>At Airbnb, we have developed Brandometer, a state-of-the-art natural language understanding (NLU) technique for understanding brand perception based on social media data.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://stripe.com/blog/shepherd-how-stripe-adapted-chronon-to-scale-ml-feature-development">Shepherd: How Stripe adapted Chronon to scale ML feature development</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/benjamin-mears-81680714/">Benjamin Mears</a></p><blockquote><p><em>In 2022 we began a partnership with Airbnb to adapt and implement its platform, Chronon, as the foundation for Shepherd&#8212;our next-generation ML feature engineering platform&#8212;with a view to open sourcing it. We&#8217;ve already used it to build a new production model for fraud detection with over 200 features, and so far the Shepherd-enabled model has outperformed our previous model, blocking tens of millions of dollars of additional fraud per year. While our work building Shepherd was specific to Stripe, we are generalizing the approach by contributing optimizations and new functionality to Chronon that anyone can use.</em></p></blockquote><div><hr></div><h1>&#128293; Catch up</h1><blockquote><p><em>&#8230;Next Saturday night, we're sending you back to the future! &#8212; Dr. Emmett Brown, Back to the Future (1985)</em></p></blockquote><p>&#128214;&#9478;<a href="https://cloud.google.com/bigquery/docs/write-sql-gemini#generate_a_sql_query">SQL code generation</a> is now available for all BigQuery projects. This feature is available in <a href="https://cloud.google.com/products#product-launch-stages">preview</a>.</p><p>&#128214;&#9478;<a href="https://cloud.google.com/bigquery/docs/user-defined-aggregates">User-defined aggregate functions (UDAFs)</a> that support SQL expressions are in <a href="https://cloud.google.com/products#product-launch-stages">preview</a>. User can create a UDAF with the <a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#sql-create-udaf-function">CREATE AGGREGATE FUNCTION</a> statement.</p><div><hr></div><h1>&#128160; Previously on Dimension</h1><blockquote><p><em>Dimension is my sub-newsletter where I note down things I learn from people smarter than me in the data engineering field.</em></p></blockquote><p>Here are the 3 latest articles:</p><h3><em><strong>Published on 2024, April 13:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;eff802da-e602-4ad5-9288-214194ae7146&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;A Closer Look Into Databricks's Photon Engine&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-04-13T11:01:08.905Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b0c8312-3ac2-4fa7-9d0d-d893be1b3031_1402x1002.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/a-closer-look-into-databrickss-photon&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:143028483,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3><em><strong>Published on 2024, April 20:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;0d97da78-1d05-4242-a311-d205598e7df1&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in data engineering. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Do we need the Lakehouse architecture?&quot;,&quot;publishedBylines&quot;:[],&quot;post_date&quot;:&quot;2024-04-20T11:01:59.339Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71fc52a9-786c-4eb1-a115-c7de3917c1be_1397x997.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/do-we-need-the-lakehouse-architecture&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:143214887,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:5,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3><em><strong>Published on 2024, April 27:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;2442b03f-2825-4c87-8a12-0c6a1a7abdb2&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;The stream processing model behind Google Cloud Dataflow&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-04-27T11:01:02.622Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70c445fb-a188-4171-815a-82f7b4f2b69c_1399x999.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/the-stream-processing-model-behind&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:143484667,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:1,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="pullquote"><p>Let me here your voice, for example: </p><p>'Your newsletter is so terrible, I can't handle it anymore.'</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/p/groupby-33-data-gateway-a-platform/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/p/groupby-33-data-gateway-a-platform/comments"><span>Leave a comment</span></a></p><div><hr></div><h2>&#8220;Hasta la vista, baby&#8221; -T800, Terminator 2: Judgment Day (1991)</h2></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for scrolling this far! There's a convenient subscribe box here if you want me to annoy you every week. &#128516;</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[GroupBy #32: Canva - Scaling to Count Billions, Ensuring Precision and Integrity: A Deep Dive into Uber’s Accounting Data Testing Strategies]]></title><description><![CDATA[Plus: LLM fine-tuning and evaluation in BigQuery, How We Built Slack AI To Be Secure and Private]]></description><link>https://vutr.substack.com/p/groupby-32-canva-scaling-to-count</link><guid isPermaLink="false">https://vutr.substack.com/p/groupby-32-canva-scaling-to-count</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Tue, 23 Apr 2024 11:01:38 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!S6i3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65aa6a69-d37c-44bb-ba67-5053990d9c00_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is <strong>GroupBy</strong>, the weekly compiled resources for data engineers.</em></p><p><em>Not subscribed yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p><div class="pullquote"><div class="install-substack-app-embed install-substack-app-embed-web" data-component-name="InstallSubstackAppToDOM"><img class="install-substack-app-embed-img" src="https://substackcdn.com/image/fetch/$s_!D8N-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fvutr.substack.com%2Fimg%2Fsubstack.png"><div class="install-substack-app-embed-text"><div class="install-substack-app-header">Get more from Vu Trinh in the Substack app</div><div class="install-substack-app-text">Available for iOS and Android</div></div><a href="https://substack.com/app/app-store-redirect?utm_campaign=app-marketing&amp;utm_content=author-post-insert&amp;utm_source=vutr" target="_blank" class="install-substack-app-embed-link"><button class="install-substack-app-embed-btn button primary">Get the app</button></a></div></div><blockquote><p><em>&#128075; Hi, my name is Vu Trinh, a data engineer.</em></p><p><em>I enjoy reading <strong>good stuff</strong>  (related to data and engineering), and this newsletter is my effort on the journey to seek the "good stuff" across the entire Internet. </em></p><p><em>Hope this issue find you well.</em></p></blockquote><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S6i3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65aa6a69-d37c-44bb-ba67-5053990d9c00_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S6i3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65aa6a69-d37c-44bb-ba67-5053990d9c00_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!S6i3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65aa6a69-d37c-44bb-ba67-5053990d9c00_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!S6i3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65aa6a69-d37c-44bb-ba67-5053990d9c00_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!S6i3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65aa6a69-d37c-44bb-ba67-5053990d9c00_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S6i3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65aa6a69-d37c-44bb-ba67-5053990d9c00_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/65aa6a69-d37c-44bb-ba67-5053990d9c00_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1393904,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!S6i3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65aa6a69-d37c-44bb-ba67-5053990d9c00_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!S6i3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65aa6a69-d37c-44bb-ba67-5053990d9c00_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!S6i3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65aa6a69-d37c-44bb-ba67-5053990d9c00_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!S6i3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65aa6a69-d37c-44bb-ba67-5053990d9c00_1400x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the Canva Image Generator.</figcaption></figure></div><div><hr></div><h1>&#128200; Career</h1><blockquote><p><em>Don't let comfort hold you back.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://blog.alexewerlof.com/p/principal-engineer">Principal Engineer</a></h4><p>&#9997; <a href="https://substack.com/profile/87732486-alex-ewerlof">Alex Ewerl&#246;f</a></p><blockquote><p><em>Just as going from Senior Engineer to Staff Engineer required a new skill (soft skills), the Principal Engineer requires a new skill: business skills.</em></p></blockquote><div><hr></div><h1>&#128640; Engineering</h1><blockquote><p><em>I have to believe in a world outside my own mind. &#8212; Memento (2000)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.canva.dev/blog/engineering/scaling-to-count-billions/">Scaling to Count Billions</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/yuszy/">Sangzhuoyang Yu</a></p><blockquote><p><em>Since we launched the program 3 years ago, usage of our creator content has doubled every 18 months. Now we pay creators based on billions of content usages each month. This usage data not only includes templates but also images, videos, and so on. Building and maintaining a service to track this data for payment is challenging. This blog post introduces the various architectures we&#8217;ve experimented with and the lessons we learned along the way.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.uber.com/en-SG/blog/accounting-data-testing-strategies/">Ensuring Precision and Integrity: A Deep Dive into Uber&#8217;s Accounting Data Testing Strategies</a></h4><p>&#9997; <a href="https://www.uber.com/en-US/blog/engineering/">Uber Engineering Blog</a></p><blockquote><p><em>To maintain these tenets, Financial Accounting Services (FAS) Platform has built robust testing, monitoring, and alerting processes. This encompasses system configuration, business accounting, and external financial report generation.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/datareply/lakehouse-data-platform-on-kubernetes-e8dca2abc6f4">Lakehouse Data Platform on Kubernetes</a></h4><p>&#9997; <a href="https://medium.com/@majidazimi?source=post_page-----e8dca2abc6f4--------------------------------">Majid Azimi</a></p><blockquote><p><em>Building such a modern, cloud-native lakehouse platform is not only possible but has been made more accessible through the use of open-source technologies. Throughout this article, we will explore the intricacies of a data lakehouse platform, emphasizing how it simplifies the transition to and utilization of data lakehouses. We will also provide a comprehensive guide on constructing an entire data lakehouse ecosystem on Kubernetes, highlighting the steps and strategies involved in leveraging this powerful container-orchestration system to deploy and manage a highly scalable and resilient data platform.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://doordash.engineering/2022/08/02/building-scalable-real-time-event-processing-with-kafka-and-flink/">Building Scalable Real Time Event Processing with Kafka and Flink</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/allen-xiaozhong-wang-97a6925/">Allen Wang</a></p><blockquote><p><em>At DoorDash, real time events are an important data source to gain insight into our business but building a system capable of handling billions of real time events is challenging. Events are generated from our services and user devices and need to be processed and transported to different destinations to help us make data-driven decisions on the platform.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://rmarcus.info/blog/2024/04/12/pg-over-time.html">Ten years of improvements in PostgreSQL's optimizer</a></h4><p>&#9997; <a href="https://discuss.systems/@ryanmarcus">Ryan Marcus</a></p><blockquote><p><em>As a query optimization researcher, I&#8217;ve spent the last 10 years of my life playing with, learning from, and building on top of the most sophisticated open source query optimizer out there, <a href="https://postgresql.org/">PostgreSQL</a>. I recently wondered how much PostgreSQL had improved over the decade since I started working on databases. While changelogs and opinion pieces were plentiful, I couldn&#8217;t find any strong empirical comparisons, so I decided to run the <a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf">join order benchmark</a> (JOB) on PostgreSQL 8 through 16. I recorded the 90th percentile query latency for each database version.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/gooddata-developers/duckdb-meets-apache-arrow-169e917a2d8d">DuckDB Meets Apache Arrow</a></h4><p>&#9997; <a href="https://medium.com/@jkadlec?source=post_page-----169e917a2d8d--------------------------------">Jan Kadlec</a></p><blockquote><p><em>You may have heard about DuckDB, Apache Arrow, or both. In this article, I&#8217;ll tell you about how we (GoodData) are the first analytics (BI) platform powered by the combination of these technologies. I believe the motivation is evident &#8212; performance &#127950;&#65039; and developer velocity.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/whatnot-engineering/powering-real-time-magic-moments-through-notifications-36cd833f898e">Powering real-time magic moments through notifications</a></h4><p>&#9997; <a href="https://medium.com/@whatnotengineering?source=post_page-----36cd833f898e--------------------------------">Whatnot Engineering</a></p><blockquote><p><em>Whatnot&#8217;s notification system had begun to reach its capacity to expand and meet the organization and our user&#8217;s needs. It started as a small class hierarchy in our main python codebase then moved to background tasks powered by RabbitMQ. As time passed, the logic and complexity in the operations grew along with the volume of notifications being sent. In this post, we share our journey building the 3rd iteration of a platform we believe will scale for the long term and share our considerations, hurdles, and lessons learned along the way.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/agoda-engineering/how-we-improved-database-development-and-ci-cd-with-storage-snapshot-fa0300d80f0f">How we Improved Database Development and CI/CD with Storage Snapshot</a></h4><p>&#9997; <a href="https://medium.com/@agoda.eng?source=post_page-----fa0300d80f0f--------------------------------">Agoda Engineering</a></p><blockquote><p><em>Database development is an integral part of software development. However, it demands considerable effort and can present challenges throughout the development and testing stages. At Agoda, we&#8217;ve adopted specific methodologies and processes to accelerate and simplify the development workflow. In this blog post, we&#8217;ll outline the strategies, insights, and implementation techniques of this process.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.roguelynn.com/talks/everyday-apis/">The Design of Everyday APIs</a></h4><p>&#9997; <a href="https://www.roguelynn.com/about/">Lynn Root</a></p><blockquote><p><em>Implementing an API is an art. It&#8217;s the connection between the user and the library itself. How can we optimize that connection to make the experience more pleasing? What makes a user reach for one library over another? What goes into an ergonomic API?</em></p></blockquote><h4>&#128214;&#9478;<a href="https://memgraph.com/blog/memgraph-storage-modes-explained">Memgraph Storage Modes Explained</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/katarina-supe/">Katarina Supe</a></p><blockquote><p><em>Memgraph is an in-memory graph database that ensures data persistence through ACID compliance by default. While it uses snapshots and write-ahead logs (WAL) for data recovery, in some cases, such additional files and insurance are not necessary. Other databases and analytics tools offer snapshots or WAL, but Memgraph offers both with different storage modes.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://blog.det.life/how-to-think-about-internal-data-products-as-a-data-engineer-42cef9081ebf">How to think about Internal Data Products as a Data Engineer</a></h4><p>&#9997; <a href="https://medium.com/@hugolu87">Hugo Lu</a></p><blockquote><p><em>In this article we&#8217;ll dive into how to think about Internal Data Products as a Data Engineer. We&#8217;ll answer some common questions about Data Products. I&#8217;ll show you how I think about building your first Data Product too.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.dremio.com/blog/the-origins-of-apache-arrow-its-fit-in-todays-data-landscape/">Origins of Apache Arrow &amp; Its Role Today</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/dipankar-mazumdar/">Dipankar Mazumdar</a></p><blockquote><p><em>This blog details the origins of Apache Arrow and shows how it fits in today&#8217;s constantly changing data landscape. The libraries and tools using Arrow illustrate that while data consumers may not directly consume Arrow, they probably have interacted with it underneath the covers and have taken advantage of numerous tasks.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://blog.datumagic.com/p/apache-hudi-from-zero-to-one-1010">Apache Hudi: From Zero To One (10/10)</a></h4><p>&#9997; <a href="https://substack.com/profile/147107684-shiyan-xu">Shiyan Xu</a></p><blockquote><p><em>Throughout the last nine posts, I have explored Hudi concepts pertinent to release 0.14, ideas that are relevant across most of the 0.x versions. For the blog series finale, I aim to cast a glance into the future and delve into the exciting new features in the upcoming 1.0 release. In doing so, this ending post will effectively accomplish the purpose of the series: guiding readers from the foundational beginnings to the groundbreaking future - from zero to one.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.alibabacloud.com/blog/building-a-streaming-lakehouse-performance-comparison-between-paimon-and-hudi_601013">Building a Streaming Lakehouse: Performance Comparison Between Paimon and Hudi</a></h4><p>&#9997; <a href="https://www.alibabacloud.com/blog/">Alibaba Cloud Blog</a></p><blockquote><p><em>Apache Paimon and Apache Hudi are widely used data lake storage formats with high write throughput and low-latency query performance. This article compares the performance of Paimon and Hudi on Alibaba Cloud EMR and explores their respective roles in building quasi-real-time data warehouses.</em></p></blockquote><div><hr></div><h1>&#9999; Data</h1><blockquote><p><em>The one thing that this job has taught me is that truth is stranger than fiction. &#8212; Predestination (2014)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://blog.owulveryck.info/2024/04/09/data-as-a-product-and-data-contract-an-evolutionary-approach-to-data-maturity.html">Data-as-a-Product and Data-Contract: An evolutionary approach to data maturity</a></h4><p>&#9997; <a href="https://blog.owulveryck.info/about.html">Olivier Wulveryck</a></p><blockquote><p><em>In recapping, I have always grappled with one question: where does one begin when seeking to implement the data mesh paradigm? Through the journey of exploring this concept, my most recent and profound insight is: the most strategic starting point lies with the data product.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://sqlpatterns.com/p/debugging-your-business-with-data">Debugging Your Business with Data</a></h4><p>&#9997; <a href="https://substack.com/@ergestx">Ergest Xheblati</a></p><blockquote><p><em>How to diagnose the root cause of a 90% drop in pipeline in 5 minutes. An interview with Abhi Sivasailam and Ankur Chawla of Levers Labs</em></p></blockquote><div><hr></div><h1>&#129302; AI&#9478;ML&#9478;Data Science</h1><blockquote><p><em>You know, Burke, I don&#8217;t know which species is worse. &#8212; Ripley, Aliens (1986)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://lethain.com/mental-model-for-how-to-use-llms-in-products/">Notes on how to use LLMs in your product</a></h4><p>&#9997; <a href="https://lethain.com/about/">Will Larson</a></p><blockquote><p><em>I&#8217;ve been working fairly directly on meaningful applicability of LLMs to existing products for the last year, and wanted to type up some semi-disorganized notes. These notes are in no particular order, with an intended audience of industry folks building products.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://slack.engineering/how-we-built-slack-ai-to-be-secure-and-private/">How We Built Slack AI To Be Secure and Private</a></h4><p>&#9997; <a href="https://slack.engineering/">Slack Engineering Blog</a></p><blockquote><p><em>Instead, to inform how we built out Slack AI, we started from first principles. We began with our requirements: upholding our existing security and compliance offerings, as well as our privacy principles like &#8220;Customer Data is sacrosanct.&#8221; Then, through the specific lens of generative AI, our team created a new set of Slack AI principles to guide us</em></p></blockquote><h4>&#128214;&#9478;<a href="https://open.nytimes.com/milestones-on-our-journey-to-standardize-experimentation-at-the-new-york-times-2c6d32db0281?gi=f847ef798c07">Milestones on Our Journey to Standardize Experimentation at The New York Times</a></h4><p>&#9997; <a href="https://medium.com/@kathy.a.yang">Kathy Yang</a></p><blockquote><p><em>At The New York Times, most product experiments use our internal experimentation platform, ABRA, which is short for A/B Reporting and Allocation architecture. Here&#8217;s a look at an early version. A lot has changed since those days, and we want to share some of the things we&#8217;ve learned on our standardization journey. While we use other types of experimentation, such as contextual multi-armed bandits, this article is focused specifically around the ways we&#8217;ve improved our A/B testing processes.</em></p></blockquote><div><hr></div><h1>&#128293; Catch up</h1><blockquote><p><em>&#8230;Next Saturday night, we're sending you back to the future! &#8212; Dr. Emmett Brown, Back to the Future (1985)</em></p></blockquote><p> &#128214;&#9478;<a href="https://www.databricks.com/blog/announcing-general-availability-ray-databricks">Announcing General Availability of Ray on Databricks</a></p><p>&#128214;&#9478;<a href="https://medium.com/apache-airflow/apache-airflow-2-9-0-dataset-and-ui-improvements-dfed574ed530">Apache Airflow 2.9.0: Dataset and UI Improvements</a></p><p>&#128214;&#9478;BigQuery: The <a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/operators#like_operator_quantified">quantified </a><code>LIKE</code> operator is <a href="https://cloud.google.com/products#product-launch-stages">GA</a>. With this operator, you can check a search value for matches against a list of patterns or an array of patterns, using one of these conditions:</p><ul><li><p><code>LIKE ANY</code>: Checks if at least one pattern matches.</p></li><li><p><code>LIKE SOME</code>: Synonym for <code>LIKE ANY</code>.</p></li><li><p><code>LIKE ALL</code>: Checks if every pattern matches.</p></li></ul><p>&#128214;&#9478;BigQuery now supports <a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/subqueries">subqueries</a> in <a href="https://cloud.google.com/bigquery/docs/managing-row-level-security#create_or_update_a_row-level_access_policy">row level access policies</a>. This feature is now in public <a href="https://cloud.google.com/products/#product-launch-stages">preview</a>.</p><p>&#128214;&#9478;<a href="https://cloud.google.com/blog/products/data-analytics/bigquery-can-now-fine-tune-models-hosted-in-vertex-ai/">Introducing LLM fine-tuning and evaluation in BigQuery</a></p><p>&#128214;&#9478;<a href="https://ai.meta.com/blog/meta-llama-3/">Introducing Meta Llama 3: The most capable openly available LLM to date</a></p><div><hr></div><h1>&#128160; Previously on Dimension</h1><blockquote><p><em>Dimension is my sub-newsletter where I note down things I learn from people smarter than me in the data engineering field.</em></p></blockquote><p>Here are the 3 latest articles:</p><h3><em><strong>Published on 2024, April 6:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;8edb4f18-fe31-4b04-8cf2-b2c86ce92723&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Why did Databricks build the Photon engine?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-04-06T11:01:07.773Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a87d420-15c8-4ecf-9954-70a96c15561f_1399x999.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/why-did-databricks-build-the-photon&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:142788667,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:2,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3><em><strong>Published on 2024, April 13:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;eff802da-e602-4ad5-9288-214194ae7146&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;A Closer Look Into Databricks's Photon Engine&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-04-13T11:01:08.905Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b0c8312-3ac2-4fa7-9d0d-d893be1b3031_1402x1002.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/a-closer-look-into-databrickss-photon&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:143028483,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3><em><strong>Published on 2024, April 20:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;0d97da78-1d05-4242-a311-d205598e7df1&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in data engineering. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Do we need the Lakehouse architecture?&quot;,&quot;publishedBylines&quot;:[],&quot;post_date&quot;:&quot;2024-04-20T11:01:59.339Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71fc52a9-786c-4eb1-a115-c7de3917c1be_1397x997.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/do-we-need-the-lakehouse-architecture&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:143214887,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:5,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="pullquote"><p>Let me here your voice, for example: </p><p>'Your newsletter is so terrible, I can't handle it anymore.'</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/p/groupby-32-canva-scaling-to-count/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/p/groupby-32-canva-scaling-to-count/comments"><span>Leave a comment</span></a></p><div><hr></div><h2>&#8220;Hasta la vista, baby&#8221; -T800, Terminator 2: Judgment Day (1991)</h2></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for scrolling this far! There's a convenient subscribe box here if you want me to annoy you every week. &#128516;</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[GroupBy #31: Migrating a Trillion Entries of Uber’s Ledger Data from DynamoDB to LedgerStore, Grab Experiment Decision Engine]]></title><description><![CDATA[Plus: Airbnb open sourced Chronon - ML Feature Platform, BigQuery data canvas]]></description><link>https://vutr.substack.com/p/groupby-31-migrating-a-trillion-entries</link><guid isPermaLink="false">https://vutr.substack.com/p/groupby-31-migrating-a-trillion-entries</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Tue, 16 Apr 2024 11:01:10 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!vqDk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd616a43b-c869-420b-b5cc-ecdd5aefbf2b_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is <strong>GroupBy</strong>, the weekly compiled resources for data engineers.</em></p><p><em>Not subscribed yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p><div class="pullquote"><div class="install-substack-app-embed install-substack-app-embed-web" data-component-name="InstallSubstackAppToDOM"><img class="install-substack-app-embed-img" src="https://substackcdn.com/image/fetch/$s_!D8N-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fvutr.substack.com%2Fimg%2Fsubstack.png"><div class="install-substack-app-embed-text"><div class="install-substack-app-header">Get more from Vu Trinh in the Substack app</div><div class="install-substack-app-text">Available for iOS and Android</div></div><a href="https://substack.com/app/app-store-redirect?utm_campaign=app-marketing&amp;utm_content=author-post-insert&amp;utm_source=vutr" target="_blank" class="install-substack-app-embed-link"><button class="install-substack-app-embed-btn button primary">Get the app</button></a></div></div><blockquote><p><em>&#128075; Hi, my name is Vu Trinh, a data engineer.</em></p><p><em>I enjoy reading <strong>good stuff</strong>  (related to data and engineering), and this newsletter is my effort on the journey to seek the "good stuff" across the entire Internet. </em></p><p><em>Hope this issue find you well.</em></p></blockquote><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vqDk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd616a43b-c869-420b-b5cc-ecdd5aefbf2b_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vqDk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd616a43b-c869-420b-b5cc-ecdd5aefbf2b_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!vqDk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd616a43b-c869-420b-b5cc-ecdd5aefbf2b_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!vqDk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd616a43b-c869-420b-b5cc-ecdd5aefbf2b_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!vqDk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd616a43b-c869-420b-b5cc-ecdd5aefbf2b_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vqDk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd616a43b-c869-420b-b5cc-ecdd5aefbf2b_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d616a43b-c869-420b-b5cc-ecdd5aefbf2b_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1738736,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vqDk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd616a43b-c869-420b-b5cc-ecdd5aefbf2b_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!vqDk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd616a43b-c869-420b-b5cc-ecdd5aefbf2b_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!vqDk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd616a43b-c869-420b-b5cc-ecdd5aefbf2b_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!vqDk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd616a43b-c869-420b-b5cc-ecdd5aefbf2b_1400x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by Canva Image Generator.</figcaption></figure></div><div><hr></div><h1>&#128200; Career</h1><blockquote><p><em>Don't let comfort hold you back.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://blog.dataengineer.io/p/the-2024-breaking-into-data-engineering?utm_source=post-email-title&amp;publication_id=1644342&amp;post_id=139996630&amp;utm_campaign=email-post-title&amp;isFreemail=true&amp;r=2rj6sg&amp;triedRedirect=true&amp;utm_medium=email">The 2024 breaking into data engineering roadmap</a></h4><p>&#9997; <a href="https://substack.com/profile/10367987-zach-wilson">Zach Wilson</a></p><blockquote><p><em>Knowing where to start and how to get a handle on this requires some guidance. This newsletter is going to unveil all the steps needed to break into data engineering in 2024!</em></p></blockquote><h4>&#128214;&#9478;<a href="https://koopingshung.substack.com/p/certifications-and-what-it-signals">Certifications &amp; What It Signals?</a></h4><p>&#9997; <a href="https://substack.com/profile/7906875-koo-ping-shung">Koo Ping Shung</a></p><blockquote><p><em>I recently had an interesting discussion on certification program, which looking at the current landscape, does feel a strong need for a deeper look into it and where it is needed.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://vadimkravcenko.com/shorts/mental-health-in-software-engineering/">Mental Health in Software Engineering</a></h4><p>&#9997;<a href="https://vadimkravcenko.com/about-me/">Vadim Kravcenko</a></p><blockquote><p><em>I want to talk about something we don't discuss enough in our field: the mental health of software engineers, especially those of us who've taken on the challenge of leadership.</em></p></blockquote><div><hr></div><h1>&#128640; Engineering</h1><blockquote><p><em>I have to believe in a world outside my own mind. &#8212; Memento (2000)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.uber.com/blog/migrating-from-dynamodb-to-ledgerstore/">Migrating a Trillion Entries of Uber&#8217;s Ledger Data from DynamoDB to LedgerStore</a></h4><p>&#9997; <a href="https://www.uber.com/blog/asia/">Uber Engineering Blog</a></p><blockquote><p><em>Last week, we explored LedgerStore (LSG) &#8211; Uber&#8217;s append-only, ledger-style database. This week, we&#8217;ll dive into how we migrated Uber&#8217;s business-critical ledger data to LSG. We&#8217;ll detail how we moved more than a trillion entries (making up a few petabytes of data) transparently and without causing disruption, and we&#8217;ll discuss what we learned during the migration.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.decodable.co/blog/pragmatic-approach-to-data-movement">The Pragmatic Approach to Data Movement</a></h4><p>&#9997; <a href="https://www.decodable.co/blog-author/eric-sammer">Eric Sammer</a></p><blockquote><p><em>Data movement is the most stubborn problem in infrastructure.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://engineering.mixpanel.com/under-the-hood-of-mixpanels-infrastructure-0c7682125e9b">Under the Hood of Mixpanel&#8217;s Infrastructure</a></h4><p>&#9997; <a href="https://medium.com/@vijay.jayaram">Vijay Jayaram</a></p><blockquote><p><em>Mixpanel&#8217;s analysis UI is powered by an in-house database called Arb, which is built for ingesting, storing, and querying trillions of events in real-time. This page covers the core aspects of our design, the pain points it eliminates for users, and how it compares to other systems.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://github.com/getlago/lago/wiki/Using-Clickhouse-to-scale-an-events-engine">Using Clickhouse to scale an events engine</a> + <a href="https://news.ycombinator.com/item?id=40005005">the article&#8217;s comment wall on HackNews</a></h4><p>&#9997; <a href="https://github.com/getlago/lago">Lago Github Repo</a></p><blockquote><p><em>Today, we&#8217;re going to explore that decision for a hybrid database stack, and more specifically, why we decided to go with ClickHouse.</em></p></blockquote><h4>&#128250;&#9478;<a href="https://www.youtube.com/playlist?list=PLJvBe8nQAEsEBSoUY0lRFVZr2_YeHYkUR">Velox Conference 2024 presentations</a></h4><h4>&#128214;&#9478;<a href="https://engineering.grab.com/grabx-decision-engine">Grab Experiment Decision Engine - a Unified Toolkit for Experimentation</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/ruike-zhang-13086053?originalSubdomain=sg">Ruike Zhang</a> + <a href="https://www.linkedin.com/in/panos-mavrokonstantis/?originalSubdomain=sg">Panos Mavrokonstantis</a></p><blockquote><p><em>This article introduces the GrabX Decision Engine, an internal open-source package that offers a comprehensive framework for designing and analysing experiments conducted on online experiment platforms. The package encompasses a wide range of functionalities, including a pre-experiment advisor, a post-experiment analysis toolbox, and other advanced tools. In this article, we explore the motivation behind the development of these functionalities, their integration into the unique ecosystem of Grab&#8217;s multi-sided marketplace, and how these solutions strengthen the culture and calibre of experimentation at Grab.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://blog.dagworks.io/p/slack-summary-pipeline-with-dlt-ibis">Slack summary pipeline with dlt, Ibis, and Hamilton</a></h4><p>&#9997; <a href="https://substack.com/profile/153186724-thierry-jean">Thierry Jean</a> + <a href="https://substack.com/@dagworks">DagWorks INC</a></p><blockquote><p><em>A lightweight &amp; modern Python ETL stack</em></p></blockquote><h4>&#128214;&#9478;<a href="https://tobikodata.com/ast_journey.html">How I became an AST convert</a></h4><p>&#9997; <a href="https://tobikodata.com/author/afzal-jasani.html">Afzal Jasani</a></p><blockquote><p><em>AST stands for abstract syntax tree. The team here at Tobiko Data has written a couple different blogs regarding the topic but I want to convey my journey in understanding ASTs and their purpose. It&#8217;s fair to say that my past experiences never warranted me to think this deeply about data structures.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/airbnb-engineering/introducing-trio-part-iii-033fbfe2171b?source=rss----53c7c27702d5---4">Introducing Trio | Part III</a></h4><p>&#9997; <a href="https://medium.com/@konakid?source=post_page-----033fbfe2171b--------------------------------">Eli Hart</a></p><blockquote><p><em>Part three on how we built a Compose based architecture with Mavericks in the Airbnb Android app</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/@netflixtechblog/a-tale-of-two-frameworks-the-domain-graph-service-framework-meets-spring-graphql-f8237f09c389">A Tale of Two Frameworks: The Domain Graph Service Framework Meets Spring GraphQL</a></h4><p>&#9997; <a href="https://netflixtechblog.medium.com/?source=post_page-----f8237f09c389--------------------------------">Netflix Technology Blog</a></p><blockquote><p><em>Netflix open-sourced the Domain Graph Service (DGS) Framework in early 2021. Since then, the framework has seen widespread adoption across Netflix and many other companies. The DGS Framework provides Java developers with a programming model on top of Spring Boot to create GraphQL services.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://blog.det.life/data-engineering-architectures-strategies-for-handling-sensitive-data-83292b997c17">Data Engineering: Architectures &amp; Strategies for Handling Sensitive Data</a></h4><p>&#9997; <a href="https://husseinjundi.medium.com/">Hussein Jundi</a></p><blockquote><p><em>Strategies and architectures to handle sensitive data efficiently and securely according to an organization&#8217;s data maturity level.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/vvus/data-council-2024-the-future-data-stack-is-composable-and-other-hot-takes-b6c5f2429e22">Data Council 2024: The future data stack is composable, and other hot takes</a></h4><p>&#9997; <a href="https://medium.com/@chsrbrts?source=post_page-----b6c5f2429e22--------------------------------">Chase Roberts</a></p><blockquote><p><em>I wrote the synopsis below for the Vertex Ventures investment team as a debrief to last week&#8217;s Data Council conference. The summary wasn&#8217;t originally intended to be published as a blog post, but one of my partners suggested I post it publicly. So, here it is &#8212; entirely unedited.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/@digitake/my-fun-journey-of-managing-a-large-table-of-postgresql-b8d09cb19444">My fun journey of managing a large table of PostgreSQL</a></h4><p>&#9997; <a href="https://medium.com/@digitake?source=post_page-----b8d09cb19444--------------------------------">digitake</a></p><blockquote><p><em>The consolidator, at first, runs reasonably fast for its size. It takes about a minute to perform an SQL SELECT/GROUP BY/ORDER BY function. One might be surprised that one minute is relatively fast compared to the number of rows in the table. The reason for that is, I do, use an index to keep track of data. I will take you with me through my journey to the data wonderland.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://hivekit.io/blog/how-weve-saved-5000-percent-in-cloud-costs-by-writing-our-own-database/">How we&#8217;ve saved 98% in cloud costs by writing our own database</a></h4><p>&#9997; <a href="https://hivekit.io/blog/">Hivekit Blog</a></p><blockquote><p><em>What is the first rule of programming? Maybe something like &#8220;do not repeat yourself&#8221; or &#8220;if it works, don&#8217;t touch it&#8221;? Or, how about &#8220;do not write your own database!&#8221;&#8230; That&#8217;s a good one.</em></p></blockquote><div><hr></div><h1>&#9999; Data</h1><blockquote><p><em>The one thing that this job has taught me is that truth is stranger than fiction. &#8212; Predestination (2014)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://piethein.medium.com/scalable-data-management-with-microsoft-fabric-and-microsoft-purview-7e54456559d9">Scalable Data Management with Microsoft Fabric and Microsoft Purview</a></h4><p>&#9997; <a href="https://piethein.medium.com/?source=post_page-----7e54456559d9--------------------------------">Piethein Strengholt</a></p><blockquote><p><em>In this blog post, I aim to demonstrate a practical data transformation using Microsoft Fabric and Microsoft Purview. The target audience includes executives, architects, analysts, and compliance and governance personnel who are interested in creating a comprehensive data platform.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://fromanengineersight.substack.com/p/microsoft-excel-in-the-era-of-big">Microsoft Excel in the Era of Big Data</a></h4><p>&#9997; <a href="https://substack.com/profile/23621089-benoit-pimpaud">Benoit Pimpaud</a></p><blockquote><p><em>What if we learn about how to build efficient and consistent spreadsheets? More than just getting closer to this engineering stuff, there are a lot of benefits for our daily tasks to strengthen these files with more efficiency, consistency, and reproducibility designs.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://roundup.getdbt.com/p/how-is-the-state-of-analytics-engineering">How is the state of analytics engineering?</a></h4><p>&#9997; <a href="https://substack.com/profile/152560204-dan-poppy">Dan Poppy</a></p><blockquote><p><em>The 2024 State of Analytics Engineering is in. Let's get into it.</em></p></blockquote><div><hr></div><h1>&#129302; AI&#9478;ML&#9478;Data Science</h1><blockquote><p><em>You know, Burke, I don&#8217;t know which species is worse. &#8212; Ripley, Aliens (1986)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/airbnb-engineering/chronon-airbnbs-ml-feature-platform-is-now-open-source-d9c4dba859e8">Chronon, Airbnb&#8217;s ML Feature Platform, Is Now Open Source</a></h4><p>&#9997; <a href="https://medium.com/@vzanoyan?source=post_page-----d9c4dba859e8--------------------------------">Varant Zanoyan</a></p><blockquote><p><em>A feature platform that offers observability and management tools, allows ML practitioners to use a variety of data sources, while handling the complexity of data engineering, and provides low latency streaming.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://engineering.backmarket.com/how-machine-learning-boosted-available-cash-flows-for-our-sellers-eddd697269fc">How machine learning boosted available cash flows for our sellers</a></h4><p>&#9997; <a href="https://medium.com/@pierre-pessarossi">Pierre Pessarossi</a></p><blockquote><p><em>In this article, we will explain how we created business value by coupling machine learning with a simple sampling technique that bypassed the need to adapt our technical stack.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://github.blog/2024-04-09-4-ways-github-engineers-use-github-copilot/">4 ways GitHub engineers use GitHub Copilot</a></h4><p>&#9997; <a href="https://github.blog/author/hstaudacher/">Holger Staudacher</a></p><blockquote><p><em>GitHub Copilot increases efficiency for our engineers by allowing us to automate repetitive tasks, stay focused, and more.</em></p></blockquote><div><hr></div><h1>&#128293; Catch up</h1><blockquote><p><em>&#8230;Next Saturday night, we're sending you back to the future! &#8212; Dr. Emmett Brown, Back to the Future (1985)</em></p></blockquote><p>&#128214;&#9478;<a href="https://cloud.google.com/blog/products/data-analytics/biglake-now-offers-native-support-for-delta-lake">Announcing Delta Lake support for BigQuery</a></p><p>&#128214;&#9478;<a href="https://cloud.google.com/products/apache-kafka-for-bigquery?hl=en">Apache Kafka for BigQuery</a></p><p>&#128214;&#9478;Gemini in <a href="https://cloud.google.com/blog/products/data-analytics/introducing-gemini-in-bigquery-at-next24/">BigQuery</a> and <a href="https://cloud.google.com/blog/products/data-analytics/introducing-gemini-in-looker-at-next24/">Looker</a></p><p>&#128214;&#9478;<a href="https://cloud.google.com/blog/products/data-analytics/get-to-know-bigquery-data-canvas/">BigQuery data canvas</a></p><p>&#128214;&#9478;<a href="https://cloud.withgoogle.com/next/session-library?session=ANA211&amp;utm_source=copylink&amp;utm_medium=unpaidsoc&amp;utm_campaign=FY24-Q2-global-ENDM33-physicalevent-er-next-2024-mc&amp;utm_content=next-homepage-social-share&amp;utm_term=-">BigQuery's continuous queries feature</a>, which&nbsp;provides users the ability to run continuously processing SQL statements that can analyze and transform data as new events arrive in BigQuery. This is only available for <a href="https://docs.google.com/forms/d/e/1FAIpQLSfeinewVSSFm9pop7O2-Ml6_A7YWbBSUrHI-67Au-SFIFvqEA/viewform">private preview</a>.</p><p>&#128214;&#9478;<a href="https://ai.meta.com/blog/next-generation-meta-training-inference-accelerator-AI-MTIA/">Next-generation Meta Training and Inference Accelerator</a></p><p>&#128214;&#9478;<a href="https://preset.io/blog/apache-superset-4-0-release-notes/">Apache Superset 4.0</a></p><p>&#128214;&#9478;<a href="https://beam.incubator.apache.org/blog/beam-yaml-release/">Introducing Beam YAML: Apache Beam's First No-code SDK</a></p><div><hr></div><h1>&#128160; Previously on Dimension</h1><blockquote><p><em>Dimension is my sub-newsletter where I note down things I learn from people smarter than me in the data engineering field.</em></p></blockquote><p>Here are the 3 latest articles:</p><h3><em><strong>Published on 2024, March 30:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;e911f8b4-c7b7-41f2-be0c-0edbdf608a9c&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in data engineering. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;A glimpse of Apache Pinot, the real-time OLAP system from LinkedIn&quot;,&quot;publishedBylines&quot;:[],&quot;post_date&quot;:&quot;2024-03-30T11:01:01.688Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea2eaa4-2af6-4f79-9d04-07753dacc209_1396x994.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/a-glimpse-of-apache-pinot-the-real&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:142533909,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:4,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3><em><strong>Published on 2024, April 6:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;8edb4f18-fe31-4b04-8cf2-b2c86ce92723&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Why did Databricks build the Photon engine?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-04-06T11:01:07.773Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a87d420-15c8-4ecf-9954-70a96c15561f_1399x999.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/why-did-databricks-build-the-photon&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:142788667,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:2,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3><em><strong>Published on 2024, April 13:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;eff802da-e602-4ad5-9288-214194ae7146&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;A Closer Look Into Databricks's Photon Engine&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-04-13T11:01:08.905Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b0c8312-3ac2-4fa7-9d0d-d893be1b3031_1402x1002.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/a-closer-look-into-databrickss-photon&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:143028483,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="pullquote"><p>Let me here your voice, for example: </p><p>'Your newsletter is so terrible, I can't handle it anymore.'</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/p/groupby-31-migrating-a-trillion-entries/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/p/groupby-31-migrating-a-trillion-entries/comments"><span>Leave a comment</span></a></p><div><hr></div><h2>&#8220;Hasta la vista, baby&#8221; -T800, Terminator 2: Judgment Day (1991)</h2></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for scrolling this far! There's a convenient subscribe box here if you want me to annoy you every week. &#128516;</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[GroupBy #30: Uber- How LedgerStore Supports Trillions of Indexes, Composable Data Systems: Lessons from Apache Calcite Success]]></title><description><![CDATA[Plus: Spotify - Data Platform Explained, Grab - Turning observations into actionable insights for enhanced decision-making.]]></description><link>https://vutr.substack.com/p/groupby-30-uber-how-ledgerstore-supports</link><guid isPermaLink="false">https://vutr.substack.com/p/groupby-30-uber-how-ledgerstore-supports</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Tue, 09 Apr 2024 11:01:39 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!6DSX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F468d621e-2905-47d4-8a20-44f90854e82a_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is <strong>GroupBy</strong>, the weekly compiled resources for data engineers.</em></p><p><em>Not subscribed yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p><div class="pullquote"><div class="install-substack-app-embed install-substack-app-embed-web" data-component-name="InstallSubstackAppToDOM"><img class="install-substack-app-embed-img" src="https://substackcdn.com/image/fetch/$s_!D8N-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fvutr.substack.com%2Fimg%2Fsubstack.png"><div class="install-substack-app-embed-text"><div class="install-substack-app-header">Get more from Vu Trinh in the Substack app</div><div class="install-substack-app-text">Available for iOS and Android</div></div><a href="https://substack.com/app/app-store-redirect?utm_campaign=app-marketing&amp;utm_content=author-post-insert&amp;utm_source=vutr" target="_blank" class="install-substack-app-embed-link"><button class="install-substack-app-embed-btn button primary">Get the app</button></a></div></div><blockquote><p><em>&#128075; Hi, my name is Vu Trinh, a data engineer.</em></p><p><em>I enjoy reading <strong>good stuff</strong>  (related to data and engineering), and this newsletter is my effort on the journey to seek the "good stuff" across the entire Internet. </em></p><p><em>Hope this issue find you well.</em></p></blockquote><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6DSX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F468d621e-2905-47d4-8a20-44f90854e82a_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6DSX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F468d621e-2905-47d4-8a20-44f90854e82a_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!6DSX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F468d621e-2905-47d4-8a20-44f90854e82a_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!6DSX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F468d621e-2905-47d4-8a20-44f90854e82a_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!6DSX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F468d621e-2905-47d4-8a20-44f90854e82a_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6DSX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F468d621e-2905-47d4-8a20-44f90854e82a_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/468d621e-2905-47d4-8a20-44f90854e82a_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2062117,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6DSX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F468d621e-2905-47d4-8a20-44f90854e82a_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!6DSX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F468d621e-2905-47d4-8a20-44f90854e82a_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!6DSX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F468d621e-2905-47d4-8a20-44f90854e82a_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!6DSX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F468d621e-2905-47d4-8a20-44f90854e82a_1400x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by Canva Image Generator.</figcaption></figure></div><div><hr></div><h1>&#128640; Engineering</h1><blockquote><p><em>I have to believe in a world outside my own mind. &#8212; Memento (2000)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.uber.com/en-SG/blog/how-ledgerstore-supports-trillions-of-indexes/">How LedgerStore Supports Trillions of Indexes at Uber</a></h4><p>&#9997; <a href="https://www.uber.com/blog/asia/">Uber Engineering Blog</a></p><blockquote><p><em>This blog covers the significance of LedgerStore indexing and its architecture, which powers trillions of indexes, with a petabyte-scale index storage footprint.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.querifylabs.com/blog/composable-data-systems-lessons-from-apache-calcite-success">Composable Data Systems: Lessons from Apache Calcite Success</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/devozerov/">Vladimir Ozerov</a></p><blockquote><p><em>In this blog post, I would like to share our experience with Apache Calcite &#8212; a powerful composable toolset for building query optimizers. Apache Calcite achieved tremendous success, powering query optimization in many popular systems, such as Apache Hive and Apache Flink.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://engineering.atspotify.com/2024/04/data-platform-explained/">Data Platform Explained</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/anastasia-khlebnikova-63827b4/">Anastasia Khlebnikova</a> + <a href="https://www.linkedin.com/in/zamithcunha/">Carol Cunha</a></p><blockquote><p><em>As engineers working at Spotify, we frequently find ourselves explaining our robust data platform to fellow professionals who are contemplating embarking on a similar venture within their organizations. Despite the number of articles, blog posts, and talks one can find online, it can be challenging to digest the information about the building blocks of a data platform, how to start building one, and the tradeoffs to consider for what is good for the business. In this blog post series, we&#8217;ll delve into what our data platform entails, its pivotal role at Spotify, and the key factors leading organizations to consider building one.</em></p></blockquote><h4>&#128214;&#9478;<strong><a href="https://netflixtechblog.medium.com/netflixs-media-landscape-evolution-from-3-2-1-to-cloud-storage-optimization-77e9a19171ed">Netflix&#8217;s Media Landscape Evolution: From 3&#8211;2&#8211;1 to Cloud Storage Optimization</a></strong></h4><p>&#9997; <a href="https://netflixtechblog.medium.com/?source=post_page-----77e9a19171ed--------------------------------">Netflix Technology Blog</a></p><blockquote><p><em>This blog will explore how harnessing user access patterns helped us optimize storage efficiency and cost-effectiveness smartly. Within this exploration, we delve into a cost analysis of lifecycle policies, explicitly examining the cost-effectiveness of various archival and purge strategies tailored to different AWS storage layers.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://posit-dev.github.io/great-tables/blog/design-philosophy/">The Design Philosophy of Great Tables</a></h4><p>&#9997; <a href="https://github.com/rich-iannone">Rich Iannone</a> + <a href="https://github.com/machow">Michael Chow</a></p><blockquote><p><em>Through the exploration of the qualities that make tables shine, the backstory of tables as a display of data, and the issues faced today, it&#8217;s clear how we can solve the great table dilemma with Great Tables.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://clickhouse.com/blog/building-a-logging-platform-with-clickhouse-and-saving-millions-over-datadog">How we Built a 19 PiB Logging Platform with ClickHouse and Saved Millions</a></h4><p>&#9997; <a href="https://clickhouse.com/blog/">Clickhouse Blog</a></p><blockquote><p><em>In the interest of others benefiting from our journey, we provide the details of our own ClickHouse-powered logging solution that contains over 19 PiB uncompressed, or 37 trillion rows, for our AWS regions alone. As a general design philosophy, we aspired to minimize the number of moving parts and ensure the design was as simple and reproducible as possible.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://hubertdulay.substack.com/p/small-files-issue-where-streams-and?r=46sqk&amp;utm_campaign=post&amp;utm_medium=web&amp;triedRedirect=true">Small Files Issue: Where Streams and Tables Meet</a></h4><p>&#9997; <a href="https://substack.com/profile/7035644-hubert-dulay">Hubert Dulay</a></p><blockquote><p><em>The announcement that Confluent will now support seamless materialization of Apache Kafka topics as Iceberg Tables (aka Tableflow) has gotten the streaming world and lakehouse worlds rubbing their chins &#129300; and cleaning their monocles &#129488;.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/airbnb-engineering/introducing-trio-part-ii-fe836013a798">Introducing Trio | Part II</a></h4><p>&#9997; <a href="https://medium.com/@konakid?source=post_page-----fe836013a798--------------------------------">Eli Hart</a></p><blockquote><p><em>Part two on how we built a Compose based architecture with Mavericks in the Airbnb Android app</em></p></blockquote><h4>&#128214;&#9478;<strong><a href="https://netflixtechblog.com/reverse-searching-netflixs-federated-graph-222ac5d23576">Reverse Searching Netflix&#8217;s Federated Graph</a></strong></h4><p>&#9997; <a href="https://netflixtechblog.medium.com/">Netflix Technology Blog</a></p><blockquote><p><em>As promised in the previous post, we&#8217;ll share how we partnered with one of our Studio Engineering teams to build reverse search. Reverse search inverts the standard querying pattern: rather than finding documents that match a query, it finds queries that match a document.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://engineering.grab.com/iris">Iris - Turning observations into actionable insights for enhanced decision making</a></h4><p>&#9997; <a href="https://engineering.grab.com/authors#huong-vuong">Huong Vuong</a> + <a href="https://engineering.grab.com/authors#hainam-cao">Hai Nam Cao</a></p><blockquote><p><em>Our Iris platform bridges the gap between raw data and meaningful insights, serving the needs of data-driven organisations. Specialising in meticulous monitoring and tracking of Spark and Presto jobs, Iris stands as a transformative tool for peak observability and effective decision-making.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://buttondown.email/jaffray/archive/a-sniff-test-for-some-query-optimizers/">A Sniff Test for Some Query Optimizers</a></h4><p>&#9997; <a href="https://buttondown.email/jaffray">Justin Jaffray</a></p><blockquote><p><em>One important part of query planning is performing transformations over queries. Today I want to see how a couple common databases perform on a completely made-up and unrepresentative benchmark.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://blog.jetbrains.com/dataspell/2023/08/polars-vs-pandas-what-s-the-difference/">Polars vs. pandas: What&#8217;s the Difference?</a></h4><p>&#9997; <a href="https://blog.jetbrains.com/author/jodie-burchell-jetbrains-com">Jodie Burchell</a></p><blockquote><p><em>If you&#8217;ve been keeping up with the advances in Python dataframes in the past year, you couldn&#8217;t help hearing about Polars, the powerful dataframe library designed for working with large datasets.</em></p></blockquote><div><hr></div><h1>&#9999; Data</h1><blockquote><p><em>The one thing that this job has taught me is that truth is stranger than fiction. &#8212; Predestination (2014)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://sqlpatterns.com/p/synchronizing-organizational-decisions">Synchronizing Organizational Decisions</a></h4><p>&#9997; <a href="https://substack.com/@ergestx">Ergest Xheblati</a></p><blockquote><p><em>How metrics trees foster harmonized decisions and drive sustained impact.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.thdpth.com/p/dissecting-what-makes-a-data-strategy">Dissecting What Makes a Data Strategy Fail: Fluff, Challenges, and Objectives</a></h4><p>&#9997; <a href="https://substack.com/profile/229923-sven-balnojan-phd">Sven Balnojan PhD</a></p><blockquote><p><em>Unfortunately, bad data strategies are everywhere. Sometimes, I find it hard to find examples of good ones! Most data startups start out with a bad strategy. Almost every single internal data initiative I read is bad.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://fromanengineersight.substack.com/p/the-data-analyst-every-ceo-wants">The Data Analyst Every CEO Wants</a></h4><p>&#9997; <a href="https://substack.com/profile/23621089-benoit-pimpaud">Benoit Pimpaud</a></p><blockquote><p><em>The following lines list the top 5 things best to learn and focus on for being the analyst that every CEO should have by his side.</em></p></blockquote><div><hr></div><h1>&#129302; AI&#9478;ML&#9478;Data Science</h1><blockquote><p><em>You know, Burke, I don&#8217;t know which species is worse. &#8212; Ripley, Aliens (1986)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://garymarcus.substack.com/p/sneak-preview-of-gpt-5">Sneak preview of GPT-5!</a></h4><p>&#9997; <a href="https://substack.com/profile/14807526-gary-marcus">Gary Marcus</a></p><blockquote><p><em>Holy shit! OpenAI just gave me sneak preview early access to GPT-5 (to do some red-teaming) &#8212; and it&#8217;s incredible!</em></p></blockquote><h4>&#128214;&#9478;<a href="https://garymarcus.substack.com/p/when-will-the-genai-bubble-burst">When Will the GenAI Bubble Burst?</a></h4><p>&#9997; <a href="https://substack.com/profile/14807526-gary-marcus">Gary Marcus</a></p><blockquote><p><em>Why and how it could happen in the next 12 months.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://github.blog/2024-04-04-what-is-retrieval-augmented-generation-and-what-does-it-do-for-generative-ai/">What is retrieval-augmented generation, and what does it do for generative AI?</a></h4><p>&#9997; <a href="https://github.blog/author/nicchoi29/">Nicole Choi</a></p><blockquote><p><em>Here&#8217;s how retrieval-augmented generation, or RAG, uses a variety of data sources to keep AI models fresh with up-to-date information and organizational knowledge.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/pinterest-engineering/how-we-built-text-to-sql-at-pinterest-30bad30dabff">How we built Text-to-SQL at Pinterest</a></h4><p>&#9997; <a href="https://medium.com/@Pinterest_Engineering?source=post_page-----30bad30dabff--------------------------------">Pinterest Engineering</a></p><blockquote><p><em>We took the rise in availability of Large Language Models (LLMs) as an opportunity to explore whether we could assist our data users with this task by developing a Text-to-SQL feature which transforms these analytical questions directly into code.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://koopingshung.substack.com/p/algorithm-war-coming-part-2">Algorithm War Coming! (Part 2)</a></h4><p>&#9997; <a href="https://substack.com/profile/7906875-koo-ping-shung">Koo Ping Shung</a></p><blockquote><p><em>The algorithm war will be never-ending. GenAI models will keep on being trained, with each GenAI model successful in certain dimensions. Model that are currently doing better will only be replaced by a later model that is doing better on all coinciding dimensions.</em></p></blockquote><div><hr></div><h1>&#128293; Catch up</h1><blockquote><p><em>&#8230;Next Saturday night, we're sending you back to the future! &#8212; Dr. Emmett Brown, Back to the Future (1985)</em></p></blockquote><p>&#128214;&#9478;BigQuery | <a href="https://cloud.google.com/bigquery/docs/differential-privacy">Differential privacy</a> is now <a href="https://cloud.google.com/products/#product-launch-stages">generally available (GA)</a>.</p><p>&#128214;&#9478;BigQuery | User now use <a href="https://cloud.google.com/bigquery/docs/create-delta-lake-table">BigLake to access Delta Lake tables</a>. This feature is available in <a href="https://cloud.google.com/products/#product-launch-stages">preview</a>.</p><p>&#128214;&#9478;BigQuery | User can now perform <a href="https://cloud.google.com/bigquery/docs/model-monitoring-overview">model monitoring</a> in BigQuery ML</p><p>&#128214;&#9478;<a href="https://cloud.google.com/bigquery/docs/data-clean-rooms">BigQuery data clean rooms</a> with analysis rules and enhanced usage metrics are now <a href="https://cloud.google.com/products/#product-launch-stages">generally available (GA)</a>.</p><div><hr></div><h1>&#128160; Previously on Dimension</h1><blockquote><p><em>Dimension is my sub-newsletter where I note down things I learn from people smarter than me in the data engineering field.</em></p></blockquote><p>Here are the 3 latest articles:</p><h3><em><strong>Published on 2024, March 23:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;5adbac5b-93b1-4cb8-9e62-1e77d00b0805&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How does Uber build real-time infrastructure to handle petabytes of data every day?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-23T11:01:01.215Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdbce475-f7ba-44f3-91f8-07b53b1c996d_1398x999.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/i-spent-7-hours-understanding-ubers&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:142351678,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3><em><strong>Published on 2024, March 30:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;e911f8b4-c7b7-41f2-be0c-0edbdf608a9c&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in data engineering. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;A glimpse of Apache Pinot, the real-time OLAP system from LinkedIn&quot;,&quot;publishedBylines&quot;:[],&quot;post_date&quot;:&quot;2024-03-30T11:01:01.688Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea2eaa4-2af6-4f79-9d04-07753dacc209_1396x994.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/a-glimpse-of-apache-pinot-the-real&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:142533909,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:4,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3><em><strong>Published on 2024, April 6:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;8edb4f18-fe31-4b04-8cf2-b2c86ce92723&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Why did Databricks build the Photon engine?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-04-06T11:01:07.773Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a87d420-15c8-4ecf-9954-70a96c15561f_1399x999.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/why-did-databricks-build-the-photon&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:142788667,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:2,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="pullquote"><p>Let me here your voice, for example: </p><p>'Your newsletter is so terrible, I can't handle it anymore.'</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/p/groupby-30-uber-how-ledgerstore-supports/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/p/groupby-30-uber-how-ledgerstore-supports/comments"><span>Leave a comment</span></a></p><div><hr></div><h2>&#8220;Hasta la vista, baby&#8221; -T800, Terminator 2: Judgment Day (1991)</h2></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for scrolling this far! There's a convenient subscribe box here if you want me to annoy you every week. &#128516;</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[GroupBy #29: Scaling AI/ML Infrastructure at Uber, The Sisyphean struggle and the new era of data infrastructure]]></title><description><![CDATA[Plus: Netflix- The Imperative of Effective Data Management, The Data Streaming Landscape 2024]]></description><link>https://vutr.substack.com/p/groupby-29-scaling-aiml-infrastructure</link><guid isPermaLink="false">https://vutr.substack.com/p/groupby-29-scaling-aiml-infrastructure</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Tue, 02 Apr 2024 11:01:11 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!UdwB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff78c1a6-1a98-4cfe-b362-c8bef2d58627_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is <strong>GroupBy</strong>, the weekly compiled resources for data engineers.</em></p><p><em>Not subscribed yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p><div class="pullquote"><div class="install-substack-app-embed install-substack-app-embed-web" data-component-name="InstallSubstackAppToDOM"><img class="install-substack-app-embed-img" src="https://substackcdn.com/image/fetch/$s_!D8N-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fvutr.substack.com%2Fimg%2Fsubstack.png"><div class="install-substack-app-embed-text"><div class="install-substack-app-header">Get more from Vu Trinh in the Substack app</div><div class="install-substack-app-text">Available for iOS and Android</div></div><a href="https://substack.com/app/app-store-redirect?utm_campaign=app-marketing&amp;utm_content=author-post-insert&amp;utm_source=vutr" target="_blank" class="install-substack-app-embed-link"><button class="install-substack-app-embed-btn button primary">Get the app</button></a></div></div><blockquote><p><em>&#128075; Hi, my name is Vu Trinh, a data engineer.</em></p><p><em>I enjoy reading <strong>good stuff</strong>  (related to data and engineering), and this newsletter is my effort on the journey to seek the "good stuff" across the entire Internet. </em></p><p><em>Hope this issue find you well.</em></p></blockquote><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UdwB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff78c1a6-1a98-4cfe-b362-c8bef2d58627_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UdwB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff78c1a6-1a98-4cfe-b362-c8bef2d58627_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!UdwB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff78c1a6-1a98-4cfe-b362-c8bef2d58627_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!UdwB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff78c1a6-1a98-4cfe-b362-c8bef2d58627_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!UdwB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff78c1a6-1a98-4cfe-b362-c8bef2d58627_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UdwB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff78c1a6-1a98-4cfe-b362-c8bef2d58627_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ff78c1a6-1a98-4cfe-b362-c8bef2d58627_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1749445,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UdwB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff78c1a6-1a98-4cfe-b362-c8bef2d58627_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!UdwB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff78c1a6-1a98-4cfe-b362-c8bef2d58627_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!UdwB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff78c1a6-1a98-4cfe-b362-c8bef2d58627_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!UdwB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff78c1a6-1a98-4cfe-b362-c8bef2d58627_1400x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the Canva Image Generator.</figcaption></figure></div><div><hr></div><h1>&#128640; Engineering</h1><blockquote><p><em>I have to believe in a world outside my own mind. &#8212; Memento (2000)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://jack-vanlightly.com/blog/2024/3/26/the-sisyphean-struggle-and-the-new-era-of-data-infrastructure">The Sisyphean struggle and the new era of data infrastructure</a></h4><p>&#9997; <a href="https://jack-vanlightly.com/home">Jack Vanlightly</a></p><blockquote><p><em>For a while now, I&#8217;ve been spending a lot of time thinking about technology trends in the data infrastructure space (as a researcher at Confluent). These trends are making some previously difficult things easy and therefore, commodity. I would go as far as to say that we are witnessing a kind of phase change, a regime shift, at least in the cloud. Almost inevitably, the quote above brought me back to this topic as it revolves around the subject of commodity and competition.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://materializedview.io/p/ce-nest-pas-un-kafka">Ce n'est pas un Kafka: Kafka is a Protocol</a></h4><p>&#9997; <a href="https://substack.com/profile/69592459-chris-riccomini">Chris Riccomini</a></p><blockquote><p><em>Apache Kafka is an aging open source project. It's time to accept that Kafka's protocol is what matters.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://kai-waehner.medium.com/the-data-streaming-landscape-2024-6e078b1959b5">The Data Streaming Landscape 2024</a></h4><p>&#9997; <a href="https://kai-waehner.medium.com/?source=post_page-----6e078b1959b5--------------------------------">Kai Waehner</a></p><blockquote><p><em>This blog post explores the data streaming landscape of 2024 to summarize existing solutions and market trends. The end of the article gives an outlook to potential new entrants in 2025.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.linkedin.com/blog/engineering/analytics/grafli-an-out-of-the-box-azure-monitoring-visualization-platform">GrafLI - An out-of-the-box Azure monitoring and visualization platform</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/prateekkumarsingh">Prateek Singh</a></p><blockquote><p><em>Recognizing these challenges, the Productivity Engineering team at LinkedIn crafted GrafLI, a cloud-native data visualization tool designed to transform the visualization of Azure and on-premises services. In this post, we delve into the intricacies of GrafLI and how it enhances the developer experience and increases engineering velocity.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://netflixtechblog.medium.com/navigating-the-netflix-data-deluge-the-imperative-of-effective-data-management-e39af70f81f7">Navigating the Netflix Data Deluge: The Imperative of Effective Data Management</a></h4><p>&#9997; <a href="https://netflixtechblog.medium.com/?source=post_page-----e39af70f81f7--------------------------------">Netflix Technology Blog</a></p><blockquote><p><em>In this article, we, the Media Infrastructure Platform team, outline the development of a Garbage Collector, our solution for effectively managing production data.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://engineering.mixpanel.com/columnar-db-file-reader-v2-a-complete-rewrite-64acdb62a223">Columnar DB File Reader V2: A Complete Rewrite</a></h4><p>&#9997; <a href="https://medium.com/@johnfmikhail">John Mikhail</a></p><blockquote><p><em>One of the main pillars of Mixpanel is our proprietary columnar store database, ARB, which we specifically designed to meet the needs of our customers. In this blog post, we delve into a comprehensive rewrite of the event reader code responsible for parsing the columnar files. The primary objective is to significantly enhance query performance, particularly for those with selective filters.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://lakefs.io/blog/lakefs-repository-git-for-data/">Anatomy of a lakeFS Repository: Practical Example of Git for Data</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/oz-katz-4b3b389">Oz Katz</a></p><blockquote><p><em>To help you understand the value of Git for data, here&#8217;s an overview of the tools that are part of every lakeFS repository. They&#8217;re bound to help with your data strategy and ensure that your organization meets its compliance, quality, and safety requirements.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/airbnb-engineering/introducing-trio-part-i-7f5017a1a903">Introducing Trio | Part I</a></h4><p>&#9997; <a href="https://medium.com/@konakid?source=post_page-----7f5017a1a903--------------------------------">Eli Hart</a></p><blockquote><p><em>A three-part series on how we built a Compose-based architecture with Mavericks in the Airbnb Android app.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://github.blog/2024-03-25-how-to-use-github-copilot-in-your-ide-tips-tricks-and-best-practices/">Using GitHub Copilot in your IDE: Tips, tricks and best practices</a></h4><p>&#9997; <a href="https://github.blog/author/ladykerr/">Kedasha Kerr</a></p><blockquote><p><em>In this blog post, I&#8217;ll share some of the daily things I do to get the most out of GitHub Copilot. I hope these tips will help you become a more efficient and productive user of the AI assistant.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/gooddata-developers/building-a-modern-data-service-layer-with-apache-arrow-33ace768e3f1">Building a Modern Data Service Layer with Apache Arrow</a></h4><p>&#9997; <a href="https://medium.com/@zupabusta?source=post_page-----33ace768e3f1--------------------------------">Jan Soubusta</a></p><blockquote><p><em>Our journey, led by Lubomir (lupko) Slivka, aims to revolutionize GoodData&#8217;s analytics offerings, transforming our traditional BI platform into a robust Analytics Lake. This transformation was motivated by the need to modernize our stack, taking full advantage of open-source technologies and modern architectural principles to better integrate with cloud platforms.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://blogs.halodoc.io/data-lake-cost-optimisation-strategies/amp/">Cost Optimization Strategies for scalable Data Lakehouse</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/imsuresh/overlay/about-this-profile/">Suresh Hasundi</a></p><blockquote><p><em>In this blog, we will be discussing about the cost related challenges we had faced when our Data Lakehouse scaled and how we have overcome such costs by optimizing the process.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://codeconfessions.substack.com/p/why-do-python-lists-multiply-oddly">Why Do Python Lists Multiply Oddly? Exploring the CPython Source Code</a></h4><p>&#9997; <a href="https://substack.com/@abhinavupadhyay">Abhinav Upadhyay</a></p><blockquote><p><em>We will start by a high level answer by just doing some inspection in the REPL, then we will go one level deeper and see the details of the list implementation in CPython to see why that happens, and finally we will go another level down to see how CPython invokes this behavior.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.databricks.com/blog/pyspark-2023-year-review">PySpark in 2023: A Year in Review</a></h4><p>&#9997; <a href="https://www.databricks.com/blog">Databricks Blog</a></p><blockquote><p><em>With the releases of Apache Spark 3.4 and 3.5 in 2023, we focused heavily on improving PySpark performance, flexibility, and ease of use. This blog post walks you through the key improvements.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://doordash.engineering/2024/03/27/setting-up-kafka-multi-tenancy/">Setting Up Kafka Multi-Tenancy</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/yunji-zhong-73810812/">Yunji Zhong</a> + <a href="https://www.linkedin.com/in/amitgud/">Amit Gud</a> + <a href="https://www.linkedin.com/in/carlosh/">Carlos Herrera</a></p><blockquote><p><em>In such a multi-tenant architecture, the isolation is implemented at the infrastructure layer. We will delve here into how we set up multi-tenancy with a messaging queue system based on Kafka.</em></p></blockquote><h4>&#128214;&#9478;<strong><a href="https://medium.com/dcsfamily/data-platforms-good-architect-bad-architect-cb9bdee35c34">Data Platforms : Good Architect &#8212; Bad Architect</a></strong></h4><p>&#9997; <a href="https://medium.com/@shah.nilay02?source=post_page-----cb9bdee35c34--------------------------------">Nilay Shah</a></p><blockquote><p><em>Data Engineering is a dynamic field that requires a deep understanding of both technical skills and overarching principles. This transition is often marked by a shift from focusing on immediate technical challenges to embracing a broader perspective on data architecture. If you are Data Architect and want to learn some basic practices &#8212; This Article is for you!</em></p></blockquote><h4>&#128214;&#9478;<strong><a href="https://medium.com/towards-data-science/many-articles-tell-you-python-tricks-but-few-tell-you-why-d4953d24e80b">Many Articles Tell You Python Tricks, But Few Tell You Why</a></strong></h4><p>&#9997; <a href="https://medium.com/@christophertao">Christopher Tao</a></p><blockquote><p><em>In my opinion, it is more important to understand the reason why these tricks are there, so we can understand when to use and when not to use them. In this article, I&#8217;ll pick up three of them and provide a detailed explanation of the mechanisms under the hood.</em></p></blockquote><div><hr></div><h1>&#129302; AI&#9478;ML&#9478;Data Science</h1><blockquote><p><em>You know, Burke, I don&#8217;t know which species is worse. &#8212; Ripley, Aliens (1986)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.uber.com/en-SG/blog/scaling-ai-ml-infrastructure-at-uber/">Scaling AI/ML Infrastructure at Uber</a></h4><p>&#9997; <a href="https://www.uber.com/blog/asia/">Uber Engineering Blog</a></p><blockquote><p><em>As the complexity and scale of AI/ML models continue to surge, there&#8217;s a growing demand for highly efficient infrastructure to support these models effectively. Over the past few years, we&#8217;ve strategically implemented a range of infrastructure solutions, both CPU- and GPU-centric, to scale our systems dynamically and cater to the evolving landscape of ML use cases. This evolution has involved tailored hardware SKUs, software library enhancements, integration of diverse distributed training frameworks, and continual refinements to our end-to-end Michaelangelo platform.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://garymarcus.substack.com/p/the-race-between-positive-and-negative">The race between positive and negative applications of Generative AI is on &#8211; and not looking pretty</a></h4><p>&#9997; <a href="https://substack.com/profile/14807526-gary-marcus">Gary Marcus</a></p><blockquote><p><em>OpenAI&#8217;s VP of Global Affairs Anna Makanju is exactly right - the race is on. I have some concerns, partly about the way that race is going, partly about (in)justice in who is likely to pay for the costs.</em></p></blockquote><div><hr></div><h1>&#128293; Catch up</h1><blockquote><p><em>&#8230;Next Saturday night, we're sending you back to the future! &#8212; Dr. Emmett Brown, Back to the Future (1985)</em></p></blockquote><p>&#128214;&#9478;<a href="https://www.databricks.com/blog/announcing-state-reader-api-new-statestore-data-source">Databricks: Announcing the State Reader API: The New "Statestore" Data Source</a></p><p>&#128214;&#9478;<a href="https://flink.apache.org/2024/03/21/apache-flink-kubernetes-operator-1.8.0-release-announcement/">Apache Flink Kubernetes Operator 1.8.0 Release Announcement</a></p><p>&#128214;&#9478;BigQuery <a href="https://cloud.google.com/bigquery/docs/search#operator_and_function_optimization">Query optimization using search indexes</a> is now (GA) applied to comparisons of string literals and indexed data, including the equal (<code>=</code>), <code>IN</code>, and <code>LIKE</code> operators and the <code>STARTS_WITH</code> function.</p><p>&#128214;&#9478;The <a href="https://cloud.google.com/bigquery/docs/write-sql-duet-ai#use_the_help_me_code_tool">Help me code tool</a> lets user use natural language to generate a SQL query that can then be run in BigQuery.</p><div><hr></div><h1>&#128160; Previously on Dimension</h1><blockquote><p><em>Dimension is my sub-newsletter where I note down things I learn from people smarter than me in the data engineering field.</em></p></blockquote><p>Here are the 3 latest articles:</p><h3><em><strong>Published on 2024, March 16:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;599861a8-c4a5-4c4a-8c99-d331a5e075ef&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;I spent another 8 hours understanding the design of Amazon Redshift. Here's what I found.&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-16T11:00:30.199Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F809f83a1-a889-470c-a7a9-8a68dd3003a7_1401x999.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/i-spent-another-8-hours-understanding&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:142050038,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:1,&quot;comment_count&quot;:6,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3><em><strong>Published on 2024, March 23:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;5adbac5b-93b1-4cb8-9e62-1e77d00b0805&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How does Uber build real-time infrastructure to handle petabytes of data every day?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-23T11:01:01.215Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdbce475-f7ba-44f3-91f8-07b53b1c996d_1398x999.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/i-spent-7-hours-understanding-ubers&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:142351678,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3><em><strong>Published on 2024, March 30:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;e911f8b4-c7b7-41f2-be0c-0edbdf608a9c&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in data engineering. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;A glimpse of Apache Pinot, the real-time OLAP system from LinkedIn&quot;,&quot;publishedBylines&quot;:[],&quot;post_date&quot;:&quot;2024-03-30T11:01:01.688Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea2eaa4-2af6-4f79-9d04-07753dacc209_1396x994.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/a-glimpse-of-apache-pinot-the-real&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:142533909,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:4,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="pullquote"><p>Let me here your voice, for example: </p><p>'Your newsletter is so terrible, I can't handle it anymore.'</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/p/groupby-29-scaling-aiml-infrastructure/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/p/groupby-29-scaling-aiml-infrastructure/comments"><span>Leave a comment</span></a></p><div><hr></div><h2>&#8220;Hasta la vista, baby&#8221; -T800, Terminator 2: Judgment Day (1991)</h2></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for scrolling this far! There's a convenient subscribe box here if you want me to annoy you every week. &#128516;</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[GroupBy #28: Tableflow - The Stream/Table, Kafka/Iceberg Duality, Kafka tiered storage deep dive]]></title><description><![CDATA[Plus: dbt&#8217;s Model Contracts for Dummies, The Problem with Data Governance]]></description><link>https://vutr.substack.com/p/groupby-28-tableflow-the-streamtable</link><guid isPermaLink="false">https://vutr.substack.com/p/groupby-28-tableflow-the-streamtable</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Tue, 26 Mar 2024 11:01:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!e4RU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F845b4265-23a9-4ebd-8157-f66831b89c97_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is <strong>GroupBy</strong>, the weekly compiled resources for data engineers.</em></p><p><em>Not subscribed yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p><div class="pullquote"><div class="install-substack-app-embed install-substack-app-embed-web" data-component-name="InstallSubstackAppToDOM"><img class="install-substack-app-embed-img" src="https://substackcdn.com/image/fetch/$s_!D8N-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fvutr.substack.com%2Fimg%2Fsubstack.png"><div class="install-substack-app-embed-text"><div class="install-substack-app-header">Get more from Vu Trinh in the Substack app</div><div class="install-substack-app-text">Available for iOS and Android</div></div><a href="https://substack.com/app/app-store-redirect?utm_campaign=app-marketing&amp;utm_content=author-post-insert&amp;utm_source=vutr" target="_blank" class="install-substack-app-embed-link"><button class="install-substack-app-embed-btn button primary">Get the app</button></a></div></div><blockquote><p><em>&#128075; Hi, my name is Vu Trinh, a data engineer.</em></p><p><em>I enjoy reading <strong>good stuff</strong>  (related to data and engineering), and this newsletter is my effort on the journey to seek the "good stuff" across the entire Internet. </em></p><p><em>Hope this issue find you well.</em></p></blockquote><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e4RU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F845b4265-23a9-4ebd-8157-f66831b89c97_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e4RU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F845b4265-23a9-4ebd-8157-f66831b89c97_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!e4RU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F845b4265-23a9-4ebd-8157-f66831b89c97_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!e4RU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F845b4265-23a9-4ebd-8157-f66831b89c97_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!e4RU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F845b4265-23a9-4ebd-8157-f66831b89c97_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e4RU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F845b4265-23a9-4ebd-8157-f66831b89c97_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/845b4265-23a9-4ebd-8157-f66831b89c97_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1862281,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!e4RU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F845b4265-23a9-4ebd-8157-f66831b89c97_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!e4RU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F845b4265-23a9-4ebd-8157-f66831b89c97_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!e4RU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F845b4265-23a9-4ebd-8157-f66831b89c97_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!e4RU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F845b4265-23a9-4ebd-8157-f66831b89c97_1400x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the Canva Image Generator.</figcaption></figure></div><div><hr></div><h1>&#128200; Career</h1><blockquote><p><em>Don't let comfort hold you back.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://paulgraham.com/google.html">How to Start Google</a></h4><p>&#9997; <a href="https://paulgraham.com/bio.html">Paul Graham</a></p><blockquote><p><em>Starting your own company can mean anything from starting a barber shop to starting Google. I'm here to talk about one extreme end of that continuum. I'm going to tell you how to start Google.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://blog.dataengineer.io/p/how-to-convert-your-yearly-review">How to convert your yearly review to a promotion in big tech</a></h4><p>&#9997; <a href="https://substack.com/@eczachly">Zach Wilson</a></p><blockquote><p><em>The dreaded review cycle usually comes once or twice a year at companies. For most this process is annoying or anxiety-provoking. It is also one of the most critical moments for advancing your career with interviewing well being the only one thing that&#8217;s more critical!</em></p></blockquote><h4>&#128214;&#9478;<a href="https://lethain.com/leadership-requires-risk/">Leadership requires taking some risk</a></h4><p>&#9997; <a href="https://lethain.com/about/">Will Larson</a></p><blockquote><p><em>At a recent offsite with Carta&#8217;s Navigators, we landed on an interesting topic: leadership roles sometimes mean that making progress on a professional initiative requires taking some personal risk.</em></p></blockquote><div><hr></div><h1>&#128640; Engineering</h1><blockquote><p><em>I have to believe in a world outside my own mind. &#8212; Memento (2000)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.scattered-thoughts.net/writing/unexplanations-sql-is-syntactic-sugar-for-relational-algebra/">Unexplanations: sql is syntactic sugar for relational algebra</a></h4><p>&#9997; <a href="https://www.scattered-thoughts.net/">Jamie Brandon</a></p><blockquote><p><em>This idea is particularly sticky because it was more or less true 50 years ago, and it's a passable mental model to use when learning sql. But it's an inadequate mental model for building new sql frontends, designing new query languages, or writing tools likes ORMs that abstract over sql.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://t.co/r8Op8DakEm">Tableflow: The Stream/Table, Kafka/Iceberg Duality</a></h4><p>&#9997; <a href="https://jack-vanlightly.com/home">Jack Vanlightly</a></p><blockquote><p><em>Confluent just announced Tableflow, the seamless materialization of Apache Kafka topics as Apache Iceberg tables. This announcement has to be the most impactful announcement I&#8217;ve witnessed while at Confluent. This post is about why Iceberg tables aren&#8217;t just another destination to sync data to; they fundamentally change the world of streaming. It&#8217;s also about the macro trends that have led us to this point and why Iceberg (and the other table formats) are so important to the future of streaming.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://developers.redhat.com/articles/2024/03/13/kafka-tiered-storage-deep-dive">Kafka tiered storage deep dive</a></h4><p>&#9997; <a href="https://developers.redhat.com/author/federico-valeri">Federico Valeri</a> + <a href="https://developers.redhat.com/author/luke-chen">Luke Chen</a></p><blockquote><p><em>Tiered storage is a new early access feature available as of Apache Kafka 3.6.0 that allows you to scale compute and storage resources independently, provides better client isolation, and allows faster maintenance of your Kafka cluster. Let's dive into this new feature to see the motivations, design, and implementation details. In this post, we will focus on Tiered storage implementation, so it is assumed a good understanding of the Kafka architecture and main components</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/insiderengineering/consumer-driven-contract-testing-cdct-b6c05c18ba25?source=rss----80f9de3e9a8a---4">Consumer-Driven Contract Testing (CDCT)</a></h4><p>&#9997; <a href="https://medium.com/@mihribankmrci?source=post_page-----b6c05c18ba25--------------------------------">Mihriban Kumarci</a></p><blockquote><p><em>Consumer-Driven Contract (CDC) Testing is gaining prominence in microservices architecture. It offers an efficient way to ensure that services meet their contracts without exhaustive end-to-end tests..</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.thdpth.com/p/timescaledb-and-the-quest-for-the">TimescaleDB and the Quest for the Ultimate Time Series Database: Growth, Challenges, and Strategic Moves</a></h4><p>&#9997; <a href="https://substack.com/profile/229923-sven-balnojan-phd">Sven Balnojan PhD</a></p><blockquote><p><em>The one key question I want to dive into today is&#8230;. Is Timescale able to handle the coming torrent? Will they be swept away or catch the current and grow absurdly quickly?</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/picnic-engineering/python-picnic-590819d066d8">Python @ Picnic</a></h4><p>&#9997; <a href="https://medium.com/@svena33">Sven Arends</a></p><blockquote><p><em>From Machine Learning to black box testing of Java services, Picnic uses Python throughout its tech stack. Learn why Picnic chose Python and how we leverage its flexibility at scale.</em></p></blockquote><h4>&#128214;&#9478;<strong><a href="https://www.pythonmorsels.com/every-dunder-method/">Every dunder method in Python</a></strong></h4><p>&#9997; <a href="https://www.pythonmorsels.com/">Trey Hunner</a></p><blockquote><p><em>Python includes tons of <a href="https://www.pythonmorsels.com/what-are-dunder-methods/">dunder methods</a> ("double underscore" methods) which allow us to deeply customize how our custom classes interact with Python's many features. What dunder methods could you add to your class to make it friendly for other Python programmers who use it?</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/tech-at-instacart/real-time-fraud-detection-with-yoda-and-clickhouse-bd08e9dbe3f4">Real-time Fraud Detection with Yoda and ClickHouse</a></h4><p>&#9997; <a href="https://medium.com/@nicholas.shieh">Nick Shieh</a></p><blockquote><p><em>Our Fraud Platform team developed Yoda, a decision platform service, to detect such fraudulent activities quickly and take appropriate measures.To enable fraud decisions in fractions of a second, Yoda uses ClickHouse as its primary real-time datastore. ClickHouse is a fast and highly performant analytical database, widely used across Instacart to power other use-cases such as critical retailer and ads dashboards, calculating results for A/B testing, and machine learning signals.</em></p></blockquote><div><hr></div><h1>&#9999; Data</h1><blockquote><p><em>The one thing that this job has taught me is that truth is stranger than fiction. &#8212; Predestination (2014)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://faithfacts.substack.com/p/dbts-model-contracts-for-dummies">dbt&#8217;s Model Contracts for Dummies</a></h4><p>&#9997; <a href="https://substack.com/profile/11412853-faith-lierheimer">Faith Lierheimer</a></p><blockquote><p><em>So yeah. The Data Contracts Content Explosion on the Data Internet was fucking exhausting. Unfortunately, it was on to an important idea.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://eric-sandosham.medium.com/the-problem-with-data-governance-2570f0573f3a">The Problem with Data Governance</a></h4><p>&#9997; <a href="https://eric-sandosham.medium.com/?source=post_page-----2570f0573f3a--------------------------------">Eric Sandosham, Ph.D.</a></p><blockquote><p><em>Over the years, Data Governance has lost much of its lustre having been unable to show much business impact. But what exactly is Data Governance? Has it evolved beyond data quality and data access? How does it square up with the emergence of AI solutioning? And so I dedicate my 29th weekly article to discussing the topic of whether Data Governance as a practice still makes sense.</em></p></blockquote><div><hr></div><h1>&#129302; AI&#9478;ML&#9478;Data Science</h1><blockquote><p><em>You know, Burke, I don&#8217;t know which species is worse. &#8212; Ripley, Aliens (1986)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://roundup.getdbt.com/p/lets-talk-about-ai">Let's talk about AI...</a></h4><p>&#9997; <a href="https://substack.com/@jthandy">Tristan Handy</a></p><blockquote><p><em>The integration of structured data and AI will be driven by metadata. And dbt&#8217;s biggest role to play in the AI revolution will be as the source of truth for that metadata.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://engineering.fb.com/2024/03/18/data-infrastructure/logarithm-logging-engine-ai-training-workflows-services-meta/">Logarithm: A logging engine for AI training workflows and services</a></h4><p>&#9997; <a href="https://engineering.fb.com/author/partha-kanuparthy/">Partha Kanuparthy</a></p><blockquote><p><em>Logarithm is a hosted, serverless, multitenant service, used only internally at Meta, that consumes and indexes these logs and provides an interactive query interface to retrieve and view logs. In this post, we present the design behind Logarithm, and show how it powers AI training debugging use cases.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.uber.com/blog/enhancing-the-quality-of-machine-learning-systems-at-scale/">Model Excellence Scores: A Framework for Enhancing the Quality of Machine Learning Systems at Scale</a></h4><p>&#9997; <a href="https://www.uber.com/blog/">Uber Engineering Blog</a></p><blockquote><p><em>By integrating the Service Level Agreement (SLA) concept, we aim to establish a standard for measuring and ensuring ML model quality.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://engineering.fb.com/2024/03/20/networking-traffic/optimizing-rtc-bandwidth-estimation-machine-learning/">Optimizing RTC bandwidth estimation with machine learning</a></h4><p>&#9997; <a href="https://engineering.fb.com/author/santhosh-sunderrajan/">Santhosh Sunderrajan</a></p><blockquote><p><em>We&#8217;ve adopted a machine learning (ML)-based approach that allows us to solve networking problems holistically across cross-layers such as BWE, network resiliency, and transport. We&#8217;re sharing our experiment results from this approach, some of the challenges we encountered during execution, and learnings for new adopters.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/netflix-techblog/sequential-testing-keeps-the-world-streaming-netflix-part-2-counting-processes-da6805341642">Sequential A/B Testing Keeps the World Streaming Netflix Part 2: Counting Processes</a></h4><p>&#9997; <a href="https://netflixtechblog.medium.com/">Netflix Technology Blog</a></p><blockquote><p><em>Netflix monitors a large suite of metrics, many of which can be classified as counts. These include metrics such as the number of logins, errors, successful play starts, and even the number of customer call center contacts. In this second installment, we describe our sequential methodology for testing count metrics, outlined in the NeurIPS paper Anytime Valid Inference for Multinomial Count Data.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://dropbox.tech/machine-learning/bye-bye-bye-evolution-of-repeated-token-attacks-on-chatgpt-models">Bye Bye Bye...: Evolution of repeated token attacks on ChatGPT models</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/markbreitenbach/">Mark Breitenbach</a> + <a href="https://www.linkedin.com/in/adrian-wood-threlfall/">Adrian Wood</a></p><blockquote><p><em>This blog will discuss the steps taken to execute the repeated token attack on ChatGPT models at various points from October 2023 through March 2024&#8212;before, during, and after OpenAI&#8217;s filtering mitigation was deployed.</em></p></blockquote><div><hr></div><h1>&#128293; Catch up</h1><blockquote><p><em>&#8230;Next Saturday night, we're sending you back to the future! &#8212; Dr. Emmett Brown, Back to the Future (1985)</em></p></blockquote><p>&#128214;&#9478;<a href="https://github.com/dbt-labs/dbt-core/blob/v1.8.0b1/CHANGELOG.md?ref=blef.fr">dbt  1.8.0</a></p><p>&#128214;&#9478;<a href="https://www.confluent.io/blog/exploring-apache-flink-1-19/">Apache Flink 1.19</a></p><p>&#128214;&#9478;<a href="https://www.confluent.io/blog/introducing-apache-kafka-3-7/">Apache Kafka 3.7</a></p><p>&#128214;&#9478;<a href="https://cloud.google.com/bigquery/docs/materialized-views-create#left-union">BigQuery | Incremental materialized views now support</a> <code>LEFT OUTER JOIN</code> and <code>UNION ALL</code>.</p><p>&#128214;&#9478;<a href="https://www.microsoft.com/en-us/research/blog/introducing-garnet-an-open-source-next-generation-faster-cache-store-for-accelerating-applications-and-services/">Microsoft introduces Garnet, a new remote cache-store</a></p><p>&#128214;&#9478;<a href="https://github.com/xai-org/grok-1">X AI open-sources Grok</a></p><p>&#128214;&#9478;<a href="https://www.confluent.io/blog/introducing-tableflow/">Confluent introduces Tableflow</a></p><div><hr></div><h1>&#128160; Previously on Dimension</h1><blockquote><p><em>Dimension is my sub-newsletter where I note down things I learn from people smarter than me in the data engineering field.</em></p></blockquote><p>Here are the 3 latest articles:</p><h3><em><strong>Published on 2024, March 9:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;4a7cd5fa-0b55-4e98-9075-497af3bf1fbd&quot;,&quot;caption&quot;:&quot;I updated the post because it initially had duplicate sections. My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;If I could travel back to 5 years ago, what would I talk to myself about Docker?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-09T11:00:23.414Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37ba199a-a064-4d0b-97e9-3ce8b77b3a5c_1399x998.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/i-spent-3-hours-to-understand-more&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:141885092,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:4,&quot;comment_count&quot;:2,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3><em><strong>Published on 2024, March 16:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;599861a8-c4a5-4c4a-8c99-d331a5e075ef&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;I spent another 8 hours understanding the design of Amazon Redshift. Here's what I found.&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-16T11:00:30.199Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F809f83a1-a889-470c-a7a9-8a68dd3003a7_1401x999.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/i-spent-another-8-hours-understanding&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:142050038,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:1,&quot;comment_count&quot;:6,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3><em><strong>Published on 2024, March 23:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;5adbac5b-93b1-4cb8-9e62-1e77d00b0805&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How does Uber build real-time infrastructure to handle petabytes of data every day?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-23T11:01:01.215Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdbce475-f7ba-44f3-91f8-07b53b1c996d_1398x999.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/i-spent-7-hours-understanding-ubers&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:142351678,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="pullquote"><p>Let me here your voice, for example: </p><p>'Your newsletter is so terrible, I can't handle it anymore.'</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/p/groupby-28-tableflow-the-streamtable/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/p/groupby-28-tableflow-the-streamtable/comments"><span>Leave a comment</span></a></p><div><hr></div><h2>&#8220;Hasta la vista, baby&#8221; -T800, Terminator 2: Judgment Day (1991)</h2></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for scrolling this far! There's a convenient subscribe box here if you want me to annoy you every week. &#128516;</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[GroupBy #27: Balancing HDFS DataNodes in the Uber DataLake, How Figma’s databases team lived to tell the scale]]></title><description><![CDATA[Plus: Building Meta&#8217;s GenAI Infrastructure, How to save millions by optimizing data pipeline shuffling]]></description><link>https://vutr.substack.com/p/groupby-27-balancing-hdfs-datanodes</link><guid isPermaLink="false">https://vutr.substack.com/p/groupby-27-balancing-hdfs-datanodes</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Tue, 19 Mar 2024 11:01:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F440e5362-8510-48ef-a235-a99171dfdb4b_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is <strong>GroupBy</strong>, where I share the resources I learn from people smarter than me in the data engineering field.</em></p><p><em>Not subscribed yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p><div class="pullquote"><div class="install-substack-app-embed install-substack-app-embed-web" data-component-name="InstallSubstackAppToDOM"><img class="install-substack-app-embed-img" src="https://substackcdn.com/image/fetch/$s_!D8N-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fvutr.substack.com%2Fimg%2Fsubstack.png"><div class="install-substack-app-embed-text"><div class="install-substack-app-header">Get more from Vu Trinh in the Substack app</div><div class="install-substack-app-text">Available for iOS and Android</div></div><a href="https://substack.com/app/app-store-redirect?utm_campaign=app-marketing&amp;utm_content=author-post-insert&amp;utm_source=vutr" target="_blank" class="install-substack-app-embed-link"><button class="install-substack-app-embed-btn button primary">Get the app</button></a></div></div><blockquote><p><em>&#128075; Hi, my name is Vu Trinh, a data engineer.</em></p><p><em>I enjoy reading <strong>good stuff</strong>  (related to data and engineering), and this newsletter is my effort on the journey to seek the "good stuff" across the entire Internet. </em></p><p><em>Hope this issue find you well.</em></p></blockquote><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WZ3Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F440e5362-8510-48ef-a235-a99171dfdb4b_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WZ3Z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F440e5362-8510-48ef-a235-a99171dfdb4b_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!WZ3Z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F440e5362-8510-48ef-a235-a99171dfdb4b_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!WZ3Z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F440e5362-8510-48ef-a235-a99171dfdb4b_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!WZ3Z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F440e5362-8510-48ef-a235-a99171dfdb4b_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WZ3Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F440e5362-8510-48ef-a235-a99171dfdb4b_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/440e5362-8510-48ef-a235-a99171dfdb4b_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2275958,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WZ3Z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F440e5362-8510-48ef-a235-a99171dfdb4b_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!WZ3Z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F440e5362-8510-48ef-a235-a99171dfdb4b_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!WZ3Z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F440e5362-8510-48ef-a235-a99171dfdb4b_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!WZ3Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F440e5362-8510-48ef-a235-a99171dfdb4b_1400x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image created by the <a href="https://www.canva.com/ai-image-generator/">Canvas Image Generator</a>.</figcaption></figure></div><div><hr></div><h1>&#128200; Career</h1><blockquote><p><em>Don't let comfort hold you back.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://muratbuffalo.blogspot.com/2024/03/the-demise-of-coding-is-greatly.html">The demise of coding is greatly exaggerated</a></h4><p>&#9997; <a href="https://cse.buffalo.edu/~demirbas/">Murat Demirbas</a></p><blockquote><p><em>I like to mention that a career in computer science and software technology (practicing coding) gives you vital and generally applicable skills: hacking, debugging, abstract thinking, quick learning/adaptation, and organizational skills.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://liw.fi/40/">40 years of programming</a></h4><p>&#9997; <a href="https://liw.fi/">Lars Wirzenius</a></p><blockquote><p><em>In April, 1984, my father bought a computer for his home office, a Luxor ABC-802, with a Z80 CPU, 64 kilobytes of RAM, a yellow-on-black screen with 80 by 25 text mode, or about 160 by 75 pixels in graphics mode, and two floppy drives. It had BASIC in its ROM, and came with absolutely no games. If I wanted to play with it, I had to learn how to program, and write my own games. I learned BASIC, and over the next few years would learn Pascal, C, and more. I had found my passion. I was 14 years old and I knew what I wanted to do when I grew up.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://martinfowler.com/articles/measuring-developer-productivity-humans.html">Measuring Developer Productivity via Humans</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/abinoda/">Abi Noda</a> + <a href="https://www.linkedin.com/in/timcochran/">Tim Cochran</a></p><blockquote><p><em>Measuring developer productivity is a difficult challenge. Conventional metrics focused on development cycle time and throughput are limited, and there aren't obvious answers for where else to turn. Qualitative metrics offer a powerful way to measure and understand developer productivity using data derived from developers themselves.</em></p></blockquote><div><hr></div><h1>&#128640; Engineering</h1><blockquote><p><em>I have to believe in a world outside my own mind. &#8212; Memento (2000)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.uber.com/en-SG/blog/balancing-hdfs-datanodes-in-the-uber-datalake/">Balancing HDFS DataNodes in the Uber DataLake</a></h4><p>&#9997; <a href="https://www.uber.com/blog/asia/">Uber Engineering Blog</a></p><blockquote><p><em>Uber has one of the largest HDFS deployments in the world, with exabytes of data across tens of clusters. It is important, but also challenging, to keep scaling our data infrastructure with the balance between efficiency, service reliability, and high performance.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.arroyo.dev/blog/why-arrow-and-datafusion">We built a new SQL Engine on Arrow and DataFusion</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/wylde/">Micah Wylde</a></p><blockquote><p><em>Arroyo 0.10 has an entirely new SQL engine built with Apache Arrow and DataFusion. It's much faster, smaller, and easier to run. Read on for why and how we're making this change.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://motherduck.com/blog/differential-storage-building-block-for-data-warehouse/">Differential storage: a key building block for a DuckDB-based data warehouse</a></h4><p>&#9997; <a href="https://motherduck.com/authors/joseph-hwang/">Joseph Hwang</a></p><blockquote><p><em>Today we&#8217;d like to talk about Differential Storage, a key infrastructure-level enabler of new capabilities and stronger semantics for MotherDuck users. Thanks to Differential Storage, features like efficient <strong><a href="https://motherduck.com/docs/key-tasks/managing-shared-motherduck-database">data sharing</a></strong> and <strong><a href="https://motherduck.com/docs/motherduck-sql-reference/create-database">zero-copy clone</a></strong> are now available in MotherDuck. Moreover, Differential Storage unlocks other features, like snapshots, branching and time travel which we&#8217;ll release in the coming months.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/pinterest-engineering/improving-efficiency-of-goku-time-series-database-at-pinterest-part-2-08130f25b874">Improving Efficiency Of Goku Time Series Database at Pinterest (Part 2)</a></h4><p>&#9997; <a href="https://medium.com/@Pinterest_Engineering?source=post_page-----08130f25b874--------------------------------">Pinterest Engineering</a></p><blockquote><p><em>This 2nd blog post focuses on how Goku time series queries were improved. We will provide a brief overview of Goku&#8217;s time series data model, query model, and architecture. We will follow up with the improvement features we added including rollup, pre-aggregation, and pagination.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://jack-vanlightly.com/analyses/2024/3/12/scaling-models-and-multi-tenant-data-systems-asds-chapter-6">Scaling Models And Multi-Tenant Data Systems - ASDS Chapter 6</a></h4><p>&#9997; <a href="https://jack-vanlightly.com/home">Jack Vanlightly</a></p><blockquote><p><em>What is scaling in large-scale multi-tenant data systems, and how does that compare to single-tenant data systems? How does per-tenant scaling relate to system-wide scaling? How do scale-to-zero and cold starts come into play? Answering these questions is chapter 6 of The Architecture of Serverless Data Systems.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.figma.com/blog/how-figmas-databases-team-lived-to-tell-the-scale/">How Figma&#8217;s databases team lived to tell the scale</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/samantha-steele-b9a41aa3/">Sammy Steele</a></p><blockquote><p><em>Figma&#8217;s database stack has grown almost 100x since 2020. This is a good problem to have because it means our business is expanding, but it also poses some tricky technical challenges</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.startdataengineering.com/post/de_best_practices_log/">Data Engineering Best Practices - #2. Metadata &amp; Logging</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/josephmachado1991/">Joseph M.</a></p><blockquote><p><em>Dealing with breaking pipelines, debugging why they failed, and putting up a fix are everyday tasks for a data engineer.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://calpaterson.com/s3.html">S3 is files, but not a filesystem</a></h4><p>&#9997; <a href="https://calpaterson.com/about.html">Cal Paterson</a></p><blockquote><p><em>"Deep" modules, mismatched interfaces - and why SAP is so painful</em></p></blockquote><h4>&#128214;&#9478;<a href="https://engineeringblog.yelp.com/2024/03/building-data-abstractions-with-streaming-at-yelp.html">Building data abstractions with streaming at Yelp</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/hakampreet-singh-pandher-88a50484/">Hakampreet Singh Pandher</a></p><blockquote><p><em>This blog post covers how we leverage Yelp&#8217;s extensive streaming infrastructure to build robust data abstractions for our offline and streaming data consumers. We will use Yelp&#8217;s Business Properties ecosystem (explained in the upcoming sections) as an example.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.pimpaudben.fr/airflow-kestra-a-simple-benchmark-ffc5a533aa85">Airflow &amp; Kestra: a Simple Benchmark</a></h4><p>&#9997; <a href="https://medium.pimpaudben.fr/?source=post_page-----ffc5a533aa85--------------------------------">Benoit Pimpaud</a></p><blockquote><p><em>This post compares Airflow and Kestra, focusing on installation, configuration, pipeline syntax, and performance.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://eng.lyft.com/postgres-aurora-db-major-version-upgrade-with-minimal-downtime-4e26178f07a0">Postgres Aurora DB major version upgrade with minimal downtime</a></h4><p>&#9997; <a href="https://medium.com/@pjay5334?source=post_page-----4e26178f07a0--------------------------------">Jay Patel</a></p><blockquote><p><em>Our payment platform team had the unique challenge to upgrade our Aurora Postgres DB from v10 to v13. This DB was responsible for storing transactions within Lyft and contains ~400 tables (with partitions) and ~30TB of data. Upgrading the database in-place would have resulted in ~30 mins of downtime. Such significant downtime is untenable &#8212; it would cause cascading failures across multiple downstream services, requiring a large amount of engineering effort to remediate.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.theseattledataguy.com/apache-druids-architecture-how-druid-processes-data-in-real-time-at-scale/">Apache Druid&#8217;s Architecture &#8211; How Druid Processes Data In Real Time At Scale</a></h4><p>&#9997; <a href="https://www.theseattledataguy.com/data-science-consultants/#page-content">Ben Rogojan</a></p><blockquote><p><em>Apache Druid has several unique features that allow it to be used as a real-time OLAP. Everything from its various nodes and processes that each have unique functionality that let it scale to the fact that the data is indexed to be pulled quickly and efficiently.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://blog.dataengineer.io/p/how-to-save-millions-by-optimizing">How to save millions by optimizing data pipeline shuffling</a></h4><p>&#9997; <a href="https://substack.com/profile/10367987-zach-wilson">Zach Wilson</a></p><blockquote><p><em>In this article we will be going over: - Why does shuffle happen and what SQL keywords trigger shuffle and which do not? - Some techniques you can use to minimize shuffle especially in Apache Spark</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/illuminations-mirror/a-look-back-at-key-trends-in-data-infrastructure-in-2023-by-four-industry-founders-7d1f7ae0d46f">A Look Back at Key Trends in Data Infrastructure in 2023 by Four Industry Founders</a></h4><p>&#9997; <a href="https://medium.com/@RisingWave_Engineering?source=post_page-----7d1f7ae0d46f--------------------------------">RisingWave Labs</a></p><blockquote><p><em>The discussion with the four founders of data infrastructure startups focused on key trends in the industry for 2023.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://blog.allegro.tech/2024/03/kafka-performance-analysis.html">Unlocking Kafka's Potential: Tackling Tail Latency with eBPF</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/moscickimaciej/">Maciej Mo&#347;cicki</a> + <a href="https://www.linkedin.com/in/piotrrzysko/">Piotr R&#380;ysko</a></p><blockquote><p><em>At Allegro, we use Kafka as a backbone for asynchronous communication between microservices. With up to 300k messages published and 1M messages consumed every second, it is a key part of our infrastructure. A few months ago, in our main Kafka cluster, we noticed the following discrepancy: while median response times for produce requests were in single-digit milliseconds, the tail latency was much worse. Namely, the p99 latency was up to 1 second, and the p999 latency was up to 3 seconds. This was unacceptable for a new project that we were about to start, so we decided to look into this issue. In this blog post, we would like to describe our journey &#8212; how we used Kafka protocol sniffing and eBPF to identify and remove the performance bottleneck.</em></p></blockquote><div><hr></div><h1>&#9999; Data</h1><blockquote><p><em>The one thing that this job has taught me is that truth is stranger than fiction. &#8212; Predestination (2014)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.linkedin.com/blog/engineering/data-management/scalable-automated-config-driven-data-validation">Scalable Automated Config-Driven Data Validation with ValiData</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/bhar2201">Bharadwaj Jayaraman</a></p><blockquote><p><em>ValiData is a scalable automated config-driven data validation tool extensively used in LinkedIn that compares metric values of test datasets against production or source-of-truth datasets and highlights differences in metric values across dimensions.</em></p></blockquote><div><hr></div><h1>&#129302; AI&#9478;ML&#9478;Data Science</h1><blockquote><p><em>You know, Burke, I don&#8217;t know which species is worse. &#8212; Ripley, Aliens (1986)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/@AnalyticsAtMeta/how-meta-tests-products-with-strong-network-effects-96003a056c2c">How Meta tests products with strong network effects</a></h4><p>&#9997; <a href="https://medium.com/@AnalyticsAtMeta?source=post_page-----96003a056c2c--------------------------------">Analytics at Meta</a></p><blockquote><p><em>I&#8217;m a member of a team that&#8217;s been applying cluster experimentation to products with strong network effects, such as chat and calling, since 2018. Today, I&#8217;d like to give an overview of the challenges we face in these highly-interactive domains, and how one solution &#8212; cluster experiments &#8212; has become a go-to method for addressing these challenges.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://stackoverflow.blog/2024/02/07/best-practices-for-building-llms/">Best practices for building LLMs</a></h4><p>&#9997; <a href="https://stackoverflow.blog/author/nitzan-gado/">Nitzan Gado</a> + <a href="https://stackoverflow.blog/author/oren-dar/">Oren Dar</a></p><blockquote><p><em>Intuit shares what they've learned building multiple LLMs for their generative AI operating system.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://doordash.engineering/2024/03/12/improving-etas-with-multi-task-models-deep-learning-and-probabilistic-forecasts/">Improving ETAs with Multi-Task Models, Deep Learning, and Probabilistic Forecasts</a></h4><p>&#9997; <a href="https://doordash.engineering/blog/">Doordash Engineering Blog</a></p><blockquote><p><em>The DoorDash ETA team is committed to providing an accurate and reliable estimated time of arrival (ETA) as a cornerstone DoorDash consumer experience. We want to ensure that every customer can trust our ETAs, ensuring a high-quality experience in which their food arrives on time every time.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/">Building Meta&#8217;s GenAI Infrastructure</a></h4><p>&#9997; <a href="https://engineering.fb.com/author/kevin-lee/">Kevin Lee</a> + <a href="https://engineering.fb.com/author/adi-gangidi/">Adi Gangidi</a> + <a href="https://engineering.fb.com/author/mathew-oldham/">Mathew Oldham</a></p><blockquote><p><em>Marking a major investment in Meta&#8217;s AI future, we are announcing two 24k GPU clusters. We are sharing details on the hardware, network, storage, design, performance, and software that help us extract high throughput and reliability for various AI workloads. We use this cluster design for Llama 3 training.</em></p></blockquote><div><hr></div><h1>&#128293; Catch up</h1><blockquote><p><em>&#8230;Next Saturday night, we're sending you back to the future! &#8212; Dr. Emmett Brown, Back to the Future (1985)</em></p></blockquote><h4>&#128214;&#9478;LinkedIn <a href="https://www.linkedin.com/blog/engineering/open-source/open-sourcing-openhouse">Open Sources OpenHouse</a>: A Control Plane for Managing Tables in a Data Lakehouse</h4><h4>&#128214;&#9478;<a href="https://www.linkedin.com/posts/apache-xtable_onetable-is-now-apache-xtable-incubating-activity-7173282583730962432-AGGi?utm_source=share&amp;utm_medium=member_desktop">OpenTable now changes name to Apache XTable</a></h4><h4>&#128214;&#9478;<strong><a href="https://arrow.apache.org/blog/2024/03/06/comet-donation/">Announcing Apache Arrow DataFusion Comet</a></strong></h4><div><hr></div><h1>&#128160; Previously on Dimension</h1><blockquote><p><em>Dimension is my sub-newsletter where I note down things I learn from people smarter than me in the data engineering field.</em></p></blockquote><p>Here are the 3 latest articles:</p><h3><em><strong>Published on 2024, March 2:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;51414477-0079-4d8b-95e9-8d59ef535d73&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;I spent 7 hours reading another paper to understand more about Snowflake's internal. Here's what I found.&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-02T11:01:04.261Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7814dfda-092b-4017-a41b-03a002fba86e_1393x993.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/i-read-another-paper-to-understand&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:141591854,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3><em><strong>Published on 2024, March 9:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;4a7cd5fa-0b55-4e98-9075-497af3bf1fbd&quot;,&quot;caption&quot;:&quot;I updated the post because it initially had duplicate sections. My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;If I could travel back to 5 years ago, what would I talk to myself about Docker?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-09T11:00:23.414Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37ba199a-a064-4d0b-97e9-3ce8b77b3a5c_1399x998.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/i-spent-3-hours-to-understand-more&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:141885092,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:4,&quot;comment_count&quot;:2,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3><em><strong>Published on 2024, March 16:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;599861a8-c4a5-4c4a-8c99-d331a5e075ef&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;I spent another 8 hours understanding the design of Amazon Redshift. Here's what I found.&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-16T11:00:30.199Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F809f83a1-a889-470c-a7a9-8a68dd3003a7_1401x999.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/i-spent-another-8-hours-understanding&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:142050038,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:1,&quot;comment_count&quot;:6,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="pullquote"><p>Let me here your voice, for example: </p><p>'Your newsletter is so terrible, I can't handle it anymore.'</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/p/groupby-27-balancing-hdfs-datanodes/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/p/groupby-27-balancing-hdfs-datanodes/comments"><span>Leave a comment</span></a></p><div><hr></div><h2>&#8220;Hasta la vista, baby&#8221; -T800, Terminator 2: Judgment Day (1991)</h2></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for scrolling this far! There's a convenient subscribe box here if you want me to annoy you every week. &#128516;</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[GroupBy #26: How GitHub uses merge queue to ship hundreds of changes every day, Data governance in the age of generative AI, "Good Enough" Data Models]]></title><description><![CDATA[Plus: Why the 100x analyst doesn&#8217;t exist, What Is Trustworthy AI?]]></description><link>https://vutr.substack.com/p/groupby-26-how-github-uses-merge</link><guid isPermaLink="false">https://vutr.substack.com/p/groupby-26-how-github-uses-merge</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Tue, 12 Mar 2024 11:02:09 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ed9e59a-5fd0-440f-b495-f66e7da380de_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is <strong>GroupBy</strong>, where I share the resources I learn from people smarter than me in the data engineering field.</em></p><p><em>Not subscribed yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p><div class="pullquote"><div class="install-substack-app-embed install-substack-app-embed-web" data-component-name="InstallSubstackAppToDOM"><img class="install-substack-app-embed-img" src="https://substackcdn.com/image/fetch/$s_!D8N-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fvutr.substack.com%2Fimg%2Fsubstack.png"><div class="install-substack-app-embed-text"><div class="install-substack-app-header">Get more from Vu Trinh in the Substack app</div><div class="install-substack-app-text">Available for iOS and Android</div></div><a href="https://substack.com/app/app-store-redirect?utm_campaign=app-marketing&amp;utm_content=author-post-insert&amp;utm_source=vutr" target="_blank" class="install-substack-app-embed-link"><button class="install-substack-app-embed-btn button primary">Get the app</button></a></div></div><blockquote><p><em>&#128075; Hi, my name is Vu Trinh, a data engineer.</em></p><p><em>I enjoy reading <strong>good stuff</strong>  (related to data and engineering), and this newsletter is my effort on the journey to seek the "good stuff" across the entire Internet. </em></p><p><em>Hope this issue find you well.</em></p></blockquote><div><hr></div><h2>Referral Program</h2><p>I&#8217;m launching a referral program to grow the community by giving you guys valuable gifts whenever you reach a referral milestone. The condition is simple: you refer friends to subscribe to my newsletter, and you will receive a gift based on the number of friends you refer. Here are the reward milestones:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lf_-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c72d52-a2c4-4e24-9714-04e72a4dc087_756x361.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lf_-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c72d52-a2c4-4e24-9714-04e72a4dc087_756x361.png 424w, https://substackcdn.com/image/fetch/$s_!lf_-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c72d52-a2c4-4e24-9714-04e72a4dc087_756x361.png 848w, https://substackcdn.com/image/fetch/$s_!lf_-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c72d52-a2c4-4e24-9714-04e72a4dc087_756x361.png 1272w, https://substackcdn.com/image/fetch/$s_!lf_-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c72d52-a2c4-4e24-9714-04e72a4dc087_756x361.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lf_-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c72d52-a2c4-4e24-9714-04e72a4dc087_756x361.png" width="756" height="361" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a4c72d52-a2c4-4e24-9714-04e72a4dc087_756x361.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:361,&quot;width&quot;:756,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:112870,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!lf_-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c72d52-a2c4-4e24-9714-04e72a4dc087_756x361.png 424w, https://substackcdn.com/image/fetch/$s_!lf_-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c72d52-a2c4-4e24-9714-04e72a4dc087_756x361.png 848w, https://substackcdn.com/image/fetch/$s_!lf_-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c72d52-a2c4-4e24-9714-04e72a4dc087_756x361.png 1272w, https://substackcdn.com/image/fetch/$s_!lf_-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c72d52-a2c4-4e24-9714-04e72a4dc087_756x361.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now, let&#8217;s refer friends and claim exciting rewards ;)</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/leaderboard?&amp;utm_source=post&quot;,&quot;text&quot;:&quot;Refer a friend&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/leaderboard?&amp;utm_source=post"><span>Refer a friend</span></a></p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!urFD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ed9e59a-5fd0-440f-b495-f66e7da380de_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!urFD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ed9e59a-5fd0-440f-b495-f66e7da380de_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!urFD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ed9e59a-5fd0-440f-b495-f66e7da380de_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!urFD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ed9e59a-5fd0-440f-b495-f66e7da380de_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!urFD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ed9e59a-5fd0-440f-b495-f66e7da380de_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!urFD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ed9e59a-5fd0-440f-b495-f66e7da380de_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3ed9e59a-5fd0-440f-b495-f66e7da380de_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1078603,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!urFD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ed9e59a-5fd0-440f-b495-f66e7da380de_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!urFD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ed9e59a-5fd0-440f-b495-f66e7da380de_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!urFD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ed9e59a-5fd0-440f-b495-f66e7da380de_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!urFD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ed9e59a-5fd0-440f-b495-f66e7da380de_1400x1000.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h1>&#128200; Career</h1><blockquote><p><em>Don't let comfort hold you back.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://github.blog/2024-03-07-hard-and-soft-skills-for-developers-coding-in-the-age-of-ai/">Hard and soft skills for developers coding in the age of AI</a></h4><p>&#9997; <a href="https://github.blog/author/saraverdi/">Sara Verdi</a></p><blockquote><p><em>While AI revolutionizes software development, it still relies on developers to pilot its use. In this blog, we&#8217;ll cover the skills that developers need to have for navigating this new AI-powered coding frontier.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.confessionsofadataguy.com/the-best-piece-of-software-engineering-advice/">The Best Piece of Software Engineering Advice - Confessions of a Data Guy</a></h4><p>&#9997; <a href="https://github.com/danielbeach">Daniel Beach</a></p><blockquote><p><em>You don&#8217;t need to be the smartest person in the room. In fact, you shouldn&#8217;t be.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://mikkeldengsoe.substack.com/p/the-100x-analyst">Why the 100x analyst doesn&#8217;t exist</a></h4><p>&#9997; <a href="https://substack.com/profile/6080775-mikkel-dengse">Mikkel Dengs&#248;e</a></p><blockquote><p><em>Much has been written about the 10x &#8211; heck, 100x &#8211; engineer. The mystical creature that ships features in hours instead of months and perseveres where others back down. Corny as it may sound, I think it&#8217;s a pretty good representation of how the world works. But does the 10x analyst exist? I believe so. How about the 100x analyst? I&#8217;m not sure.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/everestengineering/how-to-build-a-modern-data-team-seven-tips-for-success-a4d97e427d45">How to Build a Modern Data Team? Seven tips for success</a></h4><p>&#9997; <a href="https://medium.com/@martin.chesbrough?source=post_page-----a4d97e427d45--------------------------------">Martin Chesbrough</a></p><blockquote><p><em>I spend my time with clients of Everest Engineering to help them build data platforms, data products and data teams. Learning from them let me share the 7 things that I think are important.</em></p></blockquote><div><hr></div><h1>&#128640; Engineering</h1><blockquote><p><em>I have to believe in a world outside my own mind. &#8212; Memento (2000)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.uber.com/en-SG/blog/load-balancing-handling-heterogeneous-hardware/">Load Balancing: Handling Heterogeneous Hardware</a></h4><p>&#9997; <a href="https://www.uber.com/en-SG/blog/engineering/?uclick_id=15b6739c-0acd-406e-bdf6-884992beefa0">Uber Engineering Blog</a></p><blockquote><p><em>This blog post describes Uber&#8217;s journey towards utilizing hardware efficiently via better load balancing. The work described here lasted over a year, involved engineers across multiple teams, and delivered significant efficiency savings. The article covers the technical solutions and our discovery process to get to them&#8211;in many ways, the journey was harder than the destination.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://motherduck.com/blog/perf-is-not-enough/">PERF IS NOT ENOUGH</a></h4><p>&#9997; <a href="https://motherduck.com/authors/jordan-tigani/">Jordan Tigani</a></p><blockquote><p><em>Performance in general, and general-purpose benchmarking in particular, is a poor way to choose a database. You&#8217;re better off making decisions based on ease of use, ecosystem, velocity of updates, or how well it integrates with your workflow. At best, performance is a point-in-time view of the time it will take to complete certain tasks; at worst, however, it leads you to optimize for the wrong things.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://martinfowler.com/articles/rotate-pairs-experiment.html">What if we rotate pairs every day?</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/gabriel-robaina/">Gabriel Robaina</a> + Kieran Murphy</p><blockquote><p><em>We developed a lightweight methodology to help teams reflect on the benefits and challenges of pairing and how to solve them. Initial fears were overcome and teams discovered the benefits of frequently rotating pairs. We learned that pair swapping frequently greatly enhances the benefits of pairing. Here we share the methodology we developed, our observations, and some common fears and insight shared by the participating team members.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://github.blog/2024-03-04-keeping-repository-maintainer-information-accurate/">Keeping repository maintainer information accurate</a></h4><p>&#9997; <a href="https://github.blog/author/zkoppert/">Zack Koppert</a></p><blockquote><p><em>Discover how keeping repository maintainer information accurate through CODEOWNERS files and automating maintenance with tools like cleanowners fosters efficient collaboration and sustainable software projects.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.databricks.com/blog/simplify-pyspark-testing-dataframe-equality-functions">Simplify PySpark testing with DataFrame equality functions</a></h4><p>&#9997; <a href="https://www.databricks.com/blog/author/haejoon-lee">Haejoon Lee</a> + <a href="https://www.databricks.com/blog/author/allison-wang">Allison Wang</a> + <a href="https://www.databricks.com/blog/author/amanda-liu">Amanda Liu</a></p><blockquote><p><em>Introducing PySpark DataFrame equality test functions: a new set of test functions in Apache Spark. Discover how easy it is to validate data transformations with the new functions through hands-on examples.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://github.blog/2024-03-06-how-github-uses-merge-queue-to-ship-hundreds-of-changes-every-day/">How GitHub uses merge queue to ship hundreds of changes every day</a></h4><p>&#9997; <a href="https://github.blog/author/willsmythe/">Will Smythe</a> + <a href="https://github.blog/author/lawrencegripper/">Lawrence Gripper</a></p><blockquote><p><em>Here's how merge queue transformed the way GitHub deploys changes to production at scale, so you can do the same for your organization.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://eng.lyft.com/python-upgrade-playbook-1479145d52f4">Python Upgrade Playbook</a></h4><p>&#9997; <a href="https://medium.com/@aneeshusa">Aneesh Agrawal</a></p><blockquote><p><em>In this post, we&#8217;ll cover how Lyft upgrades Python at scale &#8212; 1500+ repos spanning 150+ teams &#8212; and the latest iteration of the tools and strategy we&#8217;ve built to optimize both the overall time to upgrade and the work required from our engineers. We&#8217;ve successfully used (and evolved) this playbook over multiple upgrades, from Python 2 to Python 3.10 and hope you find it useful!</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.uber.com/en-SG/blog/building-scalable-real-time-chat/?uclick_id=15b6739c-0acd-406e-bdf6-884992beefa0">Building Scalable, Real-Time Chat to Improve Customer Experience</a></h4><p>&#9997; <a href="https://www.uber.com/en-SG/blog/engineering/?uclick_id=15b6739c-0acd-406e-bdf6-884992beefa0">Uber Engineering Blog</a></p><blockquote><p><em>With millions of support interactions (known internally as contacts) being raised by Uber customers every week, our goal is to resolve these contacts within a predefined service level agreement (SLA). Contacts created by customers are resolved either via automation or with help from a customer support agent.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://arnon.dk/the-14-pains-of-billing/?utm_source=newsletter.programmingdigest.net&amp;utm_medium=referral&amp;utm_campaign=pains-of-building-your-own-billing-system">The 14 pains of building your own billing system</a></h4><p>&#9997;<a href="https://www.linkedin.com/in/arnon-shimoni/"> Arnon Shimoni</a></p><blockquote><p><em>I&#8217;ve seen them likened to an octopus, and I fully agree. They touch finance, product, experience, customer support, customers, legal, compliance, sales, and sometimes more.</em></p></blockquote><div><hr></div><h1>&#9999; Data</h1><blockquote><p><em>The one thing that this job has taught me is that truth is stranger than fiction. &#8212; Predestination (2014)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://practicaldatamodeling.substack.com/p/good-enough-data-models">"Good Enough" Data Models</a></h4><p>&#9997; <a href="https://substack.com/@joereis">Joe Reis</a></p><blockquote><p><em>Data modeling along the spectrum of perfect vs "good enough"</em></p></blockquote><h4>&#128214;&#9478;<a href="https://sqlpatterns.com/p/the-root-cause-of-all-problems-in">The Root Cause of All Problems in Data - Revisited</a></h4><p>&#9997; <a href="https://substack.com/@ergestx">Ergest Xheblati</a></p><blockquote><p><em>What about a people and process problem? What does that mean? Does it have to do with people&#8217;s unwillingness to adopt data driven decisions or do we lack a good methodology for being data driven?</em></p></blockquote><h4>&#128214;&#9478;<a href="https://piethein.medium.com/data-quality-within-lakehouses-0c9417ce0487">Data Quality within Lakehouses</a></h4><p>&#9997; <a href="https://piethein.medium.com/?source=post_page-----0c9417ce0487--------------------------------">Piethein Strengholt</a></p><blockquote><p><em>A deep dive into data quality using bronze, silver, and gold layered architectures.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://substack.timodechau.com/p/how-to-measure-a-data-platform">How to measure a data platform?</a></h4><p>&#9997; <a href="https://substack.com/profile/29441309-timo-dechau">Timo Dechau</a></p><blockquote><p><em>Product analytics for data products</em></p></blockquote><h4>&#128214;&#9478;<a href="https://faithfacts.substack.com/p/dbts-model-groups-and-access-for">dbt&#8217;s Model Groups &amp; Access for Dummies</a></h4><p>&#9997; <a href="https://substack.com/profile/11412853-faith-lierheimer">Faith Lierheimer</a></p><blockquote><p><em>Welcome to another edition of &#8220;Faith takes you on the ride while she learns how to do her job.&#8221; Today, we learn about a handful of dbt&#8217;s model governance features.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://roundup.getdbt.com/p/making-friends-with-the-truth">Making friends with the truth</a></h4><p>&#9997; <a href="https://substack.com/profile/73769889-jason-ganz">Jason Ganz</a></p><blockquote><p><em>So how can we become data driven?</em></p></blockquote><div><hr></div><h1>&#129302; AI&#9478;ML&#9478;Data Science</h1><blockquote><p><em>You know, Burke, I don&#8217;t know which species is worse. &#8212; Ripley, Aliens (1986)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://netflixtechblog.com/supporting-diverse-ml-systems-at-netflix-2d2e6b6d205d">Supporting Diverse ML Systems at Netflix</a></h4><p>&#9997; <a href="https://netflixtechblog.medium.com/?source=post_page-----2d2e6b6d205d--------------------------------">Netflix Technology Blog</a></p><blockquote><p><em>In this article, we cover a few key integrations that we provide for various layers of the Metaflow stack at Netflix, as illustrated above. We will also showcase real-life ML projects that rely on them, to give an idea of the breadth of projects we support. Note that all projects leverage multiple integrations, but we highlight them in the context of the integration that they use most prominently. Importantly, all the use cases were engineered by practitioners themselves.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://www.thdpth.com/p/the-inevitability-of-ai-in-every">The Inevitability of AI in Every Facet of Your Business: 11 Deep Thoughts With Lasting Impact</a></h4><p>&#9997; <a href="https://substack.com/profile/229923-sven-balnojan-phd">Sven Balnojan PhD</a></p><blockquote><p><em>I want to share a collection of thoughts, quote from leaders in data &amp; AI, and a few thoughts I have on them, to get you thinking. I have no clear answer to any of the questions they raise, and I keep coming back to them, revising my opinion over time.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://engineering.atspotify.com/2024/03/risk-aware-product-decisions-in-a-b-tests-with-multiple-metrics/">Risk-Aware Product Decisions in A/B Tests with Multiple Metrics</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/m%C3%A5rtenschultzberg/">M&#229;rten Schultzberg</a> + <a href="https://www.linkedin.com/in/sebastianankargren/">Sebastian Ankargren</a> + <a href="https://www.linkedin.com/in/mattias-fr%C3%A5nberg-6b631432/">Mattias Fr&#229;nberg</a></p><blockquote><p><em>We summarize the findings in our recent paper, Schultzberg, Ankargren, and Fr&#229;nberg (2024), where we explain how Spotify&#8217;s decision-making engine works and how the results of multiple metrics in an A/B test are combined into a single product decision.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/pinterest-engineering/user-action-sequence-modeling-for-pinterest-ads-engagement-modeling-21139cab8f4e">User Action Sequence Modeling for Pinterest Ads Engagement Modeling</a></h4><p>&#9997; <a href="https://medium.com/@Pinterest_Engineering?source=post_page-----21139cab8f4e--------------------------------">Pinterest Engineering</a></p><blockquote><p><em>In this blog post, we will mainly discuss how we adopt the user sequence features and the followup optimization:-Designed the sequence features-Leveraged Transformer for sequence modeling- Improved the serving efficiency by half precision inference. We will also share how to improve the model stability by Resilient Batch Norm.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://koopingshung.substack.com/p/think-vs-compute-part-2">Think Vs Compute (Part 2)</a></h4><p>&#9997; <a href="https://substack.com/profile/7906875-koo-ping-shung">Koo Ping Shung</a></p><blockquote><p><em>So in this issue, this is a refinement further on what is thinking vs compute over here, or at least the differences between humans and machines. Hopefully by understanding the current differences I can work on how to incorporate them together, moving towards the vision of machine+human.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://netflixtechblog.com/evolving-from-rule-based-classifier-machine-learning-powered-auto-remediation-in-netflix-data-039d5efd115b">Evolving from Rule-based Classifier: Machine Learning Powered Auto Remediation in Netflix Data Platform</a></h4><p>&#9997; <a href="https://netflixtechblog.comhttps//netflixtechblog.medium.com/?source=post_page-----039d5efd115b--------------------------------">Netflix Technology Blog</a></p><blockquote><p><em>This is the first of the series of our work at Netflix on leveraging data insights and Machine Learning (ML) to improve the operational automation around the performance and cost efficiency of big data jobs.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://blogs.nvidia.com/blog/what-is-trustworthy-ai/">What Is Trustworthy AI?</a></h4><p>&#9997; <a href="https://blogs.nvidia.com/blog/author/nikkipope/">Nikki Pope</a></p><blockquote><p><em>Trustworthy AI is an approach to AI development that prioritizes safety and transparency for the people who interact with it.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://aws.amazon.com/blogs/big-data/data-governance-in-the-age-of-generative-ai/">Data governance in the age of generative AI</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/krishna-rupanagunta/">Krishna Rupanagunta</a> + <a href="https://www.linkedin.com/in/rarni/">Raghvender Arni</a> + <a href="https://www.linkedin.com/in/contacttaz/">Imtiaz Sayed</a></p><blockquote><p><em>In this post, we discuss the data governance needs of generative AI application data pipelines, a critical building block to govern data used by LLMs to improve the accuracy and relevance of their responses to user prompts in a safe, secure, and transparent manner. Enterprises are doing this by using proprietary data with approaches like Retrieval Augmented Generation (RAG), fine-tuning, and continued pre-training with foundation models.</em></p></blockquote><div><hr></div><h1>&#128293; Catch up</h1><blockquote><p><em>&#8230;Next Saturday night, we're sending you back to the future! &#8212; Dr. Emmett Brown, Back to the Future (1985)</em></p></blockquote><h4>&#128214;&#9478;BigQuery&#9478;The <a href="https://cloud.google.com/bigquery/docs/information-schema-write-api">INFORMATION_SCHEMA.WRITE_API_TIMELINE*</a> <strong>views, containing per minute aggregated BigQuery Storage Write API ingestion statistics, are GA.</strong></h4><h4>&#128214;&#9478;BigQuery<a href="https://cloud.google.com/bigquery/docs/write-sql-duet-ai#generate_python_code">Duet AI in BigQuery</a> can now assist with Python code generation and code completion.</h4><div><hr></div><h1>&#128160; Previously on Dimension</h1><blockquote><p><em>Dimension is my sub-newsletter where I note down things I learn from people smarter than me in the data engineering field.</em></p></blockquote><p>Here are the 3 latest articles:</p><h3><em><strong>Published on 2024, February 24:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;229fd815-82a1-4b3d-8894-0bfd87234fd4&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;I spent 4 hours figuring out how BigQuery executes the SQL query internally. Here's what I found.&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-02-24T11:01:00.145Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb755068f-f802-4d6a-b355-beb5ce1330d3_1398x998.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/i-spent-4-hours-figuring-out-how&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:141697852,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:1,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3><em><strong>Published on 2024, March 2:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;51414477-0079-4d8b-95e9-8d59ef535d73&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;I spent 7 hours reading another paper to understand more about Snowflake's internal. Here's what I found.&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-02T11:01:04.261Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7814dfda-092b-4017-a41b-03a002fba86e_1393x993.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/i-read-another-paper-to-understand&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:141591854,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3><em><strong>Published on 2024, March 9:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;4a7cd5fa-0b55-4e98-9075-497af3bf1fbd&quot;,&quot;caption&quot;:&quot;I updated the post because it initially had duplicate sections. My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;If I could travel back to 5 years ago, what would I talk to myself about Docker?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-09T11:00:23.414Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37ba199a-a064-4d0b-97e9-3ce8b77b3a5c_1399x998.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/i-spent-3-hours-to-understand-more&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:141885092,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:4,&quot;comment_count&quot;:2,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="pullquote"><p>Let me here your voice, for example: </p><p>'Your newsletter is so terrible, I can't handle it anymore.'</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/p/groupby-26-how-github-uses-merge/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/p/groupby-26-how-github-uses-merge/comments"><span>Leave a comment</span></a></p><div><hr></div><h2>&#8220;Hasta la vista, baby&#8221; -T800, Terminator 2: Judgment Day (1991)</h2></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for scrolling this far! There's a convenient subscribe box here if you want me to annoy you every week. &#128516;</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[GroupBy #25: From Samza to Flink: A Decade of Stream Processing, DoorDash’s In-House Search Engine,Meta's DotSlash, Designing Metrics Trees]]></title><description><![CDATA[Plus: How to go from senior to staff data engineer in big tech, Apache Kafka 3.7, PyAirbyte]]></description><link>https://vutr.substack.com/p/groupby-25-from-samza-to-flink-a</link><guid isPermaLink="false">https://vutr.substack.com/p/groupby-25-from-samza-to-flink-a</guid><dc:creator><![CDATA[Vu Trinh]]></dc:creator><pubDate>Tue, 05 Mar 2024 11:01:08 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8f97dde-1b1e-4cfd-a4a2-9816665c1e68_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is <strong>GroupBy</strong>, where I share the resources I learn from people smarter than me in the data engineering field.</em></p><p><em>Not subscribed yet? Here you go:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/subscribe?"><span>Subscribe now</span></a></p><div class="pullquote"><div class="install-substack-app-embed install-substack-app-embed-web" data-component-name="InstallSubstackAppToDOM"><img class="install-substack-app-embed-img" src="https://substackcdn.com/image/fetch/$s_!D8N-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fvutr.substack.com%2Fimg%2Fsubstack.png"><div class="install-substack-app-embed-text"><div class="install-substack-app-header">Get more from Vu Trinh in the Substack app</div><div class="install-substack-app-text">Available for iOS and Android</div></div><a href="https://substack.com/app/app-store-redirect?utm_campaign=app-marketing&amp;utm_content=author-post-insert&amp;utm_source=vutr" target="_blank" class="install-substack-app-embed-link"><button class="install-substack-app-embed-btn button primary">Get the app</button></a></div></div><blockquote><p><em>&#128075; Hi, my name is Vu Trinh, a data engineer.</em></p><p><em>I enjoy reading <strong>good stuff</strong>  (related to data and engineering), and this newsletter is my effort on the journey to seek the "good stuff" across the entire Internet. </em></p><p><em>Hope this issue find you well.</em></p></blockquote><div><hr></div><h2>Referral Program</h2><p>I&#8217;m launching a referral program to grow the community by giving you guys valuable gifts whenever you reach a referral milestone. The condition is simple: you refer friends to subscribe to my newsletter, and you will receive a gift based on the number of friends you refer. Here are the reward milestones:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lf_-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c72d52-a2c4-4e24-9714-04e72a4dc087_756x361.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lf_-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c72d52-a2c4-4e24-9714-04e72a4dc087_756x361.png 424w, https://substackcdn.com/image/fetch/$s_!lf_-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c72d52-a2c4-4e24-9714-04e72a4dc087_756x361.png 848w, https://substackcdn.com/image/fetch/$s_!lf_-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c72d52-a2c4-4e24-9714-04e72a4dc087_756x361.png 1272w, https://substackcdn.com/image/fetch/$s_!lf_-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c72d52-a2c4-4e24-9714-04e72a4dc087_756x361.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lf_-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c72d52-a2c4-4e24-9714-04e72a4dc087_756x361.png" width="756" height="361" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a4c72d52-a2c4-4e24-9714-04e72a4dc087_756x361.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:361,&quot;width&quot;:756,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:112870,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!lf_-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c72d52-a2c4-4e24-9714-04e72a4dc087_756x361.png 424w, https://substackcdn.com/image/fetch/$s_!lf_-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c72d52-a2c4-4e24-9714-04e72a4dc087_756x361.png 848w, https://substackcdn.com/image/fetch/$s_!lf_-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c72d52-a2c4-4e24-9714-04e72a4dc087_756x361.png 1272w, https://substackcdn.com/image/fetch/$s_!lf_-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c72d52-a2c4-4e24-9714-04e72a4dc087_756x361.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now, let&#8217;s refer friends and claim exciting rewards ;)</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/leaderboard?&amp;utm_source=post&quot;,&quot;text&quot;:&quot;Refer a friend&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/leaderboard?&amp;utm_source=post"><span>Refer a friend</span></a></p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NYLk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8f97dde-1b1e-4cfd-a4a2-9816665c1e68_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NYLk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8f97dde-1b1e-4cfd-a4a2-9816665c1e68_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!NYLk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8f97dde-1b1e-4cfd-a4a2-9816665c1e68_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!NYLk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8f97dde-1b1e-4cfd-a4a2-9816665c1e68_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!NYLk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8f97dde-1b1e-4cfd-a4a2-9816665c1e68_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NYLk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8f97dde-1b1e-4cfd-a4a2-9816665c1e68_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c8f97dde-1b1e-4cfd-a4a2-9816665c1e68_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1845936,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NYLk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8f97dde-1b1e-4cfd-a4a2-9816665c1e68_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!NYLk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8f97dde-1b1e-4cfd-a4a2-9816665c1e68_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!NYLk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8f97dde-1b1e-4cfd-a4a2-9816665c1e68_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!NYLk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8f97dde-1b1e-4cfd-a4a2-9816665c1e68_1400x1000.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h1>&#128200; Career</h1><blockquote><p><em>Don't let comfort hold you back.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://blog.det.life/career-pathways-of-data-engineers-2bc4465483d0">Career Pathways of Data Engineers</a></h4><p>&#9997; <a href="https://medium.com/@grkumar82?source=post_page-----2bc4465483d0--------------------------------">Ravi Ganta</a></p><blockquote><p><em>The career progression for Data Engineers is not a linear journey, unlike what might be expected. A straightforward trajectory, similar to that of Software Engineers ascending from entry-level individual contributors to executive leadership roles, does not necessarily apply to Data Engineers. This discourse delves into the intricate pathways that Data Engineers can undertake as they advance in their careers.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://blog.dataengineer.io/p/how-to-go-from-senior-to-staff-data">How to go from senior to staff data engineer in big tech</a></h4><p>&#9997; <a href="https://substack.com/profile/10367987-zach-wilson">Zach Wilson</a></p><blockquote><p><em>In big tech, less than 3% of engineers make it to the staff level! If you want to beat the odds and actually become an L6+ engineer, this newsletter is for you. In my time in big tech, I went from junior to staff data engineer in four years. I&#8217;ll unveil the learnings I have from that journey.</em></p></blockquote><div><hr></div><h1>&#128640; Engineering</h1><blockquote><p><em>I have to believe in a world outside my own mind. &#8212; Memento (2000)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://materializedview.io/p/from-samza-to-flink-a-decade-of-stream">From Samza to Flink: A Decade of Stream Processing</a></h4><p>&#9997; <a href="https://substack.com/profile/69592459-chris-riccomini">Chris Riccomini</a></p><blockquote><p><em>I started Apache Samza twelve years ago during my tenure at LinkedIn. Samza was a stream processing framework built for Apache Kafka. The team grew to include all-stars like Martin Kleppmann, Chinmay Soman, Jakob Homan, Yi Pan, and many other talented engineers. Together, we added support for stateful processing, batch processing, SQL, YARN, standalone deployment, and many other features you see in modern stream processing systems. I learned a lot building Samza. In this post, I want to review Samza&#8217;s history, look at lessons learned, and talk about how these lessons affect my thinking on Apache Flink.</em></p></blockquote><h4>&#128521; Two-parts article:</h4><h4>Part 1:  &#128214;&#9478;<a href="https://www.databricks.com/blog/performance-improvements-stateful-pipelines-apache-spark-structured-streaming">Performance Improvements for Stateful Pipelines in Apache Spark Structured Streaming</a></h4><p>&#9997; <a href="https://www.databricks.com/blog/author/mojgan-mazouchi">Mojgan Mazouchi</a> + <a href="https://www.databricks.com/blog/author/mrityunjay-kumar">Mrityunjay Kumar</a> + <a href="https://www.databricks.com/blog/author/anish-shrigondekar">Anish Shrigondekar</a> + <a href="https://www.databricks.com/blog/author/karthikeyan-ramasamy">Karthikeyan Ramasamy</a></p><blockquote><p><em>Apache Spark&#8482; <a href="https://spark.apache.org/streaming/">Structured Streaming</a> is a popular open-source stream processing platform that provides scalability and fault tolerance, built on top of the Spark SQL engine. Most incremental and <a href="https://www.databricks.com/product/data-streaming">streaming workloads</a> on the Databricks Lakehouse Platform are powered by Structured Streaming, including <a href="https://www.databricks.com/product/delta-live-tables">Delta Live Tables</a> and <a href="https://docs.databricks.com/en/ingestion/auto-loader/index.html">Auto Loader</a>.</em></p></blockquote><h4>Part 2:  &#128214;&#9478;<a href="https://www.databricks.com/blog/deep-dive-latest-performance-improvements-stateful-pipelines-apache-spark-structured-streaming">A Deep Dive into the Latest Performance Improvements of Stateful Pipelines in Apache Spark Structured Streaming</a></h4><p>&#9997; <a href="https://www.databricks.com/blog/author/mojgan-mazouchi">Mojgan Mazouchi</a> + <a href="https://www.databricks.com/blog/author/mrityunjay-kumar">Mrityunjay Kumar</a> + <a href="https://www.databricks.com/blog/author/anish-shrigondekar">Anish Shrigondekar</a> + <a href="https://www.databricks.com/blog/author/karthikeyan-ramasamy">Karthikeyan Ramasamy</a></p><blockquote><p><em>In this section, we will dig deeper into the various issues we observed while analyzing performance and outline specific enhancements we have implemented to address those issues.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.pimpaudben.fr/everything-ops-beyond-the-hype-e19d2f763b40">Everything Ops: Beyond the Hype</a></h4><p>&#9997; <a href="https://medium.pimpaudben.fr/?source=post_page-----e19d2f763b40--------------------------------">Benoit Pimpaud</a></p><blockquote><p><em>Integrating &#8220;Ops&#8221; into our job titles and conceptual terms may come across as trendy, and indeed, it is.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://lakefs.io/blog/where-is-my-data/">lakeFS: Where&#8217;s my data?</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/ariels">Ariel Shaqed (Scolnicov)</a></p><blockquote><p><em>For a data management platform, sometimes it may seem as though lakeFS takes pains to hide your data. Indeed, one very common question on our Slack #help channel is a polite variation on &#8220;where&#8217;s my data?&#8221;. lakeFS does indeed keep your data and it does so inside your namespace. This core functionality works and is well-tested. But lakeFS does such a great job of abstracting away all of these details that it&#8217;s hard to find the data!</em></p></blockquote><h4>&#128214;&#9478;<a href="https://doordash.engineering/2024/02/27/introducing-doordashs-in-house-search-engine/">Introducing DoorDash&#8217;s In-House Search Engine</a></h4><p>&#9997; <a href="https://www.linkedin.com/in/kostya-shulgin/">Konstantin Shulgin</a> + <a href="https://www.linkedin.com/in/satish-saley-65527525/">Satish Subhashrao Saley</a> + <a href="https://www.linkedin.com/in/anish-walawalkar-a9221989/">Anish Walawalkar</a></p><blockquote><p><em>We decided the best way to address these challenges was to move away from Elasticsearch to a homegrown search engine. We chose Apache Lucene as the core of the new search engine. The Search Engine uses a segment-replication model and separates indexing and searching traffic. We designed the index to store multiple types of documents with relations between them. Following the migration to DoorDash&#8217;s Search Engine, we saw a 50% p99.9 latency reduction and a 75% hardware cost decrease.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/insiderengineering/query-builder-package-from-individual-chaos-to-collective-consistency-669503cb3696">Query Builder Package: From Individual Chaos to Collective Consistency</a></h4><p>&#9997; <a href="https://medium.com/@sinem.haseki?source=post_page-----669503cb3696--------------------------------">Sinem Elif Haseki</a></p><blockquote><p><em>In the dynamic world of software development, aligning multiple products with consistent and efficient service request protocols is a critical challenge. This article delves into the role of a new query builder package in revolutionizing this process, emphasizing its necessity, benefits, and drawbacks.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://engineering.fb.com/2024/02/26/developer-tools/dotslash-meta-tech-podcast/">How DotSlash makes executable deployment simpler</a></h4><p>&#9997; <a href="https://engineering.fb.com/author/pascal-hartig/">Pascal Hartig</a></p><blockquote><p><em>Andres Suarez and Michael Bolin, two software engineers at Meta, join Pascal Hartig (@passy) on the Meta Tech Podcast to discuss the ins and outs of DotSlash, a new open source tool from Meta.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://juhache.substack.com/p/we-need-a-data-engineering-specific">We Need a Data Engineering-Specific Language</a></h4><p>&#9997; <a href="https://substack.com/profile/35734446-julien-hurault">Julien Hurault</a></p><blockquote><p><em>In this article, I explore the present state of standardization within the industry.</em></p></blockquote><div><hr></div><h1>&#9999; Data</h1><blockquote><p><em>The one thing that this job has taught me is that truth is stranger than fiction. &#8212; Predestination (2014)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://sqlpatterns.com/p/designing-metrics-trees">Designing Metrics Trees</a></h4><p>&#9997; <a href="https://substack.com/@ergestx">Ergest Xheblati</a></p><blockquote><p><em>A metric tree is a logical representation of a business&#8217; growth model in a graph form. It&#8217;s a simplified representation of how inputs flow into outputs.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://open.nytimes.com/how-the-new-york-times-games-data-team-revamped-its-reporting-8af7e7c7bc97">How the New York Times Games Data Team Revamped Its Reporting</a></h4><p>&#9997; <a href="https://medium.com/@cj-robinson">CJ Robinson</a></p><blockquote><p><em>Ultimately, our goal is to make the data generation, ingestion, processing, and reporting pipeline as effortless and evergreen as possible.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/agoda-engineering/10-common-data-visualization-mistakes-and-how-to-avoid-them-e3896fe8e104">10 Common Data Visualization Mistakes and How to Avoid Them</a></h4><p>&#9997; <a href="https://medium.com/@agoda.eng?source=post_page-----e3896fe8e104--------------------------------">Agoda Engineering</a></p><blockquote><p><em>When creating data visualizations, it can be easy to make mistakes that lead to wrong interpretations. This article will look at bad data visualization and how to avoid it.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/@barrmoses/when-a-data-mesh-doesnt-make-sense-for-your-organization-20de8f3f48bd">When a Data Mesh Doesn&#8217;t Make Sense for Your Organization</a></h4><p>&#9997; <a href="https://barrmoses.medium.com/when-a-data-mesh-doesnt-make-sense-for-your-organization-20de8f3f48bd">Barr Moses</a></p><blockquote><p><em>In this article, we&#8217;ll revisit data mesh to discuss what it is, why it makes sense for some teams, and when it doesn&#8217;t make sense for yours!</em></p></blockquote><h4>&#128214;&#9478;<a href="https://faithfacts.substack.com/p/dbts-semantic-layer-for-dummies">dbt&#8217;s Semantic Layer for Dummies</a></h4><p>&#9997; <a href="https://substack.com/profile/11412853-faith-lierheimer">Faith Lierheimer</a></p><blockquote><p><em>What&#8217;s dbt&#8217;s Semantic Layer (powered by MetricFlow baby) supposed to do anyways?</em></p></blockquote><div><hr></div><h1>&#129302; AI&#9478;ML&#9478;Data Science</h1><blockquote><p><em>You know, Burke, I don&#8217;t know which species is worse. &#8212; Ripley, Aliens (1986)</em></p></blockquote><h4>&#128214;&#9478;<a href="https://medium.com/towards-data-science/2023-in-review-recapping-the-post-chatgpt-era-and-what-to-expect-for-2024-bb4357a4e827">2023 in Review: Recapping the Post-ChatGPT Era and What to Expect for 2024</a></h4><p>&#9997; <a href="https://medium.com/@iamleonie">Leonie Monigatti</a></p><blockquote><p><em>How the LLMOps landscape has evolved and why we haven&#8217;t seen many Generative AI applications in the wild yet &#8212; but maybe in 2024.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://explainextended.com/2023/12/31/happy-new-year-15/">Happy New Year: GPT in 500 lines of SQL</a></h4><p>&#9997; <a href="https://explainextended.com/">Quassnoi</a></p><blockquote><p><em>It just proves that if you want something done right, you have to do it yourself. Encouraged by this optimistic forecast, today we will implement a large language model in SQL.</em></p></blockquote><h4>&#128214;&#9478;<a href="https://read.engineerscodex.com/p/metas-new-llm-based-test-generator">Meta's new LLM-based test generator is a sneak peek to the future of development</a></h4><p>&#9997; <a href="https://read.engineerscodex.com/">Engineer&#8217;s Codex</a></p><blockquote><p><em>Meta's TestGen-LLM is a sneak peek to the future of developer productivity: specialized, orchestrated, and rigorously filtered.</em></p></blockquote><div><hr></div><h1>&#128293; Catch up</h1><blockquote><p><em>&#8230;Next Saturday night, we're sending you back to the future! &#8212; Dr. Emmett Brown, Back to the Future (1985)</em></p></blockquote><h4>&#128214;&#9478;<strong><a href="https://www.confluent.io/blog/introducing-apache-kafka-3-7/">Introducing Apache Kafka 3.7</a></strong></h4><h4>&#128214;&#9478;<a href="https://learn.microsoft.com/en-us/power-bi/create-reports/copilot-evaluate-data">Copilot in PowerBI</a></h4><h4>&#128214;&#9478;<a href="https://thenewstack.io/pyairbyte-airbytes-new-python-library-for-moving-data/">PyAirbyte: Airbyte&#8217;s New Python Library for Moving Data.</a></h4><h4>&#128214;&#9478;The following BigQuery cross-cloud features are now <a href="https://cloud.google.com/products/#product-launch-stages">generally available</a> (GA):</h4><blockquote><ul><li><p><em>You can take advantage of the benefits of <a href="https://cloud.google.com/bigquery/docs/materialized-views-intro#biglake">materialized views over Amazon S3 metadata cache-enabled BigLake tables</a>.</em></p></li><li><p><em>You can create <a href="https://cloud.google.com/bigquery/docs/materialized-views-intro#materialized_view_replicas">materialized view replicas</a> of materialized views over Amazon S3 metadata cache-enabled Biglake tables. Materialized view replicas let you use the materialized view data in queries while avoiding data egress costs and improving query performance.</em></p></li><li><p><em>You can <a href="https://cloud.google.com/bigquery/docs/materialized-view-replicas-manage#get-info">get information about materialized view replicas</a> by using SQL, the bq command-line tool, or the BigQuery API.</em></p></li><li><p><em>You can use <a href="https://cloud.google.com/bigquery/docs/biglake-intro#cross-cloud_joins">cross-cloud joins</a> to run queries that span both Google Cloud and BigQuery Omni regions.</em></p></li></ul></blockquote><h4>&#128214;&#9478;BigQuery Materialized views can now <a href="https://cloud.google.com/bigquery/docs/materialized-views-create#reference_logical_views">reference logical views</a>. This feature is in <a href="https://cloud.google.com/products#product-launch-stages">preview</a>.</h4><h4>&#128214;&#9478;BigQuery <a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#group_by_all">GROUP BY ALL clause, which groups rows by inferring grouping keys from the SELECT items, is now in preview</a></h4><div><hr></div><h1>&#128160; Previously on Dimension</h1><blockquote><p><em>Dimension is my sub-newsletter where I note down things I learn from people smarter than me in the data engineering field.</em></p></blockquote><p>Here are the 3 latest articles:</p><h3><em><strong>Published on 2024, February 17:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;06f043db-deaf-43ff-8c0a-88cce3c76d6e&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;I spent 3 hours figuring out how BigQuery inserts, deletes and updates data internally. Here's what I found.&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-02-17T11:00:08.154Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F288473f7-d1b6-4853-94d2-1c94f2a75241_1397x994.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/i-spent-3-hours-trying-to-figure&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:141271176,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:1,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3><em><strong>Published on 2024, February 24:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;229fd815-82a1-4b3d-8894-0bfd87234fd4&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;I spent 4 hours figuring out how BigQuery executes the SQL query internally. Here's what I found.&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-02-24T11:01:00.145Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb755068f-f802-4d6a-b355-beb5ce1330d3_1398x998.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/i-spent-4-hours-figuring-out-how&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:141697852,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:1,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3><em><strong>Published on 2024, March 2:</strong></em></h3><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;51414477-0079-4d8b-95e9-8d59ef535d73&quot;,&quot;caption&quot;:&quot;My name is Vu Trinh, and I am a data engineer. I&#8217;m trying to make my life less dull by spending time learning and researching &#8220;how it works&#8220; in the data engineering field. Here is a place where I share everything I&#8217;ve learned. Not subscribe yet? Here you go:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;I spent 7 hours reading another paper to understand more about Snowflake's internal. Here's what I found.&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:167177248,&quot;name&quot;:&quot;Vu Trinh&quot;,&quot;bio&quot;:&quot;This is me&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4805f673-db97-4f7c-85c4-44b345a8de80_256x256.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-02T11:01:04.261Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7814dfda-092b-4017-a41b-03a002fba86e_1393x993.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vutr.substack.com/p/i-read-another-paper-to-understand&quot;,&quot;section_name&quot;:&quot;Dimensions.&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:141591854,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;VuTrinh.&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb91e59e-4df2-4ac8-b3b4-1f4e830f7b42_256x256.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="pullquote"><p>Let me here your voice, for example: </p><p>'Your newsletter is so terrible, I can't handle it anymore.'</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/p/groupby-25-from-samza-to-flink-a/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://vutr.substack.com/p/groupby-25-from-samza-to-flink-a/comments"><span>Leave a comment</span></a></p><div><hr></div><h2>&#8220;Hasta la vista, baby&#8221; -T800, Terminator 2: Judgment Day (1991)</h2></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://vutr.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for scrolling this far! There's a convenient subscribe box here if you want me to annoy you every week. &#128516;</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item></channel></rss>