My name is Vu Trinh, and I am a data engineer.
I’m trying to make my life less dull by spending time learning and researching “how it works“ in the data engineering field.
Here is a place where I share everything I’ve learned.
Not subscribe yet? Here you go:
TL, DR:
🔵 How I made it 1+1=0 in DuckDB
🔵 Some exciting things about DuckDB's internal:
👉 Embedded analytics
👉 Execution engine
👉 File Formats
👉 Vector Format
👉 ACID
Intro
This is a 20% “How-To“ and 80% “Introduction“ article.
When setting the title: “I made 1+1=0 in DuckDB”, I didn’t intend to make it like a click-bait.
This is actually how I started playing around with DuckDB.
This idea is inspired by the modifying operator logic in PostgreSQL database blog post I read months ago.
1+1=0, How?
The action needed to make 1+1=0 is quite simple.
The hard part is finding the source code (C++) where the add operator is implemented.
Thanks to Github Copilot and half a day of sitting in front of the laptop, I finally found that piece of code:
Following Copilot:
The active selection is a struct named
AddOperator
that contains a templated static method calledOperation
. This method is designed to operate on two values of any type,TA
andTB
, and return a result of typeTR
.
From my understanding, AddOperator is an abstract method that all the “+“operators must implement.
Knowing precisely the code, I only need to change left + right
to left - right
, then compile the source code to get an executable binary file.
Finally, I ran that binary file, inputting SELECT 1+1, and the result returned was 0, as I expected.
Now that’s all for my weird intro.
The following sections are some cool features of DuckDB (to me) that I discovered when playing around with this database.
Embedded analytics
Client, server, no?
Unlike traditional database management systems with a client-server model, DuckDB aligns itself with the philosophy of simplicity and embedded operation, drawing inspiration from the widely embraced SQLite.
From the user’s point of view, DuckDB is simply an SQL interface run beside other applications on the same computer.
DuckDb’s embedded nature eliminates the need for a separate DBMS server, seamlessly integrating the analytical database within the host process.
This eliminates the complexities of installing, updating, and maintaining separate server software.
Execution engine
Vectorization
Amount of records being processed at a time
Pay attention here; you might think the term “Vectorization“ will draw some connection to a vector database.
But it’s not.
Let me explain here. (From my perspective)
Unlike traditional systems like PostgreSQL, MySQL, or SQLite, which process each row sequentially behind the scenes…
…DuckDB processes data in a vectorized style.
Said DuckDB processes a “batch of values” at once.
This approach isn’t exclusively developed in DuckDB; the vectorized execution engine is inspired by the paper MonetDB/X100: Hyper-Pipelining Query Execution by Peter Boncz, Marcin Zukowski, and Niels Nes.
The paper released in 2005 points out that the “volcano“ processing model (which indicates the model where each parent operator requests a single record at a time) does not leverage the full power of modern CPUs.
The paper's authors suggest an innovative vectorization model where records are batched into a vector and processed at once; this way can enhance the performance significantly.
This paper has a very important impact on the design of many OLAP databases.
You might not notice, but BigQuery, Databricks (Photon Engine), and Snowflake all apply vectorized execution engines.
Pull and Push
The direction of dataflow between internal operators.
In the past, DuckDB’s execution model operates in a pull-based fashion.
An operator will expose a function that allows fetching a result chunk from another operator.
The parent operators will fetch result chunks from its children using this interface until it reaches a source node.
Following the DuckDB author, this approach works fine at the beginning but soon encounters some problems like code duplication or operators not being able to be executed separately from the tree plan.
To solve this, DuckDB changed the model to a push-based fashion where child operators actively push their output data to the parent instead of passively waiting for the parent operators to call to emit the data.
If you'd like to get to know more about this, you can check here.
My intention is just an introductory blog post, so I might release another post that deep dives into the OLAP execution engine, like vectorization or push-pull dataflow, in the near future.
To end this section, I will give you some facts about other OLAP database’s execution engine:
OLAP databases with vectorization engines: BigQuery, Snowflake, Clickhouse, Databricks (Photon),…
OLAP databases that apply:
Push-based: Snowflake,…
Pull-based: Databricks (Photon), BigQuery,…
File Formats
DuckDB has advanced support for Parquet.
It also allows you to query Parquet files directly.
DuckDB suggests working with Parquet files with row groups of 100K-1M rows each to achieve better parallelized performance.
The reason for this is that DuckDB can only parallelize over row groups, so if a Parquet file has a single giant row group, it can only be processed by a single thread.
DuckDB can also parallelize across multiple rows of groups that belong to various Parquet files.
It’s best practice to have at least as many total row groups across all files as there are CPU threads.
For example, with a computer having 5 threads, in both cases when 5 files with 1 row group or 1 file with 5 row groups will achieve complete parallelism.
When querying many files, performance can be improved by using a Hive-format folder structure to partition the data along the columns used in the filter condition.
The database will only need to read the prefix that meets the filter criteria.
For example:
If the folder/bucket is organized like this:
s3://bucket_name/country=us/date=2024-01-01
s3://bucket_name/country=us/date=2024-01-02
…
The query which only needs data in 2024-01-02 will only need to load the necessary prefix.
Execution Format
This section will talk about how DuckDB represents data in memory for processing.
(You can think these are similar to Apache Arrow, a standardized column-oriented memory format).
DuckDB has its own standard to store data in memory during execution called Vector
.
Note: The Vector
format here and the Vectorized execution engine I mentioned above are two different concepts.
Vectors are logically represented arrays that contain data of a single type.
Internally, DuckDB supports different vector formats, which allow the system to store the same logical data with a different physical representation.
Here is the list of supported vector formats from DuckDB documentation:
Flat Vectors
Flat vectors are physically stored as a contiguous array; this is the standard uncompressed vector format. For flat vectors, the logical and physical representations are identical.
Constant Vectors
Constant vectors are physically stored as a single constant value.
Constant vectors are useful when data elements are repeated - for example, when representing the result of a constant expression in a function call, the constant vector allows us to store the value only once.
Dictionary Vectors
Dictionary vectors are physically stored as a child vector and a selection vector that contains indices into the child vector.
Dictionary vectors are emitted by the storage when decompressing from dictionary.
Sequence Vectors
Sequence vectors are useful for efficiently storing incremental sequences. They are generally emitted for row identifiers.
This in-memory format allows for a more compressed representation and potentially allows for compressed execution throughout the system.
ACID
When working with databases, you must have to know ACID.
Atomicity guarantees that transactions are either fully completed or not at all – no in-between states, ensuring data integrity.
Consistency ensures the database moves from one valid state to another, preserving the defined rules and constraints.
Isolation prevents interference between concurrent transactions; ensure each transaction can be executed independently.
Durability ensures that committed transactions persist even in the face of system failures, securing the permanence of our valuable data.
Initially, with OLTP databases like PostgreSQL or MySQL, enforcing ACID is a must-have to ensure data integrity.
It’s the same with OLAP databases, where this system now acts like a critical endpoint for business data analytics and usually serves as a shared environment for multiple users (data analysts, data science, data engineers, business users…)
If ACID is unnecessary in the OLAP world, why should open table formats like Delta Lake or Apache Iceberg be developed to make object storage more… ACID?
DuckDB provides transactional guarantees (ACID properties) through our custom, bulk-optimized Multi-Version Concurrency Control (MVCC).
Outro
There are many other cool things about DuckDB I didn’t mention above because I am afraid it might be too long for this blog.
For example, the secondary index allows DuckDB to have traditional indexes like MySQL or PostgreSQL behind the scenes or DuckDB or DuckDB Wasm (Web Assembly), enabling DuckDB to run right on your browser.
From the user’s point of view, DuckDB is a fascinating tool that can potentially replace libraries like Pandas with very rich SQL with an extensive function library.
In particular, it can be on and running easily without advanced technical knowledge.
From the perspective of those who love to look into the internal world of OLAP databases, DuckDB is a very exciting database that stands on the shoulders of giants, using components from various open-source projects and drawing inspiration from scientific publications.
That’s why I chose to make 1+1=0 in DuckDB.
(I mean research about DuckDB).
Reference: DuckDB documentation and source code.
Before you leave
Leave a comment or contact me via LinkedIn or Email if you:
Are interested in this article and want to discuss it further.
Would you like to correct any mistakes in this article or provide feedback?
This includes writing mistakes like grammar and vocabulary. I happily admit that I'm not so proficient in English :D
It might take 3 minutes to read, but it took me more than three days to prepare, so it will motivate me greatly if you consider subscribing to receive my writing.