"It's time to replace Parquet"
Parquet's limitations, what motivates the need for new file formats in analytics and AI, and will Parquet be replaced soon?
I will publish a paid article every Tuesday. I wrote these with one goal in mind: to offer my readers, whether they are feeling overwhelmed when beginning the journey or seeking a deeper understanding of the field, 15 minutes of practical lessons and insights on nearly everything related to data engineering.
Intro
(about the title, I don’t say it; someone on the internet did)
Parquet is great. From my previous article, we learn that Parquet is perfect for analytics workload. The query engine can choose to read the column it wants and skip irrelevant rows thanks to statistics.
However, Parquet was created over a decade ago.
Things have changed, especially with the analytics workload pattern. AI workloads are getting popular.
In this article, we first revisit the detailed implementation of a Parquet file. From that, we will try to understand the typical workload that the format was initially designed for. Then we will see how that design might not work well for the common analytics workload these days, which results in some file formats being created to solve the problems.
The goal is to provide a clearer understanding of Parquet, not only from its strengths but also from its limitations.
History
Apache Parquet was created in the early 2010s from a collaboration between engineers at Twitter and Cloudera, who looked for a more efficient and performant columnar storage format for large-scale data processing within the Apache Hadoop ecosystem.
It was designed as an improvement over the Trevni (now it’s a part of Apache Avro), a columnar storage format created by Doug Cutting, the creator of Hadoop. Notably, Parquet incorporated concepts from Google’s Dremel paper to handle complex, nested data structures. The format’s goal was to provide an open-source, columnar standard that offered superior data compression, encoding schemes, and query performance by only reading necessary columns.
The first version, Apache Parquet 1.0, was released in July 2013.
The rest is history.
Architecture
Overview
Parquet is well-known as a columnar format. People often assume that data from a column is stored together. It’s only half of the story.
The format organizes the data in the Partition Attributes Across (PAX) layout, which is commonly referred to as the hybrid format. It first groups data into “row groups,” each containing a subset of rows. (horizontal partition.)
Within each row group, data is stored column by column; values from a column are stored together. Each row group’s column is called the column chunk. Each chunk is composed of pages, which are the unit for encoding and compression.
This approach enables query engines to read only the desired columns and amortize the cost of writing data and reconciling a record; it is distributed only within a row group.
Metadata also plays a crucial role in Parquet. The file format contains information needed for the application to consume the file.
Magic number: It is used to verify if it is a valid Parquet file.
FileMetadata: Parquet stores FileMetadata in the footer of the file. This metadata provides information like the number of rows, data schema, and row group metadata.
Each row group metadata contains information about its column chunks (ColumnMetadata), including the encoding and compression scheme, size, page offset, and min/max value of the column chunk. The application can use information in this metadata to prune unnecessary data.
PageHeader: The page header metadata is stored with the page data and includes information such as value, definition, and repetition encoding. Parquet also stores definition and repetition levels to handle nested data. The application uses the header to read and decode the data.
Write process
When writing data, the engine:
Collects information, such as the data schema, the null appearance, the encoding scheme, and all the column types, which are recorded in FileMetadata.
Writes the magic number at the beginning of the file
Calculates the number of row groups based on the row group’s max size (configurable) and the data’s size
For each row group, iterates through the column list to write each column chunk for the row group. The engine typically buffers the entire row group data before flushing to disk.
Writes each column chunk page by page sequentially
Each page has a header that includes the page’s number of rows, the page’s encoding for data, repetition, and definition.
After writing all the pages for the column chunk, constructs the column chunk metadata for that chunk, including the min/max of the column (if it has), total_uncompressed_size, total_compressed_size, and the first data page offset.
Continues until all columns in the row group are written to disk.
Writes all row groups’ metadata in the FileMetadata after writing all the row groups.
Writes the FileMetadata to the footer.
Writes the magic number at the end of the file.
Read process
When reading data, the engine:
Checks the magic number to see if it’s a valid Parquet file.
Reads the FileMetadata from the footer. It extracts information for later use, such as the whole schema and the row group metadata.
Retrieves the list of row groups to be read. If filters are specified, the engine iterates over every row group metadata and checks the filters against the statistic. If it satisfies the filters, this row group is appended to the list of row groups, which is later used to read.
If there are no filters, the list contains all the row groups.
Defines the column list:
If the engine specifies a subset of columns it wants to read, the list only contains these columns.
Iterates through the row group list and reads each one.
Reads the column chunks for each row group based on the column list. It uses ColumnMetadata to locate the position of the first data page and decode the data.
Continues until all row groups are read.
Strengths
The section above highlights two obvious advantages of the Parquet file: Column Pruning and Predicate Pushdown. With the former, thanks to the column layout in row groups, the engine can read the required columns and skip irrelevant ones.
For the latter, Parquet’s statistics can help the engine apply the query filter down to the physical file level. A query to filter for a value of 5 can skip all the row groups and the column chunk’s page, which has the data ranges that do not include 5. Similar to Column Pruning, this approach reduces the amount of data read from the file, decreases disk I/O, and enables faster data reading.
In the following sections, I will first list out some Parquet limitations (based on my research), then we will move on to see why these limitations matter in today's analytics and AI workload.
Limitations
Random Access
Parquet is not ideal for random access, where a small set of rows needs to be read.
This is because Parquet stores data by column. To rebuild a single logical row, the engine must perform multiple reads—one for each column’s data—from different physical locations within the file.
Keep reading with a 7-day free trial
Subscribe to VuTrinh. to keep reading this post and get 7 days of free access to the full post archives.












