Apache Parquet paves the way for better Hadoop data storage

Newly graduated from the Apache Incubator, the Parquet project allows column-stored data to be handled at high speed

Apache Parquet, which provides columnar storage in Hadoop, is now a top-level Apache Software Foundation (ASF)-sponsored project, paving the way for its more advanced use in the Hadoop ecosystem.

Already adopted by Netflix and Twitter, Parquet began in 2013 as a co-production between engineers at Twitter and Cloudera to allow complex data to be encoded efficiently in bulk.

Databases traditionally store information in rows and are optimized for working with one record at a time. Columnar storage systems serialize and store data by column, meaning that searches across large data sets and reads of large sets of data can be highly optimized.

Hadoop was built for managing large sets of data, so a columnar store is a natural complement. Most Hadoop projects can read and write data to and from Parquet; the Hive, Pig, and Drill projects already do this, as well as conventional MapReduce.

As another benefit, per-column data compression further accelerates performance in Parquet. A textual data column is compressed differently than a column loaded with only integer data, and being able to compress columns separately provides its own performance boost. Parquet also implements column compression so that it’s “future-proofed to allow adding more encodings as they are invented and implemented.”

Early adopters and project leads have used Parquet for some time and built functionality around it. Cloudera, the project’s co-progenitor, uses Parquet as a native data storage format for its Impala analytics database project, and MapR has added data self-description functions to Parquet. Netflix — never one to shy away from a forward-looking technology (such as Cassandra) — has 7 petabytes of warehoused data in Parquet format, according to the ASF.

Parquet isn’t the only way to store columnar data in Hadoop, but it’s shaping up as the leader. Hive has its own columnar-data format, called ORC, although it’s mainly intended as an extension to Hive rather than as a general data store for Hadoop.

Hortonworks, a Cloudera competitor (in more ways than one), claimed earlier in Parquet’s lifecycle that ORC compresses data more efficiently than Parquet. And IBM ran its own performance comparisons in September 2014 and found that while ORC used the least amount of HDFS storage, Parquet had the best overall query and analysis time, which are the metrics that typically matter most for Hadoop users.

Topics

About

Policies

Our Network

More

Apache Parquet paves the way for better Hadoop data storage

Newly graduated from the Apache Incubator, the Parquet project allows column-stored data to be handled at high speed

More from this author

And the #1 Python IDE is . . .

Docker tutorial: Get started with Docker volumes

Python is the most popular language on GitHub

Python threading and subprocesses explained

The best Python libraries for parallel processing

True multithreading in Python at last!

Get started with the free-threaded build of Python 3.13

Electron vs. Tauri: Which cross-platform framework is for you?

Show me more

The dirty little secret of open source contributions

14 great preprocessors for developers who love to code

Designing the APIs that accidentally power businesses

Building Python wheels to distribute your programs

Creating a pip install-able Python package

How to get better web requests in Python with httpx

Apache Parquet paves the way for better Hadoop data storage

Newly graduated from the Apache Incubator, the Parquet project allows column-stored data to be handled at high speed

Related content

Dataframes explained: The modern in-memory data science format

Cloud providers make bank with genAI while projects fail

Overcoming data inconsistency with a universal semantic layer

Bridging the performance gap in data infrastructure for AI

More from this author

And the #1 Python IDE is . . .

Docker tutorial: Get started with Docker volumes

Python is the most popular language on GitHub

Python threading and subprocesses explained

The best Python libraries for parallel processing

True multithreading in Python at last!

Get started with the free-threaded build of Python 3.13

Electron vs. Tauri: Which cross-platform framework is for you?

Show me more

The dirty little secret of open source contributions

14 great preprocessors for developers who love to code

Designing the APIs that accidentally power businesses

Building Python wheels to distribute your programs

Creating a pip install-able Python package

How to get better web requests in Python with httpx