Newly graduated from the Apache Incubator, the Parquet project allows column-stored data to be handled at high speed Apache Parquet, which provides columnar storage in Hadoop, is now a top-level Apache Software Foundation (ASF)-sponsored project, paving the way for its more advanced use in the Hadoop ecosystem. Already adopted by Netflix and Twitter, Parquet began in 2013 as a co-production between engineers at Twitter and Cloudera to allow complex data to be encoded efficiently in bulk. Databases traditionally store information in rows and are optimized for working with one record at a time. Columnar storage systems serialize and store data by column, meaning that searches across large data sets and reads of large sets of data can be highly optimized. Hadoop was built for managing large sets of data, so a columnar store is a natural complement. Most Hadoop projects can read and write data to and from Parquet; the Hive, Pig, and Drill projects already do this, as well as conventional MapReduce. As another benefit, per-column data compression further accelerates performance in Parquet. A textual data column is compressed differently than a column loaded with only integer data, and being able to compress columns separately provides its own performance boost. Parquet also implements column compression so that it’s “future-proofed to allow adding more encodings as they are invented and implemented.” Early adopters and project leads have used Parquet for some time and built functionality around it. Cloudera, the project’s co-progenitor, uses Parquet as a native data storage format for its Impala analytics database project, and MapR has added data self-description functions to Parquet. Netflix — never one to shy away from a forward-looking technology (such as Cassandra) — has 7 petabytes of warehoused data in Parquet format, according to the ASF. Parquet isn’t the only way to store columnar data in Hadoop, but it’s shaping up as the leader. Hive has its own columnar-data format, called ORC, although it’s mainly intended as an extension to Hive rather than as a general data store for Hadoop. Hortonworks, a Cloudera competitor (in more ways than one), claimed earlier in Parquet’s lifecycle that ORC compresses data more efficiently than Parquet. And IBM ran its own performance comparisons in September 2014 and found that while ORC used the least amount of HDFS storage, Parquet had the best overall query and analysis time, which are the metrics that typically matter most for Hadoop users. Related content feature Dataframes explained: The modern in-memory data science format Dataframes are a staple element of data science libraries and frameworks. Here's why many developers prefer them for working with in-memory data. By Serdar Yegulalp Nov 06, 2024 6 mins Data Science Data Management analysis Cloud providers make bank with genAI while projects fail Generative AI is causing excitement but not success for most enterprises. This needs to change quickly, but it will take some work that enterprises may not be willing to do. By David Linthicum Nov 05, 2024 5 mins Generative AI Cloud Computing Data Management feature Overcoming data inconsistency with a universal semantic layer Disparate BI, analytics, and data science tools result in discrepancies in data interpretation, business logic, and definitions among user groups. A universal semantic layer resolves those discrepancies. By Artyom Keydunov Nov 01, 2024 7 mins Business Intelligence Data Management feature Bridging the performance gap in data infrastructure for AI A significant chasm exists between most organizations’ current data infrastructure capabilities and those necessary to effectively support AI workloads. By Colleen Tartow Oct 28, 2024 12 mins Generative AI Data Architecture Artificial Intelligence Resources Videos