Designing a type for the contents of a Parquet file

LaurentRDC · March 7, 2023, 1:16pm

Hi all,

I’m toying with the idea of writing a Haskell library to handle parquet files, which are ubiquitous in data science. This would be exceedingly useful as work.

A quick summary of what a parquet file is: contrary to CSV files, which are row-based, the parquet file format is column based. This difference is important for many use cases; for example, at work, we store time-series in parquet files. Column-based formats like parquet have two advantages:

Column-based storage is susceptible to many optimizations which make parquet files much smaller than equivalent CSVs;
One does not need to read the whole file to extract a single column, enabling streaming column operations (rather than streaming row operations).

There are two restrictions on the data a parquet file can represent:

Each column is composed of values of the same type (usually numeric), but each column can have a different type;
All columns must have the same length.

My question is: how would you write a type that represents the content of a Parquet file?

My first thought was to have a sum type to describe all possible parquet types:

import Data.Map
import Data.Text
import Data.Vector

-- Assuming that parquet files only support three types
data Value = VBool Bool
           | VInt Int
           | VDouble Double

type ParquetTable = Map Text (Vector Value)

but then It would be possible to create a heterogeneous column, for example:

myTable :: ParquetTable
myTable = Data.Map.fromList [ 
            ("col1", Vector.fromList [VInt 0     , VBool False])
          , ("col2", Vector.fromList [VDouble 0.0, VDouble 0.0])
          ]

Ideally, we would want a type-level Map which does something like this:

type ParquetTable a = TypeMap a

myTable :: ParquetTable [ (Text, Int), (Text, Double) ]
myTable = Data.Map.fromList [ 
            ("col1", Vector.fromList [VInt 0     , VInt 0])
          , ("col2", Vector.fromList [VDouble 0.0, VDouble 0.0])
          ]

This representation would be better, but (1) I don’t know if this is possible, and (2) I don’t know how you would specify the type of a parquet file whose content is unknown.

Any ideas?

jaror · March 7, 2023, 1:47pm

You can use a GADT:

data ParquetTable xs where
  PTCons :: x -> Vector y -> ParquetTable xs -> ParquetTable ((x, y) : xs)
  PTNil :: ParquetTable '[]

myTable :: ParquetTable [(Text, Int), (Text, Double)]
myTable =
  PTCons "col1" (Vector.fromList [0, 0])
  PTCons "col2" (Vector.fromList [0.0, 0.0])
  PTNil

This still has some disadvantages:

looking up a column takes linear time
columns can have different lengths
using PTCons and PTNil is not very ergonomic

I think you can solve most of these with more effort, except perhaps the last point.

As for your question (2), you can use existential quantification:

data SomeParquetTable = forall xs. ParquetTable xs

LaurentRDC · March 7, 2023, 2:27pm

That’s a great suggestion, thanks!

Looks like it’s also possible to restrict the column name type to Text and the column values to parquet-supported types, like so:

{-# LANGUAGE DataKinds         #-}
{-# LANGUAGE GADTs             #-}
{-# LANGUAGE OverloadedStrings #-}

import Data.Text
import Data.Vector (Vector)
import qualified Data.Vector as Vector

class ParquetValue v
instance ParquetValue Int
instance ParquetValue Double

data ParquetTable xs where
  PTCons :: ParquetValue y 
         => Text -> Vector y -> ParquetTable xs 
         -> ParquetTable ((Text, y) : xs)
  PTNil  :: ParquetTable '[]

myTable :: ParquetTable [(Text, Int), (Text, Double)]
myTable = PTCons "col1" (Vector.fromList [0, 0])
        $ PTCons "col2" (Vector.fromList [0.0, 0.0]) 
        $ PTNil

Topic		Replies	Views
Columnar storage of datatypes Learn	7	755	January 10, 2024
Implementing type-safe heterogeneous collections Learn	24	1245	January 22, 2024
[Initial feedback request] DataFrame library Show and Tell	24	1336	February 2, 2025
[Design] Dataframes in Haskell Haskell Foundation	25	2913	January 6, 2025
Pre-HFTP: Proposal DataFrame Library for Haskell Haskell Foundation	11	856	December 9, 2024

Designing a type for the contents of a Parquet file

Related topics