Designing a type for the contents of a Parquet file

Hi all,

I’m toying with the idea of writing a Haskell library to handle parquet files, which are ubiquitous in data science. This would be exceedingly useful as work.

A quick summary of what a parquet file is: contrary to CSV files, which are row-based, the parquet file format is column based. This difference is important for many use cases; for example, at work, we store time-series in parquet files. Column-based formats like parquet have two advantages:

  1. Column-based storage is susceptible to many optimizations which make parquet files much smaller than equivalent CSVs;
  2. One does not need to read the whole file to extract a single column, enabling streaming column operations (rather than streaming row operations).

There are two restrictions on the data a parquet file can represent:

  • Each column is composed of values of the same type (usually numeric), but each column can have a different type;
  • All columns must have the same length.

My question is: how would you write a type that represents the content of a Parquet file?

My first thought was to have a sum type to describe all possible parquet types:

import Data.Map
import Data.Text
import Data.Vector

-- Assuming that parquet files only support three types
data Value = VBool Bool
           | VInt Int
           | VDouble Double

type ParquetTable = Map Text (Vector Value)

but then It would be possible to create a heterogeneous column, for example:

myTable :: ParquetTable
myTable = Data.Map.fromList [ 
            ("col1", Vector.fromList [VInt 0     , VBool False])
          , ("col2", Vector.fromList [VDouble 0.0, VDouble 0.0])
          ]

Ideally, we would want a type-level Map which does something like this:

type ParquetTable a = TypeMap a

myTable :: ParquetTable [ (Text, Int), (Text, Double) ]
myTable = Data.Map.fromList [ 
            ("col1", Vector.fromList [VInt 0     , VInt 0])
          , ("col2", Vector.fromList [VDouble 0.0, VDouble 0.0])
          ]

This representation would be better, but (1) I don’t know if this is possible, and (2) I don’t know how you would specify the type of a parquet file whose content is unknown.

Any ideas?

4 Likes

You can use a GADT:

data ParquetTable xs where
  PTCons :: x -> Vector y -> ParquetTable xs -> ParquetTable ((x, y) : xs)
  PTNil :: ParquetTable '[]

myTable :: ParquetTable [(Text, Int), (Text, Double)]
myTable =
  PTCons "col1" (Vector.fromList [0, 0])
  PTCons "col2" (Vector.fromList [0.0, 0.0])
  PTNil

This still has some disadvantages:

  • looking up a column takes linear time
  • columns can have different lengths
  • using PTCons and PTNil is not very ergonomic

I think you can solve most of these with more effort, except perhaps the last point.

As for your question (2), you can use existential quantification:

data SomeParquetTable = forall xs. ParquetTable xs
3 Likes

That’s a great suggestion, thanks!

Looks like it’s also possible to restrict the column name type to Text and the column values to parquet-supported types, like so:

{-# LANGUAGE DataKinds         #-}
{-# LANGUAGE GADTs             #-}
{-# LANGUAGE OverloadedStrings #-}

import Data.Text
import Data.Vector (Vector)
import qualified Data.Vector as Vector

class ParquetValue v
instance ParquetValue Int
instance ParquetValue Double

data ParquetTable xs where
  PTCons :: ParquetValue y 
         => Text -> Vector y -> ParquetTable xs 
         -> ParquetTable ((Text, y) : xs)
  PTNil  :: ParquetTable '[]

myTable :: ParquetTable [(Text, Int), (Text, Double)]
myTable = PTCons "col1" (Vector.fromList [0, 0])
        $ PTCons "col2" (Vector.fromList [0.0, 0.0]) 
        $ PTNil
2 Likes