Hi all,
I’m toying with the idea of writing a Haskell library to handle parquet files, which are ubiquitous in data science. This would be exceedingly useful as work.
A quick summary of what a parquet file is: contrary to CSV files, which are row-based, the parquet file format is column based. This difference is important for many use cases; for example, at work, we store time-series in parquet files. Column-based formats like parquet have two advantages:
- Column-based storage is susceptible to many optimizations which make parquet files much smaller than equivalent CSVs;
- One does not need to read the whole file to extract a single column, enabling streaming column operations (rather than streaming row operations).
There are two restrictions on the data a parquet file can represent:
- Each column is composed of values of the same type (usually numeric), but each column can have a different type;
- All columns must have the same length.
My question is: how would you write a type that represents the content of a Parquet file?
My first thought was to have a sum type to describe all possible parquet types:
import Data.Map
import Data.Text
import Data.Vector
-- Assuming that parquet files only support three types
data Value = VBool Bool
| VInt Int
| VDouble Double
type ParquetTable = Map Text (Vector Value)
but then It would be possible to create a heterogeneous column, for example:
myTable :: ParquetTable
myTable = Data.Map.fromList [
("col1", Vector.fromList [VInt 0 , VBool False])
, ("col2", Vector.fromList [VDouble 0.0, VDouble 0.0])
]
Ideally, we would want a type-level Map
which does something like this:
type ParquetTable a = TypeMap a
myTable :: ParquetTable [ (Text, Int), (Text, Double) ]
myTable = Data.Map.fromList [
("col1", Vector.fromList [VInt 0 , VInt 0])
, ("col2", Vector.fromList [VDouble 0.0, VDouble 0.0])
]
This representation would be better, but (1) I don’t know if this is possible, and (2) I don’t know how you would specify the type of a parquet file whose content is unknown.
Any ideas?