In my quest to bring over my data science workflows to Haskell, I have previously implemented one-dimensional labeled arrays (series) in javelin.
These are already very useful to me, but they remain sub-par for some workloads, especially on the performance front. Tabular data analysis libraries like Frames don’t help here, because the data is essentially kept as an array of rows.
The next step is to create a columnar container, often called dataframes. The idea behind a dataframe is that for a composite datatype (e.g. (a, b, c)
), an array of this data is stored column-by-column:
DataFrame (a, b, c) ~ (Array a, Array b, Array c)
For nested datatypes (e.g. (a, b, (c, d))
, the data is still mapped to columns:
DataFrame (a, b, (c, d)) ~ (Array a, Array b, Array c, Array d)
These ideas are described in the Dremel paper and serves as the basis of the data storage format Parquet
–
My question is: How can I control the data representation such that the underlying data of DataFrame
is stored as a tuple of columns? My understanding is that a form of this is used to represent unboxed vectors of tuples as tuples of unboxed vectors in vector
, but I am very confused about how it’s done in that package