Columnar storage of datatypes

LaurentRDC · January 9, 2024, 8:34pm

In my quest to bring over my data science workflows to Haskell, I have previously implemented one-dimensional labeled arrays (series) in javelin.

These are already very useful to me, but they remain sub-par for some workloads, especially on the performance front. Tabular data analysis libraries like Frames don’t help here, because the data is essentially kept as an array of rows.

The next step is to create a columnar container, often called dataframes. The idea behind a dataframe is that for a composite datatype (e.g. (a, b, c)), an array of this data is stored column-by-column:

DataFrame (a, b, c) ~ (Array a, Array b, Array c)

For nested datatypes (e.g. (a, b, (c, d)), the data is still mapped to columns:

DataFrame (a, b, (c, d)) ~ (Array a, Array b, Array c, Array d)

These ideas are described in the Dremel paper and serves as the basis of the data storage format Parquet

–

My question is: How can I control the data representation such that the underlying data of DataFrame is stored as a tuple of columns? My understanding is that a form of this is used to represent unboxed vectors of tuples as tuples of unboxed vectors in vector, but I am very confused about how it’s done in that package

olf · January 9, 2024, 10:07pm

First of all, the distinction between column and row storage would not exists if it was not for the types.
I think there is a route when you admit

DataFrame (a, (b, c)) ~ (Array a, (Array b, Array c))

using a GADT such as

{-# LANGUAGE GADTs, KindSignatures #-}
import Data.Kind (Type)
data DataFrame :: Type -> Type where
    Nil :: DataFrame ()
    Cons :: Array a -> DataFrame b -> DataFrame (a,b)

You could even make Array a type parameter and vary the Array type you use, thus permitting unboxed representations.
Perhaps proper type-level lists can take you even further.

rhendric · January 10, 2024, 12:33am

Ultimately it’s a fairly straightforward application of type families (data families, specifically), but maybe that’s not the part you’re confused about? If you were a little more specific about what in vector confuses you I could try to help.

adamConnerSax · January 10, 2024, 4:03am

FWIW, the Frame in the Frames package also stores things in columns, specifically in vectors from the Vector package, using vectors of unboxed types when possible. If you look at the InCore module you will see how these columns are built.

ocramz · January 10, 2024, 4:56am

(first off: I don’t have a piece of Haskell code that does what you ask for, but I’ve thought about similar stuff in the past heidi: Tidy data in Haskell even though it’s row-oriented)

I would suggest to first explore functions that operate on the column representation (lookup etc) and give them an Applicative interface. (maybe jumping hoops with type families and GADTs isn’t necessary)

LaurentRDC · January 10, 2024, 1:52pm

My confusion stems from the fact that I can’t find the instance Unbox a, Unbox b => Vector Vector (a, b).

For example, I get that an unboxed vector of type Vector (Complex a) gets represented the same was as Vector (a, a) because it’s explicitly defined here

Now that I’m looking at the source, I was confused because tuple instances are defined elsewhere, It’s here

silky · January 10, 2024, 3:21pm

A drive-by comment; I don’t know if you spotted this recently as well, but maybe it would be worth taking a glance at - GitHub - pola-rs/polars: Dataframes powered by a multithreaded, vectorized query engine, written in Rust

LaurentRDC · January 10, 2024, 3:22pm

I did take a look at this before. Looks like a great project.

Topic		Replies	Views
Combining generics with higher-kinded types Learn	13	380	January 23, 2025
Implementing type-safe heterogeneous collections Learn	24	1255	January 22, 2024
Designing a type for the contents of a Parquet file Learn	2	628	March 7, 2023
[Design] Dataframes in Haskell Haskell Foundation	25	2960	January 6, 2025
[Initial feedback request] DataFrame library Show and Tell	24	1338	February 2, 2025

Columnar storage of datatypes

Related topics