Columnar storage of datatypes

In my quest to bring over my data science workflows to Haskell, I have previously implemented one-dimensional labeled arrays (series) in javelin.

These are already very useful to me, but they remain sub-par for some workloads, especially on the performance front. Tabular data analysis libraries like Frames don’t help here, because the data is essentially kept as an array of rows.

The next step is to create a columnar container, often called dataframes. The idea behind a dataframe is that for a composite datatype (e.g. (a, b, c)), an array of this data is stored column-by-column:

DataFrame (a, b, c) ~ (Array a, Array b, Array c)

For nested datatypes (e.g. (a, b, (c, d)), the data is still mapped to columns:

DataFrame (a, b, (c, d)) ~ (Array a, Array b, Array c, Array d)

These ideas are described in the Dremel paper and serves as the basis of the data storage format Parquet

My question is: How can I control the data representation such that the underlying data of DataFrame is stored as a tuple of columns? My understanding is that a form of this is used to represent unboxed vectors of tuples as tuples of unboxed vectors in vector, but I am very confused about how it’s done in that package

5 Likes

First of all, the distinction between column and row storage would not exists if it was not for the types.
I think there is a route when you admit

DataFrame (a, (b, c)) ~ (Array a, (Array b, Array c))

using a GADT such as

{-# LANGUAGE GADTs, KindSignatures #-}
import Data.Kind (Type)
data DataFrame :: Type -> Type where
    Nil :: DataFrame ()
    Cons :: Array a -> DataFrame b -> DataFrame (a,b)

You could even make Array a type parameter and vary the Array type you use, thus permitting unboxed representations.
Perhaps proper type-level lists can take you even further.

Ultimately it’s a fairly straightforward application of type families (data families, specifically), but maybe that’s not the part you’re confused about? If you were a little more specific about what in vector confuses you I could try to help.

FWIW, the Frame in the Frames package also stores things in columns, specifically in vectors from the Vector package, using vectors of unboxed types when possible. If you look at the InCore module you will see how these columns are built.

2 Likes

(first off: I don’t have a piece of Haskell code that does what you ask for, but I’ve thought about similar stuff in the past heidi: Tidy data in Haskell even though it’s row-oriented)

I would suggest to first explore functions that operate on the column representation (lookup etc) and give them an Applicative interface. (maybe jumping hoops with type families and GADTs isn’t necessary)

4 Likes

My confusion stems from the fact that I can’t find the instance Unbox a, Unbox b => Vector Vector (a, b).

For example, I get that an unboxed vector of type Vector (Complex a) gets represented the same was as Vector (a, a) because it’s explicitly defined here

Now that I’m looking at the source, I was confused because tuple instances are defined elsewhere, It’s here

2 Likes

A drive-by comment; I don’t know if you spotted this recently as well, but maybe it would be worth taking a glance at - GitHub - pola-rs/polars: Dataframes powered by a multithreaded, vectorized query engine, written in Rust

2 Likes

I did take a look at this before. Looks like a great project.

1 Like