[ANN] dataframe 0.1.0.0

I’ve been working on this for some months now and it’s in a mostly usable state.

Currently only works with CSV but working on parquet integration since that’s what I mostly use at work. There are small tutorials in the Github repo.

Hoping to have it be more feature-rich after ZuriHac.

Thanks,

Michael

15 Likes

Congrats Michael! I know you’ve been thinking about and working on this for a long time. Looking forward to trying it out

1 Like

Nice!! I was waiting for this a long time, congrats Michael that’s a nice work!

1 Like

Thanks Laurent! It’s been really useful for ad hoc CSV crunching at work. Have you had time to work on javelin dataframes recently?

Obrigado Vitor! Lemme know what you think and what’s missing.

Unfortunately, no. Ironically, I changed jobs and won’t need to use dataframes for a while :person_shrugging:

Very nice! Looking at your dev notes, do you have opinions already on how the lazy (graph-based) API would work?

I’ve been using pyspark and polars dataframes a lot. I think the appealing direction from me right now is creating a monadic DSL + expression language which build up a computation then build a query planner in top of that API. Then fork all the in memory functions into a new subdirectory where they will operate instead on this monad.

Still very vague but something like:

df <- readParquet "./file.parquet"
small <- deferred df >>= apply (col "field" :+: lit 3)
                     >>= filter (col "field" :<: lit 10)
print $ runEval small

Still all in my head though. But would probably still require runtime checks.

I think something like what Laurent’s javelin (schema on read) might be a little better suited for the OLAP sort of uses but I think implementing this will be a fun learning exercise.

4 Likes