Exploring the design space and wanted to try out creating a dataframe library that’s meant for more exploratory data analysis. That is where you don’t know the shape of the data before hand and want to load it up quickly and answer pretty basic question.
Please let me know what you think of this direction and maybe clue me in on some existing tools in case I’m duplicating work.
Like polars I’d like to eventually get parallelism gains from using Haskell. It might be worth wrapping python for chatting thoufh but I noticed Haskell-chart already does this.
The Dataframe project is great, I could have used it many times.
It’s also great that Dataframe puts a spotlight on the gulf between routine quick exploratory data analysis (EDA) and typical rigorous haskell engineering, where strong static typing is vitally important. This raises interesting questions about the product concept and roadmap.
Given that Dataframe types are dynamic and column names are strings, it’s philosophically much closer to python-pandas-polars and R-tidyverse than traditional haskell. With that structural foundation, the dataframe package has two obvious paths going forward:
Path 1) Continue in the dynamic direction, expanding functionality to compete directly with python and R.
Path 2) Provide a bridge between a lightweight dynamic style and traditional haskell strong static typing.
Path 1 is a big project, simply because the depth and breadth of other EDA platforms define the baseline, and therefore define community expectations. For example, matplotlib and seaborn are flexible and define the cultural norm, and building analogous haskell tools will need substantial engineering resources.
Next, suppose that a full-featured Dataframe is wildly successful. That’s essentially a big EDSL which doesn’t share haskell’s strengths, value propositions, trajectory, and culture. That’s not a bad thing but it’s not haskell.
Path 2 is another thing entirely. It would be straightforward to write adaptors between the dataframe library and traditional haskell packages like aeson, but there are bigger questions around the UX and understanding just what problems the bridge is solving.
I thought about this. There could be two abstractions.
what we have now which is, as you note, far from Haskell which has the goal of loading data in memory, doing some analysis on top of it and answering questions quickly. An API like this benefits from being somewhat dynamic since a user wants to fail fast and try things.
a lazy API that build an execution graph on top of a typed representation of a dataframe then runs at the end. This one woild be closer to an ETL pipeline for files larger than disk. The fail fast model doesn’t work here because you don’t want a pipeline that will fail 20 minutes into running with a runtime error.
This is similar to the distinction that Polars has. I’ve noted this issue in “Future work”.
a lazy API that build an execution graph on top of a typed representation of a dataframe
Cool.
(I’m assuming that typed representation means dynamic type inference inside DataFrame, e.g., during read_csv, rather than traditional haskell compile-time typing. If that’s wrong, ignore everything below.)
Then we’d have two independent type systems and API flavors which users see as unrelated. In the extreme case, a data science user could be completely unaware of traditional haskell typing, because the EDSL is so all-encompassing that there’s no type system leakage from haskell. At the same time, traditional haskell users might cringe because there are now two independent type systems in play, and DataFrame can’t easily interoperate with existing apps and packages.
Imagine a typical python or R use case: Subset rows of a dataframe using a regex on a string column, and for that subset, add a new column based on a lambda function that calls scikit. For that to be successful, there needs to be a user-friendly bridge between these two worlds.
A new, independent type system is a big architectural stake in the ground, so opinions on DataFrame may differ based on this point. Presumably data scientists, with a job to be done quickly, will love DataFrame for all the reasons they love python and R. At the same time, haskell traditionalists might be less enthusiastic because the world they knew – one big type system at the center of everything – has become two big type systems that don’t interoperate.
Wonder if there’s an interesting intersection between your two abstractions (quick dynamic EDA tools, execution call graph with type safety). If building and type-checking the graph were very fast and integrated into a platform like jupyter, you’d essentially have a general-purpose type-safe data science platform.
Yes, it would be great, but the reality is that all data sources should be assumed to be garbage, therefore assuming a particular (and fixed!) type for a column that’s been read straight from a CSV is generally a bad idea.
What ‘Frames’ does is sample a few data rows at compile time and construct a type for those, but what if someone put a phone number instead of an email at row 9872342? and so on
Heuristically infer the column datatype, like pandas and other CSV readers do. Typically they sample n rows and make a good guess. (Obviously this issue disappears with typed data like SQL tables.)
Read the full CSV, detecting errors when text can’t be parsed into the column’s datatype. Those errors could produce NaN or Nothing values, throw exceptions, ignore entire rows, call a lambda that produces default values, or do any other useful API behavior.
Typecheck the execution graph using the inferred CSV datatypes, as well as anything else relevant, e.g., 1000: DataFrame.Number in the example above.
These phases are essentially independent, so the general-purpose type-safe data science platform really is doable. I get excited just thinking about it!