Exploring the design space and wanted to try out creating a dataframe library that’s meant for more exploratory data analysis. That is where you don’t know the shape of the data before hand and want to load it up quickly and answer pretty basic question.
Please let me know what you think of this direction and maybe clue me in on some existing tools in case I’m duplicating work.
Like polars I’d like to eventually get parallelism gains from using Haskell. It might be worth wrapping python for chatting thoufh but I noticed Haskell-chart already does this.
The Dataframe project is great, I could have used it many times.
It’s also great that Dataframe puts a spotlight on the gulf between routine quick exploratory data analysis (EDA) and typical rigorous haskell engineering, where strong static typing is vitally important. This raises interesting questions about the product concept and roadmap.
Given that Dataframe types are dynamic and column names are strings, it’s philosophically much closer to python-pandas-polars and R-tidyverse than traditional haskell. With that structural foundation, the dataframe package has two obvious paths going forward:
Path 1) Continue in the dynamic direction, expanding functionality to compete directly with python and R.
Path 2) Provide a bridge between a lightweight dynamic style and traditional haskell strong static typing.
Path 1 is a big project, simply because the depth and breadth of other EDA platforms define the baseline, and therefore define community expectations. For example, matplotlib and seaborn are flexible and define the cultural norm, and building analogous haskell tools will need substantial engineering resources.
Next, suppose that a full-featured Dataframe is wildly successful. That’s essentially a big EDSL which doesn’t share haskell’s strengths, value propositions, trajectory, and culture. That’s not a bad thing but it’s not haskell.
Path 2 is another thing entirely. It would be straightforward to write adaptors between the dataframe library and traditional haskell packages like aeson, but there are bigger questions around the UX and understanding just what problems the bridge is solving.
I thought about this. There could be two abstractions.
what we have now which is, as you note, far from Haskell which has the goal of loading data in memory, doing some analysis on top of it and answering questions quickly. An API like this benefits from being somewhat dynamic since a user wants to fail fast and try things.
a lazy API that build an execution graph on top of a typed representation of a dataframe then runs at the end. This one woild be closer to an ETL pipeline for files larger than disk. The fail fast model doesn’t work here because you don’t want a pipeline that will fail 20 minutes into running with a runtime error.
This is similar to the distinction that Polars has. I’ve noted this issue in “Future work”.
a lazy API that build an execution graph on top of a typed representation of a dataframe
Cool.
(I’m assuming that typed representation means dynamic type inference inside DataFrame, e.g., during read_csv, rather than traditional haskell compile-time typing. If that’s wrong, ignore everything below.)
Then we’d have two independent type systems and API flavors which users see as unrelated. In the extreme case, a data science user could be completely unaware of traditional haskell typing, because the EDSL is so all-encompassing that there’s no type system leakage from haskell. At the same time, traditional haskell users might cringe because there are now two independent type systems in play, and DataFrame can’t easily interoperate with existing apps and packages.
Imagine a typical python or R use case: Subset rows of a dataframe using a regex on a string column, and for that subset, add a new column based on a lambda function that calls scikit. For that to be successful, there needs to be a user-friendly bridge between these two worlds.
A new, independent type system is a big architectural stake in the ground, so opinions on DataFrame may differ based on this point. Presumably data scientists, with a job to be done quickly, will love DataFrame for all the reasons they love python and R. At the same time, haskell traditionalists might be less enthusiastic because the world they knew – one big type system at the center of everything – has become two big type systems that don’t interoperate.
Wonder if there’s an interesting intersection between your two abstractions (quick dynamic EDA tools, execution call graph with type safety). If building and type-checking the graph were very fast and integrated into a platform like jupyter, you’d essentially have a general-purpose type-safe data science platform.
Yes, it would be great, but the reality is that all data sources should be assumed to be garbage, therefore assuming a particular (and fixed!) type for a column that’s been read straight from a CSV is generally a bad idea.
What ‘Frames’ does is sample a few data rows at compile time and construct a type for those, but what if someone put a phone number instead of an email at row 9872342? and so on
Heuristically infer the column datatype, like pandas and other CSV readers do. Typically they sample n rows and make a good guess. (Obviously this issue disappears with typed data like SQL tables.)
Read the full CSV, detecting errors when text can’t be parsed into the column’s datatype. Those errors could produce NaN or Nothing values, throw exceptions, ignore entire rows, call a lambda that produces default values, or do any other useful API behavior.
Typecheck the execution graph using the inferred CSV datatypes, as well as anything else relevant, e.g., 1000: DataFrame.Number in the example above.
These phases are essentially independent, so the general-purpose type-safe data science platform really is doable. I get excited just thinking about it!
Inquiry There is no ‘‘standard library’’ of table operations.
[from the paper linked from the ReadMe]
is just not true. The yardstick is ‘Relational Completeness’, Codd 1970. I’m very surprised section 2.2 Benchmark origins mentions SQL – which is, frankly, an embarrassment to the whole industry – but not Relational Algebra, which is as succinct and elegant as Lambda calculus (since you’re posting on a Haskell forum).
As the paper goes on to say, it may be some particular language doesn’t out-of-the-box provide that exact set; then it must be able to compose what it does provide to arrive at those operations. (In the literature this is known as ‘expressively complete’ – see Date & Darwen.)
And yes, the choice of user-facing operations is down to human factors – which I imagine are so difficult to compare as to make rating pointless.
[3.1 Basic definitions] A schema is an ordered sequence of column names and associated sorts.
No to the “ordered”: if the tool exposes the implementation detail of left-to-right ordering, I would ipso facto count it as not industrial strength. Worse still if it supports positional access as well as via column name. (This is the ugly situation Haskell has got itself into, making for very brittle code. Contrast for example purescript doesn’t allow positional access to fields of datatypes declared with record syntax – quite right too.)
Since origins does mention SQL, note positional access in 1960s-era is widely regarded as a mis-feature. This century’s iterations of the standard are aiming to extirpate it.
Reading on I’m going to disagree with rather a lot of 3.1, to the extent I’m not seeing the value in the exercise. Then reflect first on this point from 3.2
… we ignore input-output … does not stipulate how tables are entered into programs …
Exactly. The l-to-r sequence of how (say) CSV physical/textual columns match to logical named columns is an implementation detail that should be rated (if atall) under usability, not expressivity.