[Initial feedback request] DataFrame library

artem · January 31, 2025, 12:48pm

I think it’s a good read for everyone thinking about this design space. Do I think it’s a holy graal and is correct on all cases? Certainly, not (and I don’t think such a thing exists).

mchav · January 31, 2025, 4:35pm

Didn’t read the paper yet but this seems like a good check list for operational completeness. Although it is missing transpose which the modin paper seems to suggest is a foundational data frame operator.

artem · January 31, 2025, 6:24pm

Transpose makes sense for matrices but not dataframes, which, in general, hold heterogenous data. Am I missing something?

mchav · January 31, 2025, 7:00pm

Around page 20 of the modin paper. The argument is that a dataframe’s algebra can be distilled to a set of core operations. Transpose is one of them. In fact, most literature I’ve read suggests that a dataframe is a mix of matrix, relational, and spreadsheet algebra.

First discussed in the paper Is a dataframe just a table?. I reached out to Hellerstein about these comments and he said his view on the issue have progressed and he largely accepts the definition proposed in Modin (granted he’s a co-author).

mchav · February 2, 2025, 10:19am

To be more precise: Modin’s definition of a dataframe is a 4-tuple (A_mn, S_n, C_n, R_m) where S_n is a collection of schema induction functions (parsing function) from String to a Type. The underlying 2D array can be transposed. This would invalidate S_n so it would have to be regenerated or initialized as a list of identity functions. Of course this is only a theoretical definition that’s almost impossible to implement efficiently. And I admit it does seem very post hoc but dataframes are weird…

Speaking to a lot of people that use dataframes as a way to clean data (as opposed to a query engine) this sort of operation is useful. And I think the example in the Modin paper is illustrative of an intuitive use of transposition.

Topic		Replies	Views
[Design] Dataframes in Haskell Haskell Foundation	25	2864	January 6, 2025
Pre-HFTP: Proposal DataFrame Library for Haskell Haskell Foundation	11	852	December 9, 2024
Haskell for Data Processing	14	2411	January 17, 2024
[ANN] dataframe 0.1.0.0 Announcements	7	329	April 8, 2025
Implementing type-safe heterogeneous collections Learn	24	1235	January 22, 2024

[Initial feedback request] DataFrame library

Related topics