I think it’s a good read for everyone thinking about this design space. Do I think it’s a holy graal and is correct on all cases? Certainly, not (and I don’t think such a thing exists).
Didn’t read the paper yet but this seems like a good check list for operational completeness. Although it is missing transpose which the modin paper seems to suggest is a foundational data frame operator.
Transpose makes sense for matrices but not dataframes, which, in general, hold heterogenous data. Am I missing something?
Around page 20 of the modin paper. The argument is that a dataframe’s algebra can be distilled to a set of core operations. Transpose is one of them. In fact, most literature I’ve read suggests that a dataframe is a mix of matrix, relational, and spreadsheet algebra.
First discussed in the paper Is a dataframe just a table?. I reached out to Hellerstein about these comments and he said his view on the issue have progressed and he largely accepts the definition proposed in Modin (granted he’s a co-author).
To be more precise: Modin’s definition of a dataframe is a 4-tuple (A_mn, S_n, C_n, R_m) where S_n is a collection of schema induction functions (parsing function) from String to a Type. The underlying 2D array can be transposed. This would invalidate S_n so it would have to be regenerated or initialized as a list of identity functions. Of course this is only a theoretical definition that’s almost impossible to implement efficiently. And I admit it does seem very post hoc but dataframes are weird…
Speaking to a lot of people that use dataframes as a way to clean data (as opposed to a query engine) this sort of operation is useful. And I think the example in the Modin paper is illustrative of an intuitive use of transposition.