DataFrame update dump (0.3.1.1 → 0.4.0.5): faster, safer, and a lot more expressive
I’ve been heads-down shipping a pile of improvements to DataFrame over the last few releases, and I wanted to share a “why you should care” summary (with some highlights + a couple examples).
Highlights
-
Ecosystem
-
You can now read SQL tables using the dataframe-persistent (written by @junjihashimoto)
-
Convert DataFrames into hasktorch tensors and back with dataframe-hasktorch.
-
Symbolic regression (based on srtree) to discover mathematical relationships between columns in dataframe-symbolic-regression (thanks to Fabricio Olivetti)
-
-
Performance wins across the board
-
Faster folds/reductions, sum, mean, improved filtering + index selection, more efficient groupby paths.
-
Better join performance (incl. hashmap-based inner joins) + fixes for right/full outer join edge cases.
-
-
Decision trees
- We now have a decision tree implementation that also does enumerative bottom up search for split predicates to capture complex decision boundaries (performs well out the box on Kaggle classification tasks). This will likely be split into a separate library for symbolic machine learning proposed here.
-
Much nicer data cleaning + missing data ergonomics
-
meanMaybe,medianMaybe,stddevMaybe, plus genericPercentile / percentile. -
“Maybe-aware” functions promoted into DataFrame-level functions.
-
Null-handling helpers like
filterAllNothing,filterNothing, plus better “NA” handling. -
New
recodeWithCondition(change values based on a predicate) +recodeWithDefault. -
APIs to create splits in data (kFolds, randomSplit, sample dataframes).
-
-
Expressions / schema evolution got smoother
-
More APIs now operate on expressions instead of strings.
-
A growing monadic interface (state monad) to make “derive / inspect / impute / filter” pipelines readable and resilient as schemas evolve. This helps you use DataFrames in scripts (prior they were mostly designed for notebooks).
-
-
I/O + parsing hardening
-
CSV: strip header/title whitespace, auto bool parsing, better type logic, fast CSV improvements.
-
Parquet: fixes and improvements.
-
JSON Lines: initial support.
-
Example: expressive aggregation with conditionals (0.4.0.1)
df
|> D.groupBy [F.name ocean_proximity]
|> D.aggregate
[ "rand" .= F.sum (F.ifThenElse (ocean_proximity .== "ISLAND") 1 0)
]
Example: schema-friendly transformation pipeline (0.4.0.0+)
print $ execFrameM df $ do
is_expensive <- deriveM "is_expensive" (median_house_value .>= 500000)
meanBedrooms <- inspectM (D.meanMaybe total_bedrooms)
totalBedrooms <- imputeM total_bedrooms meanBedrooms
filterWhereM (totalBedrooms .>= 200 .&& is_expensive)
If you’re doing ETL-y cleaning, feature engineering, quick stats, or want a Haskell-y dataframe that’s getting faster and more ergonomic every release: this is a really good time to try the latest (0.4.0.4).
Hoping to get a GSOC proposal for either Parquet writers or Arrow support so if you’d like to co-mentor please reach out.