DataFrame January 2026 Updates

DataFrame update dump (0.3.1.1 → 0.4.0.5): faster, safer, and a lot more expressive

I’ve been heads-down shipping a pile of improvements to DataFrame over the last few releases, and I wanted to share a “why you should care” summary (with some highlights + a couple examples).

Highlights

  • Ecosystem

  • Performance wins across the board

    • Faster folds/reductions, sum, mean, improved filtering + index selection, more efficient groupby paths.

    • Better join performance (incl. hashmap-based inner joins) + fixes for right/full outer join edge cases.

  • Decision trees

    • We now have a decision tree implementation that also does enumerative bottom up search for split predicates to capture complex decision boundaries (performs well out the box on Kaggle classification tasks). This will likely be split into a separate library for symbolic machine learning proposed here.
  • Much nicer data cleaning + missing data ergonomics

    • meanMaybe, medianMaybe, stddevMaybe, plus genericPercentile / percentile.

    • “Maybe-aware” functions promoted into DataFrame-level functions.

    • Null-handling helpers like filterAllNothing, filterNothing, plus better “NA” handling.

    • New recodeWithCondition (change values based on a predicate) + recodeWithDefault.

    • APIs to create splits in data (kFolds, randomSplit, sample dataframes).

  • Expressions / schema evolution got smoother

    • More APIs now operate on expressions instead of strings.

    • A growing monadic interface (state monad) to make “derive / inspect / impute / filter” pipelines readable and resilient as schemas evolve. This helps you use DataFrames in scripts (prior they were mostly designed for notebooks).

  • I/O + parsing hardening

    • CSV: strip header/title whitespace, auto bool parsing, better type logic, fast CSV improvements.

    • Parquet: fixes and improvements.

    • JSON Lines: initial support.

Example: expressive aggregation with conditionals (0.4.0.1)

df
  |> D.groupBy [F.name ocean_proximity]
  |> D.aggregate
      [ "rand" .= F.sum (F.ifThenElse (ocean_proximity .== "ISLAND") 1 0)
      ]

Example: schema-friendly transformation pipeline (0.4.0.0+)

print $ execFrameM df $ do
  is_expensive  <- deriveM "is_expensive" (median_house_value .>= 500000)
  meanBedrooms  <- inspectM (D.meanMaybe total_bedrooms)
  totalBedrooms <- imputeM total_bedrooms meanBedrooms
  filterWhereM (totalBedrooms .>= 200 .&& is_expensive)

If you’re doing ETL-y cleaning, feature engineering, quick stats, or want a Haskell-y dataframe that’s getting faster and more ergonomic every release: this is a really good time to try the latest (0.4.0.4).

Hoping to get a GSOC proposal for either Parquet writers or Arrow support so if you’d like to co-mentor please reach out.

19 Likes

Btw also have a PR to add dataframe to Database-like ops benchmark. After that’s in as a baseline (I think they said they’d release results after launching a new version of DuckDb) then I’ll probably need quite a bit of help with performance optimization.

3 Likes

Fantastic work! Sounds like tons of excellent progress and ecosystem support is growing.

probably need quite a bit of help with performance optimization.

Having a baseline which is easy to measure against will surely nerd snipe someone into trying to make it faster, myself included :wink:

3 Likes