[ANN] symbolic-regression: symbolic regression in Haskell (GP + e-graphs)

Hackage | GitHub

A library for symbolic regression based on this paper. DataHaskell collaborated with Professor Fabricio Olivetti to create the package. Given a target column and dataset, it evolves mathematical expressions that predict the target and returns a Pareto front of expressions. Symbolic regression, a non-parametric method, is typically used to discover interpretable mathematical relationships in scientific data. We are experimenting with using it on non-scientific domains where explainability/interpretability matters.

Under the hood it combines:

  • genetic programming (selection / crossover / mutation),
  • e-graph optimization (equality saturation) for simplification / equivalences,
  • optimization of numeric constants (nlopt),
  • and cross-validation support via config.

Check out the readme for how to get started.

8 Likes

Very cool!
Could you please explain the types of the arguments of Symbolic.Regression.fit? The documentation suggests the second argument is the target column, but I see an expression there. Which presumably could be just a single Col name. The returned expressions will then use any Col expressions that occur in the given data frame?

So at a high level, given an appropriate target expression f(x), fit attempts to detect whether the data in n-dimensional space actually resides in (or close to) an n-1 dimensional subspace that can be described by f(x) = expr(x), where both f and expr map elements x of the data set to real numbers, right?

How well the returned expressions in the Pareto front fit the data is up to further analysis by the user, or could this be a by-product of the algorithm?

Are there analytical insights about the density of DataFrame.Internal.Expression besides the Stone-Weierstrass theorem? In other words, if I generate a data set from an expression, will fit be able to recover the expression?

1 Like

Yes. It could have been a string since this only currently works on doubles but in future it could work on ints to so the Col @a name ends up being the same as passing a string name and a proxy. So we just re-use the expression system for this. The expression machinery (when used with template Haskell) makes it easier to guarantee that the column exists and has the right name (will show an example below)

Close - the spirit is right, but there are two key refinements to make:

  1. In this library, we aren’t usually comparing two functions f(x) and expr(x). Instead, we have a target variable y and a set of features x. We are searching for a symbolic expression such that y is approximately equal to f(x). Your notation f(x) = expr(x) implies you are looking for an identity; in practice, we are looking for a predictive mapping.

  2. Geometrically, you are correct that we are looking for a lower-dimensional structure. However, a ‘subspace’ is a linear term (like a flat sheet). Symbolic regression looks for a nonlinear manifold (a curved surface) in (d+1)-dimensional space.

  3. The library uses genetic programming to ‘evolve’ these shapes. It doesn’t just check if the data fits a pre-defined category; it actually builds the functional form from scratch (using operators like +, sin, exp, etc.) to minimize the distance between the data points and that curved surface."

You typically specify the “complexity”/size of the expression you want to return after. It sort of bounds the genetic search. So for example the default max expression size is 5.

ghci> import System.Random
ghci> xs = Prelude.take 100 $ randomRs (1 :: Double, 100) (mkStdGen 53)
ghci> ys = Prelude.take 100 $ randomRs (1 :: Double, 100) (mkStdGen 54)
ghci> df = D.fromNamedColumns [("x", D.fromList xs), ("y", D.fromList ys)]
ghci> :declareColumns df -- Create global column references
x :: Expr Double
y :: Expr Double
ghci> df' = D.derive "z" (3 * (F.pow x 2) + 2 * (F.pow y 2) + 45) df
ghci> :declareColumns df'
x :: Expr Double
y :: Expr Double
z :: Expr Double
ghci> import Symbolic.Regression 
ghci> models <- fit defaultRegressionConfig z df'
ghci> Prelude.map D.prettyPrint models
["16906.841117607542","316.1039063108849 * x","(y + x) * 178.67808242696918"]

The first expression is a constant, the next is in two variables, the next is in 3 etc. So to “recover” the expression it look like we need to search for longer expressions.

But If we change the maxExpressionSize to something really high (say 15) then run models ← fit (defaultRegressionConfig { maxExpressionSize = 15 }) z df’ we get the pareto front:

  • 16906.841117607535
  • 316.1039063108847 * x
  • (x + y) * 178.67808242696915
  • (x + y) * (x + 114.92915501395817)
  • x ^ 2 * 2.7454499448536085 - y * -162.66923039222613
  • y ^ 2 * 2.0048601355891464 - -3.0051112237214688 * x * x
  • x ^ 2 * 3.0000001047208817 + y ^ 2 * 2.0000000456812024 + 44.999367612810865
  • (45.00004258415047 / x + 3.000000000860505 * x) * x + y ^ 2 * -1.9999999862145403]

Notice the last two expressions are basically equal to the expression we used to generate z.

The algorithm by default reduces mean squared errors between the result of the expression and the target variable value.

Density results (Stone–Weierstrass, etc.) tell you “an expression language can approximate a wide class of functions given enough complexity,” but they don’t imply recoverability of the generating expression from finite data.

Whether fit can recover the expression depends on:

  • What functions you pass in. E.g you can pass in defaultRegressionConfig { binaryFunctions = [cos, sin, (+), (-)] }. By default we only pass in [(+), (-), (*), (/)].
  • Randomness (since genetic programming is random and search is a hard problem)
  • Noise in your dataset
1 Like

Oh, so fit does (currently) not work when one passes an expression other than a Col name?

Indeed I should have used the term manifold as that is what I was implying with subspace. The picture I have in mind is a curved surface and a cloud of points, all of which have low distance to the surface.

That is the same as training a neural net predictor, isn’t it? Only that symbolic-regression does not pre-define the transfer functions between neurons and only vaguely pre-determines the number of “neurons” and “layers”.
At work, we once experimented with a neural network and found it was awful at learning an affine relationship, because the transfer functions were all curved. The vendor resorted to adding neural nets with affine transfer functions.

But at least Stone-Weierstrass tells you that even if you only allow (+), (-), (*) in the fit config, any continuous relationship within a box bounding the given data can be expressed arbitrarily close. Perhaps another, numerically more favourable polynomial basis should be in the default set.

What would be a good learning resource for this? I am working on a similar problem, namely learning the syntactic structure of a predicate given its semantics, the function x -> Bool.
At one stage in the algorithm, I have to enumerate a set of sub-terms (known to be finite) up to semantic equivalence. Am I right that e-graphs could efficiently keep track of what predicates have been already enumerated?

What would be a good learning resource for this?

And the papers linked in hegg: Fast equality saturation in Haskell

2 Likes