Symbolic regression for feature discovery and model training

Hey everyone.

I work on a fraud team and we were recently trying to automate the rule creation process. Analysts typically write high precision rules in response to a fraud attack. The process starts with some examples and involves a combination of analysis and back testing.

So given a fraud signature or some fraud examples the task was to try and automatically come up with rules that could be ported to SQL. This seemed like a natural fit for program synthesis (small data+ interpretable output). Since some of my workflows were already on dataframe and there is already an expression DSL that is SQL like, I figured it would be a good exercise to implement it in Haskell.

I’ll make a larger blog post about it when the whole project is finished but thought I’d share the general direction.

We ended up finding 2 use cases for this automatic expression generation:

  • Creating composite features from combinations of weaker features
  • Creating rules from fraud examples

We’re aiming to launch a new fraud model with some of these new synthesized features in the coming week.

The rule creation is still pending experimentation and evaluation.

During the ideation phase, most of the testing happened on the California housing dataset. This helped me compare my results to really good Kaggle notebooks. The feature generation process produced some interesting features. For example, the following script produced the output below it:

{-# LANGUAGE OverloadedStrings #-}

import qualified DataFrame as D
import qualified DataFrame.Functions as F
import qualified Data.Text as T
import Data.Maybe
import DataFrame ((|>))

import System.Environment

main :: IO ()
main = do
    (d:b:_) <- fmap (map read) getArgs
    -- Some light feature engineering.
    df <- fmap (\d -> d |> D.impute "total_bedrooms" (fromMaybe 0 (D.mean "total_bedrooms" d))
                        |> D.derive "ocean_proximity" (F.lift oceanProximity (F.col "ocean_proximity"))) (D.readCsv "./data/housing.csv")
    let (Right e) = F.synthesizeFeatureExpr "median_house_value" d b df

    print e

    print $ D.take 10 (df |> D.derive "prediction" e)


oceanProximity :: T.Text -> Double
oceanProximity op = case op of
    "ISLAND" -> 0
    "NEAR OCEAN" -> 1
    "NEAR BAY" -> 2
    "<1H OCEAN" -> 3
    "INLAND" -> 4
    _ -> error ("Unknown ocean proximity value: " ++ T.unpack op)
  ( divide
        ( ifThenElse
            ( geq
                (col @Double "ocean_proximity")
                (percentile 75 (col @Double "ocean_proximity"))
            )
            (col @Double "ocean_proximity")
            (col @Double "households")
        )
        (divide (col @Double "population") (col @Double "median_income"))
    )

This feature captures the intuition that locations close really close to the ocean are more expensive. And if the house is further way it falls back to the number of households (densee areas are more expensive).

At search depths greater than 3 it becomes a little harder to interpret the expression but LLMs tend to guide analyst intuition well.

I’ve also been working on another function fitRegression that initially does symbolic regression maximising mutual information then does a second pass minimizing mean squared error.

This gives comparable performance to Hasktorch’s linear regression. It also works reasonably well for non linear decision boundaries without much feature engineering.

It’s still very slow for large data sets. I think most of the slowdown is related to expression evaluation.

I’ve been trying to cover a lot of breadth for this project so I haven’t had the chance to go back and think about performance (or API hygiene) so any help would be appreciated.

This was a cool example of Haskell in the production path. It’s also interesting to try and make a tool for people that both prevents them from shooting themselves in the foot but doesn’t get in the way. I think dataframe is getting close. It’s been a pleasjre training people to use it and watching them learn to use it and Haskell.

8 Likes