Injecting variables into a GHCi session

I was recently looking at Kotlin’s dataframe implementation and it has this neat feature where column names are turned into typed column references.

val dfWithUpdatedColumns = df
    .filter { stars > 50 }
    .convert { topics }.with {
        val inner = it.removeSurrounding("[", "]")
        if (inner.isEmpty()) emptyList() else inner.split(',').map(String::trim)
    }
dfWithUpdatedColumns

I was curious how this happens and from what I understand when you read a dataframe using df = DataFrame.readCsv("https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv") it hooks into the Jupyter kernel (effectively into their version of ghci) and creates typed variables for each of the columns. It seems like this runs on every cell. Outside of an interactive environment I think the library does some reflection against an object type to achieve the same behaviour: df = DataFrame.readCsv("https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv").convertTo<Repositories>().

The latter behaviour can easily be expressed in some template Haskell logic but the former is a little more difficult. It would require hooking into ghci to inject variables somehow.

What problem is this trying to solve

Even though my current implementation of expressions on dataframes are locally type-safe, the code throws an error if types are misspecified.

E.g.

ghci> df <- D.readCsv "./data/housing.csv"
ghci> df |> D.derive "avg_bedrooms_per_house" (F.col @Double "total_bedrooms" / F.col @Double "households")

In this case the expression type checks but the code will throw an exception that says:

[Error]: Type Mismatch
        While running your code I tried to get a column of type: "Double" but the column in the dataframe was actually of type: "Maybe Double"

My current workaround to this is providing a function that generates some code for the user to paste into their GHCi session.

ghci> D.printSessionSchema df
:{
{-# LANGUAGE TypeApplications #-}
import qualified DataFrame.Functions as F
import Data.Text (Text)
(longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity) = (F.col @(Double) "longitude",F.col @(Double) "latitude",F.col @(Double) "housing_median_age",F.col @(Double) "total_rooms",F.col @(Maybe Double) "total_bedrooms",F.col @(Double) "population",F.col @(Double) "households",F.col @(Double) "median_income",F.col @(Double) "median_house_value",F.col @(Text) "ocean_proximity")
:}

After which, the example above looks like:

ghci>  df |> D.derive "avg_bedrooms_per_house" (total_bedrooms / households)

<interactive>:21:60: error: [GHC-83865]
    • Couldn't match type ‘Double’ with ‘Maybe Double’
      Expected: Expr (Maybe Double)
        Actual: Expr Double
    • In the second argument of ‘(/)’, namely ‘households’
      In the second argument of ‘derive’, namely
        ‘(total_bedrooms / households)’
      In the second argument of ‘(|>)’, namely
        ‘derive "avg_bedrooms_per_house" (total_bedrooms / households)’

You also now get column name completion.

A solution that involves generating a module and reloading GHCi wipes the REPL state which isn’t great so this is the best I could think of for now.

I mention the problem in full just in case the “injecting variables into GHCi” solves an x-y problem.

Any insight would be greatly appreciated.

2 Likes

The latter behaviour can easily be expressed in some template Haskell logic but the former is a little more difficult. It would require hooking into ghci to inject variables somehow.

IIRC, Template Haskell can be used in ghci as well. Although it needs a small trick in order to run [Dec] splices.

2 Likes

Thanks so much! You’re always extremely helpful! I appreciate it.

The final result looks like this:

typeFromString :: String -> Q Type
typeFromString "Text" = [t| T.Text |]  -- TODO: We shouldn't special case this. Figure out how to keep imported types under control.
typeFromString s = do
  maybeType <- lookupTypeName s
  case maybeType of
    Just name -> return (ConT name)
    Nothing -> do
      if "Maybe " `L.isPrefixOf` s
        then do
          let innerType = drop 6 s
          inner <- typeFromString innerType
          return (AppT (ConT ''Maybe) inner)
        else if "Either " `L.isPrefixOf` s
          then do
            let (left: right:_) = tail (words s)
            lhs <- typeFromString left
            rhs <- typeFromString right
            return (AppT (AppT (ConT ''Either) lhs) rhs)
          else fail $ "Unsupported type: " ++ s

declareColumns :: DataFrame -> DecsQ
declareColumns df = let
        names = (map fst . L.sortBy (compare `on` snd). M.toList . columnIndices) df
        types = map (columnTypeString . (`unsafeGetColumn` df)) names
        specs = zip names types
    in fmap concat $ forM specs $ \(nm, tyStr) -> do
        ty  <- typeFromString tyStr
        let n  = mkName (T.unpack nm)
        sig <- sigD n [t| Expr $(pure ty) |]
        val <- valD (varP n) (normalB [| col $(TH.lift nm) |]) []
        pure [sig, val]

And then I include the following in my .ghci as a helper function:

:set -XOverloadedStrings
:set -XTemplateHaskell

:def! cols \s -> return ("_ = (); declareColumns " ++ s)

Then in the session I have:

ghci> df <- D.readCsv "./data/housing.csv" 
ghci> import DataFrame.Functions (declareColumns)
ghci> :cols df

This creates all the references. Sadly adds an explicit template haskell dependency.

4 Likes