Reusing Haskell's binding when defining context free grammars

Parsing in Haskell is usually done with parser combinators. However these will get stuck in an infinite loop if you write left recursive parsers, e.g.:

expr = Plus <$> expr <* char '+' <*> expr
   <|> ...

This means we can never define natural left recursive grammars in Haskell if we use this approach to embedding the grammars.

Instead we can use custom combinators and add a manually written string to identify grammar rules (as the gll package does):

expr = "Expr" <::=> Plus <$$> expr <** char '+' <**> expr
               <||> ...

However, now we must ensure manually that we use a different string in every place. The compiler won’t help us catch name conflicts.

After a bit of experimentation I found that there is a way to use generic programming to define grammars without having to use strings as names:

data Nat = Nat
  deriving (GHC.Generic)
  deriving anyclass (Generic, HasDatatypeInfo)

nat :: CFG
nat = grammar Nat \nat -> [
    nat :-> [nat, "0"],
    nat :-> [nat, "1"],
    nat :-> ["1"]
  ]

data Expr = Expr
  deriving (GHC.Generic)
  deriving anyclass (Generic, HasDatatypeInfo)

expr :: CFG
expr = grammar Expr \expr -> [
    expr :-> ["(", expr, ")"],
    expr :-> [expr, "+", expr],
    expr :-> [nat]
  ]

The grammar function uses generic programming to generate a unique name based on the module and data type name. This way the Haskell compiler ensures we have no name conflicts. If there are two nonterminals with the same name we can use qualified imports to disambiguate them.

Full source code:
#!/usr/bin/env cabal
{- cabal:
build-depends: base, generics-sop
-}
{-# LANGUAGE GHC2021 #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE BlockArguments #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE GADTs #-}
{-# LANGUAGE DeriveAnyClass #-}
{-# LANGUAGE DerivingStrategies #-}
{-# OPTIONS_GHC -Wno-name-shadowing #-}

import Data.String ( IsString(..) )
import qualified GHC.Generics as GHC
import Generics.SOP

data CFG = CFG Symbol [Production] deriving Show
data Production = CFG :-> [CFG] deriving Show
data Symbol = SymVar ModuleName DatatypeName | SymLit String deriving Show

instance IsString CFG where
  fromString x = CFG (SymLit x) []

grammar :: forall a. (HasDatatypeInfo a, Generic a, Code a ~ '[ '[]]) => a -> (CFG -> [Production]) -> CFG
grammar _ f = CFG s (f (CFG s []))
  where 
    di = datatypeInfo (Proxy @a)
    s = SymVar (moduleName di) (datatypeName di)

data Nat = Nat
  deriving (GHC.Generic)
  deriving anyclass (Generic, HasDatatypeInfo)

nat :: CFG
nat = grammar Nat \nat -> [
    nat :-> [nat, "0"],
    nat :-> [nat, "1"],
    nat :-> ["1"]
  ]

data Expr = Expr
  deriving (GHC.Generic)
  deriving anyclass (Generic, HasDatatypeInfo)

expr :: CFG
expr = grammar Expr \expr -> [
    expr :-> ["(", expr, ")"],
    expr :-> [expr, "+", expr],
    expr :-> [nat]
  ]

main :: IO ()
main = print expr
1 Like

Neat!

Does this rely on users using each such type in exactly one production?

You can use the same data type multiple times, but then all those uses refer to the same nonterminal.

I mean what happens when I write

expr :: CFG
expr = grammar \(Expr expr) -> [
    expr :-> ["(", expr2, ")"],
    expr :-> [nat]
  ]

expr2 :: CFG
expr2 = grammar \(Expr expr2) -> [
    expr2 :-> [expr, "+", expr],
  ]

Since the data type is only used for its name, why not make it a unit type?

data Expr = Expr
  deriving ...

-- grammar :: HasDatatypeInfo a => a -> (CFG -> [Production]) -> CFG

expr :: CFG
expr = grammar Expr \expr -> ...
2 Likes

I think what should happen is that both the expr and the expr2 CFGs should represent the same CFG that would be defined by the single combined definition:

expr :: CFG
expr = grammar \(Expr expr) -> [
    expr :-> ["(", expr, ")"],
    expr :-> [expr, "+", expr],
    expr :-> [nat]
  ]

Thanks! That does look nicer. I have updated the main post to use that approach.

An easy way to generate unique strings is:

import GHC.Stack

uid :: HasCallStack => String
uid = u
  where
    cs   = head (getCallStack callStack)
    loc  = snd cs
    name = fst cs
    u    = mconcat
      [ srcLocPackage loc
      , ":"
      , srcLocModule loc
      , ":"
      , show (srcLocStartLine loc)
      , ":"
      , show (srcLocStartCol loc)
      ]

main :: IO ()
main = do
  putStrLn uid
  putStrLn uid
  putStrLn uid

Which gives:

❯ ./Test
main:Main:24:12
main:Main:25:12
main:Main:26:12

Couldn’t you use the gll approach with this?

expr = uid <::=> Plus <$$> expr <** char '+' <**> expr
           <||> ...
3 Likes