Reusing Haskell's binding when defining context free grammars

Parsing in Haskell is usually done with parser combinators. However these will get stuck in an infinite loop if you write left recursive parsers, e.g.:

expr = Plus <$> expr <* char '+' <*> expr
   <|> ...

This means we can never define natural left recursive grammars in Haskell if we use this approach to embedding the grammars.

Instead we can use custom combinators and add a manually written string to identify grammar rules (as the gll package does):

expr = "Expr" <::=> Plus <$$> expr <** char '+' <**> expr
               <||> ...

However, now we must ensure manually that we use a different string in every place. The compiler won’t help us catch name conflicts.

After a bit of experimentation I found that there is a way to use generic programming to define grammars without having to use strings as names:

data Nat = Nat
  deriving (GHC.Generic)
  deriving anyclass (Generic, HasDatatypeInfo)

nat :: CFG
nat = grammar Nat \nat -> [
    nat :-> [nat, "0"],
    nat :-> [nat, "1"],
    nat :-> ["1"]
  ]

data Expr = Expr
  deriving (GHC.Generic)
  deriving anyclass (Generic, HasDatatypeInfo)

expr :: CFG
expr = grammar Expr \expr -> [
    expr :-> ["(", expr, ")"],
    expr :-> [expr, "+", expr],
    expr :-> [nat]
  ]

The grammar function uses generic programming to generate a unique name based on the module and data type name. This way the Haskell compiler ensures we have no name conflicts. If there are two nonterminals with the same name we can use qualified imports to disambiguate them.

Full source code:
#!/usr/bin/env cabal
{- cabal:
build-depends: base, generics-sop
-}
{-# LANGUAGE GHC2021 #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE BlockArguments #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE GADTs #-}
{-# LANGUAGE DeriveAnyClass #-}
{-# LANGUAGE DerivingStrategies #-}
{-# OPTIONS_GHC -Wno-name-shadowing #-}

import Data.String ( IsString(..) )
import qualified GHC.Generics as GHC
import Generics.SOP

data CFG = CFG Symbol [Production] deriving Show
data Production = CFG :-> [CFG] deriving Show
data Symbol = SymVar ModuleName DatatypeName | SymLit String deriving Show

instance IsString CFG where
  fromString x = CFG (SymLit x) []

grammar :: forall a. (HasDatatypeInfo a, Generic a, Code a ~ '[ '[]]) => a -> (CFG -> [Production]) -> CFG
grammar _ f = CFG s (f (CFG s []))
  where 
    di = datatypeInfo (Proxy @a)
    s = SymVar (moduleName di) (datatypeName di)

data Nat = Nat
  deriving (GHC.Generic)
  deriving anyclass (Generic, HasDatatypeInfo)

nat :: CFG
nat = grammar Nat \nat -> [
    nat :-> [nat, "0"],
    nat :-> [nat, "1"],
    nat :-> ["1"]
  ]

data Expr = Expr
  deriving (GHC.Generic)
  deriving anyclass (Generic, HasDatatypeInfo)

expr :: CFG
expr = grammar Expr \expr -> [
    expr :-> ["(", expr, ")"],
    expr :-> [expr, "+", expr],
    expr :-> [nat]
  ]

main :: IO ()
main = print expr
5 Likes

Neat!

Does this rely on users using each such type in exactly one production?

1 Like

You can use the same data type multiple times, but then all those uses refer to the same nonterminal.

1 Like

I mean what happens when I write

expr :: CFG
expr = grammar \(Expr expr) -> [
    expr :-> ["(", expr2, ")"],
    expr :-> [nat]
  ]

expr2 :: CFG
expr2 = grammar \(Expr expr2) -> [
    expr2 :-> [expr, "+", expr],
  ]
1 Like

Since the data type is only used for its name, why not make it a unit type?

data Expr = Expr
  deriving ...

-- grammar :: HasDatatypeInfo a => a -> (CFG -> [Production]) -> CFG

expr :: CFG
expr = grammar Expr \expr -> ...
3 Likes

I think what should happen is that both the expr and the expr2 CFGs should represent the same CFG that would be defined by the single combined definition:

expr :: CFG
expr = grammar \(Expr expr) -> [
    expr :-> ["(", expr, ")"],
    expr :-> [expr, "+", expr],
    expr :-> [nat]
  ]
1 Like

Thanks! That does look nicer. I have updated the main post to use that approach.

1 Like

An easy way to generate unique strings is:

import GHC.Stack

uid :: HasCallStack => String
uid = u
  where
    cs   = head (getCallStack callStack)
    loc  = snd cs
    name = fst cs
    u    = mconcat
      [ srcLocPackage loc
      , ":"
      , srcLocModule loc
      , ":"
      , show (srcLocStartLine loc)
      , ":"
      , show (srcLocStartCol loc)
      ]

main :: IO ()
main = do
  putStrLn uid
  putStrLn uid
  putStrLn uid

Which gives:

❯ ./Test
main:Main:24:12
main:Main:25:12
main:Main:26:12

Couldn’t you use the gll approach with this?

expr = uid <::=> Plus <$$> expr <** char '+' <**> expr
           <||> ...
4 Likes

The grammatical-parsers library allows left-recursive grammars using the standard applicative operators. Monadic too, but they may blow up. Disclosure: I’m the author and apparently the only person aware of it.

3 Likes

Ah, I see you use a record based approach. I actually started with something like that (you can see it if you look at the edits on the main post of this thread), but it seems that is not really necessary, at least not for writing left recursive grammars. And I think it is nicer to have each nonterminal be its own thing rather than bundling a bunch of nonterminals together. Also, I’d like to avoid heavy machinery like higher kinded datatypes and Template Haskell.

1 Like

I don’t see any way to avoid some heavy machinery in practice. The trouble with your Production is that it’s unityped. That looks fine with toy examples, but it’s not usable for a real-world grammar where different productions construct values of different types.

And once you allow different productions to have different types, you need to leave the plain lists behind and reach for some language extensions, be they GADTs, TypeFamilies, or RankNTypes. Personally I prefer the last, and higher-kinded records to heterogenous lists, but that’s somewhat subjective.

3 Likes

How about something like this which requires only existential quantification (and not in a user-facing way):

import Control.Applicative
import GHC.Stack

newtype Name a = Name SrcLoc

newtype Parser a = Parser { unParser :: String -> [(Step a, String)] } deriving Functor
data Step a = forall b. Goto (Name b) (b -> Parser a) | forall b. Label (Name b) (Parser b) (b -> Parser a) | Result a
deriving instance Functor Step

instance Applicative Parser where
  pure x = Parser (\s -> [(Result x, s)])
  Parser p <*> q = Parser $ \s -> do
    (step, s') <- p s
    case step of
      Goto l k -> pure (Goto l ((<*> q) . k), s')
      Label l bdy k -> pure (Label l bdy ((<*> q) . k), s')
      Result f -> unParser (f <$> q) s'
instance Alternative Parser where
  empty = Parser (const [])
  Parser p <|> Parser q = Parser (\s -> p s ++ q s)

data Expr = Int Int | Plus Expr Expr

char :: Char -> Parser Char
char c = Parser $ \s ->
  case s of
    c' : s' | c == c' -> [(Result c, s')]
    _ -> []

int :: Parser Int
int = undefined

goto :: Name a -> Parser a
goto l = Parser (\s -> [(Goto l pure, s)])

nonterm :: HasCallStack => (Parser a -> Parser a) -> Parser a
nonterm f = Parser (\s -> [(Label (Name l) (f (goto (Name l))) pure, s)]) where
  (_, l):_ = getCallStack callStack

expr :: Parser Expr
expr = nonterm $ \expr ->
      Int <$> int
  <|> Plus <$> expr <* char '+' <*> expr

Edit: I guess this doesn’t create a full representation the grammar, but this shallow embedding might be a solution if the only problem you want to solve is the left recursion in parser combinators. If you want a deep embedding then you have to use a GADT for the grammar type, but I don’t think that has to affect the user-facing API at all. So the user that is defining grammars won’t have to know GADTs are used behind the scenes.

1 Like

Sure. I didn’t list ExistentialQuantification because it’s a subset of GADTs and generally less useful one.
I think you should rename Step to Production, that’s what it represents. I admit I never thought of representing the list of productions as (G)ADT alternatives. I have some doubts about the performance of this approach, but it’s certainly original and relatively minimalistic.

2 Likes

One problem with using HasCallStack is that you cannot make derived combinators. E.g.:

derived1 :: String
derived1 = uid

derived2 :: HasCallStack => String
derived2 = uid

main :: IO ()
main = do
  putStrLn uid
  putStrLn uid
  putStrLn uid
  putStrLn derived1
  putStrLn derived1
  putStrLn derived1
  putStrLn derived2
  putStrLn derived2
  putStrLn derived2

Which gives:

$ runhaskell D.hs
main:Main:27:12
main:Main:28:12
main:Main:29:12
main:Main:20:12
main:Main:20:12
main:Main:20:12
main:Main:23:12
main:Main:23:12
main:Main:23:12

So derived1 and derived2 don’t generate unique ids.

I think derived2 could be made to work by selecting the last element of the call stack instead of the head, but then the user needs to be very careful and very aware of how the HasCallStack mechanism works to be able to correctly use this approach.

Edit: I think there is actually a way around this by using withFrozenCallStack. This does work:

derived3 :: HasCallStack => String
derived3 = withFrozenCallStack uid
1 Like

I think the best approach is to use this applicative approach to observable sharing since we already wrap everything in the Parser applicative:

{-# LANGUAGE ExplicitNamespaces #-}
{-# LANGUAGE GADTs #-}
import Control.Applicative
import GHC.Stack
import Data.Type.Equality ( type (:~:)(..) )
import Unsafe.Coerce ( unsafeEqualityProof, UnsafeEquality(UnsafeRefl) )

newtype Name a = Name ID

type ID = Int

data Binding f = forall a. Binding (Name a) (f a)

eqName :: forall a b. Name a -> Name b -> Maybe (a :~: b)
eqName (Name x) (Name y) = case unsafeEqualityProof @a @b of UnsafeRefl | x == y -> Just Refl; _ -> Nothing

data Step a = forall b. Goto (Name b) (b -> P a) | Result a
deriving instance Functor Step

newtype P a = P { unP :: String -> [(Step a, String)] } deriving Functor
instance Applicative P where
  pure x = P (\s -> [(Result x, s)])
  P p <*> q = P $ \s -> do
    (step, s') <- p s
    case step of
      Goto l k -> pure (Goto l ((<*> q) . k), s')
      Result f -> unP (f <$> q) s'
instance Alternative P where
  empty = P (const [])
  P p <|> P q = P (\s -> p s ++ q s)

newtype Parser a = Parser { unParser :: ID -> [Binding P] -> (P a, ID, [Binding P]) } deriving Functor
instance Applicative Parser where
  pure x = Parser (\i bs -> (pure x, i, bs))
  Parser p <*> Parser q = Parser $ \i bs ->
    let (p', i', bs') = p i bs
        (q', i'', bs'') = q i' bs'
    in (p' <*> q', i'', bs'')
instance Alternative Parser where
  empty = Parser (\i bs -> (empty, i, bs))
  Parser p <|> Parser q = Parser $ \i bs ->
    let
      (p', i', bs') = p i bs
      (q', i'', bs'') = q i' bs'
    in
      (p' <|> q', i'', bs'')

nt :: Parser a -> Parser a
nt (Parser p) = Parser $ \i bs ->
  let
    (p', i', bs') = p i bs
    n = Name i'
    i'' = i' + 1
    bs'' = Binding n p' : bs'
    p'' = P $ \s -> [(Goto n pure, s)]
  in
    (p'', i'', bs'')

lift :: P a -> Parser a
lift x = Parser $ \i bs -> (x, i, bs)

char :: Char -> Parser Char
char c = lift $ P $ \s ->
  case s of
    c' : s' | c == c' -> [(Result c, s')]
    _ -> []

-- Example usage:

data Expr = Int Int | Plus Expr Expr

int :: Parser Int
int = undefined

expr :: Parser Expr
expr = 
  nt $ Int <$> int
   <|> Plus <$> expr <* char '+' <*> expr

Edit: this reuse of the Parser applicative doesn’t work, e.g.:

a = b *> b where
  b = nt $ ...

That will create two nonterminals b where we only want one!

1 Like

After discovering all the flaws in my previous attempts, I have now decided to bite the bullet and use data-reify. Now I finally have a proof of concept for general parser combinators in Haskell:

This is still exponential time, I believe, but I’m planning to implement a proper O(n^3) GLL parsing algorithm. Edit: I’m actually unsure about the running time. This might already be O((n log n)^3) (the log n is due to my use of Data.Map instead of arrays).

1 Like

After my rec-def talk at ICFP, where I claimed that this should be possible, I wanted to actually demonstrate it, and I looked at various parser combinators that I could “package up” in this “shallow recursion embedding” style. But if you say

then you may have already done it! But I don’t see uses of data-reify on the repo. Why not?

(It might be that observing sharing in the way I do in rec-def is easier and more reliable than how data-reify is using it – I only need unsafePerformIO, but no stable names or other heap-level hackery.)

I decided against that approach because it is too easy to accidentally create functions and thereby destroying the sharing.

I’m currently slowly settling on a GADT-based approach which does not need any observable sharing shenanigans:

It is a bit more verbose, but it is similar to how algebraic effect libraries work, so I don’t think that’s a dealbreaker.

If you want to see the data-reify version you can look back in the git history:

That’s fair, that is a problem with the rec-def approach that is hard to fix.

I gave it a shot as well:
I copied Agda.Utils.Parser.MemoisedCPS, which expects explicit calls to memoise into Parser.hs. I changed it so that one can memoise at any type (using unsafeCoerce internally, but the necessary invariant should be ensured by the wrapper).

Then I created a lazy-enough wrapper in RParser.hs. A first test seems to work:

s,a,b,c :: Parser Char String
s = s >>> a >>> c <|> c
a = b <|> char 'a' >>> c >>> char 'a'
b = id <$> b
c = char 'b' <|> c >>> a

-- ghci> parse s "babababa"
-- ["babababa"]

(the example is taken from https://okmij.org/ftp/Haskell/LeftRecursion.hs)

These grammars work sufficiently different from fixed-point operations that I didn’t reuse any of the rec-def library infrastructure.

A deep embedding of the combinator tree like you have there would also be the approach I’d choose.
In return you’d get to do grammar analysis, to eliminate useless non-terminals and left recursion and to potentially optimise your parser.

You could even combine it with staging Ă  la Parsley if you are willing to give up on a monadic interface.

1 Like