Lexing & Parsing in different stages: antipattern? Reading recomendations please!

Hi!
Some time ago I remember having read that lexing and parsing in different stages is an antipattern (specially regarding parsing combinators for compilers), and I wanted to learn more about that perspective and it’s opposite, as someone who has not read theory about lexing/parsing (nor worked much with parsers/lexers, beyond basic understanding of differences between types of grammars).

So far I’ve found these papers that seem interesting (the last 2 seem to be mostly focused on performance) :

Any reading/watching/etc. recommendations are very much welcomed! Thanks!

3 Likes

I don’t know if there’s anything Haskell specific here, you may want to search online about parsing in general.

At least in the context of programming languages, I think parsing in one pass is not a good idea. I wrote about this a few weeks ago, in three parts: part 1, part 2, part 3.

specially regarding parsing combinators for compilers

The libraries/tools used to implement the stages is an implementation detail. I can provide the same API with parser combinators, or with lexer/parser generators, or by hand-rolled recursive descent parsing. What matters is (1) supporting the use cases that I have (2) with good performance. I think you may find my posts linked above helpful as I dive into both of these topics.

7 Likes

I wouldn’t call the lexing stage strictly an antipattern, there are times when it’s useful to have it. For example, if you’re aiming for your language to be expressible as an LL(1) or LR(1) grammar, you probably will need to use a separate lexer. The reason is that the “1” in these formalisms refers to the number of tokens that the parser needs to look at to determine the relevant grammar production. If your tokens are single characters, the implication is that every keyword and possibly every user identifier (depending on the language) must start with a different letter.

If LL/LR(1) is not a requirement, and you’re willing to give up on their performance guarantees, the case for lexing stage is much weaker IMO. If you’re settled on using parsing combinators, lexing is mostly a nuisance. It requires more boilerplate code in the implementation, and also makes it harder to evolve the language. Regarding the latter point, you can’t for example add a new keyword (aka reserved word) to the language without breaking its backward compatibility.

5 Likes

I think that overstates things somewhat; you can add new contextual keywords and maintain backward compatibility by altering the grammar to accept both identifier tokens and the keyword anywhere that the new keyword is not intended to be used, and translating the keyword into an identifier during the parse in those grammar positions. It’s not elegant, but it’s certainly possible, and I’ve seen it done many times.

2 Likes

Ugh. I agree with your characterization, but in the context of OP’s question I think it’s not too much of an overstatement. I just can’t see any self-respecting compiler designer actually plan for such a hack from the beginning.

1 Like

For contextual lexing, you can pass expected tokens to the lexer’s “next token” function when calling it in the parser. tree-sitter uses this approach, it seems to work fine.

1 Like

(Thank you for the ‘Exploring Parsing’ notes.) I wonder if there’s something Haskell specific about the ‘accursed dot’?

(“accursed” because I was a COBOL programmer in a previous life. The barely-visible . is needed to terminate statements – similar to ; in Haskell and other Algol-family languages. Indentation or newline isn’t significant in COBOL, but programmers nest/indent IF statements anyway, then either forget the . so the following code gets included in the ELSE branch despite being outdented; or forget to remove a . when refactoring, and find the following code is a fresh statement despite being indented. If I could have a dollar for every misplaced . I’ve debugged …)

Haskell 98 has:

  • . as part of the token in Floats – 3.14;
  • .as part of the token in Module-qualified names – Data.Set.Internal.insert is a function, Data.Set.Internal.Tip is a constructor;
  • non-Module-qualifier . is an operator ‘compose’, less binding than the ‘invisible operator’ for function application – foo bar.baz parses as (foo bar) . baz, contrast map Data.Set.Internal.size parses as map (Data.Set.Internal.size) – this compose is the . that can be space-surrounded;
  • that . operator can appear module-qualified, like anything else – (+ 1) GHC.Base..double $ 7 yields 15;
  • . can be part of an operator id My.+.. 7 is applying operator (+..) from module My.

In recent years GHC has added OverloadedRecordDot such that

  • foo bar.baz parses tighter-binding than function apply as foo ((.bar) baz) – but only if that option is set; otherwise it’s compose as above;
  • In there, (.bar) is a special operation using name bar, which isn’t necessarily a declared id anywhere. Whereas foo $ .bar baz doesn’t achieve that, IIUC.
  • At least I hope I’ve got those descriptions and syntax accurate, and sorry-notverysorry if I didn’t.

Are your lexers/parsers still happy? Because my head exploded when that lot arrived.

1 Like

From what I understand, separating parsing into two stages usually has no advantages when using parser combinators. If you want to work on tokens instead of with plain text, you can write a token parser:

myToken :: Parser Token

and use it as follows:

myParser :: Parser x
myParser = do
  token <- myToken
  -- do something with token

With parser combinators, lexing and parsing can be trivially combined.

I like this video which showcases how this can be done at the time 6:30:

2 Likes

Thanks to all for your responses! gotta go to read & learn now!