Survey and Opinions: Backtracking in Parsers

Introduction

I am currently building an introspectable parser combinator library.
This has been a goal of mine for a long time, but only recently did the idea make sense in my head, after I read Exploring Arrows for sequencing effects, so it is arrow-based now.

You can see parts of it working in the source or in the docs.

Example

-- >>> take 5 $ exampleWords binaryNumber
-- [fromList [One],fromList [One,Zero],fromList [One,One],fromList [One,Zero,Zero],fromList [One,Zero,One]]

binaryNumber :: Parser BinaryToken () ()
binaryNumber = exact One *> many digit $> ()

Although these are applicative operators, they use arrows under the hood. I just thought it was easier to write (because I’m used to it) with them.

Moreover, I plan to do static analysis on the parser, whenever it fails. This would allow the library to recover from a parsing failure and suggest repair sequences. (some terminology from laurence tratt). Producing example words, (inputs the parser would accept) is only a first step.

Input and Opinions

Finally, my question for your feedback and input:
In your favorite parsing library:

  • what is the flavor of backtracking used
    • e.g.: full backtracking, backtracking unless input was consumed, explicit backtracking?
  • what do you like about it?
  • what do you dislike about it?
  • plus more if you care to share
6 Likes

Looks interesting, I’ll definitely take a look.

As for backtracking, I’m not sure it’s the central issue I’m concerned about. My main focus at the moment is writing resilient parsers and figuring out how to prove resilience properties in the type system.

Parsec-style backtracking unless input was consumed is definitely my least favourite. It’s confusing and misleading for newcomers, and occasionally bites even experienced users.

PEG-style backtracking parsers are performant and predictable, though they require some extra primitives for full power.

Full backtracking, which should be indistinguishable from full parallel parsing, is easy to understand but in its bare formulation it has trouble with syntax error reporting.

To control syntax reporting and help with performance, I prefer adding something like the commit/admit combinators rather than the consumed-input backtracking.

7 Likes

Thank you for you thoughts!

I’ve never heard of this approach before, thanks for sharing it.

I know this is way late, but here are my thoughts on backtracking:

I most often use attoparsec, and I agree with its behavior of always backtracking a <|> b such that if a fails then b is tried, starting at the initial position. The behavior of parsec–to not backtrack in such case unless an explicit try is in place–makes no sense to me: attempting alternative b at some arbitrary spot (where a gave up) could never be right. Backtracking seems inherent in the concept of “or”. (I guess, the parsec behavior is sensible if you know that you alternative will never have any “overlap”.)

On the other hand, I was surprised when I found out that parser combinator libraries like attoparsec don’t backtrack to the extent that regexes (at least, such as implemented by Perl) do. Consider this pseudocode, with capital letters representing some possibly more complicated pattern: (A|B)(C|D). In words this means, “A or B followed by C or D”. With parser combinators, this will attempt A and if that fails it will attempt B and if either of those succeeds it will commit to that choice and then attempt to match C-or-D at the new offset position. With regexes, if A succeeds but the subsequent C-or-D fails, it will backtrack and then try B (followed by C-or-D), which could succeed. The regex behavior could be simulated with parsing combinators using (A(C|D))|(B(C|D)), and the duplication can be mitigated by putting the repeated part in a variable for reuse. I do think the regex behavior is better, but I can’t say that I’ve ever run into a situation with parser combinators where I ran into a problem due to the different behavior.

I completely don’t understand the concept of “it will backtrack unless it has consumed input”, since the classification of what has consumed input it not apparent: for any parser combinator in a library, you need to look at its implementation to see how it’s implemented in order to discover if it “consumes input”. For instance, if you are matching string "hello" against “help”, will it have consumed “hel”? It seems the answer depends on whether string is a primitive of some sort or whether it’s built on top of char. Having the semantics be fundamentally dictated by implementation details is not good. And most parsers need to look ahead at least one character—but I guess that doesn’t count as “consuming”? It’s just a very unclear concept.

Edit: There’s an informative discussion of backtracking on Hacker News, which I think is where I learned of the combinator-vs-regex difference.

4 Likes

Yeah, backtracking in parser combinators libraries is a mess in theory (and in my opinion hard to reason about in practice too). Another big problem is that the backtracking is done naively if it is done properly at all, so it can easily cause exponential blowup of running time.

It is possible to have O(n) running times (for unambiguous grammars) with proper backtracking, by using memoization. The biggest problem with that is that the library needs to know where the recursion is happening, but you can’t really see that from inside Haskell. I floated a bunch of ideas in another topic here: Reusing Haskell's binding when defining context free grammars.

I didn’t pursue those much further because I haven’t yet fully understood the state of the art algorithms, but I did manage to squeeze a paper out about formalizing the semantics in Agda: A Type Theoretic Treatment of Context-Free Languages Without Mutual Recursion. I would like to expand that line of work when I have time again.

3 Likes

Personally, I’m pretty happy just learning how a particular parser works and then working with it rather than being naive about it. Parsec gives you a lot of control over errors if you lean into its dumb model. flatparse has hard error and soft error, so it knows when to backtrack and when to stop (as suggested above). attoparsec and binary aren’t known for good error messages, but it’s mostly not important, so you don’t have to think very hard about your parser structure, you can just throw another <|> in it.

3 Likes

Moreover, I plan to do static analysis on the parser, whenever it fails. This would allow the library to recover from a parsing failure and suggest repair sequences. (some terminology from laurence tratt). Producing example words, (inputs the parser would accept) is only a first step.

Interesting. I believe that a “wired applicative” ( Applicative-wired monad pattern ) would be feasible for parsers that you want to graph/do static analysis on, while retaining the convenience of do-notation. In my case I was thinking in the context of a kind of optparse-applicative, which was able to say “if you specify —output then you can also specify —stream.” This kind of dependency between cmd args is not usually seen, but would be self documenting and statically checked in this scenario.

1 Like

Thank you for your thoughts, I agree with you, the non-consuming-backtracking semantics are highly confusing. I am a little afraid of full backtracking like you describe it, because it may remember all possible branches ever taken. I think I’ll use it nevertheless and combine it with the commit-combinator mentioned in the post above.

Unfortunately, I left the comfort zone of do-notation and went for arrow notation, which clashed with my wanna-be implementation of Invertible Syntax Descriptions.
I’m working on this only sporadically, but I think I may have to use Overloaded.Categories, which is a GHC plugin that enables further introspection into desugared arrows.

I think you’re already aware of this, but this is emphatically not true of parser combinators in general. To belong in the general class of parser combinators in any language, all a library needs to do is:

  • provide a set of parsing primitives as first-class values and
  • a set of combinators that both apply to and return those values, so when applied repeatedly
  • can build a working parser

There’s no further requirement to satisfy any particular property. There are parsing combinator libraries with exactly the Regular Expression semantics you admire above, there are others with the Parsing Expression Grammar semantics you don’t, others with the Context Free Grammar semantics, and many with implementation-defined semantics dictated by performance concerns. Both Parsec and Attoparsec are in the last category, for better or worse, though the former is documented in a conference paper so maybe that makes it legit.

3 Likes

Yes indeed, I didn’t mean to imply that all parser combinator libraries had to behave that way, I was just using that terminology as a way to group the ones I was mentioning.

That said, there is a group of parser combinator libraries that have something in common. For instance, the parsers and parser-combinators libraries provide combinators that work across different parser combinator libraries, which raises the question, what sorts of libraries do they work with? I think, the answer is ones that use Applicative (and possibly Monad) for sequencing and Alternative for “or”. (I don’t think this category has a name, but it would be useful if it did. I don’t think “monadic parser combinators” is right, because Applicative-only cases can also fall into this category, I believe.)

Starting from that, consider this construction, sequencing x and y and z:

f <$> x <|> y <|> z

Can it perform “full backtracking” as I discussed above? If x is a parser such as string "hello", then there’s no backtracking to consider. But if x is somehing like string "foo" <|> string "bar", then for full backtracking, if x succeeds and then y fails, x needs to get a chance to “try again”, but, crucially, which some state preserved, since the behavior on the second try needs to be different to reflect that the first alternative has already been tried. (Also, it’s possible for x and y to succeed, and then for failure of z to trigger the backtracking all the way back to x.)

I think this means that parsers you can backtrack to need to be a different type (and sequence with a different combinator) than those that can’t be backtracked to, because a successful parse from such needs to return not only the match result but also some additional state which the sequencing needs to plumb through.

So if my analysis is correct, then I think libraries such as Attoparsec (and others in this unnamed category) couldn’t do full backtracking without having a different sort of API. (And I guess, having two different types among its parsers would violate the “parser combinator” definition, at least in spirit.) I guess it could be done by having all parser cases “support” backtracking, but return a state indicating “no more backtracking” when one doesn’t actually allow backtracking to, but it would certainly be a non-localized implementation change, with implications beyond just performance.

Anyway, just some additional thoughts, since the design space is interesting to explore.

I do wonder what’s best. On the one hand, I think that full backtracking as the base behavior, to be limited with a commit combinator, is conceptually best, since that’s the semantics which allows the most input cases to match successfully. This reminds me of strict vs non-strict evaluation semantics for languages, in that non-strict semantics (possibly with opt-in localized strictness) results in the most cases of successfully terminating programs.

Similarly though, this “better” semantics comes with a cost, and in the parser case that means suffering backtracking in cases where it’s destined to never succeed. If these cases are too common, then conceptually better semantics is practically worse. (Having to add commit frequently in order to get good performance is the analog of the inconvenience of having to add try frequently in order to get good behavior). Probably only real-world usage will reveal which scenarios are common in practice. Although I like the idea of full backtracking, I’d have to say that I haven’t really suffered from its absence when using Attoparsec.

I’m not entirely sure what your claim is here. It is certainly possible to use the parsers and parser-combinators libraries with parser implementations that perform either style of backtracking.

Operationally, the only thing that changes is the state of the parser.

The “full backtracking” parsers usually keep track of a list of successful parses, or alternatively they carry a single successful parse state and call all possible continuations to try them out. You can see the two implementation styles in Text.Grampa.ContextFree.Parallel and in Text.Grampa.ContextFree.Continued, respectively.

The parsers with limited backtracking need remember only a single successful parse, so their state is simpler. If they use continuations (like Text.Grampa.PEG.Backtrack for example), their continuation types are simpler because they don’t need to know the failure continuation.

The point is, this is the internal state of the parser and higher-level combinator libraries like parsers don’t need to know it.

1 Like

My claim was that in order to implement a sequencing combinator such as (<*>), the implementation would have to take into account whether or not the left operand could be backtracked to (and therefore it couldn’t handle both cases with a single combinator). I see now that with a continuation-based approach this isn’t necessary, because in any case the left operand is responsible for running “the rest of the parse”, so it can maintain the additional state it needs just in its local scope. (I hadn’t thought of the continuation-based approach before, so my thinking was that any state that needed to be maintained would have to be returned to the caller to be passed back in on backtracked invocations, and so its existence wouldn’t be hidden.)

Thank you for the examples. I also glanced at Attoparsec’s implementation, which also seems to be continuation-based.

1 Like