Apologies in advance for what may be a very obvious question!
I’m writing a PNG parser using megaparsec, and am having some difficulty with the many combinator. My understanding in the context of parsing is that it applies a given parser until that parser fails, and then returns a list of the successfully parsed results. Because of this, it seems reasonable that many should never fail; it can just return an empty list.
In my code, I have the following:
pPNGBytestream :: Parser PNGImage
pPNGBytestream = do
pngSignature
header <- pImageHeader
_ <- many pUnsupported -- problem here
palette <- optional $ pPalette header
pure $ PNGImage{..}
where
pngSignature = string (B.pack [137, 80, 78, 71, 13, 10, 26, 10]) <?> "png signature"
My problem is that when there’s a parsing failure in pUnsupported, many also fails and that cascades up to pPNGBytestream. Can anyone explain why that behaviour might occur?
The full code is available here. Running the main executable should induce the behaviour I’m describing. If you feel like making comments on any of the other code, please feel free. I’m sure I’m not writing the most idiomatic code.
This probably has to do with the fact that megaparsec doesn’t backtrack by default. So if you make a sequence of parsers and the first few succeed, but then there is a failure somewhere in the middle, then the parser will not roll back to the start of the sequence.
You can probably solve it by using many (try pUnsupported) which uses try to manually insert a backtracking point. But using try to much can make your parsers take exponential time.
Note that this is a particular trait of the parsec family of parsers. Other monadic parsers like Text.ParserCombinators.ReadP and uu-parsinglib don’t have this problem.
It’s worse. The parsing will proceed, but it will just move on where it left off in the case of failure after consuming input, e.g.:
Intuitively you might expect p <|> q to work like the regular expression p|q, and for many p to work like p* (Kleene star). But it might be helpful to note that the default for [mega]parsec is more like the “atomic” group (?>p|q) a.k.a. (*atomic:p|q) and derived “possessive” quantifier p*+ = (?>p*) in Perl/PCRE.
The plain p* allows backtracking into it, so it represents a search: try taking the maximum n items, and if that doesn’t work, back off to n − 1, and so on. That’s what many (try p) emulates. Whereas p*+ takes all it can and commits to it, so if the next parser fails, we won’t backtrack into the group. We might still backtrack over it, as in try (many p) <|> q, but the many p is all-or-nothing.
What try says is that the parent choice point is backtrackable. In logic-programming terms, the default is like putting a cut in every branch, and try removes it.