When does Control.Applicative.many fail?

Hi all,

Apologies in advance for what may be a very obvious question!

I’m writing a PNG parser using megaparsec, and am having some difficulty with the many combinator. My understanding in the context of parsing is that it applies a given parser until that parser fails, and then returns a list of the successfully parsed results. Because of this, it seems reasonable that many should never fail; it can just return an empty list.

In my code, I have the following:

pPNGBytestream :: Parser PNGImage
pPNGBytestream = do
    pngSignature
    header <- pImageHeader
    _ <- many pUnsupported -- problem here
    palette <- optional $ pPalette header
    pure $ PNGImage{..}
  where
    pngSignature = string (B.pack [137, 80, 78, 71, 13, 10, 26, 10]) <?> "png signature"

My problem is that when there’s a parsing failure in pUnsupported, many also fails and that cascades up to pPNGBytestream. Can anyone explain why that behaviour might occur?

The full code is available here. Running the main executable should induce the behaviour I’m describing. If you feel like making comments on any of the other code, please feel free. I’m sure I’m not writing the most idiomatic code.

Thanks very much for any help in advance!

1 Like

This probably has to do with the fact that megaparsec doesn’t backtrack by default. So if you make a sequence of parsers and the first few succeed, but then there is a failure somewhere in the middle, then the parser will not roll back to the start of the sequence.

You can probably solve it by using many (try pUnsupported) which uses try to manually insert a backtracking point. But using try to much can make your parsers take exponential time.

5 Likes

Ah, I had tried inserting try, but in the wrong place! Thanks so much, that worked :slight_smile:

I always find failure in monadic/applicative parsers quite hard to understand, so I’ll try and work out what’s going on here.

There is a distinction between “failing after consuming input” and “failing without having consumed input”. I think the key difference is that

failAfterConsumingInput <|> x == failAfterConsumingInput

whereas

failWithoutHavingConsumedInput <|> x == x

and try failAfterConsumingInput == failWithoutHavingConsumedInput. In this case, many is

many p = go id
  where
    go f = do
      r <- C.optional p
      case r of
        Nothing -> return (f [])
        Just x -> go (f . (x :))

and optional is

optional v = Just <$> v <|> pure Nothing

so the first thing that many pUnsupported does is

pUnsupported <|> pure Nothing

so if pUnsupported fails having consumed input the whole many fails. That’s why using try pUnsupported instead is the correct fix.

4 Likes

Thank you, that does make it much clearer.

Note that this is a particular trait of the parsec family of parsers. Other monadic parsers like Text.ParserCombinators.ReadP and uu-parsinglib don’t have this problem.

It’s worse. The parsing will proceed, but it will just move on where it left off in the case of failure after consuming input, e.g.:

parse ((char 'a' *> char 'b') <|> char 'c') "ac" = Just 'c'

That can’t possibly be right!

Indeed, I just tried it and it doesn’t seem like what I’m saying is correct. Never mind my comment then.

Indeed

% cabal repl -b megaparsec -w ghc-9.6
...
ghci> import Text.Megaparsec; import Text.Megaparsec.Char; import Data.Either
ghci> isLeft (parse ((char 'a' *> char 'b') <|> char 'c') "<no filename>" "ac")
True
1 Like

Intuitively you might expect p <|> q to work like the regular expression p|q, and for many p to work like p* (Kleene star). But it might be helpful to note that the default for [mega]parsec is more like the “atomic” group (?>p|q) a.k.a. (*atomic:p|q) and derived “possessive” quantifier p*+ = (?>p*) in Perl/PCRE.

The plain p* allows backtracking into it, so it represents a search: try taking the maximum n items, and if that doesn’t work, back off to n − 1, and so on. That’s what many (try p) emulates. Whereas p*+ takes all it can and commits to it, so if the next parser fails, we won’t backtrack into the group. We might still backtrack over it, as in try (many p) <|> q, but the many p is all-or-nothing.

What try says is that the parent choice point is backtrackable. In logic-programming terms, the default is like putting a cut in every branch, and try removes it.