Parsec, alternative parsers with overlaping strings

I’m writting a HTTP parser using Parsec, and this is my current definition for parsing the HTTP status line

statusParser :: Parsec ByteString a HTTPStatus
statusParser = do
  string "HTTP/"
  version <- choice [string "1.0", string "1.1", string "2.0"]
    <&> convert
    >>= onNothing (fail "Unknown HTTP version number, expected 1.0 1.1 2.0")

  space
  code <- sequence [digit, digit, digit]
    <&> convert
    >>= onNothing (fail "Invalid status code, expected 3 digits")


  space 
  message <- if version == 2.0 
    then pure Nothing
    else manyTill (choice [upper, space]) (Text.Parsec.Prim.try crlf)
           <&> Just

  pure $ HTTPStatus { .. }

My problem is that with choice or using directly <|> version only works for string "1.0" and string "2.0", but not "1.1". I assume it’s because of their overlapping prefix, as the first parser that starts matching something is the one that is allowed to “run all the way”.

I can think of workarounds around the issue, such as moving that parser out, expecting 3 characters (digit, char '.', digit) and then fail if it isn’t one of the choices. However, I find the current format more straighforward to write and parse mentally. Is there a minimally intrusive way to make this type of pattern work?


convert is just a custom function that allows me to unpack Text/ByteString to String and call readMaybe on that result. onNothing to raise a failure and unwrap the value from Just on success.

You might be interested in try.

edit: which you are using already in another function in your example, sorry. So maybe you you do not fancy to fmap try parsers or are worried about efficiency?

Haven’t really used Parsec before thus I glossed over try. I only had it within the code as I’ve used the manyTill docs example, without reading the try doc beforehand.

Performance is definitely not something I’m thinking about at this point.

On a somewhat related note, is there an easy way to run the parser and also receive back the unconsumed input? I know attoparsec can do this, and “build your own parser combinator” tutorials write such an interface, but Parsec doesn’t have an easy example… Maybe runParsecT? However without a simple example not sure what to make of the type

runParsecT :: Monad m => ParsecT s u m a -> State s u -> m (Consumed (m (Reply s u a))) 

I’m avoiding attoparsec just because I’m trying to limit myself strictly to packages shipped with GHC (or at least what I see with ghc-pkg list when GHC is installed via ghcup)

Good, then mapping try over that list of parsers should do!

I am not sure there is a more elegant way, but when I test stuff with parsec in ghci I just do

parseTest (myparser >> many anyChar) "Chiare fresche e dolci acque"

to get the uncomsumed stream.

I have not used parsec for parsing (I started with megaparsec and it just worked), so excuse me if this isn’t useful info, but maybe it would work to parse the individual characters if the whole string isn’t working?

I have done:

parseBranchDescription :: Parser BranchDescription
parseBranchDescription = do
  desc <- pack <$> some
    (alphaNumChar <|> char '.' <|> char '-' <|> char '_' <|> char '/') :: Parser BranchDescription
  return desc

I’m sorry, but isn’t

do
  desc <- pack <$> ...
  return desc

equivalent to
pack <$> ...

This is a common issue with parsec-like parser combinators unfortunately.

One possible solution is to “left factor” your code - first parse the common prefix, and then choose between alternatives that do not have a common prefix. For example in your case:

statusParser = do
  string "HTTP/"
  version <- choice
    [ string "1." *>
      choice
        [ string "0" $> "1.0"
        , string "1" $> "1.1"
        ]
    , string "2.0"
    ]
    <&> convert ...

Another solution (which is better for richer languages imo) is to separate the lexing and parsing stages so that the lexing stage is ugly like the code above (but simpler because it only needs to care about tokens and not the structure), but the parsing stage is much nicer because it only needs to be concerned about the structure.

Another relevant link for the discussion is Parsec: “try a <|> b” considered harmful : ezyang’s blog

5 Likes

Great point, thanks!