Parsec, alternative parsers with overlaping strings

mhitza · May 13, 2022, 5:05pm

I’m writting a HTTP parser using Parsec, and this is my current definition for parsing the HTTP status line

statusParser :: Parsec ByteString a HTTPStatus
statusParser = do
  string "HTTP/"
  version <- choice [string "1.0", string "1.1", string "2.0"]
    <&> convert
    >>= onNothing (fail "Unknown HTTP version number, expected 1.0 1.1 2.0")

  space
  code <- sequence [digit, digit, digit]
    <&> convert
    >>= onNothing (fail "Invalid status code, expected 3 digits")


  space 
  message <- if version == 2.0 
    then pure Nothing
    else manyTill (choice [upper, space]) (Text.Parsec.Prim.try crlf)
           <&> Just

  pure $ HTTPStatus { .. }

My problem is that with choice or using directly <|> version only works for string "1.0" and string "2.0", but not "1.1". I assume it’s because of their overlapping prefix, as the first parser that starts matching something is the one that is allowed to “run all the way”.

I can think of workarounds around the issue, such as moving that parser out, expecting 3 characters (digit, char '.', digit) and then fail if it isn’t one of the choices. However, I find the current format more straighforward to write and parse mentally. Is there a minimally intrusive way to make this type of pattern work?

convert is just a custom function that allows me to unpack Text/ByteString to String and call readMaybe on that result. onNothing to raise a failure and unwrap the value from Just on success.

f-a · May 13, 2022, 5:22pm

You might be interested in try.

edit: which you are using already in another function in your example, sorry. So maybe you you do not fancy to fmap try parsers or are worried about efficiency?

mhitza · May 13, 2022, 5:50pm

Haven’t really used Parsec before thus I glossed over try. I only had it within the code as I’ve used the manyTill docs example, without reading the try doc beforehand.

Performance is definitely not something I’m thinking about at this point.

On a somewhat related note, is there an easy way to run the parser and also receive back the unconsumed input? I know attoparsec can do this, and “build your own parser combinator” tutorials write such an interface, but Parsec doesn’t have an easy example… Maybe runParsecT? However without a simple example not sure what to make of the type

runParsecT :: Monad m => ParsecT s u m a -> State s u -> m (Consumed (m (Reply s u a)))

I’m avoiding attoparsec just because I’m trying to limit myself strictly to packages shipped with GHC (or at least what I see with ghc-pkg list when GHC is installed via ghcup)

f-a · May 13, 2022, 6:17pm

Good, then mapping try over that list of parsers should do!

I am not sure there is a more elegant way, but when I test stuff with parsec in ghci I just do

parseTest (myparser >> many anyChar) "Chiare fresche e dolci acque"

to get the uncomsumed stream.

ketzacoatl · May 13, 2022, 11:53pm

I have not used parsec for parsing (I started with megaparsec and it just worked), so excuse me if this isn’t useful info, but maybe it would work to parse the individual characters if the whole string isn’t working?

I have done:

parseBranchDescription :: Parser BranchDescription
parseBranchDescription = do
  desc <- pack <$> some
    (alphaNumChar <|> char '.' <|> char '-' <|> char '_' <|> char '/') :: Parser BranchDescription
  return desc

Abab9579 · May 14, 2022, 4:53am

I’m sorry, but isn’t

do
  desc <- pack <$> ...
  return desc

equivalent to
pack <$> ...

gilmi · May 14, 2022, 7:15am

This is a common issue with parsec-like parser combinators unfortunately.

One possible solution is to “left factor” your code - first parse the common prefix, and then choose between alternatives that do not have a common prefix. For example in your case:

statusParser = do
  string "HTTP/"
  version <- choice
    [ string "1." *>
      choice
        [ string "0" $> "1.0"
        , string "1" $> "1.1"
        ]
    , string "2.0"
    ]
    <&> convert ...

Another solution (which is better for richer languages imo) is to separate the lexing and parsing stages so that the lexing stage is ugly like the code above (but simpler because it only needs to care about tokens and not the structure), but the parsing stage is much nicer because it only needs to be concerned about the structure.

Another relevant link for the discussion is Parsec: “try a <|> b” considered harmful : ezyang’s blog

ketzacoatl · May 16, 2022, 12:39pm

Great point, thanks!

neil.mayhew · May 27, 2022, 9:13pm

BTW, Parsec supports using <> with parsers, so the example above can be written like this:

statusParser = do
  _ <- string "HTTP/"
  version <- choice
    [ string "1." <>
      choice
        [ string "0"
        , string "1"
        ]
    , string "2.0"
    ]
    <&> convert

gilmi · May 28, 2022, 5:48am

I didn’t know that. Thanks!

Topic		Replies	Views
Solved. Parsec combinator 'many' is applied to a parser that accepts an empty string Learn	2	783	March 30, 2024
Advance a simple calculator parser to read floating point numbers Learn	4	2180	January 8, 2020
Parsing TeX output with Parsec Learn	2	390	June 26, 2023
When does Control.Applicative.many fail? Learn	9	760	November 7, 2023
Can't parse function application with megaparsec Learn	3	678	July 13, 2024

Parsec, alternative parsers with overlaping strings

Related topics