How to parse streams and variable length syntax?

Gem7733 · February 4, 2025, 9:41pm

Normally, one makes parser combinators, like

newtype Parser a = Parser { runParser :: String -> Maybe (a, String) }

And then instances Monad, MonadFail, and Alternative in such newtype. And gets a Parser that can work as long you already have the entire thing to parse.

This has two inconveniences:

If the Syntax can change in length depending on the contents (specially in binary protocols), then is no clear how much data you may need to give it before it gets an valid result, likewise if you give it too much data, you may truncate information and get a failing program.
Similar problem with an Streaming input to the parser.

In any case you could try to solve giving it at input of k length (starting with k = 1) until you get a valid output or consume the entire input, but such thing relive is a memorization to no be O(n²) and therefore very suicidal.

There is some suggestions in how to solve this problem?

tobz619 · February 4, 2025, 11:02pm

Not my specialty but my gut instinct is that you’d probably have some kind of standardised input format and one of the things that your input will tell you is the length of the message in some kind of Header.

You could also build some kind of Parser for any kind of synchronisation input, that you know must be a certain length and format which can always be listening and, on being triggered, brings the rest of your parsers into play.

data Header :: { protocol :: Protocol, length :: Word8 }

parseProtocol :: Parser Protocol
parseProtocol :: Parser $ \xs ->
  let (protoBits, rest) = splitAt numberOfProtobits xs
       protocol = matchProtoBits protobits
  in either Nothing (\pro -> Just (pro, rest)) protocol

parseLength :: Parser Word8
parseLength = Parser $ \xs -> let
  (lengthBits, rest) = splitAt 8 xs
  in \word -> (word, rest) <$> ( readMaybe lengthBits :: Maybe Word8)

parseHeader :: Parser Header
parseHeader =  do
  prot <- parseProtocol
  len <- parseLength
  pure $ Header prot len

parseMessage :: Word8 -> Parser String
parseMessage len = Parser $ \xs -> let
  message = take (fromIntegral len) xs
  pure (xs, "") -- if you want to end the stream here, or just do the regular splitAt stuff and return rest.

handleMessage :: String -> a
handleMessage ...

parseSync :: Parser ()
parseSync= Parser $ \xs -> if take lengthOfSyncMessage xs == syncMessage then pure ((), drop lengthOfSyncMessage xs) else Nothing


parseMessage :: Parser a
parseMessage = do
     _ <- parseSync -- discard the sync message as it's not used, no need to validate since if this doesn't pass, we can't do the rest of the block any way.
     Header _ msgLen <- parseHeader
     msg <- parseMessage msgLen
     pure $ handleMessage msg

Types are probably all over the place, and there’s probably a better way to do it not sure if I was committing to using bytes or strings or stuff.

The key thing is realising that you build up smaller pieces, and stick them together, and by discarding parts you don’t need, you can use them as triggers to prepare you for the things you do need.

EDIT: My bad, missed the entire thing where you need to have the whole string before you can do anything lol. Yeah, the industrial strength implementations work with streams and are happy waiting / consuming more input when triggered.

rhendric · February 4, 2025, 11:17pm

This is why industrial-strength parser combinators actually work with monadic input streams (example: Text.Parsec.Stream) instead of a pure String. The input represents all the data, even if it hasn’t all been received yet, without the caller having to worry about how much input the parser wants. The parser consumes as much of it as it needs, executing monadic actions via the stream to fetch more if necessary, and (in the case of parsec, this requires using a lower-level function like runParsecT) returns a stream containing any unused input suitable for use in another parser.

Gem7733 · February 4, 2025, 11:45pm

What if the specification of the protocol does not have a global length?

Gem7733 · February 4, 2025, 11:48pm

That does not give insight in how to make my own parser.

rhendric · February 5, 2025, 12:11am

You want to write your own parser combinator library for funsies, or for serious business?

If the former, doing streaming right is an extra level of challenge, but if you want that consult parsec’s implementation and see what you can learn.

If the latter, what is unsatisfactory about existing libraries like parsec?

tobz619 · February 5, 2025, 12:20am

Umm I guess if there are headers where a length is not specified, then I’m assuming that there must be some kind of encoder for the end of the message or indication in the header that the message is infinitely long. Otherwise, I don’t see how it can be parsed and considered a message.

I know there are combinators that allow you to keep parsing until you parse a specific thing and then it returns everything before that thing.

Something like

messageParsers <* parseEnd

tomjaguarpaw · February 5, 2025, 7:18am

Gem7733:

Normally, one makes parser combinators, like
newtype Parser a = Parser { runParser :: String -> Maybe (a, String) }
…
And gets a Parser that can work as long you already have the entire thing to parse.
…
If the Syntax can change in length depending on the contents (specially in binary protocols), then is no clear how much data you may need to give it before it gets an valid result, likewise if you give it too much data, you may truncate information and get a failing program.

The second part isn’t a problem. It is incorporated into thi design. If you give it too much data then the left over data is returned as the String in the second component of the Maybe. If you want it to be able to handle not enough data then you can tweak the definition a bit, for example:

newtype Parser a = Parser { runParser :: String -> Maybe (Either (Parser a) (a, String)) }

Then you use runParser like

case runParser p of
    Nothing -> -- parse failure
    Left needMore -> -- feed more String into needMore
    Right (a, leftovers) -> -- result and unparsed leftovers