Memory usage for Backtracking in Infinite Stream Parsing

Hello, I’m attempting to learn some pure functional programming, to make myself less stupid. :slight_smile:

The topic of composable parsers is a particular interest of mine. So I did an investigation into making a fake arbitrarily-big stream with a generating function, and used that stream with Attoparsec:

https://ae1020.github.io/incremental-streaming-haskell-parse/

TL;DR - incremental parsing seems to be missing from what are otherwise quite powerful parsers (Megaparsec). The talked-about one that has incrementality (Attoparsec) holds on to all input for each top-level call, regardless of how much the rules need to backtrack.

I have lots of questions! But to avoid asking too many at once, I guess I’ll just ask if there’s any feedback on the article…and any insights regarding what’s asked at the end:

I know too little about Monads to reason out whether what I want is possible. Can a parser that is framed monadically “see into” the composition to know what the earliest point of backtracking could be? It may be that there’s no way to do it without describing the parse with some custom declarative structure, which might make one ask what writing it in Haskell actually buys.

2 Likes

One downside of monadic parsing is that monads cannot be analyzed statically. A solution to this problem is to use applicative parsers. Applicative parsers are less powerfull, but they can be analyzed much better and they are still expressive enough to define context free grammars. One library that produces output lazily and only retains as much of the input as is necessary is uu-parsinglib. It also has a monadic interface which is kept separate internally to keep as much information as possible.

A big conceptual problem is that many parsers have to keep track of large parts of the inputs. Take for example a string containing a large list delimited by square brackets: [1,2,3....10000]. A correct parser can only decide if this parser succeeds after it has reached the final closing bracket. If there are still other alternatives that need to be checked afterwards by backtracking, then this parser needs to retain the entire input. A way around this problem is to use a breadth-first algorithm that doesn’t backtrack.

I would recommend Combinator Parsing: A short tutorial for more explanation by the author of the uu-parsinglib library. The most interesting part starts in section 4.

P.S. be warned: I think not all code in that tutorial is correct and it is probably outdated, and the uu-parsinglib library doesn’t have much documentation, so it might be pretty difficult to understand.

1 Like

Thank you for the reference! The opening of Section 4 of the linked paper definitely seems empathetic to what I’m asking about (while the paper as a whole is not something I can grok with any sort of speed). Unfortunately the website and svn seem to have vanished.

Following the example in the .tar file, I was able to make an IP address parser as before. I just changed parseIP to uu-parsing’s definitions (decimal -> pNatural, char -> pSym).

data IP = IP Word8 Word8 Word8 Word8 deriving Show

type Parser a = P (Str Char String LineColPos) a

parseIP :: Parser IP
parseIP = do
  d1 <- pNatural
  pSym '.'
  d2 <- pNatural
  pSym '.'
  d3 <- pNatural
  pSym '.'
  d4 <- pNatural
  pSym '\n'
  return $ IP d1 d2 d3 d4

main :: IO ()
main = let
    text = "10.20.3.4\n"
    uu_str = createStr (LineColPos 0 0 0) text
    ip_uu_rule = ( (,) <$> parseIP <*> pEnd)
    (ip, errors) = parse ip_uu_rule uu_str
do
  print ip

This invokes the parser the same way run did in the examples. There are two entry points in the definition of parse, and parse is the one I want…since parse_h is the “history parser” that can’t know results until the end.

Here’s parse:

-- | The function @`parse`@ shows the prototypical way of running a parser on
-- some specific input.
-- By default we use the future parser, since this gives us access to partial
-- result; future parsers are expected to run in less space.
parse :: (Eof t) => P t a -> t -> a
parse (P (T _  pf _) _ _ _) state =
    fst . eval $ pf (\rest ->
        if eof rest then Done () else error "pEnd missing?"
      ) state

Unfortunately, there’s no parseFromStream or parseWith :-/ so I’m not really sure how to get a equivalent situation to the test I depict of using Attoparsec.

Is adapting the call to use a stream–or just a function that adds input–something that is “close” or “far away”?

(Note: I’m also kind of curious if the Haskellsphere has any kind of emerging consensus for what the foundations of streaming/piping should be, e.g. Conduit or Pipes? I’d like to be using the “right” one for my infinite stream tests, e.g. the one that has the most notable and correct composability. I think I incorrectly assumed that System in System.IO.Streams meant “standardized”–like Prelude. But it sounds like System is just a filing cabinet for things that interoperate with OS-specific functionality…and nothing has been “blessed” in the way something like C++'s std:: streams have.)

Here’s a working example:

import Text.ParserCombinators.UU
import Text.ParserCombinators.UU.Idioms
import Text.ParserCombinators.UU.Utils
import Text.ParserCombinators.UU.BasicInstances
import Data.Foldable
import Data.Word

data IP = IP Word8 Word8 Word8 Word8 deriving Show

ips :: [String]
ips = ["192.168.1.0","8.8.8.8","255.255.255.0"]

ipString :: Int -> String
ipString n = unlines (take n (cycle ips))

pIP :: Parser IP
pIP = iI IP pNatural '.' pNatural '.' pNatural '.' pNatural '\n' Ii

main :: IO ()
main = traverse_ print
  (parse (pList pIP <* pEnd) (createStr (LineColPos 0 0 0) (ipString 5)))

This doesn’t (explicitly) stop and resume the parser for each IP address. I know of one parsing library based on the same ideas as uu-parsinglib, but it is even less easy to understand and it has even less documentation… It is Yi’s incremental parser. I have been thinking before about adding such incremental functionality to uu-parsinglib, but I haven’t had enough time for it yet (and it is pretty hard).

There is a way to cheat an IO String stream into an infinite string using unsafeInterleaveIO.

--                This string is infinite vvvvvv
withStringStream :: IO (Maybe String) -> (String -> IO a) -> IO a
withStringStream m f = f =<< go
  where
    go = do
      x <- m
      case x of
        Just s -> (s ++) <$> unsafeInterleaveIO go
        Nothing -> return ""

Here’s a little gist with example usage.


Another thing that should be noted is that this uses Haskell’s String type, which is not always very efficient. Additionally, I haven’t done any benchmarks, but I wouldn’t be surprised if uu-parsinglib is quite a bit slower than your hand-made decomposed attoparsec parser that works with bytestrings.


Also, this uses idiom brackets, the iI and Ii. Syntactically it is a bit of a hack. They can be converted into do notation. These definitions are all interchangable*:

pIP = iI IP pNatural '.' pNatural '.' pNatural '.' pNatural '\n' Ii
pIP = IP <$> pNatural <* pSym '.' 
         <*> pNatural <* pSym '.'
         <*> pNatural <* pSym '.'
         <*> pNatural <* pSym '\n'
pIP = do
  d1 <- pNatural
  pSym '.'
  d2 <- pNatural
  pSym '.'
  d3 <- pNatural
  pSym '.'
  d4 <- pNatural
  pSym '\n'
  return $ IP d1 d2 d3 d4

*except that the first two use only Applicative operations, the last one requires Monad operations (unless you use the ApplicativeDo langauge extension). And uu-parsinglib is able to do more analysis on parsers that only use Applicative operations, so one of the first two is preffered. Officially, the laws of Monad state that the monadic operations should be interchangeable with the Applicative operations, so the uu-parsinglib parser is not officially a proper monad. But, the only difference should be performance, so it is not a big deal.


EDIT: I have done some benchmarks and the result is that the working example at the top of this comment still uses a linear amount of memory unfortunately.

Thanks for the code-bearing answer!

Yes, I saw the same when I tried it. :frowning: (Note: it asked me to add the FlexibleContexts extension).

But I wonder if that is because of the way the input is being generated? One of the reasons I went with IORef and figuring out how to generate values out of thin air was so I could study the distinction more confidently… I don’t quite know when to trust things like the cycle-type solutions yet. (I also wanted to know what was on the inside of stream implementations…)

Here’s a little gist with example usage.

I’m afraid I don’t quite follow the example (e.g. it works the same if I take the unsafeInterleaveIO off…)?

uu-parsinglib is able to do more analysis on parsers that only use Applicative operations, so one of the first two is preferred. (…) But, the only difference should be performance, so it is not a big deal.

What I’ve been looking into is seeing if the infinite memory use can be avoided, implicitly from the rule. I’m not as concerned about performance differences, unless it’s an order of magnitude…

…though you bring up a good point that people who aren’t being sensitive to the file format and delimiting would run into the memory problems anyway–as in the case of the brackets on the very long list requiring backtrack. In some ways, getting them actively involved in establishing the granularity vs. hoping the parser figures it out may be better in practice?

(Sidenote: Since there is a Conduit to Attoparsec adapter, I followed this example of using conduitParser on a conduit “Source” that I made which yield-ed data infinitely. It had the same fixed-memory-behavior as parseFromStream did, with the mapM_C able to spit out one address at a time. I updated the article to include my test code for that…)

I think the issue is that uu-parsinglib also tries to correct errors. If something goes wrong at the end of the input, it might need to go back to the beginning and choose an alternative. Even if the parsing never fails it has to keep track of this information. This is technically not backtracking, instead of retaining the input explicitly it retains it inside the parsing state. Practically, this has the same effect of linear space usage.


I have updated the gist with more explanation.


It might be… My wild guess would be somewhere between 2x and 10x of a proper manually decomposed attoparsec parser that works with bytestrings. By the way, your usage of bytestrings is not so great for performance. Constructing a separate singleton bytestring for each letter is very inefficient.

I think the absolute fastest solution would allocate a single buffer of a few KiB and repeatedly fill it with new input and run a parallel/SIMD parser (using rank-select bit-strings) over it. Or perhaps something more like this video.

Hmm… it seemed the paper’s author was trying to solve the space problem, explicitly:

4.1.4 Space Consumption
(…) In order to be able to continue with the backtracking process (i.e. go
back to a previous choice point) the implementation keeps a reference in the
input which was passed to the composite parser. Unfortunately this is also the
case for the root symbol, and thus the complete input is kept in memory at least
until the first complete parse has been found (…)

4.1.5 Conclusions
(…) In solving the problems mentioned we will start with the space consumption
problem, and next we change the implementation to produce online results. (…)

Is it a case of intending for it to work, but not testing it? Or perhaps the goal changed somewhere along the line?

Yep, I’m aware. When I was first trying to pull values “out of thin air” and reading about stream types, it said the stream should be typed to match the individual elements. I assumed that would have to be characters to be general (e.g. if it were a list of records, then you would presume the stream was a list-of-list-of-records). By the time I found out the interface was done as ByteString I’d already written the character code, so I figured it would still make a good pathological test… (and I’d have a way to precisely know how much data the parser had asked for, down to the character)

You’re right, it is strange. I think it is more likely that it is a bug or a misunderstanding on my part.

Sadly, it seems that asking is not an option, the author of uu-parsinglib passed away in March of this year. :frowning:

Yes, I’m aware :frowning: . I’m going to start an internship at Utrecht University next week, so I might get an opportunity to ask some people who know more.

I have also done some inspection with ghc-heap-vew and that indicates another suspicion I had might be right: thunks are built up by the Str type because the current input position is not evaluated strictly after parsing each input token.

Good luck!

If you now feel personally invested in seeing memory usage being proportional to the needs of the specific parser passed in, I’ll wait for your findings vs. try to dig into it myself right now. (I’m still at the phase of “understanding how to use a parser library” vs. “second-guessing the internals of the designs of parser libraries.”)

If you strictly evaluate the position before parsing each list element then the space consumption stays constant (if GC kicks in).

main = traverse_ print
  (parse (pList (seq <$> pPos <*> pIP) <* pEnd) (createStr (LineColPos 0 0 0) (ipString 5)))

Unfortunately, the garbage collector does not always kick in automatically (I’ve tested this in GHCi by manually calling System.Mem.performGC). I don’t know why that is.

Yeah, that is a good plan. Attoparsec is more mature and streaming libraries are more predictable.

EDIT: I have done more testing. The garbage collector is not the problem. The problem was the way I was testing it. The lazy LineColPos update was really the only cause of the linear space use. I think it would be pretty easy to fix in uu-parsinglib.

EDIT2: I think I have fixed it. Nope, not completely.

EDIT3: I think this fix is sufficient. I have also noticed another leak related to the error correction, but your IP parsing example can be done in constant space now:

import           Text.ParserCombinators.UU
import           Text.ParserCombinators.UU.Utils
import           Text.ParserCombinators.UU.BasicInstances
import           Data.Word
import           Data.Foldable
import           System.Environment

data IP = IP !Word8 !Word8 !Word8 !Word8
  deriving Show

ips :: [String]
ips = ["192.168.1.0", "8.8.8.8", "255.255.255.0"]

ipString :: String
ipString = unlines $ cycle ips

pIP :: P (Str Char String LineColPos) IP
pIP =
  IP
    <$> pNaturalRaw
    <*  pSym '.'
    <*> pNaturalRaw
    <*  pSym '.'
    <*> pNaturalRaw
    <*  pSym '.'
    <*> pNaturalRaw
    <*  pSym '\n'

main :: IO ()
main = do
  (x : _) <- getArgs
  traverse_ print $ take (read x) $ parse
    (pList pIP <* pEnd)
    (createStr (LineColPos 0 0 0) ipString)

Note that I’m using pNaturalRaw because pNatural also includes arbitrary spacing after the number which conflicts with the pSym '\n' and is generally is not wanted in IP addresses.

This can parse about 250,000 IP/s on my machine. That is not terrible but certainly not very great.

1 Like

I applied your patch and also see no memory growth for the code as given. Thanks for looking into all of this! Though I’ll have to learn more before I understand Monad vs. Applicative, or any of it enough to modify it (none of my attempts at modifying give what I expect…)

Note: I found also this article which would have been nice to find earlier on:

https://www.tweag.io/blog/2017-10-05-streaming2/