I have several very large files (many 10s of GB) I’m working with where I essentially need to read a line, parse, etc. The files are already sorted in order and my goal is a kind of merge sort between them and outputting the results. All the code i have to do this “in the small” works just fine and now I’m just scaling it up.
The primary issue that I have is (I think) limited to small place in the code. Early in the app when I load up each of the files I perform a recursive do
that returns a list of the parsed lines (pseudo-code):
readLines :: Handle -> IO [ParsedLine]
readLines file = mdo
line <- BS.hGetLine file
((parse line) :) <$> readLines file
But the above doesn’t appear to be lazy. For example, if I do:
readLines h >>= mapM_ print
It builds the entire list of lines in memory before printing them.
Before I head down the road of io-streams
(which may be the best solution anyway), I was wondering if there was a method of doing the above that I’m just missing?
At the end of the day, my hope was that all the code would coalesce down into a simple:
merge (sequence $ mapM readLines fileHandles)
And produce a lazy list of the results.