I have several very large files (many 10s of GB) I’m working with where I essentially need to read a line, parse, etc. The files are already sorted in order and my goal is a kind of merge sort between them and outputting the results. All the code i have to do this “in the small” works just fine and now I’m just scaling it up.
The primary issue that I have is (I think) limited to small place in the code. Early in the app when I load up each of the files I perform a recursive do that returns a list of the parsed lines (pseudo-code):
But the above doesn’t appear to be lazy. For example, if I do:
readLines h >>= mapM_ print
It builds the entire list of lines in memory before printing them.
Before I head down the road of io-streams (which may be the best solution anyway), I was wondering if there was a method of doing the above that I’m just missing?
At the end of the day, my hope was that all the code would coalesce down into a simple:
I might be misunderstanding the problem (in which case, disregard), but streaming libraries are designed specifically for problems like this. See e.g. the example from streamly: Introduction.html - Haskell Streamly Documentation. Also check out pipes, which is a less performance-tuned but very user friendly streaming library.
Btw, what happens if you replace your mdo with do?
A good first step would be to have a version of withFile which opened several files at the same time and let you work with the list of Handles in a callback (while still offering the cleanup assurances of withFile).
This function that filters the list of Handles and keeps only those that haven’t reached eof could also be useful:
filterEOF :: [Handle] -> IO [Handle]
filterEOF = filterM $ \h -> do
eof <- hIsEOF h
-- hClose will get called again at the end of withManyFiles,
-- but that isn't a problem
when eof $ hClose h
return (not eof)
-- lazy.hs
import System.IO (stdin, Handle)
import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Lazy.Char8 as BSL
type ParsedLine = BSL.ByteString
parse = id
readLines :: Handle -> IO [ParsedLine]
readLines file = do
cont <- BSL.hGetContents file
pure (map parse (BSL.lines cont))
main = readLines stdin >>= mapM_ print
I can’t claim to understand what’s happening in your example. It has something to do with the fact that the fmap inside readLines is sequencing each and every hGetLine in the IO monad, perhaps combined with the fact that ByteString is stricter than you might expect in some cases.
For instance, my example works because I’m using Data.ByteString.Lazy.hGetContents, which specifically states that it reads the contents lazily. Contrast to Data.ByteString.hGetContents which states it reads the contents strictly.
I’d like to thank everyone for their responses. Shortly after posting I realized why it wouldn’t work as I expected (due to the use of the returned, recursive value to build the list).
I started looking into other packages like conduit, io-streams, and streamly (thanks, @reuben) and all seem to be able to do what I want. Although - from first glance and trying things out - I tend to like conduit more than the others.