Lazy file reading

I have several very large files (many 10s of GB) I’m working with where I essentially need to read a line, parse, etc. The files are already sorted in order and my goal is a kind of merge sort between them and outputting the results. All the code i have to do this “in the small” works just fine and now I’m just scaling it up.

The primary issue that I have is (I think) limited to small place in the code. Early in the app when I load up each of the files I perform a recursive do that returns a list of the parsed lines (pseudo-code):

readLines :: Handle -> IO [ParsedLine]
readLines file = mdo
  line <- BS.hGetLine file
  ((parse line) :) <$> readLines file

But the above doesn’t appear to be lazy. For example, if I do:

readLines h >>= mapM_ print

It builds the entire list of lines in memory before printing them.

Before I head down the road of io-streams (which may be the best solution anyway), I was wondering if there was a method of doing the above that I’m just missing?

At the end of the day, my hope was that all the code would coalesce down into a simple:

merge (sequence $ mapM readLines fileHandles)

And produce a lazy list of the results.

3 Likes

I might be misunderstanding the problem (in which case, disregard), but streaming libraries are designed specifically for problems like this. See e.g. the example from streamly: Introduction.html - Haskell Streamly Documentation. Also check out pipes, which is a less performance-tuned but very user friendly streaming library.

Btw, what happens if you replace your mdo with do?

mdo
  line <- BS.hGetLine file
  ((parse line) :) <$> readLines file

Is equivalent to

do
  line <- BS.hGetLine file
  rest <- readLines file
  pure $ (parse line) : rest

which is

do
  line <- BS.hGetLine file
  line' <- BS.hGetLine file
  line'' <- BS.hGetLine file
  ...
  pure $ (parse line) : (parse line') : (parse line'') : ...

So it should always read the whole file and throw an EOF error since it didn’t handle that.

A good first step would be to have a version of withFile which opened several files at the same time and let you work with the list of Handles in a callback (while still offering the cleanup assurances of withFile).

We can build such a function using ContT:

import Control.Monad.Trans.Cont
import System.IO

withManyFiles :: [FilePath] -> IOMode -> ([Handle] -> IO r) -> IO r
withManyFiles files mode =
  runContT $ traverse (`withFileCont` mode) files
  where
    withFileCont :: FilePath -> IOMode -> ContT r IO Handle
    withFileCont filepath mode = ContT (withFile filepath mode)

This function that filters the list of Handles and keeps only those that haven’t reached eof could also be useful:

filterEOF :: [Handle] -> IO [Handle]
filterEOF = filterM $ \h -> do
    eof <- hIsEOF h
    -- hClose will get called again at the end of withManyFiles,
    -- but that isn't a problem
    when eof $ hClose h
    return (not eof)

Not that I think it’s a good idea, but I think to get a lazy readLines you want to use unsafeInterleaveIO, along the lines of

readLines file = unsafeInterleaveIO $ mdo ...

@risrr451 , this seems to work:

yes | runghc lazy.hs where

-- lazy.hs
import System.IO (stdin, Handle)
import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Lazy.Char8 as BSL

type ParsedLine = BSL.ByteString
parse = id

readLines :: Handle -> IO [ParsedLine]
readLines file = do
  cont <- BSL.hGetContents file
  pure (map parse (BSL.lines cont))

main = readLines stdin >>= mapM_ print

I can’t claim to understand what’s happening in your example. It has something to do with the fact that the fmap inside readLines is sequencing each and every hGetLine in the IO monad, perhaps combined with the fact that ByteString is stricter than you might expect in some cases.

For instance, my example works because I’m using Data.ByteString.Lazy.hGetContents, which specifically states that it reads the contents lazily. Contrast to Data.ByteString.hGetContents which states it reads the contents strictly.

I’d like to thank everyone for their responses. Shortly after posting I realized why it wouldn’t work as I expected (due to the use of the returned, recursive value to build the list).

I started looking into other packages like conduit, io-streams, and streamly (thanks, @reuben) and all seem to be able to do what I want. Although - from first glance and trying things out - I tend to like conduit more than the others.

@chreekat, thanks for your example!

@danidiaz, I haven’t used ContT before, so I’ll have to take a look at that one.

1 Like