File Reading & Decompression Performance Tips


#1

I am trying to open thousands of files with

Data.ByteString.Lazy.readFile from which I am seeing openFile: invalid argument (Invalid argument) for many files

and decompress them with

Codec.Compression.GZip.decompress from which I end up with *** Exception: Codec.Compression.Zlib: premature end of compressed data stream.

I suspect both are caused by performance issues. Can someone give me some performance tips to hopefully help me overcome these issues? Or perhaps there are relevent blog posts out there already?

Thanks!

**FYI, I never actually see resource useage increase on my laptop - which is odd…


#2

It looks like Conduit is the way to go… (I see Pipes and Streamly are also popular)

Nevertheless, these errors are strange to me…


#3

Now that i am more awake, this of course just looks like an iffy file is causing a problem… Sorry for the confusion!


#4

My guess is, I am opening too many files (3000) and running up against some OS (Win10) limit… – because limiting to 1500 files is OK for example.


#5

The problem here is lazy IO: using Data.ByteString.Lazy.readFile will open the file for reading, but not actually read the file. This means that you have an open file descriptor for this file. Only when you consume the ByteString, the file will actually be read, and if you consume it completely, the file will be closed and the file descriptor released.

Most operating systems have a limit on the amount of file descriptors that a user can have at the same time. If you open a large number of files this way, you will run in to this limit.

The reason why you don’t see resource increase in your laptop is that file descriptors have a relatively low overhead.

The solution here would be to restructure your program a bit, so that you e.g. read a file, compress it, close it, and only then start opening the next file. Some libraries like conduit and pipes can help you with this, but they may be a bit overkill – it really depends on what you’re trying to do.


#6

Thanks.

I was thinking about this and listened to a podcast which mentioned the problem here towards the end:

https://www.haskellcast.com/episode/006-gabriel-gonzalez-and-michael-snoyman-on-pipes-and-conduit

My conclusion is… for this task - a script parsing thousands of files - I kind of like the default behaviour (apart from the file-descriptor limit [only an assumption that it is actually the problem] which seems somehow arbitrary… annoying…)


#7

You may be able to get around the problem by just changing a single line in your code, but it’s hard to help without seeing the code.


#8

V true, here is the code:

import Codec.Compression.GZip (decompress)
import qualified Data.ByteString.Lazy as LBS (readFile)
import System.FilePath.Find (find, always, fileName, (~~?))
import Utils ((|>))
import Models.Msgs (getMsgs)
    
parse path query = do
      fs <- search "*.bin.gz" path
      ds <- fs |> mapM (fmap decompress . LBS.readFile)
      ds |> map getMsgs |> query |> return

perhaps v naive, but I suppose naivety has its place : ) // except of course I need to work around this openFile issue.


#9

For future reference, withFile seems to have done the trick!


#10

For future reference, the following is one way to handle (avoid) exceptions while decompressing:

decompress gz = gz
  |> foldDecompressStreamWithInput (\ c a -> fmap (BS.append c) a)
                                   (\_ -> Just BS.empty)
                                   -- on error
                                   (\_ -> Nothing)
                                   (decompressST gzipFormat defaultDecompressParams)

Following the details here: http://hackage.haskell.org/package/zlib-0.6.2/docs/Codec-Compression-Zlib-Internal.html