File Reading & Decompression Performance Tips

mmport80 · July 30, 2019, 6:44am

I am trying to open thousands of files with

Data.ByteString.Lazy.readFile from which I am seeing openFile: invalid argument (Invalid argument) for many files

and decompress them with

Codec.Compression.GZip.decompress from which I end up with *** Exception: Codec.Compression.Zlib: premature end of compressed data stream.

I suspect both are caused by performance issues. Can someone give me some performance tips to hopefully help me overcome these issues? Or perhaps there are relevent blog posts out there already?

Thanks!

**FYI, I never actually see resource useage increase on my laptop - which is odd…

mmport80 · July 30, 2019, 8:48am

It looks like Conduit is the way to go… (I see Pipes and Streamly are also popular)

Nevertheless, these errors are strange to me…

mmport80 · July 30, 2019, 9:30am

Now that i am more awake, this of course just looks like an iffy file is causing a problem… Sorry for the confusion!

mmport80 · July 30, 2019, 9:40am

My guess is, I am opening too many files (3000) and running up against some OS (Win10) limit… – because limiting to 1500 files is OK for example.

jaspervdj · July 30, 2019, 12:00pm

The problem here is lazy IO: using Data.ByteString.Lazy.readFile will open the file for reading, but not actually read the file. This means that you have an open file descriptor for this file. Only when you consume the ByteString, the file will actually be read, and if you consume it completely, the file will be closed and the file descriptor released.

Most operating systems have a limit on the amount of file descriptors that a user can have at the same time. If you open a large number of files this way, you will run in to this limit.

The reason why you don’t see resource increase in your laptop is that file descriptors have a relatively low overhead.

The solution here would be to restructure your program a bit, so that you e.g. read a file, compress it, close it, and only then start opening the next file. Some libraries like conduit and pipes can help you with this, but they may be a bit overkill – it really depends on what you’re trying to do.

mmport80 · July 30, 2019, 3:07pm

Thanks.

I was thinking about this and listened to a podcast which mentioned the problem here towards the end:

https://www.haskellcast.com/episode/006-gabriel-gonzalez-and-michael-snoyman-on-pipes-and-conduit

My conclusion is… for this task - a script parsing thousands of files - I kind of like the default behaviour (apart from the file-descriptor limit [only an assumption that it is actually the problem] which seems somehow arbitrary… annoying…)

jaspervdj · July 30, 2019, 5:06pm

You may be able to get around the problem by just changing a single line in your code, but it’s hard to help without seeing the code.

mmport80 · July 30, 2019, 8:54pm

V true, here is the code:

import Codec.Compression.GZip (decompress)
import qualified Data.ByteString.Lazy as LBS (readFile)
import System.FilePath.Find (find, always, fileName, (~~?))
import Utils ((|>))
import Models.Msgs (getMsgs)
    
parse path query = do
      fs <- search "*.bin.gz" path
      ds <- fs |> mapM (fmap decompress . LBS.readFile)
      ds |> map getMsgs |> query |> return

perhaps v naive, but I suppose naivety has its place : ) // except of course I need to work around this openFile issue.

mmport80 · August 1, 2019, 6:20am

For future reference, withFile seems to have done the trick!

mmport80 · August 1, 2019, 6:22am

For future reference, the following is one way to handle (avoid) exceptions while decompressing:

decompress gz = gz
  |> foldDecompressStreamWithInput (\ c a -> fmap (BS.append c) a)
                                   (\_ -> Just BS.empty)
                                   -- on error
                                   (\_ -> Nothing)
                                   (decompressST gzipFormat defaultDecompressParams)

Following the details here: http://hackage.haskell.org/package/zlib-0.6.2/docs/Codec-Compression-Zlib-Internal.html

Topic		Replies	Views
Lazy file reading Learn	6	1356	February 8, 2023
Memory performance when reading large files Learn	11	2734	December 18, 2021
Lzlib 0.3.3.0 released Announcements	0	360	January 22, 2020
Reference Counting Utility for ResourceT Show and Tell	0	499	May 9, 2024
Solving a ResourceT-related space leak in production Links	12	882	December 15, 2024

File Reading & Decompression Performance Tips

Related topics