Memory performance when reading large files

I documented a performance challenge in the format of a stackoverflow-question:

I am reading two big text files, 5’700’000 lines and 1’900’000 lines each. One contains natural word frequency information, the other is a word list. I use the first to order the second. The memory use of my algorithm seems excessive but I am not able to become friends with the garbage collector.

I calculated the expected size of my data structures to be 717 MB in total, yet the heap profile shows memory use well above 800 MB (which isn’t a large discrepancy per se), but worth then that, running top shows a memory use around 1.8 GB to a peak of 2.6 GB.

The code is provided here, in case you want to try yourself.

Thanks for any help on this matter. I am stuck, all kind of advice–be it general o specific–is appreciated. :slight_smile:

2 Likes

Success of some kind: here you see the heap by type (+RTS -hT) for three versions of the program run sequentially.

Nr. 1 is using strict IO. Nr. 2 arguably has a better profile. Using lazy IO the blue part “ARR_WORDS” grows incrementally, w/o any obvious downside.

In Nr 3. I succeeded to get rid of the violet thunk “THUNK” entirely. That thunk that gets garbage collected in version Nr. 1 and Nr. 2 eventually.

The key was changing from

ls <- Text.lines <$> Text.readFile fileData -- Version 1

to

ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile fileData -- Version 2

to

ls <- (Lazy.toStrict <$!>) . Lazy.lines <$> Lazy.readFile fileData

This is the second time that trying to use strictness (Lazy.toStrict .) only really worked after–well–doing it strictly (Lazy.toStrict <$!>). This last line might be a useful pattern in general when loading big files.

However, all of this is still not satisfying to me. Neither of my changes affects the peak memory usage.

I believe this is because GHC uses a copying garbage collector which means that you need at least 2x the live memory. And there was some other factor which bumps that up to about 3x, so your numbers seem quite reasonable to me. I remember that there was some explanation in this video from ZuriHac, I think around 1:14:30.

Thanks!

https://downloads.haskell.org/~ghc/7.0.3/docs/html/users_guide/prof-heap.html

I found that paragraph right at the bottom confirming your point:

There is an overhead of profiling itself, which is subtracted from the residency figures by the profiler. This overhead goes away when compiling without profiling support, of course. The space overhead is currently 2 extra words per heap object, which probably results in about a 30% overhead.

Garbage collection requires more memory than the actual residency. The factor depends on the kind of garbage collection algorithm in use: a major GC in the standard generation copying collector will usually require 3L bytes of memory, where L is the amount of live data. This is because by default (see the +RTS -F option) we allow the old generation to grow to twice its size (2L) before collecting it, and we require additionally L bytes to copy the live data into. When using compacting collection (see the +RTS -c option), this is reduced to 2L, and can further be reduced by tweaking the -F option. Also add the size of the allocation area (currently a fixed 512Kb).

The stack isn’t counted in the heap profile by default. See the +RTS -xt option.

The program text itself, the C stack, any non-heap data (eg. data allocated by foreign libraries, and data allocated by the RTS), and mmap()'d memory are not counted in the heap profile.

I will check top with profiling disabled and I will play around with +RTS -A -G to see if I can optimize that.

Here is a modification of your program which is twice faster and twice less hungry for memory:

{-# LANGUAGE OverloadedStrings #-}

module Main where

import qualified Data.ByteString as BS
import qualified Data.ByteString.Unsafe as BS
import qualified Data.ByteString.Builder as BSB
import qualified Data.ByteString.Char8 as BS (lines, readInt)
import Data.List (sortOn)
import qualified Data.Map.Strict as Map

main :: IO ()
main = do
    mapFrequencies <- Map.fromList . parseFrequencies <$> BS.readFile fileFrequencies
    ls <- BS.lines <$> BS.readFile fileData
    let sorted = sortOn (\k -> Map.findWithDefault 0 k mapFrequencies) ls
    BSB.writeFile fileSorted $ foldMap ((<> "\n") . BSB.byteString) sorted

fileFrequencies :: FilePath
fileFrequencies = "deu_news_2020_freq.txt"

fileData :: FilePath
fileData = "german.utf8.dic"

fileSorted :: FilePath
fileSorted = "german.utf8.sorted.dic"

parseFrequencies :: BS.ByteString -> [(BS.ByteString, Int)]
parseFrequencies bs = case BS.uncons bs of
    Nothing -> []
    -- this is admittedly brittle, just to demonstrate single-pass parsing with readInt
    Just (35, _) -> parseFrequencies (BS.unsafeTail (BS.dropWhile (/= 10) bs))
    _ -> let (w, f) = BS.break (== 9) bs in
         case BS.readInt (BS.unsafeTail f) of
                Just (i, bs') -> (w, i) : parseFrequencies (BS.unsafeTail bs')
                Nothing -> []

Key insights:

  1. Your data is UTF8, but all parsing-sensitive characters (which are #, \t and \n) are ASCII. This suggests to use ByteString instead Text, immediately halving memory requirements.
  2. Both files must be loaded in full before sorting. So no point in lazy ByteString: in fact it will degrade performance by increasing heap fragmentation.
  3. Since keys are short strings, there is unlikely to be much benefit from HashMap, we can use Map and do not waste space for hashes.
  4. No need to involve Vector only for the purpose of sorting, plain old Data.List.sortOn is good enough.
  5. Use findWithDefault instead of fromMaybe / lookup, because it allocates less.
  6. Use readInt to parse integers from ByteString, do not resort to read . unpack.
  7. Use Builder for output, avoiding constructing it in memory in full.
8 Likes

EDIT: bodigrim’s answer just appeared, seems to perform much better than any of what I did.

I posted my current state of optimization on stackoverflow.

In conclusion, ls <- (Lazy.toStrict <$!>) . Lazy.lines <$> Lazy.readFile file seems a good choice for reading files line by line.

I guess Data.List.sort is fine. I thought for that size sorting in-place would be worthwhile. (And Data.Vector.Vector has a slightly smaller memory footprint than Data.List.List)

Thanks a lot, I will try your code and do some comparisons!

I like to add

pound :: Word8
pound = 35 -- '#'

linefeed :: Word8
linefeed = 10 -- '\n'

tab :: Word8
tab = 9 -- '\t'
1 Like

the results are in :slight_smile:

First, you see my version, after I tried around optimizing using

  • Vector
  • sorting algo from Data.Vector.Algorithms
  • Text
  • a mix of lazy and eager functions to avoid tunks

Second, you see @Bodigrim 's code. I am amazed.

Two main take-aways for me:

  • custom bytestring parsing tops Text.lines performance-wise as promised by @Bodigrim
  • Data.List.sort seems to allow to write the beginning of the list (and gargabe-collect it). At least this is how I explain the memory profile.

The sorting algorithms from Data.Vector.Algorithms are either poorly optimized or I underestimate the efficiency of list sorting over in-place sorting. (The combination of Vector and Map, instead of HashMap, did have lower peak memory use, but would not justify the use of Vector over list in this instance).

The code can be found here.


Two more things.

The third graph shows the difference between the following two lines:

ls <- Char8.lines <$> BS.readFile fileData -- middle graph

and

ls <- (LBS.toStrict <$!>) . LChar8.lines <$> LBS.readFile fileData -- rightmost graph

The impact isn’t big and leads to the second thing:

Why does the peak have a peak on top? Peak memory usage does seem higher than necessary. Remember that we got rid of HashMap in favor of Map. I will try and reduce peak memory use. You can see from the graph that that peak on top is (,) and :. Could it be that the list is zipped with the Ints by sortOn?

2 Likes

Below the runtime statistics, +RTS -s, not affected by heap profiling

Algo 1: vector, hashmap

  45,188,255,784 bytes allocated in the heap
   2,943,650,440 bytes copied during GC
     836,215,912 bytes maximum residency (14 sample(s))
     197,200,792 bytes maximum slop
            1889 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     43127 colls,     0 par    3.727s   3.734s     0.0001s    0.0116s
  Gen  1        14 colls,     0 par    1.993s   1.993s     0.1424s    1.0522s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   16.800s  ( 16.946s elapsed)
  GC      time    5.720s  (  5.727s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   22.520s  ( 22.674s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    2,689,791,570 bytes per MUT second

  Productivity  74.6% of total user, 74.7% of total elapsed

1937512

Algo 2: custom bytestring parsing, map, sort

   11,283,802,312 bytes allocated in the heap
   4,285,010,392 bytes copied during GC
     506,851,512 bytes maximum residency (11 sample(s))
       2,793,288 bytes maximum slop
            1161 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     10712 colls,     0 par    1.878s   1.880s     0.0002s    0.0008s
  Gen  1        11 colls,     0 par    1.013s   1.013s     0.0921s    0.2681s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    4.816s  (  4.908s elapsed)
  GC      time    2.891s  (  2.894s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    7.707s  (  7.802s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    2,343,009,063 bytes per MUT second

  Productivity  62.5% of total user, 62.9% of total elapsed

1191852

Aglo 3, like 2, with lazy bytestring

  11,348,410,528 bytes allocated in the heap
   4,588,097,368 bytes copied during GC
     546,505,472 bytes maximum residency (14 sample(s))
       4,579,624 bytes maximum slop
            1236 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     10771 colls,     0 par    1.882s   1.884s     0.0002s    0.0009s
  Gen  1        14 colls,     0 par    1.169s   1.169s     0.0835s    0.3114s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    4.738s  (  4.814s elapsed)
  GC      time    3.051s  (  3.053s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    7.789s  (  7.867s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    2,394,964,805 bytes per MUT second

  Productivity  60.8% of total user, 61.2% of total elapsed

1268920

The last line is the output of /usr/bin/env time -f '%M', showing peak memory use. My little optimization didn’t do any good, all in all.

Each ByteString in bytestring-0.10 is 32 bytes atop of actual payload (which is very short in your use case): there is a constructor, pointer, offset and length, all 8 byte long. Further, Map takes a lot of space to store a huge binary tree: each internal node has 48 bytes overhead. So for each input word, which is on average ~8 bytes, you need 8 + 32 + 48 = 88 bytes, 11 times more. Your input files are ~120 Mb, so it is expected to have residency around 120*11 = 1300 Mb. I would say that actually things behave better than one could expect.

If you restrict to bytestring >= 0.11, memory residence will be a bit better, but not much.

You can also pass ghc-options: --nonmoving-gc so that the peak memory according to top matches maximum residency as measured by GHC internally.

These are my testing results:

ghc gc peak mem runtime
8.10.7 copying 1.13 GB 8.9s
8.10.7 nonmoving 1.91 GB 9.4s
9.0.1 copying 1.12 GB 9.0s
9.0.1 nonmoving 1.47 GB 9.4s
9.2.1 copying 1.23 GB 8.9s
9.2.1 nonmoving 1.57 GB 9.5s

Updated code on github.

The nonmvoing garbage collector didn’t help improving performance. Note that the nonmoving gc requires compilation with -threaded. Switching to bytesting > 0.11.1 did save me about 200 MB.

ghc-8.10.7, --copying-gc

  10,854,192,904 bytes allocated in the heap
   4,168,751,200 bytes copied during GC
     483,539,712 bytes maximum residency (12 sample(s))
       2,217,216 bytes maximum slop
            1099 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     10286 colls,     0 par    1.849s   1.851s     0.0002s    0.0007s
  Gen  1        12 colls,     0 par    1.014s   1.014s     0.0845s    0.2666s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    4.700s  (  4.746s elapsed)
  GC      time    2.863s  (  2.866s elapsed)
  EXIT    time    0.000s  (  0.008s elapsed)
  Total   time    7.563s  (  7.620s elapsed)

  Alloc rate    2,309,560,431 bytes per MUT second

  Productivity  62.1% of total user, 62.3% of total elapsed

ghc-8.10.7, --nonmoving-gc

  10,854,191,096 bytes allocated in the heap
   3,014,050,232 bytes copied during GC
   1,359,623,712 bytes maximum residency (35 sample(s))
18,446,744,073,709,367,944 bytes maximum slop
            2253 MiB total memory in use (436 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     10263 colls,     0 par    2.946s   2.949s     0.0003s    0.0011s
  Gen  1        35 colls,     0 par    0.041s   0.041s     0.0012s    0.0148s
  Gen  1        35 syncs,                       0.040s     0.0012s    0.0261s
  Gen  1      concurrent,              3.057s   6.547s     0.1871s    2.2230s

  TASKS: 38 (35 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    5.292s  (  5.338s elapsed)
  GC      time    2.987s  (  2.990s elapsed)
  CONC GC time    3.057s  (  6.547s elapsed)
  EXIT    time    0.000s  (  0.002s elapsed)
  Total   time   11.335s  (  8.330s elapsed)

  Alloc rate    2,051,224,999 bytes per MUT second

  Productivity  73.6% of total user, 64.1% of total elapsed

ghc-9.0.1, --copying-gc

  11,142,965,792 bytes allocated in the heap
   4,160,725,696 bytes copied during GC
     481,143,792 bytes maximum residency (12 sample(s))
       2,229,264 bytes maximum slop
            1095 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     10566 colls,     0 par    1.917s   1.920s     0.0002s    0.0008s
  Gen  1        12 colls,     0 par    1.017s   1.017s     0.0847s    0.2659s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    4.750s  (  4.797s elapsed)
  GC      time    2.934s  (  2.937s elapsed)
  EXIT    time    0.001s  (  0.006s elapsed)
  Total   time    7.685s  (  7.740s elapsed)

  Alloc rate    2,345,705,836 bytes per MUT second

  Productivity  61.8% of total user, 62.0% of total elapsed

ghc-9.0.1, --nonmoving-gc

  11,142,964,104 bytes allocated in the heap
   3,006,568,272 bytes copied during GC
     994,390,544 bytes maximum residency (38 sample(s))
18,446,744,073,708,941,944 bytes maximum slop
            1680 MiB total memory in use (4 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     10540 colls,     0 par    2.944s   2.948s     0.0003s    0.0015s
  Gen  1        38 colls,     0 par    0.026s   0.026s     0.0007s    0.0019s
  Gen  1        38 syncs,                       0.022s     0.0006s    0.0083s
  Gen  1      concurrent,              3.136s   6.654s     0.1751s    1.2637s

  TASKS: 41 (38 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.000s elapsed)
  MUT     time    5.362s  (  5.409s elapsed)
  GC      time    2.971s  (  2.974s elapsed)
  CONC GC time    3.136s  (  6.654s elapsed)
  EXIT    time    0.000s  (  0.007s elapsed)
  Total   time   11.470s  (  8.390s elapsed)

  Alloc rate    2,077,983,578 bytes per MUT second

  Productivity  74.1% of total user, 64.5% of total elapsed

ghc-9.2.1, --copying-gc

  11,142,998,064 bytes allocated in the heap
   4,018,312,024 bytes copied during GC
     539,641,000 bytes maximum residency (11 sample(s))
       2,390,872 bytes maximum slop
            1197 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      2635 colls,     0 par    1.877s   1.878s     0.0007s    0.0032s
  Gen  1        11 colls,     0 par    0.988s   0.988s     0.0898s    0.3087s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    4.729s  (  4.779s elapsed)
  GC      time    2.864s  (  2.866s elapsed)
  EXIT    time    0.000s  (  0.005s elapsed)
  Total   time    7.594s  (  7.650s elapsed)

  Alloc rate    2,356,120,004 bytes per MUT second

  Productivity  62.3% of total user, 62.5% of total elapsed

ghc-9.2.1, --nonmoving-gc

  11,142,996,392 bytes allocated in the heap
   2,935,283,184 bytes copied during GC
   1,059,977,008 bytes maximum residency (29 sample(s))
18,446,744,073,708,492,376 bytes maximum slop
            1802 MiB total memory in use (342 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      2617 colls,     0 par    3.008s   3.010s     0.0011s    0.0056s
  Gen  1        29 colls,     0 par    0.072s   0.072s     0.0025s    0.0097s
  Gen  1        29 syncs,                       0.099s     0.0034s    0.0536s
  Gen  1      concurrent,              2.818s   6.257s     0.2158s    1.1607s

  TASKS: 33 (30 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    5.300s  (  5.350s elapsed)
  GC      time    3.081s  (  3.082s elapsed)
  CONC GC time    2.818s  (  6.257s elapsed)
  EXIT    time    0.001s  (  0.008s elapsed)
  Total   time   11.201s  (  8.440s elapsed)

  Alloc rate    2,102,289,662 bytes per MUT second

  Productivity  72.5% of total user, 63.4% of total elapsed