Garbage Collection

When a program initializes, the mutator allocates memory space for new small objects in the nursery. The nursery serves as a buffer to maintain recently created objects. If these objects remain alive during the garbage collection cycle, they may undergo aging and be promoted to Generation 0 for small objects or Generation 1 for large objects. So, the nursery, Gen 0, and Gen 1 are pipelining with each other and interacting accordingly. I want to explain my understanding as follow:

  1. Nursery:

    • Newly allocated objects start in the nursery, designed for short-lived objects.
  2. Garbage Collection Cycle:

    • The garbage collector examines nursery objects during a cycle.
    • Surviving objects in the nursery may age within the same generation (e.g., Gen 0).
  3. Generational Promotion:

    • Objects surviving multiple cycles may be promoted to the next older generation (e.g., from Gen 0 to Gen 1).
  4. Interaction and Pipeline:

    • Objects move through a pipeline-like structure: nursery to Gen 0, and potentially to Gen 1.
    • The nursery buffers short-lived objects, Gen 0 represents the initial stage for object promotion, and Gen 1 (and further generations) accommodates objects with longer lifetimes.

Is my understanding correct?

I expect the some helpful explanations.

This sounds broadly right to me, though there are lots of details. Did you have any specific questions? There’s a bit of commentary on the GHC wiki if you’d like to read more:

One small point, you wrote:

You mention small and large objects here, but I don’t think the size is relevant in the way you described. The GC does have a concept of “large objects” which get special treatment by the copying collector (basically it gives each one a block to itself and keeps track of the generation to which that block belongs, rather than copying the object). But AFAIK large objects can exist in generation 0 and eventually age into older generations, just like small objects.

Small object or large object which one that I have meant is that the duration of their survival time.

"The question I want to explore is related to list fusion, which occurs when we compose functions in a pipeline, and intermediate lists are generated during the list processing. For example, consider the expression double (triple [1,2,3]). The double function will promptly consume the result of the triple function, as demonstrated by the expression 9:double(triple [2,3]), correct?

The specific inquiry is where these intermediate lists are located in memory and whether GHC’s garbage collector (GC) immediately performs the collection action on these intermediate lists."

Let’s consider a scenario where we have a list with 1,000,000 elements, and we apply the double and triple functions to this list. Initially, each list processing task (mutator or thread) will allocate blocks of memory in the nursery. Subsequently, when the garbage collector executes its lifecycle, the intermediate lists will remain in the nursery until the entire list processing is completed, provided that the garbage collector doesn’t examine the nursery during this ongoing process. In such a case, the entire list processing is aged or promoted (copied by the collector) to Generation 0. If the objects are still alive after subsequent garbage collection cycles in Generation 0, they may then be promoted to Generation 1.

Do you mean a list like [1,2 .. 1000000] ? Thanks to laziness and list fusion, almost never will GC be trying to manage a million allocated values – especially if you’re only applying dumb functions like doubling or tripling.

If you have a million data items you’re wanting to update randomly, almost certainly a List is the wrong data structure. (I’d argue Haskell tutorials cover Lists far too much and more realistic structures not enough.)

The specific inquiry is where these intermediate lists are located …

List fusion (the context you’re asking about) means intermediate lists are never allocated/it makes no sense to ask where.

1 Like

What about for map (*2) (map (* 3) [1..1000000]) ?

If you compile with optimizations ghc will turn that into a loop like this:

x = go 1 where
  go n | n <= 1000000 = (n * 3) * 2 : go (n + 1)
       | otherwise = []

Lists are not singular objects, each link is a separate element.

map only needs the first element of the input list to return the first element of its output. Once you force the head of the resulting list you will get 6:_ and the rest of the function is effectively map (*2) (map (*3) [2..1000000]). The garbage collector is then free to recycle any intermediate garbage produced while creating that result, as well as the 1:_ of the original list, and once you no longer need the 6 the head of the resulting list is free to be recycled as well.

List fusion is merely a way to remove some of the intermediate garbage, in this case the list between the two maps, but it does not change the garbage collection reasoning.

1 Like

Well it does mean the garbage collector will run less often and maybe not at all.

Everything allocated goes into the nursery. If it sticks around long enough, it will be collected, but “long enough” has to do with the total allocations, total runtime and overall memory size of your program. In general you should never expect GC to run or not at any particular time unless you make an explicit IO call requesting a GC.

The “right” way to program is to imagine that everything is allocated when “forced” (i.e. only when necessary) and that it “may as well be deleted” the moment there is no longer a further reference to it, even though an actual GC may only occur much later – the purpose of the GC system is so you do not have to worry about these things.

Only in very high memory usage / high performance situations have I found reasoning about GC necessary (beyond diagnosing leaks introduced that are simply bugs). And in such circumstances I have found that experimenting with GC settings is much more fruitful than trying to reason from first principles.

In particular, outside of long-running server processes, most programs will likely only have one or two GC cycles at most over their entire runtime.

1 Like