I have a use-case where I run memory expensive jobs in their own thread and at the end I would like the GC to deallocate the whole heap that was allocated by the thread in one go, so that other processes are able to use that memory.
I have metric measuring with normal green threads and GHC. GHC’s GC is doing it’s job, after some tuning, and eventually it does deallocate e.g. 400MB in pieces over X time period (>1 hour) until reaching 5MB total memory use.
An easy but less efficient way to get immediate space reclaimed is to launch the same program in a separate process. Linux’s process launching is quite cheap. I don’t have an aversion to using simple processes (many people do have an aversion to it); the OS does a good job of it and it makes other things like nicing and stability a bit easier (think: Chrome).
But I’m curious whether anyone has taken a different route. It would be great if I could launch a thread with memory isolation, like in Erlang, with its own GC roots that were not global. But I haven’t found much about that after searching.
I know that you probably already know this, but isn’t this (effectively) spawning a new process? Perhaps it’d be a bit more ergonomic because you wouldn’t to be as explicit about exactly what references the “subthread” hangs on to from its parent, but it seems that being explicit about that might be a good thing just in terms of PoLA, etc.?
The reason this works so simply in Erlang is that, in Erlang, it is just spawning a new process with no shared state (semantics-wise), so why wouldn’t that model for Haskell? (I guess there might be some optimization around ergonomics and passing large objects to subprocesses without copying, but last time I looked at Erlang – a looong time ago – it actually just did copying.)
EDIT: Just to add. Obviously spawning a process isn’t going to be great for performance if your “work units” are small (e.g. painting a frame in a game).
I tested out compact regions on a simple AST I had and the program just hanged with 100% indefinitely, so I consider that feature alpha.
I’m using GHC.Stats’s GC statistics for my Prometheus monitoring, so the stats are about memory use by the GC, regardless of page management by Linux. If I look in top, the RSS is always more but I don’t pay attention to that due to the reasons mentioned in your link.
Good reminder, though; that might be nice to enable later for a different kind of monitoring.
Perhaps you got bitten by the fact that the main compact function can’t handle sharing and cycles? If your AST contains cycles then it will indeed loop indefinitely. In those cases you can try compactWithSharing which is slower, but does preserve sharing and cycles. If you can still reproduce it I would recommend opening an issue on GHC’s issue tracker.
Now, I’m a bit confused. GHC’s GC is precise so after a garbage collection (e.g. forced manually with performGC) every dead object should be deallocated. Are you saying that the runtime still holds on to dead memory for a while after garbage collection for some reason? What data are you using exactly, gcdetails_live_bytes?
My AST doesn’t contain cycles, I know this because I wrote it, but also because I’m serializing the same AST with Aeson trivially. However, the docs states that sharing is simply copied and is only non-terminating in the presence of a cycle.
If the structure contains any internal sharing, the shared data will be duplicated during the compaction process. This will not terminate if the structure contains cycles (use compactWithSharing instead).
But the module is new and experimental. Maybe the docs are wrong. Anyway, that’s fine, I don’t need the feature so I’m not motivated to reduce it down to a test case, which would take a long time.
Yes, I think that it does. I had the same question, but after inspecting the flags more, I believe it’s the -Fd factor flag which causes this behavior:
The inverse rate at which unused memory is returned to the OS when it is no longer needed. After a large amount of allocation the RTS will start by retaining a lot of allocated blocks in case it will need them again shortly but then it will gradually release them based on the [-Fd ⟨factor⟩](5.7. Running a compiled program — Glasgow Haskell Compiler 9.2.1 User's Guide %E2%9F%A8factor%E2%9F%A9). On each subsequent major collection which is not caused by a heap overflow a little more memory will attempt to be returned until the amount retained is similar to the amount of live bytes.
Increasing this factor will make the rate memory is returned slower, decreasing it will make memory be returned more eagerly. Setting it to 0 will disable the memory return (which will emulate the behaviour in releases prior to 9.2).
The live bytes are indeed much smaller, but the GC won’t let go (which is where freeing of pages would then come in, yet another layer of retention) of the allocated blocks for a while. This is actually a good heuristic to apply generally, but for one specific task it’s less helpful.
A book about GHC’s garage collector implementation would be helpful, so I wouldn’t have to make so inferences based on the limited docs, or read the source code, or make experiments as if it’s some kind of natural phenomenon.
See also here from Well Typed’s article about memory fragmentation:
Finally, the garbage collector may ask the block allocator to free some megablocks. This is ultimately how the RTS returns memory to the OS. To add to the confusion, regardless of the --disable-delayed-os-memory-return flag, the garbage collector estimates how many megablocks are needed for the next garbage collection and may avoid freeing megablocks even if more could be freed. This is done to avoid the cost of returning/reallocating memory from the OS.