As it happens, binary sizes “slowly but steadily going out of hand” was discussed by GHC maintainers a few days ago:
I have also been concerned about the disk space used by the Haskell toolchain. Particularly when used in places like CI where docs are 100% dead weight. Every second spent downloading them and copying them around during installation is just wasted over and over again.
I personally never use local docs and only static build – thus the disk use would actually be 400MB, if this could be configured upfront.
Totally agree on this complaint! I run into this with nix
, where I think I’d like to write a program in Haskell, so I create a new flake for that project which ends up pinning a different version of nixpkgs
than I’ve used on that particular machine. This then means that getting started with the project involves something in the neighborhood of 10GiB of disk space for all the tooling. One can avoid paying this for every project on a particular computer by relying on a copy of the dependencies shared with the wider system or other projects, but I still do see it from time and time and wish it weren’t so.
Your comparison with openjdk
is apt, and it might be worth investigating a bit to see where the main differences are (e.g. While GHC’s approach to inlining is certainly a factor, are there other differences? Fewer packages? No profiling libraries by default? Docs?).
It is hard to motivate working on this as bandwidth and disk space are more easily available than developer time, but I think you’re right that there are surely would-be Haskell users who balk at the tooling resource requirements and end up choosing a different language for that reason. I know this because I will avoid writing a small program in Haskell as part of a larger project if it would mean adding Haskell tooling to the dev environment for that project in large part because it has such a large effect on environment provisioning and CI.
Well, I just deleted nine ~/.ghcup/ghc/*/share
directories and it freed up 5.8 GB. Fingers crossed that doesn’t break anything!
One problem we don’t know exactly what is taking up space. Both in aggregate: documentation/profiling/standard builds. And more granularity: per package, per object file. It’s basically impossible to optimize anything you can’t measure it and find bottlenecks.
Second one is. Yes haskell use inlining and it is necessary. But I suspect it’s used too much. We like to write general code but in order to get runtime performance it must get specialized and only easy and reliable way to to slap INLINE
on function. It’s not possible to ask GHC to generate specializations without inlining.
I agree with @acowley , I would use Haskell more often if GHC was lighter. For comparison, here are a few fedora’s package size:
$ podman run -it --rm fedora:42
# dnf install rust
After this operation, 523 MiB extra will be used
# dnf install nodejs
After this operation, 205 MiB extra will be used
# dnf install go
After this operation, 523 MiB extra will be used
# dnf install ghc
After this operation, 986 MiB extra will be used
If gcc
is already installed, then GHC only needs 810 MiB (need to download 83 MiB). This is without docs or profiling libraries by default, so even without those, adding GHC to a new project is still relatively costly.
316 MiB of that is the ghc-ghc-devel
package, and at risk of making things up, is that only useful if you’re using GHC as a library? I imagine that’s a pretty niche use case (unless, uh, does Template Haskell require this?), and if all that speculation is true, making that package an optional dependency could be a relatively easy win, bringing GHC in line with the numbers from Rust and Go.
Edit: In support of the above speculation, I force-removed ghc-ghc-devel
from a Fedora container and successfully compiled a small Haskell program, including TH. GHCi works too.
Good catch, I haven’t looked too much into it, there is also Cabal-*-devel which looks quite heavy. I guess we could have a ghc-minimal package, but the optional libs propably needs to be defined upstream, e.g. in that page: Haskell Hierarchical Libraries
Except for ghc-base-devel
(and I may recall seeing things in the CLC repo that make me think that some folks are working towards even that), I think every ghc-*-devel
package is technically optional, so I don’t know what good making that list would do.
This might be derailing into a conversation that should be had on a Fedora platform instead, but instead of a ghc-minimal
, would it make sense to change all of the not-base
ghc-*-devel
packages to weak dependencies? That wouldn’t change the default user experience (and thus sidesteps any debates about whether array
is more critical to provide to users by default than xhtml
), but would let size-sensitive users exclude them, right?
This is something that I find confusing.
I have -O3 optimisation turned on, and it produces a lot of assembly code. I remove the optimisation entirely, and the assembly becomes a lot shorter.
Am I missing something here?
Debian 12
Total: 3.45 GB for 15,275 files
breakdown
-bin: 96.4 MB for 26 files
Note: I’m going to bundle files like ghc with ghc.12.1 for simplicity.
–ghc, the compiler: 2 MB
–ghc-iserv, the (external) i(nterpreter) serv(ice): 29 MB
–ghc-iserv-prof: 62 MB
–ghc-pkg, the package cache manager: 0.5 MB
–ghc-toolchain-bin, the backend picker: 0.2 MB
–haddock, the html doc generator: 0.02 MB
–hp2ps, heap profile to PostScript: 0.04 MB
–hpc, code coverage tool: 0.7 MB
–hsc2hs, hsc pre-processor: 0.7 MB
–runghc, Cabal thingy, runs ghc: 0.06 MB
–runhaskell, Cabal thingy, interprets Setup.hs: 0 MB
–unlit, literate haskell transformer (lhs to hs): 0.01 MB
-completion, a thing: 0.03 MB
-doc: 800 MB for 6,048 files
-docs-utils, a thing?: 0.02 MB
-include, C header files, presumably no longer needed: 0.5 MB for 89 files
-lib, packages needed for bin (that ghc also lets you import): 2.57 GB for 9,015 files
–html, doc stuff: 0.5 MB for 26 files
–latex, doc stuff: 0.02 MB for 1 file
Note: here I see only .conf files.
–package.conf.d: 0.4 MB for 48 files
–x86_64-linux-ghc-9.12.1-043d: 2.57 GB
—.hi files, for static linking
—.dyn_hi files, for dynamic linking
—.p_hi files, for static linking + profiling
—.p_dyn_hi files, for dynamic linking + profiling
—.a libs, for static linking
—_p.a libs, for static linking + profiling
—.so libs, for dynamic linking
—_p.so libs, for dynamic linking + profiling
–*some files: 0.05 MB
-manpage: 0.1 MB
-mk, hadrian leftovers?: 0.08 MB
-share, haddock main page and… package licenses?: 0.2 MB
-wrappers: 0 MB
*some scripts: 0.5 MB
Notes: bin looks dynamically linked, explains the size of bin.
Same size as Windows but doesn’t come with backend?
Docs are 39.5% of files. Docs are 23% of size.
Packages are 59% of files. Packages are 74% of size.
About package sizes:
(Bundling hi, dyn_hi, p_hi, p_dyn_hi as .*hi)
ghc.so: 116 MB
ghc_p.so: 229 MB
ghc.*hi: 128 MB
ghc.a: 294 MB
ghc_p.a: 588 MB
Total: 1355 MB (1.3 GB)
Note: this is half of lib size.
Cabal.so: 15.7 MB
Cabal_p.so: 31 MB
Cabal.*hi: 20 MB
Cabal.a: 40.2 MB
Cabal_p.a: 81.7 MB
Total: 188 MB
bytestring.so: 1.5 MB
bytestring_p.so: 3 MB
bytestring.*hi: 4 MB
bytestring.a: 3.9 MB
bytestring_p.a: 8 MB
Total: 20.4 MB
Note: biggest blobs (1.9 GB) are
ghc (1.3 GB), Cabal-syntax (211 MB), ghc-internal (195 MB), Cabal (188 MB)
This is 74% of package size (4 packages out of 44).
Notes: each of the *hi files has the same size (so for ghc, only hi files are (128 MB / 4) = 32 MB)
Profiling increases the size, doubles it.
Windows
Total: 3.43 GB for 17,469 files
breakdown
-bin: 582 MB for 28 files
–ghc: 139 MB
–ghci: 9 MB
–ghc-iserv: 32 MB
–ghc-iserv-prof: 64.6 MB
–ghc-pkg: 41.5 MB
–ghc-toolchain-bin: 31.4 MB
–haddock: 142 MB
–hp2ps: 9.3 MB
–hpc: 31.4 MB
–hsc2hs: 30.5 MB
–runghc: 28.3 MB
–runhaskell: 18.5 MB
–unlit: 0.1 MB
-doc: 816 MB for 6,144 files
-lib, packages needed for bin (that ghc also lets you import): 1.27 GB for 4,807 files
–doc: 0.6 MB
–html: 0.5 MB
–latex: 0.02 MB
Note: here there are .conf and .copy files
–package.conf.d: 0.5 MB for 91 files
Note: just static files
–x86_64-windows-ghc-9.12.1-9da5: 1.27 GB
—.hi files, for static linking
—.p_hi files, for static linking + profiling
—.a libs, for static linking
—_p.a libs, for static linking + profiling
–*some files: 0.05 MB
-man: 0.1 MB
-mingw, the backend: 812 MB for 6,489 files
Notes:
Bin is 16% of size.
Mingw is 23% of size.
Docs are 35% of files. Docs are 23% of size.
Packages are 28% of files. Packages are 37% of size.
ghc: 703 MB
Cabal-syntax: 111 MB
ghc-internal: 105 MB
Cabal: 98.7 MB
Note:
windows → ghc.hi = 0.3 MB
linux → ghc.hi = 0.3 MB
windows → ghc.a = 196 MB
linux → ghc.a = 294 MB
windows → ghc_p.a = 442 MB
linux → ghc_p.a = 588 MB
So not only there are more linux libs, the linux libs are fatter.
Note: if you trim the fat you can fit a backend in there.
For comparison:
The Zig Compiler
Windows
Total: 304 MB for 15,660 files
How do they do it?
- No docs fancy like Haddock, just one simple html file.
It’s the lang (std) reference, doesn’t have the docs of something like xhtml. - Clang is stripped of extras before statically merging with the Zig binary.
- The std is source (.zig).
You compile and cache what you will use.
This means the first hello world takes some (2-3) minutes as Zig setups things.
This also means debug things (symbols or .pdb file) get generated if you want, no need for *p files.
Or if you want .lib, .a, .so files, or everything bundled. - Clang needs mingw, glibc, musl for the target?
The .h and .c files source files are included, so they are compiled and cached.
Note: the compiling and caching approach generated a cache folder with 85 MB and 2,056 files for a “zig c++ helloworld.cpp” that yielded a 0.8 MB binary and 5.5 MB debug symbols.
So yeah, docs and packages.
Indeed TH doesn’t need lib:GHC, because it’s implemented through some clever dependency inversion. So, the compiler depends on the TH library and implements the interfaces defined in it
In the unoptimized version it just calls library functions like (+)
with boxed arguments, while in the optimized version it adds additional code to unbox the integers. Your function is so small that the unboxing takes about the same number of instructions as the actual work, so the increase in code size is significant.
If you use unboxed ints directly you will see almost no difference between the optimized and unoptimized code:
{-# LANGUAGE MagicHash #-}
module Example where
import GHC.Exts
data IntList = Nil | Cons Int# IntList
sumOverArray :: IntList -> Int#
sumOverArray Nil = 0#
sumOverArray (Cons x xs) = x +# sumOverArray xs
(Also, note that -O2 is GHC’s highest optimization level, -O3 does nothing more)
Interesting. So package distributions could slim this down a lot by shipping the four sorts of libraries separately (static vs dynamic, profiling vs non-profiling).
Learn something new every day.
I didn’t even realized that the Ints were boxed!
But that makes sense. Thanks.
The Docker image for Haskell (even with haskell:slim
instead of haskell:latest
) is above 3GiB in size too, which to me is a little too large as a base image to use while developing applications (using e.g. VS Code Dev Containers), trying different versions of the toolchain, or use in the CI.
Docker images are stored and transferred compressed.
The compressed size of haskell:slim
is ~600 MB, that of haskell:latest
around ~650MB.
Here is a command to check:
docker manifest inspect -v haskell:slim | jq '.[] | select(.Descriptor.platform.architecture == "amd64") | .OCIManifest.layers | map(.size) | map (./1000000) | add'
Wow, thanks for this!! I’m learning new stuff about tools I use frequently.
This thread is mixing up different things. One is the download size, the other is the installed size.
The download size of GHC 9.12.1 is ~290 MB on linux. In CI, that takes mere seconds to fetch.
Installing files makes some difference:
make install
: 124smake install_bin install_lib install_extra update_package_db
: 74s
(this is from an extracted bindist)
I think the download size is not an issue, mostly. Installation time is a real issue on windows (in ghcup, that is), but not that bad on unix. The main issue is the installed size on end users systems, IMO.
There are 3 approaches to alleviate that:
- garbage collection (already exists in ghcup)
- more options during installation (partly already exists as demonstrated… but is not exposed through ghcup)
- splitting the bindists (similar to what hvr’s PPA did)
I’m reluctant regarding 2, because I still have low confidence that GHC would maintain a stable interface to the Makefile and it will very likely lead to downstream patching.
Point 3 would be massive work for GHCup and complicate the interface.
I think making 1 more granular and powerful is an ok approach.
References: