Look at the Debian package for ghc: https://packages.debian.org/trixie/ghc, on the bottom of that page:
Package size: 75MB
Installed size: 700MB
(what kind of binaries compresses so well…?)
155MB of ghc sources compiles into 700MB of binaries… what??
For comparison, look at Java. If I want to install openjdk-24-jdk, it’ll download 137MB of packages and total installed size will be: 320MB.
Installing ghc using ghcup requires 7.5GB free space (it make 3 copies during installation…) and ends up with an installed size of 2.5GB. (this also includes the profiling libraries. For Debian you have to install them separately, package ghc-prof: 600MB installed)
Installing HLS using ghcup : 2.9GB installed size…
cabal is using a 850MB index file for hackage…
I’m a Haskell beginner and a former Java developer for many years. I like Haskell very much but I don’t know why its tools are so bloated. That’s the only negative aspect I have against Haskell…
Well, nothing seems to be optimized size-wised.
Haskell is a very good general purpose programming language. And seems very fit for quickly implementing smallish programs… prototyping, testing ideas, stuff like this. But when you see that giant compiler and the other bloated tools… you think… maybe II should use something else, like python…? (well, not python for me as I don’t get it, it’s just weird to me).
I don’t understand the link between the code-size of the tooling, and the capability of the resulting programs.
Ever installed Visual Studio? Not Visual Studio Code, the real deal. Adding the C++ compiler toolchain adds ~3-4GB to the installation size. That doesn’t sound too different from your observed GHC + HLS install size.
Ultimately, compared to other high-level languages like Java, GHC (and HLS and other tools) are doing a LOT of work. GHC compile times are relatively long because there’s a lot of work under the hood to produce optimal executables. It makes sense to me that these tools would also be large.
I’ve worked on small- and medium-sized Haskell codebases, and produced some insanely efficient executables taking < 5MB of memory at runtime. I didn’t even notice that the compiler and tools that allowed me to create this executable were large.
Unfortunately these cabal options are not available to downstream users of pre-packaged tools. However we ought to send distributors recommendations on the best options to produce quality packages, including how to reduce the size on the end-user’s machine.
And how much I’m using from that file? 40%? 10%? No, it’s more like 0.1%! But I have to store on my disk the other 99.9%. 1GB of data if you include the related files. Useless.
The way we use Haskell idiomatically and write at a higher level and more reusably (e.g., you write like filter p . map f and whileM cond body rather than your own recursion like in C), the compiler must have the choice of trying inlining to discover some more removable abstraction overheads.
(I presume that the alternative would be “Haskell tools could use some speedup: Why does cabal take an hour to resolve just 5 dependencies?!”)
A built library binary, call it L0, must then include a representation of its source code (basically, Haskell : *.hi files :: C++ : pre-compiled headers (oh you know about C++ header files containing inlinable actual code too, right? right?)) so that, for example, when the compiler is building another library L1 that depends on L0, the compiler can try inlining L0 code at call sites in L1 to discover optimizations.
It gets better/worse: Therefore some L0 code (both asm and represented source) bleeds into the L1 built binary. If one more library L2 then depends on L1, the built L2 binary can contain both some L1 code and some L0 code. This can go on transitively and quadratically.
As a corollary, the built L2 binary is tied to a particular build of L1 and of L0, not just a particular version in the normal sense. If you even rebuild L0 with slightly different flags, the existing builds of not just L1, but also L2 can become incompatible. This is why cabal computes and records build hashes to track build genealogy.
Part II
How an executable binary became so big: because static linking (of Haskell libs from above) is preferred.
You can go dynamic linking (sometimes I do), but then for every library used by the exe, you need to keep installed the very same build of that library (again, not even just the same version) that the exe was built against. You can do it to yourself if you like, but wide distribution would be a nightmare.
Haskell tools are written in Haskell, and so are subject to that same consequence.
Epilogue
The above take for granted two yet deeper presumptions:
GHC relies on build-time optimizations rather than runtime JIT optimizations. Java binaries can avoid quadratic accumulation and static linking because most of the Java world goes with JIT.
A new Haskell compiler that goes the JIT route could avoid the problem. Any takers?
Haskell tools are subject to the above consideration because they are written in Haskell.
If you wrote Haskell tools in C, or even Javascript, you could also avoid the problem. Any takers?
@treblacy Why indeed. Is this a behaviour that you can reproduce in any way? This is not supposed to happen and should definitely be reported on the ticket tracker of the Cabal project.
Understand the frustration - I’m big into degrowth and reducing my material footprint where I can so I use a pretty old machine that gets easily overwhelmed space and memory wise (I can’t even run intellij without it crashing but that’s a me problem). That said GHC and friends being large hasn’t bothered me/come up yet. I’m curious what made you look in the first place.
Saying “useless” seems harsh though, might not make for as productive as discussion as you’d wanted.
To your original concern: Is the worry that you’ll run out of disk space or is it that you’re worried that this doesn’t bode well for the sorts of programs that the compiler produces?
Either way I think you’ll enjoy getting more into Haskell! Don’t let this dissuade you.
Very clear explanation, thank you for taking the time to write it. The problem is worst than I thought… the difficulty of sharing code between applications… Imagine having most of your applications on your computer made in Haskell…
A JIT compiler, yes, how nice it is then: “build once, run anywhere”… and share a lot the libraries.
Assuming inlining optimization is reduced and we get a 200MB ghc compiler instead of an optimized 700MB. I wonder how much worst the 200MB is performing…
I have different linux gadgets… A nice “toy” is the PinePhone with the Pine64 keyboard attached. It’s a mini-laptop basically with a SIM card. Of course I would like to compile simple programs on it. Well, it has 2 GB of RAM and 16 GB storage… I’ve installed the Debian package of ghc on it…
Another one is the PineNote, an e-ink tablet. Very nice with a bluetooth keyboard. Has better specs: 4GB/128GB.
I have Raspberry Pis too…
Haskell has such a concise syntax and you can easily code a few lines that do quite a lot… But if you can not compile them…
Heavy tools hurts adoption of the language…
My ghcup-installed ghc is 1.8gb, there’s 600mb of docs, and then 3 copies of every library: shared, static, and static+profiling. Much is down to that 3x overhead. It’s big, but I also have here a 60gb debug build of llvm. You could make a shared-only ghc distribution with no docs, or just take a normal one and delete the *.a*.p_a and docs. Or if you want to run a small haskell-like just to play on small devices, dig up hugs. ghc has gone for heavy but powerful.
For binary output sizes, they’re not bad if you if you dynamically link, e.g. a fairly complicated program at work is 1.3mb dynamic and 6.7mb static. My personal project is 32mb dynamic, but with some static C libraries. This is pretty meaningless though, because it’s more about project structure. Say we could reduce code size by 2x, 7mb becomes 3.5mb, that’s still pretty irrelevant. This would have been a big deal in the 90s, but I noticed a 1tb drive for $30 last week.
Disks are cheap and people solve problems they have. If you do have the problem, then you are the one with the motivation to fix it! The upside is there may be easy to fix things in there. You could look into ghcup’s 3x usage, that could be an oversight. It looks like cabal has a 00-index.tar.tgz, a 00-index.tar, and maybe it doesn’t really need the .tar variant, you could investigate that. I don’t know anything about HLS but haskell is much more usable in plain vim or emacs that Java, due to ghci.
There’s some prior discussion of that idea in this issue.
I had a notion of maybe implementing a patch for cabal-install that would assemble the index incrementally as needed instead of as one big download, but I couldn’t figure out an easy way to reconcile that with Hackage’s security design, in which the index is signed to prevent MITM attacks on the supply chain. Probably a scheme like that would need Hackage to start signing individual package metadata (as exposed over the package-info-json API), and thus wouldn’t be something I could just try out on my own with a local patch, which mostly killed my interest in this idea. (Unless someone involved in operating Hackage reads this and thinks they’d be willing to help out?)
And I suppose it’s not just for an exe but any other library that it’s also dynamically linked, to lets say, library A. And inline optimization effectively copies some code from library A to library B or the exe, correct? Then if you make a small change in A with no change to its API at all, you’ll still have to rebuild everything too – B and the exe – so inlining can again update the old copied code. This doesn’t scale well when you are using lots of Haskell applications. Also, inlining seems to copy quite a lot of code. And you’ll also end up with a lot of duplicate code even when you are dynamically linking… And when you have many applications… sharing is not true sharing after all.
Thanks for these details. I think it would be helpful if someone could post a breakdown of what a distribution of GHC installs and the installed size of each component.