I have just posted an article on the GHC blog describing the current state of play regarding GHC’s support for Apple M1 hardware and ARM more generally.
Feel free to discuss here.
I have just posted an article on the GHC blog describing the current state of play regarding GHC’s support for Apple M1 hardware and ARM more generally.
Feel free to discuss here.
Thanks for the update. I learned a couple new things - surprised that LLVM is that slow, I was under the impression the future is all LLVM (in general, not for GHC specifically).
This is a rather thorny issue. So let’s try to get to it. Without putting too much blame at anything in particular
GHC’s pipeline looks something like this:
Frontend
-> Cmm
-> LlvmCodeGen
-> Llvm Textual Intermediate Representation [.il]
-> Llvm's Optimiser (opt) [.il -> .il]
-> Llvm's Compiler (llc) [.il -> .s]
-> GHC's LLVMManger [.s -> .s]
-> Assembler [.s -> .o]
Compared to the NCG:
Frontend
-> Cmm
-> Native Code Gen [.s]
-> Assembler [.s -> .o]
As can be seen from this, the NCG is a much shorter pipeline and also only shells out to effectively one program, the assembler, whereas the LLVM pipeline shells out to the optimiser, the compiler, the mangler and the assembler. (After that we might have a linker phase if we actually want to link the program).
This also means we produce much more straight forward assembly, than what the LLVM pipeline through the optimiser can produce, as there is a lot more effort put into LLVM’s optimiser. However in practice you’ll find that the produced code is often pretty similar. My hypothesis here is that (a) we start at Cmm, which is already pretty close to what we expect the final result to be, and (b) LLVM is primarily a compiler backend for imperative or non-lazy languages.
We could improve the LLVM performance bit, by collapsing the last three phases (Compiler + Mangler + Assembler) into one step. The Llvm Compiler can produce object code, however as it stands right now, we can’t make the Llvm Compiler produce the right object code (e.g. why we need the Mangler to fix up the assembly llvm generates).
Finally I believe we could improve the whole LLVM pipeline quite substantial (though this is not backed up by data, just a hunch), by producing better llvm IR. What we produce in the LLVM Backend is functional, but also very verbose. My pet theory here is that if we produced better llvm ir, we might see faster compile and optimisation times as well. If you look at what ghc produces for llvm (-keep-llvm-files
and look at the produce .il
), this is far from easy to read and contains copious amounts of bitcasts and what not. LLVM’s optimiser is probably able to clean this all up, yet I believe this also costs a non-negligible amount of time and resources. If we cut the .il
file in half, that will certainly save parsing time, however small that might be. The parsed AST would be smaller, have less nodes, …, might need a few less inlining phases, …
So in summary, we could do:
Hope this helps a bit.
Thanks, that is indeed helping.
To elaborate a bit further, we are not the only ones who have noticed that LLVM is slow. mesa
(particularly amdgpu
's shader compiler), and at least one JavaScript implementation have removed LLVM from their compilation pipelines due to LLVM’s compilation performance. Moreover, rustc
has been integrating an alternate backend for non-release builds, also largely due to the compilation performance of LLVM. GCC continues to be faster than Clang in Linux kernel compilation.
Ultimately, LLVM is very good at producing efficient code, but not terrific at doing so efficiently. I suspect that the reason for this is much the same reason why GHC compilation performance is not great: it’s far easier to write papers about better code generation than about faster compilation.
Of course, LLVM does have commercial users who no doubt care about compiler performance; their efforts are in part why GHC’s LLVM backend is as competitive as it is, despite needing to serialise/deserialise a rather verbose textual intermediate representation. Nevertheless, writing a fast, memory-efficient, modular, compiler capable of sophisticated optimisation is not easy.
It also does not help that LLVM’s IR does not match GHC’s execution model particularly well. As a result we need to give up some efficiency (namely by splitting up proc-points into distinct procedures) in order to shoe-horn GHC’s C-- representation into LLVM IR. Kavon Farvardin previously tried working with LLVM upstream to extend the IR to allow a more direct mapping but there was some resistance from upstream (since all optimisations would need to account for the new construct).
Furthermore, no one has looked into which LLVM optimisations are truly pulling their weight (#11295). It would be really great to make progress on this issue in particular as I suspect there are some easy wins here.