GSoC 2025: Documenting and Improving Cmm

Hi all,

I’ve submitted a custom GSoC proposal titled “Documenting and Improving Cmm”, and I wanted to share it here with the community — both for feedback and to spark discussion on some possible refinements.

https://drive.google.com/drive/folders/1Nw2h4cKNJBnU0yBEVXHNq0J4vw_IW27Y

The project begins with a documentation phase focused on Cmm, which is central to GHC’s backend but remains under-documented and fragmented. After that, the work could proceed in one of two directions:


:jigsaw: Route A: LLVM Backend Improvements

  • Replace GHC’s internal LLVM AST with a maintained library like llvm-codegen or llvm-pretty, which now supports parsing and round-tripping textual IR.
  • This would simplify GHC’s LLVM backend and make it easier to support new LLVM versions.
  • It could also introduce optional Cabal/Stack dependency support for the LLVM backend. Since the native codegen would remain available, GHC itself wouldn’t strictly depend on Hackage — only the LLVM backend would.

:warning: However, I understand that some GHC developers may be reluctant to introduce Hackage dependencies inside GHC, even if they’re optional. As an alternative, I’ve considered documenting and polishing GHC’s existing internal LLVM AST, and offering it as a standalone library. This could serve both GHC and external tools, while staying closer to the status quo.


:gear: Route B: SSA-Based Register Allocation (or ANF-Style Alternatives)

Benjamin Maurer previously attempted to improve register allocation in the native Cmm backend by introducing live range splitting. However, because variables in Cmm are mutable and can be redefined, lifetimes were hard to track precisely.

To address this, he introduced SSA-like annotations, giving variables single definitions and explicit value flow. Unfortunately, his implementation also introduced a novel SSA-aware graph allocator, different from traditional Briggs-style allocators. It turned out to be slower at both compile-time and runtime than the well-optimized linear allocator, and was never merged.

My proposed refinement is simpler:
:heavy_check_mark: Use Benjamin’s SSA representation, but pair it with the existing Cmm graph allocator — extended only with live range splitting. This avoids new allocator complexity while retaining the clarity of SSA.

Alternatively, I’ve been experimenting with ANF-style transformations that eliminate redefinition and mutation, offering SSA-like benefits without requiring explicit SSA conversion.

Example in pseudocode (not Cmm syntax, but translatable):

int sum_upto(int limit) {
  int sum = 0;
  for (int i = 1; i <= limit; ++i) {
    sum += i;
  }
  return sum;
}

→ Converted to an ANF-like recursive form:


int sum_helper(int i, int limit, int acc) {
  bool cond = i <= limit;
  if (!cond) {
    return acc;
  } else {
    int acc1 = acc + i;
    int i1 = i + 1;
    return sum_helper(i1, limit, acc1);
  }
}

int sum_upto(int limit) {
  int start = 1;
  int acc0 = 0;
  return sum_helper(start, limit, acc0);
}

Because there’s no mutation, lifetimes are precise and allocator-friendly. The only caveat is that recursive calls may exceed Cmm’s argument register limits (see GHC#24019). However i believe i can solve this problem with the llvm backend without introducing a performance penalty


:rocket: Additional Thoughts

Unrelated to the proposal, I’ve also been exploring the idea of compiling Cmm to portable and fast C, as a lightweight backend. I’ll post more about that separately — it may be too much of a detour for GSoC.


I’d love feedback on any of this, especially:

  • Whether reusing and publishing GHC’s LLVM AST is preferable to switching libraries.
  • Whether SSA+existing allocator is a viable simplification of Maurer’s work.
  • Whether ANF-style rewriting has potential for targeted optimization passes.

Thanks!
Diego Antonio Rosario Palomino
UTEC, Lima — GSoC 2025 applicant

13 Likes

Hi Diego,

Awesome to see someone excited about the codegen backend!

GHC’s llvm version restriction is mostly an artifact of early llvm versions where the LLVM textual IR used to change significantly between versions. For quite a while now this hasn’t been a real issue anymore. Hence I don’t think switching the generating library would provide that much of an improvement in that regard. It might maybe in generated llvm ir. I have been going down this route a very long time when I experimented with providing GHC with a bitcode llvm ir backend (as that format is versioned and stable across multiple llvm versions. Apple at that time was also hyping it as an intermediate target for iOS deployment… that seems to have fizzled out).

IMHO the biggest issue with the llvm backend is that it takes off from cmm. Cmm is often too low for the llvm translation and we end up trying to recover information that was lost during the stg to cmm transformation. Mostly around offsets. Another is the whole $def symbols story, and the issue of llvm needing unique symbol prototypes.

Thus you can imagine I’m flavour of route B. I am also not sure how much the proposed changes would impact especially the WASM backend.

On your final remarks about a c backend, we used to have -fviaC, I think it was dropped not a long time ago or it might still be there and I misremember.

3 Likes

Hello fellow user whose username ends in man.

Thanks for the feedback. I gather it is common for users to experience llvm incompatibilities. If not, there is something going on that is being attributed as that problem. This may stem from the fact GHC recommends a very old llvm version and tells you it doesnt guarantee working with a newer one.

About innefiences with the llvm backend. I knew cmm gets generated as llvm functions with a uniform type signature across all function. So for example a cmm function that takes 3 parameters would have a llvm type signature of 8 parameters but only use 3. I could fix this with one of the llvm features so unused parameters dont slow down function call.
I was ignorant of other llvm inneficiencies. But i also know the llvm backend employs some tricks not related to performance. For example it has a weird way of ensuring tables next to code not necessary in modern llvm.

Maybe if i take the llvm route i could address all of this

About the changes related to route B . Nothing should change in relation to the wasm backend. Register allocators are specific to only the native code generators, so adding a new one should not affect the llvm or wasm backends. Benjamin explained me that ssa is just a late form that should only concern the native code generators as well. And if i ANF convert code instead of using ssa annotations this shouldnt even matter.

Edit : I forgot to mention, It would be possible to only compile performance critical code with the improved graph allocator and transformed to either ssa or anf form. The rest of cmm functions could be compiled normally with the linear allocator to ensure quick compilation

Hey @angerman :wave:

IIRC the C-backend is still around, but only included for platforms that neither support LLVM nor have native code generators. So, basically no relevant ones… :sweat_smile: (If it’s still working might be another question.)

1 Like

Hey @GunpowderGuy :wave:

IMHO it would be great to set a focus on the documentation (and keeping the additional routes small.)

A good documentation of Cmm would lower the entrance barrier to GHC immensely. But, documenting is usually not as much fun as coding… However, I think it would be very much appreciated.

1 Like

For what it’s worth, an external library is not necessarily a deal-breaker assuming that:

  • the library has only boot library dependencies, and
  • the refactoring is actually solving a problem

However, I am a bit skeptical that the second condition will be easily satisfied. In my experience, GHC’s LLVM IR AST is rarely the barrier to supporting new LLVM versions. The barrier has rather been adapting to changes in LLVM behavior.

I think introducing an optional dependency would be a strict regression relative to the status quo. Currently any user who builds GHC will have access to the LLVM backend: the user story is clear, packaging and distribution are trivial. This is why we insist that all build dependencies (save the bootstrap compiler and alex) are either submodules or in the tree (the sole exception here is the dependencies of Hadrian itself, although in this case we have introduced infrastructure specifically to package these libraries for downstream consumption).

Refactoring GHC’s IR into a separate library (shipped with GHC and maintained in the GHC source tree) is another viable option. However, I think we should have a clear goal when doing so. We need not forget that library maintenance comes at a cost, especially when interface stability is a goal.

In light of the above, I am yet to be convinced that the LLVM AST projects are really the most important projects we could undertake. @angerman does raise an interesting point regarding the potential for a another backend design targetting LLVM from STG. It is interesting to consider what could be gained by such an approach although I fear that it may be too research-y for a GSoC project.

Your Route B is an interesting potential project, although it does seem a bit open-ended for a GSoC project

GHC indeed has a C backend (see GHC.CmmToC). This is what is used when one configures GHC with the --enable-unregisterised flag. Performance is quite poor, however, and language support is a bit spotty. We generally only recommend that it be used for porting GHC to new platforms.

5 Likes

That’s an interesting direction, but I would first like to understand exactly which inefficiencies arise from compiling STG → Cmm → LLVM. If those inefficiencies can be fixed, I believe it’s better for GHC to continue targeting LLVM from Cmm, rather than bypassing it.

@supersven I agree — documentation is by far the most important part of this project. Csaba Hruska (Haskell and compilers expert, and the mentor who has been guiding me) recently mentioned a critical issue with Cmm that is not currently documented:

Cmm has both an internal representation that can be pretty-printed, and a textual syntax that can be parsed. However, the pretty-printed version and the parseable version are not the same. Even worse, the parser cannot handle all possible Cmm programs — it only supports a restricted subset suitable for handwritten Cmm (e.g., used in GHC’s RTS).

So based on this, we think the project should be structured in three phases:

1 : Document Cmm: its pipeline, semantics, and known quirks like the pretty-printing/parsing mismatch.

2 : Fix Cmm tooling issues: Such as unify the pretty-printed and parseable formats, and improve the parser to support full Cmm.

3 : If time permits, tackle one of the two directions:

Improve the LLVM backend to address some of the inefficiencies discussed in this thread.

Or implement improvements to the native backend (SSA or ANF + allocator enhancements).

It is probably relevant to mention one of the reasons i think llvm should be targetted by cmm and not stg directly. Is Csaba is working a custom STG compilation scheme that will also compile to cmm, but provide a whole array of optimizations

It will take GHCs STG → custom optimizations → textual cmm that will be compiled by GHC’s backend

Thank you again to everyone who has or will provide feedback!

3 Likes

Btw, does anyone know a haskell project being written that needs a code generator or that will get its code generator rewritten? Once i document how to output cmm programatically. I hope it becomes an all around good code generation backend. The native backends take a lot faster to compile than something like llvm. While llvm is still an options for release builds

That’s an interesting direction, but I would first like to understand exactly which inefficiencies arise from compiling STG → Cmm → LLVM. If those inefficiencies can be fixed, I believe it’s better for GHC to continue targeting LLVM from Cmm, rather than bypassing it.

Indeed I agree; it’s an interesting question but the answer is not at all obvious. One attribute of the proposal that I left unsaid in my previous message is that it would likely be necessary to break ABI compatibility with the NCG to get more juice out of LLVM. For this reason it would be a rather risky project.

For what reasons do you think it would be necessary to break backwards compatibility?