What’s needed to bootstrap GHC with hugs?

I was recently reminded that we still don’t have a bootstrappable GHC build (in the sense of https://bootstrappable.org/), and that it would would be nice to have.

Somehow this caused me to try to combine all of GHC’s source code into one huge file, which I didn’t fully succeed, nor would it really directly help with bootstrapping GHC. But it made me wonder again.

A previous attempt to bootstrap GHC didn’t use hugs because it doesn’t support recursive modules. Presumably, adding that wouldn’t be too hard.

What else would be missing before we can compile a (oldish) version of GHC with Hugs? Could this plan of attack work:

  1. Identify an old version of GHC that

    • One can use to bootstrap subsequent versions until today.
    • Is old enough to use as few features not supported by hugs as possible.
    • Is still new enough so that one can obtain a compatible toolchain.
  2. Wrangle the build system to tell you which files to compile, with which preprocessor flags etc.

  3. Boostrap all pre-processing tools used by GHC (cpphs or use plan cpp, happy, alex).

  4. For every language feature not supported by Hugs, either

    • Implement it in Hugs,
    • Manually edit the source code to avoid compiling the problematic code, if it is optional (e.g. in an optimization pass)
    • Rewrite the problematic code
    • Write a pre-processing tool (like the one above) that compiles the feature away
  5. Similarly, since Hugs probably ships a base that is different than what GHC, or the libraries used by GHC expects, either adjust Hugs’ base, or modify the GHC code that uses it.

@AntC2 seems to be the expert on Hugs here; what is your assessment?

7 Likes

Thanks @nomeata for thrusting me into the limelight. Ooh er, if I’m any sort of “expert”, abandon hope now.

The last Hugs release (2006) supports to H98 standard and well beyond. The main H2010 language feature it doesn’t support is guards: the comma-list; the three forms of guard. Those went into GHC ~2000; it was SPJ keen on introducing them, because a lot of compiler code could be improved thereby. If GHC immediately started using those guards itself, that spells trouble.

IIRC from your previous attempt, you particularly wanted to build the whole compiler/Prelude/libraries from source – including all the low-level interfaces to the operating system. Then we still have the same difficulties: the operating systems have evolved such that those low-level interfaces are no longer compatible. Furthermore O.S. versions are out of support: even if you can find an install, you might not be able to get it licensed.

Does your saying “Wrangle the build system” mean you’re not now trying to get an authentic-from-source compile?

I’m not trying to compile Hugs from source ground up, so I just copied the binaries from the distro for that low-level stuff. (Also I’m not trying to call low-level routines.) As we discussed before, theoretically those binaries might be smuggling in malware/trojans to plant in compiled code.

Hugs certainly doesn’t know about the dastardly FTP changes. But then if you’re bootstrapping a GHC of the same vintage as Hugs, neither did GHC in ~2006.

I presume that early in GHC’s history, it stopped compiling via Hugs (which is an interpreter), and switched to compiling itself(?) – that is, to object code. Maybe that was even before 2000(?) In which case maybe Hugs 2006 was never used to run GHC. GHC version 3.02 (1997) has a ‘Building from Source’ guide that doesn’t mention Hugs.

For compiling GHC, I can see plenty of ugly low-level stuff inside Hugs 2006 sources with machine dependencies. But maybe that was just carried forward unchanged.

Presumably if you can find an old enough version of GHC, it wouldn’t be trying to use recursive modules.

3 Likes

Ah, sorry I wrote that before looking in detail at your links “mutually recursive module dependencies, which is a feature that even the earliest versions of GHC rely on.” So how early did GHC cease using Hugs to get compiled? Would SPJ remember?

There seem to be some hacks mentioned via Google/Stackoverflow to get mutually recursive modules to work. Concoct or edit the cpp; or a boot file? That seems more manageable than squashing the whole compiler into one source file.

It must have been before 0.29 since platform.h.in mentions it, but as far as I know, no available compilers (Hugs, NHC) can compile 0.29 (i.e. it was probably strictly self hosting). But now that you mention it, it would be good to systematically try all of these. HBC we’d need a Lazy ML compiler for. (maybe it’s easier to write one than modifying Hugs?)

#define HC_UNSPECIFIED	1
#define HC_GLASGOW_GHC	2
#define HC_USE_HC_FILES 3
#define HC_CHALMERS_HBC	4
#define HC_ROJEMO_NHC	5
#define HC_YALE_YHC	6
#define HC_HUGS		7

(I did email Lennart Augustsson a few months back, but I never heard back)

1 Like

I briefly tried to reproduce the steps taken in https://elephly.net/posts/2017-01-09-bootstrapping-haskell-part-1.html. Noting that in the end he did manage to produce .hc files by interpreting nhc98 using Hugs, but then the resulting programs segfaulted, it seems prudent to first make sure that one can compile and run nhc98 from the generated C sources files included in the distribution.

Trying to do that with a current GHC causes the segmentation faults he describes. So maybe using an old toolchain would be helpful.

What is the best way to get a GCC from around 2005 in nix? Is there maybe a repository of such “old” compilers somewhere?

Or maybe there are flags that one can pass to gcc to make it behave a bit older?

(I’m way out of my depth here, just dabbling around.)


The last revision of nixpkgs to have gcc-3.3 was 1dfd467c1476d57cf218bad20148d194099c253, but importing that doesn’t quite work (at least some outdated mirror paths need to be updated.)

1 Like

There seems to be some confusion about the origins of GHC, which is somewhat unusual considering the relevant information has been widely available since 2007:

As I dimly recall, the only time GHC and Hugs were ever really “connected” was in the STGHugs project, which gave rise to GHCi as we now know it (because it was then simpler to implement a new interpreter in Glasgow Haskell instead of trying to keep GHC and Hugs “in harmony”).


…but why (re)try bootstrapping with STGHugs (as previously noted here) when you now have (preliminary) support for WebAssembly in GHC. It might be easier to use wasm as a “pseudo-UNCOL, with trusted interpreters then running the wasm-version of GHC to then bootstrap GHC from source.

I can confirm Hugs 2006 continues to not support recursive modules. There’s one point in the code that makes that check/rejects Recursive import dependency.

Looks like that code is trying to sort the imports into dependency order: load first the modules with no dependencies; then the modules that depend only on them; then another round; etc.

Given GHC supported recursive imports from early days, and it’s a requirement in the standard, but the non-compliance is still a ‘known infelicity’ in Hugs 2006 – and indeed Hugs uses a non-compliant Hugs.Prelude module to avoid it – I guess “wouldn’t be too hard” to fix is being wildly optimistic.

Maybe I’m not seeing it, but how do you get the wasm-version of GHC?

(Note that we are not talking about bootstrapping in the sense of supporting a new architecture, but in the sense of not using any untrusted binaries or other generated files.)

2 Likes

I assume there is a parser for a single module, which could be changed to parse all modules of a SCC in one go, leaving enough traces for the name resolution to keep definitions from different models apart?

(I assume it wasn’t done because nobody needed it - recursive modules are very rarely used outside GHC, snd maybe none of the Hugs users ever missed it.)

At least to me that sounds less scary than debugging segmentation faults in nhc98 when using modern C compilers. I wish I’d know how much of the way it takes us, or if, say, the path via nhc98 is more likely to succeed.

By first compiling GHC to WebAssembly using GHC itself.


…which would seem to eliminate Hugs as a means of bootstrapping, as it’s also a compiled, untrusted binary (unless you’re intending to use a certified compiler to build Hugs).

But then you need GHC first :smiley:

…which would seem to eliminate Hugs as a means of bootstrapping, as it’s also a compiled, untrusted binary (unless you’re intending to use a certified compiler to build Hugs).

Hugs is written in C, so (assuming you have a C compiler), you can compile it yourself!

1 Like

From the project’s website:

…therefore:

…isn’t satisfactory unless your chosen compiler for that ol’ warhorse is itself trustworthy. Or should one use a C interpreter to run the C compiler from its source…oh wait, how was that C interpreter built?

There are two approaches to bootstrapping a programming language:

  1. someone writes the implementation in some assembly language;

  2. someone writes the implementation using an existing programming language and its implementation e.g. a compiler.

In either case, you will be confronted with an “opaque binary” provided by someone else during the bootstrapping process. I’ll leave you to decide which is more trustworthy: assembly written by hand, or assembly generated by another program…


…only once. The wasm sources can then be kept for later reuse, much like the ABC-version is also provided for each major release of the Clean compiler.

No. (Or I don’t understand what you’re asking.) Hugs is an interpreter, not a compiler. It doesn’t produce object code/there’s no separate compilation. If you tell it to compile module A, it’ll chase everything A imports, and everything the imports import and so on. Then sort the modules into reverse dependency order, as I described. Then in effect compile it as one enormous source file with module ‘fences’. The reason for the reverse dependency order is so that it gets first clean declarations of all the names in the non-dependent modules; then for later decls, it only needs to refer to decls already processed.

I think Hugs internals doesn’t have a concept of ‘traces for name resolution’. It only has names compiled into the dictionary as syntactically valid decls. (The dictionary is quite a subtle/fragile [**] structure: for example, instances are stored in overlap sequence chained from the class decl. Yes it has several passes over the dictionary after syntax validation, for semantic validation, type improvement. But I see no notion of ‘forward procedure’-- to borrow an ALGOL term.)

You can’t squish several modules into (in effect) one source file/you have to observe module boundaries, because there might be the same names declared as local only in two different modules.

(Or I don’t understand what you’re asking.)

[**] As I discovered when I tried to make overlapping instances work smarter – I broke a lot of stuff and got the interpreter segfaulting until I figured out the instances chain.

Idk, but bootstrapping from nothing into c isn’t so uncommon. Here is guix doing this in less than 20 steps into guile. C doesn’t seem like a bad idea for a fairly ubiquitously abailabe compiler.

Maybe the other route is through lisp/scheme/racket/hacket?

1 Like

@atravers you seem to be firing off a load of arrows. I don’t see any of them are hitting relevant targets.

Still not good enough: that source gets ‘assembled’ by an ‘assembler’: is that “itself trustworthy”? The only reasonable approach is to enter the compiler binary code direct into memory using the machine’s front panel switches. And even then, we’re relying on the microcode interpreting that binary correctly, and all the Operating System’s low-level routines that we call. (And can’t avoid calling, because the OS allows those operations only in protected memory.) Hand-keying a few million bytes of binary, what could be more trustworthy?

But note: Hugs is an interpreter, it doesn’t produce binaries. Rather, we’re using it to host source ghc to compile ghc to produce binaries.

So we don’t need to trust gcc much, but as @angerman points out, C++ compilers are a known evil validated far more widely than LML/Lisp/even rust that your link seems to favour. The critical question is where do the pre-packaged binaries (routines) come from that ghc compiled from ghc interpreted under Hugs merely copies into its output?

The Hugs compiled from C++ might include faults that will cause it to segfault/dangling pointers/buffer overflow given sufficiently evil Haskell input. But providing that doesn’t happen when interpreting ghc to compile ghc, we then throw away the crutch and have an untainted ghc compiler (object code) – which of course we must immediately use to compile ghc ‘properly’.

Certainly! I would go one step further and suggest that we would have to make our computers using parts we build ourselves (e.g. relays from copper wire, iron rods and springs, etc) to know for sure that we’re not using “opaque binaries” anywhere…


Alright, so let’s look at how Hugs would deal with all the FFI-C calls GHC relies on: does it call out to a C interpreter to perform the call before continuing on interpreting Haskell? No - it (sensibly) calls a C compiler to build extra opaque binaries modules which Hugs loads at runtime as needed by the Haskell program being interpreted.

So using Hugs to bootstrap it will not give you that “untainted GHC compiler”: all the foreign calls will be implemented with assembly generated by another “opaque binary”. Now if angerman, yourself and everyone else here is content with that fact, then you are all quite free to use Hugs in this way.

But the real test here is whether the bootstrappable project will be also be content with this approach. I will defer to their judgement on the matter (if someone else wants to ask them).

It is! Just yesterday someone announced that guix project is now bootstrapped from a 357-byte binary!

Sure, but a interpreter that can deal with mutual recursion within a module can be taught to deal with mutual recursion across modules. You are absolutely right the things it has to keep track of (which names are in which module etc.), but that’s compiler construction bread and butter work.

I’ll wager a bet: If someone demonstrates that we can bootstrap some version of GHC using Hugs if only Hugs supported recursive modules, I’ll implement support for that.

2 Likes

But look why they wanted to do it:

One reason is to properly address supply chain security concerns. The Bitcoin community was one of the first to recognize its importance well enough to put the idea into practice. At the [Breaking Bitcoin conference 2020], ...

If you want to break Bitcoin, give your money to Sam Bankman-Fried (apparently – the latest in a long line of scams). No amount of fancy software assurance will protect from straight-up crooks (or maybe idiot-savants – it’s a bit hard to tell with SB-F).

Simon Tournier just told me that guix has bootstrapped 9.4.4 from 7.8.4, then there is a gap that needs to be filled, and also 6.10.4 from 4.08.2. So finding a way to bootstrap 4.08.2 would nicely connect to their work.

Looking at the code, I see boolean guards (f x | p x = y), but not multiple guards or deconstructing guards. But it seems that Hugs doesn’t even have boolean guards, right?

That sounded too good for me to believe and indeed it seems to be:

while the package graph is rooted in a 357-byte program, the set of binaries from which packages are built includes a 25 MiB statically-linked Guile

What does it mean to say that a package graph is rooted in a binary? It doesn’t seem to mean that all packages can be built using only that binary…

1 Like