Hook up Hackage to Software Heritage

I wonder if “jobs” is the right category for a call for volunteers, but it kinda fits, doesn’t it?

I was approached by Stefano Zacchiroli from the Software Heritage project that archive all the source code (+ VCS history) that they can find for posterity. They archive Haskell code, but it’s currently “incidental”, in the sense that they encounter a lot of it via general-purpose forges such as GitHub, GitLab, as well as distributions such as Debian.

To make much more relevant Haskell code is archived, they would like to also systematically archive all Hackage packages. In addition to increase Haskell code coverage, that will also enable to archive package metadata properly. They could do it, but we prefer to delegate development of the needed (open source) adapters for Software Heritage to domain experts, and they have some funding for it.

He expects the needed development work to be fairly limited, as they already have adapters for other similar package repositories, so there will be a lot of code to be reused. So, dunno, maybe a couple of weeks to 1 month of work, including the time needed to get familiar with their software stack. Preferably the work should be done in Python, because after initial deployment they will need to maintain it ourselves, and they are not big enough to have developers fluent in all languages out there :slight_smile:

This seems like a worthwhile goal, and I hope I can find someone here whose finger itch to get Hackage included in that archive. Is there?

See https://sympa.inria.fr/sympa/arc/swh-devel/2021-04/msg00035.html for information about the small grant, if funding plays a role. In either case, Zack will be able to answer questions at zack@upsilon.cc.

10 Likes

Not related to this directly, but I’d love if we could ensure they archived all the earlier functional language compilers (hbc, nhc, etc) and proof assistants (cayanne, epigram, etc) that we can get to them.

7 Likes

hbc is on the internet archive: https://archive.org/details/haskell-b-compiler. Is there any project that tries to keep the artifacts runnable?

1 Like

@nomeata I was thinking about this two. In particular I think the Git data structures SWH uses are a better normal form than the nested tarballs we use today (which admit all sorts of garbage edge cases). I would love to come up with a plan such that Hackage audits against xattrs etc. and (along with the tar for back compat) make an authoritative Git data structure for Hackage.

SWH will then have a very easy time ingestioning that. But that is just a side benefit — they can scrape things well enough either way. I am proposing it because I think those git data structures / SWHIDs will be very good for us too.

You mean, instead of exporting to SWH, just let hackage use the same data structure, and there wouldn’t have to be an exporter?

You mean, instead of exporting to SWH, just let hackage use the same data structure,

Yeah

and there wouldn’t have to be an exporter?

Perhaps Hackage would still “push” to SWH (or push a notification for SWH to pull) rather than SWH periodical polling Hackage. But the actual data would be relayed very simply and safely, since it’s the same data structures and references as you say.

Sounds interesting!

But maybe easier to first write an exporter (also so that Software Heritage can start archiving our stuff), and later expand the scope of the project? (If a volunteer can be found)

1 Like

That git data-structure idea sounds really similar to the stackage architecture…

What’s that? I just know about https://github.com/commercialhaskell/all-cabal-hashes which I believe just does tarball hashing.

It’s worth noting that @emilypi and @davean et all have a way nice set of ideas in flight that may help with this being pretty easy and efficient to do

For HBC, a more recent version is also available (also archived, but no VCS metadata unfortunately)-:

The sources and VCS metadata for Hugs98 are still available at https://darcs.haskell.org/hugs98.

Some other early Haskell implementations:

There also used to be a more experimental system for small computing devices e.g. the Palm Pilot - Pocket Haskell (no type checking): it was at http://not.meko.dk/Hacks/Haskell/PocketHaskell-0.1.1a.tar.gz (not archived)-:

1 Like

From the writer of HBC:

During the spring of 1990 I was eagerly awaiting the first Haskell compiler, it was supposed to come from Glasgow and be based on the LML compiler. And I waited and waited. After talking to Glasgow people at the LISP & Functional Programming conference in Nice in late June of 1990 Staffan Truvé and I decided that instead of waiting even longer we would write our own Haskell compiler based on the LML compiler.

For various reasons Truvé couldn’t help in the coding of the compiler, so I ended up spending most of July and August coding […]

The first release, 0.99, was on August 21, 1990.

…and here it is: lml-0.99.src.tar.Z (SHA256: 51d0d5954a10b7b2a9df241e21887f60448081c9af19417dd84ada8437403e8f)
(…curious: that site doesn’t appear to have been archived.)

With regards to software preservation: it would require heroic effort to make it work on modern hardware, so it will probably end up joining the old Yale Haskell system as another static exhibit…

More concretely than what i said before https://nlnet.nl/project/SoftwareHeritage-P2P/ is now a thing we’ve planned out.

I think it would be very cool to build upon that, and use SWHIDs in Hackage, and then sync between end users, Hackage, and Software Heritage with IPFS.

1 Like

Nice!

Although that project “just” means that SWH data is available in IPFS, right? In order to get hackage on there, an exporter would still be needed (unless hackage completely changes it’s internals, and suddenly cabal uploads to SWH directly and hackage uses that as the backend, but that sounds like a big undertaking).

Oops I wish I saw this earlier.

Although that project “just” means that SWH data is available in IPFS, right?

Yes.

In order to get hackage on there, an exporter would still be needed

Yes.

unless hackage completely changes it’s internals

Yes I think we should do this anyways. The tar-of-tar situation is quite bespoke, far from a normal form (I’ve been told some Hackage tarballs won’t even exctract). and thus a bit cumbersome to ingest downstream.

suddenly cabal uploads to SWH directly and hackage uses that as the backend

I don’t think SWH wants to be the archive of first resort. I just want everyone to agree on what the data structures are, and SWH choosing essential git should be a very uncontroversial choice.

One the data structures agree, it’s very easy to try all sort of variations on what information flows where. But while they don’t agree we are stuck with sad duck-tape and ad-hoc syncing logic.

There is no tar of tar format in hackage. There is a tarball of cabal files in the index.tgz. Individual cabal packages are also tarballs. That’s it. I don’t understand what the existing SWH system or setup is anywhere from the links in this thread, but hooking up some “glue” should be straightforward, and naively I don’t expect “making structures agree” would make much of a difference here.

1 Like

Don’t get me wrong, we certainly can mirror the data to SWH without this change. And not changing Hackage’s internals is less work.

But there is still a bunch of small paper cuts for code being stored in different formats, whether it’s integrating Hackage packages with other tools besides, or simply remembering how much of Hackage was sent to SWH (ingesting new hashes only doesn’t require extra state).

I suppose this is something that interests more the build systems and taylorism-for-FOSS side of me than the Haskell side of me per-se, but I hope I can at least convince you all if that if the PRs to Hackage and Cabal existed they would be worth merging.

It would be possible (and straightforward) to also provide the index in another format if desired, and if anyone wanted to code it up. Given the way the index format and hashes are coupled to hackage-security and TUF, I can’t imagine altering that makes much sense. And as far as code formats, the cabal tarball is a standard that is both straightforward and extremely time-tested. I can’t see any percentage in moving away from it.

Perhaps there’s just aspects of the current TUF implementation you’re less familiar with, and which are nicer than you may be imagining?

1 Like

Yes I wouldn’t say get rid of the current thing either – no need to break old Cabals out in the wild from day one.

I am indeed not aware of the details of the current implementation. But the spec makes a point of not specifying the exact file format, and I wouldn’t want to abandon TUF.

It sounded like you meant replacing the format while keeping the current data would effectively invalidate existing signatures? Yeah that isn’t a good either. Hopefully TUF is amendable to some sort of “hash agility” scheme where some authority can vouch for the new SWHID, but at initially I suppose the two can live in parallel.

Interestingly, this gets back at what sort of advantages this stuff might bring Software Heritage in the first place. If the “archive all the things” battle is won, the next challenge is better attesting what is what, and all the TUF signatures could well help with that.

Whether or not the details pan out in just that way, I would hope there is a chance here to nail down just how language-specific package managers like Cabal ought to work with resources like the Software Heritage archive and other tools like Nix alike. If we would end up pioneering a path others could follow, I think that would be especially useful, and still would benefit Haskell too in the long run.