Hook up Hackage to Software Heritage

I wonder if “jobs” is the right category for a call for volunteers, but it kinda fits, doesn’t it?

I was approached by Stefano Zacchiroli from the Software Heritage project that archive all the source code (+ VCS history) that they can find for posterity. They archive Haskell code, but it’s currently “incidental”, in the sense that they encounter a lot of it via general-purpose forges such as GitHub, GitLab, as well as distributions such as Debian.

To make much more relevant Haskell code is archived, they would like to also systematically archive all Hackage packages. In addition to increase Haskell code coverage, that will also enable to archive package metadata properly. They could do it, but we prefer to delegate development of the needed (open source) adapters for Software Heritage to domain experts, and they have some funding for it.

He expects the needed development work to be fairly limited, as they already have adapters for other similar package repositories, so there will be a lot of code to be reused. So, dunno, maybe a couple of weeks to 1 month of work, including the time needed to get familiar with their software stack. Preferably the work should be done in Python, because after initial deployment they will need to maintain it ourselves, and they are not big enough to have developers fluent in all languages out there :slight_smile:

This seems like a worthwhile goal, and I hope I can find someone here whose finger itch to get Hackage included in that archive. Is there?

See https://sympa.inria.fr/sympa/arc/swh-devel/2021-04/msg00035.html for information about the small grant, if funding plays a role. In either case, Zack will be able to answer questions at zack@upsilon.cc.

7 Likes

Not related to this directly, but I’d love if we could ensure they archived all the earlier functional language compilers (hbc, nhc, etc) and proof assistants (cayanne, epigram, etc) that we can get to them.

5 Likes

hbc is on the internet archive: https://archive.org/details/haskell-b-compiler. Is there any project that tries to keep the artifacts runnable?

1 Like

@nomeata I was thinking about this two. In particular I think the Git data structures SWH uses are a better normal form than the nested tarballs we use today (which admit all sorts of garbage edge cases). I would love to come up with a plan such that Hackage audits against xattrs etc. and (along with the tar for back compat) make an authoritative Git data structure for Hackage.

SWH will then have a very easy time ingestioning that. But that is just a side benefit — they can scrape things well enough either way. I am proposing it because I think those git data structures / SWHIDs will be very good for us too.

You mean, instead of exporting to SWH, just let hackage use the same data structure, and there wouldn’t have to be an exporter?

You mean, instead of exporting to SWH, just let hackage use the same data structure,

Yeah

and there wouldn’t have to be an exporter?

Perhaps Hackage would still “push” to SWH (or push a notification for SWH to pull) rather than SWH periodical polling Hackage. But the actual data would be relayed very simply and safely, since it’s the same data structures and references as you say.

Sounds interesting!

But maybe easier to first write an exporter (also so that Software Heritage can start archiving our stuff), and later expand the scope of the project? (If a volunteer can be found)

That git data-structure idea sounds really similar to the stackage architecture…