Indexing Hackage: Glean vs. hiedb

I thought it might be fun to try to use Glean to index as much of Hackage as I could, and then do some rough comparisons against hiedb and also play around to see what interesting queries we could run against a database of all the code in Hackage.

24 Likes

I wonder if an index like this could help us answer questions about breaking changes in base. For example there is currently a proposal to remove the mappend method of the Monoid class. To assess the impact of this change we need to now how many instances there are that use a custom (or even non-canonical) definition of mappend. I wonder how expressive an index needs to be to answer such questions.

4 Likes

I had this thought as well while reading.

I think it’s even more compelling for small simple changes like adding a new function to base. I think the current impact assessment process is quite tricky, since either you end up just grepping Hackage for similarly named things, or you end up trying to use something like GitHub - haskell/clc-stackage: Meta-package to facilitate impact assessment for CLC proposals but that requires building GHC, and then a lot of packages.

If we just hosted an index for one of these tools somewhere, we could let proposers query them (maybe they download the index) and that seems like a much nicer process than error-prone grepping or the labour-intensive task of building all of stackage yourself.

1 Like

Being able to do a quick impact-analysis for changing an API would be super useful. However, the level of detail we would need to be able to identify non-canonical definitions of mappend is basically the full AST - that’s not impossible, but it’s beyond what we have done for any other language in Glean, and it’s not clear whether the cost/benefit tradeoff is a good one.

Identifying all the instances that override mappend at all is certainly possible though. It falls within “index the details of all the declarations” which is high up the todo list.

4 Likes

If we just hosted an index for one of these tools somewhere, we could let proposers query them (maybe they download the index) and that seems like a much nicer process than error-prone grepping or the labour-intensive task of building all of stackage yourself.

Yes - putting up the index somewhere for download is much easier than hosting a query service, but both are possible. I’ll think about it.

3 Likes

This is way out of the ecosystem, but I heard recently on a Develeper Voices podcast about a modular query engine that is used to power the rust semantic version checking. The key point is you can have different data sources/query endpoints for the different versions, which then convert into a common internal representation which can be checked. But I gather the details are pretty bloody, as expected. Anyway, it is a reference implementation, a similar approach may work on top of Glean, providing the version-independent view of package APIs for different releases of haskell, and of the package.

2 Likes

Thanks @simonmar for the interesting post. It’s very useful to see a comparision between the two projects, and it’s good to see that .hie files are useful for other tools than HLS.

1 Like

.hie files are good for extracting all the identifers, references and types, but as I’m discovering it’s hard to reconstruct the AST for declarations from the .hie file. For example, I want to extract all the methods of a class, all the constructors of a datatype, and so on. We already have the exports in the .hie, but not the declarations. Ideally I want something like the Haddock Interface type (but for all the declarations, not just the exported ones) or the internal AST.

I think at least for declarations (including constructors) .hie files give some information, hiedb and stan seem to make use of them (if I understood the issue).

See: Stan 0206 should not fire for type data · Issue #544 · kowainik/stan · GitHub

2 Likes

Thanks for the pointer. Matching on the AST node is what I imagined we would have to do, good to have an example to look at.

But doing it this way is at best fragile. The original AST has been split into pieces during HIE construction, which then rebuilds its generic AST by using the SrcSpans. So the constructors will usually end up as children of the data declaration if the spans are correct, but we shouldn’t have to rely on this fragile property and indeed it might not even be true - CPP and #include can break the span inclusion (admittedly that would be weird and rare). Still, doing it this way feels very wrong when we had the canonical representation originally, we just threw it away. Sigh.

1 Like