Cabal upload revisions? and some philosphical points

I wish I would upload revisions via the command line.

pragmatics

Firstly, this would simply be more convenient, especially in that it would remember me I should always update the source repo, add a tag, etc. etc.

philosophy

Second, I think this would make people like revisions more for philosophical reasons. Right now, because revisions are done so differently, it is easily to mentally define revisions in that way:

versions are normal releases, revisions are something created in this special out of band way

But that is icky and imperative. The right way to think about versions always relational:

(version-x, source-x) Upholds-Invariants (version-y, source-y)

is valid if the “distance” between the sources does not exceed a threshold defined by the “distance” between the versions. This formalism plainly more beautiful, and also leads one to useful connections with many branches of mathematics (metrics and orders, at least).

With normal versions, we have the PVP defining the relation, which is sadly not enforceable with the tooling we have. With revisions, we indeed have the tooling, as implemented in the Hackage server!

This to me shows we have an opportunity to do better. Not only can we do revision uploads from cabal, we can factor out the validation code and put it in Cabal so users can validate revisions locally!

This would whet the appetite for making cabal-diff GitHub - phadej/cabal-extras: A tool suite to aid Haskell development using `cabal-install` first class (with the support from GHC, etc. it needs) so we implement the PVP just formally and conveniently.

Change logs in revisions?

Currently I have to document what I did in the revision in the change log after the fact. This is a bit annoying, especially as I and my colleagues are always working hard to remind each other to normally make sure we don’t leave the change log for later, so as not to burden whoever cuts the release.

If we allow sdists for revisions, might we relax the rules so that the change log file is also allowed to change? I am a bit wary of my desire here — some evil project could use the change log as part of the source proper! — but perhaps there is a way we can make this safe, e.g. by somehow ensuring the change log file is not part of the source present during buidls.

6 Likes

There’s a tool for this: GitHub - hackage-trustees/hackage-cli: CLI tool for Hackage

A modest PR to cabal proper adding some nicely worked version of this functionality would probably be welcomed.

2 Likes

@sclv Thanks. Is the validation logic also in that tool, or does the server just error if it is invalid?

I should say I care less whether the CLI command is cabal or something else, than I wouldn’t like our two notions of versions to have such divergent implementations.

Right now, the fact that regular versions and revisions are implemented at two different times in two different ways really leaks, and obscures the fact that it should be possible to think of both in the same manner. This is stuff like:

  • Why is the “main version” in the cabal file, but the revision not?
  • Why can’t I upload a whole tarball for revision 1 and it complains if all the other files are not the same?

That we happen to store revisions via just the cabal file is to me just a delta-encoding — an implementation detail. The invariant in the spec is

if (v1, s1) is coherent with (v2, s2), and v1 and v2 only differ in revision, then s1 and s2 only diff in the Cabal file, and only in certain ways.

Sure, that delta encoding is enabled by the invariant, and hell, as a second line of defense after the Hackage server validates the Cabal file, it helps maintain the invariant to. But I would like to implement the spec in full generality too, not tailored to our specific flows (upload tarball vs upload cabal file vs submit cabal-file-modifying form), and then reuse that code for flows and new ones.

In short, the philosophical point is implement the stuff in more generally, less attached to specific flows and the historical accident of normal versions predating revisions, and then users will in turn get a better mental model and be less suspicious of revisions.

Not sure I am making sense yet, but hopefully that is a little clearer?

Well it depends what you mean by validation. If you mean “does this only allow valid build plans” then no, there’s no validation in the tool or the server or the client. If you mean “is this revision allowed” then the server checks.

The trustee tool can be used to check which constraints produce valid build plans: Oleg's gists - New things in Haskell package QA

Storing revisions in the cabal file is not just “delta encoding” and an implementation detail. It is precisely because cabal files are package metadata that happen to be included in tarballs, and are not part of the package proper. That’s the mental model we want people to have. There’s already too much confusion because people think that “revisions” mean packages can be revised, and not just their metadata.

2 Likes

Well it depends what you mean by validation. If you mean “does this only allow valid build plans” then no, there’s no validation in the tool or the server or the client.

Yeah don’t need to worry about build plans until we are way further along doing validations. Just thinking about simpler comparisons of two versions of a package now — local or uncontextual validations in other words. I think this is a good separation of concerns, too, rather than merely trying to grab the lower hanging fruit first.

If you mean “is this revision allowed” then the server checks.

Gotcha

Storing revisions in the cabal file is not just “delta encoding” and an implementation detail. It is precisely because cabal files are package metadata that happen to be included in tarballs, and are not part of the package proper. That’s the mental model we want people to have. There’s already too much confusion because people think that “revisions” mean packages can be revised, and not just their metadata.

Hmm. “happen to be included in tarballs” is telling. So in your mind, its the other way around, including the cabal file with the other files, that’s the kludge. Right?

I think that stuff leaks in nasty ways e.g. where custom Setup.hs needs to read the cabal file, but I wish
we could get rid of Setup.hs so I am sympathetic.

Your mental model is closer to a binary relation between (revision Ă— cabal-file Ă— source) for coherence, then?

Ultimately I think we as a community should agree on something, independent of whatever historical choices we’ve made, and document it. First we should decide on what the canonical form a package is, e.g. pulling the cabal file away from the rest of the data in the tarball if we really do want it to be separate. Then we define various relations.

Well think of it like a book. When someone writes a book, they tend to mean the content of the book, the chapters, the introduction, conclusion, etc. If there was a missing entry in the index added, or if the ToC were fixed because the chapter numbering was listed wrongly in it, there’s a sense in which we’d understand that “the book itself is the same”. On the other hand, it makes sense to bind books with a ToC and index, because those are useful tools in the reading of the book itself. Hence, if we’re referring to the physical book, then the ToC and index clearly are part of it, in the physical sense. But they are not necessarily part of it in the other sense – i.e. often the person that produces the index is not the same person that writes the book, but the person that writes the book is listed as the author, and the person the publisher hires to assemble the index is not.

Tarballs would not work as independent units of distribution of packages if they did not, in some fashion include cabal files. But also, clearly, the cabal file is special with regards to its meaning, compared to everything else in the package. In a sense the cabal file is already “away from” the rest of the data with regards to package repositories in that while an initial upload includes a cabal file, revisions precisely provide new cabal files without anything else, and further, the cabal tool knows when fetching a cabal package to use the revised file and not the original. I don’t see how forcing them to be initially provided separately would do anything but cause more work.

I don’t know what you mean by “coherence relation” btw. I would not say different revisions change the package. But also, even with the exact same (cabal-at-revision x source) pair, many different things can still affect the final build-product (i.e., all compiler flags, all package flags, and all choices of such plus versions for all transitive dependencies, etc).

Sorry for getting back to this slowly. I was hoping to perhaps not not wade deeper and deeper into the philosophical depths, but I now think that is helping either of us (or anyone else reading this) so I will dive right in.

Well think of it like a book. When someone writes a book, they tend to mean the content of the book, the chapters, the introduction, conclusion, etc. If there was a missing entry in the index added, or if the ToC were fixed because the chapter numbering was listed wrongly in it, there’s a sense in which we’d understand that “the book itself is the same”. On the other hand, it makes sense to bind books with a ToC and index, because those are useful tools in the reading of the book itself. Hence, if we’re referring to the physical book, then the ToC and index clearly are part of it, in the physical sense. But they are not necessarily part of it in the other sense – i.e. often the person that produces the index is not the same person that writes the book, but the person that writes the book is listed as the author, and the person the publisher hires to assemble the index is not.

I get your point about bundling the metadata. Before returning to that, I think it’s good to ask, “is the cabal file actually metadata?” I think the answer is that some of it is, and some of it isn’t.

I stopped to try to try to find a good definition of metadata, and I decided it was some sort of purely optional context. For example one can:

  • Read a book without reading the ToC, to take your example
  • View an image with the geotagging info stripped
  • Disable access times on most filesystems without screwing up the running machine

Under that definition, fields like the summary, description, source-repo, etc. absolute are metadata. A “read me” or “change log” files would also be metadata. All the stuff should be safe to delete (with the exception of read mes which are literate Haskell and packaged in the cabal file as an example, and probably other less justifiable exceptions too.)

But what about the rest of the cabal file? Fields like build-depends can be converted into other forms that also suffice: for example I can do cabal build --only-dependencies, and then cabal repl, scrape the CLI flags, delete the cabal file, and then invoke GHCi by hand without cabal.

If build-depends isn’t actually necessary to do the thing, does that mean build-depends is metadata? I don’t think so. In particular, while I replace that context with another, I can’t drop it entirely without the package loosing much of it’s meaning. My view is packages are functions of some odd sort, and build-depends (and friends) collectively specifying the type of the domain, but this is all in a church-style not curry-style system so that stuff is absolutely normative, recovering it is not decidable.

Finally, they are fields like name and version, which today are normative because stuff like the -package-id flag depends on them. However, I wish they were metadata per my definition. In particular, I wish packages didn’t “know their own name”, and instead give them package ids that were content addressed, with a separate index mapping human-meaningful metadata to anonymous packages in the package database. This would imply a sort of homomorphic scheme where the data and metadata are separate compiled the package database, such that renaming or reversioning the package shouldn’t necessitate a rebuild simple-stupid Nix-style caching. (This last bit is tangential to the rest, but I hope a useful example.)

I guess the summary tl;dr is (data, metadata) is either a regular product, or a dependent sum where the meta data depends on the data, but never a dependent sum where the data depends on the metadata, for then the metadata would be normative Context. Cabal files fall into that 3rd category unless we chop them out and triage the information inside, in which case we can do a bit a better.

1 Like

They do, but per what I wrote above, a cabal file rather murkily straddles the metadata, divide. Furthermore, it is mainly for changing the bounds of build-depends, but build-depends are more data than metadata!

I do think revisions have structure, but since revisions are change the data parts of the cabal file, I don’t think data vs metadata is the right way to elucidate that structure.

1 Like

I don’t know what you mean by “coherence relation” btw. I would not say different revisions change the package. But also, even with the exact same (cabal-at-revision x source) pair, many different things can still affect the final build-product (i.e., all compiler flags, all package flags, and all choices of such plus versions for all transitive dependencies, etc).

OK I should define what this notion is, and then why I made it that way.

The basic purpose of this relation is to formalize “is this new release I want to make correct?”, where release could be a whole new version to upload, or mere revision.

Per the part above, I find trying to divide up a package into parts very fraught, and so I would want the initial analysis to be rather hands-off, trying things in whole / rather opaquely. I think I will double down that, so rather than saying (version, source), per my wishing the versions were really metadata, I should instead just project the version from a unitary source.

My basic idea is rules in the form:

version(source_a) SomeNotionOfApartness version(source_b)
source_a SomeNotionOfSimilarity source_b
-----------------------------------------------------
source_a CoherentWith source_b

the idea being the further apart one demonstrate version is, the less simular the packages need be. That is basic covariance, but I say it in a funny way to show that one ought to find the biggest gap between the versions they can so the hard second part becomes easier.

One thing you may notice is there are no contexts in “rule template” above. The revision rules actually do work without context. The PVP ones are very tricky because naively the domain of build plans (and thus contexts) the two sources’ interpretations as Haskell packages are both admissible in (and thus must be compatible in) is open world and “grows” as new versions are released! One must therefore be very careful not end up with nonsense impredicativity!

I am going to look your ArXiv papers on formalization packages now to get a refresher on where you stand.

I came into this thread with a specific idea, but let me take a step back a bit. I think our general strength with FP is the unity of theory and practice.

Practitioners understand that keeping up with theory helps generalize ones code, making better libraries that make us more productive next time we try to write down an end goal something. Theoreticians understand that we can only build intuitions for so many “abstract nonsense” definitions by playing around with them in programs and theorem provers — that there are in fact hands-on skills to develop in tandem with the core academic work of reading, writing, and thinking.

With package management, however, we are at our weakest. There is little established theory for how to think about this stuff (I would say the existing literature is mainly a few scattered attempts by people venturing in from various home fields). A lot of the stuff is more basic, and those authors (to my knowledge) are not engaging with the more richer work like @sclv’s own. Meanwhile, the practitioners are also not really engaged with much of any of this stuff, and the codebases, having passed through many hands without much unified theory, are a cacophony of different accumualted ideas only sometimes in harmony.

I really want to close that gap. Now that the “Cabal vs Stack” wars have cooled down, and so the immediate problems are dealt with, I think this is the long next step.

Again I came in with specific idea that I thought would be a good uncontroversial starting point. I see now it too is just the tip of an iceberg of unreconcilled theory and practice, and thus is it is far from clear it is a good idea at all, let alone good first step.

I wrote a bunch of words here, but if it’s not to much to read, I think it would be great to keep on discussing the philosophy of this stuff (perhaps more productively than after Haskell meetups in loud places in real time :)).

3 Likes

Just wanted to say that I appreciate this thoughtful analysis, thanks @Ericson2314.

1 Like

For those curious, here are the arxiv papers referred to: [2107.01542] The Semantics of Package Management via Event Structures and [2004.05688] The Topological and Logical Structure of Concurrency and Dependency via Distributive Lattices

And here is a recent talk at packagingcon that gives a very high level overview of some of this An Invitation to Order-Theoretic Models of Package Dependencies - Gershom Bazerman | PackagingCon - YouTube

4 Likes

It might help if you consider the situation before Cabal existed. From the compiler’s point of view, all of .cabal file is metadata. Nothing stops you from using -package options on the command line to specify the exact dependencies, assuming you solved the exact versions you need and installed them yourself.

You can even go further and completely erase the notion of packages: just unpack the source code of every dependency in the same directory with the package you intend to build. In fact it might be fun to have a cabal get --with-dependencies command to prepare the ground automatically.