[Well-Typed Blog] Anonymous or large records with OverloadedRecordDot and OverloadedRecordUpdate

adamgundry · March 7, 2023, 4:15pm

https://well-typed.com/blog/2023/03/overloaded-anon-large-records/

AntC2 · March 8, 2023, 7:18am

Thanks Adam.

I was alarmed to see the talk of “the quadratic compilation time of standard records.” (and quadratic code size). I must have missed Edsko’s blog at the time.

So “standard records” there means H98 data decls using record syntax, irrespective of allowing duplicate field names(?)

... prevent GHC from generating field accessors (one of the many sources of quadratic code), but we can still construct and pattern match on these records in a normal way.

Does option -XNoFieldSelectors avoid that blowup? That is, using pattern matching via field names to select fields? Like:

case someR of
    MkR{f00, f01, f02, f03, ...} -> ...

Does it help to name on LHS only labels you’re going to be using on RHS?

adamgundry · March 8, 2023, 7:59am

So “standard records” there means H98 data decls using record syntax, irrespective of allowing duplicate field names(?)

Yes. The problem arises simply with

data T = MkT { f1 :: T1, ..., fn :: Tn }

because that generates n record selector functions each of which is of size n (in Core). The general problem is pretty much independent of name resolution.

Does option -XNoFieldSelectors avoid that blowup?

Unfortunately not, because the selector function is still generated internally (hence the trick mentioned in the post to prevent it being generated by introducing an existential type).

That is, using pattern matching via field names to select fields? Like:
case someR of
   MkR{f00, f01, f02, f03, ...} -> ...
Does it help to name on LHS only labels you’re going to be using on RHS?

The Core representation will be linear here anyway, I think. I doubt whether you name all the fields or not makes much difference.

AntC2 · March 8, 2023, 9:03am

Ooh err. By coincidence, I just posted a q about SML’s ‘anonymous records’ that are free-standing/not declared via data.

Do you know if they would generate field selector functions for every possible record type appearing in the program which mentions that field?/Do they give such bad blowup?

By contrast, I can see Hugs/TRex doesn’t: for a field selector you must use explicit # prefix (a kind of a macro) to get a function spelled #f99. And because those things are extensible, that generates

\(f99 = f99 | _) -> f99

in which the | _ denotes the ‘tail’ of the record/all other fields.

[Of course Hugs is an interpreter, so it only generates for functions your code actually calls.]

edsko · March 8, 2023, 2:52pm

Just to point out, the selector functions are only one of many, many reasons why traditional records lead to quadratic code and compilation times, and probably one of the more minor reasons. Avoiding them is by no means the end of the story.

AntC2 · March 9, 2023, 1:06am

Thanks Edsko. (And should we be discussing your quadratic post somewhere else, rather than distracting from Adam’s?)

Well I’ve never liked/never wanted/don’t use selector functions. Indeed I’ve always looked on the record form of H98 data decls as a stopgap design, waiting for something better – there are heaps of ideas.

Can we gain some optimisation if we know the datatype has only a single constructor? (I guess that would usually be the case for records with 20 or more fields.)
Specifically in your post you switch to talking about ‘field accessors’, with this example (pseudo-Haskell) for constraint dictionaries:

    case dict of
      Constraints_R d00 d01 d02 d03 d04 d05 d06 d07 d08 d09
                    d10 d11 d12 d13 d14 d15 d16 d17 d18 d19
                    -- .. lots more ..
                    d90 d91 d92 d93 d94 d95 d96 d97 d98 d99 ->
        d00

That’s like what gets generated for a field selector function; but it’s internal to the object code. You seem to be saying it’s as bad as a selector function(?)

Why is that bothering to give a name to all the unused positions? Could it just use _ instead of d01 etc and avoid the overhead?

edsko · March 9, 2023, 7:01am

Both large-anon and large-records only deal with records, i.e., single constructor types. The general case of “large ADTs” I’ve not even begun to think about

And yes, the situation with records and type classes is identical at the core level; unfortunately, simply leaving out the unused variables is not an option in core; this would require a change to the core language. Easy enough to do for record field access, but much harder to do for record field update. And even if there is a solution, you’d then have to convince the ghc team that it’s worth it; changes to core affect the entire compiler and are kept at an absolute minimum.

AntC2 · March 9, 2023, 9:19am

Thank you again. No, it’s waaay above my pay-scale to be looking for a ‘solution’. I’m merely trying to understand the challenges.

?? Field update is just as bad whether you go via named fields or purely positional: you have to pattern match on all the positions, and build a new vector in memory copying most positions?

edsko · March 9, 2023, 9:33am

I was talking mostly about the compilation time cost (all of those variables must be mentioned and typed), but yes, you are right of course that there is also a runtime cost to such an update. Standard records in ghc as well as the records implemented by large-records have this behaviour (the entire record is copied on every update). In large-anon however this is not the case: its internal representation is hybrid: it uses a vector of fields along with a HashMap (with indices that are computed at compile time) of updates to that vector, which gets applied every so often. See large-anon: Practical scalable anonymous records for Haskell - Well-Typed: The Haskell Consultants for details.