When is a module too big? When is a module too small?

kindaro · February 22, 2024, 7:41am

TLDR just answer the questions in the title

I am going to try and answer myself but you will see my success is modest.

why modules

Haskell has modules. Why? I know for sure of two reasons:

(parallelization) Modules that do not depend on one another will be compiled by GHC in parallel, speeding compilation up on multi-core machines.
(caching) If a module’s transitive and reflexive dependencies have not changed, it will not be recompiled by GHC, saving compilation time.

module boundaries and optimization

I recall hearing about optimizations working differently within and across modules. I understand that, for inlining to happen across modules, the source code of the function to be inlined has to be made available in the interface file. But I am not sure whether this is important — if GHC thinks a definition should be inlined, there is nothing to stop it from putting the source code of this definition into the interface file.

Also, there is the question of complexity — perhaps on bigger modules some optimizations that happen to have super-linear complexity will run visibly slowly. But I do not know any specifics — I only have the general idea of the possibility.

module boundaries and build product size

I am not sure how it works actually. One idea I have is that GHC makes one object code file for each module, and then asks ld (or another linker) to put them together and bolt the run time system on top. Plausibly a linker can skip those modules that are not actually needed. Does it do that?

I recall the project fragnix. As I understand, what it does is put every definition into its own module, so that each definition is compiled into its own object code file. They claim «lower binary size and compilation time» as expected advantages.

considering extremes

We can approach this question theoretically. Say we have a code base of arbitrary size. There are two extremes in which we may represent it as an hierarchy of modules:

(the most gross partition) Everything in one big module.
(the most fine partition) Each definition in its own module.

What trade-offs shall we observe?

I have some big modules at hand (thousands of lines, hundreds of definitions) and they compile visibly slowly. But I cannot say if this is because of super-linear complexity or simply because every single definition, with derived and generic instances on top, is hard to compile by itself, and they linearly add up.

I also have small modules and they do compile fast and only when transitive and reflexive dependencies have changed. I get the most from both parallelization and caching. And I do not see any downsides — aside from the inconvenience of opening and closing files and the inefficiency of the underlying file system layout, which I neglect as not Haskell-specific. Are there any Haskell-specific downsides I should be aware of?

For example, who will suffer if we take Data.Text (about two thousand lines of code) and split it into small modules holding one function each? Will inlining suffer much? As a middle ground, what if we keep those functions that should inline into one another together but split most finely otherwise?

coupling and cohesion

As the object oriented school of thought teaches us, some definitions belong together. Since we do not have objects with state, we cannot group definitions by the state they share. Instead, in application to Haskell, we may study the dependency graph between definitions and try to partition it optimally.

Surely mutually dependent definitions belong into one module. This is a requirement of GHC that can be side-stepped but I have never had a case where side-stepping it was needed.
Instances belong either with the class (if it is a custom class) or with the type (if it is a standard class). This protects the code base from orphan instances.

There are no other hard requirements. Can we find other precise reasons to put definitions together?

Maybe we can start like so: transitive reduction ∘ skeletalization. What we get is a comfortable and unambiguous presentation of our dependency graph. What can we do with it?

We can «fold leaves». A definition without dependencies that only one other definition depends on — a leaf — can be put into the same module as its dependent — maybe even into a where block. But should this always be done? Applied iteratively, this rule will gather all outer trees of our dependency graph into nodes on which at least two other definitions would depend. This can make big modules, which defeats the purpose of the module system.

However, we can also make an opposite argument: if two definitions are not both dependencies of some other definition, they certainly should not be in the same module. This puts formal ground under Do you have a giant `Types` module in your projects?.

More generally, we can «fold chains». If X depends only on Y, we can merge them. They will have to be compiled the one after the other anyway. We lose caching but we do not lose parallelization because we have not had any to begin with. But once again, should this always be done?

The answer is that we need another criterion for when to fold stuff and when to keep it laid out. For example, if Y takes way long to compile, we should not fold it into X so that we retain caching.

programmer experience

There are these two simple observations:

It is easy to review definitions jointly when they are on the same screen. The easiest way to have stuff on the same screen is to put it into the same file.
Scrolling is painful. So, modules should be small enough to fit on the screen.

These considerations together suggest that we should fold stuff into a module while it fits on the screen and then stop folding. This sounds like a silly criterion but for me both files that take more than two screens and files that are 10 lines long are painful to work with. Maybe I need to learn some editor tricks?

other languages

In Java and C# the standard is to have exact one to one correspondence between classes and files. Each class gets a file of its own. I am not sure if every source file gets its own object code file. Overall, they go the path of the most fine partition.
In Vue they put markup, style and code for the same «thing» in the same file, and it seems to be clearly defined what a «thing» is, though I do not quite grasp it. So, a kind of «vertical integration». I think they do it for better programmer experience.

My experience with languages other than Haskell is tiny so I cannot say much here.

so

So, I have no idea what to do with my code! Please help me out!

Lsmor · February 22, 2024, 2:19pm

Lately I’ve been in a hobby project and I reached the following conclusion: Split your modules in a LSP-friendly way. This only apply for the Data layer of your application.

Maybe not the best advice, since many don’t use hls, but I think is a good rule of thumb. Roughly speaking lsp-friendliness could be defined by:

Your module must have one record (let’s name it the principal type) and the Module name should be the same as the record.
Your module may define more types ( let’s name them secondary types) if and only if:
- They are enumerations used in one field of your principal type
- They are type synonyms (type keyword) used in one field of your principal type
- They are newtypes (newtype keyword) used in one field of your principal type
Your Principal type can have field of a type (ideally Principal type) from other module (let’s name them external types).
Functions in your module must:
- use secondary type or external type as arguments and return the principal type (constructor functions)
- use the principal type as an argument and return a secondary type or external type (getter functions)
- use the principal type and/or any secondary type at any position, conbined with basic types like Int, [SecondaryType], Bool, etc… (utils functions. Things like length :: Principal -> Int or isABC :: Secondary -> Bool, etc…)
- Dont’ create functions such that they take principal and external types as arguments. That would belong to a different module in the application layer (using hexagonal terminology)

The reason to do that is that you can import qualified the module and have a OOP feel by writing the name of the module, pressing Enter and you have the , as if the type was a class. Maybe an example explains better

-- file src/Data/Weapon.hs.
-- This is an example of how to apply the principles above
module Data.Weapon (Weapon, ...) where

-- Secondary newtypes are allowed, as long as they are used in the Principal Type
newtype Attack = Attack Int deriving ...
newtype Defense = Defense Int deriving ...
type WeaponName = Text

-- Smart constructor for attack. Take a basic type Int
-- and produces a Secondary type: Attack
mkAttack :: Int -> Attack
mkAttack x = if < 0 then 0 else x

-- same as before
mkDefense :: Int -> Defense
mkDefense = ...

-- Principal type.
data Weapon {attack :: Attack, defense :: Defense, name :: WeaponName}

-- A constructor for the Principal type. 
-- Takes, Secondary type and produces Principal type
basicWeapon :: WeaponName -> Weapon
basicWeapon n = Weapon 1 0 n 

-- display name takes Principal Type and produces WeaponName.
displayName :: Weapon -> WeaponName
displayName (Weapon a d n)
  | a - d > 5 =  redText n  -- supose redText exists in some library
  | a - d > 10 = brightRedText n 
  | ... 
-- ^ Notice this function is a candidate for refactoring, because if the project grows
-- Is very likely you want to distinguish between WeaponName and DisplayName
-- Hence, sooner or later displayName will be of type Weapon -> DisplayName.
-- But now, DisplayName isn't a field type of the Principal type Weapon, hence
-- the module would become too big.

-- file: src/Data/Player.hs
-- This is an example on how the usage of small modules lead to better experience.
module Data.Player where

-- Import external type qualified, leads to better redabilty and better UX when using an LSP
import Data.Weapon (Weapon)
import Data.Weapon qualified as Weapon

-- Secondary type
type Lifes  = Int

-- Principal type                                |- name collission doesn't matter here because Weapon is qualified
data Player = {weapon :: Weapon, lifes :: Lifes, name :: PlayerName}

playerCombat ::  Player -> Player -> (Player, Player)
playerCombat p1 p2 = 
  let 
    -- | Notice the use of Weapon module in a OOP-class style
    p1' = p1{ life = p1.life + Weapon.defense (p1.weapon) - Weapon.attack (p2.weapon)}
   in ...

displayName :: Player -> PlayerName --        | Again notice the OOP style
displayName player = name player <> " / "  <> Weapon.displayName (weapon player)
-- ^ Notice that If I have something of type Weapon (or a type hole of such a type),
-- it is very easy to remember that I can  type Weapon.<tab> and get LSP recomendations 
-- for the type: constructors, getters and utils.

So… a module is too big if it breaks the above rules, and too small if you have different modules which can be merge under the above rules.

Of course, this is just an advice but I think is pretty good (given my limited experience)

elaforge · February 23, 2024, 4:35am

I’ve found there’s a rough taxonomy of modules. So I don’t think you can make global rules, but you could probably have rules of thumb based on what category the module seems to fit into.

For a module oriented around as single type, like a Data.Whatever, they have to be defined at the level they can usefully hide things. So they have to have at least enough stuff to define a useful API to avoid exporting internal constructors. In my code it’s not that common, or at least it probably shouldn’t be. 100 functions on 10 types, rather than 10 functions on 100 types, right? If it winds up being too big, you can work around with Internal modules and a facade that re-exports from them. But it’s just convention, haskell has only one level of visibility. Well cabal has another level with private modules, but it’s clunky. Java has package private but people don’t seem to like it. Monorepo oriented build systems also tend to have another visibility layer, but also with spotty adoption. Maybe one level is enough.

Otherwise, a module can have circular references internally, but you can’t have circular imports, so a module can only get as small as the smallest loop, without playing with hs-boot. That means you wind up with modules defining collections of types which refer to each other. But putting all the functions inside as well probably makes it too big, so you lose the export list and have to export everything. This is an awkward tradeoff, hence the hs-boot hack.

If a module is fulfilling an interface, like say exporting functions :: [Function] then it’s easy to define module boundaries, it’s whatever lets you export just that. It’s good to find and establish regular structures like that.

Beyond that are “tools” modules which are many small functions which are all exported, and the main consideration is that they shouldn’t unduly couple dependencies, so they wind up being grouped by the set of modules they transitively import.

How to group modules is tied up with architecture and design, which is to say complicated and subtle and probably has no satisfying general answers, only many ad-hoc rules of thumb.

In practice, I wind up with 200-400 lines, and if they get much larger it’s because of circular dependencies.

AntC2 · February 23, 2024, 5:37am

Your “two reasons” are pretty peripheral to Haskell’s design. At best they’re convenient consequences of why Modules. Most of your reasoning looks like premature optimisation: worry more about your software as an artefact others will need to read and understand; worry less about the compiler.

Modules are namespaces. The most likely artefacts in your code that want namespace protection are (data) types and methods/classes that operate on them – as other answers have pointed out. (And as you give under ‘coupling and cohesion’.) Those names are global, so might clash with names from other Modules somebody is importing alongside your Module.

Then your Module should declare all names that are tightly referentially coupled in the application domain. Where ‘declare’ entails also giving their definition/code. A Module might be small because it’s merely giving a name wrapper round a bunch of FFI calls. That’s not “too small”. Your Module might be big to avoid arbitrarily chopping it up and needing many cross-Module linkages. That’s not “too big”.

chreekat · February 23, 2024, 6:49am

Can someone explain what’s bad about hs-boot files? (Serious question.) I’ve only used them a few times, very lightly, so I don’t have a lot of experience. But within that experience they’ve always done what I want with a minimum of fuss. I like to organize my modules by topic to make the structure of a program easy to understand, which occasionally results in circular module references. Trying to avoid the circular references would just mean creating an awkward module organization just to satisfy the computer, which seems worse.

dschrempf · February 23, 2024, 8:56am

For me, modules are also mental aids to separate ideas/processes/datatypes/… There is not much worse than a “I can do it all module”, is there?

atravers · February 23, 2024, 10:00am

From “refreshing the memory banks” here:

…there are two major complaints about bootstrap-modules:

they go against the DRY principle;
determining which module in a cycle should have one isn’t simple.

Another crucial factor is the size of the cycle: as you mentioned, they’re not really much of a problems for small cycles - it’s the larger ones which can be challenging, particularly if there are cycles of cycles…

(…as mentioned here.)

michaelpj · February 23, 2024, 4:44pm

This old blog post is relevant: Keeping Compilation Fast

Notably it reports pretty significant improvements from greater parallelism due to module splitting. YMMV.

It’s also worth noting that the next Cabal release will support the new semaphore parallelism control in GHC 9.8, which will hopefully make quite a big difference to cabal users. (In the mean time I recommend adding ghc-options: -j4 to your cabal.project… it has a big impact!)

jmcarthur · February 23, 2024, 8:55pm

For me, the fact that modules are compilation units is not the main point. I don’t even think the fact that they group related definitions together is the main point. In my opinion, the point of a module is encapsulation. It gives you a way to build a proper abstraction, not merely a collection of related functionality.

elaforge · February 24, 2024, 7:37am

For me, the main reason is that circular references are hard to understand. One of my first ways to understand a program is to find out the big chunks of who lives above whom, and a loop disrupts that. So from a design point of view I’d be satisfied if the structure were simple or clear due to other reasons, but the default is that loops are hard to understand.

From a build point of view, they seem equivalent to copy pasting everything into one module. Except they make it easier to do this kind of mass copy paste, you can break the loop with a minimal number of declarations, but you were not forced to actually minimize the amount of code caught up in the loop, in the way you might be motivated to do so if you actually had to physically copy paste it.

So I think my main objection is to the incentives, and as much as I dislike messing with stuff to avoid loops, the pain is the pain of being forced to modularize, which will be paid either now or later with interest.

I did try hs-boot for a while to allow a big awkward loop, ghci load times got long and confusing. It might have simply been the “everything copy pasted into one module” thing, so not really any worse than one big module, but it was confusing because it messed up my intuitions of what should reload when I changed something.

That said, I’m not exactly satisfied with being forced to juggle things around to keep them directed and acyclic, so I’m not categorically against hs-boot, just wary. Perhaps something intrinsically fine-grained like unison would be better, or maybe it would just make it easier to make even worse spaghetti. I gather rust slices things up a bit differently, but haven’t had experience with it to have an opinion. Java allows loops via dynamic importing, python sort of does too as long as you enjoy weird initialization bugs, I feel like both those are traps in the long run. C++ allows hs-boot everywhere via empty declarations, I’m not so sure about that either. But I don’t have enough experience.

danidiaz · February 24, 2024, 9:54am

This reminds me that, when revamping how GHC emits errors, hs-boot files were considered, but ultimately discarded in favor of a solution without circular dependencies. I think it was the right call.

chreekat · February 25, 2024, 8:46am

Thanks for the reply. It kinda sounds like hs-boot files (and circular module dependencies) are sort of a smell that may indicate bad modularization in the first place. Not necessarily bad on their own, but may be reached for in dubious circumstances.

Sort of like what was said in the proposal @danidiaz linked:

but what you suggest is IMO a step in the right direction, to make GHC more modular.

Yeah I think it’s less that cycles are bad, than modularity is good.

elaforge · March 1, 2024, 6:35pm

Yes that’s a good way to put it. I suppose the limitations around modules come from separate compilation, which came from efficiency… otherwise every compiler would be whole program (though there is a movement in the C/C++ world of mashing everything into one giant file). But it turns out that humans like some of the same things that compilers like, such as the ability to forget about A while understanding B.

You could also understand modules in general that way, they’re the intersection between assistance for the compiler (unit of compilation), assistance for the human (logical grouping), and design (API boundaries, visibility).