TLDR just answer the questions in the title
I am going to try and answer myself but you will see my success is modest.
why modules
Haskell has modules. Why? I know for sure of two reasons:
- (parallelization) Modules that do not depend on one another will be compiled by GHC in parallel, speeding compilation up on multi-core machines.
- (caching) If a module’s transitive and reflexive dependencies have not changed, it will not be recompiled by GHC, saving compilation time.
module boundaries and optimization
I recall hearing about optimizations working differently within and across modules. I understand that, for inlining to happen across modules, the source code of the function to be inlined has to be made available in the interface file. But I am not sure whether this is important — if GHC thinks a definition should be inlined, there is nothing to stop it from putting the source code of this definition into the interface file.
Also, there is the question of complexity — perhaps on bigger modules some optimizations that happen to have super-linear complexity will run visibly slowly. But I do not know any specifics — I only have the general idea of the possibility.
module boundaries and build product size
I am not sure how it works actually. One idea I have is that GHC makes one object code file for each module, and then asks ld
(or another linker) to put them together and bolt the run time system on top. Plausibly a linker can skip those modules that are not actually needed. Does it do that?
I recall the project fragnix
. As I understand, what it does is put every definition into its own module, so that each definition is compiled into its own object code file. They claim «lower binary size and compilation time» as expected advantages.
considering extremes
We can approach this question theoretically. Say we have a code base of arbitrary size. There are two extremes in which we may represent it as an hierarchy of modules:
- (the most gross partition) Everything in one big module.
- (the most fine partition) Each definition in its own module.
What trade-offs shall we observe?
I have some big modules at hand (thousands of lines, hundreds of definitions) and they compile visibly slowly. But I cannot say if this is because of super-linear complexity or simply because every single definition, with derived and generic instances on top, is hard to compile by itself, and they linearly add up.
I also have small modules and they do compile fast and only when transitive and reflexive dependencies have changed. I get the most from both parallelization and caching. And I do not see any downsides — aside from the inconvenience of opening and closing files and the inefficiency of the underlying file system layout, which I neglect as not Haskell-specific. Are there any Haskell-specific downsides I should be aware of?
For example, who will suffer if we take Data.Text
(about two thousand lines of code) and split it into small modules holding one function each? Will inlining suffer much? As a middle ground, what if we keep those functions that should inline into one another together but split most finely otherwise?
coupling and cohesion
As the object oriented school of thought teaches us, some definitions belong together. Since we do not have objects with state, we cannot group definitions by the state they share. Instead, in application to Haskell, we may study the dependency graph between definitions and try to partition it optimally.
- Surely mutually dependent definitions belong into one module. This is a requirement of GHC that can be side-stepped but I have never had a case where side-stepping it was needed.
- Instances belong either with the class (if it is a custom class) or with the type (if it is a standard class). This protects the code base from orphan instances.
There are no other hard requirements. Can we find other precise reasons to put definitions together?
Maybe we can start like so: transitive reduction ∘ skeletalization. What we get is a comfortable and unambiguous presentation of our dependency graph. What can we do with it?
We can «fold leaves». A definition without dependencies that only one other definition depends on — a leaf — can be put into the same module as its dependent — maybe even into a where
block. But should this always be done? Applied iteratively, this rule will gather all outer trees of our dependency graph into nodes on which at least two other definitions would depend. This can make big modules, which defeats the purpose of the module system.
However, we can also make an opposite argument: if two definitions are not both dependencies of some other definition, they certainly should not be in the same module. This puts formal ground under Do you have a giant `Types` module in your projects?.
More generally, we can «fold chains». If X depends only on Y, we can merge them. They will have to be compiled the one after the other anyway. We lose caching but we do not lose parallelization because we have not had any to begin with. But once again, should this always be done?
The answer is that we need another criterion for when to fold stuff and when to keep it laid out. For example, if Y takes way long to compile, we should not fold it into X so that we retain caching.
programmer experience
There are these two simple observations:
- It is easy to review definitions jointly when they are on the same screen. The easiest way to have stuff on the same screen is to put it into the same file.
- Scrolling is painful. So, modules should be small enough to fit on the screen.
These considerations together suggest that we should fold stuff into a module while it fits on the screen and then stop folding. This sounds like a silly criterion but for me both files that take more than two screens and files that are 10 lines long are painful to work with. Maybe I need to learn some editor tricks?
other languages
- In Java and C# the standard is to have exact one to one correspondence between classes and files. Each class gets a file of its own. I am not sure if every source file gets its own object code file. Overall, they go the path of the most fine partition.
- In Vue they put markup, style and code for the same «thing» in the same file, and it seems to be clearly defined what a «thing» is, though I do not quite grasp it. So, a kind of «vertical integration». I think they do it for better programmer experience.
My experience with languages other than Haskell is tiny so I cannot say much here.
so
So, I have no idea what to do with my code! Please help me out!