How to split up modules? (and is one function per file reasonable?)

From “Gentle Introduction to Haskell” (which I probably first read over 20 years ago now), the modules section says the following:

A Haskell program consists of a collection of modules. A module in Haskell serves the dual purpose of controlling name-spaces and creating abstract data types.

When I have a datatype, particularly an abstract datatype, the boundaries of a module are more clear, and feel quite similar to the “one (public) class per file” pattern often used in OO.

What actually goes in the same file as a datatype is actually somewhat required by the language/GHC, namely:

  1. Instances must be defined in the same file as the datatype (assuming they aren’t the same file as the class).
  2. One can only access non-exposed constructors inside the file, so basically any accessors/creation functions that require access to the datatype’s internals need to go in the same file as the datatype.

Point 2 here in particular is similar as OO, as only methods can access private members, not free standing functions.

Of course, there’s a similar coupling with files containing a typeclass.

But outside of those cases, the guidelines are unclear. Now I guess this question could apply to a number of languages, but in OO style languages, so much more of the code is related to a particular object, hence “object orientated”. What is associated with what is clear.

The great thing about the Haskell type system, unlike object orientated type systems, is that Haskell excels not just in describing objects themselves (which many OO languages can do also) but in describing the interaction between objects.

As such, many functions take as input a variety of sometimes abstract types, and maybe doing things other than just manipulating or introspecting an object and performing an action.

This happens in other languages as well, that is code that is not clearly connected to an object, is the majority.

What I usually do, perhaps out of laziness, is just add a bunch of vaguely related functions all to the same file. As an example, just because it’s current in my head, I’m working on a streaming CSV library (using streamly) that ideally will be able to parse and stream CSVs in constant space (that is, if someone puts 10 billion 'A’s in a cell that doesn’t blow your heap).

You have a bunch of transformations that for example:

  1. Converts a stream of Bytestrings into a stream of CSVTokens
  2. Takes a stream of CSVTokens and adds line/column information to those tokens
  3. Combines “partial” tokens (where a cell spans a boundary of the incoming Bytestring) into complete tokens by concatenating the bytestrings, with a limit on cell size that will throw an error otherwise (to maintain bounded space)
  4. Removes (or keeps) particular numbers columns from the CSV
  5. Outputs a stream of CSV tokens as a stream of Bytestrings

Now I originally was jamming these in one file. Then I split them up into two files, one for parts 1 and 2, “tokenising” and the “manipulation” in the other file.

But as I add more and more manipulation functions, that will eventually get quite large.

Usually what I’ve done in the past is something like the above process, just do some vague split up based on some category and then overtime I just add more and more things as more features/hacks are required until I have a file which depends on almost everything else in the library/project which is just a hideous mess to pull apart (particular when pulling it part results in surprising circular dependencies).

So for this CSV library, for example, I’m trying a different approach, which is basically:

One (exposed) function per file

This isn’t a hard and fast rule, if there is a chain of functions that call each other, i.e. h calls g which calls f, I might put h, g, and f in the same file and perhaps expose them, as sometimes it’s worth exposing “helper” functions as they are often more generic.

But other than that, if there’s say two “manipulation” functions but they do different things, I’m thinking like:

module CSV.Manipulation.SelectColumns where
selectColumns :: [Int] -> Stream CSVToken -> ...

module CSV.Manipulation.MapRows where
mapRows :: ...

instead of:

module CSV.Manipulation where

selectColumns :: [Int] -> Stream CSVToken -> ...
mapRows :: ...
...

Basically, most of my files won’t be over a twenty lines.

If I end up bumping into a circular dependency by doing this, well then I can resolve it then and there as appropriate, instead of such circular dependencies hiding.

I can organise my code based on module hierarchy, whilst still only having one function per file, and if I want to reorganise, I feel like moving files is easier than actually moving functions between files, as a simple find/replace on the fully qualified module name is the only refactoring required.

One issue perhaps may be performance, in that module boundaries can prevent inlining, but like I said, I often put helper functions in the same file. I guess there could be an issue where there’s a helper function that a lot of the “manipulation” style functions use. By having all the manipulation functions in one file, one can have the helper function there also, and the functions can all inline that helper function. But even having all the manipulation functions in separate files, one could just have the helper function in it’s on separate file with an INLINE or INLINABLE pragma.

I guess the bad thing may be that haddocks are harder to search.

And people need to write more imports, but IDEs handle that pretty well nowadays.

But overall this so far seems like a better approach. I don’t have to think as much about what functions are related to others, and if I get the groupings wrong, renaming files is easier than moving functions between files.

So I ask this question because even though I currently think having each file having a 5-10 line function and nothing more is the way to go, that doesn’t seem how the Haskell ecosystem does things, particularly when I look up thinks on Hackage. Which suggests that this approach is wrong in someway.

The only other thing I can think of is IDE concerns, but I don’t think this is a problem, HLS allows me to jump around quite easily, and functions that are very closely related I’m still keeping together.

So I was looking for reasons why NOT to have just lots of 10-20 line or so files in a Haskell library/project?

I think it will be slower to compile a larger number of smaller files rather than few larger files.

Because it has lots of overhead

  • creating a new file involves
    • finding a name
    • deciding what to put in
    • writing the file header , .i.e the other imports keeping in mind that the more modules/files the more import statement to write and find which one to import
    • sooner or later that will create dependency cycle which will need to be broken my merging modules
    • unless you carefully plan your module to avoid cycle up front (that requires thinking time too)
    • being able to see everything at a glance is better when possible. I can display 70 lines on my screen so I prefer a 70 lines file that 3 of 20 lines

Having one big file has its drawback too, mainly recompilation time, but in my case I am more worried it recompilation triggering, i.e not to have to recompile everything when I touch something.

Most the cons of small and big files can be overcome by IDE so It just a matter of finding a trade off between the two. You tried big files and have a idea of when too big is too big. Try small files and you’ll (soon) see when too small is to small.

I like thinking files as modules, and group related things in one module.
I however generally do keep related types in a separate module, it is the best way avoid cycles and full recompilation.

2 Likes

On the criteria to be used in decomposing systems into modules by DL Parnas is a classic paper on this topic

A good rule of thumb is to group things that change at the same time in the same module.

I think in practice you would want more than one function per file, especially because Haskell functions tend to be shorter than imperative ones.

10-20 lines sounds a bit short to me. I think ~200 lines is a nice size for modules. But size can really vary, and both very short or very big modules can make sense.

I wouldn’t worry about inlining/performance too much until benchmarks show that it is a concern.

6 Likes