Call for ideas: a NixOS deployment tool written in Haskell

Hi! I’m calling Haskell/Nix(OS) users to suggest feature ideas for a CLI tool I’m writing to build and deploy distributed fleets of NixOS machines. Provisional name is fleet, but I’ve just noticed that there are existing tools with the same name so I’m open to suggestions.

In my home I currently have 6 machines running NixOS (workstation, laptop, router, and 3 raspberry pis running a distributed sound system). I love the idea of being able to configure them all in one Nix repository, such that they can reference each other, share code, and access common variables etc. I tried various existing deployment tools: NixOps, Colmena, Bento and Comin to name a few - but found them all sub-optimal in one way or another, at least for my needs.

For example:
NixOps is in “low-maintenance mode”;
Comin works on a pull-based model with I didn’t find very “nix-y”;
Bento also works on a pull-based model, and furthermore is written in Bash which I find slightly unnerving;
Colmena doesn’t support health checks.

Also, none of the above are written in Haskell.

I ended up settling on Colmena, which I believe is the best overall, but still I dreamed of better. So, let’s do it. The idea is to combine all the best features of all the best deployment tools, together with beautiful code written in the best language of them all.

From the Nix side, the design will be very similar to the other tools, effectively “fleet-wide configuration and metadata”, “normal NixOS configuration for each node”, and “misc fun deployment stuff”. The last of these is the part for which I’m asking for suggestions.

I have a list of features in mind, including but not limited to:

  • configurable health checks, implemented by a daemon running on the nodes (written in Haskell of course);
  • deployment hooks, i.e. arbitrary commands or scripts run on the nodes before/after deployment;
  • secrets, i.e. sensitive files copied to the nodes without ending up in the nix store on the node or the deployment client;
  • minimising the number of times the user is prompted for a sudo password, by creating a system user on the node with password-less sudo to nix commands;
  • automatically configure SSH access between nodes, including generation of key pairs;
  • configurable delegation of NixOS builds to specific nodes in or out of the fleet;

I currently have a minimum viable product implementing a few of these features, and I’m working on the rest. I’m aware of the pitfalls of trying to cram too many ideas into one tool, but I feel there’s still some wiggle-room for a few other killer features.

So, if you manage a fleet of NixOS machines and always wished for a way to do a particular thing, or generally you have an idea for the design of this deployment tool, please let me know below! Also if there are any existing tools with great features not on the list above, that’d be useful information too.

If anyone’s interested, I’ll elaborate further on the current design, or post status updates (or health checks!).

Thanks!

7 Likes

Speaking of unmaintained packages, there is also hail, which is so unmaintained that even its GitHub project is gone. I think its scope is significantly narrower than yours, but you could poke around the sdist on Hackage to see if it does anything useful for you.

Thanks for the tip, I’ll take a look!

Cachix Deploy is a fantastic design and implementation, but sadly closed source. In particular, it correctly separates building from deploying completely. Once a deployment is initiated on a node, it shouldn’t run much more than nix-store -r <toplevel> && <toplevel>/bin/switch-to-configuration switch. The UI is also quite nice for monitoring node deployment status– using SSH for 20+ nodes is pretty cumbersome.

2 Likes

Hey @fpringle,

We use haskell for deploying our machines at Composewell.

Some of the things we do:

  • Spin up machines.
  • Create appropriate users (Simple machines have only “admin”).
  • Setup dependencies.
  • Setup communication between machines by creating the appropriate ssh config.
  • Using bigger machines as build servers and scp-ing build artefacts to the
    smaller machines.
  • Take regular backups of our database.
  • etc. etc. etc.

All of this is written in haskell in a very expressive manner.

More than what the deployment tool should contain, I’ll talk about how we
designed our tool.

Our initial requirements:

  1. We want to write haskell
  2. We want to run some functions on remote machines
  3. We want to run some functionality we’ve written via cli
  4. The overhead for the developer should be as minimal as possible

What we essentially wanted was an RPC framework that can be called via humans
and via haskell with a good user experience. Good user experience is what
required most of the work.

We have a private library called simple-rpc that addresses the requirements.
We use a combination of template-haskell along with preprocessing the haskell
files via the custom build-type.

I’ll be making the library public soon with an accompanying blog post.
The usage currently looks like the following:

Define a function:

-- RPCIFY_MODULE

module Less where

less :: Int -> String -> IO [String]
less numLines filePath = do
    contents <- readFile filePath
    pure $ take numLines $ lines contents

Configure the endpoint:

main :: IO ()
main =
    mainWith
        serverVersion
        [ evaluator less ]

Run the endpoint via haskell remotely:

outerFunctionInAnotherPackage :: IO ()
outerFunctionInAnotherPackage = do
    ...
    let sshConf = SSHConfig user@a.b.c.d 22
        rpcConf = exec "/path/to/executable/on/remote" & onSSH sshConf
    top20Lines <- call less rpcConf 20 "/path/to/file"
    ...

Run the endpoint from the executable locally:

$ echo [20, "/path/to/file"] | /path/to/executable Less.less
["The contents of the file.", "This is line 2" .... ]

We can make a general purpose modular deployment tool with ideas above. If
you’re interested, I can give you access to the private repo. You can use it to
build your deployment tool and give feedback regarding the library.

2 Likes

Hi @adithyaov! Thanks for the comment. Sounds like you’ve got some fun stuff going on over there!
I’d definitely be interested to learn more about how you use RPC to manage deployments. As I started implementing I realised how fiddly it could get calling processes locally, let alone over SSH. Looks like you’re using variadic argument too, very cool. Your user-facing API looks very clean.
Please do invite me to the repo, I’d be happy to take a look and steal some ideas!

NixOps is in “low-maintenance mode”

Very true. I’m working on a new project under the name NixOps4.
I haven’t announced it too broadly yet because it’s still very much in development and not at all usable yet.

It will implement a more Nix-like architecture in the sense that the tool itself only manages the scheduling and data flow between derivations resources, and a lot of the extra behavior that isn’t CRUD on resources can be defined with Nix expressions, modules. A resource can be something in the cloud like a vm, vpc or a dns record, or other things that make sense to manage declaratively, such as a NixOS profile.

The tool is in Rust, but most of the interesting development will be done in the Nix language and/or the module system. Resources will be possible to write in any language, as long as it can e.g. talk JSON over pipes or run an HTTP server. So while no Haskell is planned for now, it could be integrated, thanks to the emphasis on IPC for extension, certainly for resources. Maybe Haskell could be another frontend instead, replacing the CLI, or much more ambitiously to replace or supplement the Nix language for resource declarations.
The latter two are very speculative though, and for now we’re only focusing on making the basics work.

1 Like

Hey @fpringle

The library is now public at GitHub - composewell/simple-rpc
You can look at the issues regarding some design decisions and what needs to be changed/improved.

Thanks for reaching out for ideas!

For me as a user there are already too many options for nixos deployment tools that I’d test them all out.
I wonder why so many projects were developed to solve the last mile of nix deployment.
In any case, I’d love to see consolidation in that space and, where applicable, a common standard to make it simpler for users to switch between tools.

I believe the modifications that are done to the deployed system should be as little and transparent as possible, esp. for security-relevant things like privileged users and ssh access.

Out of curiosity: can you explain why you see a pull-based model as not very “nix-y”?

Hi @roberth, thanks for the comment. Could you explain what the term “resources” means inx NixOps(4), other than just NixOS derivations? And what does it mean for a resource to be “written” in a language?

Once NixOps4 is more stable, I’d definitely be interested in talking more about the possibility of integrating Haskell.

Hi @flandweber,

Actually, while I was trying out the various deployment tools currently available, I noticed that converting my Nix config files between formats was pretty easy - as I said in my original post, they generally follow a standard pattern:

effectively “fleet-wide configuration and metadata”, “normal NixOS configuration for each node”, and “misc fun deployment stuff”


About the pull-based model: maybe I was too hasty in slandering the idea as “not nix-y”. I’m happy to hear arguments to the contrary. From my (admittedly a bit immature) view of Nix, the main things we love about the language (reproducibility, declarative programming etc) are really all facets of the principle that we want to have as much control as possible over our systems and derivations, and for that control to be as explicit as possible. So to my mind, a push-based system (“build this and deploy it now”) is more aligned with that philosophy than a pull-based system (“build this and wait for the nodes to pick it up”), especially when it comes to things like error reporting.

Then again, there are of course arguments based on the complexity of maintaining larger distributed systems, “eventual consistency” etc. Like I said, I’m happy to be persuaded otherwise.

Oh, one more argument for push-based (and this is something that’s also guided some other design choices in my WIP tool): I’d like the very first deployment to a node to work as similarly as possible to all the following deployments. This means assuming wherever possible that the node doesn’t already have stuff running on it from previous deployments (with the same tool). This assumption kind of precludes a pull-based model, without some kind of complicated bootstrapping process. But that bootstrapping would violate the principle I mentioned, about the first deployment being as similar as possible to the others.

Btw: I agree it seems silly that the “last mile” of NixOS deployment has to be delegated to 3rd party tools. Before I started researching the different tools, I had heard of NixOps, and had assumed that would be the best choice, since it lives under the NixOS GitHub page and therefore seemed like it would be the most “native”. I’m hopeful about NixOps4, especially since it seems to have funding (@roberth can you confirm?)

@fpringle The resource concept is similar to that in Terraform/OpenTofu, basically anything that has CR(U)D operations and makes sense to declare in code. Examples are cloud resources created through APIs, such as VM instances, DNS records, etc, but also something that is managed over SSH could be modeled as a resource.

Their implementation is defined by (fairly small) programs that are built with Nix, and then communicate with NixOps4 according to some generic contract.
So for a resource to be written in a specific language means that this executable (invoked by NixOps) is written in that language.
It will be possible to program a resource using an API library, or to take for example a generic SSH resource implementation and write a Nix function that uses it to perform certain actions, such as managing a NixOS installation.

Absolutely and I’m very glad to have an opportunity to improve this.

correct!

I also use Colmena as my current deployment tool, but if you were to write the equivalent in Haskell, I’d jump ship simply because it is in Haskell. Colmena being written in Rust, it’s not very accessible to me when I need to understand what exactly it’s doing in an edge case or when I want to add a feature.

configurable health checks, implemented by a daemon running on the nodes

I think this is a nice idea, definitely something I had to implement a few times myself. I find it strange that the “systemd ecosystem” doesn’t have any real solution here other than the very rudimentary checks systemd does.

That said, unless I’m missing something, these health checks implemented as daemons would be independent of the deployment tool, wouldn’t they? I.e. won’t they just be new NixOS modules?

automatically configure SSH access between nodes

Again, sounds very useful, but also sounds like a new NixOS module that’s independent of the deployment tool.

I’d like to suggest a feature that I’ve been missing in Colmena: Making sure that the module system for your tool is “embeddable”. To expand a little bit; We are using Terranix + Colmena to manage our infrastructure and both of these tools rely on a module system that’s designed to be an island. For example, I can’t emit a colmena-machines attribute from my Terranix config, because Terranix doesn’t expose the compiled config of its module system. And Colmena insists on receiving the “uncooked” module definitions. I.e. it doesn’t provide any hooks so that you could assemble your Colmena machines in a larger module hierarchy and call Colmena on the result.

To put the previous paragraph differently, Terranix isn’t flexible in the output and Colmena isn’t flexible in the input. Having said that, they’re both inflexible in the other direction too :slight_smile: This is in contrast to NixOS’s own module setup which is flexible both in the input and in the output. Here’s an example of how NixOS embeds itself into itself by using this two-way flexibility. What’s happening there is deeply magical, but I think this is a very underused potential in the Nix ecosystem: I want to describe my project-specific module setup where Terranix, Colmena, NixOS, flake-parts and others (Kubernix?) all live within a single module hierarchy so that I can implement infra-wide functionality as top-level modules.