ANN: pGenie – a SQL-first code generator for PostgreSQL: no DSLs, no ORMs, no hand-rolled codecs

Hello colleagues,

I’m Nikita Volkov, the author of “hasql”. After “hasql-th” brought compile-time SQL syntax checking a few years back, one significant gap remained: the validation of query compatibility with the actual database schema. Another concern coming from users has long been the need to hand-roll encoders and decoders for every query.

Today I’m releasing pGenie - a tool that completely rethinks the way we integrate with Postgres from Haskell and it addresses these pain points and so much more!

What it does

You give pGenie two things:

  1. Your SQL migration files (plain CREATE TABLE …, ALTER …, etc.)

  2. Your SQL query files (plain SELECT …, INSERT …, etc.)

It executes them against real PostgreSQL via a Docker container, validates queries against your schema, and generates a fully type-safe Haskell SDK on top of “hasql”. There is no DSL to learn, no Template Haskell, no ORM layer. Postgres itself is the single source of truth.

Besides Haskell it also generates Rust and actually has a decentralized codegen ecosystem that allows you to write your own codegens in beautiful Dhall and distribute them however you like.

As the cherry on top, pGenie also automates index management to some extent.

Links

pGenie is free and open source. I’d be grateful for feedback, bug reports, and a GitHub star if you find it useful!

23 Likes

This looks great! Maybe I’m being pedantic, but one thing that often irks me about database library docs is that they try and pretend like we can infer domain types from the database schema.

Consider

data TrackInfo = TrackInfo
  { -- | Maps to @title@.
    title :: Maybe (Text),
    -- | Maps to @duration_seconds@.
    durationSeconds :: Maybe (Int32),
    -- | Maps to @tags@.
    tags :: Maybe (Vector (Maybe Text))
  }
  deriving stock (Show, Eq, Ord)

Personally, I’d much rather see this called TrackInfoRow (or even TrackInfoPgRow) and be explicit that this is a representation of a database row. It may seem like a tiny change, but it makes it clear that the intention is to be as precise as you can be in representing your row, and not in modeling the domain.

Otherwise tags :: Maybe (Vector (Maybe Text)) just seems hideous – in the real world you’ll probably want to have something like tags :: Set Text which is easier to read, more ergonomic and also more correct, because you’re unlikely to want to have duplicate tags.

Also, adding the Row suffix makes it clear that you have to think and actually model your domain type based on the problem you’re solving rather than the database details.

Thanks for your feedback and attention to detail!

There’s a couple of misunderstandings here and an actual problem that is scheduled to be solved in future releases. I’ll address them in order.

they try and pretend like we can infer domain types from the database schema

I absolutely agree that this is a wrong approach and actually a very common mistake in Haskell. A data type should model only one concept. However in many codebases they use a single type to model a database row and a domain model that also happens to have a JSON instance that surfaces in the public API, which results in coupling of everything, leading to changes caused by one tiny aspect escalating to the entire codebase or worse, silently breaking APIs. I want to stress that this is not what pGenie does or ever will do.

I’d much rather see this called TrackInfoRow (or even TrackInfoPgRow) and be explicit that this is a representation of a database row

Not in this case. The generated type you mention is located under the *.Types.* namespace, which stands for user-defined types as per the database. This one resembles not a row, but this user-defined Postgres type.

In case of types resembling rows of a result set, what you suggest is actually what we do. E.g., SelectAlbumWithTracksResultRow is a generated type that represents a row of the result set of the SelectAlbumWithTracks statement.

Otherwise tags :: Maybe (Vector (Maybe Text)) just seems hideous – in the real world you’ll probably want to have something like tags :: Set Text which is easier to read, more ergonomic and also more correct, because you’re unlikely to want to have duplicate tags.

The core problem here is that in Postgres all fields of composite types are nullable and there’s no control over that. Elements of all arrays are also nullable and there’s no control over that either. So this merely represents the truth. Yes, it’s not pretty, but silently pretending that this problem doesn’t exist will only lead to bugs in the user’s code. In this case we’re talking about a user-defined composite type.

In case of result sets you get complete control of issues like this via the signature files. They let you narrow down such types by stating that you expect only non-null values for both the array and its elements, as is done here, which leads to generation of a Vector Text without maybes. Doing such tweaks will lead to runtime decoding errors in case the database will present you with nulls, but that’s a tradeoff that you can choose to make in order to have a more ergonomic API. The default is safe.

Adding similar signature files for user-defined types is on the roadmap and thus it will give you an option of controlling this.

Now regarding Set. Postgres doesn’t have a set type. So the starting argument is again about the truth. Interpreting Postgres arrays as sets is imposing domain logic over the database integration. The safest way to do this is to control such conversions explicitly outside of the integration layer. However I hear your point about ergonomics and I think that such features can be added via the signature files in the future as well. Would be great to have a discussion about this and get an input from other users before implementing this.

That’s great to hear. Then again I’ve seen this much more in C#/Java than in Haskell where people generally care more about types.

Exactly. I agree with the way it’s represented but only in the context that this is actually a row from in a Postgres database, and not a “real” track.

Not in this case. The generated type you mention is located under the *.Types.*

Well, yeah, but that means you need qualified imports to do something like fromRow :: Row.TrackInfo -> TrackInfo which I guess is a matter of preference.

In case of result sets you get complete control of issues like this via the signature files. They let you narrow down such types by stating that you expect only non-null values for both the array and its elements, as is done here, which leads to generation of a Vector Text without maybes. Doing such tweaks will lead to runtime decoding errors in case the database will present you with nulls, but that’s a tradeoff that you can choose to make in order to have a more ergonomic API. The default is safe.

I get that, but I don’t see how it’s better than just defining your domain types separately and having a conversion function. In that case any queries you do will return [*Row] and then it’s the responsibility of the caller to figure out how to handle each row individually, whether it’s defaulting the Maybe (Vector x) to mempty or something else.

Now regarding Set . Postgres doesn’t have a set type. So the starting argument is again about the truth.

I don’t argue with the truthfulness of what pGenie generated, in fact I agree completely with the Maybe (Vector (Maybe Text)) representation – I just think the naming itself makes it seem like we’re inferring a real domain model from a Postgres schema.


To summarize: I think what you’ve done is really cool, the only changes I would do is be more explicit about the naming (even if that means changing the .Types namespace to .Tables or something more obviously database-related).

I hear you. The Types namespace may ring different meanings to users (like domain-types). Tables however would definitely be wrong, because this namespace holds user-defined types, not tables. Actually during development that namespace was initially called CustomTypes, but then I reduced it for brevity.

Any way, I have several good news on the matter:

  1. The codegen is maintained separately from the main codebase as a Dhall program. You can fork it, do whatever tweaks you need and start using it immediately by doing nothing more than referring to your fork in the pGenie project file, as explained in the docs. Also such changes are trivial to handle for LLMs, so you don’t even need to delve into the codebase at all.

  2. Although the Haskell codegen doesn’t yet have any configurable parameters, configuration of codegens is supported, so similar opinion-related issues can simply be turned into codegen-specific config options.

2 Likes

https://pgenie.io/#llms

Glad to see there’s still people out there that understand that probabilistic tools aren’t the answer to everything and we still need deterministic tools.

5 Likes

Is this similar to sqlc ?

Yes.

Following is a comparison which I’ve originally posted on Reddit and Lobsters.

The main differences are:

  1. Analysis engine. pGenie is designed from ground up to rely on the Postgres server. That’s why it sees everything the way Postgres itself does. sqlc made an initial mistake of going with a custom emulator, which as I understand led to an endless stream of bugs caused by the emulator being out of sync with the interpreter of Postgres. So pGenie supports any Postgres query of any complexity, sqlc likely does not. Lately they’ve been attempting to change this strategy and as I understand they have a mixture of both strategies now.

  2. pGenie is focused on Postgres only. This lets us represent all features of Postgres in the internal models. That’s why we support composite types and multiranges for instance, which sqlc does not. Targeting multiple databases inevitably requires you to limit the internal models to the least common denominator which leads to the reduction of supported features of the DB. sqlc targets multiple DBs.

  3. Signature files. pGenie generates signature files which are then used as the source of truth about what types your queries expect. This is essential for preventing schema drift, when you create a migration that changes a column’s type thus silently breaking your queries. They are also used for fine-tuning the derived types (e.g., enforcing non-null on parameters, where Postgres allows nulls). sqlc does not have such a concept.

  4. Index management. pGenie performs analysis on the indexes used by the queries and generates recommendations on how to optimize them. E.g., delete the unused ones or add ones on columns that cause seq-scans. It can actually generate such migrations for you. sqlc does not touch this area.

However it must be noted that sqlc is a large project now and there is a large community around it and that inevitably has impact in areas like the amount of available resources, tutorials, and community support.

There’s also a historical point. This project was originally developed when sqlc wasn’t on the radar. I was experimenting with a different strategy for the distribution of pGenie (SaaS), which did not succeed. Now I’ve pivoted on the strategy to Open Source + Consulting. E.g., here’s the initial announcement on Reddit dated 4 years ago.

3 Likes

Thank you for explanation