Property checks and PostgreSQL?

So, I have a three layer stack: PostgreSQL data base, Haskell back end, PureScript front end. And there is a fault somewhere in this stack: sometimes some items present in the data base do not make it to the front end. A small fortune in programmer hours has been spent searching for the fault but there is no progress. No one can figure out even roughly where the fault is. Our last hope is to property check the whole thing, from end to end, verifying that any things put into the data base can be retrieved by the front end.

We have in place a fixture that creates a new PostgreSQL cluster, runs the back end on it and makes some scripted HTTP requests. We use it to run unit tests, such like «if I make this PUT request and then that GET request to the same end point, I should get back the same thing as I put in». In principle it would not be a problem to make a property check out of this property. Practically, tearing down the cluster, building a new pristine one and initializing the data base anew takes seconds, so this way even a trivial property check would take minutes to run.

Theoretically, we can carefully reset the data base back to the initial state after each run of the property check. Given that only a small amount of data is modified, it would take imperceptibly little time. But it is hard to know what exactly should be reset, and it is easy to get it wrong if done by hand!

One trick we do use is SQL transactions. We take a single connexion, start a transaction, perform a check and then roll the transaction back. Unfortunately, even this is not completely safe: it looks as though sequences used to draw primary keys from are not reset to their previous values! (Or maybe we are doing something wrong — I am not really a PostgreSQL guru so I am not too sure of myself.) But even ignoring this problem (it is irrelevant to most checks), there is another, more severe problem: transactions guarantee that uncommitted data is only visible inside the same connexion. So, there is no way to request it from the HTTP API. This trick is suitable for fuzzing the data base – back end layer, but nothing beyond that.

Another thing I heard being done is to selectively drop and re-create exactly the tables affected by a run of a property check. This seems challenging to automate though. How can I know in a general way if a table has been touched? And then, how can I re-create exactly the affected tables given that the data base initialization code is an opaque blob?

I wonder if this problem is already solved. In any case, any advice is welcome!

Can you reproduce it reliably? Probably not, but if you can find a simple recipe, then turning on logging of all SQL statements in PostgreSQL and looking at what happens would be what I would try. It might be something as simple as forgetting to commit a transaction.

t looks as though sequences used to draw primary keys from are not reset to their previous values!

Yes, that’s how sequences work - they are increased regardless of whether the transaction rolls back or not, and are not isolated.

Details in the PostgreSQL Sequence Manipulation Functions documentation.

This seems challenging to automate though.

One way to handle this is to initialize the database once at the beginning of the test run and do a dump of the pristine structure. For each test you then start by recreating the database using the dump. That way you do not have to keep track of which parts of the database have been affected by each individual test, and you only have to create the initial structure once.

Can you reproduce it reliably? Probably not …

Your guess is right, we have no small example that shows this failure yet.

Yes, that’s how sequences work - they are increased regardless of whether the transaction rolls back or not, and are not isolated.

Thanks! Good to be sure!

One way to handle this is to initialize the database once at the beginning of the test run and do a dump of the pristine structure. For each test you then start by recreating the database using the dump.

The problem here is that re-creating the whole schema from a *.sql file would take a second or two.

  • Perhaps we can do something like copying the schema internally, within the same cluster — I am not sure if this is possible but if it is it should be faster.
  • Also, there are very many tables that are most likely not related to the issue, so we can re-create only the tables that are needed and leave other tables absent. I am going to go this way if no other ideas will be found.

Are your tests executed sequentially? If your bug is related to concurrent access (misconfigured transaction levels, or read-modify-write cycles) sequential tests won’t be able to reproduce it.

1 Like

The problem here is that re-creating the whole schema from a *.sql file would take a second or two.

How many tests do you have?

Perhaps we can do something like copying the schema internally, within the same cluster — I am not sure if this is possible but if it is it should be faster.

Yes, you can also make a copy of the pristine database and delete the “used” database after each test and copy the copy back in (some of my colleagues did that when they weren’t happy with the dump-based solution).

Given the problem as a black box to solve, I’d start here as well.

I would also review the testing code / strategy. If you’ve already sunk as much into this as you suggest, I would consider paying a consultant for their analysis and opinions.

How many tests do you have?

We only have a handful of end to end checks, but they already take minutes. Of course there are also fast property checks for the back end, and we are writing some property checks for the back end – data base interface as well — the transaction trick makes them fast enough. But these do not detect the issue in question.

Yes, you can also make a copy of the pristine database and delete the “used” database after each test and copy the copy back in …

The problem here is that I will have to restart the application so that it reconnects to the new data base. It also takes time.

You said you’ve already spent a small fortune in programmer hours trying to find the problem - perhaps sacrificing a little bit of test speed could be worth the gain?

It’s always a trade off, of course.

Have you put any of those programmer hours into logging? If so, I don’t understand how can you not know at least the approximate location of the problem. I mean, simply examining the log of all the communication between the back-end and front-end should be able to blame one of the two.

1 Like