Is unsafeIOToSTM ever safe?

cdsmith · July 22, 2022, 6:16am

Context: I’ve been experimenting with adding persistence on top of STM in Haskell. There’s an existing library called TCache to do this, but I’m not happy with some of the design choices there: there’s a lot of global mutable state, playing fast and loose with notions of safety, etc., on top of it not building with modern GHC versions. So I set out to implement an alternative, which I’ve published as persistent-stm, which features a more limited set of possible states and a more explicit API, along with a more detailed argument for its correctness.

However, I’m not happy with the argument for correctness. In the end, the restriction on unsafeIOToSTM that’s most difficult to justify is the one that says the thread may be aborted at any time without throwing an exception or otherwise allowing for any cleanup. This means that code run by unsafeIOToSTM must have the property that if the thread vanishes at any moment, the behavior will still be correct.

That seems to me a very difficult demand to satisfy. In particular, pretty much any effectful code might internally use an implementation that needs some kind of cleanup, and this isn’t typically documented at all. In combing through implementations, I’ve found that seemingly innocuous actions like generating a random number often involve some kind of locking that could stall the whole process if a thread holding the lock were to abort in the middle. There is a trick to work around this, by creating an empty MVar, forking a new thread that writes its result into the MVar, and then taking the result from the MVar… but I also cannot satisfy myself that this isn’t itself unsafe. Can you really fork a new thread without there being even a single moment where you hold a lock on something like the list of threads in the Haskell runtime? Even if you can today, will this remain true in the future?

I guess all this is building up to asking: is it actually just impossible to use unsafeIOToSTM and be confident that the resulting code is guaranteed to be correct? If not, where does the guarantee of correctness come from?

In my case, my use of unsafeIOToSTM is limited to calls to newEmptyMVar, forkIO, takeMVar, mkWeak, and deRefWeak. There’s no exception handling in my own code, but in order to be safe, I must additionally be sure that no exception handling is needed in the implementations of these functions, either now or in the future. I cannot see how it’s possible to be sure of that for anything.

jaror · July 27, 2022, 12:46pm

So in your case I’d expect that takeMVar and perhaps something inside the forkIO are the only things for which it is not possible to roll it back (automatically). Unless perhaps the takeMVar is only called on MVars that are created during a single atomic STM operation?

cdsmith · July 27, 2022, 1:01pm

Yes, it is the case that takeMVar is called on an MVar that is created within the same transaction. Furthermore, the only visible effect of the thread spawned by forkIO is to write to that MVar. These should be safe. My concern, though, is not about the operations I’m aware of, but rather about the unspecified implementation of these functions.

One could imagine, for example, that adding a finalizer with mkWeak may be implemented by adding the finalizer to some kind of global finalizer list. If the thread creating the weak pointer is interrupted in the middle of updating that list, it may hold locks or something that it could continue to hold if it is interrupted without exception handling. Similarly, does forkIO update a global thread list, and if so, does interrupting that at the wrong moment pose similar risks? Even if that’s not true today, it could potentially become true tomorrow, because there’s nothing specified about mkWeak or forkIO that forbids them taking a lock.

I am pretty convinced that today, the code I’ve written works. I’m even convinced that it’s not likely that such low-level operations as I’ve used will suddenly start taking global locks in GHC. Nevertheless, it’s a bit unnerving for the correctness of my code to depend on a bunch of assumptions and “eh, it’ll probably never happen” sort of reasoning. I guess I was hoping that perhaps I’m missing some reason why it is at least sometimes safe to use unsafeIOToSTM.

jaror · July 27, 2022, 1:08pm

I believe it should not be possible for that to happen. A proper implementation would mask exceptions and only allow interruptions outside its critical sections. Otherwise you would be able to wreak similar havoc with standard asynchronous exceptions.

However, now that you mention mkWeak has finalizers, I think even if mkWeak is implemented atomically, then the added finalizer by itself is also not something you can roll back automatically.

cdsmith · July 27, 2022, 1:15pm

Ah, so even although the IO action used with unsafeIOToSTM doesn’t actually see an exception when a transaction is aborted, you believe that it will continue to run until exceptions are unmasked before aborting? Would that be documented somewhere?

jaror · July 27, 2022, 1:16pm

That’s what I would assume, but I don’t know the details either.

Edit: Actually, I think STM doesn’t interrupt anything at all. It just lets the operations run completely and only afterwards decide whether to keep the results or roll back.

sclv · July 28, 2022, 12:54am

That would be an optimistic model of stm. but I think STM isn’t actually optimistic in haskell.

Indeed the docs say “The STM implementation will abort transactions that are known to be invalid and need to be restarted. This may happen in the middle of unsafeIOToSTM, so make sure you don’t acquire any resources that need releasing (exception handlers are ignored when aborting the transaction). That includes doing any IO using Handles, for example. Getting this wrong will probably lead to random deadlocks.”.

Unlike what is described in the initial STM papers, STM stops threads dead in their tracks if they are inconsistent or rolled-back on every “validation” which occurs on every GC, whether or not they finish. This is necessary because otherwise transactions might get stuck in a loop due to an inconsistent state.

So I think in some sense the “safe” operations to unsafeIOToSTM would be those which are guaranteed to not be only “partway done” when a GC kicks off. But I think GCs can occur at just about any point in time, so, I suppose they’re also those which if they are stopped at any point at all, its fine.