Help optimizing simple decompressor

Axman6 · December 30, 2025, 3:09am

I just had a read through this PR, and one thing that immediately jumps out to me is how many bounds checks this must be doing. Every single unconsW8 checks to make sure there’s at least one byte, so for unconsW64, it’s doing eight bounds checks. If this were written to explicitly check that there’s at least eight elements and then use some unsafeIndex and unsafeDrop’s, I think performance could be improved.

Better yet, if the main loop of decompressBinary always checked that there’s at least… the maximum number of bytes that a single iteration can read* then it can safely always use non-checking functions, and then change to the more conservative version for the last few chunks

\* I tried to figure this out, but something seemed very strange with this code - does it really read one byte, then two, then four, then eight and ignore the previous values? That feels like a really strange encoding!

jaror · December 30, 2025, 8:18am

Yeah, another optimization would be to read 8 bytes (or even more with SIMD) at a time and only slow down if one of them is 0x80.

Also, returning the result in a linked list is not optimal. Writing into an array might be a lot cheaper.

pmidden · December 30, 2025, 9:45am

Good question. One reason you already gave (though it’s not a pressing one). The other is, I couldn’t figure out how to parse into a vector instead of a linked list (for efficiency reasons, I’d prefer the former).

I have written an attoparsec version of the code using [Int64], and curiously, it’s exactly as fast as the Binary implementation.

pmidden · December 30, 2025, 9:48am

Fantastic work! Thank you for making the effort, and I’m glad memory consumption went down. The implementation also doesn’t rely on explicit mention of the lazy chunks, which is nice to see.

pmidden · December 30, 2025, 9:49am

One thing to note here is that the “higher number” cases are not executed in the example .cbf file given, so the code could be completely bogus in that regard. One would need a better test image.

chrisdone · December 30, 2025, 8:57pm

Gotcha.

I wrote a flatparse parse into a list with pure $ V.fromList xs at the end, and it got fused, so that the final Core code simply makes a loop that parses straight into a mutable vector without any intermediate lists. Worth keeping an eye out for that, or lack of it.

Axman6 · December 31, 2025, 12:42am

Nearly every time I’ve gone to use attoparsec, I expected it to have primitives for parsing integers and then remember it’s not really aimed at binary parsing. I can’t see a good reason why it couldn’t be though, and I’d love to see the primitives for parsing (Double|Float|(Word|Int(8|16|32|64)(LE|BE))).

jaror · December 31, 2025, 9:40am

Unless you’re explicitly using build to construct the list with flatparse, I don’t see how it could fuse automatically. I would love to see what you did and how it works.

Edit: I just tested V.fromList [1..10] and that doesn’t even fuse, so I don’t think V.fromList can fuse with the input list at all.

See the go3_a2l4 helper function of the $wfoo function in the core of this code:

chrisdone · December 31, 2025, 1:38pm

I managed to dig up this from 2021 Rust vs Haskell: parse vector of ints · GitHub (which has some accompanying chat with @AndrasKovacs), I’d have to find some spare time with my laptop to reproduce. Maybe I missed something while reading the Core at the time, shame I didn’t paste that with my gist! You’re welcome to reproduce that, as I don’t have my laptop to hand at the moment (holiday season)! It’s possible there was a list but it was consumed lazily and cheaply enough that I didn’t pay proper attention. The numbers versus Rust and Andras’s hand rolled loop were very compelling.

pmidden · January 6, 2026, 12:47pm

I’ve applied the performance improvements by @jaror. I also tried writing to a mutable vector (in the ST monad) instead of returning an [Int64], but curiously, the code is exactly as fast as before.

mchav · January 6, 2026, 8:00pm

indexMaybe over a lazy bytestring will be linear for every readInt8.

Can you try not passing in an explicit pos and indexing/dropping the lazy bytestring? If you’re gonna do that it makes more sense to use a strict bytestring.