Botan bindings devlog

Update: botan 0.0.1 package candidate imminent

Whew! It’s been an intense couple of days, now that botan-bindings and botan-low are published to hackage. Additionally, the round 2 funding proposal has been updated to provide more details on the intended trajectory for the next several months.

Of course, now that the excitement has settled down, we are back to focusing on our next goal: getting botan to package candidate status, and soon after to release. Here lies the issue that I’ve been gnawing on for the last few days:

Our initial ADT-based approach to managing algorithms has turned out to be insufficient. Cryptographic typeclasses were something that was in the original proposal, but they were removed to reduce scope; however, they’ve turned out to be necessary in the long term.

It is going to take time to develop the proper solution of typeclasses and data families in order to build per-algorithm gold-standard modules, and I’d like to get botan live before then. Now, there’s still:

  1. the problem of the insufficiency of the ADT interface, which must be dealt with before we publish a 0.0.1, and

  2. the question of whether to prune the typeclasses / data families / per-algorithm gold-standard modules from said initial release because they are still being developed, and add quite a bit of inertia - I’d rather present something comparable to z-botan first before* going after higher-level targets.

* I’m trying to keep my prioritization straight

I do have a solution to #1, which I feel may suffice. Our issue is thus:

-- We have some set of algorithms for a common operation
data Hash
    = CryptoHash CryptoHash
    | Checksum Checksum

-- We have nested ADTs because some functions require a specific subset of algorithms
data CryptoHash
    = SHA3 SHA3
    | ...

data Checksum
    = Adler32
    | ...

-- Algorithms may themselves variants, and so the nesting gets deeper
data SHA3
    = SHA3_512
    | ...

-- An algorithm for one operation may be a component in another operation, but this is dangerous:
data MAC
    = HMAC Hash -- Wrong!

-- The component algorithm may require a specific subset:
data MAC
    = HMAC CryptoHash  -- Right!

-- This all turns out to be unwieldy in practice, and leads to ridiculous stacks of wrappers:
hmac_sha3_512 = HMAC $ CryptoHash $ SHA3 $ SHA3_512

Typeclasses are the obvious long-term solution, but will take time. What’s the interim solution? Let’s ask ourselves, “What did z-botan do?”

Well, they flattened the ADT to an enumeration, and threw an exception if an incorrect algorithm was used. I think we can do slightly better…

-- Flatten the ADTs, have newtype wrappers for specific subsets!
data Hash
    = SHA3_512
    ...
    | Adler32

-- Then, it's up to each subset to define their wrapper and smart constructors
newtype CryptoHash = MkCryptoHash { unCryptoHash :: Hash }

-- We'll use smart constructors instead of just throwing exceptions like z-botan
cryptoHash :: Hash -> Maybe CryptoHash
cryptoHash SHA3_512 = Just $ MkCryptoHash SHA3_512
...
cryptoHash _ = Nothing

unsafeCryptoHash :: Hash -> CryptoHash
unsafeCryptoHash h = fromJust $ CryptoHash h

newtype Checksum = MkChecksum { unChecksum :: Hash }

checksum :: Hash -> Maybe Checksum
checksum Adler32 = Just $ MkChecksum Adler32
...
checksum _ = Nothing

unsafeChecksum :: Hash -> Checksum
unsafeChecksum h = fromJust $ Checksum h

-- Algorithm families can get the same treatment if necessary
newtype SHA3 = MkSHA3 { unSHA3 :: Hash }

-- This means that we have at most 1 wrapper, from enumeration to specific subset
hmac_sha3_512 = HMAC (unsafeCryptoHash SHA3_512)

It’s not perfect, but it gives us the type safety we need, without impairing the ergonomics too much while we work on the proper typeclasses. A happy medium!

There’s no update to the repo quite yet as I’m still doing some cleanup while applying this, but once that is done, the botan 0.0.1.0 package candidate will be going up.

So then, this leaves the question of #2 - to prune or not to prune? Do we elide the in-progess abstractions to keep focus on the core modules?

2 Likes

Please consider this bit carefully and try to avoid these as much as possible, even for the final interface. It looks like there’s potential to overengineer here and make it hard for clients of the library to figure out how to use the API as these abstractions can significantly obscure it.

2 Likes

There is a long-term method to my madness; I am trying to keep the ergonomics of use in mind, but botan is primarily a cryptographic kitchen sink, and so it is my duty to expose the full complexity of every algorithm and primitive. botan is far closer to crypton than something like saltine.

However, I am planning for a user-friendly saltine -like interface as well, and I think your concerns align more closely there. The end goal is to expose a pure, idiomatic interface, but all of the mutable IO and machinery underneath is still necessary, and this is part of that important machinery, and not necessarily the final interface.

A few things to help address your concerns:

  1. I am being sure to use data families and not type families; this significantly reduces strain on the compiler, as data families do not inhibit type inference at usage sites like type families do, and instead places the onus of effort on me to fulfill the boilerplate.

  2. This was always going to be complex, no matter how I chose to express it, and having tried several ‘simpler’ routes, it turns out that while data families do have more initial overhead for development, their complexity for use grows much more slowly. In comparison, the ADT approach is easy to develop, and terrible to use.

  3. I am also writing a series of high-level tutorials focusing on explaining the correct usage, which in most cases consists of using a single type application or explicit type annotation. This interface affords a great deal of flexibility and specificity, as I have been quite thorough.

The following lines are all valid and equivalent:

sha3_512 "Fee fi fo fum!"
sha3 @512 "Fee fi fo fum!"
sha3 "Fee fi fo fum!" :: SHA3_512Digest
sha3 "Fee fi fo fum!" :: SHA3Digest 512
sha3 "Fee fi fo fum!" :: Digest (SHA3 512)
sha3 "Fee fi fo fum!" :: Digest SHA3_512
hash @(SHA3 512) "Fee fi fo fum!"
hash @SHA3_512 "Fee fi fo fum!"
hash "Fee fi fo fum!" :: SHA3_512Digest
hash "Fee fi fo fum!" :: SHA3Digest 512
hash "Fee fi fo fum!" :: Digest (SHA3 512)
hash "Fee fi fo fum!" :: Digest SHA3_512
-- 03a240a26674994a...

There is a great deal of freedom of choice, but from this we may expose a final, curated interface, eg by picking a particular idiom and emphasizing it in the tutorials.

  1. The cryptographic typeclasses are being designed to expose a consistent, implementation-agnostic cryptographic interface and will eventually be lifted out into their own package with botan continuing to conform to it. This will eventually allow alternate backends to fulfill the required primitives, while allowing more complex implementation-agnostic concepts like merkle trees to be implemented without locking yourself to one backend library. This will also allow the final curated interfaces to be reused across backends, and is part of the long-term goal of a unified haskell cryptography ecosystem.
3 Likes

Honestly, I still think the best abstraction here is backpack. At least at a conceptual level, it simply feels like the right things.

I’d be happy to have a chat about it, even if just to put it on your map.

In practice, however, it may be that backpack is not a viable option due to Stack’s lack of support and HLS — on the other hand, it could be a strong motivation to implement support for it

2 Likes

I do not at all oppose investigating the use of backpack to make swapping of implementation easier, but the ability to pull it out as an implementation-agnostic interface is merely a convenient side effect of being extremely stringent in acknowledging the unique properties of all of the various algorithms :grin:


I feel as though I need to clarify something for the general audience, however, as there appears to be some wariness or misunderstanding of why the typeclasses are being developed:

The typeclass abstractions are still necessary within botan as a single package, in order to expose multiple algorithms for the same primitive in a type-safe manner without throwing exceptions, or pretending that algorithm variants don’t exist.

Many algorithms have unique knobs and levers, and it is only at the very end, when we process the plaintext, that they act similarly. Digest MD5 is not the same type as Digest (SHA3 512), and should not be considered equal, even if they were to contain the same bytes.

Some algorithm families are parameterized by a unique salt / identifier / tweak that is not necessarily bounded in size, and thus there may be an infinite number of instances to write manually. Skein512 512 "foo" takes two additional parameters compared to MD5 - how are we to provide the additional arguments while providing a uniform interface?

Typeclasses were designed for this; the alternative is passing an algorithm witness around exactly like you might a typeclass implementation dictionary, while having to handle errors when people pass in a parameter that isn’t valid for a given algorithm / operation, which would be normally be handled by the type system via typeclass constraints.

I’d be writing the newtypes anyway for per-algorithm type-safety - take a look! The data families are effectively free and the typeclasses just codify everything that I’m doing anyway.

We have our data family and typeclass - not scary at all :slight_smile: :

data family Digest hash

class (Eq (Digest hash), Ord (Digest hash)) => Hash hash where
    hash :: ByteString -> Digest hash

We could implement simple concrete types for our algorithm this way, without the typeclass and data families:

newtype MD5Digest = MkMD5Digest { getMD5ByteString :: ByteString }
    deriving newtype (Eq, Ord)

md5 :: ByteString -> MD5Digest
md5 = MkMD5Digest . Botan.hash Botan.md5

However, conformance to the typeclasses is effectively free

-- Free
data MD5

-- Free by replacing "MD5Digest" with "instance Digest MD5"
newtype instance Digest MD5 = MkMD5Digest { getMD5ByteString :: ByteString }
    deriving newtype (Eq, Ord)

-- Unchanged
md5 :: ByteString -> MD5Digest
md5 = MkMD5Digest . Botan.hash Botan.md5

-- Free
type MD5Digest = Digest MD5

-- Free
instance Hash MD5 where
    hash :: ByteString -> Digest MD5
    hash = md5

Something with a few type parameters isn’t much harder.

data SHA3 (n :: Nat)

-- Pardon this atrocity
type SHA3Size (n :: Nat) = (KnownNat n, (n == 224 || n == 256 || n == 384 || n == 512) ~ True)

newtype instance Digest (SHA3 n) = MkSHA3Digest { getSHA3ByteString :: ByteString }
    deriving newtype (Eq, Ord)

instance (SHA3Size n) => Hash (SHA3 n) where
    hash :: ByteString -> Digest (SHA3 n)
    hash = sha3

-- The only actual work involves type literals
sha3 :: (SHA3Size n) => ByteString -> Digest (SHA3 n)
sha3 = MkSHA3Digest . Botan.hash h where
        n = fromInteger $ natVal $ Proxy @n
        h = fromJust $ Botan.sha3 

-- We still export user-friendly functions at the end
sha3_512 :: ByteString -> Digest (SHA3 512)
sha3_512 = sha3 @512

I want to stress that these typeclasses aren’t replacing anything, they’re only augmenting. I hope this gives a better idea as to the reasoning behind my approach.

2 Likes

Yes yes yes! That does clarify things, thank you.

That example is quite good. Looking forward to seeing more of this work

1 Like

I think instead of the atrocity, the following might be more helpful:

type family IsSHA3Size n where
    IsSHA3Size 224 = ()
    IsSHA3Size 256 = ()
    IsSHA3Size 384 = ()
    IsSHA3Size 512 = ()
    IsSHA3Size _ = TypeError "Make this a custom type error that indicates what the allowed numbers are"

type SHA3Size (n :: Nat) = (KnownNat n, IsSHA3Size n)

This way the error message will be a lot more helpful.

2 Likes

Update: botan 0.0.1.0 package candidate uploaded!

As promised, it’s here:

There’s still a good bit of work to be done before release, but I’ll be giving it the same level of love that botan-bindings and botan-low got while they were a package candidate - which was a lot.

First up, I’ve flattened the algorithm types for most operations (see issue #1 from earlier) with an overall positive result.

It is really visible in something like Botan.Cipher where we both have specific subsets that we need to deal with (Cipher vs AEAD), while also taking specific subsets of other operations as arguments (BlockCipher vs BlockCipher128). This new method is proving to be significantly better to use, so I’ll be finishing up and applying to the remaining operation / algorithm data types as necessary.

Secondly, I’ve decided to not elide the cryptography typeclasses from the 0.0.1 release (see issue #2 from earlier), but will still be postponing their continued development until after the 0.0.1 release, as to stay focused on the immediate needs - priortization! :upside_down_face:

The repo has been updated

4 Likes

Back in Action after getting some Rest

After getting botan-bindings and botan-low to the initial 0.0.1 release and botan to package candidate status, I found myself in need of a bit of a break - both because I needed to pull back and get a bird’s eye view of it, and because I’ve been working on this project almost non-stop for 8 months now and everyone deserves a bit of vacation now and then. :sleeping:

Now that’s over, and it is time to get back in action! As this project transitions to the next phase of its life cycle, I’ve been writing an official Haskell Cryptography Group milestone blog post and call-for-users, taking into account feedback on the project and the round 2 funding proposal (which I am also still working on :slight_smile:). I’ll be discussing the history of this project and how it all started, what we’ve accomplished so far, what feedback we’ve heard / users want, and what flag we’ve planted on the horizon as our goal, as well as concrete details on the specific things we wish to accomplish next.

You can expect regular updates to resume Monday next week!


Minor note: The project has now officially transferred ownership to the Haskell Cryptography Group, and as planned, the repository been moved:

7 Likes

A more sizeable update and proper First Milestone blog is coming soon :slight_smile: In the meantime, please enjoy this sneak peek at something I had fun implementing while on the break!

Just updating some old code to use botan-low

2 Likes

Returning from Absence.

I underestimated how much rest I needed, but in the words of Granny Weatherwax “I ate’nt dead”.

This last month has been half taking-a-break, half writing out the milestone blog and plans for the future, and half trying to figure out how to enable that future. For the moment, things are out of my hands, so instead, I’m going to do what I do best, which is the code, and update the devlog.

Things that have changed in the last few weeks:

  • Cryptographic typeclasses have been temporarily moved to its own branch to focus on releasing botan
  • Minor improvements to utility functions and error handling
  • Botan.RNG significantly reworked and cleaned up
  • Botan.KDF, PwdHash, HOTP, TOTP are now release-ready and have tutorials and explanations
  • Stable implementation errors in botan_pwdhash_timed are now handled properly - see […] confusing param mapping between pwdhash and pwdhash_timed
  • A minor update to botan-bindings and botan-low may be necessary to improve constants and organization of named elliptic curves, logarithm groups, and raw algorithm modes.

This isn’t pushed yet - I’ll let you known when it is.

8 Likes

Hey @ApothecaLabs, is there a good place to chat with you/the botan devs about development? I’m currently working through seeing how feasible it would be to swap crypton for botan in amazonka, and I’m running into a lot of issues with the design of the hashing API. I was wondering if there’s a good IRC channel or somewhere to chat - here feels like the wrong place but I can do that if needed.

Quick summary of the issues:

  • There’s no pure, incremental hashing API (you probably say my GitHub issue), which makes writing things like pure streaming sinks difficult - Monad m => Conduit ByteString () m (Digest a) can’t currently be written, it needs MonadIO instead (and with the current assumption that all hash state will be mutable updated, there could be bugs in how something that would get used, depending on m).
    As an aside here, linear types would make this API both pure and quite nice to work with - it probably shouldn’t be the only interface, but it would avoid the required copying of the context that a pure API would necessitate
  • The Digest data family is difficult to work with generically, it’s nearly impossible to write something that works on the wrapped ByteStrings in functions with types like Hash a => Digest a -> ByteString. I’d love to see a function digestBytes :: Digest a -> ByteString added to the Hash class, at the very least.
  • I’ve also grown quite fond of memory's ByteArrayAccess class and its ability to abstract types like digests soo they can easily be hashed too. I think that cryptonite got a lot write with its use of memory, forcing everything to be a ByteString limits things a lot for production code.

I’m sure I’ll run into more, but this is where I am at the moment. I’m aware that botan is unreleased, but wanted to at least investigate whether it’ll be usable in Amazonka. At the moment there isn’t a clear path that wouldn’t affect performance.

I should also mention that I’m happy to contribute the development of the changes I’d like to see - I should make a fork start playing around with ideas.

4 Likes

Since we’re on the topic of Amazonka, I should also say that I’d love to see botan become the cryptography library that it rests on, but there are a few things beyond @Axman6’s exploration that I’d also like to see happen:

  1. The changes to support FFI really should make it into botan upstream, so that we’re depending on the same vetted code as everyone else.
  2. The Haskell packages and the native library should have a good build/download story for all major platforms. I don’t want to accidentally cut off users on e.g., Windows if there’s no easy way to get them the native deps they need. (Also, AFAICT, Botan 3 is only in Debian experimental.)
  3. botan-bindings should not have to vendor any source code. I’m leery of packages which vendor source code, especially in the cryptography and security spaces.

These sound like pretty big asks to me, so I want to get them on the table now as a starting point for discussion. Maybe I’m being too dogmatic, or maybe they’re easier than I think, but let’s talk about them at least.

5 Likes

As already explained elsewhere: [Haskell Cryptography Group] Botan: The First Milestone - #14 by hasufell

3.3.0 is in debian experimental and oddly enough, windows is likely the least problematic platform.

I tried to create a botan-clib package, but the build system is a disaster: which source files to compile depends on platform and processor features that are detected during “configure”. You’ll have to map those things manually in the cabal file and create flags for all the cpu features, because the c source files used need to be encoded in the cabal file. And then review the build system changes on every version bump.

These are things that e.g. a gentoo or nixpkgs developer would be comfortable to do, but it isn’t at all trivial. And you lose the automagic detection of features.

Another way might be to abuse custom setup really hard, but that’s not gonna be very popular or easy either.

2 Likes

@Axman6 I am down for a video chat to discuss your needs in depth - just send me a private message and we’ll pick a time.

I have a lot to say in response to all this but saying it all would take time that I need to spend getting botan updated / released. So you get fragmented notes in the name of expediency :slight_smile: I hope they are not too insensible.

  • It is brave of you to be testing out the high-level interfaces when they are unstable

    • Super glad you are interested
    • I am trying to keep this in mind, but when I push the release (cross your fingers that its within a day or two) things may change (for the better I hope!)
    • The focus has been on getting everything working in at least one way - then, we can deal with different interfaces
  • Data families vs witness / proxy

    • I don’t mind providing more than one interface, I just happen to like data families
      • Am planning on interfaces that also take a alg value witness or Proxy alg as an alternative, like crypton/ite
    • Most other cryptography operations have multiple types that must work together so data families really works for them
    • Witnesses are simple when you only have a single operation to apply them to
      • Witnesses get hard when you have to apply them to every function in a multi-step operation
      • Or when you need multiple or compound algorithms
    • Your specific use case is the classic weak spot for data family type inference, eg when:
      • Inferring from the return type as the only mention of the actual type as you cited
      • Inferring a type that is not visible to the outer scope (eg, digestBytes $ hash "foo")
    • Hashing only has one related type, which is also produced as an output
      • This means hashing in particular doesn’t work as well with data families, but it does work well with witnesses
      • Being consistent overall is worth the tradeoff though
      • We can still expose a witness / proxy -based interface too so this pain is temporary
  • There is a difference between incremental / online and mutable

    • I’m not surprised you are having issues because I’ve been wanting to improve this are in particular - its confusing and so you should expect a lot of change here in the coming update(s)
    • Incremental is really about online ie chunked processing, I want to reduce it to Strict vs Lazy ByteStrings so a typeclass may then be unnecessary
    • Mutable is really about stateful contexts
      • A crypton/ite -style mutable interface is planned
  • Pure interfaces - only Hash has copyState, so specifically for hashes it is possible to make a stepwise pure interface out of a mutable interface, but for other operations it is not

    • This is actually sensible if you stop to think about it
      • Most cryptographic operations, copying state is highly undesirable or an outright information leak, so they don’t
      • hashing is a rare exception as you may be hashing things sharing the same prefix, so there is a major efficiency gain
      • The same is not necessarily true for other operations eg encryption and signing, especially anything using nonces
    • The botan hash state is mutable, so I am highly averse to pretending that it is pure
      • Need to guarantee that the IO actions are only run once each, and that they are run in the right order
    • Can make a copy of the hash state for every ‘pure’ function, but this is inefficient - but it is possible
    • Can pretend that its pure if we pretend that it is linear (ie, the result is only used once as an argument for the next action)
    • Trying to avoid a class-splosion
    • A similar issue regarding pure vs mutable interfaces can be seen in random with RandomGen vs StatefulGen
  • Regarding memory, I too liked the ByteArrayAccess class and friends, however there are issues that make me hesitant to include it as a dependency

    • The source repository has been archived by the author and is read-only
    • It does not appear to be maintained (and so may warrant initiating a takeover process)
    • botan-low has an absolutely minimal dependency footprint (aside from the botan library itself)
    • However, I have already been considering that it may be worth reviving those classes due to their utility

I hope that is readable and relevant. It will make more sense when I publish the code.

1 Like

[memory] does not appear to be maintained (and so may warrant initiating a takeover process)

It is confirmed that none of Vincent Hanquez’ packages will be maintained, nor has he given an ok for taking it over. He’s ok with anyone forking it and using a new name, so that’s what’s been happening with crypton(ite) and (crypton-)x509 etc.

Xeno’s Paradox Progress

I’ll keep this short; progress toward an update / release is ongoing; I feel as though every update we get about half the remaining work done, but discover more to do. However, an update is good, and I’m trying to not stress about it :slight_smile: It has gotten a bit slow because now I might spend the entire day chasing down one specific operation on one specific algorithm in order to verify that it is well-behaved. The closer you look, the more detail there is to be seen.

Regardless, progress:

  • All the botan modules are now 100% functional-complete
  • Significant reorganization of modules to reduce their number
  • Lots of improvements to PubKey, including post-quantum Kyber / Dilithium algorithm parameters
  • There are proper newtypes for everything, and so type safety is significantly improved
  • Basic unit tests for botan are in progress, and are the major blocker for release now
  • Some massaging / improvements are still needed per catching exceptions and turning them into Either / Maybe when appropriate
    • This bears pondering, as there are some failures where an exception is far more appropriate
    • An example: encrypt is a general function for multiple algorithms; some algorithms have specific handling requirements for plaintext length and chunking, and failure to validate may throw an exception if the function is provided inappropriate arguments - this failure is different from failing to decrypt because of using the wrong key or auth and should return Nothing rather than throwing an exception.
    • This is important because otherwise we have to make a function like encrypt return a Maybe Ciphertext even though encryption really shouldn’t fail the same way decrypt can. This would result in an un-ergonomic interface.
    • I am strongly leaning towards providing validation functions for input, throwing exceptions in the standard functions (and maybe an Either variant that doesn’t), and saving Maybe for functions like decrypt that are expected to fail in specific ways.
  • All of the typeclasses / type family / algorithm-specific modules are still separated out for the moment, will need some minor merging back in, and may not make the initial botan release (as to better focus on the immediate needs)

I’m trying to get this out sooner rather than later, but it will require a minor update to the lower libraries, and I’ll need to make sure I don’t goof that up - I’m thinking I’ll bump botan-bindings and botan-low from 0.0.1 to 0.1.0 because I think there a few minor breaking changes, and release the first version of botan as 0.1.0.

4 Likes

What to do about memory re: crypton/ite?

I’ve been absent for some time due to the need to focus on personal health. It can be a struggle to reach out at times like these, but this last month I’ve not been idle. Herein lie the results of my investigation in the meantime.

In order for botan to provide a palatable alternative to crypton/ite, it is important to understand how people are using it, and what dependencies it implies. One of those dependencies is memory, which supplies a variety of typeclasses and data types regarding low-level pointers and memory. This package, like cryptonite and by the same author, has been archived, and will not be updated in the future.

This is problematic, because even after forking crypton/ite, many libraries still have a transitive dependency on memory. The reason for this is that one of the prominent design choices was to use constraints instead of concrete data types in many of the typeclasses and functions.

For example:

class Cipher cipher where
    ...
    cipherInit :: ByteArray key => key -> CryptoFailable cipher 
    ...
class Cipher cipher => BlockCipher cipher where
    ...
    ecbEncrypt :: ByteArray ba => cipher -> ba -> ba
    ecbDecrypt :: ByteArray ba => cipher -> ba -> ba 
    ...

Thankfully, most algorithm-specific outputs are wrapped in a concrete newtype.

hash :: (ByteArrayAccess ba, HashAlgorithm a) => ba -> Digest a

That does seem to be the general rule, that unstructured blobs of binary data be represented a ByteArray, mostly plaintext / ciphertext / key / salt / password -type data, but it can be slightly onerous when parameters are allowed to be heterogenous

bcrypt
    :: (ByteArray salt, ByteArray password, ByteArray output)	 
    => Int -> salt -> password -> output

As a result, unless one is dealing exclusively with ByteString, and / or the data types supplied by crypton/ite itself, use of memory is almost compulsory. Furthermore, if we wish to provide a similar interface using botan as a backend, this means either accepting memory as a dependency, or doing something about it.

As such, I have spent a lot of time familiarizing myself with its guts and how it is used. I really wanted to know the intent behind many of the decisions the author made (such as how the different allocation / create functions need to be used re: inlining and unsafePerformIO) because it doesn’t do to simply copy and paste. Here’s what I have learned:

The following libraries use both crypton/ite and memory, but there are many more.

  • amazonka-core
  • stack
  • tls
  • password
  • x509

One can look at the reverse dependencies for crypton/ite and memory to see the general cross-section. Thankfully most dependents of crypton/ite found ByteString and the supplied data types sufficient, and so they did not promulgate the issue by depending on memory directly, only indirectly.

So, what is memory?

It is a small, innocuous package, mostly exposing the following classes that allow ByteString and primitive:Data.Primitive.ByteArray-like access to pointers to contiguous memory:

-- Data.ByteArray

class ByteArrayAccess ba where
    length :: ba -> Int
    withByteArray :: ba -> (Ptr p -> IO a) -> IO a
    copyByteArrayToPtr :: ba -> Ptr p -> IO ()

class (Eq ba, Ord ba, Monoid ba, ByteArrayAccess ba) => ByteArray ba where

    allocRet :: Int	-> (Ptr p -> IO a) -> IO (a, ba)

data Bytes

data ScrubbedBytes

-- Data.ByteArray.Sized

class (ByteArrayAccess c, KnownNat n) => ByteArrayN (n :: Nat) c | c -> n where
    allocRet :: forall p a. Proxy n -> (Ptr p -> IO a) -> IO (a, c)

data SizedByteArray (n :: Nat) ba

These classes allow for fast and efficient manipulation of memory, and basically wrap up and classify a lot of low-level functions that GHC uses internally - incredibly powerful, and about as far as you can go without dropping down to using primitive and MagicHash directly.

Notably, the package does not distinguish between pinned and unpinned memory, though the unspoken assumption is that the memory pointed to in withByteArray :: ba -> (Ptr p -> IO a) -> IO a will not be moved for the duration of that operation.

Aside from a few minor quibbles of hierarchy and nomenclature for purposes of consistency with Data.Primitive.ByteArray and Foreign.* and for better separation of bytewise functions vs elementwise functions, it is a useful set of classes. I see no reason to make large changes to it.

So, how is memory being used

I do wish to understand how memory is being used, and so I have gone over many of the direct dependents in order to collate their use cases.

A few particular use cases dominate:

  • ByteArrayAccess and ByteArray classes
  • convert and constEq functions
  • convertToBase functions

Aside from that, there are several errata in the package that we must also consider. Notable:

  • ScrubbedBytes
  • SizedByteArray and ByteArrayN

ByteArrayAccess and ByteArray

By and far, the most-oft used parts of this package are the classes themselves. The major user of these classes is the crypton/ite ecology, and are the core of the package. Enough said - high priority.

convert and constEq

If they aren’t using ByteArray as a general ByteString-like class, directly, chances are they using memory in order to convert between ByteArray-conforming types. This is actually quite sensible, as many cryptography-related packages have some modules / classes / functions for converting various primitives / newtypes to and from bytestrings.

The other common use is to pull in constEq for constant-time equality checks for sensitive information.

Quite often, the memory package is pulled in for just for one or both of these. Between that and the classes themselves, we’ve covered the majority of memory use cases

Base and convertToBase

Seeing occasional use, memory also exposes functionality for converting between various bases in the Data.ByteArray.Encoding module. This is important, but not critical, because there are already packages for converting bytestrings to various bases.

SizedByteArray and ByteArrayN

Seeing infrequent use (eg, NaCl). Essentially a Nat-parameterized ByteArray, it might have been called FiniteByteArray if being consistent with Bits and FiniteBits.

ScrubbedBytes

Seeing infrequent use (eg, NaCl).

memory supplies a ScrubbedBytes class for sensitive memory that protects its Show instance and is automatically zeroed before release. This is a really important but also niche use, and I only observed one or two packages using this functionality.

Here is how the zeroing of memory works for ScrubbedBytes:

getScrubber :: Addr# -> State# RealWorld -> State# RealWorld
getScrubber addr s =
    let IO scrubBytes = memSet (Ptr addr) 0 (I# sz)
     in case scrubBytes s of
            (# s', _ #) -> s'

Essentially, it is using extremely low-level RealWorld magic to dynamically vend a memSet function call in the finalizer of ScrubbedBytes, in an effort to avoid the compiler eliding the zeroing due to the following free (that is, many assemblers will ignore memSet if it is immediately followed by free, because such a memSet is normally considered to be a no-op).

Even if a compiler emits the instructions to zero memory in the compiled binary, that is no guarantee that it will be run, because the CPU may still optimize it out during runtime. As a result, most operating systems provide a specific OS-dependent function for this in order to ensure that the memory is zeroed, and the volatile keyword and memset_explicit function have only recently been properly standardized in C.

I cannot attest to the efficacy of this functionality regarding GHC specifically, but for reference, here is how Botan::secure_scrub_memory works in C++

void secure_scrub_memory(void* ptr, size_t n) {
#if defined(BOTAN_TARGET_OS_HAS_RTLSECUREZEROMEMORY)
    ::RtlSecureZeroMemory(ptr, n);

#elif defined(BOTAN_TARGET_OS_HAS_EXPLICIT_BZERO)
    ::explicit_bzero(ptr, n);

#elif defined(BOTAN_TARGET_OS_HAS_EXPLICIT_MEMSET)
    (void)::explicit_memset(ptr, 0, n);

#elif defined(BOTAN_USE_VOLATILE_MEMSET_FOR_ZERO) && (BOTAN_USE_VOLATILE_MEMSET_FOR_ZERO == 1)
    static void* (*const volatile memset_ptr)(void*, int, size_t) = std::memset;
    (memset_ptr)(ptr, 0, n);
#else

    volatile uint8_t* p = reinterpret_cast<volatile uint8_t*>(ptr);

    for(size_t i = 0; i != n; ++i)
        p[i] = 0;
#endif
}

Replacement of this functionality is considered important, but as a low-level priority compared to the higher-priority functionalities.

Other errara

Outside of that, there are portions of memory that see little to no actual use:

  • Parser and Packer modules
  • Endianness
  • Word128
  • The SipHash and FNVHash modules

I do not think we need to worry about these for the time being.

So, what do we do about it?

I think it would be wise to provide ByteArray-like functionality in botan for better compatibility with crypton/ite and easier migration, and this assessment helps illustrate what capability needs to be preserved, and what can be ignored.

Thus I am considering creating a slimmed down bytearray-classes package, that plucks out the necessary classes and functions in order to allow us to depend on it instead of memory.

We definitely want the bulk of these modules, maybe ignoring the concrete data types and focusing only on the classes:

  • Data.ByteArray
  • Data.ByteArray.Encoding
  • Data.ByteArray.Sized
  • We also probably do want to make some data types a la Bytes and ScrubbedBytes

Using bytearray-classes

Migration would ideally be simple, as I don’t intend on changing the interfaces much, but this actually leads to a rather thorny problem and important decision - where in botan does this library get used? The answer to that is slightly frustating, and one that gives me no small amount of anxiety:

For improving compatibility / ease of migration, the best place to use bytearray-classes and ByteArray constraints like crypton/ite is in botan-low.

The reasoning here is that the concrete ByteString functions can be obtained from the ByteArray functions simply by restricting the types, and we would be able to make all of the internal logic and function generators for allocating ByteStrings much more consistent.

This of course depends on whether people actually prefer a ByteArray constraint over a concrete ByteString parameter at all, as you may find use of TypeApplications suddenly necessary.

5 Likes

Regarding amazonka-core: we only seem to directly use memory for base{16,64} encoding and it might be easy to remove. Issue

Ah, it is not, but I suspect this is a major cause of direct dependencies on memory:

This is not straightforward: the various Digest as that we get from crypton can only be interacted with using the ByteArrayAccess interface from memory, which is why amazonka-core has polymorphic functions for base{16,64} encoding. And if we move to botan we’ll be able to get encoders from there anyway, so there’s little point in doing this refactoring twice.

The password library also doesn’t really need (or at least want) to use memory.

The convert to Bytes we have is only in there because benchmarks showed it is slightly more performant and if we’re depending on memory already (because of crypton(ite)), then why not.
The one we DO want/need is the constEq for security reasons. Though I feel that is something that could also just be added to the bytestring library. Then we’d use that one.

1 Like