Botan bindings devlog

I’m quite illiterate in the area, so I might very well talk nonsense. But why do we care about building Botan from its sources? Why could not botan-bindings just package botan_all.cpp?

“Compressing” a multi-unit C++ project into a single file is a non-trivial task. Also, considering that botan takes a while to build, even if you managed to produce a single-file bundle, it’s not clear how much memory GCC/clang would need to compile it (my guess is a lot).

The Amalgamation Build

As an ex linux source distro maintainer, I’ve seen cases where big enough C++ files caused so much memory spikes that compilation regularly failed on end users systems.

On a more general note, I want to re-iterate that bundling any crypto source code is considered very bad practice amongst linux distro maintainers.

What you’re essentially doing is becoming responsible for tracking security vulnerabilities upstream (which is already done and automated in large linux distros, including tools that check for affected packages etc) and figuring out how to ship updates to end users as soon as possible.

cabal has no support for this. And we just got a new security response team that is still working on their processes and workflow.

If you bundle crypto libraries, please demonstrate how you’re tracking upstream vulnerabilities and inform end users.

5 Likes

Update: botan 0.0.1 package candidate imminent

Whew! It’s been an intense couple of days, now that botan-bindings and botan-low are published to hackage. Additionally, the round 2 funding proposal has been updated to provide more details on the intended trajectory for the next several months.

Of course, now that the excitement has settled down, we are back to focusing on our next goal: getting botan to package candidate status, and soon after to release. Here lies the issue that I’ve been gnawing on for the last few days:

Our initial ADT-based approach to managing algorithms has turned out to be insufficient. Cryptographic typeclasses were something that was in the original proposal, but they were removed to reduce scope; however, they’ve turned out to be necessary in the long term.

It is going to take time to develop the proper solution of typeclasses and data families in order to build per-algorithm gold-standard modules, and I’d like to get botan live before then. Now, there’s still:

  1. the problem of the insufficiency of the ADT interface, which must be dealt with before we publish a 0.0.1, and

  2. the question of whether to prune the typeclasses / data families / per-algorithm gold-standard modules from said initial release because they are still being developed, and add quite a bit of inertia - I’d rather present something comparable to z-botan first before* going after higher-level targets.

* I’m trying to keep my prioritization straight

I do have a solution to #1, which I feel may suffice. Our issue is thus:

-- We have some set of algorithms for a common operation
data Hash
    = CryptoHash CryptoHash
    | Checksum Checksum

-- We have nested ADTs because some functions require a specific subset of algorithms
data CryptoHash
    = SHA3 SHA3
    | ...

data Checksum
    = Adler32
    | ...

-- Algorithms may themselves variants, and so the nesting gets deeper
data SHA3
    = SHA3_512
    | ...

-- An algorithm for one operation may be a component in another operation, but this is dangerous:
data MAC
    = HMAC Hash -- Wrong!

-- The component algorithm may require a specific subset:
data MAC
    = HMAC CryptoHash  -- Right!

-- This all turns out to be unwieldy in practice, and leads to ridiculous stacks of wrappers:
hmac_sha3_512 = HMAC $ CryptoHash $ SHA3 $ SHA3_512

Typeclasses are the obvious long-term solution, but will take time. What’s the interim solution? Let’s ask ourselves, “What did z-botan do?”

Well, they flattened the ADT to an enumeration, and threw an exception if an incorrect algorithm was used. I think we can do slightly better…

-- Flatten the ADTs, have newtype wrappers for specific subsets!
data Hash
    = SHA3_512
    ...
    | Adler32

-- Then, it's up to each subset to define their wrapper and smart constructors
newtype CryptoHash = MkCryptoHash { unCryptoHash :: Hash }

-- We'll use smart constructors instead of just throwing exceptions like z-botan
cryptoHash :: Hash -> Maybe CryptoHash
cryptoHash SHA3_512 = Just $ MkCryptoHash SHA3_512
...
cryptoHash _ = Nothing

unsafeCryptoHash :: Hash -> CryptoHash
unsafeCryptoHash h = fromJust $ CryptoHash h

newtype Checksum = MkChecksum { unChecksum :: Hash }

checksum :: Hash -> Maybe Checksum
checksum Adler32 = Just $ MkChecksum Adler32
...
checksum _ = Nothing

unsafeChecksum :: Hash -> Checksum
unsafeChecksum h = fromJust $ Checksum h

-- Algorithm families can get the same treatment if necessary
newtype SHA3 = MkSHA3 { unSHA3 :: Hash }

-- This means that we have at most 1 wrapper, from enumeration to specific subset
hmac_sha3_512 = HMAC (unsafeCryptoHash SHA3_512)

It’s not perfect, but it gives us the type safety we need, without impairing the ergonomics too much while we work on the proper typeclasses. A happy medium!

There’s no update to the repo quite yet as I’m still doing some cleanup while applying this, but once that is done, the botan 0.0.1.0 package candidate will be going up.

So then, this leaves the question of #2 - to prune or not to prune? Do we elide the in-progess abstractions to keep focus on the core modules?

2 Likes

Please consider this bit carefully and try to avoid these as much as possible, even for the final interface. It looks like there’s potential to overengineer here and make it hard for clients of the library to figure out how to use the API as these abstractions can significantly obscure it.

2 Likes

There is a long-term method to my madness; I am trying to keep the ergonomics of use in mind, but botan is primarily a cryptographic kitchen sink, and so it is my duty to expose the full complexity of every algorithm and primitive. botan is far closer to crypton than something like saltine.

However, I am planning for a user-friendly saltine -like interface as well, and I think your concerns align more closely there. The end goal is to expose a pure, idiomatic interface, but all of the mutable IO and machinery underneath is still necessary, and this is part of that important machinery, and not necessarily the final interface.

A few things to help address your concerns:

  1. I am being sure to use data families and not type families; this significantly reduces strain on the compiler, as data families do not inhibit type inference at usage sites like type families do, and instead places the onus of effort on me to fulfill the boilerplate.

  2. This was always going to be complex, no matter how I chose to express it, and having tried several ‘simpler’ routes, it turns out that while data families do have more initial overhead for development, their complexity for use grows much more slowly. In comparison, the ADT approach is easy to develop, and terrible to use.

  3. I am also writing a series of high-level tutorials focusing on explaining the correct usage, which in most cases consists of using a single type application or explicit type annotation. This interface affords a great deal of flexibility and specificity, as I have been quite thorough.

The following lines are all valid and equivalent:

sha3_512 "Fee fi fo fum!"
sha3 @512 "Fee fi fo fum!"
sha3 "Fee fi fo fum!" :: SHA3_512Digest
sha3 "Fee fi fo fum!" :: SHA3Digest 512
sha3 "Fee fi fo fum!" :: Digest (SHA3 512)
sha3 "Fee fi fo fum!" :: Digest SHA3_512
hash @(SHA3 512) "Fee fi fo fum!"
hash @SHA3_512 "Fee fi fo fum!"
hash "Fee fi fo fum!" :: SHA3_512Digest
hash "Fee fi fo fum!" :: SHA3Digest 512
hash "Fee fi fo fum!" :: Digest (SHA3 512)
hash "Fee fi fo fum!" :: Digest SHA3_512
-- 03a240a26674994a...

There is a great deal of freedom of choice, but from this we may expose a final, curated interface, eg by picking a particular idiom and emphasizing it in the tutorials.

  1. The cryptographic typeclasses are being designed to expose a consistent, implementation-agnostic cryptographic interface and will eventually be lifted out into their own package with botan continuing to conform to it. This will eventually allow alternate backends to fulfill the required primitives, while allowing more complex implementation-agnostic concepts like merkle trees to be implemented without locking yourself to one backend library. This will also allow the final curated interfaces to be reused across backends, and is part of the long-term goal of a unified haskell cryptography ecosystem.
3 Likes

Honestly, I still think the best abstraction here is backpack. At least at a conceptual level, it simply feels like the right things.

I’d be happy to have a chat about it, even if just to put it on your map.

In practice, however, it may be that backpack is not a viable option due to Stack’s lack of support and HLS — on the other hand, it could be a strong motivation to implement support for it

2 Likes

I do not at all oppose investigating the use of backpack to make swapping of implementation easier, but the ability to pull it out as an implementation-agnostic interface is merely a convenient side effect of being extremely stringent in acknowledging the unique properties of all of the various algorithms :grin:


I feel as though I need to clarify something for the general audience, however, as there appears to be some wariness or misunderstanding of why the typeclasses are being developed:

The typeclass abstractions are still necessary within botan as a single package, in order to expose multiple algorithms for the same primitive in a type-safe manner without throwing exceptions, or pretending that algorithm variants don’t exist.

Many algorithms have unique knobs and levers, and it is only at the very end, when we process the plaintext, that they act similarly. Digest MD5 is not the same type as Digest (SHA3 512), and should not be considered equal, even if they were to contain the same bytes.

Some algorithm families are parameterized by a unique salt / identifier / tweak that is not necessarily bounded in size, and thus there may be an infinite number of instances to write manually. Skein512 512 "foo" takes two additional parameters compared to MD5 - how are we to provide the additional arguments while providing a uniform interface?

Typeclasses were designed for this; the alternative is passing an algorithm witness around exactly like you might a typeclass implementation dictionary, while having to handle errors when people pass in a parameter that isn’t valid for a given algorithm / operation, which would be normally be handled by the type system via typeclass constraints.

I’d be writing the newtypes anyway for per-algorithm type-safety - take a look! The data families are effectively free and the typeclasses just codify everything that I’m doing anyway.

We have our data family and typeclass - not scary at all :slight_smile: :

data family Digest hash

class (Eq (Digest hash), Ord (Digest hash)) => Hash hash where
    hash :: ByteString -> Digest hash

We could implement simple concrete types for our algorithm this way, without the typeclass and data families:

newtype MD5Digest = MkMD5Digest { getMD5ByteString :: ByteString }
    deriving newtype (Eq, Ord)

md5 :: ByteString -> MD5Digest
md5 = MkMD5Digest . Botan.hash Botan.md5

However, conformance to the typeclasses is effectively free

-- Free
data MD5

-- Free by replacing "MD5Digest" with "instance Digest MD5"
newtype instance Digest MD5 = MkMD5Digest { getMD5ByteString :: ByteString }
    deriving newtype (Eq, Ord)

-- Unchanged
md5 :: ByteString -> MD5Digest
md5 = MkMD5Digest . Botan.hash Botan.md5

-- Free
type MD5Digest = Digest MD5

-- Free
instance Hash MD5 where
    hash :: ByteString -> Digest MD5
    hash = md5

Something with a few type parameters isn’t much harder.

data SHA3 (n :: Nat)

-- Pardon this atrocity
type SHA3Size (n :: Nat) = (KnownNat n, (n == 224 || n == 256 || n == 384 || n == 512) ~ True)

newtype instance Digest (SHA3 n) = MkSHA3Digest { getSHA3ByteString :: ByteString }
    deriving newtype (Eq, Ord)

instance (SHA3Size n) => Hash (SHA3 n) where
    hash :: ByteString -> Digest (SHA3 n)
    hash = sha3

-- The only actual work involves type literals
sha3 :: (SHA3Size n) => ByteString -> Digest (SHA3 n)
sha3 = MkSHA3Digest . Botan.hash h where
        n = fromInteger $ natVal $ Proxy @n
        h = fromJust $ Botan.sha3 

-- We still export user-friendly functions at the end
sha3_512 :: ByteString -> Digest (SHA3 512)
sha3_512 = sha3 @512

I want to stress that these typeclasses aren’t replacing anything, they’re only augmenting. I hope this gives a better idea as to the reasoning behind my approach.

2 Likes

Yes yes yes! That does clarify things, thank you.

That example is quite good. Looking forward to seeing more of this work

1 Like

I think instead of the atrocity, the following might be more helpful:

type family IsSHA3Size n where
    IsSHA3Size 224 = ()
    IsSHA3Size 256 = ()
    IsSHA3Size 384 = ()
    IsSHA3Size 512 = ()
    IsSHA3Size _ = TypeError "Make this a custom type error that indicates what the allowed numbers are"

type SHA3Size (n :: Nat) = (KnownNat n, IsSHA3Size n)

This way the error message will be a lot more helpful.

2 Likes

Update: botan 0.0.1.0 package candidate uploaded!

As promised, it’s here:

There’s still a good bit of work to be done before release, but I’ll be giving it the same level of love that botan-bindings and botan-low got while they were a package candidate - which was a lot.

First up, I’ve flattened the algorithm types for most operations (see issue #1 from earlier) with an overall positive result.

It is really visible in something like Botan.Cipher where we both have specific subsets that we need to deal with (Cipher vs AEAD), while also taking specific subsets of other operations as arguments (BlockCipher vs BlockCipher128). This new method is proving to be significantly better to use, so I’ll be finishing up and applying to the remaining operation / algorithm data types as necessary.

Secondly, I’ve decided to not elide the cryptography typeclasses from the 0.0.1 release (see issue #2 from earlier), but will still be postponing their continued development until after the 0.0.1 release, as to stay focused on the immediate needs - priortization! :upside_down_face:

The repo has been updated

4 Likes

Back in Action after getting some Rest

After getting botan-bindings and botan-low to the initial 0.0.1 release and botan to package candidate status, I found myself in need of a bit of a break - both because I needed to pull back and get a bird’s eye view of it, and because I’ve been working on this project almost non-stop for 8 months now and everyone deserves a bit of vacation now and then. :sleeping:

Now that’s over, and it is time to get back in action! As this project transitions to the next phase of its life cycle, I’ve been writing an official Haskell Cryptography Group milestone blog post and call-for-users, taking into account feedback on the project and the round 2 funding proposal (which I am also still working on :slight_smile:). I’ll be discussing the history of this project and how it all started, what we’ve accomplished so far, what feedback we’ve heard / users want, and what flag we’ve planted on the horizon as our goal, as well as concrete details on the specific things we wish to accomplish next.

You can expect regular updates to resume Monday next week!


Minor note: The project has now officially transferred ownership to the Haskell Cryptography Group, and as planned, the repository been moved:

7 Likes

A more sizeable update and proper First Milestone blog is coming soon :slight_smile: In the meantime, please enjoy this sneak peek at something I had fun implementing while on the break!

Just updating some old code to use botan-low

2 Likes

Returning from Absence.

I underestimated how much rest I needed, but in the words of Granny Weatherwax “I ate’nt dead”.

This last month has been half taking-a-break, half writing out the milestone blog and plans for the future, and half trying to figure out how to enable that future. For the moment, things are out of my hands, so instead, I’m going to do what I do best, which is the code, and update the devlog.

Things that have changed in the last few weeks:

  • Cryptographic typeclasses have been temporarily moved to its own branch to focus on releasing botan
  • Minor improvements to utility functions and error handling
  • Botan.RNG significantly reworked and cleaned up
  • Botan.KDF, PwdHash, HOTP, TOTP are now release-ready and have tutorials and explanations
  • Stable implementation errors in botan_pwdhash_timed are now handled properly - see […] confusing param mapping between pwdhash and pwdhash_timed
  • A minor update to botan-bindings and botan-low may be necessary to improve constants and organization of named elliptic curves, logarithm groups, and raw algorithm modes.

This isn’t pushed yet - I’ll let you known when it is.

8 Likes

Hey @ApothecaLabs, is there a good place to chat with you/the botan devs about development? I’m currently working through seeing how feasible it would be to swap crypton for botan in amazonka, and I’m running into a lot of issues with the design of the hashing API. I was wondering if there’s a good IRC channel or somewhere to chat - here feels like the wrong place but I can do that if needed.

Quick summary of the issues:

  • There’s no pure, incremental hashing API (you probably say my GitHub issue), which makes writing things like pure streaming sinks difficult - Monad m => Conduit ByteString () m (Digest a) can’t currently be written, it needs MonadIO instead (and with the current assumption that all hash state will be mutable updated, there could be bugs in how something that would get used, depending on m).
    As an aside here, linear types would make this API both pure and quite nice to work with - it probably shouldn’t be the only interface, but it would avoid the required copying of the context that a pure API would necessitate
  • The Digest data family is difficult to work with generically, it’s nearly impossible to write something that works on the wrapped ByteStrings in functions with types like Hash a => Digest a -> ByteString. I’d love to see a function digestBytes :: Digest a -> ByteString added to the Hash class, at the very least.
  • I’ve also grown quite fond of memory's ByteArrayAccess class and its ability to abstract types like digests soo they can easily be hashed too. I think that cryptonite got a lot write with its use of memory, forcing everything to be a ByteString limits things a lot for production code.

I’m sure I’ll run into more, but this is where I am at the moment. I’m aware that botan is unreleased, but wanted to at least investigate whether it’ll be usable in Amazonka. At the moment there isn’t a clear path that wouldn’t affect performance.

I should also mention that I’m happy to contribute the development of the changes I’d like to see - I should make a fork start playing around with ideas.

5 Likes

Since we’re on the topic of Amazonka, I should also say that I’d love to see botan become the cryptography library that it rests on, but there are a few things beyond @Axman6’s exploration that I’d also like to see happen:

  1. The changes to support FFI really should make it into botan upstream, so that we’re depending on the same vetted code as everyone else.
  2. The Haskell packages and the native library should have a good build/download story for all major platforms. I don’t want to accidentally cut off users on e.g., Windows if there’s no easy way to get them the native deps they need. (Also, AFAICT, Botan 3 is only in Debian experimental.)
  3. botan-bindings should not have to vendor any source code. I’m leery of packages which vendor source code, especially in the cryptography and security spaces.

These sound like pretty big asks to me, so I want to get them on the table now as a starting point for discussion. Maybe I’m being too dogmatic, or maybe they’re easier than I think, but let’s talk about them at least.

5 Likes

As already explained elsewhere: [Haskell Cryptography Group] Botan: The First Milestone - #14 by hasufell

3.3.0 is in debian experimental and oddly enough, windows is likely the least problematic platform.

I tried to create a botan-clib package, but the build system is a disaster: which source files to compile depends on platform and processor features that are detected during “configure”. You’ll have to map those things manually in the cabal file and create flags for all the cpu features, because the c source files used need to be encoded in the cabal file. And then review the build system changes on every version bump.

These are things that e.g. a gentoo or nixpkgs developer would be comfortable to do, but it isn’t at all trivial. And you lose the automagic detection of features.

Another way might be to abuse custom setup really hard, but that’s not gonna be very popular or easy either.

2 Likes

@Axman6 I am down for a video chat to discuss your needs in depth - just send me a private message and we’ll pick a time.

I have a lot to say in response to all this but saying it all would take time that I need to spend getting botan updated / released. So you get fragmented notes in the name of expediency :slight_smile: I hope they are not too insensible.

  • It is brave of you to be testing out the high-level interfaces when they are unstable

    • Super glad you are interested
    • I am trying to keep this in mind, but when I push the release (cross your fingers that its within a day or two) things may change (for the better I hope!)
    • The focus has been on getting everything working in at least one way - then, we can deal with different interfaces
  • Data families vs witness / proxy

    • I don’t mind providing more than one interface, I just happen to like data families
      • Am planning on interfaces that also take a alg value witness or Proxy alg as an alternative, like crypton/ite
    • Most other cryptography operations have multiple types that must work together so data families really works for them
    • Witnesses are simple when you only have a single operation to apply them to
      • Witnesses get hard when you have to apply them to every function in a multi-step operation
      • Or when you need multiple or compound algorithms
    • Your specific use case is the classic weak spot for data family type inference, eg when:
      • Inferring from the return type as the only mention of the actual type as you cited
      • Inferring a type that is not visible to the outer scope (eg, digestBytes $ hash "foo")
    • Hashing only has one related type, which is also produced as an output
      • This means hashing in particular doesn’t work as well with data families, but it does work well with witnesses
      • Being consistent overall is worth the tradeoff though
      • We can still expose a witness / proxy -based interface too so this pain is temporary
  • There is a difference between incremental / online and mutable

    • I’m not surprised you are having issues because I’ve been wanting to improve this are in particular - its confusing and so you should expect a lot of change here in the coming update(s)
    • Incremental is really about online ie chunked processing, I want to reduce it to Strict vs Lazy ByteStrings so a typeclass may then be unnecessary
    • Mutable is really about stateful contexts
      • A crypton/ite -style mutable interface is planned
  • Pure interfaces - only Hash has copyState, so specifically for hashes it is possible to make a stepwise pure interface out of a mutable interface, but for other operations it is not

    • This is actually sensible if you stop to think about it
      • Most cryptographic operations, copying state is highly undesirable or an outright information leak, so they don’t
      • hashing is a rare exception as you may be hashing things sharing the same prefix, so there is a major efficiency gain
      • The same is not necessarily true for other operations eg encryption and signing, especially anything using nonces
    • The botan hash state is mutable, so I am highly averse to pretending that it is pure
      • Need to guarantee that the IO actions are only run once each, and that they are run in the right order
    • Can make a copy of the hash state for every ‘pure’ function, but this is inefficient - but it is possible
    • Can pretend that its pure if we pretend that it is linear (ie, the result is only used once as an argument for the next action)
    • Trying to avoid a class-splosion
    • A similar issue regarding pure vs mutable interfaces can be seen in random with RandomGen vs StatefulGen
  • Regarding memory, I too liked the ByteArrayAccess class and friends, however there are issues that make me hesitant to include it as a dependency

    • The source repository has been archived by the author and is read-only
    • It does not appear to be maintained (and so may warrant initiating a takeover process)
    • botan-low has an absolutely minimal dependency footprint (aside from the botan library itself)
    • However, I have already been considering that it may be worth reviving those classes due to their utility

I hope that is readable and relevant. It will make more sense when I publish the code.

1 Like