Help with designing an API for text encodings

taylorfausak · September 29, 2022, 2:04am

I maintain Witch, a library for converting values from one type into another. Recently I’ve been struggling with how to handle conversions between text and encodings like UTF-8. I’d like to get some feedback on a change I’m thinking about before I commit to it.

The Basics

I don’t think you need to be familiar with the Witch library in order to weigh in here. However it will be useful to be familiar with the basics so that you can see how things are used in context. Witch revolves around two type classes: From for conversions that always succeed, and TryFrom for ones that can sometimes fail. They look roughly like this:

{-# language MultiParamTypeClasses #-}

class From source target where
  from :: source -> target

class TryFrom source target where
  tryFrom :: source -> Maybe target

I like using examples to show how things work. For these type classes, the sized integral types make for great example. For example you can always convert from a Word8 into a Word16, but you can’t always go the other way around because you might overflow. This pair of type classes allows you to express that:

import Data.Word (Word8, Word16)
import qualified Data.Bits as Bits

instance From Word8 Word16 where
  -- from :: Word8 -> Word16
  from = fromIntegral

instance TryFrom Word16 Word8 where
  -- tryFrom :: Word16 -> Maybe Word8
  tryFrom = Bits.toIntegralSized

And to use those instances, you simply call from or tryFrom with the appropriate types:

from (0 :: Word8) :: Word16
-- 0

tryFrom (0 :: Word16) :: Maybe Word8
-- Just 0

tryFrom (256 :: Word16) :: Maybe Word8
-- Nothing

In practice, things will often be inferred from context. Or you’ll use type applications. The explicit type signatures are relatively uncommon. But ergonomics are besides the point here

Encoding Problem

With an interface like that, how do you support converting between Text and ByteString? The simplest thing that works is to pick an encoding. That’s exactly what Witch does today! You can get any encoding you want, as long as it’s UTF-8.

import Data.Text (Text)
import Data.ByteString (ByteString)
import qualified Data.Text.Encoding as Encoding

instance From Text ByteString where
  -- from :: Text -> ByteString
  from = Encoding.encodeUtf8

instance TryFrom ByteString Text where
  -- tryFrom :: ByteString -> Maybe Text
  tryFrom = hush . Encoding.decodeUtf8'

hush :: Either x a -> Maybe a
hush = either (const Nothing) Just

This is not great for two reasons: The encoding is implicit, and you can’t change it. Nothing in the type signature suggests that it will use UTF-8. Maybe you can guess that since it’s a popular encoding. Or maybe you can read the documentation to find out. But it would be nice if it was in the types! Not only would it be self documenting, but then we could choose another encoding if we wanted to.

Current Solution

So let’s do just that. I’m going to introduce a bunch of types here, but in essence they’re similar to the Tagged data type with a Symbol as the tag. I’ll get back to that later. Let’s start with a type for wrapping up some value with a phantom encoding:

newtype WithEncoding encoding value
  = WithEncoding value
  deriving (Eq, Show)

withEncoding :: encoding -> value -> WithEncoding encoding value
withEncoding _ = WithEncoding

withoutEncoding :: WithEncoding encoding value -> value
withoutEncoding (WithEncoding x) = x

And then let’s add a type for that encoding:

data Utf8
  = Utf8
  deriving (Eq, Show)

Armed with those, we can write some better instances that more clearly express which encoding we’re using:

{-# language FlexibleInstances #-}

instance From Text (WithEncoding Utf8 ByteString) where
  -- from :: Text -> WithEncoding Utf8 ByteString
  from = withEncoding Utf8 . Encoding.encodeUtf8

instance TryFrom (WithEncoding Utf8 ByteString) Text where
  -- tryFrom :: WithEncoding Utf8 ByteString -> Maybe Text
  tryFrom = hush . Encoding.decodeUtf8' . withoutEncoding

That feels way better. We’re using a phantom type to tag the byte string with its encoding. That makes it clearer that we’re using UTF-8, and it makes it possible to provide conversions for other encodings.

Better Solution

So what’s the problem? Why am I here asking about this if I already have a solution?

Well the problem is that I’m a little confused about what WithEncoding Utf8 ByteString means. In the From instance, it means the ByteString is definitely valid UTF-8. We just converted it from Text, so it had better be valid! But in the TryFrom instance, it means the ByteString should be valid UTF-8. We don’t know if it is in fact valid yet or not. Once we decode it, we’ll know, but by then we will have converted it into Text anyway.

This ambiguity makes me uncomfortable.

Perhaps it could be addressed by introducing even more types, like DefinitelyUtf8 and ProbablyUtf8. But that doesn’t feel satisfying.

I’ve been thinking about other ways to approach this problem. My current favorite is to tag the Text value instead of the ByteString. Like this:

instance From (WithEncoding Utf8 Text) ByteString where
  -- from :: WithEncoding Utf8 Text -> ByteString
  from = Encoding.encodeUtf8 . withoutEncoding

instance TryFrom ByteString (WithEncoding Utf8 Text) where
  -- tryFrom :: ByteString -> Maybe (WithEncoding Utf8 Text)
  tryFrom = fmap (WithEncoding Utf8) . hush . Encoding.decodeUtf8'

This feels backwards. But it’s got some things going for it! Most importantly, it’s clearer to me what WithEncoding Utf8 Text means: It’s Text, and if you decode into it or encode from it, it’ll use UTF-8.

Questions

Thanks for taking the time to read my long-winded post! Hopefully I did a good job outlining the problem and a couple of solutions. I’m seeking feedback before I commit to a solution so that I can avoid changing things again later on.

Should I tag the ByteString or the Text? Or do something else entirely?
Should I define my own type like WithEncoding or use the one from Data.Tagged?
Should I define my own type like Utf8 or use symbols like "UTF-8"?

If you’d prefer to discuss this on GitHub, I have a pull request with these changes here: https://github.com/tfausak/witch/pull/58. The related issue also includes lots of links to other libraries on Hackage that solve this same problem.

hasufell · September 29, 2022, 3:51am

Even if a bytestring can be parsed by a UTF-8 parser, that doesn’t mean it’s actually semantically UTF-8.

You could be skewing the meaning and introduce incorrect behavior down the line, including security bugs.

The only correct way is to require explicit passing of the encoding. You can do that with a newtype and remove the bytestring instances.

danidiaz · September 29, 2022, 1:28pm

In the From instance, it means the ByteString is definitely valid UTF-8. We just converted it from Text , so it had better be valid! But in the TryFrom instance, it means the ByteString should be valid UTF-8. We don’t know if it is in fact valid yet or not.

Perhaps change the instance From (WithEncoding Utf8 Text) ByteString instance to

instance IsEncoding encoding => From Text (encoding -> ByteString)

And now WithEncoding can unequivocally mean “attempt to decode using this encoding”. Or perhaps simply use a tuple instead of WithEncoding:

instance  IsEncoding encoding => TryFrom (encoding, ByteString) Text where

WithEncoding exists to support a form of “uncurrying”, after all. (Although, come to think of it, using a tuple might cause overlapping instances if the types aren’t concrete.)

ChShersh · September 29, 2022, 3:34pm

Let’s take a step back and think about the final goal here.

In text-2.0, Text is already stored internally as a UTF-8 byte array. So, in some sense, Text is already a version of UTF-8 encoded ByteString. In fact, encodeUtf8 from Text to ByteString doesn’t do any actual encoding, it just copies bytes.

Why copy bytes instead of simply returning an underlying ByteString? Well, the difference between Text bytes and ByteString bytes is that ByteString is pinned and Text is unpinned. So conversion requires copying to a different memory region.

I’m not sure there exists a GHC primitive for marking an existing memory region as pinned.

In this context, the benefit of newtype Utf8 = Utf8 ByteString over Text is that conversion from Utf8 to ByteString is a no-op. But why would you prefer this Utf8 newtype over Text? This is the important question here.

I would think about possible use cases for Utf8 and evaluate how often this data type should be preferred over Text. It may happen, you don’t even need this newtype after all.

But if you still need it, I agree with @hasufell that a similar newtype with a smart constructor is the way to go here. I don’t think there are many benefits of the generic WithEncoding data type because there’re not so many data types you may want to encode differently (probably just ByteString?) and there’re not so many common encodings (probably just UTF-8?) (unless you want to create a generic library for converting between ASCII, UTF-8, UTF-16, UTF-32, KOI-8 and so on).

jaror · September 29, 2022, 3:45pm

It is a bit more nuanced. Sometimes Text is also pinned:

A byte array [and by extension Text] can be pinned as a result of three possible causes:

It was allocated by newPinnedByteArray#. [Text is never constructed using this primitive]

It is large. Currently, GHC defines large object to be one that is at least as large as 80% of a 4KB block (i.e. at least 3277 bytes).

It has been copied into a compact region. The documentation for ghc-compact and compact describes this process.

I actually don’t know why encodeUtf8 doesn’t check if the Text is already pinned. Maybe that is some low-hanging optimization fruit?

taylorfausak · September 30, 2022, 2:30am

Thanks for the quick and thorough feedback!

For the purposes of this discussion, I’m not interested in performance. I don’t care that Text values happen to already be UTF-8 behind the scenes.

I’m more interested in the interface that lets you convert between Text and ByteString for various encodings. Although I’m currently focused on UTF-8, I suspect I’ll add UTF-16 and UTF-32 as well.

And while I’m focused on ByteStrings, I think the same interface could work just as well with Vector Word8 or any other array type. I’d like to keep things flexible enough to allow for that.

It sounds like y’all are in favor of tagging the ByteString, which I think is a good approach. Especially @hasufell’s note about semantics helps me get over my trepidation.

I hadn’t thought about using a function or a tuple as the “output” type, as @danidiaz suggested. That’s clever, but I feel like WithEncoding (or Tagged) is a little more straightforward, at the expense of introducing another type.

I’m leaning toward going with the first solution, like this:

instance From Text (WithEncoding Utf8 ByteString) where ...
instance TryFrom (WithEncoding Utf8 ByteString) Text where ...

I don’t know if all the extra types (WithEncoding and Utf8) are worth the overhead. Should I do this instead?

instance From Text (Tagged "UTF-8" ByteString) where ...
instance TryFrom (Tagged "UTF-8" ByteString) Text where ...

taylorfausak · October 3, 2022, 10:28pm

I published these changes in Witch version 1.1.0.0. Thanks again for the feedback!

reuben · October 4, 2022, 3:00pm

Btw, I hadn’t heard of Witch before, but it worked great, and it feels like something I’ll want to use a lot!