I maintain Witch, a library for converting values from one type into another. Recently I’ve been struggling with how to handle conversions between text and encodings like UTF-8. I’d like to get some feedback on a change I’m thinking about before I commit to it.
The Basics
I don’t think you need to be familiar with the Witch library in order to weigh in here. However it will be useful to be familiar with the basics so that you can see how things are used in context. Witch revolves around two type classes: From
for conversions that always succeed, and TryFrom
for ones that can sometimes fail. They look roughly like this:
{-# language MultiParamTypeClasses #-}
class From source target where
from :: source -> target
class TryFrom source target where
tryFrom :: source -> Maybe target
I like using examples to show how things work. For these type classes, the sized integral types make for great example. For example you can always convert from a Word8
into a Word16
, but you can’t always go the other way around because you might overflow. This pair of type classes allows you to express that:
import Data.Word (Word8, Word16)
import qualified Data.Bits as Bits
instance From Word8 Word16 where
-- from :: Word8 -> Word16
from = fromIntegral
instance TryFrom Word16 Word8 where
-- tryFrom :: Word16 -> Maybe Word8
tryFrom = Bits.toIntegralSized
And to use those instances, you simply call from
or tryFrom
with the appropriate types:
from (0 :: Word8) :: Word16
-- 0
tryFrom (0 :: Word16) :: Maybe Word8
-- Just 0
tryFrom (256 :: Word16) :: Maybe Word8
-- Nothing
In practice, things will often be inferred from context. Or you’ll use type applications. The explicit type signatures are relatively uncommon. But ergonomics are besides the point here
Encoding Problem
With an interface like that, how do you support converting between Text
and ByteString
? The simplest thing that works is to pick an encoding. That’s exactly what Witch does today! You can get any encoding you want, as long as it’s UTF-8.
import Data.Text (Text)
import Data.ByteString (ByteString)
import qualified Data.Text.Encoding as Encoding
instance From Text ByteString where
-- from :: Text -> ByteString
from = Encoding.encodeUtf8
instance TryFrom ByteString Text where
-- tryFrom :: ByteString -> Maybe Text
tryFrom = hush . Encoding.decodeUtf8'
hush :: Either x a -> Maybe a
hush = either (const Nothing) Just
This is not great for two reasons: The encoding is implicit, and you can’t change it. Nothing in the type signature suggests that it will use UTF-8. Maybe you can guess that since it’s a popular encoding. Or maybe you can read the documentation to find out. But it would be nice if it was in the types! Not only would it be self documenting, but then we could choose another encoding if we wanted to.
Current Solution
So let’s do just that. I’m going to introduce a bunch of types here, but in essence they’re similar to the Tagged
data type with a Symbol
as the tag. I’ll get back to that later. Let’s start with a type for wrapping up some value with a phantom encoding:
newtype WithEncoding encoding value
= WithEncoding value
deriving (Eq, Show)
withEncoding :: encoding -> value -> WithEncoding encoding value
withEncoding _ = WithEncoding
withoutEncoding :: WithEncoding encoding value -> value
withoutEncoding (WithEncoding x) = x
And then let’s add a type for that encoding:
data Utf8
= Utf8
deriving (Eq, Show)
Armed with those, we can write some better instances that more clearly express which encoding we’re using:
{-# language FlexibleInstances #-}
instance From Text (WithEncoding Utf8 ByteString) where
-- from :: Text -> WithEncoding Utf8 ByteString
from = withEncoding Utf8 . Encoding.encodeUtf8
instance TryFrom (WithEncoding Utf8 ByteString) Text where
-- tryFrom :: WithEncoding Utf8 ByteString -> Maybe Text
tryFrom = hush . Encoding.decodeUtf8' . withoutEncoding
That feels way better. We’re using a phantom type to tag the byte string with its encoding. That makes it clearer that we’re using UTF-8, and it makes it possible to provide conversions for other encodings.
Better Solution
So what’s the problem? Why am I here asking about this if I already have a solution?
Well the problem is that I’m a little confused about what WithEncoding Utf8 ByteString
means. In the From
instance, it means the ByteString
is definitely valid UTF-8. We just converted it from Text
, so it had better be valid! But in the TryFrom
instance, it means the ByteString
should be valid UTF-8. We don’t know if it is in fact valid yet or not. Once we decode it, we’ll know, but by then we will have converted it into Text
anyway.
This ambiguity makes me uncomfortable.
Perhaps it could be addressed by introducing even more types, like DefinitelyUtf8
and ProbablyUtf8
. But that doesn’t feel satisfying.
I’ve been thinking about other ways to approach this problem. My current favorite is to tag the Text
value instead of the ByteString
. Like this:
instance From (WithEncoding Utf8 Text) ByteString where
-- from :: WithEncoding Utf8 Text -> ByteString
from = Encoding.encodeUtf8 . withoutEncoding
instance TryFrom ByteString (WithEncoding Utf8 Text) where
-- tryFrom :: ByteString -> Maybe (WithEncoding Utf8 Text)
tryFrom = fmap (WithEncoding Utf8) . hush . Encoding.decodeUtf8'
This feels backwards. But it’s got some things going for it! Most importantly, it’s clearer to me what WithEncoding Utf8 Text
means: It’s Text
, and if you decode into it or encode from it, it’ll use UTF-8.
Questions
Thanks for taking the time to read my long-winded post! Hopefully I did a good job outlining the problem and a couple of solutions. I’m seeking feedback before I commit to a solution so that I can avoid changing things again later on.
- Should I tag the
ByteString
or theText
? Or do something else entirely? - Should I define my own type like
WithEncoding
or use the one fromData.Tagged
? - Should I define my own type like
Utf8
or use symbols like"UTF-8"
?
If you’d prefer to discuss this on GitHub, I have a pull request with these changes here: https://github.com/tfausak/witch/pull/58. The related issue also includes lots of links to other libraries on Hackage that solve this same problem.