The ultimate guide to Haskell Strings

https://hasufell.github.io/posts/2024-05-07-ultimate-string-guide.html

46 Likes

I’ve learned a lot from this guide!

Tiny comment about the OsPath, PosixPath and WindowsPath section: perhaps it should be emphasized that these types live in the filepath package? Also, perhaps that section could directly follow the OsString, PosixString and WindowsString section.

5 Likes

Typo in “filo-io”:

and the platform agnostic directory and filo-io packages

Also perhaps we could explicitly link to the respective Haddocks.

Speaking of file-io, the difference between OsPath and PlatformPath is not clear to me. :thinking: When to use each one?

Section “To JSON”:

JSON demands UTF-8.

Looking through ECMA-404 I can’t find this guarantee. They do specify that it is Unicode:

A conforming JSON text is a sequence of Unicode code points

But I can’t find any mention of UTF-8.

It is in Section 8.1 of RFC8259 though:

8.1. Character Encoding

JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629].

Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON- based software implementations have chosen to use the UTF-8 encoding, to the extent that it is the only encoding that achieves interoperability.

Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

3 Likes

My interpretation of this has been, so far, that the encoding is simply outside the scope of the JSON specification - JSON source is just a valid Unicode string, and the encoding/decoding is defined in terms of Unicode code points, not raw bytes.

Negotiating the correct encoding is of course necessary when exchanging JSON data, but that is a separate concern. RFC8259 addresses that concern, but the original JSON spec did not (and that was fine at the time, since JSON was intended to be used in a context where the encoding is already negotiated as part of the underlying HTTP protocol, and the programming languages used on both sides were expected to have Unicode string types built into them).

In that context, I would even argue that encode :: Value -> ByteString (as in, for example, Aeson) is technically wrong: it should be encode :: Value -> Text. But of course that would have some undesirable performance implications…

PlatformString is more of an internal type that can aid writing cross-platform code.

You should almost always use OsString.

The definition is as follows:

#if defined(mingw32_HOST_OS) || defined(__MINGW32__)
type PlatformString = WindowsString
#else
type PlatformString = PosixString
#endif

newtype OsString = OsString { getOsString :: PlatformString }

As you can see, it’s just a type synonym. That means you can’t use the OsPath API, but would have to ifdef your imports as well, which is unergonomic. This ifdef game is exactly what the OsString/OsPath newtype and modules do for you.

See System.OsString.Internal.

E.g.:

#if defined(mingw32_HOST_OS) || defined(__MINGW32__)
import GHC.IO.Encoding.UTF16 ( mkUTF16le )
import qualified System.OsString.Windows as PF
#else
import GHC.IO.Encoding.UTF8 ( mkUTF8 )
import qualified System.OsString.Posix as PF
#endif

...

-- | Encode an 'OsString' given the platform specific encodings.
encodeWith :: TextEncoding  -- ^ unix text encoding
           -> TextEncoding  -- ^ windows text encoding
           -> String
           -> Either EncodingException OsString
#if defined(mingw32_HOST_OS) || defined(__MINGW32__)
encodeWith _ winEnc str = OsString <$> PF.encodeWith winEnc str
#else
encodeWith unixEnc _ str = OsString <$> PF.encodeWith unixEnc str
#endif
3 Likes

That’s such a great guide, thank you very much @hasufell!

I often try to use the new OsPath, but the lack of support in the rest of the ecosystem is discouraging. Is there a plan to provide OsPath alternative to the ghc libraries’ function that expects a FilePath? What are your thought on this, how would such change looks like, for say, Data.Binary.encodeFile?

7 Likes

That seem to be just

import System.OsPath.Types (OsPath)
import qualified System.File.OsPath as OS

encodeFile :: Binary a => OsPath -> a -> IO ()
encodeFile f v = OS.writeFile f (encode v)

Utilizing file-io.

Also see interaction of OsPath with getEnv, GetOpt, optparse-applicative · Issue #221 · haskell/filepath · GitHub

You mean base or ghc package? Unlikely for base. I don’t know about ghc.

I meant for the libraries included with ghc, e.g. Haskell Hierarchical Libraries . I guess we can’t depends on file-io, unless it is added to that list?

I’d be happy to help, but I’m not sure what is the expected strategy, for example, to avoid conflict, should the new functions be named like Data.Binary.encodeOsFile, Data.Aeson.encodeOsFile, Data.Text.IO.readOsFile, or perhaps through a different module like Data.Binary.OS ?

Thanks again!

1 Like

That won’t be very hard, I think. There’s only the problem of circular imports sometimes when dealing with boot libraries.

But that’s the decision of the respective core library maintainer.

One of them has to say “I need file-io”.

In the directory package, it is new modules with the same API, so that you would just change your imports: System.Directory.OsPath

There are generally two approaches code-wise:

  1. use CPP to make your internal module work with both FilePath and OsPath (I use that strategy). In file-io it was enough to actually use hs-source-dir with a platform if-else in the cabal file: Don't use `CPP` by sol · Pull Request #9 · hasufell/file-io · GitHub for the PosixPath/WindowsPath unification. Maybe that approach works as well.
  2. write your internal API to use OsPath exclusively and then write a “shim” that converts to and from filepaths using encodeFS and decodeFS. This was done in directory: directory/System/Directory.hs at master · haskell/directory · GitHub

I’m not a fan of the second approach. It also caused performance regressions, afair. I can’t find the GHC ticket at the moment.

1 Like

Is quite astonishing! Has opened my eyes to the possibilities and complexities of text processing. This kind of work, with its dept and comprehensiveness on an specific practical topic, is something I haven’t seen much in the ecosystem. Thanks for making this guide!

6 Likes

Thanks for making this - it’s very well-organized and informative.

Here are a couple more string types that could be added:

  • FastString in the GHC source
  • Interned strings from the intern library - for when you need quick equality tests

Yeah, there’s actually more stuff that we can talk about. But at this point it feels like starting a “Haskell Book” (like the Rust one).

That seems like a potentially interesting idea, but would have to be a coordinated collaborative effort.

Wrt FastString… it seemed like a really specific thing (e.g. O(1) Eq check) and the operations and ecosystem support is rather limited. Have you used it in the wild?

3 Likes

What are people using for texts indexed by their min/max length in the type? There are packages for non-empty texts. A type level indexed text type would be a generalization of that.

I suppose this is slightly distinct from length indexed vectors as known in dependently typed languages (e.g. data Vect : (len : Nat) -> (elem : Type) -> Type in Data.Vect in Idris), since it wouldn’t necessarily be a list of characters, but more like Data.Text.Text.

Currently, package refined. It’s decent for simple constraints, but if you try to say too much you spend too long leading GHC by the hand through elementary logic. I haven’t had the time to look into package rerefined to determine whether its objections are worth switching.

I also haven’t had time to look into using Liquid Haskell for this case, but it could be pretty nice too.