From-text: type class to convert from Text

I released a new package which provides

class IsText a where
  fromText :: Text -> a

aiming to simplify conversion from Text to other textual data types, including ByteArray, ByteString and OsPath. It uses UTF-8 when converting to binary types without an associated encoding.

There is an overwhelming number of alternative packages for text conversions. To name a few in no particular order:

At the moment none of them provide conversions from Text to OsPath. I could not decide which one to contribute such function to, so decided to create a new package.

15 Likes

Minor nitpick: from-text converts WindowsString (and hence OsString on windows) to UTF-16 LE.

If you need other encodings, you can always use System.OsString.encodeWith.

Oh great, it’s that discussion from two years ago:

  1. POSIX file paths can contain any sequence of bytes as long as only a small set of expectations are met (per POSIX spec chapter 6.1).

  2. OsPath is OsString (per the doc), so encoding OsPath under POSIX therefore only makes sense if we assume some locale;

  3. Inferring user’s locale is a messy process that could well fail (per Fixing 'FilePath' in Haskell · Hasufell's blog);

  4. Using IsString to convert to UTF-8 is not total, UTF-8 does not allow surrogate code points.

Therefore:

  • filepath recommends and is structured around blindly assuming UTF-8 under POSIX, addressing (2) and (3);

  • There is no IsString OsPath instance, citing (4).

And since unlawful IsString instances are tolerated within the ecosystem (in e.g. bytestring and text), IsString OsString makes total sense.


Can we just do the “OsPath is WTF-8/UCS2LE” thing I proposed two years ago instead of whatever this is?

The key thing is that Text is more restricted than String and does not contain surrogate code points. So instance IsText OsPath is total, while instance IsString OsPath is not.

4 Likes

also encode-string: Safe string conversion and encoding

I’m not sure where you have this from, but it’s not correct.

filepath recommends to not interpret filepaths returned by the OS API unless absolutely necessary. And even in the case where one needs to, you can often get away with assuming specific codepoints (e.g. filepath separator / is encoding agnostic and although file extension separator . is not in the posix standard, I would recommend to assume the ASCII codepoint here too).

If you are however constructing filepaths out of thin air (e.g. from user input), then you can decide on the shape or form.

Not every application is a file manager or shell language though. There’s lots of use cases where filepaths are mostly internal and just need to roundtrip correctly. OsPath/OsString helps with this too, since there’s no automatic conversion ever.

There’s also more interesting questions like: what does it mean to construct a filepath on unix and have a windows system interpret it? Or… what is a reasonable JSON instance? OsPath forces you to think about this and I believe there is more than one answer.

If you only want to support UTF-8 enabled systems in your application, that’s fine too. It should be documented though and you could also do sanity checks on application startup (e.g. checking locale).

WTF-8 is used to have a single internal representation on both unix and windows. But it also comes with a fair amount of trade-offs, see The WTF-8 encoding

WTF-8 is a hack intended to be used internally in self-contained systems with components that need to support potentially ill-formed UTF-16 for legacy reasons.

It is not user facing and has no relevance for public API.

I also personally think it’s a rather poor decision by the rust ecosystem. You can see part of the fallout here: 2295-os-str-pattern - The Rust RFC Book

My goal is to write general-purpose applications, and to me “absolutely necessary” means “don’t bother unless you know something extra about the target system”. I therefore assume that the suggested approach is to read/write UTF-8 with no regard for target’s locale.

This directly contradicts how base handles file path handling, which I wouldn’t care much about if the tools given were good enough to replicate what base does (with proper error handling, of course). Instead there’s decodeFS, which manages to compress everything I’m trying to escape into a single definition.

There’s no single “you” here.

Library users want to talk about file paths as if they are text, which is fine as long as the conversion is not their direct responsibility.

Library writers know that file paths are not text, and that any introspection more complex than “<NUL>, <period>, <slash>, <newline>, and <carriage-return>” requires a choice of encoding. And this choice is unavoidable when, say, parsing command-line options (think e.g. about bar in --foo=bar).

This divide cannot be bridged without an additional type that’s platform-independent (unlike OsString), but is still possibly erroneous (unlike String/Text).

Both of these are up to library users to decide, squarely outside of the topic of correct file path handling.

Used by… Rust? I have no relevant experience in Rust. When I’m talking about WTF-8 I solely mean how the bytes are laid out in memory, same with UCS2LE.

I hope you’re not getting confused by the fact that Rust has a type called OsString and that one seems to be in WTF-8. Their raw file path types are raw arrays, I guess?
The documentation is lacking.

This RFC looks thoroughly confusing to me. I do think of WTF-8 as “sliceable”, but only within the very narrow limits of the Portable Character Set, which guarantees that the characters are single-byte. This is enough to break down --foo=bar without scrutinizing bar, but that’s the furthest extent of what I found necessary.

Might well be an issue specific to Rust, seems like OsStr is an interface.