If OsPath is intended to replace FilePath why isn't FilePath deprecated

BurningWitness · October 11, 2024, 6:22pm

That’s pretty much what I said, except I wouldn’t refer to it as “sending an OsPath over the wire”. You’re sending a sequence of bytes underlying an OsPath on one computer and constructing another OsPath on the receiving end; both of the conversions are unsafe.

Okay, I found the point of confusion I had.

Canonically Windows decoding/encoding is pure: you check for surrogates and otherwise assume UTF-16LE. Unix decoding/encoding is impure, as you need to look up the locale to be able to convert to roundtripping UTF-8 (if necessary), after which you check for surrogates and otherwise assume UTF-8.

Extrapolating from that:

The surrogate checking part is indeed available directly for both interfaces, these are the xxcodeWith functions, but I don’t see the Unix roundtripping UTF-8 conversion one. utf8b :: TextEncoding as utf8 with RoundtripFailure would probably do the job here.
A PosixString is not the same as roundtripping UTF-8 by definition, so OsPath /= OsString. Rectifying this would also allow for IsString OsPath.

hasufell · October 12, 2024, 1:54am

Windows is technically UCS-2.

What you’re referring to is not UTF-8. It’s PEP383 encoding and it’s invalid UTF-8 and discouraged by the Unicode consortium.

I can’t follow this reasoning.

Let me repeat: the reason for OsString/OsPath is exactly because the roundtripping you describe is broken crap.

PEP 383 is garbage: PEP 383 Security and Interoperab
assuming existing filepaths follow the currently selected locale is nonsense (in our json case, you might be converting incorrectly between two different non-unicode encodings if two machines have different locales set)

BurningWitness · October 12, 2024, 9:52am

Yes, by “roundtripping UTF-16LE” I mean UCS-2 and by “roundtripping UTF-8” I mean UTF-8b.

Filepaths that you get from a known set of system calls do. To be able to serialize that the user has to either convert that data into legal Unicode or do unsafe things.

It is, but so is hoping the system locale is UTF-8-compatible on Unix and failing otherwise.

The design I’m proposing is:

newtype OsString = UCS2 | ByteArray -- Raw system string

newtype OsPath = UCS2 | UTF8b -- Lossless conversion with a well-defined format

type FilePath = UTF8 -- Lossy conversion

From that under Windows it will be known that

OsPath ~ OsString

decodeUtf16LE :: UCS2 -> Either Error UTF16LE -- No-copy

encodeUtf16LE :: UTF16LE -> UCS2 -- No-copy

And under Unix you get

decodeUtf8b :: TextEncoding -> ByteArray -> IO UTF8b

encodeUtf8b :: TextEncoding -> UTF8b -> IO ByteArray

decodeUtf8 :: UTF8b -> Either Error UTF8 -- No-copy

encodeUtf8 :: UTF8 -> UTF8b -- No-copy

“Good guess” variations under Unix are simply assuming that the current locale is UTF-8b.

“No-copy” means no extra allocations are necessary for the operation. Converting from UTF-16LE to UTF-8 on Windows and from UTF-8 to String if FilePath = String is still necessary.

All operations over OsPath will now be well-defined, notably Show OsPath can consistently use UTF-8b on both platforms and IsString OsPath is unambiguous.

hasufell · October 12, 2024, 2:53pm

Sorry, but again I’m at a loss what you mean.

What do system calls have to do with this?

You can have filepaths created with arbitrary encodings on your filesystem. An application can write whatever bytestring it wants as filepaths on unix.

A more common example may be a network or usb drive that you plug into your computer. The filepaths were not created on your system.

Hence, using the current locale is just a “good” guess.

I suggest to go through the documentation of the new OsPath modules which make very precise suggestions on how to deal with filepaths: System.OsPath

TL;DR: You often don’t actually have to know the encoding. A good example is the tar package.

It is untrue that PEP383 is lossless. You lose the information of the original bytes. It only roundtrips under the assumption that you use the same Encoding for both decoding and encoding. You could pick some at random… or just stick to latin1. I don’t see what we gained here.

PEP383 is not a locale.

And might print garbage, because the actual system encoding is CP932.

I’m sorry, but this is the opposite of what OsPath tries to achieve.

BurningWitness · October 12, 2024, 5:26pm

Just because you have to pick an encoding and inevitably some paths are undecodeable doesn’t mean all encodings are born equal. Unicode with roundtripping on failure has by definition the widest breadth of characters it can represent—all of Unicode, plus trash surrogates.

If I am to write an application where I want to show the user some filepath, regardless of whether I could decode it to UTF-8 or not, I personally would prefer UTF-8 with a bunch of replacement characters in place of surrogates instead of the wet trash that is a raw bytearray in an unspecified encoding. This is principally my problem with the current API, I’m not given this choice.

For the purposes of this discussion the internal implementation of the roundtripping mechanism does not matter. What matters is that UTF-8b has a consistent storage format to it, so we can talk about code points instead of individual bytes.

Looking at the filepath source code it looks to me like you’re treating PosixPaths as plain byte arrays, and I highly doubt this works correctly with multi-byte encodings.

hasufell · October 12, 2024, 11:46pm

But it is kinda meaningless (or even outright wrong) to do so, when you don’t know the original encoding.

You now have code points for the sake of having code points, not because you have a meaningful interpretation of the ByteString.

It does.

If you look at the posix standard, the path separator is defined as a single byte in the ASCII range /. If you have some funny multi byte encoding that contains such character, the system calls will not care and treat it as path separator.

Quote from posix Pathname documentation:

Additionally, since the single-byte encoding of the character is required to be the same across all locales and to not occur within a multi-byte character, references to a character within a pathname are well-defined even when the pathname is not a character string.

So you could argue that splitting on the unicode code point / is actually wrong if there are multiple byte sequences that can translate to this code point (I didn’t check if there are encodings that would satisfy that).

The filepath extension character . is not defined by posix, so whether it is in the ASCII range or encoding sensitive is a random decision.

The fact that the posix standard implicitly advises to use the portable character set for filepaths and that the path separator is in that single byte character set also drives the decision that OsChar is a Word8. If you want to split at unicode codepoint boundaries you should first understand what the actual encoding is and not shove it into some random internal representation and then pretend that will produce sensible output.

FilePath maintainers know what they’re doing

Could there be a different internal representation than raw byte arrays? Sure. Rust uses WTF-8 (not UTF-8b). It has other trade offs. And there are even successors/variants of it.

My guess is that you were kinda trying to reinvent that format, so you might find the specification interesting.

The original Abstract FilePath proposal doesn’t mention any of that and I agree that leaving the damn bytes untouched seems like a compelling property to have for the internals.

BurningWitness · October 13, 2024, 1:12am

It would be meaningless if I were basing my judgement solely on the definition of the Unix filepath, because it can quite literally be any sequence of bytes. I however believe that when my application is passed a locale, that encoding will generally refer to the most used encoding on that system, i.e. the best choice.

Yes, the storage format I described before happens to exactly match Section 6.4 of WTF-8. As I said before I don’t care about how roundtripping is implemented, just that it roundtrips and that when a string has no surrogates it’s proper UTF-8. This doesn’t need an entire new standard in my mind, just a convenient name so that I don’t have to refer to it as “UTF-8 with surrogates”.

The “actual encoding” would be the locale one, and GHC’s roundtripping Unicode encoder converts to and from UTF-8 with surrogates, which is not a “random” internal representation.

Again, I don’t “pretend” that it produces sensible output, in my mind this is the most sensible output I can get for my buck. Blindly trying to decode UTF-8 is strictly less powerful: I still get the correctness check, but I don’t get to use the partially decoded string if the conversion fails.

hasufell · October 13, 2024, 1:40am

You’re free to implement your own library.

OsPath/OsString is quite useless in my opinion if we do the locale dependent interpretation and roundtripping anyway. Then you can just stick to String.

Rust doesn’t do that locale dependent stuff either. The reason they picked WTF-8 are others. They very clearly say it’s just a storage format and has nothing to do with interpretation, like you propose with your UTF-8b variant.

BurningWitness · October 13, 2024, 10:34am

It would be the exact opposite of useless. Also OsString would remain the exact same, I’m just questioning whether it would make sense to retain all operations it’s currently exposed to, knowing that some of them could mangle multi-byte code points.

As an example, here’s the problem I want to solve with this on my side:

A set of commandline options for a given program can be efficiently described as a radix tree of functions, where each key is a supported option name. I’m precompiling the tree, so under Windows all keys in this tree will be in UTF-16LE, under Unix in UTF-8. Then:

Under Windows I can directly use the raw path to do the tree lookup, because WindowsString ~ WindowsPath. If a lookup fails I should still be able to display the raw path to the user with invalid surrogates replaced by �.
Under Unix I need a way to convert the PosixString into a UTF-8-compatible encoding. Two possible approaches here:
1. I don’t bother with the locale and try to decode plain UTF-8. This necessarily fails for any paths that are not directly UTF-8-compatible and could misrepresent code points otherwise. If the conversion succeeds, the string is proper UTF-8 and I can proceed with the lookup; if it doesn’t, I’m stuck with a bunch of raw bytes that I have no coherent way to talk about.
2. I use the locale encoding with the roundtripping Unicode decoder to get potentially-malformed-UTF-8. If the lookup on the tree succeeds, the string is proper UTF-8; if the lookup fails, I replace invalid surrogates with � and present that to the user. This is effectively symmetrical to the Windows side.

None of these three approaches involve Strings because they do not need to.

My proposal is merely pointing to the fact that both Unix approaches convert to a UTF-8-compatible encoding, so by putting a newtype over that you could satisfy both use cases with much greater type safety. On top of that, as the library stands currently, it lacks functions for all three described cases:

There are no sanitation functions for potentially-malformed-Unicode byte arrays for either platform type;
There is no cheap PosixString -> UTF8 conversion function, I’m expected to jump through a String for this;
The roundtripping Unicode decoder function is not provided.

With this in mind, I don’t believe “just implement your own library” is a good argument here. My goal matches that of the library—I want to perform the minimal necessary set of transformations over raw system data—it’s just that my use case is slightly more complex than the default user one and I view this as a deficiency of the current design.

BurningWitness · October 13, 2024, 3:16pm

And, to unite this with my original question about IsString OsPath, knowing all I do today, here is my concern reiterated:

When decoding a PosixString I am not allowed to speculate about its encoding. Since a PosixString is a set of raw bytes of unknown origin, I agree with this.

When encoding a PosixPath I am expected to encode into UTF-8 because most people use it. I agree with this.

If PosixPath ~ PosixString, then my expectation is that we should talk about raw bytes, not encodings, so the PosixPath encoding rule should no longer apply. PosixPath should be a black box: you can encode a string into it, but you can no longer talk about what’s inside.

However, since you choose to maintain that encoding rule and you still need to operate over PosixPaths, you are forced to assume that a PosixPath is always in a single-byte encoding. For the purposes of the filepath package this is compatible with UTF-8, since UTF-8 is self-synchronizing and filepath only cares about character lookup.

Downsides of this approach:

Most functions from System.OsString.Posix are unsafe to use with PosixPaths, but the types do not reflect that:

> let path = [System.OsPath.Posix.pstr|/tmp/𝗔𝗕𝗖.def|] :: PosixPath
> System.OsString.Posix.dropEnd 8 path :: PosixPath
"/tmp/\240\157\151\148\240\157\151\149"

Any function that encodes a PosixPath with a variable-width encoding is only safe in terms of the filepath API when no successive bytes in said encoding can be 0x2E and 0x2F (ASCII '.' and '/' respectively) .
IsString PosixPath morally could exist and would be in UTF-8, but cannot be a part of the API because that would result in IsString PosixString.
All of this is confusing to library users and should be extensively documented.

Upsides of this approach:

PosixPath can be created in any encoding (as long as the aforementioned successive bytes rule is not broken).
Edit: Actually, is this even an upside? You would never want to mix encodings anyway.
Still good enough for the overwhelming majority of real-world uses solely due to a confluence of historical reasons.

The proposed change would alter only two definitions in System.OsPath, that is OsPath would no longer be the same as OsString and the user would have to choose their conversion method between the two. The big API changes would be for System.OsPath.Windows and System.OsPath.Posix, because the two are not symmetrical and pretending otherwise makes the API worse.

hasufell · October 14, 2024, 5:00am

I’m starting to feel you’re moving goal post, so the discussion is getting a bit exhausting, although I don’t mind confrontational arguments in general. But it’s not really clear what you’re arguing against/for.

This has not much to do with filepath, OsString or OsPath. This is a specific use case with different trade offs. Both os-string and filepath provide primitives. If you just want to support UTF-8, you can just do so.

Wrt command line parsing, there’s a discussion here: Support for ospath/osstring · Issue #491 · pcapriotti/optparse-applicative · GitHub

Why would there be. It’s not a unicode library. This can be provided by other libraries that want to deal with unicode.

I’m not sure what this means. Do you mean there is no function reEncode :: PosixString -> TextEncoding -> PosixString?

That could certainly be implemented, but again: whether it should live in os-string or somewhere else can be debated. I’d probably be willing to add it to an .Internal module.

This is not true:

You get the roundtripping TextEncoding by calling getFileSystemEncoding.

I appreciate the interest in the new API, but you haven’t really pointed out to any deficiency in the design so far. Most of the conversation was very chaotic: You made proposals about internal representations, which have nothing to do with external API, suggested a different Show instance, etc. etc. The only specific points are the three above… one is incorrect, and two are additional API.

You can implement your ideas just fine based on the current primitives.

I suggest we continue this discussion on the os-string/filepath issue tracker, because I’m not sure this is useful for anyone else in this thread, because it is very specific.

hasufell · October 14, 2024, 5:17am

I have no idea how you came to these conclusions.

The decision of whether to interpret PosixPath or not is up to the API user. Please take a moment to read my blog post: Fixing 'FilePath' in Haskell · Hasufell's blog

For simplicity, here’s the table:

API function	from	to	posix encoding	windows encoding	remarks
`encodeUtf`	FilePath	OsPath	UTF-8 (strict)	UTF-16 (strict)	not total
`encodeWith`	FilePath	OsPath	user specified	user specified	depends on input
`encodeFS`	FilePath	OsPath	depends on getFileSystemEncoding	UTF-16 (escapes coding errors)	requires IO, used by `base` for roundtripping
`decodeUtf`	OsPath	FilePath	UTF-8 (strict)	UTF-16 (strict)	not total
`decodeWith`	OsPath	FilePath	user specified	user specified	depends on input
`decodeFS`	OsPath	FilePath	depends on getFileSystemEncoding	UTF-16 (escapes coding errors)	requires IO, used by `base` for roundtripping

It is perfectly safe, because PosixChar is Word8 and not unicode aware. If you want unicode aware drop/take API, convert to String or Text. This property is explained in many places… we can make it clearer in that specific module too. Feel free to raise a ticket.

The IsString instance has been rejected and will not be added, see:

But feel free to continue the discussion there.

I have no idea why you think PosixPath and PosixString are inherently different or why they should be.

If you want higher level API, you can implement it on top of os-string. Again: it is not a high level library.

An example of a package building a type on top is here:

BurningWitness · October 15, 2024, 2:48pm

Rephrased and restructured my proposal into filepath#237 (warning: huge).