ByteString.empty is actually not the same as ""

Today, I discovered something interesting that could potentially cause a bug while working with FFI.

GHCi, version 9.4.8: https://www.haskell.org/ghc/  :? for help
ghci> :set -package bytestring
package flags have changed, resetting and loading new packages...
ghci> import qualified Data.ByteString as BS
ghci> import qualified Data.ByteString.Internal as BSI
ghci> import Foreign.ForeignPtr
ghci> :set -XOverloadedStrings
ghci> BS.empty == ""
True
ghci> withForeignPtr (fst (BSI.toForeignPtr0 "")) print
0x0000004200431e70
ghci> withForeignPtr (fst (BSI.toForeignPtr0 BS.empty)) print
0x0000000000000000
ghci> 

:open_mouth:

1 Like
#  ghci
GHCi, version 9.8.2: https://www.haskell.org/ghc/  :? for help
ghci> import Foreign.Marshal.Utils
ghci> (1 + 1 :: Int) == 2
True
ghci> with (1 + 1 :: Int) print
0x000000420042a428
ghci> with (2 :: Int) print
0x000000420042c248
ghci> 

No surprise here - while expressions 1 + 1 :: Int and 2 :: Int reduce to the same normal form, Haskell has non-strict semantics. So those two expressions are represented internally by two different structured values: a different thunk for each expression. That’s why the results of with and withForeignPtr are I/O actions.

1 Like

It’s nothing about IO or laziness here. It’s all about “nullptr” vs “pointer to a null”. You can view the implementation here:

To be short, The underlying of empty is a ForeignPtr with nullAddr# and packChars "" is a ForeignPtr with memory allocate by newPinnedByteArray#

5 Likes

And I find a related closed issue: unsafeUseAsCString(Len) may return a NULL pointer. · Issue #674 · haskell/bytestring · GitHub

1 Like

Since you were able to diagnose where it is, can you can provide some patches to the bytestring maintainers to resolve this discrepancy?

The byte array in question is zero-sized, there’s no good reason to ever follow that pointer.

Furthermore sharing empty values is a whole separate maintainer headache, see e.g. text#493.


Personally I’m in favor of data ByteString = BS1 ByteString1 | Empty, since most functions already do emptiness checks and ByteString1 has its own uses, same with Text. You’d need a lot of pushing to get something like that through, however.

3 Likes

What is the expected behaviour?

  1. Empty BS is represented by null pointer, “” is represented by null pointer
  2. Empty BS is represented by pointer to nothing, “” is represented by pointer to nothing, the address of the pointer to nothing is shared
  3. Empty BS is represented by pointer to nothing, “” is represented by pointer to nothing, the addresses of the pointers are different

Also, is the nothing, another null pointer, nothing (empty array of char), or a null terminated CString (array of char with 1 element but that strlen says that it is length 0)?

1 Like

Agreed, I don’t see any issues here. This is just what happens when you get down to the pointer level. Some pointers look different than others!

3 Likes

It doesn’t seem easy. I’d rather consider this a design challenge than a flaw.

Yes, and, this may also introduce a little performance degradation(I’m not sure).

Well, the current behaviour is acceptable since it’s a defined behaviour. However, it might be a good idea to put a warning remark in the documentation.

Yes, generally, this is fine. However, it requires users to be more careful. (Although dealing with pointers is always dangerous)

Consider the following example:

// c code
void f(char* p) {
  if (p) 
    do something...
  else
    do another...
}

-- haskell
foreign import ccall unsafe "f" c_f :: Ptr Word8 -> IO ()

f :: ByteString -> IO ()
f bs = withForeignPtr (fst (BSI.toForeignPtr0 bs)) c_f

Normally, you may expect f BS.empty to be totally the same as f "", but actually, it will enter two different branches defined in the C code.

Consider using functions from Data.ByteString.Unsafe instead: they are in the stable part of the API and they are well-documented. In this case unsafeWithCStringLen does the same thing.

And, once again, you may not discard length when using ByteStrings this way because internally they are not null-terminated.

See also this issue about sharing 0 length ByteArray# objects. Ben makes the good point that users might depend on creating distinct 0-length objects for instance because they might use them with weak pointers.

2 Likes

users should not expect two different bytestrings containing the string “hello” to have the same pointer, and so by induction there’s no reason they should expect that two containing the string “” would either.

in general, pointer equality is never presumed to be synonymous with semantic equality.

10 Likes