Symbolize 1.0.1.0: Efficient string interning / global Symbol table, with Garbage Collection

Hi all!

With great pleasure I want to introduce a new little library I’ve been working on:

Symbolize

(direct link to documentation)

This is a Haskell implementation of Symbols / a global symbol table, with garbage collection.

Symbols, also known as Atoms or Interned Strings, are a common technique
to reduce memory usage and improve performance when using many small strings.

When is this useful?

Symbols are very common in dynamically-typed languages.
In Haskell, some of the usage of symbols is subsumed by parsing directly into
datatypes with statically-known field / variant names, or sometimes singleton-based ‘known-at-compile-time’ string constants that also happen to be called symbols.

Also, laziness / persistent datastructures / call-by-need mean
that when the same text-value is used in many places in your app, these might
be all pointers to the same underlying value.
But they won’t be when you happen to parse the same string multiple times from user input.

Thus, also in Haskell there are cases in which a symbol table is very useful:
For example, when you have something like:

data UserDefinedCollection = UserDefinedCollection
{ schema :: HashMap Field Type -- ^ Structure of the items
, items :: Vector (HashMap Field Value)  -- ^ Elements are enforced at runtime to have the same fields as `schema` 
}
  deriving (Aeson.FromJSON, Aeson.ToJSON, ...)

or in general, any kind of AST or DSL, where the names of variables are user-defined, which you are deserializing from some external format and then operating on. (Or in general any case where you expect the same set of strings to be seen over and over again.)

In these cases, it is highly likely that the same small set of ‘field names’ or ‘variable names’
is encountered over and over again.
However, by default, you will end up with many separate copies of these strings (be they Text,ByteString, etc.) in memory.

Furthermore, when we’re talking about a ‘variable name’, we often use it as one atomic whole,
and only rarely call string-manipulation functions on them.
This means that a type like (strict) Text or ByteString which not only keeps track of a byte-buffer but also an offset and length into it to support subslicing, is rather wasteful.

And if you want to use those Text or ByteStrings as keys to a hashmap,
you might not want to re-calculate their hashes over and over again.
This might lead you to something like:

newtype Field = Field (Hashed Text)
  deriving (Eq, Ord, Show, Hashable, ...)

This seems like a good idea, but not only does Hashed introduce yet another Int# for every
field, it also is defined as a polymorphic wrapper (data Hashed a = Hashed a {-# UNPACK #-} !Int),
which means the inner Text cannot be unpacked into it, resulting in yet another layer of indirection
and another separate allocation for the GC to track.

In these cases

newtype Field = Field Symbol
  deriving (Eq, Ord, Show, Hashable, ...)

is a much better choice:

  • All usage of the same field name is deduplicated to a single allocation
  • Equality checks are a constant-time pointer-equality check.
  • No more double indirections or extra fields as would be introduced by Hashed
  • Hashing still is constant-time!

How does it work?

Symbolize is built on top of StableName and Weak pointers (+finalizers).
By using these GHC-specific but very powerful features,
we can ensure :

  • O(1) equality checks, using pointer equality (safe usage of reallyUnsafePtrEquality#, yay!)
  • O(1) guaranteed collision-free hashing for symbols, using StableName
  • O(1) uninterning to ShortText, ShortByteString or ByteArray, since lookup of the internal string data is just a pointer-dereference: Symbols directly wrap ByteArray#s. (And GHC should worker-wrapper that outer box away nearly everywhere)
  • Support for Symbol# :: UnliftedType wherever you cannot/don’t want to depend on GHC’s automatic unboxing.
  • Automatic reclamation of no-longer-used Symbols. Also, the global symbol table is HashDoS-resistant (using SipHash). Therefore, it is safe to create Symbols from user input.

I had a lot of fun writing this.

Your feedback is greatly appreciated!

~Marten / Qqwy

33 Likes