Symbolize 1.0.3.0: Efficient string interning / global Symbol table, with Garbage Collection

Hi all!

With great pleasure I want to introduce a new little library I’ve been working on:

Symbolize

(direct link to documentation)

This is a Haskell implementation of Symbols / a global symbol table, with garbage collection.

Symbols, also known as Atoms or Interned Strings, are a common technique
to reduce memory usage and improve performance when using many small strings.

When is this useful?

Symbols are very common in dynamically-typed languages.
In Haskell, some of the usage of symbols is subsumed by parsing directly into
datatypes with statically-known field / variant names, or sometimes singleton-based ‘known-at-compile-time’ string constants that also happen to be called symbols.

Also, laziness / persistent datastructures / call-by-need mean
that when the same text-value is used in many places in your app, these might
be all pointers to the same underlying value.
But they won’t be when you happen to parse the same string multiple times from user input.

Thus, also in Haskell there are cases in which a symbol table is very useful:
For example, when you have something like:

data UserDefinedCollection = UserDefinedCollection
{ schema :: HashMap Field Type -- ^ Structure of the items
, items :: Vector (HashMap Field Value)  -- ^ Elements are enforced at runtime to have the same fields as `schema` 
}
  deriving (Aeson.FromJSON, Aeson.ToJSON, ...)

or in general, any kind of AST or DSL, where the names of variables are user-defined, which you are deserializing from some external format and then operating on. (Or in general any case where you expect the same set of strings to be seen over and over again.)

In these cases, it is highly likely that the same small set of ‘field names’ or ‘variable names’
is encountered over and over again.
However, by default, you will end up with many separate copies of these strings (be they Text,ByteString, etc.) in memory.

Furthermore, when we’re talking about a ‘variable name’, we often use it as one atomic whole,
and only rarely call string-manipulation functions on them.
This means that a type like (strict) Text or ByteString which not only keeps track of a byte-buffer but also an offset and length into it to support subslicing, is rather wasteful.

And if you want to use those Text or ByteStrings as keys to a hashmap,
you might not want to re-calculate their hashes over and over again.
This might lead you to something like:

newtype Field = Field (Hashed Text)
  deriving (Eq, Ord, Show, Hashable, ...)

This seems like a good idea, but not only does Hashed introduce yet another Int# for every
field, it also is defined as a polymorphic wrapper (data Hashed a = Hashed a {-# UNPACK #-} !Int),
which means the inner Text cannot be unpacked into it, resulting in yet another layer of indirection
and another separate allocation for the GC to track.

In these cases

newtype Field = Field Symbol
  deriving (Eq, Ord, Show, Hashable, ...)

is a much better choice:

  • All usage of the same field name is deduplicated to a single allocation
  • Equality checks are a constant-time pointer-equality check.
  • No more double indirections or extra fields as would be introduced by Hashed
  • Hashing still is constant-time!

How does it work?

Symbolize is built on top of StableName and Weak pointers (+finalizers).
By using these GHC-specific but very powerful features,
we can ensure :

  • O(1) equality checks, using pointer equality (safe usage of reallyUnsafePtrEquality#, yay!)
  • O(1) guaranteed collision-free hashing for symbols, using StableName
  • O(1) uninterning to ShortText, ShortByteString or ByteArray, since lookup of the internal string data is just a pointer-dereference: Symbols directly wrap ByteArray#s. (And GHC should worker-wrapper that outer box away nearly everywhere)
  • Support for Symbol# :: UnliftedType wherever you cannot/don’t want to depend on GHC’s automatic unboxing.
  • Automatic reclamation of no-longer-used Symbols. Also, the global symbol table is HashDoS-resistant (using SipHash). Therefore, it is safe to create Symbols from user input.

I had a lot of fun writing this.

Your feedback is greatly appreciated!

~Marten / Qqwy

36 Likes

I like the implemenation/interface ratio here. Have you considered adding support for namespaces?

1 Like

Thank you, glad you like it :blush:!

What do you mean exactly by ‘namespaces’? Do you mean separate symbol tables besides the ‘global’ symbol table? Or something else?

Sorry for the terse question. What I meant is that most non-trivial languages have need of more than a single global namespace: Haskell for example has a global namespace of modules, a global namespace inside every module, plus local scopes. Did you have any plans for supporting these kinds of scenarios?

I’m still not 100% I’m understanding you entirely :sweat_smile: . Maybe you’re looking for interning of identifiers as part of a compiler pipeline, where there are different namespaces of identifiers between modules? That is indeed also known as a symbol table, but it is a bit different from what Symbolize provides:

In those cases, the identifiers probably have more data than just their name; in all likelyhood each of them is at least carrying a Concrete Syntax Tree with them.

Symbolize is not intended to solve that use-case by and of itself; and there are no plans to implement that, because it’s a different goal than that of the library. The goal of Symbolize is to do interning of strings/texts only. (And to excel at that single focused goal).

However, you can certainly use Symbolize as part of implementing that. And the core pattern that is used by Symbolize to remove entries from a global hashmap when the last reference goes out of scope can of course be replicated (with or without using Symbolize itself): The trick is to have such a map be an IntMap (or a HashMap or HashTable with Int keys), and upon insertion add a finalizer to the Symbol# so when the last instance of the symbol is GC’d, that particular Int key with its associated value is removed from the map again. (If you use Symbol itself directly as key, it won’t ever be GC’d unless some other code removes it from the hashmap again)

1 Like

Symbolize v1.0.3.0 has been released!

Besides some small bugfixes and improvements to the documentation, there are two important larger changes:

  1. A problem has been fixed where weak pointers were allocated too eagerly. This would result in rather high memory pressure if doing a lot of insertions in close succession. The whole point of the library is of course to reduce memory pressure! Fixing this has significantly improved performance and reduced memory usage.
  2. The global symbol table used to be a Data.IntMap. This has been replaced by a Dictionary Int from the vector-hashtables library. Whereas Data.IntMap would become a little slower for each extra insertion based on the size of the table (akin to a linearithmic curve, individual insertions taking log(n) if already n live symbols exist), the new implementation grows essentially linearly, (i.e. individual insertions take constant-time).

Detailed benchmark results.

:rocket:

A big thanks to @Bodigrim who gave me some pointers for small performance and code-quality fixes, as well as telling me about the existence of vector-hashtables. :heart_hands:


Enjoy!

5 Likes