Aztecs v0.6: A modular game engine and ECS for Haskell

Hey! I’m a long time lurker, first time poster, really excited to announce my open-source Haskell game engine and ECS: Aztecs.

import Aztecs
import qualified Aztecs.ECS.Access as A
import qualified Aztecs.ECS.Query as Q
import qualified Aztecs.ECS.System as S
import qualified Aztecs.SDL as SDL
import Control.Arrow ((>>>))
import Control.Monad (when)

setup :: System () ()
setup = S.queue . const . A.spawn_ $ bundle Window {windowTitle = "Aztecs"}

update :: Schedule IO () ()
update =
  reader (S.single (Q.fetch @_ @KeyboardInput))
    >>> task
      ( \kb -> do
          when (wasKeyPressed KeyW kb) $ print "Onwards!"
          when (wasKeyPressed KeyS kb) $ print "Retreat..."
          when (wasKeyReleased KeyW kb || wasKeyReleased KeyS kb) $ print "Halt!"
      )

main :: IO ()
main =
  runSchedule_ $
    SDL.setup
      >>> system setup
      >>> forever_ (SDL.update >>> update >>> SDL.draw)
16 Likes

Wohoo!!! Welcome to the team of Haskell game engine developers!

I took a quick look at the paper and your library. Very cool! A few comments:

  • The library’s api looks good and the choice of using arrow is curious! Without having looked at the implementation in depth: why did you decide to use Arrows?

  • I would have guessed the game engine and ECS library would be separate, so other Haskell game engines can leverage your ECS system without depending on your engine whole. Is the way e.g. the game loop and ECS components are designed that intertwines them enough for separation not to be possible?

  • The benchmark looks promising! It says it only compares simple mutating queries, so it would be cool to see a more complete comparison e.g. by translating some of bevy’s more complex benchmarks into Aztecs.

Anyway those are just some thoughts off the top of my head. Great to see this project and even more people in the space :slight_smile:

5 Likes

Thanks so much!! I really appreciate you taking a look

  • Arrows are awesome! I found them to be really powerful at expressing computations like queries and systems, where it’s much more efficient to have the actual operations known ahead of time. For example, an arrow-based query (rather than the earlier Monad interface) knows which ComponentIDs to search for ahead of time, and can therefore jump directly to groups of components containing them. I’m really excited to see how arrows can combine together to abstract away ECS concepts (such as this monadic interface for loading assets as a generic ArrowSystem )

  • Since Aztecs now uses pure queries and systems, my goal is to have game logic side-effect free so you can write your game as just input -> state -> state (heavily inspired by Yampa). So the provided components like Window and Camera only store parts of the game state, requiring systems in place to actually produce side-effects. Then something like aztecs-sdl can be used to actually render your game state (and hopefully the components are generic enough to support other backends like OpenGL).

  • I’m really excited about performance so far, but I’m not 100% convinced of the results just yet (so really sorry if I’m spreading fake news lol). What I’m hoping for long-term is that Aztecs won’t need expensive state management tools like observers and change detection (like Bevy and Flecs use), and games can instead rely on Haskell language features to keep things composable but fast.

Thanks for reading my ramblings :smiley: I can’t wait to dive deeper into Haskell game dev

4 Likes

Please add apecs to benchmarks. I tried comparing (in 0.4) and aztecs ended up way slower. Perhaps I was doing something wrong.

5 Likes

Thanks for giving it a go :slight_smile: I definitely want to add more comparisons, especially to another Haskell ECS like Apecs, when this package gets a little more refined.

Ugh unfortunately I might have been advertising some false positives - I think without the latest use of NFData, the previous benchmarks were just accumulating thunks :thinking:

The latest results are 2ms for 10,000 entities, which really isn’t ideal. I feel like this architecture in general should be faster for functional languages because you can directly map over groups of components, where as sparse-set approaches (like Apecs) requiring more filtering. I’d expect at least queries on archetypes to be faster than sparse sets, but with slower inserts/removals. I’m still hopeful that queries can be optimized here.

If anyone would be interested in helping track down performance hits, I’m seeing the biggest time sink with profiling is View.unview (which is pretty surprising to me TBH)

1 Like

Could you share your profile? Ideally you have compiled all dependencies with profiling enabled as well and ran with -fprof-late and then the executable with +RTS -pj to get a format viewable in speedscope.app

In cabal you can do this by adding to the cabal.project with:

package *
  profiling: True
  profiling-detail: late

and running the resulting executable with +RTS -pj

1 Like

If you want to reach competitive performance with other state of the art ECS libraries, I think you will need to rethink your whole storage architecture. Here are the things I would keep in mind when working on an ECS.

Data Locality
The speed of ECS is based on data locality. This means that data which is accessed in a query should be next to each other. Then, L1/L2 caches can be optimally used and memory access is very fast.
In Haskell terms, this means you may not store datatypes in a regular fashion, since they are allocated on the heap in a non-linear manner. Tree-based storage containers like IntMap accelerate this problem. I think it is likely that your current storage spreads the stored items over the whole heap.

The simplest way to fix this is to allocate a big memory buffer and store elements via a Storable instance next to each other. Of course, there are caveats: You need to handle low-level code, not everything can be made Storable, arrays do not have infinite size, …

Optimize the query loops

You need to make sure to optimize the looping over all elements in a query since they are the hot path. You must not allocate anything within the loop and all storage functions should be inlined.

Let’s see this in an example:

for [1..elementsAmount] $ \i -> do
  element <- getElement i
  element' <- doStuff element
  putElement i element'

In the above code, getElement and putElement are ideally inlined so that accessing the data is fast. Additionally, for [1..elementsAmount] will most likely result in a simple for loop without allocating any indices, though here some benchmarking might be needed to verify GHC code generation works as expected.

In contrast, your storage DynamicStorage is based on a record of functions. While this is very nice for composability, those functions are probably not inlined and will be slower as inlined versions as a result. Keep in mind, each call to a non-inlined function is an indirection. Additionally, entitiesDyn' might just allocate all indices, tanking the performance.

Performance and Composability are often opposite goals and you will have to find a path in-between.

8 Likes

I doubt that. IntMaps used by apecs are already pretty hard to beat and that’s without further caching/indexing.

3 Likes

For sure here’s the output I got:

From the flamegraph it looks like reading and writing is taking roughly the same time :thinking:
https://gist.githubusercontent.com/matthunz/14044fc8ffef6abb9ce0dbd1e22fe122/raw/899ef4c1c033711d28f05522c4b2042d0353c7c8/flamegraph.svg

Oh wow a low-level approach like that sounds awesome, and really ideal in a lot of cases :eyes: I’d be very interested in a module like Aztecs.ECS.Storable with higher-performance, but lower-level components.

Just thinking about it now I guess most components could be Storable :thinking: but I do think a higher-level API also makes sense when they can’t be. I tried to make query interfaces generic enough to support other storage structures (albeit there might be some re-arranging to be done to support Storable).

In general the current component storages definitely need an update. The current IntMap should really just be a sorted list, where the EntityIDs are stored in the actual Archetype (since they’re shared across all the components in this case). I still think a dynamic approach with archetypes is good to keep in, but I’d really like to export a lower-level Storable approach (and maybe a higher-level typed approach)

1 Like

Just wanted to share the benchmark results are in a much better state after some inlining and simpler component storage, but there’s definitely more to optimize

benchmarking iter
time                 288.8 μs   (287.9 μs .. 289.4 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 286.6 μs   (285.9 μs .. 287.3 μs)
std dev              2.383 μs   (2.076 μs .. 2.828 μs)

benchmarking iter system
time                 289.6 μs   (289.0 μs .. 290.3 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 289.9 μs   (289.4 μs .. 290.7 μs)
std dev              1.972 μs   (1.558 μs .. 2.647 μs)

Thanks a ton for your insight @simon I’m really hoping to dig deeper into the lower-level ideas you brought up too

3 Likes

You can load the profile into speedscope.app for a nicer browsing experience.

This is what we use when profiling issues in GHC.

6 Likes

:eyes: ooo that looks super helpful thanks!

I’ve been messing around it and I definitely think I’ve tracked at least some things down, just still having some trouble with inlining. It looks like reads and writes take roughly similar times, which I guess makes sense since they’re both just iterating over a list of components :thinking: I’m really curious if a stream processing library like Streamly might help here.

benchmarking iter
time                 81.70 μs   (81.67 μs .. 81.73 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 81.56 μs   (81.51 μs .. 81.60 μs)
std dev              150.9 ns   (117.5 ns .. 223.3 ns)

Also I’m psyched to share share v0.11 :smiley: GitHub - aztecs-hs/aztecs: A modular game engine and ECS for Haskell

Queries now use an applicative DSL and systems are now monadic, enabling a much simpler API. This release replaces the Arrow and class-based interface with simple transformers that can be interpreted in a variety of ways. For example, a System can now be run in a single thread with Access.system, or concurrently with Access.concurrently.

import Aztecs.ECS
import Control.DeepSeq
import Control.Monad
import Control.Monad.IO.Class
import Data.Function
import GHC.Generics

newtype Position = Position Int deriving (Show, Generic, NFData)

instance Component Position

newtype Velocity = Velocity Int deriving (Show, Generic, NFData)

instance Component Velocity

move :: (Monad m) => QueryT m Position
move = fetch & zipFetchMap (\(Velocity v) (Position p) -> Position $ p + v)

run :: SystemT IO ()
run = do
  positions <- query move
  liftIO $ print positions

app :: AccessT IO ()
app = do
  _ <- spawn $ bundle (Position 0) <> bundle (Velocity 1)
  forever $ system run

main :: IO ()
main = runAccessT_ app
7 Likes

Also wanted to share a new benchmark I just did against Apecs @wiz

Still way slower than Bevy, but I’m feeling a little better about the archetype approach vs sparse sets. One huge benefit I realized is copying components to concurrent tasks should be a lot faster with archetypes, since smaller groups of components have to be moved.

Aztecs:

q :: Query Position
q = fetch & zipFetchMap (\(Velocity v) (Position p) -> Position $ p + v)

Apecs:

q :: System PosVel ()
q = cmap $ \(ECSVel v, ECSPos p) -> ECSPos (p+v)
4 Likes

Neat. Is there a source for the setup somewhere?

2 Likes

Hmm actually looking over the Apecs bench again I’m not sure I’m using this correctly :thinking: The current bench may include inserting components as well, but I can’t figure out how to return the World from a system yet

For Aztecs it’s just the standard bench of querying two components, and updating one

You don’t. But you don’t have to, because you pass the world to the runner for the systems to operate on - you already have it.

  1. Init world
  2. Run setup system (Insert stuff) using that world handle
  3. Pass the populated world to the benchmark runner.

If you get the same timings, then the setup system were too lazy… But it should be working.

Anyway, I’m curious about the numbers from the Aztecs that actually include inserts too. Sometimes inserts can be bottlenecks too.

1 Like