Nice data-structure for grouping?

Suppose I have some data point like:

data Atom =
  { x :: X
  , y :: Y
  , z :: Z
  , m :: M
  }

Then, suppose I have a bunch of them, atoms :: [Atom].

Now, I want to group them. Perhaps first by X, then Y, or by X, Z, Y, or …

Naively, I need to write a bunch of terrible functions like byXYZ :: Dict X (Dict Y (Dict Z Atom))), etc, per way I want to group; but there must be a better way?

3 Likes

There’s a “better” way, but it involves some type-level machinery:

{-# LANGUAGE GADTs, DataKinds, TypeFamilies, UndecidableInstances #-}
import Data.Map (Map)
import Data.Map qualified as Map
import Data.Kind

-- e.g. 'NestedDict [X,Y,Z] Atom' is equivalent to 
--      'Map X (Map Y (Map Z [Atom]))'
data NestedDict keys a where
   DNil :: [a] -> NestedDict '[] a
   DCons :: Map key (NestedDict keys a) -> NestedDict (key : keys) a

instance AllC Ord keys => Semigroup (NestedDict keys a) where
  DNil xs <> DNil ys = DNil (xs ++ ys)
  DCons xs <> DCons ys = DCons (Map.unionWith (<>) xs ys)

-- Map a type level function over a type level list to get
-- a heterogeneous list
data MapF f xs where
  MNil :: MapF f '[]
  MCons :: f x -> MapF f xs -> MapF f (x : xs)

-- Require a constraint of all elements of a type level list
type AllC :: (k -> Constraint) -> [k] -> Constraint
type family AllC c xs where
   AllC _ '[] = ()
   AllC c (x : xs) = (c x, AllC c xs)

-- Given a list of functions that project a key out of the input type 'a', 
-- this produces a nested dict of a grouped on those keys.
toNestedDict :: 
  AllC Ord keys => 
  MapF ((->) a) keys -> 
  [a] -> 
  NestedDict keys a
toNestedDict MNil xs = DNil xs
toNestedDict (MCons f ms) xs = 
  DCons (Map.fromListWith (<>) [ (f x, toNestedDict ms [x]) | x <- xs ])
2 Likes

I’m not sure what the context for this is, so this suggestion might be way off, but maybe you want to use an SQL database or something that approximates it in-memory like ixset

4 Likes

If you want a “simple Haskell” solution, you could use generic functions that allow you to specify the grouping criteria.

#!/usr/bin/env cabal
{- cabal:
build-depends:
    base ^>=4.18
  , containers ^>= 0.6
-}
{- project:
with-compiler: ghc-9.6.3
-}

module Main (main) where

import Data.Map.Lazy (Map)
import qualified Data.Map.Lazy as Map
import Data.Word (Word16, Word32, Word8)

------------------------------------------------------------------------------

groupToMapBy
  :: Ord a
  => (x -> a)
  -> [x]
  -> Map a [x]
groupToMapBy f = Map.fromListWith (++) . map (liftA2 (,) f pure)

groupToMapBy2
  :: (Ord a, Ord b)
  => (x -> a)
  -> (x -> b)
  -> [x]
  -> Map a (Map b [x])
groupToMapBy2 f g = fmap (groupToMapBy g) . groupToMapBy f

groupToMapBy3
  :: (Ord a, Ord b, Ord c)
  => (x -> a)
  -> (x -> b)
  -> (x -> c)
  -> [x]
  -> Map a (Map b (Map c [x]))
groupToMapBy3 f g h = fmap (groupToMapBy2 g h) . groupToMapBy f

------------------------------------------------------------------------------

data Atom
  = Atom
    { x :: Word8
    , y :: Word16
    , z :: Word32
    , m :: Int
    }
  deriving Show

demoAtoms :: [Atom]
demoAtoms =
    [ Atom 1 1 1 1, Atom 1 1 2 2, Atom 1 2 1 3, Atom 1 2 2 4, Atom 2 1 1 5
    , Atom 2 2 1 6, Atom 3 4 5 7, Atom 3 4 5 8
    ]

main :: IO ()
main = print $ groupToMapBy3 x y z demoAtoms
3 Likes

I use foldMap and a Map Union Semigroup wrapper. The function for foldMap is simple - just a bunch of Map.singletons

Untested, so maybe not exactly this but this general idea:

Write a groupBy :: (a -> key) -> [Atom] -> Dict key [Atom].

Assuming Dict is a Functor: fmap (fmap (fmap single . groupBy x) . groupBy y) . groupBy z where single [x] = x; single xs = ??. If ?? is a Left, then use traverses instead of fmaps to get it up to top level.

I designed deep-map for this! It’s powered both by single-level manipulation functions as well as indexed-traversable for accessing all the way down in one fell swoop.

6 Likes

Thanks everyone; much appreciated! :slight_smile:

1 Like

Keep in mind that Map.fromListWith f calls f with new element to the left, so Map.fromListWith (++) store elements in the reverse order which is usually not what the user wants.
Most of the time you’ll need Map.fromListWith (flip (++)) (or equivalent) instead.

1 Like

It really depends if you want the type to keep track of the actuall structure (in that case a combination of Map X (Map Y ...) is fine) or if you want to hide it behind one type (GroupOfAtoms`).

To hide the key, you can have define a key datatype

 data Key = KeyX X | KeyY Y | KeyZ Z | KeyM M deriving (Eq, Ord)

And then, if you need to hide the level of nesting, define

 data GroupOfAtoms  =
     Group (Map Key GroupOfAtoms)
     Atoms [Atom]

No one has mentioned monoidal-containers yet, which is just a newtype-wrapped Map or HashMap but frees you from burdens such as mentioned by @maxigit. So for reasonably efficient grouping you could use a MonoidalHashMap k (Seq Atom) where k could be a type of tuples (X,Z,Y) of the fields you are interested in. As @TeofilC suggested, you might want to think about how databases do it: Sometimes there are composite primary keys consisting of several fields, sometimes there are completely separate index structures. Depends on your use-case. The latter could be emulated in Haskell by several Maps without nesting, where the values in each Map point into the same [Atom] list.

2 Likes

Unfortunately fromList and similar functions in monoidal-containers are footguns (they have exactly the same biased behavior as their counterparts from the containers lib):

1 Like

Yes, it is a footgun, but the type signature of fromList clearly says there is in general no monoidal container being built behind each key, so you can not expect semigroup operation being used. For what its worth, once I wrote a module with the following class to solve the problem of grouping.

class Alternative f => Grouping map f key where
   groupWith :: Foldable t => (a -> key) -> (a -> value) -> t a -> map key (f value)

with instances where map is instantiated by the types of monoidal-containers and f can be [] or Seq or Vector. Since the type signature would be ambiguous at the call site, one must explicitly specify which map and f to use. The class instances make sure to use foldMap instead of fromList.

Indeed the discussion of the PR above contains the argument that fromList in its current form should not even be in containers since fromListWith const is almost certainly not what you want.

Summing up, monoidal-containers has served me well as an abstraction in database applications where one typically either folds over the returned rows or aggregates the rows in a monoidal container to process them in another way.