Nice data-structure for grouping?

silky · November 11, 2023, 9:47am

Suppose I have some data point like:

data Atom =
  { x :: X
  , y :: Y
  , z :: Z
  , m :: M
  }

Then, suppose I have a bunch of them, atoms :: [Atom].

Now, I want to group them. Perhaps first by X, then Y, or by X, Z, Y, or …

Naively, I need to write a bunch of terrible functions like byXYZ :: Dict X (Dict Y (Dict Z Atom))), etc, per way I want to group; but there must be a better way?

jaror · November 11, 2023, 10:23am

There’s a “better” way, but it involves some type-level machinery:

{-# LANGUAGE GADTs, DataKinds, TypeFamilies, UndecidableInstances #-}
import Data.Map (Map)
import Data.Map qualified as Map
import Data.Kind

-- e.g. 'NestedDict [X,Y,Z] Atom' is equivalent to 
--      'Map X (Map Y (Map Z [Atom]))'
data NestedDict keys a where
   DNil :: [a] -> NestedDict '[] a
   DCons :: Map key (NestedDict keys a) -> NestedDict (key : keys) a

instance AllC Ord keys => Semigroup (NestedDict keys a) where
  DNil xs <> DNil ys = DNil (xs ++ ys)
  DCons xs <> DCons ys = DCons (Map.unionWith (<>) xs ys)

-- Map a type level function over a type level list to get
-- a heterogeneous list
data MapF f xs where
  MNil :: MapF f '[]
  MCons :: f x -> MapF f xs -> MapF f (x : xs)

-- Require a constraint of all elements of a type level list
type AllC :: (k -> Constraint) -> [k] -> Constraint
type family AllC c xs where
   AllC _ '[] = ()
   AllC c (x : xs) = (c x, AllC c xs)

-- Given a list of functions that project a key out of the input type 'a', 
-- this produces a nested dict of a grouped on those keys.
toNestedDict :: 
  AllC Ord keys => 
  MapF ((->) a) keys -> 
  [a] -> 
  NestedDict keys a
toNestedDict MNil xs = DNil xs
toNestedDict (MCons f ms) xs = 
  DCons (Map.fromListWith (<>) [ (f x, toNestedDict ms [x]) | x <- xs ])

TeofilC · November 11, 2023, 10:58am

I’m not sure what the context for this is, so this suggestion might be way off, but maybe you want to use an SQL database or something that approximates it in-memory like ixset

tcard · November 11, 2023, 11:06am

If you want a “simple Haskell” solution, you could use generic functions that allow you to specify the grouping criteria.

#!/usr/bin/env cabal
{- cabal:
build-depends:
    base ^>=4.18
  , containers ^>= 0.6
-}
{- project:
with-compiler: ghc-9.6.3
-}

module Main (main) where

import Data.Map.Lazy (Map)
import qualified Data.Map.Lazy as Map
import Data.Word (Word16, Word32, Word8)

------------------------------------------------------------------------------

groupToMapBy
  :: Ord a
  => (x -> a)
  -> [x]
  -> Map a [x]
groupToMapBy f = Map.fromListWith (++) . map (liftA2 (,) f pure)

groupToMapBy2
  :: (Ord a, Ord b)
  => (x -> a)
  -> (x -> b)
  -> [x]
  -> Map a (Map b [x])
groupToMapBy2 f g = fmap (groupToMapBy g) . groupToMapBy f

groupToMapBy3
  :: (Ord a, Ord b, Ord c)
  => (x -> a)
  -> (x -> b)
  -> (x -> c)
  -> [x]
  -> Map a (Map b (Map c [x]))
groupToMapBy3 f g h = fmap (groupToMapBy2 g h) . groupToMapBy f

------------------------------------------------------------------------------

data Atom
  = Atom
    { x :: Word8
    , y :: Word16
    , z :: Word32
    , m :: Int
    }
  deriving Show

demoAtoms :: [Atom]
demoAtoms =
    [ Atom 1 1 1 1, Atom 1 1 2 2, Atom 1 2 1 3, Atom 1 2 2 4, Atom 2 1 1 5
    , Atom 2 2 1 6, Atom 3 4 5 7, Atom 3 4 5 8
    ]

main :: IO ()
main = print $ groupToMapBy3 x y z demoAtoms

Ambrose · November 11, 2023, 2:09pm

I use foldMap and a Map Union Semigroup wrapper. The function for foldMap is simple - just a bunch of Map.singletons

elaforge · November 12, 2023, 12:47am

Untested, so maybe not exactly this but this general idea:

Write a groupBy :: (a -> key) -> [Atom] -> Dict key [Atom].

Assuming Dict is a Functor: fmap (fmap (fmap single . groupBy x) . groupBy y) . groupBy z where single [x] = x; single xs = ??. If ?? is a Left, then use traverses instead of fmaps to get it up to top level.

mixphix · November 12, 2023, 12:48am

I designed deep-map for this! It’s powered both by single-level manipulation functions as well as indexed-traversable for accessing all the way down in one fell swoop.

silky · November 12, 2023, 7:55am

Thanks everyone; much appreciated!

maxigit · November 12, 2023, 10:53am

Keep in mind that Map.fromListWith f calls f with new element to the left, so Map.fromListWith (++) store elements in the reverse order which is usually not what the user wants.
Most of the time you’ll need Map.fromListWith (flip (++)) (or equivalent) instead.

maxigit · November 12, 2023, 11:01am

It really depends if you want the type to keep track of the actuall structure (in that case a combination of Map X (Map Y ...) is fine) or if you want to hide it behind one type (GroupOfAtoms`).

To hide the key, you can have define a key datatype

 data Key = KeyX X | KeyY Y | KeyZ Z | KeyM M deriving (Eq, Ord)

And then, if you need to hide the level of nesting, define

 data GroupOfAtoms  =
     Group (Map Key GroupOfAtoms)
     Atoms [Atom]

olf · November 12, 2023, 9:12pm

No one has mentioned monoidal-containers yet, which is just a newtype-wrapped Map or HashMap but frees you from burdens such as mentioned by @maxigit. So for reasonably efficient grouping you could use a MonoidalHashMap k (Seq Atom) where k could be a type of tuples (X,Z,Y) of the fields you are interested in. As @TeofilC suggested, you might want to think about how databases do it: Sometimes there are composite primary keys consisting of several fields, sometimes there are completely separate index structures. Depends on your use-case. The latter could be emulated in Haskell by several Maps without nesting, where the values in each Map point into the same [Atom] list.

jaror · November 12, 2023, 10:29pm

Unfortunately fromList and similar functions in monoidal-containers are footguns (they have exactly the same biased behavior as their counterparts from the containers lib):

github.com/bgamari/monoidal-containers

Make functions correspond more closely to Data.Map behavior

bgamari:master ← bgamari:aa-0.6-revert-semigroup-operations

opened 12:43PM - 22 Aug 19 UTC

ali-abrar

+65 -51

This is a breaking change. The more "monoidal" behaviors that this reverts coul…d eventually be re-added as a separate module that more thoroughly and systematically applies the monoid/semigroup operations. The goal of Data.Map.Monoidal is to be as similar to Data.Map as possible except for the Monoid/Semigroup instance that it replaces. Thanks to @cgibbard and @bgamari for helping clarify this goal.

olf · November 13, 2023, 7:03am

Yes, it is a footgun, but the type signature of fromList clearly says there is in general no monoidal container being built behind each key, so you can not expect semigroup operation being used. For what its worth, once I wrote a module with the following class to solve the problem of grouping.

class Alternative f => Grouping map f key where
   groupWith :: Foldable t => (a -> key) -> (a -> value) -> t a -> map key (f value)

with instances where map is instantiated by the types of monoidal-containers and f can be [] or Seq or Vector. Since the type signature would be ambiguous at the call site, one must explicitly specify which map and f to use. The class instances make sure to use foldMap instead of fromList.

Indeed the discussion of the PR above contains the argument that fromList in its current form should not even be in containers since fromListWith const is almost certainly not what you want.

Summing up, monoidal-containers has served me well as an abstraction in database applications where one typically either folds over the returned rows or aggregates the rows in a monoidal container to process them in another way.