Challenge: Finding the longest repeated substring

tristanC · July 4, 2024, 12:34pm

Hello folks, I had the following challenge when working on the last ICFP contest (see my notes on String Packer). This looks like an interesting problem and I was surprised to not be able to find a readily available implementation.

We are looking for a function to find the longest repeated and non overlapping substring, like this:

-- >>> longestRepeatedSubstring "UDLLLDURRRRRURR"
-- "URR"
longestRepeatedSubstring :: String -> String

It looks like we should use the suffixtree package, as suggested in the wikipedia page. A followup challenge would be to return all the substrings with their occurrence count, like this:

-- >>> longestRepeatedSubstring "UDLLLDURRRRRURR"
-- [("URR", 1), ("RR", 3), ("UR", 2), ...]
longestRepeatedSubstrings :: String -> [(String, Natural)]

So does anyone want to have a go at it?

-Tristan

meooow · July 4, 2024, 1:39pm

What does “non-overlapping” mean here?

Without this requirement, clearly you have a good solution with suffix trees.

Leary · July 4, 2024, 1:46pm

module Substrings (
  longestRepeated,
  nonoverlapping,
  substrings,
) where

-- base
import Data.Ord (Down(..), comparing)
import Data.List (find, sortOn, maximumBy)

-- containers
import Data.Set (Set)
import Data.Set qualified as S
import Data.Map (Map)
import Data.Map qualified as M


longestRepeated :: Ord a => [a] -> Maybe [a]
longestRepeated
  = fmap fst
  . find ((>1) . length . snd)
  . sortOn (Down . length . fst)
  . M.toList
  . nonoverlapping

nonoverlapping, substrings :: Ord a => [a] -> Map [a] [Int]
nonoverlapping = M.mapWithKey (deoverlap . length) . substrings
substrings     = uncurry (append 0) . substrings_ 0

deoverlap :: Int -> [Int] -> [Int]
deoverlap n = maximumBy (comparing length) . foldr f [[]]
 where
  f p pss = do
    ps <- pss
    case ps of
      p':ps' | p + n > p' -> [p:ps', p':ps']
      _                   -> pure (p:ps)

substrings_ :: Ord a => Int -> [a] -> (Set [a], Map [a] [Int])
substrings_ _ [    ] = (S.empty, M.empty)
substrings_ n (x:xs) =
  ( S.insert [x] (S.mapMonotonic (x:) i)
  , append (n+1) i r
  ) where (i, r) = substrings_ (n+1) xs

append :: Ord k => Int -> Set k -> Map k [Int] -> Map k [Int]
append n = M.unionWith (++) . M.fromSet (const [n])

tristanC · July 4, 2024, 2:08pm

I meant, the substring occurrences should not overlap. For example RRRR is a valid repeated substring, it appears in URRRRR and URRRRRR, but the occurrences overlap.

tristanC · July 4, 2024, 2:14pm

Well done @Leary ! Though I forgot to mention, we are looking for an implementation that works with arbitrary input, let’s say for length equals one million. Unfortunately, your solution seems to hang on Substrings.longestRepeated $ take 128 $ repeat 'c'.

meooow · July 4, 2024, 2:49pm

What if RRRR occurs elsewhere in the string where it does not overlap with itself? Would it be considered or rejected?
What if some substring overlaps with itself repeatedly such that some of the occurences don’t overlap? For example, is ABA considered in “ABABABA”? The first occurence overlaps with the second and the second overlaps with the third, but the first does not overlap with the third.

It would be great if you could define your requirements precisely.

tristanC · July 4, 2024, 2:57pm

Thank you for your questions!

As long as there are two repeated occurrences that don’t overlap then that’s ok.
ABA occurs twice in ABABABA and ABABABA so that is accepted.

In my use case, this function would be used to compress the text by replacing the repetitions with variable concatenation, e.g. by replacing “ABABABA” with (\s -> s <> "B" <> s) "ABA". For this to work we would only consider the substrings that are bigger than the equivalent concatenation code.

jaror · July 4, 2024, 3:47pm

Generalizing your example gives a good formal specification:

For a given string s, find the longest string x such that:

s = a <> x <> b <> x <> c

For some strings a, b, and c.

meooow · July 4, 2024, 7:09pm

Ah I see, the motivation makes things a lot clearer!

There are a few hits if you search for “longest non-overlapping substring”.

One way is to use a suffix structure and track the first and last occurences of substrings to minimize overlap.
Here’s an implementation (not optimized!) of this idea:

{- cabal:
build-depends: base, suffixtree
-}
{-# LANGUAGE BangPatterns #-}

import Data.Foldable1
import qualified Data.List.NonEmpty as NE
import Data.Semigroup
import Data.SuffixTree

longestRepeatedNonOverlapping :: String -> String
longestRepeatedNonOverlapping s =
  case go 0 [] (construct (s ++ "\0")) of (,,) (Max (Arg _ s')) _ _ -> s'
  where
    go !len _ Leaf = (,,) (Max (Arg 0 "")) (Min len) (Max len)
    go len pieces (Node es) = case foldMap1' f (NE.fromList es) of
      (,,) best min_@(Min mnx) max_@(Max mxx) ->
        let len' = min len (mxx - mnx) -- no overlapping
            substr = take len' (concat (reverse pieces))
        in (,,) (best <> Max (Arg len' substr)) min_ max_
      where
        f (p,n) = go (len + length (prefix p)) (prefix p : pieces) n

This is O(n) once the suffix tree is constructed.
Side-note: The suffixtree library is awkward to work with and certainly not as efficient as it claims to be. Maybe we could have a better library in this space…

Another approach is to use binary search and rolling hashes. We can binary search on the length of the longest repeating substring in range [0…n], and use rolling hashes to find repeating substrings of that length. Overlaps can be handled the same way by tracking the first and last occurences. This would be O(n log n).

Given that the original motivation is to perform compression, I don’t know if this way of replacing substrings scales well . It is close to dictionary compression, so it might be worth taking inspiration from that area of algorithms.

olf · July 4, 2024, 8:09pm

Indeed, a performant full-text index in Haskell would be very cool. Unfortunately, most of the literature’s algorithms on this topic are very array-centric and procedural. The suffix tree seems to stand out as a structure that carries over well to functional programming.

As to this particular challenge, maybe a variant of Lempel-Ziv 78 can be helpful here: The reason is that the LZ78 blocks are non-overlapping. Hence the longest block contains as a prefix a long non-overlapping repeated substring. That might not be the globally longest repeat, though.

tristanC · July 4, 2024, 8:13pm

Ha great, I’ll have to unpack the implementation, but that seems correct, thanks! Though it’s not very fast, 50k char input took 40 seconds to run on my laptop. Perhaps a more efficient suffixtree library could work faster on bigger inputs?

Well you are right that’s not the best strategy for such a compression and using a dictionary made of smaller substrings would work better. But that’s what we tried during the contest By the way, I regenerated the solutions with your implementation, and it yields better results for all solutions expect one (solution 8) where the previous substring occured more often, see this commit.

meooow · July 5, 2024, 4:59pm

Perhaps a more efficient suffixtree library could work faster on bigger inputs?

Yes definitely.

By the way, I regenerated the solutions with your implementation, and it yields better results for all solutions expect one…

Neat!

Unfortunately, most of the literature’s algorithms on this topic are very array-centric and procedural.

I don’t see this as a problem, we have arrays in Haskell (:

The suffix tree seems to stand out as a structure that carries over well to functional programming.

I would disagree, there isn’t a lot that is functional about it. Except that it’s a tree of course, and we love trees.