Ah I see, the motivation makes things a lot clearer!
There are a few hits if you search for “longest non-overlapping substring”.
One way is to use a suffix structure and track the first and last occurences of substrings to minimize overlap.
Here’s an implementation (not optimized!) of this idea:
{- cabal:
build-depends: base, suffixtree
-}
{-# LANGUAGE BangPatterns #-}
import Data.Foldable1
import qualified Data.List.NonEmpty as NE
import Data.Semigroup
import Data.SuffixTree
longestRepeatedNonOverlapping :: String -> String
longestRepeatedNonOverlapping s =
case go 0 [] (construct (s ++ "\0")) of (,,) (Max (Arg _ s')) _ _ -> s'
where
go !len _ Leaf = (,,) (Max (Arg 0 "")) (Min len) (Max len)
go len pieces (Node es) = case foldMap1' f (NE.fromList es) of
(,,) best min_@(Min mnx) max_@(Max mxx) ->
let len' = min len (mxx - mnx) -- no overlapping
substr = take len' (concat (reverse pieces))
in (,,) (best <> Max (Arg len' substr)) min_ max_
where
f (p,n) = go (len + length (prefix p)) (prefix p : pieces) n
This is O(n) once the suffix tree is constructed.
Side-note: The suffixtree
library is awkward to work with and certainly not as efficient as it claims to be. Maybe we could have a better library in this space…
Another approach is to use binary search and rolling hashes. We can binary search on the length of the longest repeating substring in range [0…n], and use rolling hashes to find repeating substrings of that length. Overlaps can be handled the same way by tracking the first and last occurences. This would be O(n log n).
Given that the original motivation is to perform compression, I don’t know if this way of replacing substrings scales well . It is close to dictionary compression, so it might be worth taking inspiration from that area of algorithms.