Merging bigrams using FP


Me and my lab partner just finished a partial task in an assignment where this was the requirement:

Given the input
corpus_l = ['D', 'e', ' ', 's', 'e', 'n', 'a', 's', 't', ...]
merge_bigrams(corpus_l, ('e', 'n')) should return where all the seuquences of ‘e’ and ‘n’ have been merged:
['D', 'e', ' ', 's', 'en', 'a', 's', 't', ...]
And reapplying merge_bigrams(corpus_l, ('s', 'en')) to this corpus should return
['D', 'e', ' ', 'sen', 'a', 's', 't', ...]
You will apply a greedy algorithm. Given the pair (‘a’, ‘a’) and the list [‘a’, ‘a’, ‘a’], the result will be: [‘aa’, ‘a’]

In the spirit of declarative programming, we first tried concatenating the input, and then split on the requested sequence without removing it. However, this made it impossible to compose the function with itself since concatenating the string at the beginning loses the information from the previous call.

Our imperative solution:

def merge_bigrams(corpus_l, pair):
    token_r, token_l = pair
    token_new = token_r + token_l
    new_corpus = []

    i = 0
    while i < len(corpus_l):
        if i + 1 < len(corpus_l) and corpus_l[i] == token_r dand corpus_l[i + 1] == token_l:
            i += 2  
            i += 1

    return new_corpus

How would you solve this in Haskell?


Starting with zipWith f corpus_l (tail corpus_l) seems promising. Definitely not.


@f-a the problem with that is that you need to throw away the pair that comes after the one that you’re merging.

I think the best approach is to use an unfold:

mergeBigrams pair = unfoldr step where
  step (x:y:zs) | (x,y) == pair = Just (x ++ y, zs)
  step xs = uncons xs
ghci> mergeBigrams ("s","en") . mergeBigrams ("e","n") $ ["D", "e", " ", "s", "e", "n", "a", "s", "t"]
["D","e"," ","sen","a","s","t"]

@jaror unfoldr, nice! What made you think of it?

Folds and unfolds are the cornerstones of algorithms if one does not work then try the other:

(and folds won’t work well because they can only see one element of the input list at a time.)