 # Merging bigrams using FP

Hey!

Me and my lab partner just finished a partial task in an assignment where this was the requirement:

Given the input
`corpus_l = ['D', 'e', ' ', 's', 'e', 'n', 'a', 's', 't', ...]`
`merge_bigrams(corpus_l, ('e', 'n'))` should return where all the seuquences of ‘e’ and ‘n’ have been merged:
`['D', 'e', ' ', 's', 'en', 'a', 's', 't', ...]`
And reapplying `merge_bigrams(corpus_l, ('s', 'en'))` to this corpus should return
`['D', 'e', ' ', 'sen', 'a', 's', 't', ...]`
You will apply a greedy algorithm. Given the pair (‘a’, ‘a’) and the list [‘a’, ‘a’, ‘a’], the result will be: [‘aa’, ‘a’]

In the spirit of declarative programming, we first tried concatenating the input, and then split on the requested sequence without removing it. However, this made it impossible to compose the function with itself since concatenating the string at the beginning loses the information from the previous call.

Our imperative solution:

``````def merge_bigrams(corpus_l, pair):
token_r, token_l = pair
token_new = token_r + token_l
new_corpus = []

i = 0
while i < len(corpus_l):
if i + 1 < len(corpus_l) and corpus_l[i] == token_r dand corpus_l[i + 1] == token_l:
new_corpus.append(token_new)
i += 2
else:
new_corpus.append(corpus_l[i])
i += 1

return new_corpus
``````

How would you solve this in Haskell?

2 Likes

Starting with `zipWith f corpus_l (tail corpus_l)` seems promising. Definitely not.

2 Likes

@f-a the problem with that is that you need to throw away the pair that comes after the one that you’re merging.

I think the best approach is to use an unfold:

``````mergeBigrams pair = unfoldr step where
step (x:y:zs) | (x,y) == pair = Just (x ++ y, zs)
step xs = uncons xs
``````
``````ghci> mergeBigrams ("s","en") . mergeBigrams ("e","n") \$ ["D", "e", " ", "s", "e", "n", "a", "s", "t"]
["D","e"," ","sen","a","s","t"]
``````
7 Likes

@jaror `unfoldr`, nice! What made you think of it?

Folds and unfolds are the cornerstones of algorithms if one does not work then try the other:

(and folds won’t work well because they can only see one element of the input list at a time.)

4 Likes