Merging bigrams using FP

olaberglund · September 16, 2023, 1:51pm

Hey!

Me and my lab partner just finished a partial task in an assignment where this was the requirement:

Given the input
corpus_l = ['D', 'e', ' ', 's', 'e', 'n', 'a', 's', 't', ...]
merge_bigrams(corpus_l, ('e', 'n')) should return where all the seuquences of ‘e’ and ‘n’ have been merged:
['D', 'e', ' ', 's', 'en', 'a', 's', 't', ...]
And reapplying merge_bigrams(corpus_l, ('s', 'en')) to this corpus should return
['D', 'e', ' ', 'sen', 'a', 's', 't', ...]
You will apply a greedy algorithm. Given the pair (‘a’, ‘a’) and the list [‘a’, ‘a’, ‘a’], the result will be: [‘aa’, ‘a’]

In the spirit of declarative programming, we first tried concatenating the input, and then split on the requested sequence without removing it. However, this made it impossible to compose the function with itself since concatenating the string at the beginning loses the information from the previous call.

Our imperative solution:

def merge_bigrams(corpus_l, pair):
    token_r, token_l = pair
    token_new = token_r + token_l
    new_corpus = []

    i = 0
    while i < len(corpus_l):
        if i + 1 < len(corpus_l) and corpus_l[i] == token_r dand corpus_l[i + 1] == token_l:
            new_corpus.append(token_new)
            i += 2  
        else:
            new_corpus.append(corpus_l[i])
            i += 1

    return new_corpus

How would you solve this in Haskell?

f-a · September 16, 2023, 2:05pm

~~Starting with zipWith f corpus_l (tail corpus_l) seems promising.~~ Definitely not.

jaror · September 16, 2023, 2:14pm

@f-a the problem with that is that you need to throw away the pair that comes after the one that you’re merging.

I think the best approach is to use an unfold:

mergeBigrams pair = unfoldr step where
  step (x:y:zs) | (x,y) == pair = Just (x ++ y, zs)
  step xs = uncons xs

ghci> mergeBigrams ("s","en") . mergeBigrams ("e","n") $ ["D", "e", " ", "s", "e", "n", "a", "s", "t"]
["D","e"," ","sen","a","s","t"]

olaberglund · September 16, 2023, 2:45pm

@jaror unfoldr, nice! What made you think of it?

jaror · September 16, 2023, 2:48pm

Folds and unfolds are the cornerstones of algorithms if one does not work then try the other:

(and folds won’t work well because they can only see one element of the input list at a time.)

Topic		Replies	Views
-XMultilineStrings merged! Announcements	5	2096	August 5, 2024
The Text UTF-8 Proposal Has Been Merged! Haskell Foundation	0	1179	May 20, 2021
New library: shamochu “Shuffle and merge overlapping chunks” lossless compression Show and Tell	16	1587	June 19, 2024
Monadic Parsing in Haskell Learn	28	1865	January 23, 2024
Guidance on how to parse English sentences Learn	12	1131	February 10, 2024

Merging bigrams using FP

Related topics