Guidance on how to parse English sentences

Hello, just for personal learning, I would like to build a tool that changes words from American to British spelling. I’m quite the beginner at Haskell, and it’s been over a decade since my computational linguistics module at university.

The main feature, or main challenge, of my tool is that it recognises the American spelling of words that are made up. For example, if we pretend that the verb tweet doesn’t exist, then one might label the act of posting on what was formerly Twitter as twitterising. I’d like my tool to be able to recognise twitterize and convert it to twitterise.

I reckon that I might first need to parse a sentence (or text) into its various parts of speech, and then select the (relevant) nouns and verbs for further processing.

Would this be the right place to start? Apart from parsing and natural language grammar, are there other concepts that I ought to consider in this project?

1 Like

Cool! Some advice:

Parsing natural language, especially without statistical methods, is hard, but for your project, it sounds like you only need to break the sentence up into words, and then try to identify relevant word endings.

You could think about writing a parser with the megaparsec library in order to do this. I think that would be where I would start. Basically, the parser would convert the sentence into a data structure that contains words, each of which was broken into their parts (morphemes) and then you’d pretty print this into the British (or US) form.

1 Like

I don’t think megaparsec will be of any use if you already (and rightly imo) determined that POS tagging is necessary to properly do this transformation.

A quick hackage search yields chatter which seems to include a decent POS tagging algorithm.

Edit: But chatter seems quite old (last upload in 2017), you’ll probably have to change it a bit to make it work on the latest GHC versions.

Edit: I just tried to build it and it actually works. Take that people who think Haskell is unstable!

Edit: Or at least the version from github works which had a commit 3 years ago that fixed version bounds. The hackage release does not work :frowning:

8 Likes

Hey! Thank you so much for looking into the package, and for trying to get it set up. Shame that the Hackage release isn’t working. At the risk of asking a silly question, will I probably have to write my own parser to do the part of speech tagging?

You can clone the github repo and use that. Or if you’re using cabal you can create a cabal.project file and add:

packages: .

source-repository-package
    type: git
    location: https://github.com/noughtmare/chatter.git
    tag: fd029b19cce47d8862c6a2dcd36d4b8dcc5a5e2e

(That’s my fork where I fixed a tiny issue with a module being listed twice in the cabal file)

And then you should just be able to add the chatter dependency in your .cabal file and everything should just work.

Edit: My PR has now been merged so you can use the main repo:

source-repository-package
    type: git
    location: https://github.com/creswick/chatter.git
    tag: 1ec371491e6ff039ba4f33bd80e5d0dd5f29a74b
2 Likes

Thanks once again @jaror! I think this should be enough for me to get started. :slightly_smiling_face:

Thanks for the advice @reuben! :smiley: I’ll go with the chatter approach to start with, but in later iterations, and if I’m up for it, I wouldn’t mind trying to write a parser that does part of speech tagging.

You’re welcome! Just to clarify, I meant that the parser might be useful not so much for the POS tagging, but for the morphology, i.e. to identify and change word endings like “ise” on verbs. But depending on what chatter does, that might not be necessary.

1 Like

Hello again, pardon the delayed response. I added the above into my Cabal file like you suggested, but am seeing this error every time I try to build. (I simply added chatter to my build-depends, as was mentioned in this Reddit thread.)

It appears that the issue is the following:

    • Couldn't match type ‘Data.Text.Internal.Lazy.Text’ with ‘Text’
      Expected: Message -> Text
        Actual: Message -> Data.Text.Internal.Lazy.Text
      NB: ‘Text’ is defined in ‘Data.Text.Internal’
          ‘Data.Text.Internal.Lazy.Text’
            is defined in ‘Data.Text.Internal.Lazy’

Was this something that you observed when you test-built the library during your debugging?

What @jaror suggested was to add the source-repository-package lines to the cabal.project file, not to the unamericanize.cabal file. I’m surprised you didn’t get a Cabal error that way.

1 Like

Grammatical framework with it’s Haskell runtime seems interesting, maybe this is what you need:
Website: https://www.grammaticalframework.org/
Haskell runtime documentation: gf: Grammatical Framework

@blamario gosh what a silly mistake I made lol. Should’ve read the above comment more carefully. Yes that works now, thank you!

Thank you for the suggestion!