Guidance on how to parse English sentences

broadcastjoin · January 13, 2024, 11:29am

Hello, just for personal learning, I would like to build a tool that changes words from American to British spelling. I’m quite the beginner at Haskell, and it’s been over a decade since my computational linguistics module at university.

The main feature, or main challenge, of my tool is that it recognises the American spelling of words that are made up. For example, if we pretend that the verb tweet doesn’t exist, then one might label the act of posting on what was formerly Twitter as twitterising. I’d like my tool to be able to recognise twitterize and convert it to twitterise.

I reckon that I might first need to parse a sentence (or text) into its various parts of speech, and then select the (relevant) nouns and verbs for further processing.

Would this be the right place to start? Apart from parsing and natural language grammar, are there other concepts that I ought to consider in this project?

reuben · January 13, 2024, 12:25pm

Cool! Some advice:

Parsing natural language, especially without statistical methods, is hard, but for your project, it sounds like you only need to break the sentence up into words, and then try to identify relevant word endings.

You could think about writing a parser with the megaparsec library in order to do this. I think that would be where I would start. Basically, the parser would convert the sentence into a data structure that contains words, each of which was broken into their parts (morphemes) and then you’d pretty print this into the British (or US) form.

jaror · January 13, 2024, 12:40pm

I don’t think megaparsec will be of any use if you already (and rightly imo) determined that POS tagging is necessary to properly do this transformation.

A quick hackage search yields chatter which seems to include a decent POS tagging algorithm.

Edit: But chatter seems quite old (last upload in 2017), you’ll probably have to change it a bit to make it work on the latest GHC versions.

Edit: I just tried to build it and it actually works. Take that people who think Haskell is unstable!

Edit: Or at least the version from github works which had a commit 3 years ago that fixed version bounds. The hackage release does not work

broadcastjoin · January 13, 2024, 9:41pm

Hey! Thank you so much for looking into the package, and for trying to get it set up. Shame that the Hackage release isn’t working. At the risk of asking a silly question, will I probably have to write my own parser to do the part of speech tagging?

jaror · January 13, 2024, 9:45pm

You can clone the github repo and use that. Or if you’re using cabal you can create a cabal.project file and add:

packages: .

source-repository-package
    type: git
    location: https://github.com/noughtmare/chatter.git
    tag: fd029b19cce47d8862c6a2dcd36d4b8dcc5a5e2e

(That’s my fork where I fixed a tiny issue with a module being listed twice in the cabal file)

And then you should just be able to add the chatter dependency in your .cabal file and everything should just work.

Edit: My PR has now been merged so you can use the main repo:

source-repository-package
    type: git
    location: https://github.com/creswick/chatter.git
    tag: 1ec371491e6ff039ba4f33bd80e5d0dd5f29a74b

broadcastjoin · January 13, 2024, 9:52pm

Thanks once again @jaror! I think this should be enough for me to get started.

broadcastjoin · January 13, 2024, 9:54pm

Thanks for the advice @reuben! I’ll go with the chatter approach to start with, but in later iterations, and if I’m up for it, I wouldn’t mind trying to write a parser that does part of speech tagging.

reuben · January 14, 2024, 9:15am

You’re welcome! Just to clarify, I meant that the parser might be useful not so much for the POS tagging, but for the morphology, i.e. to identify and change word endings like “ise” on verbs. But depending on what chatter does, that might not be necessary.

broadcastjoin · February 7, 2024, 8:02am

Hello again, pardon the delayed response. I added the above into my Cabal file like you suggested, but am seeing this error every time I try to build. (I simply added chatter to my build-depends, as was mentioned in this Reddit thread.)

It appears that the issue is the following:

    • Couldn't match type ‘Data.Text.Internal.Lazy.Text’ with ‘Text’
      Expected: Message -> Text
        Actual: Message -> Data.Text.Internal.Lazy.Text
      NB: ‘Text’ is defined in ‘Data.Text.Internal’
          ‘Data.Text.Internal.Lazy.Text’
            is defined in ‘Data.Text.Internal.Lazy’

Was this something that you observed when you test-built the library during your debugging?

blamario · February 9, 2024, 2:13pm

What @jaror suggested was to add the source-repository-package lines to the cabal.project file, not to the unamericanize.cabal file. I’m surprised you didn’t get a Cabal error that way.

scholablade · February 9, 2024, 2:15pm

Grammatical framework with it’s Haskell runtime seems interesting, maybe this is what you need:
Website: https://www.grammaticalframework.org/
Haskell runtime documentation: gf: Grammatical Framework

broadcastjoin · February 10, 2024, 5:11am

@blamario gosh what a silly mistake I made lol. Should’ve read the above comment more carefully. Yes that works now, thank you!

broadcastjoin · February 10, 2024, 5:12am

Thank you for the suggestion!

Topic		Replies	Views
Semantic: an open-source Haskell library from GitHub for parsing, diffing, and analyzing syntax trees Links	0	497	June 1, 2019
Counting Words, but can we go faster? Show and Tell	30	3574	August 10, 2022
Monadic Parsing in Haskell Learn	28	1907	January 23, 2024
Deciding whether to create a new package or take over an abandoned one	25	2058	May 22, 2023
Usage sample or tutorial for language-javascript? Learn	2	300	October 11, 2022

Guidance on how to parse English sentences

Related topics