Hello, just for personal learning, I would like to build a tool that changes words from American to British spelling. I’m quite the beginner at Haskell, and it’s been over a decade since my computational linguistics module at university.
The main feature, or main challenge, of my tool is that it recognises the American spelling of words that are made up. For example, if we pretend that the verb tweet doesn’t exist, then one might label the act of posting on what was formerly Twitter as twitterising. I’d like my tool to be able to recognise twitterize and convert it to twitterise.
I reckon that I might first need to parse a sentence (or text) into its various parts of speech, and then select the (relevant) nouns and verbs for further processing.
Would this be the right place to start? Apart from parsing and natural language grammar, are there other concepts that I ought to consider in this project?
Parsing natural language, especially without statistical methods, is hard, but for your project, it sounds like you only need to break the sentence up into words, and then try to identify relevant word endings.
You could think about writing a parser with the megaparsec library in order to do this. I think that would be where I would start. Basically, the parser would convert the sentence into a data structure that contains words, each of which was broken into their parts (morphemes) and then you’d pretty print this into the British (or US) form.
I don’t think megaparsec will be of any use if you already (and rightly imo) determined that POS tagging is necessary to properly do this transformation.
A quick hackage search yields chatter which seems to include a decent POS tagging algorithm.
Edit: But chatter seems quite old (last upload in 2017), you’ll probably have to change it a bit to make it work on the latest GHC versions.
Edit: I just tried to build it and it actually works. Take that people who think Haskell is unstable!
Edit: Or at least the version from github works which had a commit 3 years ago that fixed version bounds. The hackage release does not work
Hey! Thank you so much for looking into the package, and for trying to get it set up. Shame that the Hackage release isn’t working. At the risk of asking a silly question, will I probably have to write my own parser to do the part of speech tagging?
Thanks for the advice @reuben! I’ll go with the chatter approach to start with, but in later iterations, and if I’m up for it, I wouldn’t mind trying to write a parser that does part of speech tagging.
You’re welcome! Just to clarify, I meant that the parser might be useful not so much for the POS tagging, but for the morphology, i.e. to identify and change word endings like “ise” on verbs. But depending on what chatter does, that might not be necessary.
Hello again, pardon the delayed response. I added the above into my Cabal file like you suggested, but am seeing this error every time I try to build. (I simply added chatter to my build-depends, as was mentioned in this Reddit thread.)
It appears that the issue is the following:
• Couldn't match type ‘Data.Text.Internal.Lazy.Text’ with ‘Text’
Expected: Message -> Text
Actual: Message -> Data.Text.Internal.Lazy.Text
NB: ‘Text’ is defined in ‘Data.Text.Internal’
‘Data.Text.Internal.Lazy.Text’
is defined in ‘Data.Text.Internal.Lazy’
Was this something that you observed when you test-built the library during your debugging?
What @jaror suggested was to add the source-repository-package lines to the cabal.project file, not to the unamericanize.cabal file. I’m surprised you didn’t get a Cabal error that way.