Word Count Script Feedback


#1

I am planning on using this script for a word cloud.

It parses a text file, and spits out individual word counts.

Anybody have any tips / improvements?

Suspect there are a few bits and pieces which could be nicer.

module Lib where

import Data.Map.Strict
import Text.ParserCombinators.Parsec
import Data.List
import Data.Char

mn = do  
   content <- readFile "big.txt"
   case (parseTextFile content) of
     Right rs -> print $ sort' $ countWords $ Data.List.map (\s -> Data.List.map toLower s) $ concat rs
     e -> print e

countWords :: [String] -> Map String Int
countWords = Prelude.foldl (\b a -> insertWith (+) a 1 b) empty

sort' :: Map String Int -> [(String, Int)]
sort' ts = sortBy (\ (_, a) (_, b) -> compare a b) (toList ts)

textFile = endBy line eol
line = sepBy word (many1 space')
word = many $ noneOf ( eolString ++ ignoreString )
eol = many1 $ oneOf eolString
space' = oneOf ignoreString

eolString = "\n\r"
ignoreString = " \t.,\"'-_#:;()?*!|"

parseTextFile :: String -> Either ParseError [[String]]
parseTextFile input = parse textFile "(unknown)" input

#2

First, you should almost certainly use Data.Text's Text rather than String. You also may be interested in fromListWith from Data.Map.

You could write something like

countWords :: [Text] -> Map Text Int
countWords = fromListWith (+) . flip zip (repeat 1)
-- countWords ts = fromListWith (+) (zip ts (repeat 1))

As a side benefit, using this construction the result will be ordered by count, so you can skip sort'. Your sort comparison, by the way, can also be written comparing snd, rather than with a lambda.

Also, I’d highly recommend megaparsec over parsec, but that’s more preference than anything I can rigorously defend.


#3

String is my lazy default - I should switch default to Text…

Mind blown. Once I have time, i need to examine this closer!

Yeh, again Parsec is a lazy default. I had Megaparsec and Earley in mind - to experiment with. I Think I will do it in the next iteration.

Thanks for the tips!!


#4

Avoid foldl. It causes a space leak in your program: your Map is not actually constructed until you go through the entire list. Replacing it with foldl’ should fix the problem.

I don’t know if I’d reach for Parsec or any other combinator library here. If you insist, Attoparsec would be much faster for this job.

Speaking of speed, you already heard about Text, but if your input is UTF-8, and since your logic ignores only ASCII characters, ByteString would also work. Don’t do this if you intend to handle other encodings or to extend your algorithm to cover Unicode spacing and punctuation.

Finally, you can eliminate the intermediate list. Since Parsec is not a streaming parser, the list will be fully constructed in memory. You can avoid that by constructing the map directly while parsing.


#5

Thanks! I forgot that.

I agree, a quick words and filter out punctuation might do the job. But I also want easy ways to improve my Parser combinator knowledge.

Thanks for all the advice - it’s funny though, how Haskell is a honey trap for people like me looking for declarativity - but instead there’s a bunch of performance road bumps which need to be on one’s mind.

I suppose laziness giveth and taketh away - and perhaps over time, you don’t need to think too much about the performance details… Just some sort of cognitive dissonance going on…

Thanks again!


#6

I don’t see this - am I missing something simple?

I just removed sort’ changed the countWords function as above, and I end up with an unsorted list…


#7

Oh, you’re right. It sorts the list on the keys, not by count. My test data just happened to line up that way.