New blog post: Parsing Diff Output in Haskell -- Feedback appreciated

Hi everyone,

I just published a new article on my blog: Parsing Diff Output in Haskell. It describes a parser for diff output in normal, unified, and git formats using Attoparsec. You can find it at: Parsing Diff Output in Haskell | Pedro R. Borges

You may also check out my first article: A Gentle Introduction to optparse-applicative, at A Gentle Introduction to optparse-applicative | Pedro R. Borges

I would appreciate any feedback.

Regards,

Pedro R. Borges

8 Likes

Cool!

I cannot say anything about attoparsec… Last time I needed to parse something (I think it was PDF metadata) I wielded Earley — you may want to check it out. Since you have already written the grammar for the format you wish to parse (in your other blog post), it should be trivial to run Earley with this grammar. Please let me know if you try!

As for optparse-applicative, I have some notes. I often need to write a small program that accepts a handful of arguments and does something boring for me (like maybe modifying the metadata of PDF files). Since I do it often, there has grown in me some kind of a style guide. I am now going to apply that guide to your example.

These notes notwithstanding, I shall advise your guide whenever I need to explain optparse-applicative to someone. Maybe you could send it to the maintainers of optparse-applicative and see if they want to link to it from Hackage?

I think the <$> operator is harmful.

What you want to say is that there is a list. Bullet notation is good and appropriate for this task. But there is nothing privileged about the first element of this list — it does not by itself deserve a special bullet. You write <$> because you think you have to — but you do not have to. You can write pure instead.

So, for example, this code:

optionsParser =
    AppOptions
        <$> wAnswersParser
        <*> mDigitsParser
        <*> sepParser
        <*> inGroupParser
        <*> outputParser
        <*> inputParser

— Becomes:

optionsParser =
    pure AppOptions
        <*> wAnswersParser
        <*> mDigitsParser
        <*> sepParser
        <*> inGroupParser
        <*> outputParser
        <*> inputParser

I think do notation is more appropriate, because it allows you to reorder your parsers.

Why would you want to reorder your parsers? Because the order in which options show up in the help message that optparse-applicative derives automatically out of a parser depends on the order the parsers are called. With the «bullet list» notation, you are forced to keep the parsers in the same order as their outputs show up in the definition of the data type being parsed, so, if you later wish to change the order in which options show up in the help message, you will have to edit two things — the definition of the parser and the definition of the data type being parsed. This coupling is avoided with do notation.

Our example becomes:

optionsParser = do
    withAnswers ← wAnswersParser
    maxDigits ← mDigitsParser
    separation ← sepParser
    probsInGroup ← inGroupParser
    output ← outputParser
    inputFile ← inputParser
    pure AppOptions {..}

This is also helpful when you wish to define your parsers inline, because you can put any operator in there without fearing that their precedence will be weaker than or equal to that of <*>, you can freely write nested do blocks, and so on. I find that this helps when writing a bigger parser.

This last block of text revealed that you have a mismatch between the names of the fields and the names of the parsers.

Say I want to parse withAnswers. How shall I know that the parser I need is called wAnswersParser and not, say, withAParser? How shall I know that mDigitsParser parses maxDigits and not minDigits?

You will help me immensely if you free my memory from this needless burden by following a simple rule: the parser for field x is called xParser.

Our example becomes:

optionsParser = do
    withAnswers ← withAnswersParser
    maxDigits ← maxDigitsParser
    separation ← separationParser
    probsInGroup ← probsInGroupParser
    output ← outputParser
    inputFile ← inputParser
    pure AppOptions {..}

Even better, inline the parsers for the fields into optionsParser.

If something is only needed once, there is no harm in inlining it.

optionsParser = do
    withAnswers ← switch $
        short 'a'
            <> long "with-answers"
            <> help "Generate problem answers."
    maxDigits ← option (boundedInt 1 15) $
        short 'd'
            <> long "max-digits"
            <> metavar "Int"
            <> help "Maximum number of digits allowed in a term. Must be between 1 and 15."
            <> value 4
    -- …and so on…
    pure AppOptions {..}

This also forces a correspondence of ordering between what you write and what the person reading the help message will see. If you see withAnswers in your code first, that person will also see the help for withAnswers first, and so on.

3 Likes

Thanks for taking the time for reading the article and giving me feedback.

I didn’t know about Earley. I’ll have a look at it.

These notes notwithstanding, I shall advise your guide whenever I need to explain optparse-applicative to someone.

Great!

Maybe you could send it to the maintainers of optparse-applicative and see if they want to link to it from Hackage?

I’ll do that.

I think the <$>> operator is harmful.

You made me think here. Considering <$> harmful seems too strong to me.

However, I am leaning towards preferring pure and <*> instead of <$> as you propose. This keeps the code more pure-ly applicative, and with one less operator to know.

I wanted to keep changes to my article short and/or easy, so I just included a note presenting the pure-<> alternative. BTW, did you know that HLS (on vscode) suggests using <$> instead of pure-<>?

I think do notation is more appropriate, because it allows you to reorder your parsers.

I agree, at least for more than 2 or 3 options. This would be a relatively simple change in the article but, in addition, it would require describing and justifying the applicative-do extension.

You have a mismatch between the names of the fields and the names of the parsers.

You are absolutely right here. I think I was just lazy typing, but I have no excuse. I search-and-replaced these names in the article. I also added an update note at the end of the article.

Even better, inline the parsers for the fields into optionsParser.

I had them inline when I first wrote the app, but I changed them while writing the article to favor the presentation.

I’d probably leave them inline in production code. There is another reason to have them named, although very personal: As a blind user, I rely heavily on the editor and screenreader facilities to navigate and find my way in the code and this is easy for top-level functions.

Thanks again for your suggestions!

Yes, as I understand, Haskell Language Server has hlint wired in, and hlint has strong opinions that I do not always agree with. You can tweak it by writing your wishes into a file named .hlint.yaml. For example, I often write:

- warn: {name: Redundant bracket due to operator fixities}
- ignore: {name: "Avoid lambda"}
- ignore: {name: "Avoid lambda using `infix`"}
- ignore: {name: "Redundant lambda"}
- ignore: {name: "Replace case with maybe"}
- ignore: {name: "Use <$"}
- ignore: {name: "Use <$>"}
- ignore: {name: "Use $>"}
1 Like