Lexical syntax of comments and layout

natefaubion · June 12, 2025, 6:58pm

In keeping with Wadler’s Law, an interesting subject came up in the PureScript Discord regarding delimited/block comment syntax and layout.

GHC sees both of these definitions as valid:

example1 =
  let
      foo = id{-
-}42
      bar = foo
  in bar
  
example2 =
  let
      foo = id 42
{-  -}bar = foo
  in bar

This is interesting because in example1, the block comment causes the parser to ignore the column of 42, parsing id 42 as the RHS of foo. In example2, there is no actual indentation, only the block comment. The parser only consults the visual column of bar.

In comparison, given this PureScript

example1 =
  let
      foo = identity {-
-}42
      bar = foo
  in bar
  
example2 =
  let
      foo = identity 42
{-  -}bar = foo
  in bar

Purs will fail on example1 because it consults the column of every token for the purposes of layout (layout is transformation on the token stream only). It will accept example2 for the same reason as GHC.

Both parsers are problematic though if you think comments are insignificant. If by “insignificant” we mean that we should be able to strip comments from the source and retain an identical parse, then both parsers should fail to parse example2. If we remove the comments from example2, then it will fail to parse due to incorrect indentation, making comments significant in both compilers!

I would be interested in the consistency among alternative Haskell parsers that aren’t derived from GHC. This obviously doesn’t really affect anyone in practice, but an interesting artifact of naive columnar indentation nonetheless. Maybe Python was right to only have line comments!

rhendric · June 12, 2025, 7:15pm

If comments are insignificant in the sense of you can remove them and get an identical parse, then id{- -}42 should be interpreted as id42, which neither compiler does. If comments are instead whitespace, then PureScript’s column-based analysis is correct, because that’s what you’d get by replacing every ‘visible’ comment character with spaces.

Also, the Haskell 2010 report says that comments are whitespace:

whitespace 	→ 	whitestuff {whitestuff}
whitestuff 	→ 	whitechar | comment | ncomment

I think this is just a GHC bug, though possibly one that’d be too painful to fix.

natefaubion · June 12, 2025, 7:21pm

Yep, that’s a great point!

In truth, we should know that SPJ was right all along. Semicolons and braces it is!

ReleaseCandidate · June 12, 2025, 9:29pm

whitespace → whitestuff {whitestuff}
whitestuff → whitechar | comment | ncomment
whitechar → newline | vertab | space | tab | uniWhite

and 2.3 says that “Comments are valid whitespace.” and that “The comment itself is not lexically analysed.” I’d guess that means “any newlines in »nested comments« are ignored”.
As I’ve not found any further definition, neither “a comment is 1 whitespace” nor “a comment is its length in whitespace” is specified. So this is a case of undefined behaviour in Haskell

What does

     foo = id 42 {-  
   -}bar = foo

do?

You wouldn’t need them for a “Haskell-like” C-style syntax (and I’d guess neither a special status for newline characters).
Actually no, you’d need either newlines or semicolons because of currying and “function application is juxtaposition”.

natefaubion · June 12, 2025, 9:42pm

This fails to parse since it sees it as foo = id 42 bar = foo (unexpected =).

ReleaseCandidate · June 12, 2025, 9:45pm

That’s what I’ve expected.

rhendric · June 12, 2025, 10:03pm

I would take that to mean that comments can’t contain lexemes, not that the plain meaning of phrases like ‘on the same line’ is to be suspended when comments are involved in the layout algorithm, which is specified to be downstream of, not part of, lexical analysis (it takes as input ‘a stream of lexemes’).

Section 10.3 details that, ‘Where the start of a lexeme is preceded only by white space on the same line, this lexeme is preceded by < n > where n is the indentation of the lexeme […].’ Comments are white space, so -}42 at the start of the line should mean the lexeme 42 preceded by < 2 >, provided that ‘on the same line’ retains its natural meaning.

Might be a little underspecified but I think there’s a stronger case that GHC is incorrect than correct.

ReleaseCandidate · June 12, 2025, 10:17pm

Let’s take a look at 10.3:

The application L tokens delivers a layout-insensitive translation of tokens, where tokens is the result of lexically analysing a module and adding column-number indicators to it as described above.

A “nested comment” is a lexeme, and there is no indication of adding “line-number indicators” or newline characters.

The rule is the same as for multi-line strings:

(NB: a string literal may span multiple lines – Section 2.6. So in the fragment

f = (“Hello
\Bill”, “Jake”)

There is no < n > inserted before the \Bill, because it is not the beginning of a complete lexeme; nor before the , because it is not preceded only by white space.)

rhendric · June 12, 2025, 10:30pm

Well, no. ‘Token’ here means either a lexeme or an indicator, and comments are neither. Lexemes and whitespace are mutually exclusive, per the lexical syntax.

We know that ‘token’, as an input to L, means one of those two things because the section tells us, ‘The input to L is: A stream of lexemes as specified by the lexical syntax in the Haskell report, with the following additional tokens [the indicators]’.

The column-number indicators are the {n} and < n > tokens added earlier in the section. A < n > token is added when a lexeme is preceded by white space on a line (and there isn’t already a {n} token preceding it). Newlines aren’t lexemes either. So how does this indicator token insertion work if it isn’t allowed to know some things about lines that lexical analysis skipped over?

ReleaseCandidate · June 12, 2025, 11:01pm

Yes, I phrased that badly. What I meant to say was, that the lexing of newlines is part of the “lexical analysis”, as both lexemes and whitespace have to be lexed (they are part of the “lexical structure”. But if there is no lexical analysis inside nested comments, these newlines are to be ignored.

But on the other hand ncomment is defined as possibly containing newlines.

ncomment → opencom ANY seq {ncomment ANY seq} closecom
ANY seq → {ANY }⟨{ANY } ( opencom | closecom ) {ANY }⟩
ANY → graphic | whitechar

That’s the point: it can’t. And it actually doesn’t “correctly” compute the column

This

The characters newline, return, linefeed, and formfeed, all start a new line.

is badly formulated, btw. It should say something like

The characters of “newline” (return, linefeed, and formfeed), all start a new line.

see

newline → return linefeed | return | linefeed | formfeed

evincar · June 13, 2025, 2:33pm

Agreed, I believe PureScript’s is the right way to handle this.

You follow the Unicode spec for keeping track of the line and visual column as you parse the source text, assign positions to each token, then handle layout based on those positions.

This should be exactly equivalent to iterating over the comment text, including the delimiters, and replacing each character with a whitespace character of the same advance width, or leaving it alone if it’s a line break — or a tab, if you want to handle tabs smartlier than GHC’s fixed tab stops.

Only the number of line breaks and the advance width of the text following the last break should be relevant to layout, I think? Unless we add rectangular comments lol

Topic		Replies	Views
Possibly due to bad layout	0	401	March 24, 2022
A strange parse error	4	743	May 29, 2024
Adjusting Haskell98's layout to GHC behaviour Learn	4	303	April 9, 2025
Strange GHC Behaviour Learn	3	462	March 15, 2020
Monadic Parsing in Haskell Learn	28	1847	January 23, 2024

Lexical syntax of comments and layout

Related topics