In keeping with Wadler’s Law, an interesting subject came up in the PureScript Discord regarding delimited/block comment syntax and layout.
GHC sees both of these definitions as valid:
example1 =
let
foo = id{-
-}42
bar = foo
in bar
example2 =
let
foo = id 42
{- -}bar = foo
in bar
This is interesting because in example1, the block comment causes the parser to ignore the column of 42, parsing id 42 as the RHS of foo. In example2, there is no actual indentation, only the block comment. The parser only consults the visual column of bar.
In comparison, given this PureScript
example1 =
let
foo = identity {-
-}42
bar = foo
in bar
example2 =
let
foo = identity 42
{- -}bar = foo
in bar
Purs will fail on example1 because it consults the column of every token for the purposes of layout (layout is transformation on the token stream only). It will accept example2 for the same reason as GHC.
Both parsers are problematic though if you think comments are insignificant. If by “insignificant” we mean that we should be able to strip comments from the source and retain an identical parse, then both parsers should fail to parse example2. If we remove the comments from example2, then it will fail to parse due to incorrect indentation, making comments significant in both compilers!
I would be interested in the consistency among alternative Haskell parsers that aren’t derived from GHC. This obviously doesn’t really affect anyone in practice, but an interesting artifact of naive columnar indentation nonetheless. Maybe Python was right to only have line comments!
If comments are insignificant in the sense of you can remove them and get an identical parse, then id{- -}42 should be interpreted as id42, which neither compiler does. If comments are instead whitespace, then PureScript’s column-based analysis is correct, because that’s what you’d get by replacing every ‘visible’ comment character with spaces.
Also, the Haskell 2010 report says that comments are whitespace:
and 2.3 says that “Comments are valid whitespace.” and that “The comment itself is not lexically analysed.” I’d guess that means “any newlines in »nested comments« are ignored”.
As I’ve not found any further definition, neither “a comment is 1 whitespace” nor “a comment is its length in whitespace” is specified. So this is a case of undefined behaviour in Haskell
What does
foo = id 42 {-
-}bar = foo
do?
You wouldn’t need them for a “Haskell-like” C-style syntax (and I’d guess neither a special status for newline characters).
Actually no, you’d need either newlines or semicolons because of currying and “function application is juxtaposition”.
I would take that to mean that comments can’t contain lexemes, not that the plain meaning of phrases like ‘on the same line’ is to be suspended when comments are involved in the layout algorithm, which is specified to be downstream of, not part of, lexical analysis (it takes as input ‘a stream of lexemes’).
Section 10.3 details that, ‘Where the start of a lexeme is preceded only by white space on the same line, this lexeme is preceded by < n > where n is the indentation of the lexeme […].’ Comments are white space, so -}42 at the start of the line should mean the lexeme 42 preceded by < 2 >, provided that ‘on the same line’ retains its natural meaning.
Might be a little underspecified but I think there’s a stronger case that GHC is incorrect than correct.
The application L tokens delivers a layout-insensitive translation of tokens, where tokens is the result of lexically analysing a module and adding column-number indicators to it as described above.
A “nested comment” is a lexeme, and there is no indication of adding “line-number indicators” or newline characters.
The rule is the same as for multi-line strings:
(NB: a string literal may span multiple lines – Section 2.6. So in the fragment
f = (“Hello
\Bill”, “Jake”)
There is no < n > inserted before the \Bill, because it is not the beginning of a complete lexeme; nor before the , because it is not preceded only by white space.)
Well, no. ‘Token’ here means either a lexeme or an indicator, and comments are neither. Lexemes and whitespace are mutually exclusive, per the lexical syntax.
We know that ‘token’, as an input to L, means one of those two things because the section tells us, ‘The input to L is: A stream of lexemes as specified by the lexical syntax in the Haskell report, with the following additional tokens [the indicators]’.
The column-number indicators are the {n} and < n > tokens added earlier in the section. A < n > token is added when a lexeme is preceded by white space on a line (and there isn’t already a {n} token preceding it). Newlines aren’t lexemes either. So how does this indicator token insertion work if it isn’t allowed to know some things about lines that lexical analysis skipped over?
Yes, I phrased that badly. What I meant to say was, that the lexing of newlines is part of the “lexical analysis”, as both lexemes and whitespace have to be lexed (they are part of the “lexical structure”. But if there is no lexical analysis inside nested comments, these newlines are to be ignored.
But on the other hand ncomment is defined as possibly containing newlines.
ncomment → opencom ANY seq {ncomment ANY seq} closecom
ANY seq → {ANY }⟨{ANY } ( opencom | closecom ) {ANY }⟩
ANY → graphic | whitechar
That’s the point: it can’t. And it actually doesn’t “correctly” compute the column
This
The characters newline, return, linefeed, and formfeed, all start a new line.
is badly formulated, btw. It should say something like
The characters of “newline” (return, linefeed, and formfeed), all start a new line.
Agreed, I believe PureScript’s is the right way to handle this.
You follow the Unicode spec for keeping track of the line and visual column as you parse the source text, assign positions to each token, then handle layout based on those positions.
This should be exactly equivalent to iterating over the comment text, including the delimiters, and replacing each character with a whitespace character of the same advance width, or leaving it alone if it’s a line break — or a tab, if you want to handle tabs smartlier than GHC’s fixed tab stops.
Only the number of line breaks and the advance width of the text following the last break should be relevant to layout, I think? Unless we add rectangular comments lol