Survey for users of the tree-sitter Haskell grammar

I have just finished rewriting large parts of the tree-sitter grammar to deal with some bugs and want to use this opportunity to get some user feedback about the structure of the produced AST.

Since I’m not a direct consumer of the grammar, i.e. I’m not maintaining any tools that use the tree-sitter API, but only use it rather superficially for benefits like specially highlighting constructors and class names in neovim, I don’t have the necessary insights into the requirements for a good node structure, so I’m essentially guessing what would be sensible for many constructs.

If there’s anyone here who has experience and strong opinions about this, I would be very grateful for some advice.

To make the conversation a bit simpler, here’s a list of questions that I’ve noted down along the way, though it’s far from complete:

  • Is there a point to having dedicated nodes for parens, like type_parens?
    Should parens be part of operator for prefix notation, or should the have a separate node, or nothing?

  • Is it useful to have nodes wrapping variables and constructors, like exp_name / pat_name / type_name, that distinguish the namespaces from each other?

  • Is (qualified_operator (module) (constructor_operator)) better than:

    • (qualified_constructor_operator (module) (constructor_operator))
    • (qualified (module) (constructor_operator))
  • Should qualifying modules look like:

    • (qualified_variable (module (module_segment) (module_segment)) (variable))
    • (qualified_variable (module) (module) (variable))
  • Should there be nodes for empty layouts, like (alts) for an empty case?

  • Should a comment right after where be inside of the declarations node?

  • I took great care to get these different binding variants to parse correctly:

    • fun = exp(function (variable) (exp_...))

    • fun a b = exp(function (variable) (patterns (pat_name (variable)) (pat_name (variable))) (exp_...))

    • A a = exp(bind (pat_apply (pat_name (constructor)) (pat_name variable))) (exp_...)

    • a :: A = exp(bind (pat_annotated (pat_name) (type))) (exp)

    Is there much value to this?
    In particular, if the first form were parsed like the third one, the grammar would be a bit less complex, but we’d get a pattern for the function name: (bind (pat_name (variable)) (exp_...))
    Similarly, signature could be merged with pat_annotated, but it would also introduce a pat_name in signature.

  • Infix function declarations like a <> b = exp are represented as (function (infix ...)), because they can have
    additional parameters, like (a <> b) c d = exp, and this structure makes parsing easier.
    Is it a problem that infix is not specific?
    Should it be (function (function_infix ...)) or something?

  • where doesn’t have a node wrapping the binds, because it’s always in the same place, unlike a let.
    But a let only contains binds as well, so it might as well skip that node.
    But then you couldn’t select all local bindings.

  • Should linear arrows (especially the operator versions like ->.) share the node name type_fun with regular functions?
    Is it enough to match on the arrow or should it be called type_linear_fun, and should the modifier version also be in there?

  • Should it be (data_instance (newtype)) or (newtype_instance)?

  • Is it useful not to use type_name for the head tycon of a prefix constructor declaration?

  • Is it ok to have type_variable without nested variable?
    For expressions and patterns, the structure is ({exp,pat}_name (variable)), while types use (type_name (type_variable)).

  • For type families, I’ve tried an approach with very general terms that would require more context nodes in queries, but would be less verbose.
    Which is better?

  • Should all layouts have a container node, like declarations or binds?
    E.g. type family equations have no container.
    Should binds be disambiguated from type variable binds?
    Should that term be binders or bindings?

13 Likes

Thanks for working on this! I don’t have specific answers your questions, I just wanted to point out that the usecase I’m most excited about is structured editing, i.e. things like combobulate, which hopefully could be as useful as structured-haskell-mode once was.

2 Likes

Hi again Torsten, it sounds like this has a chance of solving #115, which would be great.

I don’t really understand what constructor_operator is, and I understand even less what qualified_constructor_operator is. What is a “constructor” in tree-sitter terminology? I’ve tried searching but not much is coming up.

It’s confusing when some words pertain to tree-sitter, and others pertain to Haskell. But if we could clarify the terminology, it suspect it’d help in answering these questions.

1 Like

hi @janus , your issue should indeed be solved once this is merged!

The node names should all be Haskell terminology, so constructor_operator just denotes a symbolic constructor, like data Foo a b = a :% b.

I don’t know about all the treesitter internals so I can hardly give any feedback on that but still I want to thank you for giving the Haskell grammar a bit of care, it desperately needs it; the coverage is really bad especially with things like \case , data family and some infix constructors involved, I’ll just be happy if they’re gonna be fixed at some point or the corruption that they impose on the AST becomes minimal (i.e. I don’t have to move broken definitions to the bottom of the file such that highlighing won’t be broken for the rest of the file)

I’m fairly confident that this update will improve coverage substantially, but if you have any concrete cases I can test that might not be present in my corpus, I’d be grateful!

Maybe I should set up something to parse the entirety of Hackage. Using property test generators and comparing with GHC ASTs would be optimal, but that would require another big chunk of time.

1 Like

Beware it would also break with each version of GHC, since the AST is not stable across versions! (Yet)

yeah I can supply some things that dont work well later.

First of all, thanks for all the great work you are putting into this!

I maintain Haskell queries for a few Neovim plugins (nvim-treesitter[-textobjects],
neotest-haskell, iswap.nvim, vim-matchup, …), so here’s may take based on
my experience with those :smiley:
(I don’t have an opinion on all of the questions, so I left out some answers).

Is there a point to having dedicated nodes for parens, like type_parens?

I haven’t written any queries that needed to differentiate between different types
of parens. But I could imagine the ability to differentiate between pat_parens and exp_parens
being useful in some cases.

Should parens be part of operator for prefix notation, or should the have a separate node, or nothing?

I vaguely remember having an issue with parens being part of an infix operator declaration,
something like:

(|\?|) :: a -> a -> Maybe a

I can’t remember exactly what it was, but looking at the current AST, it feels a bit
strange - as if the parens are “hidden”, if you know what I mean.

The current AST:

(signature
  name: (operator)
  type: (fun ; ...

Is it useful to have nodes wrapping variables and constructors

I had previously told you that I don’t think they are useful.
But it turns out I have quite a few queries that use them in neotest-haskell.
So maybe I was wrong? But I’m still kind of convinced it wouldn’t be too difficult
to rewrite those queries without them. And they do add a bit of boilerplate.

Is (qualified_operator (module) (constructor_operator)) better…

I’m not sure I fully understand this. Do you have a Haskell example for it?

Should qualifying modules look like…

I like the idea of being able to capture module_segments.

Should there be nodes for empty layouts

I haven’t used them in any queries yet.

Should a comment right after where be inside of the declarations node?

I think this would be beneficial.
A comment being outside the declarations node could be a foot gun
when using anchors, e.g. ((where) . (decls)).

I took great care to get these different binding variants to parse correctly […] Is there much value to this?

Yes! I think these can be very useful for refined highlighting.

Is it a problem that infix is not specific

I don’t think so? We can use the operator to determine whether
it’s a function or not.
This proposal sounds like it could lead to false positives (e.g.
(x + y) is not a function, and there might be some custom operator
like |+|, for which it would be impossible for the parser to know
if (x |+| y) is a function or not.

But a let only contains binds as well, so it might as well skip that node.
But then you couldn’t select all local bindings.

Wouldn’t this be possible using anchors?

Is it useful not to use type_name for the head tycon of a prefix constructor declaration?

I think it just adds boilerplate, so yes.

Is it ok to have type_variable without nested variable?

I haven’t had any issues with this.

1 Like
  • Should a comment right after where be inside of the declarations node?

Only if it matches the indentation of the where block. If that’s too difficult, then in my experience “no” is probably a better default, since comments tend to go before their corresponding definitions.

1 Like

This is probably as difficult to do as part of parsing as it is to do as part of querying, but it would be wildly useful if declarations for the same function were grouped.

I think that unfortunately this will be fairly important in terms of Cursorless usage.

Thanks for taking the time to respond, Marc!

Ah. I assume that the motivation generalizes to any pat_* and exp_* node?

Though I don’t have any insight into how such a case arises in practice.
Would that be some less specific query that collects a bunch of things and then you process them differently based on whether they’re pat_parens or exp_parens?
Or is it rather a query like “give me all opening parens in expressions”?

The former case seems to me like it would boil down to substring matching on a more flexible representation in your own code, i.e. that you could pick all nodes that start with exp_ and then look for parens within them.
Would be nice if tree-sitter allowed grammars to attach more metadata to nodes than just field names, like a “category” that would be “expression”, “pattern” or “type”, preferably an arbitrary set.
This is particularly annoying for Haskell because expressions and patterns are almost entirely identical (GHC’s parser doesn’t even distinguish between them until post-processing).

Anyway, my motivation for this question was more related to the second part – should there be *_parens nodes at all?
Right now the AST for f (x) is

exp_apply
  exp_name
    variable "f"
  exp_parens
    "("
    exp_name
      variable "x"
    ")"

and I was wondering whether it should be

exp_apply
  exp_name
    variable "f"
  "("
  exp_name
    variable "x"
  ")"

or

exp_apply
  exp_name
    variable "f"
  exp_name
    "("
    variable "x"
    ")"

Not sure how difficult it would be to implement the second variant, but I assume that at least the first variant would create problems for querying.

This ties in with your response to the second part of the question:

I would interpret “hidden” as “unclear whether they are inside the operator or on the same level”.
Here it helps to print the tree like above, using the CLI in the repo (nix run .#show -- '(+) :: a' or
cargo r --bin=show -- 'a = f (x)').
So there are three possibilities:

The current one, where the prefix operator is an anonymous parens node:

signature
  "("
  operator "+"
  ")"
  "::"
  type_name
    type_variable "a"

A variant like the second one in the previous example, though without an extra named node:

operator
  "("
  "+"
  ")"

Bit awkward, though possible to conditionally query, I guess?
Adding another node like exp_parens makes it more uniform in some sense:

prefix_operator
  "("
  operator "+"
  ")"

However, if your goal is to query the name of the declared function, the current variant would allow (signature [(operator) (variable)]), right?
Does that have any value over the alternatives?

I imagine that it is an inconvenience that for operators, the first child is the paren while for varids it is the variable – you can’t use the first-child anchor easily then?

Overall I’m leaning towards prefix_operator, and applying this principle in other rules.

But I don’t know what a good name would be for this type of monstrosity:

((f a) b) c = x

Right now, it parses as:

function
  "("
  "("
  variable "f"
  patterns
    pat_name
      variable "a"
  ")"
  patterns
    pat_name
      variable "b"
  ")"
  patterns
    pat_name
      variable "c"
  "="
  exp_name
    variable "x"

What do you think?

I’m going to stop here and continue later, so this won’t result in a 10 page answer after a few days :sweat_smile:

I definitely want to try using comments’ indent to influence their placement, but I’m quite certain that it’s going to be a substantial effort, so I’ll probably attempt it after the PR has landed (it’s not really a breaking change anyway).
But it’s good to know that you would benefit from it, thanks for your input!

It seems very unrealistic at first glance, but would be an interesting challenge!
I could give hints to the scanner before and after function names to store them in the state, then compare them at the next decl and emit a different symbol that marks a decl group.
It wouldn’t work for infix function decls though, and would conflict with pattern bindings in a way that might be hard to resolve.
But worth thinking about!

Overall it seems to me that cursorless would really need to use the GHC parser, or at least directly consume the tree-sitter API instead of using the query language.
Maybe it’s only that difficult for languages that are as extremely ambiguous as Haskell.

I do prefer the function node for simple bindings just for aesthetic reasons, but when you’re rearranging those grammar rules for days to minimize conflicts and still get everything to parse right, you start to wonder whether it’s pointless work because it doesn’t matter to anyone and you could just have a conflict-free rule that generalizes a bit too much :sweat_smile:

But out of curiosity – wouldn’t this be fairly simple to detect in queries?
Compare:

(function
  (variable)
  (exp_name
    (variable)))

and

(bind
  (pat_name
    (variable))
  (exp_name
    (variable)))))

(for a = b). There would still need to be the function variant for f a b = c, but nullary functions really are more closely related to pattern bindings. GHC also makes an explicit “decision” to treat them as functions.

As for the signature suggestion, it would result in this tree for a :: Int:

(signature
  (pat_name
    (variable))
  (type_name
    (type)))

so I’d assume that would also be rather trivial to adapt to.

Since I’ve ultimately reduced the number of runtime conflicts to a convincingly minimal set, I don’t feel the urge to eliminate more of them by sacrificing expressivity, so I only wanted to have at least adressed the issue.

But this particular optimization is partially even motivated by one of your bug reports, incidentally! :sweat_smile:
The runtime conflicts that are needed to distinguish these constructs have the potential to greedily include an equation on the next line in a signature to create a pattern binding with annotation, and produce some top_splices as a byproduct.

Now I’ve taken measures to forcibly end a layout element based on indent from the scanner even if it results in a parse error, so this problem should be contained, but it’s a significant fact that runtime conflicts are fundamentally unpredictable, and the more you have of them the worse it gets, as far as I can tell.

Anyway, I’m not even entirely certain that merging the different function/bind decls wouldn’t result in similar conflicts on the next level in the tree.

I apologize for rambling on like this, but I’ve had these dilemmas running circles in my head for months now, and needed to purge them :see_no_evil:

Looks like you’re using exp_name to match both qualified and unqualified variables.
You’re also requiring them to be the head of an exp_apply, so that eliminates the need to check that the variable is an expression (except for the first query, where I assume you could replace _ with exp_apply as well).

Not sure this is sufficient to discard the concept, though.

Small tangent: Wouldn’t it be nicer to write:

(_
  (exp_name) [
    (variable) @func_name
    (qualified_variable (variable) @func_name)
  ]
  (#eq? @func_name "testGroup")
) @namespace.definition

:slight_smile:

Oh! I just noticed that this is exactly what supertypes are.
Doesn’t seem like they are usable in queries though.

But wow, adding _infixexp to supertypes reduced the size of node_types.json from 300kB to 200kB :sweat_smile:

edit: OMG it works!!! Check out this query test I added!
This might be sufficient justification for removing exp_name et al, @MrcJkb

edit2: Ah, but the supertype replaces the exp_name node, so if that’s removed we can’t match on (_infixexp (variable)) to capture only the variable.

Oh, if it fixes that bug please do it, that bug fundamentally breaks the ability to test the tree sitter interaction, because after a couple dozen tests it starts parsing every function into a something prefixed with a top-level splice. I’m happy to take a little bit of an increase in query complexity for that.

module A where

data D = Int :% Int

(%) :: Int -> Int -> Int

So both :% and % are symbolic ops, but their grammar names are constructor_operator and operator. If they’re used qualified:

f = 1 A.:% 2
g = 1 A.% 2

they both get a (qualified_operator (module) (X)) tree, where X is the respective name I mentioned.
Similarly, there are qualified_variable and qualified_constructor.
It would be consistent to use qualified_constructor_operator, but it seemed too verbose when I made that decision, I assume.
On the other hand, it is a clear redundancy, since qualified_constructor_operator can only contain constructor_operator, so maybe all six variants (including type and module) should just be (qualified (module) (X))?

Let me make sure we’re not misunderstanding each other: Until now the tree of a qualified name with three module segments has looked like:

(qualified_variable
  (module)
  (module)
  (module)
  (variable))

so the segments could always be captured individually. Is that what you meant?

My proposal would be to additionally wrap the loose segments with another node:

(qualified_variable
  (module_wrapper_???
    (module_???)
    (module_???)
    (module_???))
  (variable))

I don’t know if that is of any importance in the real world, but intuitively it seems impractical to have a variable number of segments as preceding siblings of (variable), and to be unable to capture the entire module name with a node.

On top of that, the individual segments aren’t really “modules”, so that name feels a bit misleading…

The Haskell Report uses conid for the segments and modid for the entire thing, which doesn’t really fit with the more semantic naming scheme for tree-sitter either.
I’m kinda leaning towards something like (module (module_id) (module_id))

Sorry for taking so long to respond again :slight_smile:

The former case seems to me like it would boil down to substring matching on a more flexible representation in your own code

You’re right. I don’t use substring matching very often, so I forget about that.
With that option, I think there’s technically no need for _parens nodes.
But I do think it could slightly be more convenient for capturing whatever
is in the _parens node.

((exp_parens (_) @inner))

vs.

("(" . (_) @inner . ")")

The second feels slightly more verbose and needs anchors. It’s not a big difference though.
On the other hand, other queries that don’t care about the parentheses might be simpler.

However, if your goal is to query the name of the declared function, the current variant would allow (signature [(operator) (variable)]), right?
Does that have any value over the alternatives?

I haven’t yet encountered a need to query the name of an infix function/operator.

((f a) b) c = x
[…] What do you think?

Is there a language extension for that? It doesn’t look familiar to me.
I guess new variant gets rid of some nesting?

Not sure this is sufficient to discard the concept, though.

Yep, me neither. I don’t think the queries should be hard to adjust.

Small tangent: Wouldn’t it be nicer to write…

Yes, I didn’t know about alternations when I wrote those queries :slight_smile:

Ah, but the supertype replaces the exp_name node, so if that’s removed we can’t match on
(_infixexp (variable)) to capture only the variable.

I don’t think I understand this. So something like

(exp_apply
  (_infixexp (variable)) @exp)

would not be possible?
The query test looks promising…

so maybe all six variants (including type and module) should just be (qualified (module) (X))?

That does sound like it would simplify things.

so the segments could always be captured individually. Is that what you meant?

Sorry, I meant I like the idea of being able to differentiate between modules/modids
and segments/conids. So I agree with your proposal.

I could imagine it simplifying the (import ...) highlight queries in nvim-treesitter,
for example.