Survey for users of the tree-sitter Haskell grammar

yeah I can supply some things that dont work well later.

First of all, thanks for all the great work you are putting into this!

I maintain Haskell queries for a few Neovim plugins (nvim-treesitter[-textobjects],
neotest-haskell, iswap.nvim, vim-matchup, …), so here’s may take based on
my experience with those :smiley:
(I don’t have an opinion on all of the questions, so I left out some answers).

Is there a point to having dedicated nodes for parens, like type_parens?

I haven’t written any queries that needed to differentiate between different types
of parens. But I could imagine the ability to differentiate between pat_parens and exp_parens
being useful in some cases.

Should parens be part of operator for prefix notation, or should the have a separate node, or nothing?

I vaguely remember having an issue with parens being part of an infix operator declaration,
something like:

(|\?|) :: a -> a -> Maybe a

I can’t remember exactly what it was, but looking at the current AST, it feels a bit
strange - as if the parens are “hidden”, if you know what I mean.

The current AST:

(signature
  name: (operator)
  type: (fun ; ...

Is it useful to have nodes wrapping variables and constructors

I had previously told you that I don’t think they are useful.
But it turns out I have quite a few queries that use them in neotest-haskell.
So maybe I was wrong? But I’m still kind of convinced it wouldn’t be too difficult
to rewrite those queries without them. And they do add a bit of boilerplate.

Is (qualified_operator (module) (constructor_operator)) better…

I’m not sure I fully understand this. Do you have a Haskell example for it?

Should qualifying modules look like…

I like the idea of being able to capture module_segments.

Should there be nodes for empty layouts

I haven’t used them in any queries yet.

Should a comment right after where be inside of the declarations node?

I think this would be beneficial.
A comment being outside the declarations node could be a foot gun
when using anchors, e.g. ((where) . (decls)).

I took great care to get these different binding variants to parse correctly […] Is there much value to this?

Yes! I think these can be very useful for refined highlighting.

Is it a problem that infix is not specific

I don’t think so? We can use the operator to determine whether
it’s a function or not.
This proposal sounds like it could lead to false positives (e.g.
(x + y) is not a function, and there might be some custom operator
like |+|, for which it would be impossible for the parser to know
if (x |+| y) is a function or not.

But a let only contains binds as well, so it might as well skip that node.
But then you couldn’t select all local bindings.

Wouldn’t this be possible using anchors?

Is it useful not to use type_name for the head tycon of a prefix constructor declaration?

I think it just adds boilerplate, so yes.

Is it ok to have type_variable without nested variable?

I haven’t had any issues with this.

1 Like
  • Should a comment right after where be inside of the declarations node?

Only if it matches the indentation of the where block. If that’s too difficult, then in my experience “no” is probably a better default, since comments tend to go before their corresponding definitions.

1 Like

This is probably as difficult to do as part of parsing as it is to do as part of querying, but it would be wildly useful if declarations for the same function were grouped.

I think that unfortunately this will be fairly important in terms of Cursorless usage.

Thanks for taking the time to respond, Marc!

Ah. I assume that the motivation generalizes to any pat_* and exp_* node?

Though I don’t have any insight into how such a case arises in practice.
Would that be some less specific query that collects a bunch of things and then you process them differently based on whether they’re pat_parens or exp_parens?
Or is it rather a query like “give me all opening parens in expressions”?

The former case seems to me like it would boil down to substring matching on a more flexible representation in your own code, i.e. that you could pick all nodes that start with exp_ and then look for parens within them.
Would be nice if tree-sitter allowed grammars to attach more metadata to nodes than just field names, like a “category” that would be “expression”, “pattern” or “type”, preferably an arbitrary set.
This is particularly annoying for Haskell because expressions and patterns are almost entirely identical (GHC’s parser doesn’t even distinguish between them until post-processing).

Anyway, my motivation for this question was more related to the second part – should there be *_parens nodes at all?
Right now the AST for f (x) is

exp_apply
  exp_name
    variable "f"
  exp_parens
    "("
    exp_name
      variable "x"
    ")"

and I was wondering whether it should be

exp_apply
  exp_name
    variable "f"
  "("
  exp_name
    variable "x"
  ")"

or

exp_apply
  exp_name
    variable "f"
  exp_name
    "("
    variable "x"
    ")"

Not sure how difficult it would be to implement the second variant, but I assume that at least the first variant would create problems for querying.

This ties in with your response to the second part of the question:

I would interpret “hidden” as “unclear whether they are inside the operator or on the same level”.
Here it helps to print the tree like above, using the CLI in the repo (nix run .#show -- '(+) :: a' or
cargo r --bin=show -- 'a = f (x)').
So there are three possibilities:

The current one, where the prefix operator is an anonymous parens node:

signature
  "("
  operator "+"
  ")"
  "::"
  type_name
    type_variable "a"

A variant like the second one in the previous example, though without an extra named node:

operator
  "("
  "+"
  ")"

Bit awkward, though possible to conditionally query, I guess?
Adding another node like exp_parens makes it more uniform in some sense:

prefix_operator
  "("
  operator "+"
  ")"

However, if your goal is to query the name of the declared function, the current variant would allow (signature [(operator) (variable)]), right?
Does that have any value over the alternatives?

I imagine that it is an inconvenience that for operators, the first child is the paren while for varids it is the variable – you can’t use the first-child anchor easily then?

Overall I’m leaning towards prefix_operator, and applying this principle in other rules.

But I don’t know what a good name would be for this type of monstrosity:

((f a) b) c = x

Right now, it parses as:

function
  "("
  "("
  variable "f"
  patterns
    pat_name
      variable "a"
  ")"
  patterns
    pat_name
      variable "b"
  ")"
  patterns
    pat_name
      variable "c"
  "="
  exp_name
    variable "x"

What do you think?

I’m going to stop here and continue later, so this won’t result in a 10 page answer after a few days :sweat_smile:

I definitely want to try using comments’ indent to influence their placement, but I’m quite certain that it’s going to be a substantial effort, so I’ll probably attempt it after the PR has landed (it’s not really a breaking change anyway).
But it’s good to know that you would benefit from it, thanks for your input!

It seems very unrealistic at first glance, but would be an interesting challenge!
I could give hints to the scanner before and after function names to store them in the state, then compare them at the next decl and emit a different symbol that marks a decl group.
It wouldn’t work for infix function decls though, and would conflict with pattern bindings in a way that might be hard to resolve.
But worth thinking about!

Overall it seems to me that cursorless would really need to use the GHC parser, or at least directly consume the tree-sitter API instead of using the query language.
Maybe it’s only that difficult for languages that are as extremely ambiguous as Haskell.

I do prefer the function node for simple bindings just for aesthetic reasons, but when you’re rearranging those grammar rules for days to minimize conflicts and still get everything to parse right, you start to wonder whether it’s pointless work because it doesn’t matter to anyone and you could just have a conflict-free rule that generalizes a bit too much :sweat_smile:

But out of curiosity – wouldn’t this be fairly simple to detect in queries?
Compare:

(function
  (variable)
  (exp_name
    (variable)))

and

(bind
  (pat_name
    (variable))
  (exp_name
    (variable)))))

(for a = b). There would still need to be the function variant for f a b = c, but nullary functions really are more closely related to pattern bindings. GHC also makes an explicit “decision” to treat them as functions.

As for the signature suggestion, it would result in this tree for a :: Int:

(signature
  (pat_name
    (variable))
  (type_name
    (type)))

so I’d assume that would also be rather trivial to adapt to.

Since I’ve ultimately reduced the number of runtime conflicts to a convincingly minimal set, I don’t feel the urge to eliminate more of them by sacrificing expressivity, so I only wanted to have at least adressed the issue.

But this particular optimization is partially even motivated by one of your bug reports, incidentally! :sweat_smile:
The runtime conflicts that are needed to distinguish these constructs have the potential to greedily include an equation on the next line in a signature to create a pattern binding with annotation, and produce some top_splices as a byproduct.

Now I’ve taken measures to forcibly end a layout element based on indent from the scanner even if it results in a parse error, so this problem should be contained, but it’s a significant fact that runtime conflicts are fundamentally unpredictable, and the more you have of them the worse it gets, as far as I can tell.

Anyway, I’m not even entirely certain that merging the different function/bind decls wouldn’t result in similar conflicts on the next level in the tree.

I apologize for rambling on like this, but I’ve had these dilemmas running circles in my head for months now, and needed to purge them :see_no_evil:

Looks like you’re using exp_name to match both qualified and unqualified variables.
You’re also requiring them to be the head of an exp_apply, so that eliminates the need to check that the variable is an expression (except for the first query, where I assume you could replace _ with exp_apply as well).

Not sure this is sufficient to discard the concept, though.

Small tangent: Wouldn’t it be nicer to write:

(_
  (exp_name) [
    (variable) @func_name
    (qualified_variable (variable) @func_name)
  ]
  (#eq? @func_name "testGroup")
) @namespace.definition

:slight_smile:

Oh! I just noticed that this is exactly what supertypes are.
Doesn’t seem like they are usable in queries though.

But wow, adding _infixexp to supertypes reduced the size of node_types.json from 300kB to 200kB :sweat_smile:

edit: OMG it works!!! Check out this query test I added!
This might be sufficient justification for removing exp_name et al, @MrcJkb

edit2: Ah, but the supertype replaces the exp_name node, so if that’s removed we can’t match on (_infixexp (variable)) to capture only the variable.

Oh, if it fixes that bug please do it, that bug fundamentally breaks the ability to test the tree sitter interaction, because after a couple dozen tests it starts parsing every function into a something prefixed with a top-level splice. I’m happy to take a little bit of an increase in query complexity for that.

module A where

data D = Int :% Int

(%) :: Int -> Int -> Int

So both :% and % are symbolic ops, but their grammar names are constructor_operator and operator. If they’re used qualified:

f = 1 A.:% 2
g = 1 A.% 2

they both get a (qualified_operator (module) (X)) tree, where X is the respective name I mentioned.
Similarly, there are qualified_variable and qualified_constructor.
It would be consistent to use qualified_constructor_operator, but it seemed too verbose when I made that decision, I assume.
On the other hand, it is a clear redundancy, since qualified_constructor_operator can only contain constructor_operator, so maybe all six variants (including type and module) should just be (qualified (module) (X))?

Let me make sure we’re not misunderstanding each other: Until now the tree of a qualified name with three module segments has looked like:

(qualified_variable
  (module)
  (module)
  (module)
  (variable))

so the segments could always be captured individually. Is that what you meant?

My proposal would be to additionally wrap the loose segments with another node:

(qualified_variable
  (module_wrapper_???
    (module_???)
    (module_???)
    (module_???))
  (variable))

I don’t know if that is of any importance in the real world, but intuitively it seems impractical to have a variable number of segments as preceding siblings of (variable), and to be unable to capture the entire module name with a node.

On top of that, the individual segments aren’t really “modules”, so that name feels a bit misleading…

The Haskell Report uses conid for the segments and modid for the entire thing, which doesn’t really fit with the more semantic naming scheme for tree-sitter either.
I’m kinda leaning towards something like (module (module_id) (module_id))

Sorry for taking so long to respond again :slight_smile:

The former case seems to me like it would boil down to substring matching on a more flexible representation in your own code

You’re right. I don’t use substring matching very often, so I forget about that.
With that option, I think there’s technically no need for _parens nodes.
But I do think it could slightly be more convenient for capturing whatever
is in the _parens node.

((exp_parens (_) @inner))

vs.

("(" . (_) @inner . ")")

The second feels slightly more verbose and needs anchors. It’s not a big difference though.
On the other hand, other queries that don’t care about the parentheses might be simpler.

However, if your goal is to query the name of the declared function, the current variant would allow (signature [(operator) (variable)]), right?
Does that have any value over the alternatives?

I haven’t yet encountered a need to query the name of an infix function/operator.

((f a) b) c = x
[…] What do you think?

Is there a language extension for that? It doesn’t look familiar to me.
I guess new variant gets rid of some nesting?

Not sure this is sufficient to discard the concept, though.

Yep, me neither. I don’t think the queries should be hard to adjust.

Small tangent: Wouldn’t it be nicer to write…

Yes, I didn’t know about alternations when I wrote those queries :slight_smile:

Ah, but the supertype replaces the exp_name node, so if that’s removed we can’t match on
(_infixexp (variable)) to capture only the variable.

I don’t think I understand this. So something like

(exp_apply
  (_infixexp (variable)) @exp)

would not be possible?
The query test looks promising…

so maybe all six variants (including type and module) should just be (qualified (module) (X))?

That does sound like it would simplify things.

so the segments could always be captured individually. Is that what you meant?

Sorry, I meant I like the idea of being able to differentiate between modules/modids
and segments/conids. So I agree with your proposal.

I could imagine it simplifying the (import ...) highlight queries in nvim-treesitter,
for example.

hey @wenkokke, from your experience with writing queries, is there a clear favorite between flat and nested repeating structures? In particular, these two:

fun :: ∀ a . C1 => a -> C2 => ∀ b . b -> a
fun = f a b c d

where the first one has these nested and flat trees:

(forall (binders)
  (context (class)
    (fun (variable)
      (context (class)
        (forall (binders)
          (fun (variable) (variable)))))))

vs

(sigtype
  (forall (binders))
  (context (class))
  (fun (variable))
  (context (class))
  (forall (binders))
  (fun (variable))
  (variable))

and the second one:

(apply
  (apply
    (apply
      (apply
        (variable) -- f
        (variable))
      (variable))
    (variable))
  (variable))
(apply
  (variable) -- f
  (variable)
  (variable)
  (variable)
  (variable))

My intuition is that the flat structure is easier, e.g. I saw that plugins that do argument swapping rely on it; but since the nested structure is “correct” in terms of reduction semantics and associativity there might be other aspects that I have no insight into.

Is this relevant/significant for your cursorless efforts?

Flattening only the apply node into a list of a function and its arguments is workable and probably easier. But once you start extending this to infix operators and I can no longer reliably grab the function node by taking the first child, we’re in trouble.

The flattening of the type signature looks much scarier and looks like it loses a lot of useful information. If you think it retains the same structural information, I might be misunderstanding exactly what you’re flattening there.

Ah no, that wouldn’t work anyway. Infix operators have no arbitrary repetition like application.
The nested apply is much easier to implement, but if the flat structure was more desirable I would choose that instead.

Note however that it has always been flat in this grammar, see for example the tests!
It’s pretty likely that this wasn’t deterministic though.
In the rewrite, I have implemented the nested structure, mainly because I couldn’t get it to work reliably otherwise, but now I’m pretty confident I could go either way, so I need opinions!


I’m pretty certain they’re isomorphic, but I think it depends on the tooling how that translates into practice. forall, context and arrow are always right-associative, so the nested tree would be a chain anyway at the top level.
If you can come up with some queries or other examples that would be different between the two approaches, we can probably figure it out.
I’m not particularly happy about either of the alternatives, so any justification to choose one would be very welcome!


Anyway, it sounds like you’re not very certain about either case; maybe you can think about it for a bit and get back to me :relaxed: I can also push a branch to try out if that helps, or elaborate some more on whatever’s unclear.

nvim-treesitter highlights

Here’s an example where the nested structure requires multiple queries that the flat variant would achieve with only one

Is there any plan to add grammars for the other languages that we regularly use, in particular Cabal, but I suspect Core might be interesting as well.

I looked around for a treesitter grammar for Cabal about a year ago. I found nothing and ended up hacking something together… PA: Treesitter grammar for Cabal

No concrete plans, but I was thinking about adding Alex/Happy grammars. I’d be happy to collaborate though.

1 Like

Hi, we are using tree-sitter-haskell in static-ls at Mercury. This is in an effort to improve language intelligence tooling for extremely large projects (~ 10,000 modules). I made Haskell bindings for tree-sitter that generates a full typed tree with traversals in tree-sitter-simple.

4 Likes