How to parse specific syntax elements and discard the rest?

Hello,

I’m trying to write a tool to analyze nix files, notably rewriting/analyzing the path literals. I would like to parse a nix file into a list of path literals and the location they can be found.
With this data I can then check whether the targets of the path literals are valid given the location of a file, simplify the paths and rewrite using the source location, or generate a directed graph for fun.

I’m having trouble parsing the path literals in nix files. Specifically, I want to only parse the path literals and discard the rest. I thought about using regex, but since I am more familiar with parser combinator libraries I went with Megaparsec.


The difficulty is that I want to fish out all the path literals in a nix file, disregarding all other syntaxic elements. Megaparsec provides the getOffset primitive. However getOffset gives me the position of the start of the failurue, so I can’t jump forward using this information.

ghci> parseTest (liftA2 (,) (optional ("foo" :: Parser Text)) getOffset) "foo"
(Just "foo",3)
it :: ()
(0.02 secs, 80,576 bytes)
ghci> parseTest (liftA2 (,) (optional ("foo" :: Parser Text)) getOffset) "bar"
(Nothing,0)
it :: ()
(0.01 secs, 78,480 bytes)

I also have tried to use observing, but it also only reports the position at the start of the failure.

ghci> parseTest (observing ("foo" :: Parser Text)) "bar"
Left (TrivialError 0 (Just (Tokens ('b' :| "ar"))) (fromList [Tokens ('f' :| "oo")]))
it :: ()

What can I do to parse only the path literals efficiently and correctly while discarding the rest? Thanks a lot =D

What you’re describing sounds very similar to the find-all-matches mode of regular expression drivers.

You can, of course, recreate this with parser combinators.

findAll :: Parser a -> Text -> [(a, Word)]
findAll parser = go 0
  where
    go offset input = case runParser parser input of
      Left _ -> go (succ offset) (Text.drop 1 input)
      Right (result, rest) -> result : go (offset + Text.length input - Text.length rest) rest

Assuming, it’s impossible to mistake your token for another token in the target language if context is missing. E.g. it would be impossible to exclusively find all variable names in Rust code because they overlap with Type Identifiers and more.

Edit:
Efficiency is always questionable when using parser combinators.
You are probably a lot better off hand-writing some pattern-matching/parsing code over the input string if performance starts mattering (or does already).

Haskell has nix parser.
So you can use Syb to discover paths on AST

you can take GitHub - yaitskov/literal-flake-input: Web service and cli tool for converting URL path into Nix attrset · GitHub as example

You can also just do something like:

findAll :: Parser a -> Parser [a]
findAll p = p <|> anyChar *> findAll p

(Note that this assumes that your parser combinator library backtracks, some popular ones like parsec and megaparsec require manual try annotations around the first p.)

Or use flatparse, it generally is as fast as handwritten loops, unless you start doing fancy things with it.

Nice! I didn’t know that. Thanks.

What is Syb? I grepped in the literal-flake-inputs project and found nothing.

Cool! I didn’t think about making a findAll function. Thanks for the idea.

That is true. I realized this a while after posting this. When trimming away the first character a non path syntax element might become a valid syntax element, and this is not what I want. Furthermore, I can’t distinguish whether the path I found is in a string literal or not, which is also not what I want. I think I’ll go with the Haskell nix parser route.

syb is generic library that allows to traverse structures with Data instances ie function is applied to any Data field and the value can downcasted safely, though I just realized that types returned by the nix parser hardly could have Data instances.

Sorry for the late reply.
I see, I’ll look into it. Thank a lot!