I’m trying to write a tool to analyze nix files, notably rewriting/analyzing the path literals. I would like to parse a nix file into a list of path literals and the location they can be found.
With this data I can then check whether the targets of the path literals are valid given the location of a file, simplify the paths and rewrite using the source location, or generate a directed graph for fun.
I’m having trouble parsing the path literals in nix files. Specifically, I want to only parse the path literals and discard the rest. I thought about using regex, but since I am more familiar with parser combinator libraries I went with Megaparsec.
The difficulty is that I want to fish out all the path literals in a nix file, disregarding all other syntaxic elements. Megaparsec provides the getOffset primitive. However getOffset gives me the position of the start of the failurue, so I can’t jump forward using this information.
What you’re describing sounds very similar to the find-all-matches mode of regular expression drivers.
You can, of course, recreate this with parser combinators.
findAll :: Parser a -> Text -> [(a, Word)]
findAll parser = go 0
where
go offset input = case runParser parser input of
Left _ -> go (succ offset) (Text.drop 1 input)
Right (result, rest) -> result : go (offset + Text.length input - Text.length rest) rest
Assuming, it’s impossible to mistake your token for another token in the target language if context is missing. E.g. it would be impossible to exclusively find all variable names in Rust code because they overlap with Type Identifiers and more.
Edit:
Efficiency is always questionable when using parser combinators.
You are probably a lot better off hand-writing some pattern-matching/parsing code over the input string if performance starts mattering (or does already).
findAll :: Parser a -> Parser [a]
findAll p = p <|> anyChar *> findAll p
(Note that this assumes that your parser combinator library backtracks, some popular ones like parsec and megaparsec require manual try annotations around the first p.)
Or use flatparse, it generally is as fast as handwritten loops, unless you start doing fancy things with it.
Cool! I didn’t think about making a findAll function. Thanks for the idea.
That is true. I realized this a while after posting this. When trimming away the first character a non path syntax element might become a valid syntax element, and this is not what I want. Furthermore, I can’t distinguish whether the path I found is in a string literal or not, which is also not what I want. I think I’ll go with the Haskell nix parser route.
syb is generic library that allows to traverse structures with Data instances ie function is applied to any Data field and the value can downcasted safely, though I just realized that types returned by the nix parser hardly could have Data instances.