Solved. Parsec combinator 'many' is applied to a parser that accepts an empty string

mhitza · March 21, 2024, 10:14pm

I’m trying to write a parser that given the following type of input string

lorem ipsum "dolor sit" amed"lorem "

Splits it out into the following parts

lorem
ipsum
"dolor sit"
amed"lorem "

In summary, it considers parts as characters separated by spaces, but when it comes to quoted text, inner spaces are no longer considered separators until the quote is closed. At some point it should handle escaped quotes (\") within the quoted text, but that’s outside of the scope of this question.

Where I’m at, right now, and could use some help:

import Text.Parsec qualified as Parsec
import Text.Parsec.Char qualified as Parsec

stringPartsParser :: Parsec.Parsec String () String
stringPartsParser = 
  do wordpart <- Parsec.anyChar
                 `Parsec.manyTill`
                   ((Parsec.oneOf " \"" >> pure [])
                       Parsec.<|> (Parsec.eof >> pure []))
     (Parsec.lookAhead (Parsec.char '"') >> wordTillQuoteEnd wordpart)
        Parsec.<|> return wordpart
  where
    wordTillQuoteEnd partial = do
      Parsec.char '"'
      inner <- Parsec.anyChar `Parsec.manyTill` (Parsec.char '"')
      Parsec.char '"'
      pure $ partial <> ['"'] <> inner <> ['"']

And in GHCi

> Parsec.parseTest (Parsec.many stringPartsParser) "lorem\" ipsum"
*** Exception: Text.ParserCombinators.Parsec.Prim.many: combinator 'many' is applied to a parser that accepts an empty string.
CallStack (from HasCallStack):

Not much experience with Parsec, mostly banging rocks trying to succeed

edit

I thought the error was from the function definition, but it was in my GHCi call instead. Replaced the call with Parsec.parseTest (stringPartsParser Parsec.manyTill Parsec.eof) "lorem\" ipsum" and able to progress further.

Ideas still welcome for what I’m trying to do.

jeukshi · March 22, 2024, 12:09am

Always, a good idea is to try to break down the problem into smaller ones. Before we try to parse many things, let’s just write a parser for a single thing.

word :: Parsec.Parsec String () String
word = do
    start <- Parsec.many1 Parsec.letter
    end <-
        Parsec.optionMaybe (Parsec.try $ Parsec.char '"') >>= \case
            Nothing -> pure []
            Just firstQuote -> do
                rest <- Parsec.many1 (Parsec.letter Parsec.<|> Parsec.space)
                _ <- Parsec.char '"'
                return $ firstQuote : rest <> "\""
    return $ start <> end

quotedWord :: Parsec.Parsec String () String
quotedWord = do
    _ <- Parsec.char '"'
    w <- Parsec.many1 (Parsec.letter Parsec.<|> Parsec.space)
    _ <- Parsec.char '"'
    return $ "\"" <> w <> "\""

With parser for a single thing, we can now parse many things.

allWords :: Parsec.Parsec String () [String]
allWords =
    (word Parsec.<|> quotedWord)
        `Parsec.sepBy1` Parsec.space

Now, this works for lorem ipsum "dolor sit" amed"lorem ", but not for "lorem\" ipsum", because it assumes quotations come in pairs.

jamesbrock · March 30, 2024, 3:48pm

Here’s a function for splitting strings with Megaparsec (not Parsec) parsers: splitCap

import Replace.Megaparsec
import Text.Megaparsec
import Data.Either
import Data.Char

rights <$> flip splitCap "lorem ipsum \"dolor sit\" amed\"lorem \"" $ do
  do 
    try $ fmap fst $ match $ do
      char '"'
      anyTill (char '"')
  <|>
  do
    try $ fmap fst $ match $ do
      takeWhile1P Nothing (not . isSpace)

["lorem","ipsum","\"dolor sit\"","amed\"lorem","\""]

That last " is a bit weird I’m not sure how you want to handle that.