How can I produce a PDF from a XML String using Pandoc Haskell library?

Hello, community.

Starting from the following XML formatted string:

Input:

"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<html>\n  <head>\n    <meta charset=\"utf-8\"/>\n  </head>\n  <body>\n    <ul id=\"a8_boYDC\">\n      <li id=\"rp\">\n        <p>\"Alpha\"</p>\n        <ul>\n          <li id=\"cT\">\n            <p>\"Beta\"</p>\n            <ul>\n              <li id=\"wy\">\n                <p>\"Gamma\"</p>\n              </li>\n              <li id=\"Be\">\n                <p>\"Delta\"</p>\n              </li>\n              <li id=\"Ep\">\n                <p>\"Epsilon\"</p>\n              </li>\n            </ul>\n          </li>\n          <li id=\"Ko\">\n            <p>\"Zeta\"</p>\n            <ul>\n              <li id=\"AI\">\n                <p>\"Eta\"</p>\n              </li>\n            </ul>\n          </li>\n          <li id=\"kw\">\n            <p>\"Theta\"</p>\n            <ul>\n              <li id=\"sx\">\n                <p>\"Iota\"</p>\n              </li>\n              <li id=\"82\">\n                <p>\"Kappa\"</p>\n              </li>\n              <li id=\"o_\">\n                <p>\"Lambda\"</p>\n              </li>\n            </ul>\n          </li>\n        </ul>\n      </li>\n    </ul>\n  </body>\n</html>\n"

Output:

A PDF file.

I am using the functions

makePDF :: (PandocMonad m, MonadIO m, MonadMask m) => String -> [String] -> (WriterOptions -> Pandoc -> m Text) -> WriterOptions -> Pandoc -> m (Either ByteString ByteString)

from Text.Pandoc.PDF, and

readHtml :: (PandocMonad m, ToSources a) => ReaderOptions -> a -> m Pandoc

from Text.Pandoc.Readers.

{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE OverloadedStrings #-}

module Main where

import Control.Monad
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as LB
import Data.List
import qualified Data.Text as T
import Data.Text.Encoding
import qualified Data.Text.IO as TIO
import Data.Tree
import Text.Pandoc
import Text.Pandoc.PDF
import Text.Pandoc.Shared
import qualified Data.ByteString.Lazy as BL
import Control.Monad
import Control.Monad.IO.Class
import Control.Monad.Catch

xmlString :: T.Text
xmlString = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<html>\n  <head>\n    <meta charset=\"utf-8\"/>\n  </head>\n  <body>\n    <ul id=\"a8_boYDC\">\n      <li id=\"rp\">\n        <p>\"Alpha\"</p>\n        <ul>\n          <li id=\"cT\">\n            <p>\"Beta\"</p>\n            <ul>\n              <li id=\"wy\">\n                <p>\"Gamma\"</p>\n              </li>\n              <li id=\"Be\">\n                <p>\"Delta\"</p>\n              </li>\n              <li id=\"Ep\">\n                <p>\"Epsilon\"</p>\n              </li>\n            </ul>\n          </li>\n          <li id=\"Ko\">\n            <p>\"Zeta\"</p>\n            <ul>\n              <li id=\"AI\">\n                <p>\"Eta\"</p>\n              </li>\n            </ul>\n          </li>\n          <li id=\"kw\">\n            <p>\"Theta\"</p>\n            <ul>\n              <li id=\"sx\">\n                <p>\"Iota\"</p>\n              </li>\n              <li id=\"82\">\n                <p>\"Kappa\"</p>\n              </li>\n              <li id=\"o_\">\n                <p>\"Lambda\"</p>\n              </li>\n            </ul>\n          </li>\n        </ul>\n      </li>\n    </ul>\n  </body>\n</html>\n"

export :: (PandocMonad m, MonadIO m, MonadMask m) => Pandoc -> m (Either BL.ByteString BL.ByteString)
export = makePDF "pdflatex" [] writeMarkdown def

f :: T.Text -> PandocIO Pandoc
f = liftIO . (handleError <=< runIO) . g

main :: IO ()
main = do
  mybytes <- runIO $ do
    doc <- g xmlString
    export doc
  lr <- handleError mybytes 
  case lr of (Right b) -> BL.writeFile "output.pdf" b
             (Left  x) -> BL.putStr x

The issue is:

! LaTeX Error: Missing \begin{document}.
See the LaTeX manual or LaTeX Companion for explanation.

Type H <return> for immediate help.

l.1 -

Does anyone know how can I solve this issue ?

System specs:
Macbook Air M1
macOS Ventura 13.1
Installed pdflatex as the pdf engine.

3 Likes

Have you seen this post? Apparently, a latex template is needed (that can be created using pandoc).

1 Like

Thank you, @rubenmoor. I’ve already seen that post.

However, that code seems to be outdated.

export :: (MonadIO m) => String ->  Pandoc -> m (Either BL.ByteString BL.ByteString)
export tmpl pdoc = liftIO $ makePDF "xelatex" writeLaTeX (def { writerStandalone = True, writerTemplate = tmpl}) pdoc

From that type signature, one can infer the type of writerTemplate is String; however, hoogle says that the type is:

writerTemplate :: Maybe (Template Text)

One thing that could advance my project is: how can I construct a Template Text type ?

It seems the default latex template has a missing \begin{document}, hence the error.

2 Likes

Based on Text.Pandoc.Templates and the directory of default templates (and assuming OverloadedStrings), it looks like you can use:

liftIO $ do
  tmpl <- compileDefaultTemplate "latex"
  makePDF -- ...
3 Likes

Thank you so much, @gcox !

I have modified export function with an additional argument of type Template T.Text:

export :: (PandocMonad m, MonadIO m, MonadMask m) => Template T.Text -> Pandoc -> m (Either BL.ByteString BL.ByteString)
export tmpl = makePDF "pdflatex" [] writeMarkdown def {writerTemplate = Just tmpl}

and main function:

main :: IO ()
main = do
  mybytes <- runIO $ do
    doc <- g xmlString
    tmpl <- compileDefaultTemplate "latex"
    export tmpl doc
  lr <- handleError mybytes 
  case lr of (Right b) -> BL.writeFile "output.pdf" b
             (Left  x) -> BL.putStr x

Full code:

Expand disclosure triangle to see "haskell" source
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE OverloadedStrings #-}

module Main where

import Control.Monad
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as LB
import Data.List
import qualified Data.Text as T
import Data.Text.Encoding
import qualified Data.Text.IO as TIO
import Data.Tree
import Text.Pandoc
import Text.Pandoc.PDF
import Text.Pandoc.Shared
import qualified Data.ByteString.Lazy as BL
import Control.Monad
import Control.Monad.IO.Class
import Control.Monad.Catch

interact'' :: (T.Text -> IO T.Text) -> IO ()
interact'' f =
  TIO.getContents
    >>= TIO.readFile . T.unpack
      >>= f
        >>= TIO.putStrLn

xmlString :: T.Text
xmlString = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<html>\n  <head>\n    <meta charset=\"utf-8\"/>\n  </head>\n  <body>\n    <ul id=\"a8_boYDC\">\n      <li id=\"rp\">\n        <p>\"Alpha\"</p>\n        <ul>\n          <li id=\"cT\">\n            <p>\"Beta\"</p>\n            <ul>\n              <li id=\"wy\">\n                <p>\"Gamma\"</p>\n              </li>\n              <li id=\"Be\">\n                <p>\"Delta\"</p>\n              </li>\n              <li id=\"Ep\">\n                <p>\"Epsilon\"</p>\n              </li>\n            </ul>\n          </li>\n          <li id=\"Ko\">\n            <p>\"Zeta\"</p>\n            <ul>\n              <li id=\"AI\">\n                <p>\"Eta\"</p>\n              </li>\n            </ul>\n          </li>\n          <li id=\"kw\">\n            <p>\"Theta\"</p>\n            <ul>\n              <li id=\"sx\">\n                <p>\"Iota\"</p>\n              </li>\n              <li id=\"82\">\n                <p>\"Kappa\"</p>\n              </li>\n              <li id=\"o_\">\n                <p>\"Lambda\"</p>\n              </li>\n            </ul>\n          </li>\n        </ul>\n      </li>\n    </ul>\n  </body>\n</html>\n"

xmlToMD :: T.Text -> IO T.Text
xmlToMD =
  handleError
    <=< runIO
      . ( pure
            . T.unlines
            . map (T.drop 1 . T.filter (/= '-'))
            . filter (not . T.null)
            . T.lines
            <=< writePlain def {writerTabStop = 2}
            <=< readHtml def
        )


export :: (PandocMonad m, MonadIO m, MonadMask m) => Template T.Text -> Pandoc -> m (Either BL.ByteString BL.ByteString)
export tmpl = makePDF "pdflatex" [] writeMarkdown def {writerTemplate = Just tmpl}

g :: T.Text -> PandocIO Pandoc
g = readHtml def

-- f :: MonadIO m => T.Text -> m (Either BL.ByteString BL.ByteString)

f :: T.Text -> PandocIO Pandoc
f = liftIO . (handleError <=< runIO) . g

main :: IO ()
main = do
  mybytes <- runIO $ do
    doc <- g xmlString
    tmpl <- compileDefaultTemplate "latex"
    export tmpl doc
  lr <- handleError mybytes 
  case lr of (Right b) -> BL.writeFile "output.pdf" b
             (Left  x) -> BL.putStr x

However, I get an error with the default template:

! LaTeX hooks Error: Missing (empty) default label on line 7.
(hooks) This command was ignored.

Type to continue.

l.7 Sepackage

1 Like

I tried printing the output of writeLaTeX directly. Line 7 was:

\usepackage{amsmath,amssymb}

The Pandoc manual has a long list of packages that need to be installed to output to PDF, and among them is amsmath. So maybe it’s a missing dependency?

1 Like

Thank you again, @gcox !

How did you print it ?

1 Like

I edited main to use writeLaTeX directly, like this:

printLaTeX :: IO ()
printLaTeX = do
  mybytes <- runIO $ do
    doc <- g xmlString
    tmpl <- compileDefaultTemplate "latex"
    writeLaTeX def {writerTemplate = Just tmpl} doc
  t <- handleError mybytes
  TIO.putStrLn t

I also found that copying the output into a file and converting it to PDF from the command line seems to work. I don’t know why.

1 Like

Thank you !

Yes ! I found out the same !

It seems the default template used for Pandoc Haskell library is different from the template used in pandoc terminal command.

Do you have an idea about how can I load a custom template ?

With this command:

pandoc -D latex > default.latex

I obtained the default template:

In Pandoc Haskell library, appears this function:

compileTemplate :: (TemplateMonad m, TemplateTarget a) => FilePath -> Text -> m (Either String (Template a))

But I am not sure how can I use it.

1 Like

To use compileTemplate, it looks like you just write your own template (or edit the default one), save it and then do the same as with compileDefaultTemplate, but with the filename as the first argument:

do
  t <- compileTemplate "myTemplate.tex" "latex"
  case t of Right tmpl -> -- file read succeeded, so use the template
1 Like

@gcox, I sincerely appreciate all your help. I’ve managed to solve it. I am posting the full source for future readers.

{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE OverloadedStrings #-}

module Main where

import Control.Monad
import Control.Monad.Catch
import Control.Monad.IO.Class
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString.Lazy as LB
import Data.List
import qualified Data.Text as T
import Data.Text.Encoding
import qualified Data.Text.IO as TIO
import Text.DocTemplates
import Text.Pandoc
import Text.Pandoc.PDF
import Text.Pandoc.Shared

xmlString :: T.Text
xmlString = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<html>\n  <head>\n    <meta charset=\"utf-8\"/>\n  </head>\n  <body>\n    <ul id=\"9TG7k1iE\">\n      <li id=\"rp\">\n        <p>\"Alpha\"</p>\n        <ul>\n          <li id=\"cT\">\n            <p>\"Beta\"</p>\n            <ul>\n              <li id=\"wy\">\n                <p>\"Gamma\"</p>\n              </li>\n              <li id=\"Be\">\n                <p><strong>\"Delta\"</strong></p>\n              </li>\n              <li id=\"Ep\">\n                <p>\"Epsilon\"</p>\n              </li>\n            </ul>\n          </li>\n          <li id=\"Ko\">\n            <p>\"Zeta\"</p>\n            <ul>\n              <li id=\"AI\">\n                <p>\"Eta\"</p>\n              </li>\n            </ul>\n          </li>\n          <li id=\"kw\">\n            <p>\"Theta\"</p>\n            <ul>\n              <li id=\"sx\">\n                <p>\"Iota\"</p>\n              </li>\n              <li id=\"82\">\n                <p>\"Kappa\"</p>\n              </li>\n              <li id=\"o_\">\n                <p>\"Lambda\"</p>\n              </li>\n            </ul>\n          </li>\n        </ul>\n      </li>\n    </ul>\n  </body>\n</html>\n"

export :: (PandocMonad m, MonadIO m, MonadMask m) => String -> Template T.Text -> Pandoc -> m (Either BL.ByteString BL.ByteString)
export pdfengine tmpl = makePDF pdfengine [] writeLaTeX def {writerTemplate = Just tmpl}

htmlToPDF :: FilePath -> FilePath -> T.Text -> IO ()
htmlToPDF pdfengine path =
  ( either BL.putStr (BL.writeFile path)
      <=< handleError
      <=< runIO
  )
    . ( readHtml def
          >=> \doc ->
            compileDefaultTemplate "latex"
              >>= flip (export pdfengine) doc
      )

main :: IO ()
main = htmlToPDF "/Library/TeX/texbin/pdflatex" "/Users/detach/Downloads/output.pdf" xmlString
1 Like

Two issues:

export :: (PandocMonad m, MonadIO m, MonadMask m) => String -> Template T.Text -> Pandoc -> m (Either BL.ByteString BL.ByteString)
export pdfengine tmpl = makePDF pdfengine [] writeLaTeX def {writerTemplate = Just tmpl}

The writer parameter of makePDF should be writeLaTeX

And, we are forced to update the parameter

def {writerTemplate = Just tmpl}

where tmpl has type Template Text.

Otherwise, pandoc would produce a fragment (instead a standalone document). And that’s why it can’t convert that LaTeX document to PDF.

One of the parameters of the type WriterOptions used to be writeStandalone :: Bool, but that was deprecated. When a Template is used, the writer always generates a standalone document.

3 Likes