When you said “reverse-engineer”, does that mean some sort of checking is baked into the PDFs (maybe as JavaScript)?
The e-invoice is defined in the EN16931 standard and one of the compliant implementations, the French Factur-X, has an XSD with associated schematron rules. Even the Basic variant has about 400 schematron rules!
Nikita has written a new parser for XPath expressions because the existing implementation does not cover the current standard. That is mostly what is needed for parsing schematron files. Next up is to define an algebra of assertions, which will refine the original XSD model. The crucial part of the spec are equational assertions, though, whence I’ve been asking all sorts of subtype-related questions on this discourse lately. The math is beautiful, hence a fun project to work on.
Yet all this is a lot of software development for a small company like mine. I’d be grateful if someone else with a use for Haskell XML models could step in and share the cost, maybe monetizing the code generation as SAAS.