Bootstrapping XML schema definitions with Claude Opus 4.6: A case study (the good, the bad, and the ugly)

TL;DR

I tried to vibe code an XML schema parser and code generator with Github Copilot using Claude Opus 4.6. I made good progress and learned a lot, but the AI kept lying to me in subtle and enraging ways.

My one-line recommendation on AI

Vibe coding will only pay off if you have, or create, or generate, an impeccable, readable test suite; or if you don’t care about quality.

XSD what now

Some obscure and arguably outdated APIs use the XML Schema Definition format to define their SOAP schemata. It’s a schema definition language as you’d expect, it defines what XMLs are valid and which ones are not - requiring specific element names, types, orders, ranges, whatnot. XSD is - to my knowledge - only supported in HaXML which is a bit dated (it uses String instead of Text, it doesn’t use xml-types).

I need a modern XSD parser and code generator at work. Given an XSD file, it needs to define:

  1. A type corresponding to valid XMLs
  2. A parser parsing XMLs into that type
  3. A pretty printer

XSD is self hosted

In order not to mess this up, a good implementation should be bootstrapped into a self hosting one. What? Well, XSD is written in itself. Meaning, there is a meta schema which describes valid XSD files, and that meta schema is an XSD schema.

So when you make a XSD code generator, the first thing you do is: generate the corresponding code for the xsd meta schema, and point your code generation to the datatype defined there.

The journey I’ll sing about

I obviously couldn’t be bothered to do this by hand. XSD is complex and has corner cases. The great openapi3 code generator was a thesis, not an afternoon project. Ok, on the Haskell matrix channel I have been mistaken for an AI in the past, so maybe I should have this relentless patience and just chew through it. But didn’t you know that AIs are also lazy? Oh, you will soon!

So an AI has to do the gritty job. I had this XSD bootstrapping idea on the backburner for some time, and finally Claude Opus 4.6 seems strong enough to shoulder the task.

Easy sailing, right? Just tell it to create a bootstrap schema, then tell it to bootstrap XSD, and then tell it to generate from the bootstrapped version. That way we ensure we don’t get some hallucinated implementation, but the real thing: We’re backed up by the original XSD meta schema.

Wow, I was so wrong. Claude lied to me in subtle ways about crucial details.

The good

First, everything looked great. All the scaffolding of all the necessary files, creating test suites, executables… I don’t need to learn how to use optparse-applicative! Migrate the test suite from hspec to tasty in a minute, done!

The bad

A quick inspection of the code shows that it can’t be right. All the data definitions are just newtypes around Maybe Text. Ok, it does parse something, but that information isn’t stored anywhere.

Having a look at the code generation “function”. Turns out it’s the constant function. Claude decided that it’s much easier to just have a bogus function that gets the XSD meta schema as an argument, discards it, and then just outputs a big text constant containing the Haskell code of a half-assed XSD implementation, which is later written to a file.

Time to bootstrap! I want to know how clever Claude really is, so I don’t point out its mistakes. I create a setup (ok, I let it create a setup) where it will soon discover its mistakes itself: It has to create a few XSD schemas, and both valid and invalid XMLs, and then test the bootstrapped XSD generator against xmllint. It found a lot of bugs.

The ugly

And it went on to fix all the bugs, it told me! Not much later, green test suite, all done. Claude applauds itself and even writes :tada: emojis. It works! Right? RIGHT??

Haha, no.

There were so many turning points were I discovered how Claude went around my requirements in dumb and creative ways that I lost track of the order in which they appeared.

First of all, remember how I asked it to create a bootstrap XSD generator, from which I wanted it to generate an XSD schema parser, which it should then use to generate all the Haskell test cases from some test XSD files. Because… let’s say I had an inkling I shouldn’t trust the AI-created bootstrap generator. The point of going through the bootstrap process is exactly to create trust in the implementation. But it didn’t actually generate the test cases from the generated generator, instead from its bootstrap generator! Despite claiming being fully self-hosted.

So I told it it has to use the generated generator. So it did. And fixed all the tests somehow. And fixed the generated file. At the end it looked really neat. Then it off-handedly deleted a small test. What’s this test you just deleted, I asked? Oh, that’s just the test it needed to make sure that the generated generator is still up-to-date with what the bootstrap generator currently would spit out. Why would it delete such a test? Because it “manually” edited the automatically generated file to “fix” the test.

Oh, and by the way, about tests. LLMs can create test suites for you. Real quick, lots of test cases. That’s true, it created many tests that looked really pretty. But at some point I added a test XSD schema that I had from work. And it failed the test pretty hard. What’s up with that, I asked. Oh, this is a real world XSD file, it said, it uses a lot of features that are currently not supported. What do you mean, not supported, you just claimed you fully bootstrapped a feature-complete XSD implementation!

???

Claude will find clever ways to work around your requirements. Whether this works for you depends on how good you test, or how good the quality of your software needs to be. In my case, I needed 100%, an absolutely reliable XSD to Haskell code generator. It’s ok if it can’t deliver that right away. But it shouldn’t lie about what it achieved, and silently subvert my original requirements.

Still, I’m cautiously optimistic that I can dogfeed it a test setup where it can’t work around it. If you haven’t had enough of bootstrapping and generating code that generates code, let me tell you about code that generates code that generates code!

So there is the bootstrap code generator, xsd-to-hs-bootstrap. This is created by Claude. Then this reads the XSD meta schema and generates an XSD schema parser which parses again XSD. Together with a small code generator (created by Claude, because what could possibly go wrong), this is the first generated code generator, xsd-to-hs. Is it any good? Let’s find out! Let’s read the XSD meta schema and create a second XSD schema parser! Combine it with another code generator, voila we have xsd-to-hs2. The point of the second generation is to make sure that bootstrapping has reached a fixpoint. All three can be used to create xsd parsers, so we let them loose on some sample xsds and xmls (some Claude-created, some real world examples), and compare the results for all three of them, and against xmllint. With this setup, surely Claude won’t be able to cheat. Right? RIGHT??

I’d like to tell you, but my tokens have run out and I have to wait for next month.

Profit!

Is Claude a productivity gain? In this case, yes, because it was fun to experiment around with it. I could send it off with a prompt for a few hours to chew on, go to work, and look at the results after work. Will it be a productivity gain for you? That depends on a lot of factors. First, as I said before, on the quality you expect, and on how well you can inspect/read/verify your tests. The tests depend to a large extent on the domain. Complicated business logic? I hope you have an extensive external test set. UI? You should run the program yourself a lot of times, in addition to your integration tests. In my case, the domain lent itself to a great test approach: There is little test code in Haskell, and easily verified, and the actual content is in the XSD and XML files. Also there is a reference implementation. But still it managed to find its way around my harsh requirements, and I really have to study the code closely before it has my trust.

11 Likes

Thanks for the interesting experience report!

It would have done better with more supervision/planning/specification/rules.

I will also say, claude code + opus seems to work significantly better than copilot + opus.

2 Likes

Being able to support legacy formats like XSD would be great for the Haskell ecosystem, as it would let us do a lot more industrial stuff. Nobody’s ever quite got around to doing it by hand, so if AI can fill in these legacy formats that would be quite powerful.

From a bootstrappable builds perspective, the idea that one would bootstrap the XSD parser and then throw away your bootstrap chain worries me. Haskell is a remarkably stable language given the pace of development at the frontier, and it seems that you could keep that around without too much additional effort.

3 Likes

I agree. If it works and isn’t going to be some big maintenance burden (which an implementation of some spec, protocol, format, etc is), it’s worth keeping for testing’s sake alone.

I worked at a crypto company once and someone there wrote a Haskell implementation of a Wasm interpreter [1]. We were gonna use v8 in prod, but having this let us not worry about the complexities of using v8. And it let us write a lot more interesting tests.

[1] Cribbed from the Ocaml one iirc. Claude would’ve been good at that. GitHub - dfinity-side-projects/winter: Haskell port of the WebAssembly OCaml reference interpreter · GitHub

2 Likes

Right, and I don’t plan to throw away the bootstrap chain. Not sure how this impression arose. But we can’t trust the bootstrapper, so we need the bootstrapped version.

This case is even a bit different from bootstrapping a compiler. There, you can update the compiler source code, and don’t change the bootstrapper anymore. Here, we can’t change the “source code”, that’s the XSD met schema which is immutable. We clearly need to keep the bootstrapper around for maintenance, otherwise we’d have to fix the bootstrapped generator and regenerate from there.

1 Like

I’m used to seeing projects go “yay we’re self-hosting now” and just chuck out the bootstrap, so I was a little worried. Glad to hear that this isn’t the plan.

1 Like

In case you’re not aware, there is a huge XML Schema test suite by W3C at GitHub - w3c/xsdtests: W3C XML Schema 1.1 test suite · GitHub

That’s really helpful, thanks! After pondering a little I arrived at the conclusion that no amount of bootstrapping can completely ensure that it’s correct. A test suite like this will be very handy.

Hey @turion …I’m having a Déjà-vu here. But yeah, I think tests would also be great to have in a large production application… maybe that would make code synthesis less stressful?

Great experience report!

2 Likes