Yet another opinion on LLMs · Hasufell's blog

35 Likes

A post was merged into an existing topic: GHCup discontinued (or why you should use Nix)

I think you are largely underestimating the power of these tools.

The problem is IMHO the aggressive marketing that sells them as magic wands that solve all the problems instantly. You give it a prompt, it does stuff.

The truth is, if you want to get useful results you need to sweat. These are tools that require training and experience, just like any other tool. Whether using them is a good idea is another discussion. But in my experience you won’t get satisfactory results until you are good enough with LLM’s. Which can take a long time.

These tools (especially the most advanced models) can be incredibly helpful, not just for mechanical tasks. They can also help with reasoning, find subtle bugs, analyze data to spot patterns and correlations, brainstorm… Your post dismisses all these capabilities a little too hastily.

11 Likes

Yesterday I was investigating a bug in the GHC bindist configure and ran across an m4 macro:

# FPTOOLS_OVERRIDE_PLATFORM_FROM_BOOTSTRAP(platform,Platform)
# ----------------------------------
# Per the comment in FPTOOLS_OVERRIDE_PLATFORM_FROM_BOOTSTRAP's body, we
# need to sometimes replace inferred platforms with the bootstrap
# compiler's target platform.
AC_DEFUN([FPTOOLS_OVERRIDE_PLATFORM_FROM_BOOTSTRAP],
[
    if test "$bootstrap_target" != ""
    then
        $1=$bootstrap_target
        echo "$1 platform inferred as: [$]$1"
    else
        echo "Can't work out $1 platform"
        exit 1
    fi

    $2[Arch]=`echo "[$]$1" | sed 's/-.*//'`
    $2[Vendor]=`echo "[$]$1" | sed -e 's/.*-\(.*\)-.*/\1/'`
    $2[OS]=`echo "[$]$1" | sed 's/.*-//'`
])

And the call site:

    if test "$host_alias" = ""
    then
        FPTOOLS_OVERRIDE_PLATFORM_FROM_BOOTSTRAP([host], [Host])
    else
        GHC_CONVERT_PLATFORM_PARTS([host], [Host])
    fi

So if you don’t pass --host to configure it will infer it from ghc +RTS --info. But then it doesn’t synchronize it with the autoconf variables host_os, host_cpu and host_vendor, leading to subtle issues.

My first version of a patch had a bug:

--- a/m4/fptools_set_platform_vars.m4
+++ b/m4/fptools_set_platform_vars.m4
@@ -89,6 +89,13 @@ AC_DEFUN([FPTOOLS_OVERRIDE_PLATFORM_FROM_BOOTSTRAP],
     $2[Arch]=`echo "[$]$1" | sed 's/-.*//'`
     $2[Vendor]=`echo "[$]$1" | sed -e 's/.*-\(.*\)-.*/\1/'`
     $2[OS]=`echo "[$]$1" | sed 's/.*-//'`
+
+       # synchronize with the autotools variables, e.g.
+       # target_cpu = TargetPlatform
+       # etc.
+       $1[_cpu]=$2[Arch]
+       $1[_vendor]=$2[Vendor]
+       $1[_os]=$2[OS]
 ])

 # FPTOOLS_SET_PLATFORM_VARS(platform,Platform)

Kudos if you can spot it right away. I couldn’t, so I went ahead and asked Gemini what the problem with this patch is.

It kept telling me that $1[_cpu] doesn’t work, but $2[Arch] is fine on the left-hand side of these assignments, suggesting it’s actually the underscore that trips up the parser or whatever.

Here’s an excerpt from the prompt: gemini.md · GitHub

This was utter nonsense. It also kept suggesting obscure solutions that involve eval and other things. It clearly doesn’t know basic m4 syntax and has no way of actually reasoning about it.

The problem with the patch was simply that it was missing an escaped dollar sign, so instead of $1[_cpu]=$2[Arch] I needed $1[_cpu]=[$]$2[Arch]. My brain was on autopilot, because I hate m4 and didn’t want to think about it. Then I wasted 15 minutes of my time arguing with Gemini and demanding an explanation why it thinks the underscore is the issue. It was extremely confident about it.

Yes, people keep telling me that I simply don’t know how to use LLMs correctly.

My guess is that most users would just have copy pasted the obscure code from Gemini and moved on instead of questioning why it came up with said code. And maybe it would have even worked.

No, I don’t think there’s any depth about interacting with LLMs.

I don’t think so, because I provide criteria for what is “good use of LLMs”:

  1. when you can trivially verify the output, or
  2. when you have superior domain knowledge, or
  3. when you don’t need precise answers

In my experience though, this is not how people use them, because this indeed takes effort. And then the gains are not as big anymore.

15 Likes

that’s a very niche example with non-local effects (code generation), IME precisely the type of stuff LLMs are worst at.

2 Likes

heh it’s just programming with extra steps

I think we can agree on these two points.

So I’m coming from a point of extreme scepticism against LLMs. This has been my stance for months now, after using older models last year. However, last week I started testing Claude Opus 4.6 for my daily work (Haskell) and my views have changed drastically.

I was tasked with writing a prototype for an internal server-like service. We allocated a few weeks for this task. I defined by hand domain types and wrote the business logic, leaving all the glue code and other boilerplate to Claude. I wrote one handler by hand and told it to mimic the style in the others. It’s such a mind numbing task it would probably take my GenZ brain hours to force through, and it took just a couple of minutes. I told it to add chunking and streaming. It did that. I spotted quadratic complexity in a chunking algo. It fixed it. I suggested another data structure. It did that. I told it to write a CLI client. Then use this client to write HSpec tests. I reviewed those tests, found some nonsensical examples it struggled to make pass, I fixed those by hand. After a few days I have a working prototype with client and tests for something we allocated weeks to. And honestly for me it brought joy back to programming. I can concentrate on specifying exactly what I wanna do and how I wanna have it done, bring my domain knowledge in the places that really need it (business logic and test cases) and leave the boring parts to the LLM, so I can read some papers or participate in meetings while it does its thing.

Other example, when writing e2e tests I found an issue with serialization in a C++ library we own but I know nothing about. It bothered me enough that I decided to debug it. After an hour or so I thought well why not asking my advanced matrix multiplier to look into it? It found two fields missed in serialization deep in the structure after a few back and forths with me when I gave it a reproducer. It took 30 mins and saved me hours of work again.

If you haven’t yet, you should try newest models. It’s been a huge leap since last year. That said, I don’t think engineering is dead, at all. I’m actually worried about a new wave of “AI-first” new grads that maybe never have written a line of code themselves. If you never have written your own production code, how do you know how good code should look like? Your “creation” will inadvertently turn into a massive spaghetti slop. You need to be a harsh reviewer and LLM will do well. Also, domain knowledge is as important as ever. Maybe there will be less work for “pure coders”, but domain experts will still be very valuable.

11 Likes

Interesting experience report.

I wonder… it sounded like your task involved a lot of repetition. Usually in my opinion the Haskell way to solve this would be to find an abstraction. I don’t think that current models are good at it. Have you ever used it to find or build an abstraction?

4 Likes

I developed some abstractions for the glue code I mentioned but it still needed a fair bit of boilerplate. The main takeaway is probably feasibility of writing tests. I find it usually a huge pain, it’s an incredible amount of boilerplate that makes my brain melt, and one can largely offload it to a supervised LLM. It all depends on a project I guess. One project of mine I would not really let LLM touch it I guess maybe apart from small refactors. But things like tests or servers are perfect fit.

5 Likes

i’ve been using claude for a few weeks intensively to (among other things) develop my llm library. where it has been tremendously successful is for example adding new models according to a similar pattern. This isn’t something (i think) that’s quite abstractable cleanly, since while they follow mostly the same pattern, when you get into the details, each model and provider can have special cases, different features that are supported and quirks that need to be worked around. Claude has been quite successful in copying existing patterns, running those tests, and (mostly) correctly interpreting the results and adjusting the model’s handler and features.

that said, it’s working quite well for general haskell development too, but all it really does is saving you from typing out the actual code, or writing the most obvious implementations out, you still have to come up with the right abstractions and types, and structure for your program yourself.

6 Likes

This is a good article and kind of hits on my opinions of LLMs. Of course I’m a bit of an AI hater lol, so YMMV.

3 Likes

@hasufell which models are you using?

1 Like

Mostly:

  • claude sonnet 4.6 via claude.ai
  • GPT-4.1 sometimes via github
  • Gemini 3 via google
1 Like

OK. I’ve had good results with ChatGPT-5.4 Thinking and Claude Opus 4.6. GPT-4.1 in particular is almost a year old…

I have found them excellent for understanding new technologies (especially how to use package X with library Y in situation Z), general planning design and architecture, debugging and analysis (e.g. “why is this SQL query so slow and what are some alternative approaches”), code review, and a certain amount of code writing on existing projects. Honestly completely indispensable, like a junior-to-mid developer who nevertheless has deep experience with every technology and inexhaustible patience. That said, I haven’t yet tried “vibe coding” or more complicated code creation or the multi-agent approach that I’ve heard some people do.

My experiences just don’t resemble yours at all. They’re perfectly capable of adapting to any situation. They are excellent at understanding existing work, both code and documentation. They do make occasional mistakes, but they are typically the kind of mistakes that humans make. I don’t know what to tell you…

2 Likes

Interesting.

I’m using those tools quite regularly in fact. Yesterday I used claude to update my nginx config to block claude bots (lol). Today I used it to set up an OpenBSD bhyve VM on my FreeBSD machine.

In both cases it would end up guessing things randomly and I constantly had to intervene and point out that:

  • the config syntax is incorrect
  • the command doesn’t exist
  • the suggestion makes no sense

To me it seems they struggle at basic things like going through a manpage and a blog post and then assembling commands that work. Sure, you can wiggle it until it works. In those cases again… I have the opportunity to test the output.I’m not entirely sure if it saved me any time. I guess it’s hard to set up a proper experiment.

Now, as you say… you’re using it for learning new things and this is where we probably disagree fundamentally. As soon as I ask it to explain complex or abstract things that I don’t understand myself… I have no easy means to verify how accurate the information is.

I’ve not seen any suggestion so far on how to deal with this unknown. And given my experience, I don’t think these models only occasionally make mistakes or inaccuracies.

4 Likes

hahaha this HN comment about the Google AI summary sums it up:

For most queries the AI summary works. If it doesn’t, oh well.

Yesterday I looked up how much a certain car weighs. The AI summary gave me an answer. It may be the wrong one, but I looked it up out of curiosity, so it doesn’t ultimately matter.

1 Like

I honestly think the difference is the models we’re using. The new things I learn from them are “how to do things” – I then do the things and it works, so I am directly verifying how accurate the information is. They do occasionally make mistakes, but it’s usually pretty simple things.

On the other hand, I haven’t tried getting them to e.g. come up with new abstractions in a domain, or do larger-scale creation – that might be beyond their capabilities.

1 Like

idk I have unlimited usage at work so I just use Opus and it immediately hallucinates [1] if I have it generate more than like a function. Which isn’t useless - it’s a husk I can fill with value. Saves me some typing of the general boilerplate and imports. Not a bad use of like 50c.

Also if I do keep it small, the functions it generates are pretty bad? So much case splitting - I would ask a junior to improve it before merge. Instead with Claude, I improve it and have Claude tell me how and why my Haskell is better than its suggestion. Opus is good at that!

inb4 the general follow-ups of all the stuff I now have to do to get this thing to maybe be as useful as my bare hands and brain in emacs and ghci lolol

1 Like

Ditto on hallucinations - it is so easy to lead it into spewing garbage; I can tell that it is garbage because I have a lot of experience in eg that particular domain, but ye gods, the ease with which it generates something that to a layman sounds valid and even ‘has results’ in a cursory search, and all its really done is blend two fields because its confused one homonym in one field for another - it gives me pause.

It isn’t so much that it can’t generate useful and correct output (it can, especially for tasks that have been completed before), its that it first and foremost generates plausible output, so I heavily lean in favor of the difficulty of verifying the correctness of the output as being an inhibition to the use of AI, distinct from the moral issues of how they trained it.

So, best used like a smart autocomplete for rote syntactic transformations, organization and search, and nomenclature suggestions, rather than deep pontificating. I swear it spends half the energy generating empty platitudes about how amazing my approach is and how deep my questions are, because at their core those corporate AIs are not designed to answer your questions, they are designed to keep you using them, and making you feel like they answered your question is how they go about doing that regardless of what the output actually was.

But for those limited tasks, where they can not escape into some imaginary hallucinated domain, they are very helpful.

5 Likes