A recent article in Computerworld argued that the output from generative AI systems, like GPT and Gemini, isn’t as good as it used to be. It isn’t the first time I’ve heard this complaint, though I don’t know how widely held that opinion is. But I wonder: is it correct? And why?
I think a few things are happening in the AI world. First, developers of AI systems are trying to improve the output of their systems. They’re (I would guess) looking more at satisfying enterprise customers who can execute big contracts than at individuals paying $20 per month. If I were doing that, I would tune my model towards producing more formal business prose. (That’s not good prose, but it is what it is.) We can say “don’t just paste AI output into your report” as often as we want, but that doesn’t mean people won’t do it—and it does mean that AI developers will try to give them what they want.
AI developers are certainly trying to create models that are more accurate. The error rate has gone down noticeably, though it’s far from zero. But tuning a model for a low error rate probably means limiting its ability to come up with out-of-the-ordinary answers that we think are brilliant, insightful, or surprising. That’s useful. When you reduce the standard deviation, you cut off the tails. The price you pay to minimize hallucinations and other errors is minimizing the correct, “good” outliers. I won’t argue that developers shouldn’t minimize hallucination, but you do have to pay the price.
The “AI Blues” has also been attributed to model collapse. I think model collapse will be a real phenomenon—I’ve even done my own very non-scientific experiment—but it’s far too early to see it in the large language models we’re using. They’re not retrained frequently enough and the amount of AI-generated content in their training data is still relatively very small, especially if they’re engaged in copyright violation at scale.
However, there’s another possibility that is very human and has nothing to do with the language models themselves. ChatGPT has been around for almost two years. When it came out, we were all amazed at how good it was. One or two people pointed to Samuel Johnson’s prophetic statement from the 18th century: “Sir, ChatGPT’s output is like a dog’s walking on his hind legs. It is not done well; but you are surprised to find it done at all.”1 Well, we were all amazed—errors, hallucinations, and all. We were astonished to find that a computer could actually engage in a conversation—reasonably fluently—even those of us who had tried GPT-2.
But now, it’s almost two years later. We’ve gotten used to ChatGPT and its fellows: Gemini, Claude, Llama, Mistral, and a horde more. We’re starting to use it for real work—and the amazement has worn off. We’re less tolerant of its obsessive wordiness (which may have increased); we don’t find it insightful and original (but we don’t really know if it ever was). While it is possible that the quality of language model output has gotten worse over the past two years, I think the reality is that we have become less forgiving.
What’s the reality? I’m sure that there are many who have tested this far more rigorously than I have, but I have run two tests on most language models since the early days:
- Writing a Petrarchan sonnet. (A Petrarchan sonnet has a different rhyme scheme than a Shakespearian sonnet.)
- Implementing a well-known but non-trivial algorithm correctly in Python. (I usually use the Miller-Rabin test for prime numbers.)
The results for both tests are surprisingly similar. Until a few months ago, the major LLMs could not write a Petrarchan sonnet; they could describe a Petrarchan sonnet correctly, but if you asked it to write one, it would botch the rhyme scheme, usually giving you a Shakespearian sonnet instead. They failed even if you included the Petrarchan rhyme scheme in the prompt. They failed even if you tried it in Italian (an experiment one of my colleagues performed.) Suddenly, around the time of Claude 3, models learned how to do Petrarch correctly. It gets better: just the other day, I thought I’d try two more difficult poetic forms: the sestina and the villanelle. (Villanelles involve repeating two of the lines in clever ways, in addition to following a rhyme scheme. A sestina requires reusing the same rhyme words.) They could do it! They’re no match for a Provençal troubadour, but they did it!
I got the same results asking the models to produce a program that would implement the Miller-Rabin algorithm to test whether large numbers were prime. When GPT-3 first came out, this was an utter failure: it would generate code that ran without errors, but it would tell me that numbers like 21 were prime. Gemini was the same—though after several tries, it ungraciously blamed the problem on Python’s libraries for computation with large numbers. (I gather it doesn’t like users who say “Sorry, that’s wrong again. What are you doing that’s incorrect?”) Now they implement the algorithm correctly—at least the last time I tried. (Your mileage may vary.)
My success doesn’t mean that there’s no room for frustration. I’ve asked ChatGPT how to improve programs that worked correctly, but that had known problems. In some cases, I knew the problem and the solution; in some cases, I understood the problem but not how to fix it. The first time you try that, you’ll probably be impressed: while “put more of the program into functions and use more descriptive variable names” may not be what you’re looking for, it’s never bad advice. By the second or third time, though, you’ll realize that you’re always getting similar advice and, while few people would disagree, that advice isn’t really insightful. “Surprised to find it done at all” decayed quickly to “it is not done well.”
This experience probably reflects a fundamental limitation of language models. After all, they aren’t “intelligent” as such. Until we know otherwise, they’re just predicting what should come next based on analysis of the training data. How much of the code in GitHub or on StackOverflow really demonstrates good coding practices? How much of it is rather pedestrian, like my own code? I’d bet the latter group dominates—and that’s what’s reflected in an LLM’s output. Thinking back to Johnson’s dog, I am indeed surprised to find it done at all, though perhaps not for the reason most people would expect. Clearly, there is a lot on the internet that is not wrong. But there’s a lot that isn’t as good as it could be, and that should surprise no one. What’s unfortunate is that the volume of “pretty good, but not as good as it could be” content tends to dominate a language model’s output.
That’s the big issue facing language model developers. How do we get answers that are insightful, delightful, and better than the average of what’s out there on the internet? The initial surprise is gone and AI is being judged on its merits. Will AI continue to deliver on its promise or will we just say “that’s dull, boring AI,” even as its output creeps into every aspect of our lives? There may be some truth to the idea that we’re trading off delightful answers in favor of reliable answers, and that’s not a bad thing. But we need delight and insight too. How will AI deliver that?
Footnotes
From Boswell’s Life of Johnson (1791); possibly slightly modified.