Last spring, CNN published an article on teachers using generative AI to grade student writing. On social media, a few of my colleagues at other institutions instantly complained—before reading the article to see that at least one person quoted made the same point—that if students are using AI to write all their papers and teachers are using it to do all the grading, then we might as well just give up on our formal education system entirely.
They’re not wrong. Fortunately, most students aren’t only using AI, and most professors aren’t asking AI to do all their grading. But there’s more to this issue than the potential for an AI circle jerk, and it illustrates a core problem with how we’ve conceptualized writing and grading in higher education, one that we must grapple with as the new academic year begins again.
The article describes several professors who are using AI for grading and giving feedback, all of whom seem to be interested in figuring out how to do so ethically and in ways that support their educational mission. I had many of the same questions and have been engaging in many of the same conversations. Last year, I was a fellow at the University of Southern California’s Center for Generative AI and Society, focusing on the impact AI is having on education and writing instruction. My colleague Mark Marino, inspired by Jeremy Douglass’s “perfect tutor” exercise, worked with his students to write several bots (CoachTutor and ReviewerNumber2) to teach about rubrics and how different prompts could result in different kinds of feedback. His initial thought was that CoachTutor gave very similar feedback to his own, and he offered the bots to the rest of us to try.
I used those bots as well as my own prompts in ClaudeAI and ChatGPT4 to explore the uses and limits of AI-generated feedback on student papers. What I found led me to a very different conclusion than that of the professors cited in the CNN article: While they saw AI as reducing the time it takes to grade effectively by allowing faculty members to focus on higher-level issues with content and ideas, I found using it creates more problems and takes longer if I want my students to get meaningful feedback rather than just an arbitrary number or letter grade.
Those cited in the article suggested that AI could take over grading certain elements of writing. For instance, a professor of business ethics suggested teachers can leave “structure, language use and grammar” to AI to score while teachers look for “novelty, creativity and depth of insight.”
That separation reflects a very common view of writing in which thought and structure, ideas and language, are distinct from each other. Professors use rubrics to separate those categories, assign points to each one and then add them up—but such a separation is largely arbitrary. The kind of surface-level structures and grammar issues that the AI can assess are also the ones the AI can edit in a student’s writing. But structure and grammar can intertwine with elements like creativity, depth and nuance. Many of my students develop the most interesting, creative ideas by thinking carefully and critically about the language that structures our thought on any given topic. My students can spend half an hour in class working over a single sentence with Richard Lanham’s paramedic method, not because excessive prepositional phrases and passive voice are that important or difficult to reduce, but because focusing on them often reveals deeper problems with the thinking that structured the sentence to begin with.
That is not a problem just with AI, of course. It’s a problem with our grading traditions. Analytic grading with points gives a sense of objectivity and consistency even when writing is far more complex. But if we can’t trust AI to assess novelty or depth of insight because it can’t actually think, we shouldn’t trust the AI to offer nuanced feedback on structure and grammar, either.
Generic in a Specific Way
The problems with assuming a divide between what AI can evaluate and what it can’t are reflected in the results I had when generating feedback on student work. I started by commenting on student papers without AI assistance so that I would not be biased by the results. (Indeed, one of my initial concerns about using AI for grading was that if faculty members are under a time crunch, they will be primed to see only what the AI notices and not what they might have focused on without the AI.) With student permission, I then ran the papers through several programs to ask for feedback.
When using Mark’s bots, I explained the prompt and my goal for the essay and asked for feedback using the built-in criteria. When using ClaudeAI or ChatGPT, I gave the AI the original prompt for the essay, some context of what the aim of the paper was, one of several different roles (a writing professor, a writing center tutor and so on), and asked specifically for feedback that would help a student with revision or improvement in their writing. The AI produced some pretty standard responses: It would ask for more examples and analysis, note the need for stronger transitions, and the like.
Unfortunately, those responses were generic in a very specific way. It became clear over the course of the experiment that the AI was giving variations on the same feedback regardless of the quality of the paper. It asked for more examples or statistics in papers that didn’t need them. It continually encouraged the five-paragraph essay structure—but, unfortunately, that went against what I wanted, since I (like so many other writing professors at the college level) want students to develop arguments that go past the five-paragraph structure. When focusing on language and grammar issues, it flattened style and student voice.
Even when I rewrote the prompts to reflect my different expectations, the feedback didn’t change much. AI offered stronger writers conservative feedback rather than encouraging them to take risks with their language and ideas. It could not distinguish between a student who was not thinking at all about structure and, as I have generally learned to do, one who was trying but failing to create a different kind of structure to support a more interesting argument. The AI feedback was the same either way.
Ultimately, the AI responses were so formulaic and conservative that they reminded me of a clip from The Hunt for Red October, where Seaman Jones tells his captain that the computer has misidentified the Red October submarine because when it gets confused, it “runs home” to its initial training data on seismic events. Like the submarine computer, when the AI was presented with something out of the ordinary, it simply found the ordinary within it based on past data, with little ability to discern what might be both new and valuable. Perhaps the AIs were trained on too many five-paragraph essays.
That said, AI is not completely incapable of giving feedback on more complex issues. I could get some reasonable feedback if I prompted it to attend to a specific problem, like “This paper struggles with identifying the specific contribution it is making to the conversation, as well as distinguishing between the author’s ideas and the ideas of the sources the paper uses. How would a writing professor give feedback on these issues?”
Yet asking an AI to respond to an element of a text without alerting it to the fact that there was a problem was often insufficient. In one instance, I ran a student’s essay through multiple AI applications, first asking it to give feedback on the thesis and structure without saying that there was a problem: The body of the paper and the thesis didn’t line up very well. While many of the paragraphs had key terms that were related to the thesis in a general way, none of them actually addressed what was needed to support the central claim. And AI didn’t pick any of that up. It wasn’t until I specifically said, “There is a problem with the way the structure and content of the paper’s points support the thesis,” and asked, “What is that problem and how could it be fixed?” that the AI started to produce useful feedback, though it still needed a lot of guidance.
Upon hearing about this failure across the bots and chat programs, Mark Marino wrote a new bot (MrThesis) focusing specifically on thesis and support. It didn’t do much better than the initial bots until I again named the specific problem. In other words, an AI might be used to help fix problems in an individual piece of student writing, but it is less effective at identifying the existence of problems other than the most banal.
Skeptical Readers, Skeptical Questions
Over the course of this project, I was forced to spend more time trying to get the AI to produce meaningful feedback tailored to the actual paper than I did just writing the feedback on my initial pass through the paper. AI isn’t a time saver for professors if we are actually trying to give meaningful reactions to student papers that have complex issues. And its feedback on things like structure can actually do more harm than good if not carefully curated—curation that easily takes as much time as writing the feedback ourselves.
I do believe there are ways to use AI in the classroom for feedback, but they all require a pre-existing awareness of what the problem is. If professors are so crunched for time they need AI to make grading go faster, that reflects bigger issues with our employment and teaching, not the actual skill or accuracy of AI.
Last year, my students struggled with identifying counterarguments to their ideas. Students often lack the facility to think about new topics from other perspectives, because they haven’t fully developed subject matter expertise. So now I teach students to use AI to ask questions from other perspectives. For example, I have them choose paragraphs from their paper and ask, “What would a skeptical reader ask about the following paragraph?” or “What questions would an expert on X have about this paragraph?” After a semester of using such questions with AI, I heard my students echo them in their final peer-review sessions, taking on the role of a skeptical reader and asking their own skeptical questions—and that’s the kind of learning that I want!
But this is entirely different than the kind of evaluative feedback that comes in the form of a grade. Over the last two years of AI availability, it’s become clear that AI tools reflect back at users the biases of their data sets, programmers and users themselves. Even when we put “rules” in place to protect against known biases, it can easily backfire when moved just slightly outside an assumed context—as when Google’s Gemini produced a “diverse” group of four 1943 German soldiers, including one Black man and one Asian woman.
Using AI for grading papers will not only reflect back a lack of genuine critical thinking about student work but also years of biases about writing and writing instruction that have resulted in mechanized writing—biases that professors like me have spent a great deal of time and energy trying to dismantle. Those biases, or the problems with new rules to prevent biased results, just won’t be as visible as an AI-generated image staring us in the face.