Tuesday, November 12, 2024

We Must be Cautious With Hallmarks of AI in Student Writing

To the Editor:

In a recent column (“Anatomy of an AI Essay,” Inside Higher Ed, July 2, 2024), Elizabeth Steere described an analysis of AI-generated responses to essay prompts from her courses. While this analysis is valuable, its framing could give false confidence to instructors trying to determine if a student’s work was AI-generated. 

To Dr. Steere’s credit, the column itself does not explicitly suggest that readers use the report in order to decide if a specific student assignment was AI-authored. Moreover, in another recent column (“The Trouble with AI Writing Detection,” Inside Higher Ed, October 18, 2023), Dr. Steere discusses the perils of false plagiarism or AI-use allegations, and notes that her role is not to “play plagiarism police.” While the new and earlier columns do not directly contradict one another, readers may come away from the newer work with the misguided idea that, armed with a catalog of red flags, they can catch cheating students presenting AI-authored work as their own. I want to emphasize that my following critique is not about the information Dr. Steere presents—rather, it seeks to discourage hypothetical future misuse of that work.

So, why might readers misuse this catalog of AI red flags? I think there are several intertwined issues. 

First, Dr. Steere writes: “I took note of the characteristics of AI essays that differentiated them from what I have come to expect from their human-composed counterparts.” It sounds like she enumerated AI hallmarks and then compared their frequency in the AI essays to the ways she recalls her human students writing in response to similar prompts. This kind of comparison risks confirmation bias, as mistaken beliefs about how often humans use these hallmarks could distort memory. A stronger approach would entail direct quantitative comparison of AI to human writing. Ideally, such an analysis would lead to a clear decision rule for categorizing writing as AI or human authored, and the rule would be tested on novel writing samples.

Second, even if the cataloged red flags can indicate whether essays were written by AI or instead by Dr. Steere’s human students, it’s not clear if those inferences generalize to other groups of students, kinds of writing assignment, or scholarly disciplines. Students with different training and experiences often write in very different ways. One reason that automated AI detectors have largely fallen by the wayside is that they are more likely to report students writing in a second language as cheating. Arguably, much of academic training consists of socializing students in discipline-specific scholarly communication methods.

The generalization concern is not trivial, especially if the readers of Inside Higher Ed—faculty from across academic disciplines—try to use Dr. Steere’s analysis in evaluating students. To illustrate this, consider what might happen if I used the red flags to identify cheaters in my psychology research methods course

My students are asked to follow the conventions of APA style, which can lead to awkward constructions and tortured phrases, including the avoidance of first person and the use of passive voice in many contexts. As in many journal articles, sections of their papers are list-like, often repetitive, and include formulaic beginnings and endings to paragraphs. While it is not what I ask of them, in an effort to sound “more scientific,” many students use “big words” they don’t need. As students struggle to read and interpret the primary scientific literature, they often appear to be confidently wrong and rely on analogies and metaphors to understand and communicate what they’ve read. Once they do grasp a new concept, they often speak hyperbolically, in absolute terms, or as if their newfound knowledge sweeps across all contexts instead of being narrowly applicable.

All these characteristics are red flags identified in Dr. Steere’s analysis. I would speculate that the corpora on which frequently-used AI models have been trained include much scientific writing—which would mean that the very hallmarks of cheating with AI could also be the hallmarks of successful learning of discipline-specific writing style. We need to be cautious in generalizing heuristics for distinguishing AI and human work across contexts.

Finally, reliable group differences might not be informative about individual outcomes (one of many everyday statistical concerns illustrated here). For example, I know that men are taller than women, on average. But if I am told that someone is 5’8”, I cannot say with any degree of confidence whether that person is a man or a woman. This is because, while summary measures of men’s and women’s heights are different, there is much overlap in the variability around those summary measures. Given 100 people standing 5’8”, it is likely that more are men than women—but I would not want to reason from this information about the sex or gender of an individual. Similarly, the AI red flags described by Dr. Steere might turn out to be sufficient to let us support a statement like, many students in my class of 100 must have used AI, but that does not mean we have actionable evidence about any one student’s work.

Dr. Steere’s columns have sought to help us through an academic crisis. I think her work is valuable. As we all struggle to deal with AI in the classroom, many of us have grasped for any possible lifeline. I am concerned that this desperation could lead some to misuse Dr. Steere’s analysis. OpenAI shut down its own AI detection tool because it could not reliably detect cheating. Without strong proof, we must not delude ourselves into thinking that our own heuristics are any better.

–Benjamin J. Tamber-Rosenau

Assistant professor of psychology, University of Houston

Related Articles

Latest Articles