LLMs Show Weak Agreement With Human Essay Graders, Favor Short Work Over Error-Laden Longer Pieces

A preprint study (arXiv:2603.23714) finds LLMs from GPT and Llama families show weak agreement with human essay graders, over-scoring short essays and under-scoring longer ones with minor errors, though internal scoring consistency between grades and feedback remains intact.

Researchers evaluating GPT and Llama family models in out-of-the-box settings — without task-specific training — found that LLM-generated scores show relatively weak agreement with human grades, with variation tied to essay characteristics, according to a preprint published on arXiv (arXiv:2603.23714v1). Specifically, compared to human raters, LLMs tend to assign higher scores to short or underdeveloped essays while penalizing longer essays that contain minor grammatical or spelling errors. The study does not identify which specific GPT or Llama model versions were tested beyond family-level designation.

Despite the misalignment with human graders, the researchers found internal consistency in LLM scoring behavior: models that issued more praise in feedback assigned higher scores, while those generating more criticism assigned lower scores, indicating coherent internal patterns even when diverging from human standards, per the arXiv preprint.

The authors conclude that LLM-generated scores and feedback follow coherent but distinct patterns from human raters, and that while full alignment with human grading is limited, LLMs can still be reliably used in a supporting role for essay scoring. The findings raise questions about deployment of such systems in educational settings without task-specific calibration.

THE FACTUM

LLMs Show Weak Agreement With Human Essay Graders, Favor Short Work Over Error-Laden Longer Pieces

Sources (1)