Eventually, the Steam Drill Always Wins: “Law Professors Prefer AI Over Peer Answers”

Prof. Bradypus Tridactylus. Credit: Marshall, *Annales du Muséum national d’histoire naturelle*, via Wikipedia.

From a draft by Stanford law professor Julian Nyarko and others:

We conducted a blinded evaluation of short-answer tutoring in contracts courses with sixteen U.S. law professors. Participants created 40 representative questions, wrote answers, and judged 2,918 anonymized comparisons between human and LLM responses. Professors rated LLMs far higher than their peers (average win rate = 75.33%), with models performing similarly to the best instructor. LLM responses were also rarely flagged as harmful (3.53%, vs 12.06% for professors). Preferences for LLM answers were consistent across evaluators and reflected shared professional standards….

Sixteen contracts professors from fourteen U.S. law schools—who all use the same casebook to teach the material—authored questions representative of those asked during office hours. From this pool we curated 40 representative questions spanning four instructional categories (Recall: Case or Code, Recall: Doctrine, Hypotheticals, Policy).

Recall questions—whether relating to a case, code or doctrine—tend to be amenable to answers which can be evaluated against a ground truth, and where argumentative strength is of little importance. In contrast, hypotheticals present a short set of facts and ask how the law should be applied. Together with policy questions, which often center on legal or policy design under heterogeneous preferences, providing a strong answer in this category often relies on displaying careful reasoning, weighing competing arguments and other latent, professional standards of quality—even if the relevant doctrine is now settled.

In a second step, each professor wrote short answers to a subset of the 40 questions. … In a third step, we conducted blinded, forced-choice comparisons in which professors judged anonymized pairs of answers written either by their colleagues or by two LLMs. Among the different model families, we opted for Google’s models because at the time, Google made explicit efforts to optimize their models for the educational context. Consequently, we included a stock version of Gemini 2.5 Pro and a retrieval-augmented NotebookLM with access to the casebook. Preference rankings have been shown to be a particularly effective method in ranking unstructured, open text responses, thus yielding advantages over more common, rubric-based evaluations especially where quality is a more elusive concept…

To probe whether any LLM advantage might be driven by surface-level writing style rather than substantive content, we additionally engineered a set of lexico-syntactic features—answer length, structural organization, reasoning nuance, legal anchors, confidence tone, clarity, and pedagogical support—and tested how much of the preference pattern they could explain. Each professor completed approximately 150–200 pairwise evaluations, selected the better answer, and could flag any answer as pedagogically “harmful” {[i.e.,] likely to mislead or hinder learning}.

We present four main findings. First, LLMs meet—and often exceed—the professional standard as defined by expert preference. Gemini 2.5 Pro outperformed all but one instructor in head-to-head comparisons (average win rate against all instructors = 75.92%), though the difference between Gemini and the better-ranked instructor was not statistically significant. NotebookLM, by contrast, outperformed every human instructor, with one tie (average win rate = 74.75%).

Second, the LLM advantage was similar across all category questions.

Third, harmfulness rates for LLMs were low (Gemini 3.41%, NotebookLM 3.64%), compared to the wider dispersion among professors (1.00–39.75%), underscoring that the risk of pedagogically problematic responses is comparable to that of the best human instructors. When evaluating peer-written answers, each professor on average preferred LLM responses over responses generated by human instructors, suggesting that model outputs were not merely appealing to a particular subset of evaluators.

Fourth, the engineered textual features explain only part of the LLM advantage: in calibration analyses, observed LLM win rates systematically exceed the win rates predicted from lexico-syntactic differences alone, indicating that the preference for LLM answers is not reducible to length, clarity, or other stylistic markers.

The post Eventually, the Steam Drill Always Wins: "Law Professors Prefer AI Over Peer Answers" appeared first on Reason.com.

from Latest – Reason.com https://ift.tt/UwLzRqx
via IFTTT

Leave a Reply Cancel reply