Something strange happened in a Stanford Law study this week. Law professors were handed stacks of anonymized answers to student questions. They didn't know which answers came from an AI and which came from their own colleagues. Three times out of four, they picked the machine.
The study, led by Stanford's Julian Nyarko and published Monday, put 16 law professors from across the country through a blind evaluation of nearly 3,000 head-to-head comparisons. The subject: contract law, deliberately chosen because it has no single "right answer." Contract questions demand synthesis of competing arguments, navigation of ambiguity, a defensible conclusion. The kind of thing legal education is supposed to teach.
The result: AI responses won 75% of matchups. More striking: professors flagged AI answers as "potentially misleading or harmful" just 3.5% of the time. For peer-written answers, that number was 12%. Human responses were more than three times as likely to be judged damaging to a student's understanding.
How the Study Worked
The design matters here. This wasn't a multiple-choice test or a bar exam simulation. The 40 questions used in the study were the kind a student might ask after class or during office hours: messy, contextual, requiring judgment. Things like "does this clause create an enforceable obligation given these specific facts?"
Sixteen professors from Stanford, Yale, NYU, University of Chicago, and other institutions each wrote answers to the questions. Then LLMs (including commercial tutoring tools and Google's NotebookLM) generated their own responses, calibrated to match the length and structure of the human versions. All answers were anonymized, randomized, and evaluated blindly by the same group of professors.
Multiple scoring methods were used. The AI outputs were tested across a range of systems, and performance varied, but even where models were hampered by limited context, evaluators frequently still favored them over humans.
The Numbers Worth Paying Attention To
| Metric | AI Answers | Professor Answers |
|---|---|---|
| Win rate in blind comparisons | 75% | 25% |
| Flagged as harmful | 3.5% | 12% |
| Head-to-head matchups | ~1,500 | ~1,500 |
| Professors participating | — | 16 |
| Institutions represented | — | 6+ |
The harm-flagging gap is the number that sticks with me. It's one thing to say AI can produce competent legal writing. But human professors being three times more likely to produce something their peers consider pedagogically dangerous? That's not an AI story anymore. That's a story about how inconsistent human teaching actually is.
Community Reaction
On Hacker News, the reaction was split. One commenter called it "anti-intellectual nonsense," arguing the study was designed by an "AI professor for law" whose work inevitably confirms his own existence. Others pointed to the paper's careful methodology (2,918 blind comparisons, multiple scoring methods, answers calibrated for length as evidence the results deserve attention.
The researchers themselves are measured. Nyarko said the group is "not advocating for wholesale adoption of AI tutors," but that "our data suggests that blanket skepticism may be equally unwarranted." Co-author Sarath Sanga of Yale Law School put it differently: "What we wanted to know is whether AI can meet the latent professional standard that lawyers use to evaluate each other's arguments. In this case, the answer was yes."
The timing is interesting. UC Berkeley Law announced a complete ban on AI use starting summer 2026, just as this study dropped. Legal education is clearly in the middle of an unresolved argument about what role these tools should play.
What This Actually Means
The authors are careful to separate two questions, quality and deployment, and they've only answered the first. AI can produce legal answers that experts prefer. Whether and how to use that capability in the classroom is a completely different conversation.
Here's what I think matters most: this wasn't tested on uncontested factual recall or multiple-choice trivia. Contract law was chosen because it resists answer keys. And the AI still won. That suggests something real is happening with how these models handle reasoning in domains without clear right answers.
The skeptical take is also fair: preference is not the same as learning. Students preferring an AI tutor doesn't mean they learn more from it. And there's a real risk that polished AI answers make students less likely to wrestle with ambiguity themselves, which is the whole point of legal education.
But the data is what it is. If you're teaching law and dismissing AI as a toy that can't handle nuance, this study suggests you should update that view. And if you're building AI tutoring tools for judgment-heavy fields, this is a signal that the market is real.
The question is no longer whether AI can meet expert standards in law. It can. The question is what we do with that fact. Building the pedagogical layer (the part that doesn't just answer questions but actually teaches) is still mostly unsolved. And that's where the real work begins.