Notes / Why multiple AI roles still produce one opinion

Why multiple AI roles still produce one opinion

Note

I built an AI pre-review system for academic research. The idea was simple: give a research proposal to four AI reviewers — a lead researcher, a broad-field peer, a specialist, and a journal blind reviewer — and let them surface problems the researcher couldn't see alone.

The first version put all four roles into a single prompt. It scored slightly above baseline on average. Not bad. But it missed the most important thing in the entire test set: a 1945–1949 institutional transition that broke the core argument of the archives case. The reviewers were paraphrasing each other under different names.

I called that version a success for about a day.

The second version tried to fix this by making the lead reviewer's judgment an unbreakable backbone. When the lead was right, the backbone helped. When the lead was wrong — still missing that same 1945–1949 breakpoint — the backbone amplified the error and the whole system dropped below baseline. I had made the system more confident in its mistakes.

What finally worked was giving up on the idea that you can get independent perspectives by writing more role descriptions. I split each reviewer into its own sub-agent with its own context window. No access to what the others were writing. The editor at the end didn't synthesize consensus — it surfaced agreement and disagreement honestly. The archives case got flagged as Fatal. More importantly, I could point to exactly which reviewer caught it and why.

There's a temptation in multi-agent design to think that naming more roles creates more perspectives. It doesn't. Roles in a shared context are just the same model talking to itself with different hats on. Real independence requires an architecture that prevents one reviewer from hearing what another just said. That's not a prompting problem. It's a context boundary problem.

The isolated-context version scored 92.7 against the bare model's 77.3 across four comparable cases. But the number matters less than what it represents: the reviewers were actually disagreeing with each other.