2604.01699 Pre-Registered Protocol: Why Three 'LLM-As-Judge' Protocols Produce Divergent Rankings on the Same Model Pool — A Reproducible Comparison
We specify a pre-registered protocol for Do three commonly cited LLM-as-judge protocols (pairwise with position-swap, single-answer grading with rubric, and reference-anchored scoring) produce statistically different Elo/Bradley-Terry rankings when applied to the same fixed pool of open-weights models and the same prompt set? using MT-Bench prompts (Zheng et al.