What is LLM-as-a-judge?

LLM-as-a-judge is using a strong language model to score, rank, or compare other models' answers on open-ended tasks where there is no single correct string to match. The judge is given the question, the candidate answer or answers, and a rubric, and returns a rating or a winner. Zheng et al. named and systematically validated this framing in 2023.

Is an LLM judge as accurate as a human evaluator?

In the original study's settings, roughly. Zheng et al. found a strong judge agreed with human preferences over 80% of the time there, about the level at which humans agreed with each other. Reliability can vary with task and answer length, so treat it as setting-specific, not a universal score. The judge also favors the first answer shown and longer answers, so it approximates human preference rather than replacing it.

Why does LLM-as-a-judge matter for GEO?

Because many open-ended evaluations are judge-scored. When a study reports that one content tactic gets cited more or wins, an LLM judge often made that call. Knowing the judge has biases, and is often undisclosed, tells you to check who judged and with what rubric before acting on a reported effect size.

LLM-as-a-judge

LLM-as-a-judge is the practice of using a strong large language model to score, rank, or compare the outputs of other models on open-ended tasks where exact-match metrics do not work¹. Instead of checking an answer against a fixed gold string, a judge model is given the question, one or more candidate answers, and a rubric, and asked to rate each answer or pick a winner. The two common formats are pairwise comparison (which of these two answers is better) and single-answer grading (score this answer from 1 to 10 against the rubric).

The framing was named and systematically validated by Zheng et al. in 2023, who showed that, in their evaluation settings, a strong judge (GPT-4 in their study) can match human preference judgments at over 80% agreement, about the level at which humans agreed with each other there¹. The practice of scoring outputs with an LLM predates that naming, but Zheng et al. gave it the systematic validation that drove adoption: human evaluation of open-ended answers is slow and expensive, while a judge model approximates it at a fraction of the cost and can produce a natural-language rationale alongside the score rather than just a label. For AI search and GEO, it matters because it is how many open-ended results are scored. When a study reports that one content tactic wins over another on an open-ended answer, an LLM judge often decided that.

Status in 2026

LLM-as-a-judge is mainstream in model evaluation. It scores MT-Bench, the multi-turn benchmark Zheng et al. introduced alongside Chatbot Arena¹; the two are distinct, because Chatbot Arena ranks models on crowdsourced human pairwise votes, and the study used that human data to validate the judge rather than to run it. Judge-scored evaluation is common across open-ended AI-search and GEO studies, though not universal: some use deterministic text metrics instead, such as PAWC, a position-adjusted word count with no model-judge step and therefore no judge bias, though it has its own limits².

The reason to read judge-scored numbers carefully is that the judge is itself an evaluation condition, not a neutral oracle. Zheng et al. document several biases in the same paper that validated the method: position bias (the judge favors whichever answer is shown first), verbosity bias (longer answers tend to score higher even when they are not clearer or more accurate), and self-enhancement bias (a judge can favor its own outputs, though the paper found this effect less conclusive than the others), alongside limited reasoning on hard technical questions¹. Published studies vary in how fully they disclose the judge: not just the model and rubric, but the prompt, answer-order handling, decoding settings, and number of repeat runs, any of which can move the result; commercial dashboards typically disclose even less. So two studies reporting different effect sizes for the same tactic may have judged differently rather than measured a real difference.

A concrete GEO instance shows why this matters for the field's own numbers. The original GEO benchmark scored part of its results with a judge: alongside the deterministic PAWC metric, it measured a subjective-impression score using G-Eval, a GPT-based judge³. Several content tactics that benchmark found effective (adding quotations, adding statistics) also lengthen the answer, so verbosity bias is a candidate explanation for part of their measured lift: a real effect and a judge artifact can point the same way. Even one benchmark can be part deterministic and part judge-scored, so which half a headline number came from is worth asking.

How to apply

When a GEO or AI-search study hands you a number, treat the judge as part of the result, not a window onto truth. Three moves:

Check who judged and how before trusting a benchmark number. Look for the judge model and version, the prompt and rubric, the answer-order handling, and how many runs were averaged. If a study does not name the model that scored its answers, treat the effect size as provisional, because a result judged by one model may not hold under another (Zheng et al. document this disagreement)¹.
Discount for the documented biases in any score you cite. Verbosity bias means a tactic that mainly makes answers longer can look like it wins; self-enhancement bias means a benchmark where a model grades its own outputs is suspect. Read a reported lift against which bias could have produced it.
If you run your own evaluation, control the judge. Evaluate both answer orders and aggregate to detect and reduce position bias, prefer a judge separate from the generators (while knowing cross-model and shared-style preferences can remain), and fix the rubric before you start so it does not drift between runs. For consequential calls, repeat the judgment across runs or judges and report the disagreement rather than trusting one verdict. Pair a deterministic check such as citation precision where one exists, so the judge is not your only signal.

What to skip: do not compare scores across studies that used different judges as if they were one scale, and do not treat an LLM-judge score as ground truth. It is an approximation of human preference with known, directional biases.

How it relates to other concepts

AI search evaluation names the judge as one of the evaluation conditions that move reported numbers; this entry is the spoke that condition points to.
Citation precision is itself often an LLM-judge task (does this citation support its sentence); unlike PAWC it stays a semantic judgment, but it becomes auditable when the claim, source, rubric, and verdict are retained for review.
PAWC is the contrast case: a deterministic visibility metric with no judge, so it avoids judge bias but cannot assess open-ended answer quality.
C-SEO Bench is a GEO benchmark whose effect sizes should be read against how answers were scored; when a benchmark uses judge-scored answer quality, the judge configuration is part of the result.
Hallucination grounding is often assessed with an LLM judge (does the answer stay faithful to the retrieved source), so its measured rates inherit the judge's biases.

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." arXiv:2306.05685, NeurIPS 2023 Datasets and Benchmarks Track. Names and systematically validates the LLM-as-a-judge framing, using the MT-Bench question set (LLM-judge scored) and the Chatbot Arena platform (crowdsourced human pairwise votes) as the human-preference benchmarks; reports that strong judges (GPT-4) match human preferences at over 80% agreement, the level humans agree with each other, in those settings. The same paper documents position, verbosity, and self-enhancement biases (the last one it found less conclusive) plus limited reasoning ability, and proposes mitigations. Verified 2026-06-30. ↩ ↩² ↩³ ↩⁴ ↩⁵
Pranjal Aggarwal, Vishvak Murahari, et al. "GEO: Generative Engine Optimization." arXiv:2311.09735, 2023. PAWC (position-adjusted word count) is a deterministic metric computed from the positions and lengths of citation sentences, with no model-judge step. Cited here as the deterministic contrast to LLM-as-a-judge scoring: it is reproducible and judge-bias-free, but it measures source visibility, not open-ended answer quality. ↩
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu. "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment." arXiv:2303.16634, 2023. An LLM-as-a-judge method (an LLM with chain-of-thought scores generation quality). The GEO benchmark (Aggarwal et al.) applied G-Eval, with GPT-3.5, to compute its subjective-impression metric across seven dimensions, in contrast to its deterministic position-adjusted word count. Verified 2026-06-30 against the GEO paper: "We use G-Eval (Liu et al., 2023a) ... to measure each of these sub-metrics." ↩

LLM-as-a-judge

Citation status

Status in 2026

How to apply

How it relates to other concepts

Part of Methodology· editorial cluster, not a semantic link

Mentioned in· auto-generated from other terms' related lists

FAQ

Sources & further reading

Citation status

Status in 2026

How to apply

How it relates to other concepts

Footnotes

Part of Methodology· editorial cluster, not a semantic link

Related terms

Mentioned in· auto-generated from other terms' related lists

FAQ

Sources & further reading

Get the monthly digest