How to Make LLM Output Consistent — Lessons from Building a Scoring System
If you’ve worked with LLMs long enough, you’ve hit this problem: you run the same prompt twice and get different results. For a chatbot, that’s fine. For a scoring system where you need reliable, repeatable judgments? It’s a real problem.
I’ve worked on a project using LLM as a judge — a scoring system. Here’s everything I’ve learned about making LLM output consistent.
- Temperature Is Not Enough
- Audit Your Prompt for Conflicts
- Detailed Rubrics Per Score Level
- Ensemble: Multiple Calls, Aggregate
- Chain-of-Thought Before Scoring
- Known Biases in LLM Scoring
- Putting It All Together
Temperature Is Not Enough
The first thing most people reach for is temperature. Set it to 0, problem solved, right? Not quite.
Temperature=0 means greedy decoding — the model always picks the highest-probability token. It’s the most deterministic setting available, but it’s not truly deterministic. GPU floating-point operations are inherently non-deterministic due to parallel reduction — different thread execution orders produce slightly different rounding, which can flip the result when two tokens have near-identical probabilities.
OpenAI introduced a seed parameter [8] in late 2023. When you set seed + temperature=0, they aim for deterministic outputs and return a system_fingerprint. But their docs explicitly say it’s “best effort.” Backend changes, model updates, load balancing across different hardware — all can break reproducibility. In practice, users report 85-95% reproducibility, not 100%.
Anthropic doesn’t expose a seed parameter at all. Temperature=0 with greedy decoding is the best you get.
| Parameter | What it does | Deterministic? |
|---|---|---|
| temperature=0 | Greedy decoding, always picks top token | Nearly, but GPU non-determinism remains |
| temperature=0 + seed (OpenAI) | Best-effort determinism with fingerprint tracking | ~85-95% reproducible |
| top_p=1 + temperature=0 | top_p has no effect at temp 0 | Same as temperature=0 |
| Low temperature (0.1-0.3) | Reduces randomness while keeping some diversity | No, but useful for ensembles |
Bottom line: temperature helps, but alone it’s not enough for a reliable scoring system.
Audit Your Prompt for Conflicts
The second and most overlooked thing is prompt quality. If your instructions have contradictions or ambiguity, the model will be inconsistent — not because it’s random, but because it’s interpreting unclear guidance differently each time.
| Ambiguous Prompt | Clear Prompt | |
|---|---|---|
| Criteria | “Score the quality” | “Score 1-5 based on accuracy, completeness, clarity” |
| Examples | None | 2-3 anchor examples with scores and explanations |
| Score range | “Rate 0-10” | Explicit description per level (see below) |
| Result | Model interprets differently each call | Model follows consistent criteria |
| Consistency | Low | High |
Check for conflicts between your system prompt and tool descriptions. If the system prompt says “be strict” and a tool description says “be lenient,” the model is stuck. Also check between your rubric criteria — if criterion A rewards brevity and criterion B rewards thoroughness, the model will oscillate.
Detailed Rubrics Per Score Level
The third technique is what made the biggest difference: detailed rubrics with per-score-level descriptions.
If you tell the model “score from 0 to 10,” you’ll get inconsistent results. The model’s idea of a 6 versus a 7 is fuzzy. But if you define exactly what each score range means, consistency improves dramatically.
The Prometheus paper (Kim et al., ICLR 2024) [4] showed this rigorously — providing explicit score-level descriptions significantly outperformed generic “rate from 1-5” prompts.
| Technique | Impact on consistency |
|---|---|
| Detailed per-level rubric | High — the single most effective technique |
| 2-3 anchor examples with explanations | High — few-shot calibration teaches the scale |
| Narrower scale (1-5 vs 1-10) | Medium — less ambiguity between adjacent scores |
| Independent sub-criteria scored separately | Medium — reduces conflation of different quality aspects |
| Boundary examples (“this is a 3, this is a 4 because…”) | High — resolves edge cases |
Ensemble: Multiple Calls, Aggregate
The fourth technique is ensemble — instead of trusting a single call, run multiple calls and aggregate.
graph LR
subgraph "Single Call"
S1["One LLM call"] --> S2["Score: 7"]
end
subgraph "Ensemble (3 calls)"
E1["Call 1: Score 7"] --> AGG["Aggregate"]
E2["Call 2: Score 8"] --> AGG
E3["Call 3: Score 7"] --> AGG
AGG --> E4["Final: 7 (median)"]
end
style S2 fill:#e9c46a,stroke:#e9c46a,color:#000
style E4 fill:#2a9d8f,stroke:#2a9d8f,color:#fff
| Aggregation method | Best for | Notes |
|---|---|---|
| Mean | Continuous scores | Simple but outlier-sensitive |
| Median | Continuous scores | Robust to outliers, preferred |
| Majority vote | Categorical (pass/fail, A/B/C) | Best for discrete judgments |
| Trimmed mean | Continuous, high stakes | Drop highest and lowest, average the rest |
3 calls captures most of the variance reduction. When ensembling, use a small positive temperature (0.2-0.3) — at temp 0 you’d get the same answer N times. Multi-model panels (GPT-4 + Claude + Gemini) reduce shared biases.
Chain-of-Thought Before Scoring
Chain-of-thought before scoring improves consistency significantly. The G-Eval paper [3] showed reasoning before scoring improved correlation with human judgments — Spearman from ~0.38 to ~0.51. The key: reasoning must come before the score, not after. Otherwise it’s post-hoc rationalization.
The optimal pattern: chain-of-thought reasoning + structured output for the final score.
Here’s what that looks like with Instructor:
from pydantic import BaseModel, Field
import instructor
from openai import OpenAI
class EvaluationStep(BaseModel):
criterion: str
observation: str
score: int = Field(ge=0, le=5)
class Evaluation(BaseModel):
chain_of_thought: list[EvaluationStep] = Field(
description="Evaluate each criterion BEFORE assigning final score"
)
final_score: int = Field(ge=0, le=10)
summary: str
client = instructor.from_openai(OpenAI())
result = client.chat.completions.create(
model="gpt-4o",
response_model=Evaluation,
temperature=0,
messages=[
{"role": "system", "content": RUBRIC},
{"role": "user", "content": f"Evaluate: {response_text}"},
],
)
And for the ensemble:
import statistics
def score_with_ensemble(text, n_calls=3, temperature=0.2):
scores = []
for _ in range(n_calls):
result = client.chat.completions.create(
model="gpt-4o",
response_model=Evaluation,
temperature=temperature,
messages=[
{"role": "system", "content": RUBRIC},
{"role": "user", "content": f"Evaluate: {text}"},
],
)
scores.append(result.final_score)
return statistics.median(scores)
Known Biases in LLM Scoring
Be aware of known biases in LLM scoring:
| Bias | What happens | Mitigation |
|---|---|---|
| Position bias | Prefers the first response in pairwise comparison | Swap order, average both results |
| Verbosity bias | Rates longer responses higher, even if redundant | Instruct judge to ignore length |
| Self-preference bias | Rates its own model’s output ~10% higher | Use a different model as judge |
| Format/style bias | Prefers markdown, bullet points over plain text | Normalize formatting before judging |
| Anchoring bias | Hints about expected quality skew the score | Remove metadata, anonymize outputs |
Putting It All Together
timeline
title Building a Consistent LLM Scoring System
section Foundation
Step 1 : Set temperature=0 or low (0.1-0.3 for ensemble)
: Remove randomness as much as possible
section Prompt Quality
Step 2 : Audit prompt for conflicts and ambiguity
: Ensure system prompt, tools, rubric are aligned
Step 3 : Write detailed per-score-level rubric
: Add 2-3 anchor examples with explanations
: Use narrow scales (1-5) or decomposed sub-criteria
section Reliability
Step 4 : Chain-of-thought before scoring
: Reasoning influences the score, not post-hoc
Step 5 : Structured output for final score
: JSON schema with score + reasoning fields
section Robustness
Step 6 : Ensemble 3-5 calls, aggregate by median
: Consider multi-model panel for high stakes
Step 7 : Monitor score distributions for drift over time
: Model updates can shift calibration
Each layer adds consistency. You don’t need all of them for every use case — but for a production scoring system, I’d use at least steps 1-5.
What techniques are you using for LLM consistency? Have you run into the same issues?
References:
[1] Zheng et al. “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” NeurIPS 2023.
[2] Wang et al. “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” ICLR 2023.
[3] Liu et al. “G-Eval: NLG Evaluation using GPT-4 with Chain-of-Thought and a Form-Filling Paradigm.” 2023.
[4] Kim et al. “Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models.” ICLR 2024.
[5] Wang et al. “Large Language Models are not Fair Evaluators.” ACL 2024.
[6] Chan et al. “ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate.” 2023.
[7] Wallace et al. “The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions.” OpenAI, 2024.
[8] “Text Generation — Seed Parameter.” OpenAI.