Understanding Results

This guide explains what each part of a Tempered evaluation result means and how to act on it.

The Verdict

Every evaluation produces a verdict — the consensus recommendation across all AI perspectives.

Verdict Meaning Action
Proceed The change appears safe. No significant risks identified. Go ahead. Review any minor notes in the summary.
Proceed with Mitigations The change is acceptable but has identified risks that should be addressed. Implement the listed conditions before or alongside the change.
Review Required Significant risks identified. Human review recommended before proceeding. Do not proceed without a human decision-maker reviewing the analysis.
Quorum Failed Not enough AI perspectives responded successfully. Retry the evaluation or check vendor availability.

Risk Dimensions

Each AI perspective scores the change across five dimensions:

Dimension What It Assesses
Security Data exposure, access control, authentication, encryption, attack surface
Reliability Availability impact, failure modes, recovery time, degraded modes
Compliance Regulatory obligations, audit trail, data handling, certification impact
Operational Deployment complexity, rollback feasibility, monitoring, team capacity
Business Cost impact, timeline risk, stakeholder communication, revenue impact

Each dimension is scored on a four-level scale:

Level Meaning
Low Minimal risk in this dimension
Medium Notable risk that should be monitored
High Significant risk requiring active mitigation
Critical Severe risk that may warrant blocking the change

Conditions

When the verdict is Proceed with Mitigations, the conditions list tells you exactly what to do:

Conditions are merged from all AI perspectives — they represent the collective wisdom of the panel.

Confidence Score

The confidence score (0.0–1.0) indicates how certain the analysis is:

Range Interpretation
0.8–1.0 High confidence — clear-cut decision
0.6–0.8 Moderate confidence — some ambiguity in the scenario
0.4–0.6 Low confidence — the scenario is complex or underspecified
Below 0.4 Very low confidence — consider providing more context

Low confidence doesn't mean the verdict is wrong — it means the AI perspectives found the scenario ambiguous. This is often a signal that your description needs more detail.

Minority Reports

When AI perspectives disagree with the consensus, you see a minority report. This is one of Tempered's most valuable features.

A minority report shows:

Why minority reports matter: Consensus can be wrong. The dissenting perspective might have spotted something the majority missed. Always read minority reports, especially for high-stakes decisions.

Quality Analysis

The quality composite score (0.0–1.0) measures the technical quality of the vendor responses, not the decision itself:

Analyser What It Checks
Schema compliance Did the response contain all required fields?
Dimension coverage Were all risk dimensions assessed?
Confidence calibration Is the confidence score well-calibrated?
Reasoning depth Is the reasoning specific or generic?
Mitigation specificity Are mitigations actionable or vague?

A quality score below 0.6 suggests the AI responses may be unreliable for this particular scenario. Consider:

Debate Results

If you used the Debate profile, you'll see two rounds of analysis:

Revisions show which vendors changed their recommendation between rounds. A vendor that revises shows intellectual flexibility. A vendor that holds firm despite challenges shows conviction. Both are informative.

Cost and Performance

Each evaluation shows:

Metric Meaning
Duration Wall-clock time from submission to verdict
Tokens Total input + output tokens across all vendors
Cost Estimated cost based on vendor pricing
Per-vendor latency How long each vendor took to respond

Audit Trail

Every evaluation captures a complete audit trail:

This audit trail ensures reproducibility and supports compliance evidence requirements.

Acting on Results

Proceed

Proceed with Mitigations

Review Required

Quorum Failed