This guide explains what each part of a Tempered evaluation result means and how to act on it.
Every evaluation produces a verdict — the consensus recommendation across all AI perspectives.
| Verdict | Meaning | Action |
|---|---|---|
| Proceed | The change appears safe. No significant risks identified. | Go ahead. Review any minor notes in the summary. |
| Proceed with Mitigations | The change is acceptable but has identified risks that should be addressed. | Implement the listed conditions before or alongside the change. |
| Review Required | Significant risks identified. Human review recommended before proceeding. | Do not proceed without a human decision-maker reviewing the analysis. |
| Quorum Failed | Not enough AI perspectives responded successfully. | Retry the evaluation or check vendor availability. |
Each AI perspective scores the change across five dimensions:
| Dimension | What It Assesses |
|---|---|
| Security | Data exposure, access control, authentication, encryption, attack surface |
| Reliability | Availability impact, failure modes, recovery time, degraded modes |
| Compliance | Regulatory obligations, audit trail, data handling, certification impact |
| Operational | Deployment complexity, rollback feasibility, monitoring, team capacity |
| Business | Cost impact, timeline risk, stakeholder communication, revenue impact |
Each dimension is scored on a four-level scale:
| Level | Meaning |
|---|---|
| Low | Minimal risk in this dimension |
| Medium | Notable risk that should be monitored |
| High | Significant risk requiring active mitigation |
| Critical | Severe risk that may warrant blocking the change |
When the verdict is Proceed with Mitigations, the conditions list tells you exactly what to do:
Conditions are merged from all AI perspectives — they represent the collective wisdom of the panel.
The confidence score (0.0–1.0) indicates how certain the analysis is:
| Range | Interpretation |
|---|---|
| 0.8–1.0 | High confidence — clear-cut decision |
| 0.6–0.8 | Moderate confidence — some ambiguity in the scenario |
| 0.4–0.6 | Low confidence — the scenario is complex or underspecified |
| Below 0.4 | Very low confidence — consider providing more context |
Low confidence doesn't mean the verdict is wrong — it means the AI perspectives found the scenario ambiguous. This is often a signal that your description needs more detail.
When AI perspectives disagree with the consensus, you see a minority report. This is one of Tempered's most valuable features.
A minority report shows:
Why minority reports matter: Consensus can be wrong. The dissenting perspective might have spotted something the majority missed. Always read minority reports, especially for high-stakes decisions.
The quality composite score (0.0–1.0) measures the technical quality of the vendor responses, not the decision itself:
| Analyser | What It Checks |
|---|---|
| Schema compliance | Did the response contain all required fields? |
| Dimension coverage | Were all risk dimensions assessed? |
| Confidence calibration | Is the confidence score well-calibrated? |
| Reasoning depth | Is the reasoning specific or generic? |
| Mitigation specificity | Are mitigations actionable or vague? |
A quality score below 0.6 suggests the AI responses may be unreliable for this particular scenario. Consider:
If you used the Debate profile, you'll see two rounds of analysis:
Revisions show which vendors changed their recommendation between rounds. A vendor that revises shows intellectual flexibility. A vendor that holds firm despite challenges shows conviction. Both are informative.
Each evaluation shows:
| Metric | Meaning |
|---|---|
| Duration | Wall-clock time from submission to verdict |
| Tokens | Total input + output tokens across all vendors |
| Cost | Estimated cost based on vendor pricing |
| Per-vendor latency | How long each vendor took to respond |
Every evaluation captures a complete audit trail:
This audit trail ensures reproducibility and supports compliance evidence requirements.