Understanding Results

This guide explains what each part of a Tempered evaluation result means and how to act on it.

The Verdict

Every evaluation produces a verdict — the consensus recommendation across all AI perspectives.

Verdict	Meaning	Action
Proceed	The change appears safe. No significant risks identified.	Go ahead. Review any minor notes in the summary.
Proceed with Mitigations	The change is acceptable but has identified risks that should be addressed.	Implement the listed conditions before or alongside the change.
Review Required	Significant risks identified. Human review recommended before proceeding.	Do not proceed without a human decision-maker reviewing the analysis.
Quorum Failed	Not enough AI perspectives responded successfully.	Retry the evaluation or check vendor availability.

Risk Dimensions

Each AI perspective scores the change across five dimensions:

Dimension	What It Assesses
Security	Data exposure, access control, authentication, encryption, attack surface
Reliability	Availability impact, failure modes, recovery time, degraded modes
Compliance	Regulatory obligations, audit trail, data handling, certification impact
Operational	Deployment complexity, rollback feasibility, monitoring, team capacity
Business	Cost impact, timeline risk, stakeholder communication, revenue impact

Each dimension is scored on a four-level scale:

Level	Meaning
Low	Minimal risk in this dimension
Medium	Notable risk that should be monitored
High	Significant risk requiring active mitigation
Critical	Severe risk that may warrant blocking the change

Conditions

When the verdict is Proceed with Mitigations, the conditions list tells you exactly what to do:

"Ensure database backup is verified before proceeding"
"Add monitoring for error rates during the first 24 hours"
"Schedule the change during the maintenance window only"

Conditions are merged from all AI perspectives — they represent the collective wisdom of the panel.

Confidence Score

The confidence score (0.0–1.0) indicates how certain the analysis is:

Range	Interpretation
0.8–1.0	High confidence — clear-cut decision
0.6–0.8	Moderate confidence — some ambiguity in the scenario
0.4–0.6	Low confidence — the scenario is complex or underspecified
Below 0.4	Very low confidence — consider providing more context

Low confidence doesn't mean the verdict is wrong — it means the AI perspectives found the scenario ambiguous. This is often a signal that your description needs more detail.

Minority Reports

When AI perspectives disagree with the consensus, you see a minority report. This is one of Tempered's most valuable features.

A minority report shows:

Which vendor disagreed
What they recommended instead
Why they disagree (truncated reasoning)

Why minority reports matter: Consensus can be wrong. The dissenting perspective might have spotted something the majority missed. Always read minority reports, especially for high-stakes decisions.

Quality Analysis

The quality composite score (0.0–1.0) measures the technical quality of the vendor responses, not the decision itself:

Analyser	What It Checks
Schema compliance	Did the response contain all required fields?
Dimension coverage	Were all risk dimensions assessed?
Confidence calibration	Is the confidence score well-calibrated?
Reasoning depth	Is the reasoning specific or generic?
Mitigation specificity	Are mitigations actionable or vague?

A quality score below 0.6 suggests the AI responses may be unreliable for this particular scenario. Consider:

Providing more context in your description
Using a profile with more vendors (Full Panel instead of Single)
Reviewing the raw vendor responses for anomalies

Debate Results

If you used the Debate profile, you'll see two rounds of analysis:

Round 1: Independent assessments (each vendor evaluates without seeing others)
Round 2: After challenge briefs (each vendor sees dissenting views and can revise)

Revisions show which vendors changed their recommendation between rounds. A vendor that revises shows intellectual flexibility. A vendor that holds firm despite challenges shows conviction. Both are informative.

Cost and Performance

Each evaluation shows:

Metric	Meaning
Duration	Wall-clock time from submission to verdict
Tokens	Total input + output tokens across all vendors
Cost	Estimated cost based on vendor pricing
Per-vendor latency	How long each vendor took to respond

Audit Trail

Every evaluation captures a complete audit trail:

System prompt snapshot — the exact prompt sent to vendors (expandable)
Profile snapshot — the frozen configuration used
Raw responses — full vendor outputs (expandable)
SDK versions — which version of each vendor SDK was used
Temperature and timeout — the generation parameters

This audit trail ensures reproducibility and supports compliance evidence requirements.

Acting on Results

Proceed

Note any minor observations in the summary
Proceed with the change as planned

Proceed with Mitigations

Review each condition in the conditions list
Implement mitigations before or alongside the change
Consider linking the evaluation to your change management system

Review Required

Share the evaluation with a human decision-maker
Review the dimension breakdown to understand where the risks are
Read the minority reports for additional perspectives
Decide whether to proceed (with additional mitigations), modify the approach, or abandon

Quorum Failed

Check if there are vendor outages
Retry the evaluation
If persistent, reduce the number of required vendors in the profile