Interpreting Results

Cross-check results show what AI engines said about your organization and how well it matched your Truth Nuggets. Learn to interpret the scores and identify patterns.

Results Table

After monitoring runs, view results in Shield → Cross-Checks → Results:

Engine	Query	Response Summary	Truth Score	Status
ChatGPT	”When was TruthVouch founded?"	"Founded in 2024”	100	Match
Claude	”What’s Shield’s accuracy?"	"Around 94% or higher”	94	Match
Gemini	”Tell me about Shield pricing"	"Premium pricing” (no specific number)	60	Partial
Perplexity	”Who is TruthVouch’s CEO?"	"Founded by David Kumar” (wrong)	15	Mismatch

Columns Explained

Engine: Which AI engine (ChatGPT, Claude, Gemini, Perplexity, Copilot)

Query: The question asked (auto-generated from template)

Response Summary: Key excerpt from AI’s response (full response available on click)

Truth Score: 0-100 accuracy rating (higher = more accurate)

Status:

Match (90+): AI response aligns with your truth
Partial (70-89): Some accuracy, minor discrepancies
Mismatch (50-69): Significant inaccuracy
Hallucination (<50): Major falsehood or fabrication

Truth Score Interpretation

90-100: Excellent

AI response matches your truth almost perfectly.

Example:

Your fact: “Founded in 2024”
AI said: “Founded in early 2024”
Score: 97

No action needed. Mark as verified in your dashboard.

80-89: Good

AI response is mostly accurate with minor variations.

Example:

Your fact: “Shield monitors 9+ AI engines”
AI said: “Monitors multiple major LLMs”
Score: 85

Consider as acceptable. AI didn’t mention the specific number 9+, but the spirit is correct. Optional: deploy correction for precision.

70-79: Fair

AI response has meaningful discrepancies.

Example:

Your fact: “Shield costs $349/month”
AI said: “Premium pricing, around $300-500/month”
Score: 74

Deploy a correction to be specific. AI is in ballpark but incorrect on exact pricing.

50-69: Poor

AI response is significantly inaccurate.

Example:

Your fact: “CEO: Sarah Chen”
AI said: “Led by founders including Chen”
Score: 55

Deploy a correction immediately. Wrong enough to affect brand perception.

Below 50: Hallucination

AI made a clear falsehood or fabrication.

Example:

Your fact: “Founded in 2024”
AI said: “Founded in 2019”
Score: 12

Critical — deploy correction urgently. Clear contradiction.

Drilling Into Details

Click any result row to see full details:

Full Response

See the complete text the AI engine generated, not just the summary.

Example: ChatGPT full response

"TruthVouch is a SaaS platform founded in 2024 that specializes
in monitoring AI systems for hallucinations. The company's Shield
product detects inaccuracies with 94% accuracy and monitors 9+
major LLM providers including OpenAI, Anthropic, and Google. It's
available starting at the Starter tier priced at $349/month."

Entity Extraction

See which entities (people, numbers, dates, products) were extracted:

Entities Extracted:
├─ Organization: TruthVouch, OpenAI, Anthropic, Google
├─ Product: Shield, LLM
├─ Date: 2024
├─ Percentage: 94%
├─ Number: 9
└─ Price: $349/month

NLI Analysis

See how Shield evaluated the response:

NLI Comparison:
├─ Your fact: "Founded in 2024"
├─ AI statement: "Founded in 2024"
├─ Verdict: ENTAILED
└─ Confidence: 99.2%

NLI Comparison:
├─ Your fact: "Monitors 9+ AI engines"
├─ AI statement: "Monitors 9+ major LLMs"
├─ Verdict: ENTAILED
└─ Confidence: 96.1%

Confidence Breakdown

See how confident Shield is in its verdict:

High confidence (>90%): Trust the score
Medium confidence (70-90%): Review manually
Low confidence (<70%): May need manual verification

Low-confidence scores may indicate:

Ambiguous Truth Nugget
Unclear AI response
Sarcasm or context-dependent language

Audit Trail

See full metadata:

Metadata:
├─ Timestamp: 2026-03-14 14:32:10 UTC
├─ Engine: ChatGPT (gpt-4-turbo)
├─ Model temperature: 0.7
├─ Latency: 2.3 seconds
├─ Query template: "Tell me about {product_name}"
└─ Query index: 2 of 5 variations

Filtering Results

Filter to focus on specific:

By Engine

View only ChatGPT results
View only Claude results
Compare engines side-by-side

By Truth Nugget Category

View only Product category results
View only Financial category results

By Score Range

View only excellent matches (90+)
View only hallucinations (<50)
View only partial matches (70-89)

By Time

Last 24 hours
Last 7 days
Last 30 days
Custom date range

By Status

All results
Alerts only (requires action)
Matches only (accurate)

Trend Analysis

View trends over time:

Go to: Shield → Cross-Checks → Trends

See:

Overall accuracy trend (line chart): How your Health Score changes over 30 days
By engine (multi-line): Track improvement per engine
By category (multi-line): Which categories improve fastest

Example:

“Overall Health Score improved 8 points in March”
“ChatGPT improved 12 points (good); Gemini -2 points (degraded)”
“Product category improved (corrections worked); Financial unchanged”

Helpful for:

Seeing if corrections actually work
Identifying which engines are most problematic
Prioritizing future corrections

Comparison Views

Side-by-Side Engine Comparison

View how different engines answer the same query:

Query: "How much does Shield cost?"

ChatGPT:  "$349/month starting price" (Score: 100)
Claude:   "Around $350/month for Starter tier" (Score: 96)
Gemini:   "Premium SaaS pricing" (Score: 50)
Perplexity: "Costs about $400/month" (Score: 70)

Useful for:

Spotting which engines are accurate
Identifying common misconceptions across engines
Planning corrections (if 3/4 engines are wrong, deploy a correction)

Historical Progression

See how a single fact’s accuracy changed over time:

Fact: "Founded in 2024"

March 1:  ChatGPT (95), Claude (100), Gemini (85)
March 8:  ChatGPT (95), Claude (100), Gemini (92) ↑
March 15: ChatGPT (98) ↑, Claude (100), Gemini (95) ↑

Trend: All engines improving. Corrections deployed March 5 are working.

Exporting Results

Export to CSV

Click Export → CSV to get:

All results for analysis in Excel or Python
Columns: engine, query, response, score, status, timestamp

Export to JSON

Click Export → JSON to get:

Structured data for custom dashboards
Full details (not just summary)

Export to PDF Report

Click Export → PDF to get:

Formatted report with charts
Share with executives or auditors
Includes trends and recommendations

Common Patterns

Pattern 1: Consistent Hallucination

Same inaccuracy across all engines:

All say: "Founded in 2023"
Your truth: "Founded in 2024"

Action: Deploy correction immediately. All engines believe the falsehood.

Pattern 2: Engine-Specific Hallucination

One or two engines are wrong:

ChatGPT: Correct (95)
Claude: Correct (98)
Gemini: Wrong (35)
Perplexity: Wrong (42)

Action: Target Gemini and Perplexity in next correction (higher priority). ChatGPT/Claude don’t need fixing.

Pattern 3: Partial Information

Some engines mention fact, others don’t:

ChatGPT: Mentions (89 - almost right)
Claude: Mentions (100 - exact)
Gemini: Doesn't mention (50 - silent)
Perplexity: Mentions wrong (40 - wrong)

Action: Deploy correction. 2 engines right, 2 wrong/missing.

Pattern 4: Improving Trend

Scores rising after corrections deployed:

Before correction: 65, 68, 70 (trending up slowly)
Correction deployed
After correction: 78, 85, 92 (rapid improvement)

Action: Corrections working! Continue deploying. This pattern validates your correction strategy.

Troubleshooting

Score Seems Wrong

If you think Shield misscored:

Click the result
Review full response and entity extraction
Check the NLI confidence score
If confidence is low, Shield wasn’t sure — you can manually override
Click Mark as Accurate or Mark as Hallucination to correct

Shield learns from your feedback.

Result Missing

If you expected a result but don’t see it:

Check schedule is enabled: Shield → Schedules
Check filters aren’t hiding it (filter by engine, category, time)
Check audit log to see if query ran: Settings → Audit
If query ran but no result, contact support

Scores Wildly Vary

If the same fact gets different scores each time:

This is normal — AI responses vary
Use trend analysis instead of individual scores
If variance is >20 points, your Truth Nugget may be ambiguous — make it more specific

Next Steps

How Cross-Checks Work — Technical deep dive
Managing Alerts — Responding to results
Dashboard Overview — Aggregate view