Interpreting Results
Cross-check results show what AI engines said about your organization and how well it matched your Truth Nuggets. Learn to interpret the scores and identify patterns.
Results Table
After monitoring runs, view results in Shield → Cross-Checks → Results:
| Engine | Query | Response Summary | Truth Score | Status |
|---|---|---|---|---|
| ChatGPT | ”When was TruthVouch founded?" | "Founded in 2024” | 100 | Match |
| Claude | ”What’s Shield’s accuracy?" | "Around 94% or higher” | 94 | Match |
| Gemini | ”Tell me about Shield pricing" | "Premium pricing” (no specific number) | 60 | Partial |
| Perplexity | ”Who is TruthVouch’s CEO?" | "Founded by David Kumar” (wrong) | 15 | Mismatch |
Columns Explained
Engine: Which AI engine (ChatGPT, Claude, Gemini, Perplexity, Copilot)
Query: The question asked (auto-generated from template)
Response Summary: Key excerpt from AI’s response (full response available on click)
Truth Score: 0-100 accuracy rating (higher = more accurate)
Status:
- Match (90+): AI response aligns with your truth
- Partial (70-89): Some accuracy, minor discrepancies
- Mismatch (50-69): Significant inaccuracy
- Hallucination (<50): Major falsehood or fabrication
Truth Score Interpretation
90-100: Excellent
AI response matches your truth almost perfectly.
Example:
- Your fact: “Founded in 2024”
- AI said: “Founded in early 2024”
- Score: 97
No action needed. Mark as verified in your dashboard.
80-89: Good
AI response is mostly accurate with minor variations.
Example:
- Your fact: “Shield monitors 9+ AI engines”
- AI said: “Monitors multiple major LLMs”
- Score: 85
Consider as acceptable. AI didn’t mention the specific number 9+, but the spirit is correct. Optional: deploy correction for precision.
70-79: Fair
AI response has meaningful discrepancies.
Example:
- Your fact: “Shield costs $349/month”
- AI said: “Premium pricing, around $300-500/month”
- Score: 74
Deploy a correction to be specific. AI is in ballpark but incorrect on exact pricing.
50-69: Poor
AI response is significantly inaccurate.
Example:
- Your fact: “CEO: Sarah Chen”
- AI said: “Led by founders including Chen”
- Score: 55
Deploy a correction immediately. Wrong enough to affect brand perception.
Below 50: Hallucination
AI made a clear falsehood or fabrication.
Example:
- Your fact: “Founded in 2024”
- AI said: “Founded in 2019”
- Score: 12
Critical — deploy correction urgently. Clear contradiction.
Drilling Into Details
Click any result row to see full details:
Full Response
See the complete text the AI engine generated, not just the summary.
Example: ChatGPT full response
"TruthVouch is a SaaS platform founded in 2024 that specializesin monitoring AI systems for hallucinations. The company's Shieldproduct detects inaccuracies with 94% accuracy and monitors 9+major LLM providers including OpenAI, Anthropic, and Google. It'savailable starting at the Starter tier priced at $349/month."Entity Extraction
See which entities (people, numbers, dates, products) were extracted:
Entities Extracted:├─ Organization: TruthVouch, OpenAI, Anthropic, Google├─ Product: Shield, LLM├─ Date: 2024├─ Percentage: 94%├─ Number: 9└─ Price: $349/monthNLI Analysis
See how Shield evaluated the response:
NLI Comparison:├─ Your fact: "Founded in 2024"├─ AI statement: "Founded in 2024"├─ Verdict: ENTAILED└─ Confidence: 99.2%
NLI Comparison:├─ Your fact: "Monitors 9+ AI engines"├─ AI statement: "Monitors 9+ major LLMs"├─ Verdict: ENTAILED└─ Confidence: 96.1%Confidence Breakdown
See how confident Shield is in its verdict:
- High confidence (>90%): Trust the score
- Medium confidence (70-90%): Review manually
- Low confidence (<70%): May need manual verification
Low-confidence scores may indicate:
- Ambiguous Truth Nugget
- Unclear AI response
- Sarcasm or context-dependent language
Audit Trail
See full metadata:
Metadata:├─ Timestamp: 2026-03-14 14:32:10 UTC├─ Engine: ChatGPT (gpt-4-turbo)├─ Model temperature: 0.7├─ Latency: 2.3 seconds├─ Query template: "Tell me about {product_name}"└─ Query index: 2 of 5 variationsFiltering Results
Filter to focus on specific:
By Engine
- View only ChatGPT results
- View only Claude results
- Compare engines side-by-side
By Truth Nugget Category
- View only Product category results
- View only Financial category results
By Score Range
- View only excellent matches (90+)
- View only hallucinations (<50)
- View only partial matches (70-89)
By Time
- Last 24 hours
- Last 7 days
- Last 30 days
- Custom date range
By Status
- All results
- Alerts only (requires action)
- Matches only (accurate)
Trend Analysis
View trends over time:
Go to: Shield → Cross-Checks → Trends
See:
- Overall accuracy trend (line chart): How your Health Score changes over 30 days
- By engine (multi-line): Track improvement per engine
- By category (multi-line): Which categories improve fastest
Example:
- “Overall Health Score improved 8 points in March”
- “ChatGPT improved 12 points (good); Gemini -2 points (degraded)”
- “Product category improved (corrections worked); Financial unchanged”
Helpful for:
- Seeing if corrections actually work
- Identifying which engines are most problematic
- Prioritizing future corrections
Comparison Views
Side-by-Side Engine Comparison
View how different engines answer the same query:
Query: "How much does Shield cost?"
ChatGPT: "$349/month starting price" (Score: 100)Claude: "Around $350/month for Starter tier" (Score: 96)Gemini: "Premium SaaS pricing" (Score: 50)Perplexity: "Costs about $400/month" (Score: 70)Useful for:
- Spotting which engines are accurate
- Identifying common misconceptions across engines
- Planning corrections (if 3/4 engines are wrong, deploy a correction)
Historical Progression
See how a single fact’s accuracy changed over time:
Fact: "Founded in 2024"
March 1: ChatGPT (95), Claude (100), Gemini (85)March 8: ChatGPT (95), Claude (100), Gemini (92) ↑March 15: ChatGPT (98) ↑, Claude (100), Gemini (95) ↑
Trend: All engines improving. Corrections deployed March 5 are working.Exporting Results
Export to CSV
Click Export → CSV to get:
- All results for analysis in Excel or Python
- Columns: engine, query, response, score, status, timestamp
Export to JSON
Click Export → JSON to get:
- Structured data for custom dashboards
- Full details (not just summary)
Export to PDF Report
Click Export → PDF to get:
- Formatted report with charts
- Share with executives or auditors
- Includes trends and recommendations
Common Patterns
Pattern 1: Consistent Hallucination
Same inaccuracy across all engines:
All say: "Founded in 2023"Your truth: "Founded in 2024"Action: Deploy correction immediately. All engines believe the falsehood.
Pattern 2: Engine-Specific Hallucination
One or two engines are wrong:
ChatGPT: Correct (95)Claude: Correct (98)Gemini: Wrong (35)Perplexity: Wrong (42)Action: Target Gemini and Perplexity in next correction (higher priority). ChatGPT/Claude don’t need fixing.
Pattern 3: Partial Information
Some engines mention fact, others don’t:
ChatGPT: Mentions (89 - almost right)Claude: Mentions (100 - exact)Gemini: Doesn't mention (50 - silent)Perplexity: Mentions wrong (40 - wrong)Action: Deploy correction. 2 engines right, 2 wrong/missing.
Pattern 4: Improving Trend
Scores rising after corrections deployed:
Before correction: 65, 68, 70 (trending up slowly)Correction deployedAfter correction: 78, 85, 92 (rapid improvement)Action: Corrections working! Continue deploying. This pattern validates your correction strategy.
Troubleshooting
Score Seems Wrong
If you think Shield misscored:
- Click the result
- Review full response and entity extraction
- Check the NLI confidence score
- If confidence is low, Shield wasn’t sure — you can manually override
- Click Mark as Accurate or Mark as Hallucination to correct
Shield learns from your feedback.
Result Missing
If you expected a result but don’t see it:
- Check schedule is enabled: Shield → Schedules
- Check filters aren’t hiding it (filter by engine, category, time)
- Check audit log to see if query ran: Settings → Audit
- If query ran but no result, contact support
Scores Wildly Vary
If the same fact gets different scores each time:
- This is normal — AI responses vary
- Use trend analysis instead of individual scores
- If variance is >20 points, your Truth Nugget may be ambiguous — make it more specific
Next Steps
- How Cross-Checks Work — Technical deep dive
- Managing Alerts — Responding to results
- Dashboard Overview — Aggregate view