Skip to content

Interpreting Results

Cross-check results show what AI engines said about your organization and how well it matched your Truth Nuggets. Learn to interpret the scores and identify patterns.

Results Table

After monitoring runs, view results in Shield → Cross-ChecksResults:

EngineQueryResponse SummaryTruth ScoreStatus
ChatGPT”When was TruthVouch founded?""Founded in 2024”100Match
Claude”What’s Shield’s accuracy?""Around 94% or higher”94Match
Gemini”Tell me about Shield pricing""Premium pricing” (no specific number)60Partial
Perplexity”Who is TruthVouch’s CEO?""Founded by David Kumar” (wrong)15Mismatch

Columns Explained

Engine: Which AI engine (ChatGPT, Claude, Gemini, Perplexity, Copilot)

Query: The question asked (auto-generated from template)

Response Summary: Key excerpt from AI’s response (full response available on click)

Truth Score: 0-100 accuracy rating (higher = more accurate)

Status:

  • Match (90+): AI response aligns with your truth
  • Partial (70-89): Some accuracy, minor discrepancies
  • Mismatch (50-69): Significant inaccuracy
  • Hallucination (<50): Major falsehood or fabrication

Truth Score Interpretation

90-100: Excellent

AI response matches your truth almost perfectly.

Example:

  • Your fact: “Founded in 2024”
  • AI said: “Founded in early 2024”
  • Score: 97

No action needed. Mark as verified in your dashboard.

80-89: Good

AI response is mostly accurate with minor variations.

Example:

  • Your fact: “Shield monitors 9+ AI engines”
  • AI said: “Monitors multiple major LLMs”
  • Score: 85

Consider as acceptable. AI didn’t mention the specific number 9+, but the spirit is correct. Optional: deploy correction for precision.

70-79: Fair

AI response has meaningful discrepancies.

Example:

  • Your fact: “Shield costs $349/month”
  • AI said: “Premium pricing, around $300-500/month”
  • Score: 74

Deploy a correction to be specific. AI is in ballpark but incorrect on exact pricing.

50-69: Poor

AI response is significantly inaccurate.

Example:

  • Your fact: “CEO: Sarah Chen”
  • AI said: “Led by founders including Chen”
  • Score: 55

Deploy a correction immediately. Wrong enough to affect brand perception.

Below 50: Hallucination

AI made a clear falsehood or fabrication.

Example:

  • Your fact: “Founded in 2024”
  • AI said: “Founded in 2019”
  • Score: 12

Critical — deploy correction urgently. Clear contradiction.

Drilling Into Details

Click any result row to see full details:

Full Response

See the complete text the AI engine generated, not just the summary.

Example: ChatGPT full response

"TruthVouch is a SaaS platform founded in 2024 that specializes
in monitoring AI systems for hallucinations. The company's Shield
product detects inaccuracies with 94% accuracy and monitors 9+
major LLM providers including OpenAI, Anthropic, and Google. It's
available starting at the Starter tier priced at $349/month."

Entity Extraction

See which entities (people, numbers, dates, products) were extracted:

Entities Extracted:
├─ Organization: TruthVouch, OpenAI, Anthropic, Google
├─ Product: Shield, LLM
├─ Date: 2024
├─ Percentage: 94%
├─ Number: 9
└─ Price: $349/month

NLI Analysis

See how Shield evaluated the response:

NLI Comparison:
├─ Your fact: "Founded in 2024"
├─ AI statement: "Founded in 2024"
├─ Verdict: ENTAILED
└─ Confidence: 99.2%
NLI Comparison:
├─ Your fact: "Monitors 9+ AI engines"
├─ AI statement: "Monitors 9+ major LLMs"
├─ Verdict: ENTAILED
└─ Confidence: 96.1%

Confidence Breakdown

See how confident Shield is in its verdict:

  • High confidence (>90%): Trust the score
  • Medium confidence (70-90%): Review manually
  • Low confidence (<70%): May need manual verification

Low-confidence scores may indicate:

  • Ambiguous Truth Nugget
  • Unclear AI response
  • Sarcasm or context-dependent language

Audit Trail

See full metadata:

Metadata:
├─ Timestamp: 2026-03-14 14:32:10 UTC
├─ Engine: ChatGPT (gpt-4-turbo)
├─ Model temperature: 0.7
├─ Latency: 2.3 seconds
├─ Query template: "Tell me about {product_name}"
└─ Query index: 2 of 5 variations

Filtering Results

Filter to focus on specific:

By Engine

  • View only ChatGPT results
  • View only Claude results
  • Compare engines side-by-side

By Truth Nugget Category

  • View only Product category results
  • View only Financial category results

By Score Range

  • View only excellent matches (90+)
  • View only hallucinations (<50)
  • View only partial matches (70-89)

By Time

  • Last 24 hours
  • Last 7 days
  • Last 30 days
  • Custom date range

By Status

  • All results
  • Alerts only (requires action)
  • Matches only (accurate)

Trend Analysis

View trends over time:

Go to: Shield → Cross-Checks → Trends

See:

  • Overall accuracy trend (line chart): How your Health Score changes over 30 days
  • By engine (multi-line): Track improvement per engine
  • By category (multi-line): Which categories improve fastest

Example:

  • “Overall Health Score improved 8 points in March”
  • “ChatGPT improved 12 points (good); Gemini -2 points (degraded)”
  • “Product category improved (corrections worked); Financial unchanged”

Helpful for:

  • Seeing if corrections actually work
  • Identifying which engines are most problematic
  • Prioritizing future corrections

Comparison Views

Side-by-Side Engine Comparison

View how different engines answer the same query:

Query: "How much does Shield cost?"
ChatGPT: "$349/month starting price" (Score: 100)
Claude: "Around $350/month for Starter tier" (Score: 96)
Gemini: "Premium SaaS pricing" (Score: 50)
Perplexity: "Costs about $400/month" (Score: 70)

Useful for:

  • Spotting which engines are accurate
  • Identifying common misconceptions across engines
  • Planning corrections (if 3/4 engines are wrong, deploy a correction)

Historical Progression

See how a single fact’s accuracy changed over time:

Fact: "Founded in 2024"
March 1: ChatGPT (95), Claude (100), Gemini (85)
March 8: ChatGPT (95), Claude (100), Gemini (92) ↑
March 15: ChatGPT (98) ↑, Claude (100), Gemini (95) ↑
Trend: All engines improving. Corrections deployed March 5 are working.

Exporting Results

Export to CSV

Click Export → CSV to get:

  • All results for analysis in Excel or Python
  • Columns: engine, query, response, score, status, timestamp

Export to JSON

Click Export → JSON to get:

  • Structured data for custom dashboards
  • Full details (not just summary)

Export to PDF Report

Click Export → PDF to get:

  • Formatted report with charts
  • Share with executives or auditors
  • Includes trends and recommendations

Common Patterns

Pattern 1: Consistent Hallucination

Same inaccuracy across all engines:

All say: "Founded in 2023"
Your truth: "Founded in 2024"

Action: Deploy correction immediately. All engines believe the falsehood.

Pattern 2: Engine-Specific Hallucination

One or two engines are wrong:

ChatGPT: Correct (95)
Claude: Correct (98)
Gemini: Wrong (35)
Perplexity: Wrong (42)

Action: Target Gemini and Perplexity in next correction (higher priority). ChatGPT/Claude don’t need fixing.

Pattern 3: Partial Information

Some engines mention fact, others don’t:

ChatGPT: Mentions (89 - almost right)
Claude: Mentions (100 - exact)
Gemini: Doesn't mention (50 - silent)
Perplexity: Mentions wrong (40 - wrong)

Action: Deploy correction. 2 engines right, 2 wrong/missing.

Pattern 4: Improving Trend

Scores rising after corrections deployed:

Before correction: 65, 68, 70 (trending up slowly)
Correction deployed
After correction: 78, 85, 92 (rapid improvement)

Action: Corrections working! Continue deploying. This pattern validates your correction strategy.

Troubleshooting

Score Seems Wrong

If you think Shield misscored:

  1. Click the result
  2. Review full response and entity extraction
  3. Check the NLI confidence score
  4. If confidence is low, Shield wasn’t sure — you can manually override
  5. Click Mark as Accurate or Mark as Hallucination to correct

Shield learns from your feedback.

Result Missing

If you expected a result but don’t see it:

  1. Check schedule is enabled: Shield → Schedules
  2. Check filters aren’t hiding it (filter by engine, category, time)
  3. Check audit log to see if query ran: Settings → Audit
  4. If query ran but no result, contact support

Scores Wildly Vary

If the same fact gets different scores each time:

  1. This is normal — AI responses vary
  2. Use trend analysis instead of individual scores
  3. If variance is >20 points, your Truth Nugget may be ambiguous — make it more specific

Next Steps