Skip to content

Evaluation Framework

The Evaluation Framework provides comprehensive assessment of AI outputs across multiple evaluation dimensions. Instead of relying on a single metric, it runs multiple independent evaluations and uses statistical consensus to reduce variance and improve reliability.

Available Evaluators

TruthVouch includes 9 built-in evaluators covering diverse quality dimensions:

  • Factual Accuracy — Checks if claims are factually correct against your knowledge base
  • Semantic Similarity — Measures whether the output captures the intended meaning
  • Response Completeness — Ensures the response fully addresses the user’s request
  • Hallucination Score — Detects fabricated information not grounded in source material
  • Citation Verification — Confirms citations accurately reference sources
  • Prompt Injection Detection — Identifies adversarial prompts attempting to override system behavior
  • Toxicity & Safety — Flags harmful, offensive, or unsafe content
  • Bias Detection — Identifies stereotyping, unfair representation, or discriminatory language
  • Custom Evaluators — Define your own evaluation criteria and scoring rubrics

ChainPoll Consensus

ChainPoll runs multiple LLM-based evaluations on the same output and combines results using statistical consensus:

Input → [Evaluator A] → Score: 0.92
[Evaluator B] → Score: 0.88
[Evaluator C] → Score: 0.90
Consensus: 0.90 (mean)
Confidence: 0.85 (low variance)

This approach reduces individual evaluator variance and improves reliability. The consensus score represents the mean across all runs, while confidence reflects the variance:

  • High variance (low confidence) — Evaluators disagree; result is uncertain
  • Low variance (high confidence) — Evaluators agree; result is reliable

Configuration

When running an evaluation, specify:

  • Number of samples — How many times to run each evaluator (default: 3)
  • Consensus threshold — Minimum agreement level required to pass (default: 0.75)
  • Evaluator selection — Which evaluators to run (run all, or select specific ones)

Custom Evaluator Builder

Create custom evaluators tailored to your domain:

  1. Go to Governance Hub → Evaluation Framework
  2. Click Create Custom Evaluator
  3. Define:
    • Name — Descriptive evaluator name (e.g., “Medical Terminology Accuracy”)
    • Description — What this evaluator measures
    • Rubric — Scoring criteria (1-5 point scale, with descriptors for each level)
    • Examples — Sample inputs and expected scores for calibration
  4. Test the evaluator against sample outputs
  5. Deploy to make available for all evaluations

Custom evaluators are evaluated using the same LLM-based approach as built-in evaluators, allowing them to learn from your examples.

Agentic Evaluation Metrics

For agentic AI systems (agents that take actions, call tools, etc.), specialized metrics assess:

  • Tool Selection Accuracy — Did the agent choose the right tool for the task?
  • Action Completion Rate — What percentage of tool calls succeeded?
  • Tool Error Recovery — When a tool fails, does the agent retry or recover gracefully?
  • Plan Coherence — Does the agent’s action sequence make logical sense?
  • Resource Efficiency — Did the agent accomplish the task with minimal tool calls?

These metrics are automatically available when evaluating agent outputs.

Uncertainty Scoring

Evaluation results include uncertainty estimates computed via multi-sample variance analysis:

  • Confidence Score (0-1) — How confident the evaluation result is
  • Variance — Statistical spread across samples
  • Sample Count — Number of evaluations run

Example:

{
"evaluator": "Factual Accuracy",
"score": 0.87,
"confidence": 0.92,
"variance": 0.015,
"samples": 5,
"interpretation": "High confidence: evaluators consistently agree"
}

Use confidence scores to:

  • Flag uncertain results for manual review (confidence < 0.7)
  • Require additional sampling on borderline scores (0.4-0.6)
  • Automate decisions on high-confidence results (> 0.85)

Evaluation Framework UI

The Evaluation Framework lives in the Governance Hub and provides:

Configuration Screen

  • Select evaluators to run
  • Set consensus thresholds
  • Configure sample counts
  • Manage custom evaluators
  • View evaluator performance history

Run Evaluations

Input Batch → Configure Evaluators → Run → Monitor Progress
Results Dashboard

Results Dashboard

View evaluation results across dimensions:

  • Evaluator Scores — Table showing each evaluator’s score and confidence
  • Consensus Score — Overall evaluation result
  • Trend Analysis — How scores change over time
  • Failed Evaluations — Which specific checks failed and why
  • Distribution — Histogram of scores across your evaluation history

Custom Evaluator Management

  • List all custom evaluators with performance metrics
  • Edit rubrics and examples
  • Test evaluator against sample data
  • Archive or delete unused evaluators
  • Review evaluator training examples

Integration Examples

Batch Evaluation

Evaluate multiple outputs at once:

from truthvouch import TruthVouchClient
client = TruthVouchClient(api_key="your-api-key")
outputs = [
"The capital of France is Paris.",
"Machine learning is a type of artificial intelligence.",
"The moon orbits the earth in 28 days."
]
results = []
for output in outputs:
result = client.evaluate_output(
text=output,
model="gpt-4"
)
results.append({
"output": output,
"blocked": result.blocked,
"flagged": result.flagged
})

Streaming Evaluation

For streaming responses, evaluate in chunks or at completion:

# Evaluate complete response after streaming
full_response = ""
async for chunk in llm.stream_response(prompt):
full_response += chunk
# Final evaluation
result = await client.evaluate_output(
text=full_response,
model="gpt-4"
)
if result.blocked:
# Response violates policy
return {"error": result.block_reasons}

Best Practices

Choosing Evaluators

  • General content — Use Hallucination, Toxicity & Safety, Bias Detection
  • Domain-specific — Add custom evaluators for your industry
  • Agent outputs — Use agentic metrics for agent evaluation
  • Strict compliance — Run all evaluators; set high consensus thresholds

Setting Thresholds

  • Critical content (medical, legal, financial) — 0.9+ consensus, all evaluators
  • General use — 0.75 consensus, standard evaluators
  • Uncertain cases — Flag for human review if confidence < 0.7

Sampling Strategy

  • Fast evaluation — 1-2 samples, trade off reliability for speed
  • Balanced — 3-5 samples (default), good reliability with acceptable latency
  • High assurance — 10+ samples, maximum reliability for critical decisions

Next Steps