Monitoring & Overrides

The TruthOps monitoring dashboard automatically shows real-time health of all agents. TruthVouch continuously tracks performance metrics, generates alerts, and lets you manually override agent decisions when needed.

SLA & Alerts

Dashboard Overview

Top-Level Metrics

Agent Health Summary (at a glance):

Total Agents: 12
  Healthy (Green):    10
  Warning (Yellow):    1  (Customer Service accuracy drifting)
  Critical (Red):      1  (Resume Screener API failing)

Overall System Health: 96.2%
  Target: ≥99%
  Trend: ↓ (was 97.1% yesterday)

Key Metrics by Agent

View all agents in a table:

Agent Name	Risk	Status	Accuracy	Latency	Uptime	Cost	Alerts
Customer Support	HIGH	⚠ Warning	84% ↓	1.8s ✓	99.9% ✓	$4.8K	1
Resume Screener	CRITICAL	🔴 Critical	91% ✓	—	0%	$2.1K	3
Content Moderation	HIGH	✓ Healthy	77% ✓	0.3s ✓	99.95% ✓	$1.2K	0
Data Classifier	MEDIUM	✓ Healthy	96% ✓	0.5s ✓	99.8% ✓	$0.9K	0

Click any agent to see detailed metrics

Drill-Down: Individual Agent

Click “Customer Support” for detailed view:

Health Status

Status: ⚠ Warning (accuracy drifting below target)
Last update: 15 seconds ago
Incident history: No recent incidents
Estimated time to Critical: 24 hours (if trend continues)

Performance Metrics

Accuracy

Current: 84%
Target: ≥85%
Trend: ↓ 2% (was 86% yesterday)
Status: Below target ⚠
Root cause: Possibly outdated product knowledge (new features released 3 days ago)
Recommendation: Update knowledge base

Latency (p95)

Current: 1.8 seconds
Target: <2 seconds
Trend: → Stable
Status: Healthy ✓

Availability

Current: 99.9%
Target: ≥99%
Uptime: 24h: 99.9%, 7d: 99.85%, 30d: 99.8%
Last incident: 2 days ago (5 min outage)
Status: Healthy ✓

Cost

Daily average: $160
Monthly forecast: $4,800
Budget: $5,000
Status: On track ✓

Detailed Logs

Recent decisions (sampling):

Decision 1: 2024-03-15 14:35
  Query: "What's your refund policy?"
  AI Response: "We offer 30-day refunds"
  Truth Nugget: "We offer 14-day refunds"
  Hallucination: YES
  Confidence: 87%
  Override needed: YES

Decision 2: 2024-03-15 14:32
  Query: "Do you support HIPAA?"
  AI Response: "No, we don't support HIPAA"
  Truth Nugget: "We're HIPAA-ready (BAA available)"
  Hallucination: YES
  Confidence: 92%
  Override needed: YES

Manual Overrides

When an agent makes a wrong decision, manually override it:

Override an Agent Decision

Locate decision in logs (see above examples)
Click decision → “Override”
Provide correct answer (what should agent have said?)
Reason (why override? e.g., “Wrong refund policy”, “Outdated knowledge”)
Submit (logged in audit trail)

Impact:

Immediate correction sent to user/system
Feedback logged (helps improve agent)
Knowledge base can be updated (prevent future errors)

Batch Overrides

If multiple decisions wrong (e.g., same misconception):

Select multiple decisions (checkbox)
Bulk Actions → Override & Correct
Provide correction (once, applies to all selected)
Update knowledge base (optional; prevents future errors)

Example:

5 customer service decisions incorrectly said "30-day refund"
Select all 5 decisions
Bulk override: "Correct answer is 14-day refund"
Auto-update knowledge base with correct refund policy
Future queries won't have this error

Incident Management

When alerts trigger:

Critical Alert Example

🔴 CRITICAL: Resume Screener - API Failure

Alert triggered: 2024-03-15 14:20
Alert type: Availability (0% uptime for 5 min)
Status: Ongoing
Affected agent: Resume Screener (CRITICAL risk level)

What's happening:
  - OpenAI API rate limit exceeded
  - Agent unable to evaluate candidates
  - 47 pending evaluations stuck

Recommendations:
  1. Check OpenAI API quota
  2. Reduce concurrent evaluations (rate limiting)
  3. Switch to backup model (Claude) if available
  4. Contact OpenAI support

Actions taken:
  [Disable Agent] [Switch Model] [Page On-Call] [Acknowledge Alert]

Disable Agent

If agent broken and needs recovery:

Click Disable Agent
Agent stops processing new requests
Pending requests queued or escalated to human
Status updates to “Paused - Critical Issue”
Escalation notified

Re-enable when fixed:

Fix underlying issue (API quota restored, etc.)
Click Re-enable Agent
Agent resumes processing
Queued requests processed in order

Trend Analysis

Performance Over Time

View trends for any metric:

Accuracy Trend (Last 30 Days)
─────────────────────────────
Day 1-7:   87% (stable)
Day 8-14:  85% (dropped 2%)
Day 15-21: 83% (dropped 2% more)
Day 22-30: 82% (still declining)

Trend: ↓ Sharp decline starting day 8
Root cause: Product knowledge outdated after launch of 3 new features

Recommendation: Update knowledge base with new feature descriptions

Alert Trend

Alerts Triggered (Last 30 Days)
──────────────────────────────
Week 1: 0 alerts (healthy)
Week 2: 2 alerts (accuracy warnings)
Week 3: 5 alerts (accuracy critical + latency warnings)
Week 4: 8 alerts (cascading issues)

Pattern: Alerts increasing, suggesting systemic issue
Most common: Accuracy below threshold
Recommendation: Root cause analysis needed

Performance Baselines

Set Baseline

First time agent deployed, establish baseline metrics:

Set Baseline for Customer Service Agent:
  Week 1: Monitor only (no alerting)
  Establish normal ranges:
    Accuracy: 85-89%
    Latency: 1.5-2.2s
    Cost: $140-$180/day

After Week 1:
  Set targets based on baseline
  Enable alerting at ±10% of baseline

Seasonal Adjustment

Some metrics vary seasonally:

Cost baseline varies by day of week:
  Mon-Fri: $160/day (business hours heavy)
  Sat-Sun: $80/day (weekend light)

Accuracy baseline varies by season:
  Q4 (holiday): More queries; similar accuracy
  Q2-Q3: Stable; use this as "normal"

Configure alerts to account for seasonality

Real-Time Monitoring

Live Updates

Dashboard updates in real-time (or configurable intervals):

Real-time (1 sec): Throughput, error alerts
1-minute: Latency, accuracy sampling
5-minute: Uptime, cost tracking
1-hour: Trend analysis, forecasts

Streaming Alerts

Important alerts stream to Slack/email/PagerDuty immediately:

Critical alerts: Instant (page on-call)
High alerts: Within 1 min (notify owner)
Medium alerts: Batch hourly (daily digest)
Low alerts: Daily digest (weekly summary)

Reports & Exports

Generate Reports

Create compliance/audit reports:

Export Monthly Report (March 2024):
  Agent performance summary
  Uptime/SLA metrics
  Accuracy metrics + trends
  Cost analysis
  Incidents & resolutions
  Override/manual correction frequency
  Recommendations for improvements

Format: PDF (printable), Excel (for analysis)

Audit Trail

Complete audit trail of all actions:

2024-03-15 14:35 — Manual override: John Smith
  Decision ID: dec_12345
  Correction: "Refund policy is 14 days (was 30)"
  Reason: "Product knowledge outdated"

2024-03-15 14:20 — Alert: Resume Screener critical
  Type: API failure
  Duration: 5 minutes
  Impact: 47 pending evaluations
  Resolution: Switched to backup model

2024-03-15 14:00 — Agent performance: Accuracy dropped
  From: 86% → 84%
  Threshold: 85%
  Status: Warning alert issued

Agent Inventory — Register agents you’re monitoring
Configuration — Set monitoring thresholds and policies
Autonomy Levels — How autonomy affects monitoring needs

Next Steps

Review agent health — Are all agents performing within targets?
Investigate warnings — Address any metric drifts
Manual override — Correct any recent agent errors
Trend analysis — Look for patterns in performance
Adjust baselines — Update targets if reality changed