Skip to content

Monitoring & Overrides

The TruthOps monitoring dashboard automatically shows real-time health of all agents. TruthVouch continuously tracks performance metrics, generates alerts, and lets you manually override agent decisions when needed.

SLA & Alerts

Dashboard Overview

Top-Level Metrics

Agent Health Summary (at a glance):

Total Agents: 12
Healthy (Green): 10
Warning (Yellow): 1 (Customer Service accuracy drifting)
Critical (Red): 1 (Resume Screener API failing)
Overall System Health: 96.2%
Target: ≥99%
Trend: ↓ (was 97.1% yesterday)

Key Metrics by Agent

View all agents in a table:

Agent NameRiskStatusAccuracyLatencyUptimeCostAlerts
Customer SupportHIGH⚠ Warning84% ↓1.8s ✓99.9% ✓$4.8K1
Resume ScreenerCRITICAL🔴 Critical91% ✓0%$2.1K3
Content ModerationHIGH✓ Healthy77% ✓0.3s ✓99.95% ✓$1.2K0
Data ClassifierMEDIUM✓ Healthy96% ✓0.5s ✓99.8% ✓$0.9K0

Click any agent to see detailed metrics

Drill-Down: Individual Agent

Click “Customer Support” for detailed view:

Health Status

  • Status: ⚠ Warning (accuracy drifting below target)
  • Last update: 15 seconds ago
  • Incident history: No recent incidents
  • Estimated time to Critical: 24 hours (if trend continues)

Performance Metrics

Accuracy

Current: 84%
Target: ≥85%
Trend: ↓ 2% (was 86% yesterday)
Status: Below target ⚠
Root cause: Possibly outdated product knowledge (new features released 3 days ago)
Recommendation: Update knowledge base

Latency (p95)

Current: 1.8 seconds
Target: <2 seconds
Trend: → Stable
Status: Healthy ✓

Availability

Current: 99.9%
Target: ≥99%
Uptime: 24h: 99.9%, 7d: 99.85%, 30d: 99.8%
Last incident: 2 days ago (5 min outage)
Status: Healthy ✓

Cost

Daily average: $160
Monthly forecast: $4,800
Budget: $5,000
Status: On track ✓

Detailed Logs

Recent decisions (sampling):

Decision 1: 2024-03-15 14:35
Query: "What's your refund policy?"
AI Response: "We offer 30-day refunds"
Truth Nugget: "We offer 14-day refunds"
Hallucination: YES
Confidence: 87%
Override needed: YES
Decision 2: 2024-03-15 14:32
Query: "Do you support HIPAA?"
AI Response: "No, we don't support HIPAA"
Truth Nugget: "We're HIPAA-ready (BAA available)"
Hallucination: YES
Confidence: 92%
Override needed: YES

Manual Overrides

When an agent makes a wrong decision, manually override it:

Override an Agent Decision

  1. Locate decision in logs (see above examples)
  2. Click decision → “Override”
  3. Provide correct answer (what should agent have said?)
  4. Reason (why override? e.g., “Wrong refund policy”, “Outdated knowledge”)
  5. Submit (logged in audit trail)

Impact:

  • Immediate correction sent to user/system
  • Feedback logged (helps improve agent)
  • Knowledge base can be updated (prevent future errors)

Batch Overrides

If multiple decisions wrong (e.g., same misconception):

  1. Select multiple decisions (checkbox)
  2. Bulk ActionsOverride & Correct
  3. Provide correction (once, applies to all selected)
  4. Update knowledge base (optional; prevents future errors)

Example:

5 customer service decisions incorrectly said "30-day refund"
Select all 5 decisions
Bulk override: "Correct answer is 14-day refund"
Auto-update knowledge base with correct refund policy
Future queries won't have this error

Incident Management

When alerts trigger:

Critical Alert Example

🔴 CRITICAL: Resume Screener - API Failure
Alert triggered: 2024-03-15 14:20
Alert type: Availability (0% uptime for 5 min)
Status: Ongoing
Affected agent: Resume Screener (CRITICAL risk level)
What's happening:
- OpenAI API rate limit exceeded
- Agent unable to evaluate candidates
- 47 pending evaluations stuck
Recommendations:
1. Check OpenAI API quota
2. Reduce concurrent evaluations (rate limiting)
3. Switch to backup model (Claude) if available
4. Contact OpenAI support
Actions taken:
[Disable Agent] [Switch Model] [Page On-Call] [Acknowledge Alert]

Disable Agent

If agent broken and needs recovery:

  1. Click Disable Agent
  2. Agent stops processing new requests
  3. Pending requests queued or escalated to human
  4. Status updates to “Paused - Critical Issue”
  5. Escalation notified

Re-enable when fixed:

  1. Fix underlying issue (API quota restored, etc.)
  2. Click Re-enable Agent
  3. Agent resumes processing
  4. Queued requests processed in order

Trend Analysis

Performance Over Time

View trends for any metric:

Accuracy Trend (Last 30 Days)
─────────────────────────────
Day 1-7: 87% (stable)
Day 8-14: 85% (dropped 2%)
Day 15-21: 83% (dropped 2% more)
Day 22-30: 82% (still declining)
Trend: ↓ Sharp decline starting day 8
Root cause: Product knowledge outdated after launch of 3 new features
Recommendation: Update knowledge base with new feature descriptions

Alert Trend

Alerts Triggered (Last 30 Days)
──────────────────────────────
Week 1: 0 alerts (healthy)
Week 2: 2 alerts (accuracy warnings)
Week 3: 5 alerts (accuracy critical + latency warnings)
Week 4: 8 alerts (cascading issues)
Pattern: Alerts increasing, suggesting systemic issue
Most common: Accuracy below threshold
Recommendation: Root cause analysis needed

Performance Baselines

Set Baseline

First time agent deployed, establish baseline metrics:

Set Baseline for Customer Service Agent:
Week 1: Monitor only (no alerting)
Establish normal ranges:
Accuracy: 85-89%
Latency: 1.5-2.2s
Cost: $140-$180/day
After Week 1:
Set targets based on baseline
Enable alerting at ±10% of baseline

Seasonal Adjustment

Some metrics vary seasonally:

Cost baseline varies by day of week:
Mon-Fri: $160/day (business hours heavy)
Sat-Sun: $80/day (weekend light)
Accuracy baseline varies by season:
Q4 (holiday): More queries; similar accuracy
Q2-Q3: Stable; use this as "normal"
Configure alerts to account for seasonality

Real-Time Monitoring

Live Updates

Dashboard updates in real-time (or configurable intervals):

  • Real-time (1 sec): Throughput, error alerts
  • 1-minute: Latency, accuracy sampling
  • 5-minute: Uptime, cost tracking
  • 1-hour: Trend analysis, forecasts

Streaming Alerts

Important alerts stream to Slack/email/PagerDuty immediately:

  • Critical alerts: Instant (page on-call)
  • High alerts: Within 1 min (notify owner)
  • Medium alerts: Batch hourly (daily digest)
  • Low alerts: Daily digest (weekly summary)

Reports & Exports

Generate Reports

Create compliance/audit reports:

Export Monthly Report (March 2024):
Agent performance summary
Uptime/SLA metrics
Accuracy metrics + trends
Cost analysis
Incidents & resolutions
Override/manual correction frequency
Recommendations for improvements
Format: PDF (printable), Excel (for analysis)

Audit Trail

Complete audit trail of all actions:

2024-03-15 14:35 — Manual override: John Smith
Decision ID: dec_12345
Correction: "Refund policy is 14 days (was 30)"
Reason: "Product knowledge outdated"
2024-03-15 14:20 — Alert: Resume Screener critical
Type: API failure
Duration: 5 minutes
Impact: 47 pending evaluations
Resolution: Switched to backup model
2024-03-15 14:00 — Agent performance: Accuracy dropped
From: 86% → 84%
Threshold: 85%
Status: Warning alert issued

Next Steps

  1. Review agent health — Are all agents performing within targets?
  2. Investigate warnings — Address any metric drifts
  3. Manual override — Correct any recent agent errors
  4. Trend analysis — Look for patterns in performance
  5. Adjust baselines — Update targets if reality changed