Monitoring & Overrides
The TruthOps monitoring dashboard automatically shows real-time health of all agents. TruthVouch continuously tracks performance metrics, generates alerts, and lets you manually override agent decisions when needed.

Dashboard Overview
Top-Level Metrics
Agent Health Summary (at a glance):
Total Agents: 12 Healthy (Green): 10 Warning (Yellow): 1 (Customer Service accuracy drifting) Critical (Red): 1 (Resume Screener API failing)
Overall System Health: 96.2% Target: ≥99% Trend: ↓ (was 97.1% yesterday)Key Metrics by Agent
View all agents in a table:
| Agent Name | Risk | Status | Accuracy | Latency | Uptime | Cost | Alerts |
|---|---|---|---|---|---|---|---|
| Customer Support | HIGH | ⚠ Warning | 84% ↓ | 1.8s ✓ | 99.9% ✓ | $4.8K | 1 |
| Resume Screener | CRITICAL | 🔴 Critical | 91% ✓ | — | 0% | $2.1K | 3 |
| Content Moderation | HIGH | ✓ Healthy | 77% ✓ | 0.3s ✓ | 99.95% ✓ | $1.2K | 0 |
| Data Classifier | MEDIUM | ✓ Healthy | 96% ✓ | 0.5s ✓ | 99.8% ✓ | $0.9K | 0 |
Click any agent to see detailed metrics
Drill-Down: Individual Agent
Click “Customer Support” for detailed view:
Health Status
- Status: ⚠ Warning (accuracy drifting below target)
- Last update: 15 seconds ago
- Incident history: No recent incidents
- Estimated time to Critical: 24 hours (if trend continues)
Performance Metrics
Accuracy
Current: 84%Target: ≥85%Trend: ↓ 2% (was 86% yesterday)Status: Below target ⚠Root cause: Possibly outdated product knowledge (new features released 3 days ago)Recommendation: Update knowledge baseLatency (p95)
Current: 1.8 secondsTarget: <2 secondsTrend: → StableStatus: Healthy ✓Availability
Current: 99.9%Target: ≥99%Uptime: 24h: 99.9%, 7d: 99.85%, 30d: 99.8%Last incident: 2 days ago (5 min outage)Status: Healthy ✓Cost
Daily average: $160Monthly forecast: $4,800Budget: $5,000Status: On track ✓Detailed Logs
Recent decisions (sampling):
Decision 1: 2024-03-15 14:35 Query: "What's your refund policy?" AI Response: "We offer 30-day refunds" Truth Nugget: "We offer 14-day refunds" Hallucination: YES Confidence: 87% Override needed: YES
Decision 2: 2024-03-15 14:32 Query: "Do you support HIPAA?" AI Response: "No, we don't support HIPAA" Truth Nugget: "We're HIPAA-ready (BAA available)" Hallucination: YES Confidence: 92% Override needed: YESManual Overrides
When an agent makes a wrong decision, manually override it:
Override an Agent Decision
- Locate decision in logs (see above examples)
- Click decision → “Override”
- Provide correct answer (what should agent have said?)
- Reason (why override? e.g., “Wrong refund policy”, “Outdated knowledge”)
- Submit (logged in audit trail)
Impact:
- Immediate correction sent to user/system
- Feedback logged (helps improve agent)
- Knowledge base can be updated (prevent future errors)
Batch Overrides
If multiple decisions wrong (e.g., same misconception):
- Select multiple decisions (checkbox)
- Bulk Actions → Override & Correct
- Provide correction (once, applies to all selected)
- Update knowledge base (optional; prevents future errors)
Example:
5 customer service decisions incorrectly said "30-day refund"Select all 5 decisionsBulk override: "Correct answer is 14-day refund"Auto-update knowledge base with correct refund policyFuture queries won't have this errorIncident Management
When alerts trigger:
Critical Alert Example
🔴 CRITICAL: Resume Screener - API Failure
Alert triggered: 2024-03-15 14:20Alert type: Availability (0% uptime for 5 min)Status: OngoingAffected agent: Resume Screener (CRITICAL risk level)
What's happening: - OpenAI API rate limit exceeded - Agent unable to evaluate candidates - 47 pending evaluations stuck
Recommendations: 1. Check OpenAI API quota 2. Reduce concurrent evaluations (rate limiting) 3. Switch to backup model (Claude) if available 4. Contact OpenAI support
Actions taken: [Disable Agent] [Switch Model] [Page On-Call] [Acknowledge Alert]Disable Agent
If agent broken and needs recovery:
- Click Disable Agent
- Agent stops processing new requests
- Pending requests queued or escalated to human
- Status updates to “Paused - Critical Issue”
- Escalation notified
Re-enable when fixed:
- Fix underlying issue (API quota restored, etc.)
- Click Re-enable Agent
- Agent resumes processing
- Queued requests processed in order
Trend Analysis
Performance Over Time
View trends for any metric:
Accuracy Trend (Last 30 Days)─────────────────────────────Day 1-7: 87% (stable)Day 8-14: 85% (dropped 2%)Day 15-21: 83% (dropped 2% more)Day 22-30: 82% (still declining)
Trend: ↓ Sharp decline starting day 8Root cause: Product knowledge outdated after launch of 3 new features
Recommendation: Update knowledge base with new feature descriptionsAlert Trend
Alerts Triggered (Last 30 Days)──────────────────────────────Week 1: 0 alerts (healthy)Week 2: 2 alerts (accuracy warnings)Week 3: 5 alerts (accuracy critical + latency warnings)Week 4: 8 alerts (cascading issues)
Pattern: Alerts increasing, suggesting systemic issueMost common: Accuracy below thresholdRecommendation: Root cause analysis neededPerformance Baselines
Set Baseline
First time agent deployed, establish baseline metrics:
Set Baseline for Customer Service Agent: Week 1: Monitor only (no alerting) Establish normal ranges: Accuracy: 85-89% Latency: 1.5-2.2s Cost: $140-$180/day
After Week 1: Set targets based on baseline Enable alerting at ±10% of baselineSeasonal Adjustment
Some metrics vary seasonally:
Cost baseline varies by day of week: Mon-Fri: $160/day (business hours heavy) Sat-Sun: $80/day (weekend light)
Accuracy baseline varies by season: Q4 (holiday): More queries; similar accuracy Q2-Q3: Stable; use this as "normal"
Configure alerts to account for seasonalityReal-Time Monitoring
Live Updates
Dashboard updates in real-time (or configurable intervals):
- Real-time (1 sec): Throughput, error alerts
- 1-minute: Latency, accuracy sampling
- 5-minute: Uptime, cost tracking
- 1-hour: Trend analysis, forecasts
Streaming Alerts
Important alerts stream to Slack/email/PagerDuty immediately:
- Critical alerts: Instant (page on-call)
- High alerts: Within 1 min (notify owner)
- Medium alerts: Batch hourly (daily digest)
- Low alerts: Daily digest (weekly summary)
Reports & Exports
Generate Reports
Create compliance/audit reports:
Export Monthly Report (March 2024): Agent performance summary Uptime/SLA metrics Accuracy metrics + trends Cost analysis Incidents & resolutions Override/manual correction frequency Recommendations for improvements
Format: PDF (printable), Excel (for analysis)Audit Trail
Complete audit trail of all actions:
2024-03-15 14:35 — Manual override: John Smith Decision ID: dec_12345 Correction: "Refund policy is 14 days (was 30)" Reason: "Product knowledge outdated"
2024-03-15 14:20 — Alert: Resume Screener critical Type: API failure Duration: 5 minutes Impact: 47 pending evaluations Resolution: Switched to backup model
2024-03-15 14:00 — Agent performance: Accuracy dropped From: 86% → 84% Threshold: 85% Status: Warning alert issuedRelated Topics
- Agent Inventory — Register agents you’re monitoring
- Configuration — Set monitoring thresholds and policies
- Autonomy Levels — How autonomy affects monitoring needs
Next Steps
- Review agent health — Are all agents performing within targets?
- Investigate warnings — Address any metric drifts
- Manual override — Correct any recent agent errors
- Trend analysis — Look for patterns in performance
- Adjust baselines — Update targets if reality changed