PagerDuty Integration
Route critical TruthVouch alerts to PagerDuty for incident management. Automatically create and resolve incidents, escalate to on-call engineers, and track response metrics.
Setup
1. Create PagerDuty Integration
In PagerDuty:
- Go to Services → Select service
- Integrations tab → New Integration
- Choose “Events API v2”
- Copy Integration Key
2. Configure TruthVouch
truthvouch config pagerduty \ --integration-key xxxxx \ --service-id pxxxxx3. Test Connection
truthvouch config pagerduty --testAlert Routing
Critical to Incidents
from truthvouch.integrations.pagerduty import PagerDutyHandler
handler = PagerDutyHandler( integration_key="your-integration-key")
# Route critical alerts to incidentshandler.configure_rule( name="Critical Hallucinations → Incident", trigger="confidence < 0.5", action={ "type": "create_incident", "title": "CRITICAL: Hallucination detected", "urgency": "high", "service_id": "pxxxxx", "escalation_policy_id": "pxxxxx" })Severity Mapping
# Map TruthVouch severity to PagerDuty urgencyseverity_map = { "critical": "high", # Page on-call "high": "low", # Add to queue "medium": "low", # Log only "low": "low" # Log only}
# Configure mappinghandler.configure_severity_map(severity_map)Incident Creation
Create Incident
incident = handler.create_incident( title="Hallucination in production", description="Confidence: 5% - Earth is flat claim", service_id="pxxxxx", urgency="high", body={ "type": "incident_body", "details": { "query": "Is the Earth flat?", "response": "Yes, the Earth is flat", "confidence": "5%", "category": "Science", "dashboard_link": "https://dash.truthvouch.com/alert/123" } })
print(f"Created incident: {incident['incident']['incident_number']}")With Custom Fields
incident = handler.create_incident( title="Hallucination - Policy Violation", description="Auto-generated from TruthVouch", service_id="pxxxxx", urgency="high", client="api.integration.client:TruthVouch", details={ "custom_field_1": "hallucination", "custom_field_2": "5%", "alert_id": "alert_123" })Incident Lifecycle
Acknowledge and Resolve
# Get incident detailsincident_id = "Q02JTUPZWHSN7Q"
# Acknowledgehandler.acknowledge_incident( incident_id=incident_id, user_id="user123")
# Resolve when issue is fixedhandler.resolve_incident( incident_id=incident_id, resolution_note="Issue fixed in production")Escalation
# Escalate if not acknowledged in 30 minuteshandler.escalate_if_unacknowledged( incident_id=incident_id, timeout_minutes=30, escalation_policy_id="pxxxxx")Bidirectional Sync
Incident Update Webhook
from flask import Flask, requestfrom truthvouch.client import TruthVouchClient
app = Flask(__name__)tv_client = TruthVouchClient(api_key="your-api-key")
@app.route("/pagerduty/webhook", methods=["POST"])def handle_pd_event(): """Handle PagerDuty incident updates."""
event = request.get_json()
# When incident is acknowledged if event["type"] == "incident.acknowledged": incident = event["data"]["incident"] alert_id = incident["body"]["details"]["alert_id"]
# Update TruthVouch alert tv_client.alerts.acknowledge(alert_id=alert_id)
# When incident is resolved elif event["type"] == "incident.resolved": incident = event["data"]["incident"] alert_id = incident["body"]["details"]["alert_id"]
# Mark alert as resolved tv_client.alerts.resolve(alert_id=alert_id)
return {"status": "ok"}, 200On-Call Escalation
Auto-Escalate
# Escalate critical issues to on-call rotationhandler.configure_escalation( escalation_policy_id="pxxxxx", # Your escalation policy levels=[ { "level": 1, "timeout_minutes": 15, "description": "Wait 15 minutes, then escalate" }, { "level": 2, "timeout_minutes": 30, "description": "Wait 30 minutes, then escalate manager" } ])Notify Responder
# When responder is assignedhandler.notify_responder( incident_id="Q02JTUPZWHSN7Q", message="Critical hallucination detected. Review in TruthVouch dashboard.", link="https://dash.truthvouch.com/alerts/123")Metrics and Reports
Get Incident Stats
# Get metrics for servicestats = handler.get_incident_stats( service_id="pxxxxx", start_date="2024-03-01", end_date="2024-03-15")
print(f"Total incidents: {stats['total']}")print(f"Avg resolution time: {stats['avg_resolution_minutes']}m")print(f"Incidents escalated: {stats['escalated']}")Incident Report
# Generate monthly reportreport = handler.generate_report( service_id="pxxxxx", period="monthly")
for incident in report["incidents"]: print(f"- {incident['incident_number']}: {incident['title']}") print(f" Status: {incident['status']}") print(f" Duration: {incident['duration_minutes']}m")Best Practices
Incident Management
- Create incidents only for high/critical alerts
- Include dashboard link in incident details
- Use consistent severity mapping
- Implement SLA targets
Escalation
- Set realistic escalation timeouts
- Test rotation before production
- Monitor escalation effectiveness
- Review and adjust policy quarterly
Metrics
- Track MTTR (mean time to resolution)
- Monitor alert vs incident ratio
- Review false alert rates
- Plan capacity based on incident volume
Troubleshooting
Q: Incidents not creating
- Verify integration key is correct
- Check service_id exists
- Test with curl first
- Check request format
Q: Escalations not triggering
- Verify escalation policy is active
- Check user is in rotation
- Test with manual escalation first
- Review timeout settings
Q: Webhook delivery failing
- Verify webhook URL is correct
- Check TruthVouch can reach PagerDuty
- Implement retry logic
- Add comprehensive logging
Next Steps
- Review Jira Integration
- Explore Incident Management
- Check Alert Channels