Skip to content

PagerDuty Integration

Route critical TruthVouch alerts to PagerDuty for incident management. Automatically create and resolve incidents, escalate to on-call engineers, and track response metrics.

Setup

1. Create PagerDuty Integration

In PagerDuty:

  1. Go to Services → Select service
  2. Integrations tab → New Integration
  3. Choose “Events API v2”
  4. Copy Integration Key

2. Configure TruthVouch

Terminal window
truthvouch config pagerduty \
--integration-key xxxxx \
--service-id pxxxxx

3. Test Connection

Terminal window
truthvouch config pagerduty --test

Alert Routing

Critical to Incidents

from truthvouch.integrations.pagerduty import PagerDutyHandler
handler = PagerDutyHandler(
integration_key="your-integration-key"
)
# Route critical alerts to incidents
handler.configure_rule(
name="Critical Hallucinations → Incident",
trigger="confidence < 0.5",
action={
"type": "create_incident",
"title": "CRITICAL: Hallucination detected",
"urgency": "high",
"service_id": "pxxxxx",
"escalation_policy_id": "pxxxxx"
}
)

Severity Mapping

# Map TruthVouch severity to PagerDuty urgency
severity_map = {
"critical": "high", # Page on-call
"high": "low", # Add to queue
"medium": "low", # Log only
"low": "low" # Log only
}
# Configure mapping
handler.configure_severity_map(severity_map)

Incident Creation

Create Incident

incident = handler.create_incident(
title="Hallucination in production",
description="Confidence: 5% - Earth is flat claim",
service_id="pxxxxx",
urgency="high",
body={
"type": "incident_body",
"details": {
"query": "Is the Earth flat?",
"response": "Yes, the Earth is flat",
"confidence": "5%",
"category": "Science",
"dashboard_link": "https://dash.truthvouch.com/alert/123"
}
}
)
print(f"Created incident: {incident['incident']['incident_number']}")

With Custom Fields

incident = handler.create_incident(
title="Hallucination - Policy Violation",
description="Auto-generated from TruthVouch",
service_id="pxxxxx",
urgency="high",
client="api.integration.client:TruthVouch",
details={
"custom_field_1": "hallucination",
"custom_field_2": "5%",
"alert_id": "alert_123"
}
)

Incident Lifecycle

Acknowledge and Resolve

# Get incident details
incident_id = "Q02JTUPZWHSN7Q"
# Acknowledge
handler.acknowledge_incident(
incident_id=incident_id,
user_id="user123"
)
# Resolve when issue is fixed
handler.resolve_incident(
incident_id=incident_id,
resolution_note="Issue fixed in production"
)

Escalation

# Escalate if not acknowledged in 30 minutes
handler.escalate_if_unacknowledged(
incident_id=incident_id,
timeout_minutes=30,
escalation_policy_id="pxxxxx"
)

Bidirectional Sync

Incident Update Webhook

from flask import Flask, request
from truthvouch.client import TruthVouchClient
app = Flask(__name__)
tv_client = TruthVouchClient(api_key="your-api-key")
@app.route("/pagerduty/webhook", methods=["POST"])
def handle_pd_event():
"""Handle PagerDuty incident updates."""
event = request.get_json()
# When incident is acknowledged
if event["type"] == "incident.acknowledged":
incident = event["data"]["incident"]
alert_id = incident["body"]["details"]["alert_id"]
# Update TruthVouch alert
tv_client.alerts.acknowledge(alert_id=alert_id)
# When incident is resolved
elif event["type"] == "incident.resolved":
incident = event["data"]["incident"]
alert_id = incident["body"]["details"]["alert_id"]
# Mark alert as resolved
tv_client.alerts.resolve(alert_id=alert_id)
return {"status": "ok"}, 200

On-Call Escalation

Auto-Escalate

# Escalate critical issues to on-call rotation
handler.configure_escalation(
escalation_policy_id="pxxxxx", # Your escalation policy
levels=[
{
"level": 1,
"timeout_minutes": 15,
"description": "Wait 15 minutes, then escalate"
},
{
"level": 2,
"timeout_minutes": 30,
"description": "Wait 30 minutes, then escalate manager"
}
]
)

Notify Responder

# When responder is assigned
handler.notify_responder(
incident_id="Q02JTUPZWHSN7Q",
message="Critical hallucination detected. Review in TruthVouch dashboard.",
link="https://dash.truthvouch.com/alerts/123"
)

Metrics and Reports

Get Incident Stats

# Get metrics for service
stats = handler.get_incident_stats(
service_id="pxxxxx",
start_date="2024-03-01",
end_date="2024-03-15"
)
print(f"Total incidents: {stats['total']}")
print(f"Avg resolution time: {stats['avg_resolution_minutes']}m")
print(f"Incidents escalated: {stats['escalated']}")

Incident Report

# Generate monthly report
report = handler.generate_report(
service_id="pxxxxx",
period="monthly"
)
for incident in report["incidents"]:
print(f"- {incident['incident_number']}: {incident['title']}")
print(f" Status: {incident['status']}")
print(f" Duration: {incident['duration_minutes']}m")

Best Practices

Incident Management

  • Create incidents only for high/critical alerts
  • Include dashboard link in incident details
  • Use consistent severity mapping
  • Implement SLA targets

Escalation

  • Set realistic escalation timeouts
  • Test rotation before production
  • Monitor escalation effectiveness
  • Review and adjust policy quarterly

Metrics

  • Track MTTR (mean time to resolution)
  • Monitor alert vs incident ratio
  • Review false alert rates
  • Plan capacity based on incident volume

Troubleshooting

Q: Incidents not creating

  • Verify integration key is correct
  • Check service_id exists
  • Test with curl first
  • Check request format

Q: Escalations not triggering

  • Verify escalation policy is active
  • Check user is in rotation
  • Test with manual escalation first
  • Review timeout settings

Q: Webhook delivery failing

  • Verify webhook URL is correct
  • Check TruthVouch can reach PagerDuty
  • Implement retry logic
  • Add comprehensive logging

Next Steps