PagerDuty Integration

Route critical TruthVouch alerts to PagerDuty for incident management. Automatically create and resolve incidents, escalate to on-call engineers, and track response metrics.

Setup

1. Create PagerDuty Integration

In PagerDuty:

Go to Services → Select service
Integrations tab → New Integration
Choose “Events API v2”
Copy Integration Key

2. Configure TruthVouch

truthvouch config pagerduty \
  --integration-key xxxxx \
  --service-id pxxxxx

3. Test Connection

truthvouch config pagerduty --test

Alert Routing

Critical to Incidents

from truthvouch.integrations.pagerduty import PagerDutyHandler

handler = PagerDutyHandler(
    integration_key="your-integration-key"
)

# Route critical alerts to incidents
handler.configure_rule(
    name="Critical Hallucinations → Incident",
    trigger="confidence < 0.5",
    action={
        "type": "create_incident",
        "title": "CRITICAL: Hallucination detected",
        "urgency": "high",
        "service_id": "pxxxxx",
        "escalation_policy_id": "pxxxxx"
    }
)

Severity Mapping

# Map TruthVouch severity to PagerDuty urgency
severity_map = {
    "critical": "high",      # Page on-call
    "high": "low",           # Add to queue
    "medium": "low",         # Log only
    "low": "low"             # Log only
}

# Configure mapping
handler.configure_severity_map(severity_map)

Incident Creation

Create Incident

incident = handler.create_incident(
    title="Hallucination in production",
    description="Confidence: 5% - Earth is flat claim",
    service_id="pxxxxx",
    urgency="high",
    body={
        "type": "incident_body",
        "details": {
            "query": "Is the Earth flat?",
            "response": "Yes, the Earth is flat",
            "confidence": "5%",
            "category": "Science",
            "dashboard_link": "https://dash.truthvouch.com/alert/123"
        }
    }
)

print(f"Created incident: {incident['incident']['incident_number']}")

With Custom Fields

incident = handler.create_incident(
    title="Hallucination - Policy Violation",
    description="Auto-generated from TruthVouch",
    service_id="pxxxxx",
    urgency="high",
    client="api.integration.client:TruthVouch",
    details={
        "custom_field_1": "hallucination",
        "custom_field_2": "5%",
        "alert_id": "alert_123"
    }
)

Incident Lifecycle

Acknowledge and Resolve

# Get incident details
incident_id = "Q02JTUPZWHSN7Q"

# Acknowledge
handler.acknowledge_incident(
    incident_id=incident_id,
    user_id="user123"
)

# Resolve when issue is fixed
handler.resolve_incident(
    incident_id=incident_id,
    resolution_note="Issue fixed in production"
)

Escalation

# Escalate if not acknowledged in 30 minutes
handler.escalate_if_unacknowledged(
    incident_id=incident_id,
    timeout_minutes=30,
    escalation_policy_id="pxxxxx"
)

Bidirectional Sync

Incident Update Webhook

from flask import Flask, request
from truthvouch.client import TruthVouchClient

app = Flask(__name__)
tv_client = TruthVouchClient(api_key="your-api-key")

@app.route("/pagerduty/webhook", methods=["POST"])
def handle_pd_event():
    """Handle PagerDuty incident updates."""

    event = request.get_json()

    # When incident is acknowledged
    if event["type"] == "incident.acknowledged":
        incident = event["data"]["incident"]
        alert_id = incident["body"]["details"]["alert_id"]

        # Update TruthVouch alert
        tv_client.alerts.acknowledge(alert_id=alert_id)

    # When incident is resolved
    elif event["type"] == "incident.resolved":
        incident = event["data"]["incident"]
        alert_id = incident["body"]["details"]["alert_id"]

        # Mark alert as resolved
        tv_client.alerts.resolve(alert_id=alert_id)

    return {"status": "ok"}, 200

On-Call Escalation

Auto-Escalate

# Escalate critical issues to on-call rotation
handler.configure_escalation(
    escalation_policy_id="pxxxxx",  # Your escalation policy
    levels=[
        {
            "level": 1,
            "timeout_minutes": 15,
            "description": "Wait 15 minutes, then escalate"
        },
        {
            "level": 2,
            "timeout_minutes": 30,
            "description": "Wait 30 minutes, then escalate manager"
        }
    ]
)

Notify Responder

# When responder is assigned
handler.notify_responder(
    incident_id="Q02JTUPZWHSN7Q",
    message="Critical hallucination detected. Review in TruthVouch dashboard.",
    link="https://dash.truthvouch.com/alerts/123"
)

Metrics and Reports

Get Incident Stats

# Get metrics for service
stats = handler.get_incident_stats(
    service_id="pxxxxx",
    start_date="2024-03-01",
    end_date="2024-03-15"
)

print(f"Total incidents: {stats['total']}")
print(f"Avg resolution time: {stats['avg_resolution_minutes']}m")
print(f"Incidents escalated: {stats['escalated']}")

Incident Report

# Generate monthly report
report = handler.generate_report(
    service_id="pxxxxx",
    period="monthly"
)

for incident in report["incidents"]:
    print(f"- {incident['incident_number']}: {incident['title']}")
    print(f"  Status: {incident['status']}")
    print(f"  Duration: {incident['duration_minutes']}m")

Best Practices

Incident Management

Create incidents only for high/critical alerts
Include dashboard link in incident details
Use consistent severity mapping
Implement SLA targets

Escalation

Set realistic escalation timeouts
Test rotation before production
Monitor escalation effectiveness
Review and adjust policy quarterly

Metrics

Track MTTR (mean time to resolution)
Monitor alert vs incident ratio
Review false alert rates
Plan capacity based on incident volume

Troubleshooting

Q: Incidents not creating

Verify integration key is correct
Check service_id exists
Test with curl first
Check request format

Q: Escalations not triggering

Verify escalation policy is active
Check user is in rotation
Test with manual escalation first
Review timeout settings

Q: Webhook delivery failing

Verify webhook URL is correct
Check TruthVouch can reach PagerDuty
Implement retry logic
Add comprehensive logging

PagerDuty Integration

Setup

1. Create PagerDuty Integration

2. Configure TruthVouch

3. Test Connection

Alert Routing

Critical to Incidents

Severity Mapping

Incident Creation

Create Incident

With Custom Fields

Incident Lifecycle

Acknowledge and Resolve

Escalation

Bidirectional Sync

Incident Update Webhook

On-Call Escalation

Auto-Escalate

Notify Responder

Metrics and Reports

Get Incident Stats

Incident Report

Best Practices

Incident Management

Escalation

Metrics

Troubleshooting

Next Steps