Performance & Latency
Overview
The Firewall adds latency to every request. Typical overhead is 50-200ms depending on deployment model, configuration, and traffic volume. This guide shows how to measure, optimize, and tune for your workload.
Understanding Firewall Latency
The total response time has three components:
Total Latency = Network + Scanning + AI ProviderNetwork (20-60ms)
- Time for request to reach Firewall
- Firewall to AI provider
- Response back to client
Scanning (10-100ms)
- Each of the 15 stages takes time
- Depends on configuration and payload size
- Parallelization can reduce this
AI Provider (500ms - 30s)
- The actual model inference
- This is usually the dominant factor
- Firewall impact is small relative to this
Latency Budgets Per Stage
Typical Stage Timing (milliseconds)
| Stage | Min | Avg | Max | Notes |
|---|---|---|---|---|
| Rate Limiter | <1 | 1 | 5 | Local check |
| Input PII Scanner | 2 | 8 | 25 | Regex + model |
| Injection Detector | 5 | 15 | 40 | Multiple algorithms |
| Content Safety | 8 | 20 | 50 | Classification model |
| Business Logic | 1 | 5 | 15 | Policy evaluation |
| Truth Scanner | 10 | 30 | 100 | DB lookup + similarity |
| Output PII Scanner | 2 | 8 | 25 | Regex + model |
| Policy Enforcement | 1 | 3 | 10 | Policy check |
| Total | 30 | 90 | 270 | Typical range |
Measuring Performance
Via Dashboard
Governance → Reports → Performance:
- P50, P95, P99 latency percentiles
- Throughput (requests/second)
- Error rates
- Per-stage breakdown
Via API
curl -X GET http://localhost:5000/api/v1/governance/metrics/latency \ -H "Authorization: Bearer $TOKEN" \ -H "Accept: application/json"Response:
{ "period": "last_24_hours", "overall_latency": { "p50_ms": 85, "p95_ms": 180, "p99_ms": 250, "mean_ms": 95 }, "by_stage": { "rate_limiter": {"p50": 1, "p95": 2}, "injection_detector": {"p50": 12, "p95": 35}, "truth_scanner": {"p50": 28, "p95": 90}, "output_pii_scanner": {"p50": 6, "p95": 18} }, "throughput": { "requests_per_second": 450, "peak_requests_per_second": 650 }}Local Testing
# Test single request latencytime curl -X POST http://localhost:5003/openai/v1/chat/completions \ -H "Authorization: Bearer $KEY" \ -d '{"model": "gpt-4", "messages": [...]}'
# Load testab -n 1000 -c 10 -p request.json \ -H "Authorization: Bearer $KEY" \ http://localhost:5003/openai/v1/chat/completionsOptimization Strategies
1. Disable Unused Stages
If you don’t need a feature, disable it:
firewall: stages: - name: "rate-limiter" enabled: true
- name: "injection-detector" enabled: false # Not needed for internal systems
- name: "truth-scanner" enabled: true
- name: "content-safety" enabled: false # Not needed for internal APISavings: 30-50ms per disabled stage
2. Increase Thresholds (Reduce False Positives)
Lower accuracy = faster scanning:
firewall: stages: - name: "injection-detector" config: sensitivity: "low" # Instead of "high" # Savings: 10-20ms3. Enable Caching
Cache embedding vectors and classifications:
firewall: cache: enabled: true backend: "redis" # or "in-memory" ttl_seconds: 3600 max_size_mb: 1000
stages: - name: "truth-scanner" config: cache_embeddings: true cache_ttl: 1800Savings: 50-80% on repeated queries
4. Async Scanning for Non-Critical Paths
Run expensive scans in background for logging, not blocking:
firewall: async: enabled: true timeout_ms: 5000
stages: - name: "truth-scanner" config: async: true # Doesn't block response
- name: "content-safety" config: async: false # Critical, must completeSavings: 20-50ms by not waiting for non-blocking scans
5. Batch Processing
Group multiple requests if possible:
# Instead of:for prompt in prompts: response = await client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}] ) # Firewall latency: 90ms per request
# Do:responses = await client.batch( [ {"model": "gpt-4", "messages": [{"role": "user", "content": prompt}]} for prompt in prompts ])# Firewall latency: 90ms for entire batchSavings: 80-95% when batching 10+ requests
6. Adjust Timeout Values
Balance between correctness and speed:
firewall: stages: - name: "truth-scanner" config: timeout_ms: 2000 # Don't wait forever
- name: "injection-detector" config: timeout_ms: 1000 # Quick, or fail openDefault: Stages timeout after 5 seconds, fall through
7. Use Regional Endpoints
Connect to nearby geographically:
# Before: Default US endpointclient = TruthVouchClient(api_key="tvk_...", region="us")
# After: Use EU endpoint if in EUclient = TruthVouchClient(api_key="tvk_...", region="eu")
# Savings: 30-80ms from reduced latencyConfiguration for Speed
High-Throughput Configuration
For e-commerce, support chatbots, high-volume systems:
firewall: version: "2.0" pipeline: enabled: true
# Only essential stages stages: - name: "rate-limiter" enabled: true config: requests_per_minute: 10000 burst_allowance: 1000
- name: "input-pii-scanner" enabled: true config: confidence_threshold: 0.95 # Only very sure entity_types: ["credit_card"] # Only critical
- name: "injection-detector" enabled: false # Skip for trusted internal API
- name: "output-pii-scanner" enabled: true config: async: true # Log only, don't block
- name: "content-safety" enabled: false # Skip if not critical
cache: enabled: true backend: "redis" ttl_seconds: 3600
async: enabled: true timeout_ms: 3000Expected latency: 20-50ms added overhead
High-Security Configuration
For regulated industries, data protection critical:
firewall: version: "2.0" pipeline: enabled: true
# All stages enabled, strict thresholds stages: - name: "rate-limiter" enabled: true config: requests_per_minute: 100 burst_allowance: 0
- name: "input-pii-scanner" enabled: true config: confidence_threshold: 0.70 # Catch all entity_types: ["all"]
- name: "injection-detector" enabled: true config: sensitivity: "high" timeout_ms: 5000
- name: "output-pii-scanner" enabled: true config: async: false # Block on PII
- name: "truth-scanner" enabled: true config: similarity_threshold: 0.85
- name: "content-safety" enabled: true config: toxicity_threshold: 0.5
cache: enabled: true backend: "redis"
async: enabled: false # All scans must completeExpected latency: 150-250ms added overhead
Monitoring Over Time
Set Up Alerts
Alert if latency exceeds SLA:
alerts: - name: "firewall_latency_high" condition: "p95_latency > 200ms" severity: "warning" action: "notify_slack"
- name: "firewall_latency_critical" condition: "p99_latency > 500ms" severity: "critical" action: "page_oncall"Weekly Review
- Check Reports → Performance
- Compare to previous week
- Identify which stages are slowest
- Adjust configuration if needed
Capacity Planning
Track these metrics:
- Peak requests/second
- P99 latency
- Error rate
If hitting limits, scale by:
- Adding more Firewall instances (horizontal)
- Increasing resources per instance (vertical)
- Enabling more aggressive caching
Real-World Latency Examples
Example 1: SaaS Proxy (Production)
Request → Your App (0ms) ↓Network to TruthVouch (30ms) ↓Firewall Scanning: - Rate Limiter: 1ms - PII Scanner: 8ms - Injection Detector: 15ms - Content Safety: 0ms (disabled) - Truth Scanner: 25ms Total: 49ms ↓Network to OpenAI (40ms) ↓OpenAI Response: 2000ms ↓Network back to App (30ms) ↓Total End-to-End: 2149ms
Firewall overhead: 49ms (2.3% of total)Example 2: Self-Hosted Sidecar
Request → Your App (0ms) ↓Local network to Firewall (1ms) ↓Firewall Scanning: 90ms (all stages enabled) ↓Local network to AI Provider (1ms) ↓AI Provider Response: 1500ms ↓Network back to App (1ms) ↓Total End-to-End: 1593ms
Firewall overhead: 92ms (5.8% of total, but all local)Example 3: Heavily Cached
Request → Your App (0ms) ↓Network to TruthVouch (30ms) ↓Firewall Scanning (with cache hits): - Rate Limiter: 1ms - PII Scanner: 2ms (cached) - Injection Detector: 5ms (cached classification) - Truth Scanner: 3ms (cached embedding hit!) Total: 11ms ↓Network to OpenAI (40ms) ↓OpenAI Response: 1800ms ↓Network back: 30ms ↓Total End-to-End: 1911ms
Firewall overhead: 11ms (0.6% of total!) — Cache was very effective