Skip to content

Performance & Latency

Overview

The Firewall adds latency to every request. Typical overhead is 50-200ms depending on deployment model, configuration, and traffic volume. This guide shows how to measure, optimize, and tune for your workload.

Understanding Firewall Latency

The total response time has three components:

Total Latency = Network + Scanning + AI Provider

Network (20-60ms)

  • Time for request to reach Firewall
  • Firewall to AI provider
  • Response back to client

Scanning (10-100ms)

  • Each of the 15 stages takes time
  • Depends on configuration and payload size
  • Parallelization can reduce this

AI Provider (500ms - 30s)

  • The actual model inference
  • This is usually the dominant factor
  • Firewall impact is small relative to this

Latency Budgets Per Stage

Typical Stage Timing (milliseconds)

StageMinAvgMaxNotes
Rate Limiter<115Local check
Input PII Scanner2825Regex + model
Injection Detector51540Multiple algorithms
Content Safety82050Classification model
Business Logic1515Policy evaluation
Truth Scanner1030100DB lookup + similarity
Output PII Scanner2825Regex + model
Policy Enforcement1310Policy check
Total3090270Typical range

Measuring Performance

Via Dashboard

GovernanceReportsPerformance:

  • P50, P95, P99 latency percentiles
  • Throughput (requests/second)
  • Error rates
  • Per-stage breakdown

Via API

Terminal window
curl -X GET http://localhost:5000/api/v1/governance/metrics/latency \
-H "Authorization: Bearer $TOKEN" \
-H "Accept: application/json"

Response:

{
"period": "last_24_hours",
"overall_latency": {
"p50_ms": 85,
"p95_ms": 180,
"p99_ms": 250,
"mean_ms": 95
},
"by_stage": {
"rate_limiter": {"p50": 1, "p95": 2},
"injection_detector": {"p50": 12, "p95": 35},
"truth_scanner": {"p50": 28, "p95": 90},
"output_pii_scanner": {"p50": 6, "p95": 18}
},
"throughput": {
"requests_per_second": 450,
"peak_requests_per_second": 650
}
}

Local Testing

Terminal window
# Test single request latency
time curl -X POST http://localhost:5003/openai/v1/chat/completions \
-H "Authorization: Bearer $KEY" \
-d '{"model": "gpt-4", "messages": [...]}'
# Load test
ab -n 1000 -c 10 -p request.json \
-H "Authorization: Bearer $KEY" \
http://localhost:5003/openai/v1/chat/completions

Optimization Strategies

1. Disable Unused Stages

If you don’t need a feature, disable it:

firewall:
stages:
- name: "rate-limiter"
enabled: true
- name: "injection-detector"
enabled: false # Not needed for internal systems
- name: "truth-scanner"
enabled: true
- name: "content-safety"
enabled: false # Not needed for internal API

Savings: 30-50ms per disabled stage

2. Increase Thresholds (Reduce False Positives)

Lower accuracy = faster scanning:

firewall:
stages:
- name: "injection-detector"
config:
sensitivity: "low" # Instead of "high"
# Savings: 10-20ms

3. Enable Caching

Cache embedding vectors and classifications:

firewall:
cache:
enabled: true
backend: "redis" # or "in-memory"
ttl_seconds: 3600
max_size_mb: 1000
stages:
- name: "truth-scanner"
config:
cache_embeddings: true
cache_ttl: 1800

Savings: 50-80% on repeated queries

4. Async Scanning for Non-Critical Paths

Run expensive scans in background for logging, not blocking:

firewall:
async:
enabled: true
timeout_ms: 5000
stages:
- name: "truth-scanner"
config:
async: true # Doesn't block response
- name: "content-safety"
config:
async: false # Critical, must complete

Savings: 20-50ms by not waiting for non-blocking scans

5. Batch Processing

Group multiple requests if possible:

# Instead of:
for prompt in prompts:
response = await client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
# Firewall latency: 90ms per request
# Do:
responses = await client.batch(
[
{"model": "gpt-4", "messages": [{"role": "user", "content": prompt}]}
for prompt in prompts
]
)
# Firewall latency: 90ms for entire batch

Savings: 80-95% when batching 10+ requests

6. Adjust Timeout Values

Balance between correctness and speed:

firewall:
stages:
- name: "truth-scanner"
config:
timeout_ms: 2000 # Don't wait forever
- name: "injection-detector"
config:
timeout_ms: 1000 # Quick, or fail open

Default: Stages timeout after 5 seconds, fall through

7. Use Regional Endpoints

Connect to nearby geographically:

# Before: Default US endpoint
client = TruthVouchClient(api_key="tvk_...", region="us")
# After: Use EU endpoint if in EU
client = TruthVouchClient(api_key="tvk_...", region="eu")
# Savings: 30-80ms from reduced latency

Configuration for Speed

High-Throughput Configuration

For e-commerce, support chatbots, high-volume systems:

firewall:
version: "2.0"
pipeline:
enabled: true
# Only essential stages
stages:
- name: "rate-limiter"
enabled: true
config:
requests_per_minute: 10000
burst_allowance: 1000
- name: "input-pii-scanner"
enabled: true
config:
confidence_threshold: 0.95 # Only very sure
entity_types: ["credit_card"] # Only critical
- name: "injection-detector"
enabled: false # Skip for trusted internal API
- name: "output-pii-scanner"
enabled: true
config:
async: true # Log only, don't block
- name: "content-safety"
enabled: false # Skip if not critical
cache:
enabled: true
backend: "redis"
ttl_seconds: 3600
async:
enabled: true
timeout_ms: 3000

Expected latency: 20-50ms added overhead

High-Security Configuration

For regulated industries, data protection critical:

firewall:
version: "2.0"
pipeline:
enabled: true
# All stages enabled, strict thresholds
stages:
- name: "rate-limiter"
enabled: true
config:
requests_per_minute: 100
burst_allowance: 0
- name: "input-pii-scanner"
enabled: true
config:
confidence_threshold: 0.70 # Catch all
entity_types: ["all"]
- name: "injection-detector"
enabled: true
config:
sensitivity: "high"
timeout_ms: 5000
- name: "output-pii-scanner"
enabled: true
config:
async: false # Block on PII
- name: "truth-scanner"
enabled: true
config:
similarity_threshold: 0.85
- name: "content-safety"
enabled: true
config:
toxicity_threshold: 0.5
cache:
enabled: true
backend: "redis"
async:
enabled: false # All scans must complete

Expected latency: 150-250ms added overhead

Monitoring Over Time

Set Up Alerts

Alert if latency exceeds SLA:

alerts:
- name: "firewall_latency_high"
condition: "p95_latency > 200ms"
severity: "warning"
action: "notify_slack"
- name: "firewall_latency_critical"
condition: "p99_latency > 500ms"
severity: "critical"
action: "page_oncall"

Weekly Review

  1. Check ReportsPerformance
  2. Compare to previous week
  3. Identify which stages are slowest
  4. Adjust configuration if needed

Capacity Planning

Track these metrics:

  • Peak requests/second
  • P99 latency
  • Error rate

If hitting limits, scale by:

  • Adding more Firewall instances (horizontal)
  • Increasing resources per instance (vertical)
  • Enabling more aggressive caching

Real-World Latency Examples

Example 1: SaaS Proxy (Production)

Request → Your App (0ms)
Network to TruthVouch (30ms)
Firewall Scanning:
- Rate Limiter: 1ms
- PII Scanner: 8ms
- Injection Detector: 15ms
- Content Safety: 0ms (disabled)
- Truth Scanner: 25ms
Total: 49ms
Network to OpenAI (40ms)
OpenAI Response: 2000ms
Network back to App (30ms)
Total End-to-End: 2149ms
Firewall overhead: 49ms (2.3% of total)

Example 2: Self-Hosted Sidecar

Request → Your App (0ms)
Local network to Firewall (1ms)
Firewall Scanning: 90ms (all stages enabled)
Local network to AI Provider (1ms)
AI Provider Response: 1500ms
Network back to App (1ms)
Total End-to-End: 1593ms
Firewall overhead: 92ms (5.8% of total, but all local)

Example 3: Heavily Cached

Request → Your App (0ms)
Network to TruthVouch (30ms)
Firewall Scanning (with cache hits):
- Rate Limiter: 1ms
- PII Scanner: 2ms (cached)
- Injection Detector: 5ms (cached classification)
- Truth Scanner: 3ms (cached embedding hit!)
Total: 11ms
Network to OpenAI (40ms)
OpenAI Response: 1800ms
Network back: 30ms
Total End-to-End: 1911ms
Firewall overhead: 11ms (0.6% of total!) — Cache was very effective