Performance & Latency

Overview

The Firewall adds latency to every request. Typical overhead is 50-200ms depending on deployment model, configuration, and traffic volume. This guide shows how to measure, optimize, and tune for your workload.

Understanding Firewall Latency

The total response time has three components:

Total Latency = Network + Scanning + AI Provider

Network (20-60ms)

Time for request to reach Firewall
Firewall to AI provider
Response back to client

Scanning (10-100ms)

Each of the 15 stages takes time
Depends on configuration and payload size
Parallelization can reduce this

AI Provider (500ms - 30s)

The actual model inference
This is usually the dominant factor
Firewall impact is small relative to this

Latency Budgets Per Stage

Typical Stage Timing (milliseconds)

Stage	Min	Avg	Max	Notes
Rate Limiter	<1	1	5	Local check
Input PII Scanner	2	8	25	Regex + model
Injection Detector	5	15	40	Multiple algorithms
Content Safety	8	20	50	Classification model
Business Logic	1	5	15	Policy evaluation
Truth Scanner	10	30	100	DB lookup + similarity
Output PII Scanner	2	8	25	Regex + model
Policy Enforcement	1	3	10	Policy check
Total	30	90	270	Typical range

Measuring Performance

Via Dashboard

Governance → Reports → Performance:

P50, P95, P99 latency percentiles
Throughput (requests/second)
Error rates
Per-stage breakdown

Via API

curl -X GET http://localhost:5000/api/v1/governance/metrics/latency \
  -H "Authorization: Bearer $TOKEN" \
  -H "Accept: application/json"

Response:

{
  "period": "last_24_hours",
  "overall_latency": {
    "p50_ms": 85,
    "p95_ms": 180,
    "p99_ms": 250,
    "mean_ms": 95
  },
  "by_stage": {
    "rate_limiter": {"p50": 1, "p95": 2},
    "injection_detector": {"p50": 12, "p95": 35},
    "truth_scanner": {"p50": 28, "p95": 90},
    "output_pii_scanner": {"p50": 6, "p95": 18}
  },
  "throughput": {
    "requests_per_second": 450,
    "peak_requests_per_second": 650
  }
}

Local Testing

# Test single request latency
time curl -X POST http://localhost:5003/openai/v1/chat/completions \
  -H "Authorization: Bearer $KEY" \
  -d '{"model": "gpt-4", "messages": [...]}'

# Load test
ab -n 1000 -c 10 -p request.json \
  -H "Authorization: Bearer $KEY" \
  http://localhost:5003/openai/v1/chat/completions

Optimization Strategies

1. Disable Unused Stages

If you don’t need a feature, disable it:

firewall:
  stages:
    - name: "rate-limiter"
      enabled: true

    - name: "injection-detector"
      enabled: false  # Not needed for internal systems

    - name: "truth-scanner"
      enabled: true

    - name: "content-safety"
      enabled: false  # Not needed for internal API

Savings: 30-50ms per disabled stage

2. Increase Thresholds (Reduce False Positives)

Lower accuracy = faster scanning:

firewall:
  stages:
    - name: "injection-detector"
      config:
        sensitivity: "low"  # Instead of "high"
        # Savings: 10-20ms

3. Enable Caching

Cache embedding vectors and classifications:

firewall:
  cache:
    enabled: true
    backend: "redis"  # or "in-memory"
    ttl_seconds: 3600
    max_size_mb: 1000

  stages:
    - name: "truth-scanner"
      config:
        cache_embeddings: true
        cache_ttl: 1800

Savings: 50-80% on repeated queries

4. Async Scanning for Non-Critical Paths

Run expensive scans in background for logging, not blocking:

firewall:
  async:
    enabled: true
    timeout_ms: 5000

  stages:
    - name: "truth-scanner"
      config:
        async: true  # Doesn't block response

    - name: "content-safety"
      config:
        async: false  # Critical, must complete

Savings: 20-50ms by not waiting for non-blocking scans

5. Batch Processing

Group multiple requests if possible:

# Instead of:
for prompt in prompts:
    response = await client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    # Firewall latency: 90ms per request

# Do:
responses = await client.batch(
    [
        {"model": "gpt-4", "messages": [{"role": "user", "content": prompt}]}
        for prompt in prompts
    ]
)
# Firewall latency: 90ms for entire batch

Savings: 80-95% when batching 10+ requests

6. Adjust Timeout Values

Balance between correctness and speed:

firewall:
  stages:
    - name: "truth-scanner"
      config:
        timeout_ms: 2000  # Don't wait forever

    - name: "injection-detector"
      config:
        timeout_ms: 1000  # Quick, or fail open

Default: Stages timeout after 5 seconds, fall through

7. Use Regional Endpoints

Connect to nearby geographically:

# Before: Default US endpoint
client = TruthVouchClient(api_key="tvk_...", region="us")

# After: Use EU endpoint if in EU
client = TruthVouchClient(api_key="tvk_...", region="eu")

# Savings: 30-80ms from reduced latency

Configuration for Speed

High-Throughput Configuration

For e-commerce, support chatbots, high-volume systems:

firewall:
  version: "2.0"
  pipeline:
    enabled: true

    # Only essential stages
    stages:
      - name: "rate-limiter"
        enabled: true
        config:
          requests_per_minute: 10000
          burst_allowance: 1000

      - name: "input-pii-scanner"
        enabled: true
        config:
          confidence_threshold: 0.95  # Only very sure
          entity_types: ["credit_card"]  # Only critical

      - name: "injection-detector"
        enabled: false  # Skip for trusted internal API

      - name: "output-pii-scanner"
        enabled: true
        config:
          async: true  # Log only, don't block

      - name: "content-safety"
        enabled: false  # Skip if not critical

  cache:
    enabled: true
    backend: "redis"
    ttl_seconds: 3600

  async:
    enabled: true
    timeout_ms: 3000

Expected latency: 20-50ms added overhead

High-Security Configuration

For regulated industries, data protection critical:

firewall:
  version: "2.0"
  pipeline:
    enabled: true

    # All stages enabled, strict thresholds
    stages:
      - name: "rate-limiter"
        enabled: true
        config:
          requests_per_minute: 100
          burst_allowance: 0

      - name: "input-pii-scanner"
        enabled: true
        config:
          confidence_threshold: 0.70  # Catch all
          entity_types: ["all"]

      - name: "injection-detector"
        enabled: true
        config:
          sensitivity: "high"
          timeout_ms: 5000

      - name: "output-pii-scanner"
        enabled: true
        config:
          async: false  # Block on PII

      - name: "truth-scanner"
        enabled: true
        config:
          similarity_threshold: 0.85

      - name: "content-safety"
        enabled: true
        config:
          toxicity_threshold: 0.5

  cache:
    enabled: true
    backend: "redis"

  async:
    enabled: false  # All scans must complete

Expected latency: 150-250ms added overhead

Monitoring Over Time

Set Up Alerts

Alert if latency exceeds SLA:

alerts:
  - name: "firewall_latency_high"
    condition: "p95_latency > 200ms"
    severity: "warning"
    action: "notify_slack"

  - name: "firewall_latency_critical"
    condition: "p99_latency > 500ms"
    severity: "critical"
    action: "page_oncall"

Weekly Review

Check Reports → Performance
Compare to previous week
Identify which stages are slowest
Adjust configuration if needed

Capacity Planning

Track these metrics:

Peak requests/second
P99 latency
Error rate

If hitting limits, scale by:

Adding more Firewall instances (horizontal)
Increasing resources per instance (vertical)
Enabling more aggressive caching

Real-World Latency Examples

Example 1: SaaS Proxy (Production)

Request → Your App (0ms)
       ↓
Network to TruthVouch (30ms)
       ↓
Firewall Scanning:
  - Rate Limiter: 1ms
  - PII Scanner: 8ms
  - Injection Detector: 15ms
  - Content Safety: 0ms (disabled)
  - Truth Scanner: 25ms
  Total: 49ms
       ↓
Network to OpenAI (40ms)
       ↓
OpenAI Response: 2000ms
       ↓
Network back to App (30ms)
       ↓
Total End-to-End: 2149ms

Firewall overhead: 49ms (2.3% of total)

Example 2: Self-Hosted Sidecar

Request → Your App (0ms)
       ↓
Local network to Firewall (1ms)
       ↓
Firewall Scanning: 90ms (all stages enabled)
       ↓
Local network to AI Provider (1ms)
       ↓
AI Provider Response: 1500ms
       ↓
Network back to App (1ms)
       ↓
Total End-to-End: 1593ms

Firewall overhead: 92ms (5.8% of total, but all local)

Example 3: Heavily Cached

Request → Your App (0ms)
       ↓
Network to TruthVouch (30ms)
       ↓
Firewall Scanning (with cache hits):
  - Rate Limiter: 1ms
  - PII Scanner: 2ms (cached)
  - Injection Detector: 5ms (cached classification)
  - Truth Scanner: 3ms (cached embedding hit!)
  Total: 11ms
       ↓
Network to OpenAI (40ms)
       ↓
OpenAI Response: 1800ms
       ↓
Network back: 30ms
       ↓
Total End-to-End: 1911ms

Firewall overhead: 11ms (0.6% of total!) — Cache was very effective