Neural Cache Architecture

Neural Cache is a two-tier caching system that accelerates fact verification from 100ms+ (LLM calls) to <10ms (cache hits). This guide explains the architecture and how it works.

Two-Tier Architecture

L1 Cache: Redis (Exact Match)

Purpose: Ultra-fast exact-match lookups Latency: <1ms Hit Rate: 40-60% (same queries repeat frequently)

Query: "When was TruthVouch founded?"
  ↓
Redis Lookup
  ├─ Match Found → Return cached result (0.8ms)
  └─ No Match → Continue to L2

Cached Data:

Exact query → cached response
Query hash → truth nugget ID
Verification timestamp
Confidence score

L2 Cache: pgvector (Semantic Search)

Purpose: Semantic similarity matching Latency: 5-10ms Hit Rate: 30-50% (paraphrased queries, variations)

Query: "TruthVouch founding year?"
  ↓ (L1 miss)
  ↓
Convert to embedding vector
  ↓
PostgreSQL pgvector search
  ├─ Semantic similar results found → Return (8ms)
  └─ No semantic match → Call LLM (100ms+)

Cached Data:

Query embeddings
Truth nugget embeddings
Similarity scores (cosine distance)
Verification metadata

Cache Hit Scenarios

Scenario 1: Exact Repeat (L1 Hit)

Time 0:00: Query "When was TruthVouch founded?"
  ↓ Cache Miss
  ↓ Call LLM, verify, cache result
  ↓ Response: "2023"

Time 0:15: Same query repeated
  ↓ L1 Redis Match (0.8ms)
  ↓ Return cached: "2023"

Speedup: 100x faster

Scenario 2: Paraphrased Query (L2 Hit)

Truth: "Founded in 2023"

Query 1: "When was TruthVouch founded?"
  → Verified, cached with embedding

Query 2: "What is TruthVouch's founding year?"  (different wording)
  → L1 miss (exact text differs)
  → L2 Hit (semantic embedding matches)
  → Return cached result in 8ms

Speedup: 10x faster

Scenario 3: No Cache (LLM Call)

Query: "What makes TruthVouch unique compared to competitors?"
  ↓ L1 Miss (new query)
  ↓ L2 Miss (no similar embeddings)
  ↓ Call LLM
  ↓ LLM response verified against truth nuggets
  ↓ Results cached in L1 and L2
  ↓ Response returned (120ms)

Embedding Generation

Query Embedding

Questions/claims converted to semantic vectors:

Query: "When was TruthVouch founded?"
  ↓
Tokenization: ["When", "was", "TruthVouch", "founded", "?"]
  ↓
Embedding Model (all-MiniLM-L6-v2)
  ↓
Vector: [0.234, -0.891, 0.123, ..., 0.456]  (384 dimensions)

Truth Nugget Embedding

Truth nuggets similarly converted:

Truth: "Founded in 2023"
  ↓
Vector: [0.221, -0.876, 0.098, ..., 0.442]  (384 dimensions)

Similarity Computation

Cosine similarity between vectors:

Query Vector: [0.234, -0.891, 0.123, ..., 0.456]
Truth Vector: [0.221, -0.876, 0.098, ..., 0.442]
  ↓
Cosine Similarity: 0.97
  ↓
Interpretation: 97% similar → Match!

Cache Performance

Real-World Numbers

Scenario	L1 Hit	L2 Hit	LLM Call
Latency	0.8ms	8ms	120ms
Speedup	150x	15x	1x
Hit Rate	45%	35%	20%
Avg Response	0.45 × 0.8ms + 0.35 × 8ms + 0.2 × 120ms = 25ms

Cost Savings

With neural cache:

Reduced LLM calls by 80%
Cost per verification: ~95% reduction
Throughput: 10x more queries per dollar

Cache Invalidation

Caches are automatically invalidated when truth nuggets change:

L1 Invalidation (Exact Match)

Truth Nugget Updated
  ↓
Find all queries in L1 Redis matching this nugget
  ↓
Invalidate affected entries
  ↓
Next query: Cache miss, verify against updated truth

L2 Invalidation (Semantic)

Truth Nugget Embedding Changed (e.g., pricing updated)
  ↓
Recompute embedding vector
  ↓
Update pgvector entry
  ↓
Cosine similarity scores recalculated for future queries

Manual Cache Clear

# Clear specific query
client.cache.invalidate_query("When was TruthVouch founded?")

# Clear by truth nugget
client.cache.invalidate_nugget("founding_year")

# Clear all cache
client.cache.clear()

# Clear L1 only (keep L2 for semantic searches)
client.cache.clear_l1()

Cache Configuration

TTL (Time-To-Live)

client.cache.configure(
    l1_ttl_minutes=1440,      # L1 cache expires in 24h
    l2_ttl_minutes=10080,     # L2 cache expires in 7 days
    invalidate_on_nugget_change=True  # Auto-clear on updates
)

Size Limits

client.cache.configure(
    l1_max_size_mb=500,       # Redis 500MB max
    l2_max_size_gb=50,        # pgvector 50GB max
    eviction_policy="LRU"     # Least recently used
)

Similarity Threshold

client.cache.configure(
    l2_similarity_threshold=0.85,  # Min 85% similarity for L2 hit
    allow_semantic_fallback=True   # Use semantic if no exact match
)

Cache Monitoring

Metrics

stats = client.cache.get_statistics()

print(f"L1 Hit Rate: {stats.l1_hit_rate}%")
print(f"L2 Hit Rate: {stats.l2_hit_rate}%")
print(f"LLM Call Rate: {stats.llm_call_rate}%")
print(f"Avg Latency: {stats.avg_latency_ms}ms")
print(f"Cost Savings: {stats.cost_savings_percent}%")

Dashboard

Navigate to Settings → Cache Performance
View:
- Hit rate trends (hourly/daily)
- Top cached queries
- Most-searched truth nuggets
- Cost savings over time

Best Practices

1. Optimize Truth Nuggets

Specific, consistent nuggets improve cache performance:

# Bad: varied wording
nugget1 = "TruthVouch costs 349 dollars per month"
nugget2 = "TruthVouch is priced at $349/month"
nugget3 = "Monthly subscription is $349"

# Good: consistent format
nugget = "Starter Plan: $349/month"

2. Monitor Cache Health

Regular review of cache statistics:

# Weekly cache report
report = client.cache.get_weekly_report()
if report.l1_hit_rate < 0.3:
    print("Low L1 hit rate — consider reviewing query patterns")

3. Warm the Cache

Pre-populate cache with frequent queries:

# Load common questions on startup
client.cache.warm_with_common_queries([
    "What is TruthVouch?",
    "How much does TruthVouch cost?",
    "What AI models does TruthVouch monitor?",
])

4. Cache Logs for Compliance

Maintain cache logs for audit trails:

# Ensure cache lookups are logged
client.cache.configure(
    log_all_lookups=True,
    log_retention_days=2555  # 7 years for compliance
)

Advanced: Custom Embeddings

For specialized use cases, use custom embedding models:

client.cache.configure_embeddings(
    model="custom-domain-model",  # Your fine-tuned model
    dimensions=768,
    update_existing=True
)

Next Steps

Performance: Monitor cache hit rates and latency
Configuration: Tune TTL and similarity thresholds
Monitoring: Set up alerts for cache degradation
Scaling: Learn about cache scaling for enterprise