Neural Cache Architecture
Neural Cache is a two-tier caching system that accelerates fact verification from 100ms+ (LLM calls) to <10ms (cache hits). This guide explains the architecture and how it works.
Two-Tier Architecture
L1 Cache: Redis (Exact Match)
Purpose: Ultra-fast exact-match lookups Latency: <1ms Hit Rate: 40-60% (same queries repeat frequently)
Query: "When was TruthVouch founded?" ↓Redis Lookup ├─ Match Found → Return cached result (0.8ms) └─ No Match → Continue to L2Cached Data:
- Exact query → cached response
- Query hash → truth nugget ID
- Verification timestamp
- Confidence score
L2 Cache: pgvector (Semantic Search)
Purpose: Semantic similarity matching Latency: 5-10ms Hit Rate: 30-50% (paraphrased queries, variations)
Query: "TruthVouch founding year?" ↓ (L1 miss) ↓Convert to embedding vector ↓PostgreSQL pgvector search ├─ Semantic similar results found → Return (8ms) └─ No semantic match → Call LLM (100ms+)Cached Data:
- Query embeddings
- Truth nugget embeddings
- Similarity scores (cosine distance)
- Verification metadata
Cache Hit Scenarios
Scenario 1: Exact Repeat (L1 Hit)
Time 0:00: Query "When was TruthVouch founded?" ↓ Cache Miss ↓ Call LLM, verify, cache result ↓ Response: "2023"
Time 0:15: Same query repeated ↓ L1 Redis Match (0.8ms) ↓ Return cached: "2023"
Speedup: 100x fasterScenario 2: Paraphrased Query (L2 Hit)
Truth: "Founded in 2023"
Query 1: "When was TruthVouch founded?" → Verified, cached with embedding
Query 2: "What is TruthVouch's founding year?" (different wording) → L1 miss (exact text differs) → L2 Hit (semantic embedding matches) → Return cached result in 8ms
Speedup: 10x fasterScenario 3: No Cache (LLM Call)
Query: "What makes TruthVouch unique compared to competitors?" ↓ L1 Miss (new query) ↓ L2 Miss (no similar embeddings) ↓ Call LLM ↓ LLM response verified against truth nuggets ↓ Results cached in L1 and L2 ↓ Response returned (120ms)Embedding Generation
Query Embedding
Questions/claims converted to semantic vectors:
Query: "When was TruthVouch founded?" ↓Tokenization: ["When", "was", "TruthVouch", "founded", "?"] ↓Embedding Model (all-MiniLM-L6-v2) ↓Vector: [0.234, -0.891, 0.123, ..., 0.456] (384 dimensions)Truth Nugget Embedding
Truth nuggets similarly converted:
Truth: "Founded in 2023" ↓Vector: [0.221, -0.876, 0.098, ..., 0.442] (384 dimensions)Similarity Computation
Cosine similarity between vectors:
Query Vector: [0.234, -0.891, 0.123, ..., 0.456]Truth Vector: [0.221, -0.876, 0.098, ..., 0.442] ↓Cosine Similarity: 0.97 ↓Interpretation: 97% similar → Match!Cache Performance
Real-World Numbers
| Scenario | L1 Hit | L2 Hit | LLM Call |
|---|---|---|---|
| Latency | 0.8ms | 8ms | 120ms |
| Speedup | 150x | 15x | 1x |
| Hit Rate | 45% | 35% | 20% |
| Avg Response | 0.45 × 0.8ms + 0.35 × 8ms + 0.2 × 120ms = 25ms |
Cost Savings
With neural cache:
- Reduced LLM calls by 80%
- Cost per verification: ~95% reduction
- Throughput: 10x more queries per dollar
Cache Invalidation
Caches are automatically invalidated when truth nuggets change:
L1 Invalidation (Exact Match)
Truth Nugget Updated ↓Find all queries in L1 Redis matching this nugget ↓Invalidate affected entries ↓Next query: Cache miss, verify against updated truthL2 Invalidation (Semantic)
Truth Nugget Embedding Changed (e.g., pricing updated) ↓Recompute embedding vector ↓Update pgvector entry ↓Cosine similarity scores recalculated for future queriesManual Cache Clear
# Clear specific queryclient.cache.invalidate_query("When was TruthVouch founded?")
# Clear by truth nuggetclient.cache.invalidate_nugget("founding_year")
# Clear all cacheclient.cache.clear()
# Clear L1 only (keep L2 for semantic searches)client.cache.clear_l1()Cache Configuration
TTL (Time-To-Live)
client.cache.configure( l1_ttl_minutes=1440, # L1 cache expires in 24h l2_ttl_minutes=10080, # L2 cache expires in 7 days invalidate_on_nugget_change=True # Auto-clear on updates)Size Limits
client.cache.configure( l1_max_size_mb=500, # Redis 500MB max l2_max_size_gb=50, # pgvector 50GB max eviction_policy="LRU" # Least recently used)Similarity Threshold
client.cache.configure( l2_similarity_threshold=0.85, # Min 85% similarity for L2 hit allow_semantic_fallback=True # Use semantic if no exact match)Cache Monitoring
Metrics
stats = client.cache.get_statistics()
print(f"L1 Hit Rate: {stats.l1_hit_rate}%")print(f"L2 Hit Rate: {stats.l2_hit_rate}%")print(f"LLM Call Rate: {stats.llm_call_rate}%")print(f"Avg Latency: {stats.avg_latency_ms}ms")print(f"Cost Savings: {stats.cost_savings_percent}%")Dashboard
- Navigate to Settings → Cache Performance
- View:
- Hit rate trends (hourly/daily)
- Top cached queries
- Most-searched truth nuggets
- Cost savings over time
Best Practices
1. Optimize Truth Nuggets
Specific, consistent nuggets improve cache performance:
# Bad: varied wordingnugget1 = "TruthVouch costs 349 dollars per month"nugget2 = "TruthVouch is priced at $349/month"nugget3 = "Monthly subscription is $349"
# Good: consistent formatnugget = "Starter Plan: $349/month"2. Monitor Cache Health
Regular review of cache statistics:
# Weekly cache reportreport = client.cache.get_weekly_report()if report.l1_hit_rate < 0.3: print("Low L1 hit rate — consider reviewing query patterns")3. Warm the Cache
Pre-populate cache with frequent queries:
# Load common questions on startupclient.cache.warm_with_common_queries([ "What is TruthVouch?", "How much does TruthVouch cost?", "What AI models does TruthVouch monitor?",])4. Cache Logs for Compliance
Maintain cache logs for audit trails:
# Ensure cache lookups are loggedclient.cache.configure( log_all_lookups=True, log_retention_days=2555 # 7 years for compliance)Advanced: Custom Embeddings
For specialized use cases, use custom embedding models:
client.cache.configure_embeddings( model="custom-domain-model", # Your fine-tuned model dimensions=768, update_existing=True)Next Steps
- Performance: Monitor cache hit rates and latency
- Configuration: Tune TTL and similarity thresholds
- Monitoring: Set up alerts for cache degradation
- Scaling: Learn about cache scaling for enterprise