Skip to content

Neural Cache Architecture

Neural Cache is a two-tier caching system that accelerates fact verification from 100ms+ (LLM calls) to <10ms (cache hits). This guide explains the architecture and how it works.

Two-Tier Architecture

L1 Cache: Redis (Exact Match)

Purpose: Ultra-fast exact-match lookups Latency: <1ms Hit Rate: 40-60% (same queries repeat frequently)

Query: "When was TruthVouch founded?"
Redis Lookup
├─ Match Found → Return cached result (0.8ms)
└─ No Match → Continue to L2

Cached Data:

  • Exact query → cached response
  • Query hash → truth nugget ID
  • Verification timestamp
  • Confidence score

Purpose: Semantic similarity matching Latency: 5-10ms Hit Rate: 30-50% (paraphrased queries, variations)

Query: "TruthVouch founding year?"
↓ (L1 miss)
Convert to embedding vector
PostgreSQL pgvector search
├─ Semantic similar results found → Return (8ms)
└─ No semantic match → Call LLM (100ms+)

Cached Data:

  • Query embeddings
  • Truth nugget embeddings
  • Similarity scores (cosine distance)
  • Verification metadata

Cache Hit Scenarios

Scenario 1: Exact Repeat (L1 Hit)

Time 0:00: Query "When was TruthVouch founded?"
↓ Cache Miss
↓ Call LLM, verify, cache result
↓ Response: "2023"
Time 0:15: Same query repeated
↓ L1 Redis Match (0.8ms)
↓ Return cached: "2023"
Speedup: 100x faster

Scenario 2: Paraphrased Query (L2 Hit)

Truth: "Founded in 2023"
Query 1: "When was TruthVouch founded?"
→ Verified, cached with embedding
Query 2: "What is TruthVouch's founding year?" (different wording)
→ L1 miss (exact text differs)
→ L2 Hit (semantic embedding matches)
→ Return cached result in 8ms
Speedup: 10x faster

Scenario 3: No Cache (LLM Call)

Query: "What makes TruthVouch unique compared to competitors?"
↓ L1 Miss (new query)
↓ L2 Miss (no similar embeddings)
↓ Call LLM
↓ LLM response verified against truth nuggets
↓ Results cached in L1 and L2
↓ Response returned (120ms)

Embedding Generation

Query Embedding

Questions/claims converted to semantic vectors:

Query: "When was TruthVouch founded?"
Tokenization: ["When", "was", "TruthVouch", "founded", "?"]
Embedding Model (all-MiniLM-L6-v2)
Vector: [0.234, -0.891, 0.123, ..., 0.456] (384 dimensions)

Truth Nugget Embedding

Truth nuggets similarly converted:

Truth: "Founded in 2023"
Vector: [0.221, -0.876, 0.098, ..., 0.442] (384 dimensions)

Similarity Computation

Cosine similarity between vectors:

Query Vector: [0.234, -0.891, 0.123, ..., 0.456]
Truth Vector: [0.221, -0.876, 0.098, ..., 0.442]
Cosine Similarity: 0.97
Interpretation: 97% similar → Match!

Cache Performance

Real-World Numbers

ScenarioL1 HitL2 HitLLM Call
Latency0.8ms8ms120ms
Speedup150x15x1x
Hit Rate45%35%20%
Avg Response0.45 × 0.8ms + 0.35 × 8ms + 0.2 × 120ms = 25ms

Cost Savings

With neural cache:

  • Reduced LLM calls by 80%
  • Cost per verification: ~95% reduction
  • Throughput: 10x more queries per dollar

Cache Invalidation

Caches are automatically invalidated when truth nuggets change:

L1 Invalidation (Exact Match)

Truth Nugget Updated
Find all queries in L1 Redis matching this nugget
Invalidate affected entries
Next query: Cache miss, verify against updated truth

L2 Invalidation (Semantic)

Truth Nugget Embedding Changed (e.g., pricing updated)
Recompute embedding vector
Update pgvector entry
Cosine similarity scores recalculated for future queries

Manual Cache Clear

# Clear specific query
client.cache.invalidate_query("When was TruthVouch founded?")
# Clear by truth nugget
client.cache.invalidate_nugget("founding_year")
# Clear all cache
client.cache.clear()
# Clear L1 only (keep L2 for semantic searches)
client.cache.clear_l1()

Cache Configuration

TTL (Time-To-Live)

client.cache.configure(
l1_ttl_minutes=1440, # L1 cache expires in 24h
l2_ttl_minutes=10080, # L2 cache expires in 7 days
invalidate_on_nugget_change=True # Auto-clear on updates
)

Size Limits

client.cache.configure(
l1_max_size_mb=500, # Redis 500MB max
l2_max_size_gb=50, # pgvector 50GB max
eviction_policy="LRU" # Least recently used
)

Similarity Threshold

client.cache.configure(
l2_similarity_threshold=0.85, # Min 85% similarity for L2 hit
allow_semantic_fallback=True # Use semantic if no exact match
)

Cache Monitoring

Metrics

stats = client.cache.get_statistics()
print(f"L1 Hit Rate: {stats.l1_hit_rate}%")
print(f"L2 Hit Rate: {stats.l2_hit_rate}%")
print(f"LLM Call Rate: {stats.llm_call_rate}%")
print(f"Avg Latency: {stats.avg_latency_ms}ms")
print(f"Cost Savings: {stats.cost_savings_percent}%")

Dashboard

  1. Navigate to SettingsCache Performance
  2. View:
    • Hit rate trends (hourly/daily)
    • Top cached queries
    • Most-searched truth nuggets
    • Cost savings over time

Best Practices

1. Optimize Truth Nuggets

Specific, consistent nuggets improve cache performance:

# Bad: varied wording
nugget1 = "TruthVouch costs 349 dollars per month"
nugget2 = "TruthVouch is priced at $349/month"
nugget3 = "Monthly subscription is $349"
# Good: consistent format
nugget = "Starter Plan: $349/month"

2. Monitor Cache Health

Regular review of cache statistics:

# Weekly cache report
report = client.cache.get_weekly_report()
if report.l1_hit_rate < 0.3:
print("Low L1 hit rate — consider reviewing query patterns")

3. Warm the Cache

Pre-populate cache with frequent queries:

# Load common questions on startup
client.cache.warm_with_common_queries([
"What is TruthVouch?",
"How much does TruthVouch cost?",
"What AI models does TruthVouch monitor?",
])

4. Cache Logs for Compliance

Maintain cache logs for audit trails:

# Ensure cache lookups are logged
client.cache.configure(
log_all_lookups=True,
log_retention_days=2555 # 7 years for compliance
)

Advanced: Custom Embeddings

For specialized use cases, use custom embedding models:

client.cache.configure_embeddings(
model="custom-domain-model", # Your fine-tuned model
dimensions=768,
update_existing=True
)

Next Steps

  • Performance: Monitor cache hit rates and latency
  • Configuration: Tune TTL and similarity thresholds
  • Monitoring: Set up alerts for cache degradation
  • Scaling: Learn about cache scaling for enterprise