Embeddings & Vector Search

Text embeddings convert words and sentences into high-dimensional vectors that capture semantic meaning. TruthVouch uses embeddings for neural caching, semantic search, and fact matching.

What Are Embeddings?

Text embeddings represent meaning as vectors (arrays of numbers):

Text: "When was TruthVouch founded?"
Embedding: [0.234, -0.891, 0.123, ..., 0.456]  (384 dimensions)

Text: "TruthVouch founding year?"
Embedding: [0.227, -0.878, 0.119, ..., 0.448]  (384 dimensions)

Similarity: 0.97 (97% similar, different wording, same meaning)

How TruthVouch Uses Embeddings

1. Neural Cache (L2)

Semantic search via pgvector:

Query converted to embedding
Compared against cached truth nugget embeddings
Sub-10ms retrieval of semantically similar results

2. Hallucination Detection

NLI scoring uses embeddings:

Premise converted to embedding
Hypothesis converted to embedding
Relationship (entailment/neutral/contradiction) determined
94%+ accuracy hallucination detection

3. Fact Matching

Finding relevant truth nuggets:

User query converted to embedding
Matched against truth nugget embeddings
Returns top K most relevant facts

4. Query Recommendations

Suggest related queries:

Current query embedding computed
Similar past queries found via vector search
Top related queries recommended

Embedding Model

TruthVouch uses all-MiniLM-L6-v2:

Dimensions: 384
Speed: <1ms per embedding
Accuracy: Trained on 215M sentence pairs
Size: 22MB (efficient for pgvector)
Multilingual: Works in 50+ languages

Vector Search (pgvector)

PostgreSQL pgvector extension enables fast similarity search:

-- Create embeddings table
CREATE TABLE truth_nuggets (
  id uuid PRIMARY KEY,
  text text,
  embedding vector(384),
  client_id uuid
);

-- Create index for fast search
CREATE INDEX ON truth_nuggets USING ivfflat (embedding vector_cosine_ops);

-- Find similar nuggets (semantic search)
SELECT text, (embedding <-> query_embedding) AS distance
FROM truth_nuggets
WHERE client_id = $1
ORDER BY embedding <-> query_embedding
LIMIT 10;

Performance: 10,000 embeddings searched in <5ms.

Cosine Similarity

Measure semantic closeness:

Similarity Score = (v1 · v2) / (|v1| × |v2|)

Range: 0.0 (completely different) to 1.0 (identical)

Threshold Configuration:
- Similarity >= 0.95: Strong match
- Similarity 0.85-0.94: Moderate match
- Similarity < 0.85: Weak/no match

Customization

Fine-tune embeddings for your domain:

# Use custom embedding model
client.embeddings.configure(
    model="custom-domain-embedding",
    dimensions=768,
    retrain_on=["domain_specific_terms"]
)

# Re-embed all existing nuggets
client.embeddings.reembed_all()

Next Steps

Neural Cache: Learn caching with embeddings
Hallucination Detection: NLI uses embeddings
Configuration: Customize embedding model