Vendor Scoring

Vendor Scoring automatically evaluates LLM providers against standard criteria: security, compliance, features, cost, and support. Use pre-built scorecards to compare vendors or customize for your unique requirements.

Vendor Hub

Scoring Dimensions

TruthVouch automatically evaluates vendors across five core dimensions:

1. Security & Compliance (30%)

What’s evaluated:

Data encryption (at rest and in transit)
Access controls and authentication
Audit logging and data residency options
Security certifications (SOC 2, ISO 27001)
Compliance frameworks (HIPAA, GDPR, SOX)
Incident response and breach notification
Third-party penetration testing

Scoring:

90+ = Comprehensive controls; multiple certifications; regular audits
70-89 = Strong baseline; major certifications; annual audits
50-69 = Adequate controls; basic certifications; some gaps
<50 = Weak controls; no certifications; significant gaps

Examples:

OpenAI: 82 (SOC 2, encryption, good practices but data US-only)
Anthropic: 88 (SOC 2, HIPAA-ready, stronger privacy controls)
Self-hosted: Varies (depends on your implementation)

2. Features & Capability (25%)

What’s evaluated:

Model quality (accuracy on benchmarks like MMLU, HumanEval)
Feature richness (vision, embeddings, function calling, etc.)
Model availability (latest models, deprecated models support)
Customization options (fine-tuning, RAG-ready, plugins)
API completeness (chat, completions, embeddings, vision)
Latency and throughput (p50, p95, p99 latencies)

Scoring:

90+ = Best-in-class performance; all major features; low latency
70-89 = Strong performance; most features; acceptable latency
50-69 = Adequate performance; basic features; higher latency
<50 = Limited capability; missing key features; slow

Examples:

GPT-4: 92 (excellent accuracy, comprehensive features, 2-5s latency)
Claude Opus: 90 (excellent reasoning, good features, 2-8s latency)
Gemini: 85 (good accuracy, growing features, 1-3s latency)
Llama 2 (self-hosted): 75 (good for open-source, slower, requires tuning)

3. Cost Efficiency (20%)

What’s evaluated:

Base API cost per 1M tokens
Volume discounts available
Infrastructure costs (if self-hosted)
Total cost of ownership (training, support, tooling)
Pricing transparency
Cost predictability

Scoring:

90+ = <$1 per 1M tokens (self-hosted or very efficient)
70-89 = $1-$5 per 1M tokens (good value)
50-69 = $5-$15 per 1M tokens (premium pricing)
<50 = >$15 per 1M tokens (expensive)

Examples:

Self-hosted Llama 2: 95 (infrastructure cost only, ~$0.10/1M)
Gemini Pro: 88 ($0.075 input / $0.30 output per 1M)
Claude Opus: 70 ($15 per 1M tokens, premium)
GPT-4: 65 ($30 per 1M tokens, very expensive)

4. Support & Reliability (15%)

What’s evaluated:

Response time (SLA for support requests)
Technical support quality (dedicated vs. community)
Uptime guarantees (SLA %)
Documentation quality
Community size and maturity
Vendor stability (financial health, customer base)

Scoring:

90+ = <1hr response, 99.95% SLA, dedicated support, excellent docs
70-89 = <4hr response, 99.9% SLA, good support, good docs
50-69 = <24hr response, 99% SLA, basic support, adequate docs
<50 = No SLA, community support only, limited docs

Examples:

OpenAI: 78 (good docs, <1hr support for enterprise, community strong)
Anthropic: 82 (excellent docs, growing support, responsive team)
Google Cloud AI: 80 (enterprise SLA available, good support)
Open-source: 50 (community support only, no SLA)

5. Governance & Privacy (10%)

What’s evaluated:

Data usage and retention policies
Privacy commitments (no training on your data?)
Terms of service clarity
Contract terms and flexibility
Transparency on algorithms
Data residency options

Scoring:

90+ = Transparent; no data training; flexible contracts; multi-region
70-89 = Good terms; default no training; reasonable contracts
50-69 = Adequate terms; clarification needed; restrictive contracts
<50 = Opaque terms; unclear data usage; inflexible

Examples:

OpenAI: 75 (improved privacy terms, but US data residency default)
Anthropic: 85 (strong privacy commitments, clearer terms)
Google: 80 (regional options, GDPR-compliant)

Overall Score Calculation

Weighted Total = (Security × 0.30) + (Features × 0.25) + (Cost × 0.20) + (Support × 0.15) + (Governance × 0.10)

Example: Comparing Top 3 Vendors

Vendor	Security	Features	Cost	Support	Governance	Weighted Total
OpenAI GPT-4	82	92	65	78	75	80.9
Anthropic Claude 3	88	90	70	82	85	83.0
Google Gemini Pro	80	85	88	80	80	82.8

Interpretation:

Claude 3 is strongest overall (83.0) — balanced across all dimensions
OpenAI strong on features (92) but weak on cost (65)
Google best on cost (88) but weaker on security

Vendor Scorecards

OpenAI

Overall Score: 80.9
Strengths: Best-in-class model quality (GPT-4), comprehensive API
Weaknesses: Expensive, US data residency, less transparent on privacy
Best for: Organizations wanting state-of-the-art performance
Not ideal for: Cost-sensitive or privacy-first orgs

Anthropic Claude

Overall Score: 83.0
Strengths: Strong safety focus, good reasoning, transparent privacy
Weaknesses: Slightly slower than GPT-4, premium pricing
Best for: Orgs prioritizing safety and privacy
Not ideal for: Cost-optimization focus

Google Gemini

Overall Score: 82.8
Strengths: Cost-efficient, good multimodal (vision), regional options
Weaknesses: Younger ecosystem, smaller community, less proven accuracy
Best for: Organizations on Google Cloud, cost-conscious
Not ideal for: Vision-critical tasks requiring maximum accuracy

Self-Hosted (Llama 2 or Mistral)

Overall Score: 72 (highly variable; depends on your implementation)
Strengths: Lowest cost, full control, no data sharing
Weaknesses: Requires infrastructure expertise, lower quality, limited support
Best for: High-volume, cost-sensitive, compliance-critical workloads
Not ideal for: Organizations without DevOps expertise

Using Scores for Decisions

Tier 1: Best Overall (Score 85+)

Use for production systems where quality matters. Higher cost acceptable.

Tier 2: Strong (Score 80-84)

Good balance of cost, capability, and support. Suitable for most workloads.

Tier 3: Adequate (Score 75-79)

Good for non-critical, cost-sensitive workloads. Trade-offs on quality or support.

Tier 4: Limited (Score <75)

Consider only for specific use cases where they excel (e.g., cost optimization).

Customizing Vendor Scores

Default weights (30% security, 25% features, 20% cost, 15% support, 10% governance) suit most orgs. Customize if your priorities differ:

For Finance/Compliance-heavy orgs:

Security: 40%, Governance: 20%, Cost: 15%, Features: 20%, Support: 5%
Prioritizes compliance and security over cost

For Cost-sensitive startups:

Cost: 40%, Features: 30%, Security: 20%, Support: 5%, Governance: 5%
Prioritizes performance and cost; accepts lower support tier

For Safety-critical (healthcare, aviation):

Security: 35%, Governance: 25%, Features: 25%, Support: 10%, Cost: 5%
Prioritizes safety and compliance; cost is secondary

See Custom Criteria to create your own weighting.

Updating Scores

Vendor scores are updated quarterly as:

New models/features release
Pricing changes
Security/compliance certifications change
Performance benchmarks evolve
Customer reviews and feedback accumulate

Each score includes a last updated date; historical versions available.

Custom Criteria — Create your own vendor evaluation
Vendor Templates — RFP and scorecard templates
Tools - Comparison — Side-by-side vendor comparison

Next Steps

Review vendor scores for your use case
Identify top 2-3 candidates (closest match to your priorities)
Request trial access from finalist vendors
Customize scoring if your needs differ from defaults
Run cost analysis for finalists (usage projections)
Present to stakeholders with recommendation

Vendor Scoring

Scoring Dimensions

1. Security & Compliance (30%)

2. Features & Capability (25%)

3. Cost Efficiency (20%)

4. Support & Reliability (15%)

5. Governance & Privacy (10%)

Overall Score Calculation

Example: Comparing Top 3 Vendors

Vendor Scorecards

OpenAI

Anthropic Claude

Google Gemini

Self-Hosted (Llama 2 or Mistral)

Using Scores for Decisions

Tier 1: Best Overall (Score 85+)

Tier 2: Strong (Score 80-84)

Tier 3: Adequate (Score 75-79)

Tier 4: Limited (Score <75)

Customizing Vendor Scores

Updating Scores

Related Topics

Next Steps