Content Safety
Overview
The Content Safety stage in the Firewall detects and prevents harmful, toxic, biased, or inappropriate content in both AI requests and responses. Using advanced NLP classifiers and configurable thresholds, it helps ensure your AI systems produce safe, responsible outputs.
Detection Categories
Toxicity
Identifies offensive, insulting, or hostile language that violates community standards.
- Examples: Profanity, hate speech, threats, derogatory comments
- Threshold Range: 0.0 to 1.0 (default: 0.7)
- Action Options: block, redact, warn, or log
Bias
Detects language that stereotypes, discriminates, or disadvantages specific groups based on protected characteristics.
- Protected Classes: Race, ethnicity, gender, sexual orientation, religion, disability, age, nationality
- Examples: “Women are less technical than men”, assumptions about immigrant groups, stereotypes
- Threshold Range: 0.0 to 1.0 (default: 0.6)
- Action Options: warn, redact, or block
Harmful Content
Flags content that promotes violence, self-harm, illegal activities, or unsafe behavior.
- Examples: Instructions for weapons, drug synthesis, self-harm methods, fraud tutorials, exploitation
- Threshold Range: 0.0 to 1.0 (default: 0.8)
- Action Options: block (always)
Sexual Explicit Content
Detects adult-oriented content that may violate company policies or user expectations.
- Threshold Range: 0.0 to 1.0 (default: 0.75)
- Action Options: block, redact, or warn
Configuration
Via YAML
firewall: stages: - name: "content-safety" enabled: true config: check_input: true check_output: true
toxicity: enabled: true threshold: 0.7 action: "warn" # "block", "redact", "warn", "log"
bias: enabled: true threshold: 0.6 action: "warn"
harmful_content: enabled: true threshold: 0.8 action: "block" # Always blocks
sexual_content: enabled: true threshold: 0.75 action: "warn"
# Allowlist specific phrases/topics allowlist: - "legitimate medical discussion about sexual health" - "academic study of discrimination" - "historical context of slurs"Via UI
- Go to Governance → Firewall → Content Safety
- Adjust thresholds for each category
- Set actions (what to do when a violation is detected)
- Add domain-specific allowlist entries
- Click Save & Deploy
Actions Explained
Block
Prevents the response from reaching the user. Returns a generic error message. The full response is logged in audit trail for review.
{ "error": "Response filtered for safety reasons", "code": "SAFETY_VIOLATION", "category": "harmful_content"}Redact
Modifies the response to remove or mask the problematic content while preserving the rest.
Example:
- Original: “You should punch him in the face for that”
- Redacted: “You should [REDACTED] for that”
Warn
Allows the response through but adds a metadata flag indicating the safety concern.
{ "response": "...", "safety_warnings": [ { "category": "toxicity", "score": 0.72, "severity": "medium", "message": "Response contains potentially offensive language" } ]}Log
Records the violation in audit logs without modifying the response. Used for monitoring and analytics.
Exceptions & Allowlists
Topic-Based Allowlist
Some topics legitimately contain sensitive language:
- Medical discussions about sexual health
- Academic analysis of racism, discrimination
- Historical documentation of slurs or atrocities
- Fiction and creative writing with mature themes
Add context to your allowlist:
firewall: content-safety: allowlist: - pattern: "sexual health" category: "sexual_content" reason: "Medical context"
- pattern: "historic.*slur" category: "toxicity" reason: "Educational context"User/Role-Based Exceptions
Certain roles may need lower thresholds (e.g., moderation team, researchers):
- Go to Governance → Content Safety → Exceptions
- Select the user or role
- Choose which categories to relax
- Set custom thresholds for this user
- Set expiration date for the exception
Understanding Scores
Each detection returns a confidence score (0.0 to 1.0):
- 0.0-0.3: Not detected or very weak signal
- 0.3-0.6: Possible detection, may need human review
- 0.6-0.8: Strong detection, likely applies
- 0.8-1.0: Definitive detection
Thresholds determine at what score an action is triggered. A threshold of 0.7 means scores >= 0.7 trigger the action.
Real-World Examples
Example 1: Toxicity in Support Response
Request: “Can you write a response to a customer who left a bad review?”
AI Response (unchecked): “This customer is an idiot who doesn’t understand our product.”
Safety Analysis:
- Toxicity Score: 0.82 (exceeds 0.7 threshold)
- Action: Redact
- Result: “This customer [REDACTED] doesn’t understand our product.”
Example 2: Bias in Hiring Assistant
Request: “Draft a job description for an engineer role.”
AI Response (unchecked): “We’re looking for a young, energetic programmer with fresh ideas. Stay-at-home moms need not apply.”
Safety Analysis:
- Bias Score: 0.91 (exceeds 0.6 threshold)
- Action: Block
- Result: Response rejected, admin notified
- Audit: Full response logged for legal review
Example 3: Harmful Content in Educational Content
Request: “Explain how locks work from an engineering perspective.”
AI Response: “Traditional pin-tumbler locks can be bypassed using…”
Safety Analysis:
- Harmful Content Score: 0.65 (below 0.8 threshold)
- Result: Response passes through
- Audit: Low-score detection logged for monitoring
Testing Content Safety
Via Web UI
- Go to Governance → Test Firewall → Content Safety Tab
- Paste sample text
- Click Scan
- See scores for each category
Via API
curl -X POST http://localhost:5000/api/v1/governance/content-safety/test \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{ "text": "Sample text to analyze", "categories": ["toxicity", "bias", "harmful_content"] }'Response:
{ "text": "Sample text to analyze", "scores": { "toxicity": 0.12, "bias": 0.05, "harmful_content": 0.02, "sexual_content": 0.01 }, "violations": [], "passed": true}Monitoring & Analytics
Go to Governance → Reports → Content Safety to see:
- Volume by Category: Which types of violations are most common
- Trend Over Time: Are violations increasing or decreasing
- False Positive Rate: Actions taken vs. user appeals
- Allowlist Hit Rate: How often allowlist exceptions prevent false blocks
Fine-Tuning Thresholds
Start Conservative
Begin with high thresholds (strict: 0.8+) and lower them gradually as you understand your content patterns.
Monitor False Positives
If legitimate content is being blocked, add to allowlist before lowering thresholds.
Domain-Specific Adjustment
Tech forums may need different thresholds than medical advice platforms.
A/B Test
Enable shadow mode: log violations without blocking, measure impact, then rollout.