Skip to content

Content Safety

Overview

The Content Safety stage in the Firewall detects and prevents harmful, toxic, biased, or inappropriate content in both AI requests and responses. Using advanced NLP classifiers and configurable thresholds, it helps ensure your AI systems produce safe, responsible outputs.

Detection Categories

Toxicity

Identifies offensive, insulting, or hostile language that violates community standards.

  • Examples: Profanity, hate speech, threats, derogatory comments
  • Threshold Range: 0.0 to 1.0 (default: 0.7)
  • Action Options: block, redact, warn, or log

Bias

Detects language that stereotypes, discriminates, or disadvantages specific groups based on protected characteristics.

  • Protected Classes: Race, ethnicity, gender, sexual orientation, religion, disability, age, nationality
  • Examples: “Women are less technical than men”, assumptions about immigrant groups, stereotypes
  • Threshold Range: 0.0 to 1.0 (default: 0.6)
  • Action Options: warn, redact, or block

Harmful Content

Flags content that promotes violence, self-harm, illegal activities, or unsafe behavior.

  • Examples: Instructions for weapons, drug synthesis, self-harm methods, fraud tutorials, exploitation
  • Threshold Range: 0.0 to 1.0 (default: 0.8)
  • Action Options: block (always)

Sexual Explicit Content

Detects adult-oriented content that may violate company policies or user expectations.

  • Threshold Range: 0.0 to 1.0 (default: 0.75)
  • Action Options: block, redact, or warn

Configuration

Via YAML

firewall:
stages:
- name: "content-safety"
enabled: true
config:
check_input: true
check_output: true
toxicity:
enabled: true
threshold: 0.7
action: "warn" # "block", "redact", "warn", "log"
bias:
enabled: true
threshold: 0.6
action: "warn"
harmful_content:
enabled: true
threshold: 0.8
action: "block" # Always blocks
sexual_content:
enabled: true
threshold: 0.75
action: "warn"
# Allowlist specific phrases/topics
allowlist:
- "legitimate medical discussion about sexual health"
- "academic study of discrimination"
- "historical context of slurs"

Via UI

  1. Go to GovernanceFirewallContent Safety
  2. Adjust thresholds for each category
  3. Set actions (what to do when a violation is detected)
  4. Add domain-specific allowlist entries
  5. Click Save & Deploy

Actions Explained

Block

Prevents the response from reaching the user. Returns a generic error message. The full response is logged in audit trail for review.

{
"error": "Response filtered for safety reasons",
"code": "SAFETY_VIOLATION",
"category": "harmful_content"
}

Redact

Modifies the response to remove or mask the problematic content while preserving the rest.

Example:

  • Original: “You should punch him in the face for that”
  • Redacted: “You should [REDACTED] for that”

Warn

Allows the response through but adds a metadata flag indicating the safety concern.

{
"response": "...",
"safety_warnings": [
{
"category": "toxicity",
"score": 0.72,
"severity": "medium",
"message": "Response contains potentially offensive language"
}
]
}

Log

Records the violation in audit logs without modifying the response. Used for monitoring and analytics.

Exceptions & Allowlists

Topic-Based Allowlist

Some topics legitimately contain sensitive language:

  • Medical discussions about sexual health
  • Academic analysis of racism, discrimination
  • Historical documentation of slurs or atrocities
  • Fiction and creative writing with mature themes

Add context to your allowlist:

firewall:
content-safety:
allowlist:
- pattern: "sexual health"
category: "sexual_content"
reason: "Medical context"
- pattern: "historic.*slur"
category: "toxicity"
reason: "Educational context"

User/Role-Based Exceptions

Certain roles may need lower thresholds (e.g., moderation team, researchers):

  1. Go to GovernanceContent SafetyExceptions
  2. Select the user or role
  3. Choose which categories to relax
  4. Set custom thresholds for this user
  5. Set expiration date for the exception

Understanding Scores

Each detection returns a confidence score (0.0 to 1.0):

  • 0.0-0.3: Not detected or very weak signal
  • 0.3-0.6: Possible detection, may need human review
  • 0.6-0.8: Strong detection, likely applies
  • 0.8-1.0: Definitive detection

Thresholds determine at what score an action is triggered. A threshold of 0.7 means scores >= 0.7 trigger the action.

Real-World Examples

Example 1: Toxicity in Support Response

Request: “Can you write a response to a customer who left a bad review?”

AI Response (unchecked): “This customer is an idiot who doesn’t understand our product.”

Safety Analysis:

  • Toxicity Score: 0.82 (exceeds 0.7 threshold)
  • Action: Redact
  • Result: “This customer [REDACTED] doesn’t understand our product.”

Example 2: Bias in Hiring Assistant

Request: “Draft a job description for an engineer role.”

AI Response (unchecked): “We’re looking for a young, energetic programmer with fresh ideas. Stay-at-home moms need not apply.”

Safety Analysis:

  • Bias Score: 0.91 (exceeds 0.6 threshold)
  • Action: Block
  • Result: Response rejected, admin notified
  • Audit: Full response logged for legal review

Example 3: Harmful Content in Educational Content

Request: “Explain how locks work from an engineering perspective.”

AI Response: “Traditional pin-tumbler locks can be bypassed using…”

Safety Analysis:

  • Harmful Content Score: 0.65 (below 0.8 threshold)
  • Result: Response passes through
  • Audit: Low-score detection logged for monitoring

Testing Content Safety

Via Web UI

  1. Go to GovernanceTest FirewallContent Safety Tab
  2. Paste sample text
  3. Click Scan
  4. See scores for each category

Via API

Terminal window
curl -X POST http://localhost:5000/api/v1/governance/content-safety/test \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"text": "Sample text to analyze",
"categories": ["toxicity", "bias", "harmful_content"]
}'

Response:

{
"text": "Sample text to analyze",
"scores": {
"toxicity": 0.12,
"bias": 0.05,
"harmful_content": 0.02,
"sexual_content": 0.01
},
"violations": [],
"passed": true
}

Monitoring & Analytics

Go to GovernanceReportsContent Safety to see:

  • Volume by Category: Which types of violations are most common
  • Trend Over Time: Are violations increasing or decreasing
  • False Positive Rate: Actions taken vs. user appeals
  • Allowlist Hit Rate: How often allowlist exceptions prevent false blocks

Fine-Tuning Thresholds

Start Conservative

Begin with high thresholds (strict: 0.8+) and lower them gradually as you understand your content patterns.

Monitor False Positives

If legitimate content is being blocked, add to allowlist before lowering thresholds.

Domain-Specific Adjustment

Tech forums may need different thresholds than medical advice platforms.

A/B Test

Enable shadow mode: log violations without blocking, measure impact, then rollout.