Content Safety

Overview

The Content Safety stage in the Firewall detects and prevents harmful, toxic, biased, or inappropriate content in both AI requests and responses. Using advanced NLP classifiers and configurable thresholds, it helps ensure your AI systems produce safe, responsible outputs.

Detection Categories

Toxicity

Identifies offensive, insulting, or hostile language that violates community standards.

Examples: Profanity, hate speech, threats, derogatory comments
Threshold Range: 0.0 to 1.0 (default: 0.7)
Action Options: block, redact, warn, or log

Bias

Detects language that stereotypes, discriminates, or disadvantages specific groups based on protected characteristics.

Protected Classes: Race, ethnicity, gender, sexual orientation, religion, disability, age, nationality
Examples: “Women are less technical than men”, assumptions about immigrant groups, stereotypes
Threshold Range: 0.0 to 1.0 (default: 0.6)
Action Options: warn, redact, or block

Harmful Content

Flags content that promotes violence, self-harm, illegal activities, or unsafe behavior.

Examples: Instructions for weapons, drug synthesis, self-harm methods, fraud tutorials, exploitation
Threshold Range: 0.0 to 1.0 (default: 0.8)
Action Options: block (always)

Sexual Explicit Content

Detects adult-oriented content that may violate company policies or user expectations.

Threshold Range: 0.0 to 1.0 (default: 0.75)
Action Options: block, redact, or warn

Configuration

Via YAML

firewall:
  stages:
    - name: "content-safety"
      enabled: true
      config:
        check_input: true
        check_output: true

        toxicity:
          enabled: true
          threshold: 0.7
          action: "warn"  # "block", "redact", "warn", "log"

        bias:
          enabled: true
          threshold: 0.6
          action: "warn"

        harmful_content:
          enabled: true
          threshold: 0.8
          action: "block"  # Always blocks

        sexual_content:
          enabled: true
          threshold: 0.75
          action: "warn"

        # Allowlist specific phrases/topics
        allowlist:
          - "legitimate medical discussion about sexual health"
          - "academic study of discrimination"
          - "historical context of slurs"

Via UI

Go to Governance → Firewall → Content Safety
Adjust thresholds for each category
Set actions (what to do when a violation is detected)
Add domain-specific allowlist entries
Click Save & Deploy

Actions Explained

Block

Prevents the response from reaching the user. Returns a generic error message. The full response is logged in audit trail for review.

{
  "error": "Response filtered for safety reasons",
  "code": "SAFETY_VIOLATION",
  "category": "harmful_content"
}

Redact

Modifies the response to remove or mask the problematic content while preserving the rest.

Example:

Original: “You should punch him in the face for that”
Redacted: “You should [REDACTED] for that”

Warn

Allows the response through but adds a metadata flag indicating the safety concern.

{
  "response": "...",
  "safety_warnings": [
    {
      "category": "toxicity",
      "score": 0.72,
      "severity": "medium",
      "message": "Response contains potentially offensive language"
    }
  ]
}

Log

Records the violation in audit logs without modifying the response. Used for monitoring and analytics.

Exceptions & Allowlists

Topic-Based Allowlist

Some topics legitimately contain sensitive language:

Medical discussions about sexual health
Academic analysis of racism, discrimination
Historical documentation of slurs or atrocities
Fiction and creative writing with mature themes

Add context to your allowlist:

firewall:
  content-safety:
    allowlist:
      - pattern: "sexual health"
        category: "sexual_content"
        reason: "Medical context"

      - pattern: "historic.*slur"
        category: "toxicity"
        reason: "Educational context"

User/Role-Based Exceptions

Certain roles may need lower thresholds (e.g., moderation team, researchers):

Go to Governance → Content Safety → Exceptions
Select the user or role
Choose which categories to relax
Set custom thresholds for this user
Set expiration date for the exception

Understanding Scores

Each detection returns a confidence score (0.0 to 1.0):

0.0-0.3: Not detected or very weak signal
0.3-0.6: Possible detection, may need human review
0.6-0.8: Strong detection, likely applies
0.8-1.0: Definitive detection

Thresholds determine at what score an action is triggered. A threshold of 0.7 means scores >= 0.7 trigger the action.

Real-World Examples

Example 1: Toxicity in Support Response

Request: “Can you write a response to a customer who left a bad review?”

AI Response (unchecked): “This customer is an idiot who doesn’t understand our product.”

Safety Analysis:

Toxicity Score: 0.82 (exceeds 0.7 threshold)
Action: Redact
Result: “This customer [REDACTED] doesn’t understand our product.”

Example 2: Bias in Hiring Assistant

Request: “Draft a job description for an engineer role.”

AI Response (unchecked): “We’re looking for a young, energetic programmer with fresh ideas. Stay-at-home moms need not apply.”

Safety Analysis:

Bias Score: 0.91 (exceeds 0.6 threshold)
Action: Block
Result: Response rejected, admin notified
Audit: Full response logged for legal review

Example 3: Harmful Content in Educational Content

Request: “Explain how locks work from an engineering perspective.”

AI Response: “Traditional pin-tumbler locks can be bypassed using…”

Safety Analysis:

Harmful Content Score: 0.65 (below 0.8 threshold)
Result: Response passes through
Audit: Low-score detection logged for monitoring

Testing Content Safety

Via Web UI

Go to Governance → Test Firewall → Content Safety Tab
Paste sample text
Click Scan
See scores for each category

Via API

curl -X POST http://localhost:5000/api/v1/governance/content-safety/test \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Sample text to analyze",
    "categories": ["toxicity", "bias", "harmful_content"]
  }'

Response:

{
  "text": "Sample text to analyze",
  "scores": {
    "toxicity": 0.12,
    "bias": 0.05,
    "harmful_content": 0.02,
    "sexual_content": 0.01
  },
  "violations": [],
  "passed": true
}

Monitoring & Analytics

Go to Governance → Reports → Content Safety to see:

Volume by Category: Which types of violations are most common
Trend Over Time: Are violations increasing or decreasing
False Positive Rate: Actions taken vs. user appeals
Allowlist Hit Rate: How often allowlist exceptions prevent false blocks

Fine-Tuning Thresholds

Start Conservative

Begin with high thresholds (strict: 0.8+) and lower them gradually as you understand your content patterns.

Monitor False Positives

If legitimate content is being blocked, add to allowlist before lowering thresholds.

Domain-Specific Adjustment

Tech forums may need different thresholds than medical advice platforms.

A/B Test

Enable shadow mode: log violations without blocking, measure impact, then rollout.

Content Safety

Overview

Detection Categories

Toxicity

Bias

Harmful Content

Sexual Explicit Content

Configuration

Via YAML

Via UI

Actions Explained

Block

Redact

Warn

Log

Exceptions & Allowlists

Topic-Based Allowlist

User/Role-Based Exceptions

Understanding Scores

Real-World Examples

Example 1: Toxicity in Support Response

Example 2: Bias in Hiring Assistant

Example 3: Harmful Content in Educational Content

Testing Content Safety

Via Web UI

Via API

Monitoring & Analytics

Fine-Tuning Thresholds

Start Conservative

Monitor False Positives

Domain-Specific Adjustment

A/B Test

Related Topics