ai-moderating-content

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Auto-Moderate What Users Post

自动审核用户发布的内容

Guide the user through building AI content moderation — classify user-generated content, score severity, and route decisions (auto-approve, human-review, auto-reject). The pattern: classify, score, route.

引导用户构建AI内容审核系统——对用户生成内容进行分类、评分严重程度，并路由审核决策（自动通过、人工审核、自动拒绝）。核心模式：分类→评分→路由。

When you need content moderation

适用场景

User-generated content (comments, posts, reviews, messages)
Community platforms and forums
Marketplace listings (product descriptions, seller profiles)
Chat and messaging features
Any surface where users create content that others see

用户生成内容（评论、帖子、评价、消息）
社区平台和论坛
市场列表（产品描述、卖家资料）
聊天和消息功能
任何用户可创建内容并展示给他人的场景

Step 1: Define your moderation policy

步骤1：定义审核策略

Ask the user:

What content do you need to catch? (hate speech, spam, NSFW, harassment, self-harm, illegal activity, PII)
What are the severity levels? (warning, remove, ban)
What's the tolerance for false positives? (over-moderating frustrates users)
Is human review in the loop? (auto-only vs. auto + human escalation)

询问用户：

需要拦截哪些内容？（仇恨言论、垃圾信息、NSFW内容、骚扰、自残、非法活动、个人可识别信息PII）
严重程度分为几级？（警告、删除、封禁）
对误判的容忍度如何？（过度审核会打击用户积极性）
是否需要人工审核介入？**（纯自动 vs 自动+人工升级处理）

Step 2: Build the moderator

步骤2：构建审核器

Classification + severity scoring + routing decision:

python

import dspy
from typing import Literal

VIOLATIONS = Literal[
    "safe", "spam", "hate_speech", "harassment",
    "violence", "nsfw", "self_harm", "illegal",
]

class ModerateContent(dspy.Signature):
    """Assess user-generated content against platform policies."""
    content: str = dspy.InputField(desc="user-generated content to moderate")
    platform_context: str = dspy.InputField(desc="where this content appears, e.g. 'product review'")
    violation_type: VIOLATIONS = dspy.OutputField()
    severity: Literal["none", "low", "medium", "high"] = dspy.OutputField()
    explanation: str = dspy.OutputField(desc="brief reason for the decision")

class ContentModerator(dspy.Module):
    def __init__(self):
        self.assess = dspy.ChainOfThought(ModerateContent)

    def forward(self, content, platform_context="social media post"):
        result = self.assess(content=content, platform_context=platform_context)

        # Route based on severity
        if result.severity == "high":
            decision = "remove"
        elif result.severity == "medium":
            decision = "human_review"
        elif result.severity == "low":
            decision = "warn"
        else:
            decision = "approve"

        return dspy.Prediction(
            violation_type=result.violation_type,
            severity=result.severity,
            decision=decision,
            explanation=result.explanation,
        )

分类+严重程度评分+路由决策：

python

import dspy
from typing import Literal

VIOLATIONS = Literal[
    "safe", "spam", "hate_speech", "harassment",
    "violence", "nsfw", "self_harm", "illegal",
]

class ModerateContent(dspy.Signature):
    """Assess user-generated content against platform policies."""
    content: str = dspy.InputField(desc="user-generated content to moderate")
    platform_context: str = dspy.InputField(desc="where this content appears, e.g. 'product review'")
    violation_type: VIOLATIONS = dspy.OutputField()
    severity: Literal["none", "low", "medium", "high"] = dspy.OutputField()
    explanation: str = dspy.OutputField(desc="brief reason for the decision")

class ContentModerator(dspy.Module):
    def __init__(self):
        self.assess = dspy.ChainOfThought(ModerateContent)

    def forward(self, content, platform_context="social media post"):
        result = self.assess(content=content, platform_context=platform_context)

        # Route based on severity
        if result.severity == "high":
            decision = "remove"
        elif result.severity == "medium":
            decision = "human_review"
        elif result.severity == "low":
            decision = "warn"
        else:
            decision = "approve"

        return dspy.Prediction(
            violation_type=result.violation_type,
            severity=result.severity,
            decision=decision,
            explanation=result.explanation,
        )

Usage

moderator = ContentModerator() result = moderator(content="Great product, works exactly as described!") print(result.decision) # "approve"

result = moderator(content="This seller is a scammer, I'll find where they live") print(result.decision) # "remove" print(result.violation_type) # "harassment"

undefined

moderator = ContentModerator() result = moderator(content="Great product, works exactly as described!") print(result.decision) # "approve"

result = moderator(content="This seller is a scammer, I'll find where they live") print(result.decision) # "remove" print(result.violation_type) # "harassment"

undefined

Step 3: Multi-label moderation

步骤3：多标签审核

Content can violate multiple policies at once (e.g., spam and contains PII):

python

VIOLATION_TYPES = ["safe", "spam", "hate_speech", "harassment", "violence", "nsfw", "self_harm", "illegal"]

class MultiLabelModerate(dspy.Signature):
    """Flag all policy violations in user content. Content may have multiple violations."""
    content: str = dspy.InputField()
    platform_context: str = dspy.InputField()
    violations: list[str] = dspy.OutputField(desc=f"all that apply from: {VIOLATION_TYPES}")
    severity: Literal["none", "low", "medium", "high"] = dspy.OutputField(
        desc="overall severity based on the worst violation"
    )
    explanation: str = dspy.OutputField()

class MultiLabelModerator(dspy.Module):
    def __init__(self):
        self.assess = dspy.ChainOfThought(MultiLabelModerate)

    def forward(self, content, platform_context=""):
        result = self.assess(content=content, platform_context=platform_context)

        # Validate that returned violations are from the allowed set
        dspy.Assert(
            all(v in VIOLATION_TYPES for v in result.violations),
            f"Violations must be from: {VIOLATION_TYPES}",
        )

        return result

内容可能同时违反多项政策（例如：垃圾信息+包含PII）：

python

VIOLATION_TYPES = ["safe", "spam", "hate_speech", "harassment", "violence", "nsfw", "self_harm", "illegal"]

class MultiLabelModerate(dspy.Signature):
    """Flag all policy violations in user content. Content may have multiple violations."""
    content: str = dspy.InputField()
    platform_context: str = dspy.InputField()
    violations: list[str] = dspy.OutputField(desc=f"all that apply from: {VIOLATION_TYPES}")
    severity: Literal["none", "low", "medium", "high"] = dspy.OutputField(
        desc="overall severity based on the worst violation"
    )
    explanation: str = dspy.OutputField()

class MultiLabelModerator(dspy.Module):
    def __init__(self):
        self.assess = dspy.ChainOfThought(MultiLabelModerate)

    def forward(self, content, platform_context=""):
        result = self.assess(content=content, platform_context=platform_context)

        # Validate that returned violations are from the allowed set
        dspy.Assert(
            all(v in VIOLATION_TYPES for v in result.violations),
            f"Violations must be from: {VIOLATION_TYPES}",
        )

        return result

Step 4: Hard blocks with assertions

步骤4：基于断言的强制拦截

For zero-tolerance patterns, don't even ask the LM — block instantly with pattern matching:

python

import re

class StrictModerator(dspy.Module):
    def __init__(self):
        self.assess = dspy.ChainOfThought(ModerateContent)

    def forward(self, content, platform_context=""):
        # Pattern-based hard blocks (instant, no LM needed)
        dspy.Assert(
            not re.search(r"\b\d{3}-\d{2}-\d{4}\b", content),
            "Content contains SSN pattern — auto-reject",
        )
        dspy.Assert(
            not re.search(
                r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",
                content,
            ),
            "Content contains email addresses — redact before posting",
        )
        dspy.Assert(
            not re.search(r"\b\d{16}\b", content),
            "Content contains potential credit card number — auto-reject",
        )

        # LM-based assessment for everything else
        return self.assess(content=content, platform_context=platform_context)

Pattern-based blocks are faster, cheaper, and more reliable than LM-based detection for well-defined patterns (SSNs, credit cards, emails). Use regex for structure, LMs for semantics.

对于零容忍的内容模式，无需调用大语言模型——直接通过模式匹配立即拦截：

python

import re

class StrictModerator(dspy.Module):
    def __init__(self):
        self.assess = dspy.ChainOfThought(ModerateContent)

    def forward(self, content, platform_context=""):
        # Pattern-based hard blocks (instant, no LM needed)
        dspy.Assert(
            not re.search(r"\b\d{3}-\d{2}-\d{4}\b", content),
            "Content contains SSN pattern — auto-reject",
        )
        dspy.Assert(
            not re.search(
                r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",
                content,
            ),
            "Content contains email addresses — redact before posting",
        )
        dspy.Assert(
            not re.search(r"\b\d{16}\b", content),
            "Content contains potential credit card number — auto-reject",
        )

        # LM-based assessment for everything else
        return self.assess(content=content, platform_context=platform_context)

对于定义明确的模式（社保号、信用卡号、邮箱），基于模式的拦截比大语言模型检测更快、更便宜、更可靠。用正则表达式处理结构化内容，用大语言模型处理语义内容。

Step 5: Confidence-based routing

步骤5：基于置信度的路由

Route uncertain decisions to human reviewers instead of making bad calls:

python

class ConfidentModerate(dspy.Signature):
    """Moderate content and rate your confidence in the assessment."""
    content: str = dspy.InputField()
    platform_context: str = dspy.InputField()
    violation_type: VIOLATIONS = dspy.OutputField()
    severity: Literal["none", "low", "medium", "high"] = dspy.OutputField()
    confidence: float = dspy.OutputField(desc="0.0 to 1.0 — how sure are you about this assessment?")
    explanation: str = dspy.OutputField()

class ConfidentModerator(dspy.Module):
    def __init__(self, confidence_threshold=0.7):
        self.assess = dspy.ChainOfThought(ConfidentModerate)
        self.confidence_threshold = confidence_threshold

    def forward(self, content, platform_context=""):
        result = self.assess(content=content, platform_context=platform_context)

        # Validate confidence range
        dspy.Assert(
            0.0 <= result.confidence <= 1.0,
            "Confidence must be between 0.0 and 1.0",
        )

        # Route based on confidence + severity
        if result.confidence < self.confidence_threshold:
            decision = "human_review"  # uncertain → always escalate
        elif result.severity == "high":
            decision = "remove"
        elif result.severity == "medium":
            decision = "human_review"
        elif result.severity == "low":
            decision = "warn"
        else:
            decision = "approve"

        return dspy.Prediction(
            violation_type=result.violation_type,
            severity=result.severity,
            confidence=result.confidence,
            decision=decision,
            explanation=result.explanation,
        )

将不确定的决策转交给人工审核，避免错误判断：

python

class ConfidentModerate(dspy.Signature):
    """Moderate content and rate your confidence in the assessment."""
    content: str = dspy.InputField()
    platform_context: str = dspy.InputField()
    violation_type: VIOLATIONS = dspy.OutputField()
    severity: Literal["none", "low", "medium", "high"] = dspy.OutputField()
    confidence: float = dspy.OutputField(desc="0.0 to 1.0 — how sure are you about this assessment?")
    explanation: str = dspy.OutputField()

class ConfidentModerator(dspy.Module):
    def __init__(self, confidence_threshold=0.7):
        self.assess = dspy.ChainOfThought(ConfidentModerate)
        self.confidence_threshold = confidence_threshold

    def forward(self, content, platform_context=""):
        result = self.assess(content=content, platform_context=platform_context)

        # Validate confidence range
        dspy.Assert(
            0.0 <= result.confidence <= 1.0,
            "Confidence must be between 0.0 and 1.0",
        )

        # Route based on confidence + severity
        if result.confidence < self.confidence_threshold:
            decision = "human_review"  # uncertain → always escalate
        elif result.severity == "high":
            decision = "remove"
        elif result.severity == "medium":
            decision = "human_review"
        elif result.severity == "low":
            decision = "warn"
        else:
            decision = "approve"

        return dspy.Prediction(
            violation_type=result.violation_type,
            severity=result.severity,
            confidence=result.confidence,
            decision=decision,
            explanation=result.explanation,
        )

Step 6: Metrics and optimization

步骤6：指标与优化

Define moderation metrics

定义审核指标

python

def moderation_metric(example, prediction, trace=None):
    """Weighted score: type matters more than severity."""
    type_correct = float(prediction.violation_type == example.violation_type)
    severity_correct = float(prediction.severity == example.severity)
    return 0.7 * type_correct + 0.3 * severity_correct

python

def moderation_metric(example, prediction, trace=None):
    """Weighted score: type matters more than severity."""
    type_correct = float(prediction.violation_type == example.violation_type)
    severity_correct = float(prediction.severity == example.severity)
    return 0.7 * type_correct + 0.3 * severity_correct

Per-category metrics (more useful than overall accuracy)

分类型指标（比整体准确率更实用）

python

def make_category_metric(category):
    """Create a precision metric for a specific violation category."""
    def metric(example, prediction, trace=None):
        # Did we correctly identify this category?
        if example.violation_type == category:
            return float(prediction.violation_type == category)  # recall
        else:
            return float(prediction.violation_type != category)  # precision
    return metric

python

def make_category_metric(category):
    """Create a precision metric for a specific violation category."""
    def metric(example, prediction, trace=None):
        # Did we correctly identify this category?
        if example.violation_type == category:
            return float(prediction.violation_type == category)  # recall
        else:
            return float(prediction.violation_type != category)  # precision
    return metric

Track each category separately

hate_speech_metric = make_category_metric("hate_speech") spam_metric = make_category_metric("spam")

undefined

hate_speech_metric = make_category_metric("hate_speech") spam_metric = make_category_metric("spam")

undefined

Optimize the moderator

优化审核器

python

undefined

python

undefined

Prepare labeled training data

trainset = [ dspy.Example( content="Buy cheap watches at spam-site.com!!!", platform_context="product review", violation_type="spam", severity="medium", ).with_inputs("content", "platform_context"), dspy.Example( content="This product changed my life, highly recommend!", platform_context="product review", violation_type="safe", severity="none", ).with_inputs("content", "platform_context"), # 50-200 labeled examples for good optimization ]

optimizer = dspy.MIPROv2(metric=moderation_metric, auto="medium") optimized = optimizer.compile(moderator, trainset=trainset)

undefined

optimizer = dspy.MIPROv2(metric=moderation_metric, auto="medium") optimized = optimizer.compile(moderator, trainset=trainset)

undefined

Step 7: Handle tricky cases

步骤7：处理复杂场景

Brief notes on content that's hard to moderate correctly:

Sarcasm and satire — "Oh sure, what a great product" isn't hate speech. Context matters. The
```
platform_context
```
field helps here.
Quoting to criticize — "The seller said 'you're an idiot'" is reporting harassment, not committing it. Include instructions in your signature to distinguish.
Code snippets — Variable names or test strings might contain offensive words. If your platform has code, add a code-detection step before moderation.
Non-English content — LMs handle major languages well but may miss nuance in less-common languages. Consider language-specific test sets.
Adversarial evasion — Users will try to bypass moderation (leetspeak, Unicode tricks, word splitting). Test your moderator with
```
/ai-testing-safety
```
.

关于难以正确审核的内容的简要说明：

讽刺与调侃 —— "哦当然，这产品可‘真棒’" 不属于仇恨言论。上下文很重要，
```
platform_context
```
字段对此有帮助。
引用批评 —— "卖家说‘你是个白痴’" 是举报骚扰，而非实施骚扰。在Signature中加入说明以区分这类情况。
代码片段 —— 变量名或测试字符串可能包含冒犯性词汇。如果你的平台涉及代码内容，在审核前添加代码检测步骤。
非英语内容 —— 大语言模型能较好处理主流语言，但可能遗漏小众语言中的细微差别。考虑使用特定语言的测试集。
对抗性规避 —— 用户会尝试绕过审核（如黑客语、Unicode技巧、拆分词语）。使用
```
/ai-testing-safety
```
测试你的审核器。

Tips

提示

False positives hurt more than false negatives. Over-moderation kills user engagement. Tune your confidence threshold to minimize false positives, especially for borderline content.
Build separate metrics per category. You care more about catching hate speech (high harm) than catching mild spam (low harm).
Use a stronger model for uncertain cases. Route low-confidence decisions from GPT-4o-mini to GPT-4o for a second opinion.
Log every decision. Moderator decisions should be reviewable and auditable. See
```
/ai-monitoring
```
for production logging patterns.
Test your moderator adversarially. Users will try to evade moderation. Run
```
/ai-testing-safety
```
against your moderator.
Start permissive, tighten later. It's easier to add restrictions than to regain user trust after over-moderating.

误判比漏判危害更大。过度审核会降低用户参与度。调整置信度阈值以减少误判，尤其是对边缘内容。
为每个类别单独设置指标。你更关注是否拦截仇恨言论（高危害），而非轻度垃圾信息（低危害）。
对不确定场景使用更强的模型。将低置信度决策从GPT-4o-mini转交给GPT-4o进行二次判断。
记录每一项决策。审核器的决策应可复查和审计。查看
```
/ai-monitoring
```
了解生产环境日志记录模式。
对抗性测试你的审核器。用户肯定会尝试规避审核。用
```
/ai-testing-safety
```
测试你的审核器。
先宽松，后收紧。添加限制比过度审核后重新赢回用户信任更容易。

Additional resources

额外资源

Use
```
/ai-sorting
```
for general classification patterns (moderation is classification + routing)
Use
```
/ai-checking-outputs
```
for output guardrails on your own AI's responses
Use
```
/ai-testing-safety
```
to adversarially test your moderator
Use
```
/ai-monitoring
```
to track moderation quality in production
See
```
examples.md
```
for complete worked examples

使用
```
/ai-sorting
```
获取通用分类模式（审核是分类+路由的组合）
使用
```
/ai-checking-outputs
```
为你自己的AI响应添加输出防护
使用
```
/ai-testing-safety
```
对审核器进行对抗性测试
使用
```
/ai-monitoring
```
跟踪生产环境中的审核质量
查看
```
examples.md
```
获取完整的示例