ai-moderating-content

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Auto-Moderate What Users Post

自动审核用户发布的内容

Guide the user through building AI content moderation — classify user-generated content, score severity, and route decisions (auto-approve, human-review, auto-reject). The pattern: classify, score, route.
引导用户构建AI内容审核系统——对用户生成内容进行分类、评分严重程度,并路由审核决策(自动通过、人工审核、自动拒绝)。核心模式:分类→评分→路由。

When you need content moderation

适用场景

  • User-generated content (comments, posts, reviews, messages)
  • Community platforms and forums
  • Marketplace listings (product descriptions, seller profiles)
  • Chat and messaging features
  • Any surface where users create content that others see
  • 用户生成内容(评论、帖子、评价、消息)
  • 社区平台和论坛
  • 市场列表(产品描述、卖家资料)
  • 聊天和消息功能
  • 任何用户可创建内容并展示给他人的场景

Step 1: Define your moderation policy

步骤1:定义审核策略

Ask the user:
  1. What content do you need to catch? (hate speech, spam, NSFW, harassment, self-harm, illegal activity, PII)
  2. What are the severity levels? (warning, remove, ban)
  3. What's the tolerance for false positives? (over-moderating frustrates users)
  4. Is human review in the loop? (auto-only vs. auto + human escalation)
询问用户:
  1. 需要拦截哪些内容?(仇恨言论、垃圾信息、NSFW内容、骚扰、自残、非法活动、个人可识别信息PII)
  2. 严重程度分为几级?(警告、删除、封禁)
  3. 对误判的容忍度如何?(过度审核会打击用户积极性)
  4. 是否需要人工审核介入?**(纯自动 vs 自动+人工升级处理)

Step 2: Build the moderator

步骤2:构建审核器

Classification + severity scoring + routing decision:
python
import dspy
from typing import Literal

VIOLATIONS = Literal[
    "safe", "spam", "hate_speech", "harassment",
    "violence", "nsfw", "self_harm", "illegal",
]

class ModerateContent(dspy.Signature):
    """Assess user-generated content against platform policies."""
    content: str = dspy.InputField(desc="user-generated content to moderate")
    platform_context: str = dspy.InputField(desc="where this content appears, e.g. 'product review'")
    violation_type: VIOLATIONS = dspy.OutputField()
    severity: Literal["none", "low", "medium", "high"] = dspy.OutputField()
    explanation: str = dspy.OutputField(desc="brief reason for the decision")

class ContentModerator(dspy.Module):
    def __init__(self):
        self.assess = dspy.ChainOfThought(ModerateContent)

    def forward(self, content, platform_context="social media post"):
        result = self.assess(content=content, platform_context=platform_context)

        # Route based on severity
        if result.severity == "high":
            decision = "remove"
        elif result.severity == "medium":
            decision = "human_review"
        elif result.severity == "low":
            decision = "warn"
        else:
            decision = "approve"

        return dspy.Prediction(
            violation_type=result.violation_type,
            severity=result.severity,
            decision=decision,
            explanation=result.explanation,
        )
分类+严重程度评分+路由决策:
python
import dspy
from typing import Literal

VIOLATIONS = Literal[
    "safe", "spam", "hate_speech", "harassment",
    "violence", "nsfw", "self_harm", "illegal",
]

class ModerateContent(dspy.Signature):
    """Assess user-generated content against platform policies."""
    content: str = dspy.InputField(desc="user-generated content to moderate")
    platform_context: str = dspy.InputField(desc="where this content appears, e.g. 'product review'")
    violation_type: VIOLATIONS = dspy.OutputField()
    severity: Literal["none", "low", "medium", "high"] = dspy.OutputField()
    explanation: str = dspy.OutputField(desc="brief reason for the decision")

class ContentModerator(dspy.Module):
    def __init__(self):
        self.assess = dspy.ChainOfThought(ModerateContent)

    def forward(self, content, platform_context="social media post"):
        result = self.assess(content=content, platform_context=platform_context)

        # Route based on severity
        if result.severity == "high":
            decision = "remove"
        elif result.severity == "medium":
            decision = "human_review"
        elif result.severity == "low":
            decision = "warn"
        else:
            decision = "approve"

        return dspy.Prediction(
            violation_type=result.violation_type,
            severity=result.severity,
            decision=decision,
            explanation=result.explanation,
        )

Usage

Usage

moderator = ContentModerator() result = moderator(content="Great product, works exactly as described!") print(result.decision) # "approve"
result = moderator(content="This seller is a scammer, I'll find where they live") print(result.decision) # "remove" print(result.violation_type) # "harassment"
undefined
moderator = ContentModerator() result = moderator(content="Great product, works exactly as described!") print(result.decision) # "approve"
result = moderator(content="This seller is a scammer, I'll find where they live") print(result.decision) # "remove" print(result.violation_type) # "harassment"
undefined

Step 3: Multi-label moderation

步骤3:多标签审核

Content can violate multiple policies at once (e.g., spam and contains PII):
python
VIOLATION_TYPES = ["safe", "spam", "hate_speech", "harassment", "violence", "nsfw", "self_harm", "illegal"]

class MultiLabelModerate(dspy.Signature):
    """Flag all policy violations in user content. Content may have multiple violations."""
    content: str = dspy.InputField()
    platform_context: str = dspy.InputField()
    violations: list[str] = dspy.OutputField(desc=f"all that apply from: {VIOLATION_TYPES}")
    severity: Literal["none", "low", "medium", "high"] = dspy.OutputField(
        desc="overall severity based on the worst violation"
    )
    explanation: str = dspy.OutputField()

class MultiLabelModerator(dspy.Module):
    def __init__(self):
        self.assess = dspy.ChainOfThought(MultiLabelModerate)

    def forward(self, content, platform_context=""):
        result = self.assess(content=content, platform_context=platform_context)

        # Validate that returned violations are from the allowed set
        dspy.Assert(
            all(v in VIOLATION_TYPES for v in result.violations),
            f"Violations must be from: {VIOLATION_TYPES}",
        )

        return result
内容可能同时违反多项政策(例如:垃圾信息+包含PII):
python
VIOLATION_TYPES = ["safe", "spam", "hate_speech", "harassment", "violence", "nsfw", "self_harm", "illegal"]

class MultiLabelModerate(dspy.Signature):
    """Flag all policy violations in user content. Content may have multiple violations."""
    content: str = dspy.InputField()
    platform_context: str = dspy.InputField()
    violations: list[str] = dspy.OutputField(desc=f"all that apply from: {VIOLATION_TYPES}")
    severity: Literal["none", "low", "medium", "high"] = dspy.OutputField(
        desc="overall severity based on the worst violation"
    )
    explanation: str = dspy.OutputField()

class MultiLabelModerator(dspy.Module):
    def __init__(self):
        self.assess = dspy.ChainOfThought(MultiLabelModerate)

    def forward(self, content, platform_context=""):
        result = self.assess(content=content, platform_context=platform_context)

        # Validate that returned violations are from the allowed set
        dspy.Assert(
            all(v in VIOLATION_TYPES for v in result.violations),
            f"Violations must be from: {VIOLATION_TYPES}",
        )

        return result

Step 4: Hard blocks with assertions

步骤4:基于断言的强制拦截

For zero-tolerance patterns, don't even ask the LM — block instantly with pattern matching:
python
import re

class StrictModerator(dspy.Module):
    def __init__(self):
        self.assess = dspy.ChainOfThought(ModerateContent)

    def forward(self, content, platform_context=""):
        # Pattern-based hard blocks (instant, no LM needed)
        dspy.Assert(
            not re.search(r"\b\d{3}-\d{2}-\d{4}\b", content),
            "Content contains SSN pattern — auto-reject",
        )
        dspy.Assert(
            not re.search(
                r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",
                content,
            ),
            "Content contains email addresses — redact before posting",
        )
        dspy.Assert(
            not re.search(r"\b\d{16}\b", content),
            "Content contains potential credit card number — auto-reject",
        )

        # LM-based assessment for everything else
        return self.assess(content=content, platform_context=platform_context)
Pattern-based blocks are faster, cheaper, and more reliable than LM-based detection for well-defined patterns (SSNs, credit cards, emails). Use regex for structure, LMs for semantics.
对于零容忍的内容模式,无需调用大语言模型——直接通过模式匹配立即拦截:
python
import re

class StrictModerator(dspy.Module):
    def __init__(self):
        self.assess = dspy.ChainOfThought(ModerateContent)

    def forward(self, content, platform_context=""):
        # Pattern-based hard blocks (instant, no LM needed)
        dspy.Assert(
            not re.search(r"\b\d{3}-\d{2}-\d{4}\b", content),
            "Content contains SSN pattern — auto-reject",
        )
        dspy.Assert(
            not re.search(
                r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",
                content,
            ),
            "Content contains email addresses — redact before posting",
        )
        dspy.Assert(
            not re.search(r"\b\d{16}\b", content),
            "Content contains potential credit card number — auto-reject",
        )

        # LM-based assessment for everything else
        return self.assess(content=content, platform_context=platform_context)
对于定义明确的模式(社保号、信用卡号、邮箱),基于模式的拦截比大语言模型检测更快、更便宜、更可靠。用正则表达式处理结构化内容,用大语言模型处理语义内容。

Step 5: Confidence-based routing

步骤5:基于置信度的路由

Route uncertain decisions to human reviewers instead of making bad calls:
python
class ConfidentModerate(dspy.Signature):
    """Moderate content and rate your confidence in the assessment."""
    content: str = dspy.InputField()
    platform_context: str = dspy.InputField()
    violation_type: VIOLATIONS = dspy.OutputField()
    severity: Literal["none", "low", "medium", "high"] = dspy.OutputField()
    confidence: float = dspy.OutputField(desc="0.0 to 1.0 — how sure are you about this assessment?")
    explanation: str = dspy.OutputField()

class ConfidentModerator(dspy.Module):
    def __init__(self, confidence_threshold=0.7):
        self.assess = dspy.ChainOfThought(ConfidentModerate)
        self.confidence_threshold = confidence_threshold

    def forward(self, content, platform_context=""):
        result = self.assess(content=content, platform_context=platform_context)

        # Validate confidence range
        dspy.Assert(
            0.0 <= result.confidence <= 1.0,
            "Confidence must be between 0.0 and 1.0",
        )

        # Route based on confidence + severity
        if result.confidence < self.confidence_threshold:
            decision = "human_review"  # uncertain → always escalate
        elif result.severity == "high":
            decision = "remove"
        elif result.severity == "medium":
            decision = "human_review"
        elif result.severity == "low":
            decision = "warn"
        else:
            decision = "approve"

        return dspy.Prediction(
            violation_type=result.violation_type,
            severity=result.severity,
            confidence=result.confidence,
            decision=decision,
            explanation=result.explanation,
        )
将不确定的决策转交给人工审核,避免错误判断:
python
class ConfidentModerate(dspy.Signature):
    """Moderate content and rate your confidence in the assessment."""
    content: str = dspy.InputField()
    platform_context: str = dspy.InputField()
    violation_type: VIOLATIONS = dspy.OutputField()
    severity: Literal["none", "low", "medium", "high"] = dspy.OutputField()
    confidence: float = dspy.OutputField(desc="0.0 to 1.0 — how sure are you about this assessment?")
    explanation: str = dspy.OutputField()

class ConfidentModerator(dspy.Module):
    def __init__(self, confidence_threshold=0.7):
        self.assess = dspy.ChainOfThought(ConfidentModerate)
        self.confidence_threshold = confidence_threshold

    def forward(self, content, platform_context=""):
        result = self.assess(content=content, platform_context=platform_context)

        # Validate confidence range
        dspy.Assert(
            0.0 <= result.confidence <= 1.0,
            "Confidence must be between 0.0 and 1.0",
        )

        # Route based on confidence + severity
        if result.confidence < self.confidence_threshold:
            decision = "human_review"  # uncertain → always escalate
        elif result.severity == "high":
            decision = "remove"
        elif result.severity == "medium":
            decision = "human_review"
        elif result.severity == "low":
            decision = "warn"
        else:
            decision = "approve"

        return dspy.Prediction(
            violation_type=result.violation_type,
            severity=result.severity,
            confidence=result.confidence,
            decision=decision,
            explanation=result.explanation,
        )

Step 6: Metrics and optimization

步骤6:指标与优化

Define moderation metrics

定义审核指标

python
def moderation_metric(example, prediction, trace=None):
    """Weighted score: type matters more than severity."""
    type_correct = float(prediction.violation_type == example.violation_type)
    severity_correct = float(prediction.severity == example.severity)
    return 0.7 * type_correct + 0.3 * severity_correct
python
def moderation_metric(example, prediction, trace=None):
    """Weighted score: type matters more than severity."""
    type_correct = float(prediction.violation_type == example.violation_type)
    severity_correct = float(prediction.severity == example.severity)
    return 0.7 * type_correct + 0.3 * severity_correct

Per-category metrics (more useful than overall accuracy)

分类型指标(比整体准确率更实用)

python
def make_category_metric(category):
    """Create a precision metric for a specific violation category."""
    def metric(example, prediction, trace=None):
        # Did we correctly identify this category?
        if example.violation_type == category:
            return float(prediction.violation_type == category)  # recall
        else:
            return float(prediction.violation_type != category)  # precision
    return metric
python
def make_category_metric(category):
    """Create a precision metric for a specific violation category."""
    def metric(example, prediction, trace=None):
        # Did we correctly identify this category?
        if example.violation_type == category:
            return float(prediction.violation_type == category)  # recall
        else:
            return float(prediction.violation_type != category)  # precision
    return metric

Track each category separately

Track each category separately

hate_speech_metric = make_category_metric("hate_speech") spam_metric = make_category_metric("spam")
undefined
hate_speech_metric = make_category_metric("hate_speech") spam_metric = make_category_metric("spam")
undefined

Optimize the moderator

优化审核器

python
undefined
python
undefined

Prepare labeled training data

Prepare labeled training data

trainset = [ dspy.Example( content="Buy cheap watches at spam-site.com!!!", platform_context="product review", violation_type="spam", severity="medium", ).with_inputs("content", "platform_context"), dspy.Example( content="This product changed my life, highly recommend!", platform_context="product review", violation_type="safe", severity="none", ).with_inputs("content", "platform_context"), # 50-200 labeled examples for good optimization ]
optimizer = dspy.MIPROv2(metric=moderation_metric, auto="medium") optimized = optimizer.compile(moderator, trainset=trainset)
undefined
trainset = [ dspy.Example( content="Buy cheap watches at spam-site.com!!!", platform_context="product review", violation_type="spam", severity="medium", ).with_inputs("content", "platform_context"), dspy.Example( content="This product changed my life, highly recommend!", platform_context="product review", violation_type="safe", severity="none", ).with_inputs("content", "platform_context"), # 50-200 labeled examples for good optimization ]
optimizer = dspy.MIPROv2(metric=moderation_metric, auto="medium") optimized = optimizer.compile(moderator, trainset=trainset)
undefined

Step 7: Handle tricky cases

步骤7:处理复杂场景

Brief notes on content that's hard to moderate correctly:
  • Sarcasm and satire — "Oh sure, what a great product" isn't hate speech. Context matters. The
    platform_context
    field helps here.
  • Quoting to criticize — "The seller said 'you're an idiot'" is reporting harassment, not committing it. Include instructions in your signature to distinguish.
  • Code snippets — Variable names or test strings might contain offensive words. If your platform has code, add a code-detection step before moderation.
  • Non-English content — LMs handle major languages well but may miss nuance in less-common languages. Consider language-specific test sets.
  • Adversarial evasion — Users will try to bypass moderation (leetspeak, Unicode tricks, word splitting). Test your moderator with
    /ai-testing-safety
    .
关于难以正确审核的内容的简要说明:
  • 讽刺与调侃 —— "哦当然,这产品可‘真棒’" 不属于仇恨言论。上下文很重要,
    platform_context
    字段对此有帮助。
  • 引用批评 —— "卖家说‘你是个白痴’" 是举报骚扰,而非实施骚扰。在Signature中加入说明以区分这类情况。
  • 代码片段 —— 变量名或测试字符串可能包含冒犯性词汇。如果你的平台涉及代码内容,在审核前添加代码检测步骤。
  • 非英语内容 —— 大语言模型能较好处理主流语言,但可能遗漏小众语言中的细微差别。考虑使用特定语言的测试集。
  • 对抗性规避 —— 用户会尝试绕过审核(如黑客语、Unicode技巧、拆分词语)。使用
    /ai-testing-safety
    测试你的审核器。

Tips

提示

  • False positives hurt more than false negatives. Over-moderation kills user engagement. Tune your confidence threshold to minimize false positives, especially for borderline content.
  • Build separate metrics per category. You care more about catching hate speech (high harm) than catching mild spam (low harm).
  • Use a stronger model for uncertain cases. Route low-confidence decisions from GPT-4o-mini to GPT-4o for a second opinion.
  • Log every decision. Moderator decisions should be reviewable and auditable. See
    /ai-monitoring
    for production logging patterns.
  • Test your moderator adversarially. Users will try to evade moderation. Run
    /ai-testing-safety
    against your moderator.
  • Start permissive, tighten later. It's easier to add restrictions than to regain user trust after over-moderating.
  • 误判比漏判危害更大。过度审核会降低用户参与度。调整置信度阈值以减少误判,尤其是对边缘内容。
  • 为每个类别单独设置指标。你更关注是否拦截仇恨言论(高危害),而非轻度垃圾信息(低危害)。
  • 对不确定场景使用更强的模型。将低置信度决策从GPT-4o-mini转交给GPT-4o进行二次判断。
  • 记录每一项决策。审核器的决策应可复查和审计。查看
    /ai-monitoring
    了解生产环境日志记录模式。
  • 对抗性测试你的审核器。用户肯定会尝试规避审核。用
    /ai-testing-safety
    测试你的审核器。
  • 先宽松,后收紧。添加限制比过度审核后重新赢回用户信任更容易。

Additional resources

额外资源

  • Use
    /ai-sorting
    for general classification patterns (moderation is classification + routing)
  • Use
    /ai-checking-outputs
    for output guardrails on your own AI's responses
  • Use
    /ai-testing-safety
    to adversarially test your moderator
  • Use
    /ai-monitoring
    to track moderation quality in production
  • See
    examples.md
    for complete worked examples
  • 使用
    /ai-sorting
    获取通用分类模式(审核是分类+路由的组合)
  • 使用
    /ai-checking-outputs
    为你自己的AI响应添加输出防护
  • 使用
    /ai-testing-safety
    对审核器进行对抗性测试
  • 使用
    /ai-monitoring
    跟踪生产环境中的审核质量
  • 查看
    examples.md
    获取完整的示例