ai-moderating-content
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAuto-Moderate What Users Post
自动审核用户发布的内容
Guide the user through building AI content moderation — classify user-generated content, score severity, and route decisions (auto-approve, human-review, auto-reject). The pattern: classify, score, route.
引导用户构建AI内容审核系统——对用户生成内容进行分类、评分严重程度,并路由审核决策(自动通过、人工审核、自动拒绝)。核心模式:分类→评分→路由。
When you need content moderation
适用场景
- User-generated content (comments, posts, reviews, messages)
- Community platforms and forums
- Marketplace listings (product descriptions, seller profiles)
- Chat and messaging features
- Any surface where users create content that others see
- 用户生成内容(评论、帖子、评价、消息)
- 社区平台和论坛
- 市场列表(产品描述、卖家资料)
- 聊天和消息功能
- 任何用户可创建内容并展示给他人的场景
Step 1: Define your moderation policy
步骤1:定义审核策略
Ask the user:
- What content do you need to catch? (hate speech, spam, NSFW, harassment, self-harm, illegal activity, PII)
- What are the severity levels? (warning, remove, ban)
- What's the tolerance for false positives? (over-moderating frustrates users)
- Is human review in the loop? (auto-only vs. auto + human escalation)
询问用户:
- 需要拦截哪些内容?(仇恨言论、垃圾信息、NSFW内容、骚扰、自残、非法活动、个人可识别信息PII)
- 严重程度分为几级?(警告、删除、封禁)
- 对误判的容忍度如何?(过度审核会打击用户积极性)
- 是否需要人工审核介入?**(纯自动 vs 自动+人工升级处理)
Step 2: Build the moderator
步骤2:构建审核器
Classification + severity scoring + routing decision:
python
import dspy
from typing import Literal
VIOLATIONS = Literal[
"safe", "spam", "hate_speech", "harassment",
"violence", "nsfw", "self_harm", "illegal",
]
class ModerateContent(dspy.Signature):
"""Assess user-generated content against platform policies."""
content: str = dspy.InputField(desc="user-generated content to moderate")
platform_context: str = dspy.InputField(desc="where this content appears, e.g. 'product review'")
violation_type: VIOLATIONS = dspy.OutputField()
severity: Literal["none", "low", "medium", "high"] = dspy.OutputField()
explanation: str = dspy.OutputField(desc="brief reason for the decision")
class ContentModerator(dspy.Module):
def __init__(self):
self.assess = dspy.ChainOfThought(ModerateContent)
def forward(self, content, platform_context="social media post"):
result = self.assess(content=content, platform_context=platform_context)
# Route based on severity
if result.severity == "high":
decision = "remove"
elif result.severity == "medium":
decision = "human_review"
elif result.severity == "low":
decision = "warn"
else:
decision = "approve"
return dspy.Prediction(
violation_type=result.violation_type,
severity=result.severity,
decision=decision,
explanation=result.explanation,
)分类+严重程度评分+路由决策:
python
import dspy
from typing import Literal
VIOLATIONS = Literal[
"safe", "spam", "hate_speech", "harassment",
"violence", "nsfw", "self_harm", "illegal",
]
class ModerateContent(dspy.Signature):
"""Assess user-generated content against platform policies."""
content: str = dspy.InputField(desc="user-generated content to moderate")
platform_context: str = dspy.InputField(desc="where this content appears, e.g. 'product review'")
violation_type: VIOLATIONS = dspy.OutputField()
severity: Literal["none", "low", "medium", "high"] = dspy.OutputField()
explanation: str = dspy.OutputField(desc="brief reason for the decision")
class ContentModerator(dspy.Module):
def __init__(self):
self.assess = dspy.ChainOfThought(ModerateContent)
def forward(self, content, platform_context="social media post"):
result = self.assess(content=content, platform_context=platform_context)
# Route based on severity
if result.severity == "high":
decision = "remove"
elif result.severity == "medium":
decision = "human_review"
elif result.severity == "low":
decision = "warn"
else:
decision = "approve"
return dspy.Prediction(
violation_type=result.violation_type,
severity=result.severity,
decision=decision,
explanation=result.explanation,
)Usage
Usage
moderator = ContentModerator()
result = moderator(content="Great product, works exactly as described!")
print(result.decision) # "approve"
result = moderator(content="This seller is a scammer, I'll find where they live")
print(result.decision) # "remove"
print(result.violation_type) # "harassment"
undefinedmoderator = ContentModerator()
result = moderator(content="Great product, works exactly as described!")
print(result.decision) # "approve"
result = moderator(content="This seller is a scammer, I'll find where they live")
print(result.decision) # "remove"
print(result.violation_type) # "harassment"
undefinedStep 3: Multi-label moderation
步骤3:多标签审核
Content can violate multiple policies at once (e.g., spam and contains PII):
python
VIOLATION_TYPES = ["safe", "spam", "hate_speech", "harassment", "violence", "nsfw", "self_harm", "illegal"]
class MultiLabelModerate(dspy.Signature):
"""Flag all policy violations in user content. Content may have multiple violations."""
content: str = dspy.InputField()
platform_context: str = dspy.InputField()
violations: list[str] = dspy.OutputField(desc=f"all that apply from: {VIOLATION_TYPES}")
severity: Literal["none", "low", "medium", "high"] = dspy.OutputField(
desc="overall severity based on the worst violation"
)
explanation: str = dspy.OutputField()
class MultiLabelModerator(dspy.Module):
def __init__(self):
self.assess = dspy.ChainOfThought(MultiLabelModerate)
def forward(self, content, platform_context=""):
result = self.assess(content=content, platform_context=platform_context)
# Validate that returned violations are from the allowed set
dspy.Assert(
all(v in VIOLATION_TYPES for v in result.violations),
f"Violations must be from: {VIOLATION_TYPES}",
)
return result内容可能同时违反多项政策(例如:垃圾信息+包含PII):
python
VIOLATION_TYPES = ["safe", "spam", "hate_speech", "harassment", "violence", "nsfw", "self_harm", "illegal"]
class MultiLabelModerate(dspy.Signature):
"""Flag all policy violations in user content. Content may have multiple violations."""
content: str = dspy.InputField()
platform_context: str = dspy.InputField()
violations: list[str] = dspy.OutputField(desc=f"all that apply from: {VIOLATION_TYPES}")
severity: Literal["none", "low", "medium", "high"] = dspy.OutputField(
desc="overall severity based on the worst violation"
)
explanation: str = dspy.OutputField()
class MultiLabelModerator(dspy.Module):
def __init__(self):
self.assess = dspy.ChainOfThought(MultiLabelModerate)
def forward(self, content, platform_context=""):
result = self.assess(content=content, platform_context=platform_context)
# Validate that returned violations are from the allowed set
dspy.Assert(
all(v in VIOLATION_TYPES for v in result.violations),
f"Violations must be from: {VIOLATION_TYPES}",
)
return resultStep 4: Hard blocks with assertions
步骤4:基于断言的强制拦截
For zero-tolerance patterns, don't even ask the LM — block instantly with pattern matching:
python
import re
class StrictModerator(dspy.Module):
def __init__(self):
self.assess = dspy.ChainOfThought(ModerateContent)
def forward(self, content, platform_context=""):
# Pattern-based hard blocks (instant, no LM needed)
dspy.Assert(
not re.search(r"\b\d{3}-\d{2}-\d{4}\b", content),
"Content contains SSN pattern — auto-reject",
)
dspy.Assert(
not re.search(
r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",
content,
),
"Content contains email addresses — redact before posting",
)
dspy.Assert(
not re.search(r"\b\d{16}\b", content),
"Content contains potential credit card number — auto-reject",
)
# LM-based assessment for everything else
return self.assess(content=content, platform_context=platform_context)Pattern-based blocks are faster, cheaper, and more reliable than LM-based detection for well-defined patterns (SSNs, credit cards, emails). Use regex for structure, LMs for semantics.
对于零容忍的内容模式,无需调用大语言模型——直接通过模式匹配立即拦截:
python
import re
class StrictModerator(dspy.Module):
def __init__(self):
self.assess = dspy.ChainOfThought(ModerateContent)
def forward(self, content, platform_context=""):
# Pattern-based hard blocks (instant, no LM needed)
dspy.Assert(
not re.search(r"\b\d{3}-\d{2}-\d{4}\b", content),
"Content contains SSN pattern — auto-reject",
)
dspy.Assert(
not re.search(
r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",
content,
),
"Content contains email addresses — redact before posting",
)
dspy.Assert(
not re.search(r"\b\d{16}\b", content),
"Content contains potential credit card number — auto-reject",
)
# LM-based assessment for everything else
return self.assess(content=content, platform_context=platform_context)对于定义明确的模式(社保号、信用卡号、邮箱),基于模式的拦截比大语言模型检测更快、更便宜、更可靠。用正则表达式处理结构化内容,用大语言模型处理语义内容。
Step 5: Confidence-based routing
步骤5:基于置信度的路由
Route uncertain decisions to human reviewers instead of making bad calls:
python
class ConfidentModerate(dspy.Signature):
"""Moderate content and rate your confidence in the assessment."""
content: str = dspy.InputField()
platform_context: str = dspy.InputField()
violation_type: VIOLATIONS = dspy.OutputField()
severity: Literal["none", "low", "medium", "high"] = dspy.OutputField()
confidence: float = dspy.OutputField(desc="0.0 to 1.0 — how sure are you about this assessment?")
explanation: str = dspy.OutputField()
class ConfidentModerator(dspy.Module):
def __init__(self, confidence_threshold=0.7):
self.assess = dspy.ChainOfThought(ConfidentModerate)
self.confidence_threshold = confidence_threshold
def forward(self, content, platform_context=""):
result = self.assess(content=content, platform_context=platform_context)
# Validate confidence range
dspy.Assert(
0.0 <= result.confidence <= 1.0,
"Confidence must be between 0.0 and 1.0",
)
# Route based on confidence + severity
if result.confidence < self.confidence_threshold:
decision = "human_review" # uncertain → always escalate
elif result.severity == "high":
decision = "remove"
elif result.severity == "medium":
decision = "human_review"
elif result.severity == "low":
decision = "warn"
else:
decision = "approve"
return dspy.Prediction(
violation_type=result.violation_type,
severity=result.severity,
confidence=result.confidence,
decision=decision,
explanation=result.explanation,
)将不确定的决策转交给人工审核,避免错误判断:
python
class ConfidentModerate(dspy.Signature):
"""Moderate content and rate your confidence in the assessment."""
content: str = dspy.InputField()
platform_context: str = dspy.InputField()
violation_type: VIOLATIONS = dspy.OutputField()
severity: Literal["none", "low", "medium", "high"] = dspy.OutputField()
confidence: float = dspy.OutputField(desc="0.0 to 1.0 — how sure are you about this assessment?")
explanation: str = dspy.OutputField()
class ConfidentModerator(dspy.Module):
def __init__(self, confidence_threshold=0.7):
self.assess = dspy.ChainOfThought(ConfidentModerate)
self.confidence_threshold = confidence_threshold
def forward(self, content, platform_context=""):
result = self.assess(content=content, platform_context=platform_context)
# Validate confidence range
dspy.Assert(
0.0 <= result.confidence <= 1.0,
"Confidence must be between 0.0 and 1.0",
)
# Route based on confidence + severity
if result.confidence < self.confidence_threshold:
decision = "human_review" # uncertain → always escalate
elif result.severity == "high":
decision = "remove"
elif result.severity == "medium":
decision = "human_review"
elif result.severity == "low":
decision = "warn"
else:
decision = "approve"
return dspy.Prediction(
violation_type=result.violation_type,
severity=result.severity,
confidence=result.confidence,
decision=decision,
explanation=result.explanation,
)Step 6: Metrics and optimization
步骤6:指标与优化
Define moderation metrics
定义审核指标
python
def moderation_metric(example, prediction, trace=None):
"""Weighted score: type matters more than severity."""
type_correct = float(prediction.violation_type == example.violation_type)
severity_correct = float(prediction.severity == example.severity)
return 0.7 * type_correct + 0.3 * severity_correctpython
def moderation_metric(example, prediction, trace=None):
"""Weighted score: type matters more than severity."""
type_correct = float(prediction.violation_type == example.violation_type)
severity_correct = float(prediction.severity == example.severity)
return 0.7 * type_correct + 0.3 * severity_correctPer-category metrics (more useful than overall accuracy)
分类型指标(比整体准确率更实用)
python
def make_category_metric(category):
"""Create a precision metric for a specific violation category."""
def metric(example, prediction, trace=None):
# Did we correctly identify this category?
if example.violation_type == category:
return float(prediction.violation_type == category) # recall
else:
return float(prediction.violation_type != category) # precision
return metricpython
def make_category_metric(category):
"""Create a precision metric for a specific violation category."""
def metric(example, prediction, trace=None):
# Did we correctly identify this category?
if example.violation_type == category:
return float(prediction.violation_type == category) # recall
else:
return float(prediction.violation_type != category) # precision
return metricTrack each category separately
Track each category separately
hate_speech_metric = make_category_metric("hate_speech")
spam_metric = make_category_metric("spam")
undefinedhate_speech_metric = make_category_metric("hate_speech")
spam_metric = make_category_metric("spam")
undefinedOptimize the moderator
优化审核器
python
undefinedpython
undefinedPrepare labeled training data
Prepare labeled training data
trainset = [
dspy.Example(
content="Buy cheap watches at spam-site.com!!!",
platform_context="product review",
violation_type="spam",
severity="medium",
).with_inputs("content", "platform_context"),
dspy.Example(
content="This product changed my life, highly recommend!",
platform_context="product review",
violation_type="safe",
severity="none",
).with_inputs("content", "platform_context"),
# 50-200 labeled examples for good optimization
]
optimizer = dspy.MIPROv2(metric=moderation_metric, auto="medium")
optimized = optimizer.compile(moderator, trainset=trainset)
undefinedtrainset = [
dspy.Example(
content="Buy cheap watches at spam-site.com!!!",
platform_context="product review",
violation_type="spam",
severity="medium",
).with_inputs("content", "platform_context"),
dspy.Example(
content="This product changed my life, highly recommend!",
platform_context="product review",
violation_type="safe",
severity="none",
).with_inputs("content", "platform_context"),
# 50-200 labeled examples for good optimization
]
optimizer = dspy.MIPROv2(metric=moderation_metric, auto="medium")
optimized = optimizer.compile(moderator, trainset=trainset)
undefinedStep 7: Handle tricky cases
步骤7:处理复杂场景
Brief notes on content that's hard to moderate correctly:
- Sarcasm and satire — "Oh sure, what a great product" isn't hate speech. Context matters. The field helps here.
platform_context - Quoting to criticize — "The seller said 'you're an idiot'" is reporting harassment, not committing it. Include instructions in your signature to distinguish.
- Code snippets — Variable names or test strings might contain offensive words. If your platform has code, add a code-detection step before moderation.
- Non-English content — LMs handle major languages well but may miss nuance in less-common languages. Consider language-specific test sets.
- Adversarial evasion — Users will try to bypass moderation (leetspeak, Unicode tricks, word splitting). Test your moderator with .
/ai-testing-safety
关于难以正确审核的内容的简要说明:
- 讽刺与调侃 —— "哦当然,这产品可‘真棒’" 不属于仇恨言论。上下文很重要,字段对此有帮助。
platform_context - 引用批评 —— "卖家说‘你是个白痴’" 是举报骚扰,而非实施骚扰。在Signature中加入说明以区分这类情况。
- 代码片段 —— 变量名或测试字符串可能包含冒犯性词汇。如果你的平台涉及代码内容,在审核前添加代码检测步骤。
- 非英语内容 —— 大语言模型能较好处理主流语言,但可能遗漏小众语言中的细微差别。考虑使用特定语言的测试集。
- 对抗性规避 —— 用户会尝试绕过审核(如黑客语、Unicode技巧、拆分词语)。使用测试你的审核器。
/ai-testing-safety
Tips
提示
- False positives hurt more than false negatives. Over-moderation kills user engagement. Tune your confidence threshold to minimize false positives, especially for borderline content.
- Build separate metrics per category. You care more about catching hate speech (high harm) than catching mild spam (low harm).
- Use a stronger model for uncertain cases. Route low-confidence decisions from GPT-4o-mini to GPT-4o for a second opinion.
- Log every decision. Moderator decisions should be reviewable and auditable. See for production logging patterns.
/ai-monitoring - Test your moderator adversarially. Users will try to evade moderation. Run against your moderator.
/ai-testing-safety - Start permissive, tighten later. It's easier to add restrictions than to regain user trust after over-moderating.
- 误判比漏判危害更大。过度审核会降低用户参与度。调整置信度阈值以减少误判,尤其是对边缘内容。
- 为每个类别单独设置指标。你更关注是否拦截仇恨言论(高危害),而非轻度垃圾信息(低危害)。
- 对不确定场景使用更强的模型。将低置信度决策从GPT-4o-mini转交给GPT-4o进行二次判断。
- 记录每一项决策。审核器的决策应可复查和审计。查看了解生产环境日志记录模式。
/ai-monitoring - 对抗性测试你的审核器。用户肯定会尝试规避审核。用测试你的审核器。
/ai-testing-safety - 先宽松,后收紧。添加限制比过度审核后重新赢回用户信任更容易。
Additional resources
额外资源
- Use for general classification patterns (moderation is classification + routing)
/ai-sorting - Use for output guardrails on your own AI's responses
/ai-checking-outputs - Use to adversarially test your moderator
/ai-testing-safety - Use to track moderation quality in production
/ai-monitoring - See for complete worked examples
examples.md
- 使用获取通用分类模式(审核是分类+路由的组合)
/ai-sorting - 使用为你自己的AI响应添加输出防护
/ai-checking-outputs - 使用对审核器进行对抗性测试
/ai-testing-safety - 使用跟踪生产环境中的审核质量
/ai-monitoring - 查看获取完整的示例
examples.md