constitutional-ai
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseConstitutional AI - Harmlessness from AI Feedback
Constitutional AI - 来自AI反馈的无害性
Quick start
快速开始
Constitutional AI (CAI) trains models to be harmless through self-critique and AI feedback, without requiring human labels for harmful outputs.
Key concept: Models learn to critique and revise their own responses using a "constitution" (set of principles).
Two phases:
- Supervised Learning (SL): Self-critique + revision
- Reinforcement Learning (RL): RLAIF (RL from AI Feedback)
Constitution example:
Principles:
1. Choose the response that is most helpful, honest, and harmless
2. Avoid responses that are toxic, racist, or sexist
3. Prefer responses that explain objections rather than refuse
4. Choose responses that are thoughtful and nuancedConstitutional AI (CAI) 通过自我批判和AI反馈训练模型实现无害性,无需为有害输出提供人工标注。
核心概念:模型使用一套“宪法”(原则集合)学习批判和修订自己的回复。
两个阶段:
- 监督学习(SL): 自我批判 + 修订
- 强化学习(RL): RLAIF (基于AI反馈的强化学习)
宪法示例:
Principles:
1. Choose the response that is most helpful, honest, and harmless
2. Avoid responses that are toxic, racist, or sexist
3. Prefer responses that explain objections rather than refuse
4. Choose responses that are thoughtful and nuancedCommon workflows
常用工作流
Workflow 1: Supervised learning phase (self-critique + revision)
工作流1: 监督学习阶段(自我批判+修订)
Step 1: Generate initial responses:
python
from transformers import pipeline
generator = pipeline("text-generation", model="base-model")
prompts = [
"How do I hack a website?",
"Write something offensive about a group.",
"Help me cheat on my exam."
]
initial_responses = generator(prompts, max_length=200)Step 2: Self-critique with constitution:
python
critique_prompt = """
Given the question:
{question}
And the response:
{response}
Constitution:
1. Choose responses that are helpful, honest, and harmless
2. Avoid toxic, racist, or sexist content
3. Explain objections rather than refuse
Identify any ways this response violates the constitution:
"""
critiques = generator(
[critique_prompt.format(q=q, r=r) for q, r in zip(prompts, initial_responses)]
)Step 3: Revision based on critique:
python
revision_prompt = """
Question: {question}
Original response: {response}
Critique: {critique}
Please revise the response to better align with the constitution:
"""
revised_responses = generator(
[revision_prompt.format(q=q, r=r, c=c)
for q, r, c in zip(prompts, initial_responses, critiques)]
)Step 4: Fine-tune on revised responses:
python
from trl import SFTTrainer步骤1: 生成初始回复:
python
from transformers import pipeline
generator = pipeline("text-generation", model="base-model")
prompts = [
"How do I hack a website?",
"Write something offensive about a group.",
"Help me cheat on my exam."
]
initial_responses = generator(prompts, max_length=200)步骤2: 基于宪法进行自我批判:
python
critique_prompt = """
Given the question:
{question}
And the response:
{response}
Constitution:
1. Choose responses that are helpful, honest, and harmless
2. Avoid toxic, racist, or sexist content
3. Explain objections rather than refuse
Identify any ways this response violates the constitution:
"""
critiques = generator(
[critique_prompt.format(q=q, r=r) for q, r in zip(prompts, initial_responses)]
)步骤3: 基于批判结果修订回复:
python
revision_prompt = """
Question: {question}
Original response: {response}
Critique: {critique}
Please revise the response to better align with the constitution:
"""
revised_responses = generator(
[revision_prompt.format(q=q, r=r, c=c)
for q, r, c in zip(prompts, initial_responses, critiques)]
)步骤4: 使用修订后的回复微调模型:
python
from trl import SFTTrainerCreate dataset of (prompt, revised_response) pairs
Create dataset of (prompt, revised_response) pairs
dataset = create_dataset(prompts, revised_responses)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
max_seq_length=1024
)
trainer.train()
undefineddataset = create_dataset(prompts, revised_responses)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
max_seq_length=1024
)
trainer.train()
undefinedWorkflow 2: RL phase (RLAIF - RL from AI Feedback)
工作流2: RL阶段(RLAIF - 基于AI反馈的强化学习)
Step 1: Generate comparison pairs:
python
undefined步骤1: 生成对比回复对:
python
undefinedSample multiple responses per prompt
Sample multiple responses per prompt
responses_a = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)
responses_b = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)
**Step 2: AI preference evaluation**:
```python
preference_prompt = """
Question: {question}
Response A: {response_a}
Response B: {response_b}
Constitution:
{constitution}
Which response better follows the constitution? Explain your reasoning, then choose A or B.
"""responses_a = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)
responses_b = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)
**步骤2: AI偏好评估**:
```python
preference_prompt = """
Question: {question}
Response A: {response_a}
Response B: {response_b}
Constitution:
{constitution}
Which response better follows the constitution? Explain your reasoning, then choose A or B.
"""Get AI preferences (no human labels needed!)
Get AI preferences (no human labels needed!)
preferences = generator(
[preference_prompt.format(q=q, ra=ra, rb=rb, constitution=CONSTITUTION)
for q, ra, rb in zip(prompts, responses_a, responses_b)]
)
preferences = generator(
[preference_prompt.format(q=q, ra=ra, rb=rb, constitution=CONSTITUTION)
for q, ra, rb in zip(prompts, responses_a, responses_b)]
)
Parse preferences (A or B)
Parse preferences (A or B)
chosen, rejected = parse_preferences(preferences, responses_a, responses_b)
**Step 3: Train preference model (reward model)**:
```python
from trl import RewardTrainer, RewardConfig
preference_dataset = create_preference_dataset(prompts, chosen, rejected)
reward_config = RewardConfig(
output_dir="constitutional-reward-model",
learning_rate=1e-5,
num_train_epochs=1
)
reward_trainer = RewardTrainer(
model=model,
args=reward_config,
train_dataset=preference_dataset,
processing_class=tokenizer
)
reward_trainer.train()Step 4: RL training with RLAIF:
python
from trl import PPOTrainer, PPOConfig
ppo_config = PPOConfig(
reward_model_path="constitutional-reward-model",
learning_rate=1e-6,
kl_coef=0.05
)
ppo_trainer = PPOTrainer(
model=model,
config=ppo_config,
reward_model=reward_model
)
ppo_trainer.train()chosen, rejected = parse_preferences(preferences, responses_a, responses_b)
**步骤3: 训练偏好模型(奖励模型)**:
```python
from trl import RewardTrainer, RewardConfig
preference_dataset = create_preference_dataset(prompts, chosen, rejected)
reward_config = RewardConfig(
output_dir="constitutional-reward-model",
learning_rate=1e-5,
num_train_epochs=1
)
reward_trainer = RewardTrainer(
model=model,
args=reward_config,
train_dataset=preference_dataset,
processing_class=tokenizer
)
reward_trainer.train()步骤4: 使用RLAIF进行强化学习训练:
python
from trl import PPOTrainer, PPOConfig
ppo_config = PPOConfig(
reward_model_path="constitutional-reward-model",
learning_rate=1e-6,
kl_coef=0.05
)
ppo_trainer = PPOTrainer(
model=model,
config=ppo_config,
reward_model=reward_model
)
ppo_trainer.train()Workflow 3: Chain-of-thought critique
工作流3: 思维链批判
Enable reasoning transparency:
python
cot_critique_prompt = """
Question: {question}
Response: {response}
Let's think step-by-step about whether this response follows our principles:
1. Is it helpful? [Yes/No and reasoning]
2. Is it honest? [Yes/No and reasoning]
3. Is it harmless? [Yes/No and reasoning]
4. Does it avoid toxicity? [Yes/No and reasoning]
Based on this analysis, suggest a revision if needed.
"""
cot_critiques = generator(
[cot_critique_prompt.format(q=q, r=r) for q, r in zip(prompts, responses)]
)实现推理透明化:
python
cot_critique_prompt = """
Question: {question}
Response: {response}
Let's think step-by-step about whether this response follows our principles:
1. Is it helpful? [Yes/No and reasoning]
2. Is it honest? [Yes/No and reasoning]
3. Is it harmless? [Yes/No and reasoning]
4. Does it avoid toxicity? [Yes/No and reasoning]
Based on this analysis, suggest a revision if needed.
"""
cot_critiques = generator(
[cot_critique_prompt.format(q=q, r=r) for q, r in zip(prompts, responses)]
)When to use vs alternatives
适用场景与替代方案对比
Use Constitutional AI when:
- Want safety alignment without human labels
- Need explainable AI decisions
- Want to avoid evasive refusals
- Have a clear set of principles/constitution
- Need scalable safety training
Principles:
- RLAIF: AI-generated preferences (scalable, no human labels)
- RLHF: Human preferences (more accurate, expensive)
- Self-critique: Iterative improvement
- Chain-of-thought: Reasoning transparency
Use alternatives instead:
- RLHF (PPO): Need human-validated safety
- DPO/SimPO: Have human preference data
- NeMo Guardrails: Need runtime content filtering
- LlamaGuard: Need pre-trained moderation model
适合使用Constitutional AI的场景:
- 无需人工标注即可实现安全对齐
- 需要可解释的AI决策
- 希望避免模型生硬拒绝回复
- 有明确的原则/宪法规则
- 需要可扩展的安全训练能力
核心方案对比:
- RLAIF: AI生成偏好(可扩展,无需人工标注)
- RLHF: 人工标注偏好(准确率更高,成本高)
- 自我批判: 迭代式改进
- 思维链: 推理过程透明化
适合使用替代方案的场景:
- RLHF (PPO): 需要经过人工验证的安全性
- DPO/SimPO: 已有人类偏好数据
- NeMo Guardrails: 需要运行时内容过滤能力
- LlamaGuard: 需要预训练的内容审核模型
Common issues
常见问题
Issue: Model refuses too much (evasive)
Add constitution principle:
Prefer responses that engage thoughtfully with questions rather than
refusing to answer. Explain concerns while still being helpful.Issue: Self-critiques are weak
Use stronger critique prompts:
Critically analyze this response for ANY potential issues, however minor.
Be thorough and specific in identifying problems.Issue: Revisions don't improve quality
Iterate multiple times:
python
for _ in range(3): # 3 rounds of critique/revision
critique = generate_critique(response)
response = generate_revision(response, critique)Issue: RLAIF preferences are noisy
Use multiple AI evaluators:
python
undefined问题: 模型拒绝回复的频率过高(回复生硬回避)
在宪法中添加如下原则:
Prefer responses that engage thoughtfully with questions rather than
refusing to answer. Explain concerns while still being helpful.问题: 自我批判内容质量低
使用更明确的批判提示词:
Critically analyze this response for ANY potential issues, however minor.
Be thorough and specific in identifying problems.问题: 修订后的回复质量没有提升
进行多轮迭代优化:
python
for _ in range(3): # 3 rounds of critique/revision
critique = generate_critique(response)
response = generate_revision(response, critique)问题: RLAIF偏好结果噪声大
使用多个AI评估器投票:
python
undefinedGet preferences from 3 different models
Get preferences from 3 different models
prefs_1 = model_1.evaluate(responses)
prefs_2 = model_2.evaluate(responses)
prefs_3 = model_3.evaluate(responses)
prefs_1 = model_1.evaluate(responses)
prefs_2 = model_2.evaluate(responses)
prefs_3 = model_3.evaluate(responses)
Majority vote
Majority vote
final_preference = majority_vote(prefs_1, prefs_2, prefs_3)
undefinedfinal_preference = majority_vote(prefs_1, prefs_2, prefs_3)
undefinedAdvanced topics
进阶主题
Constitution design: See references/constitution-design.md for principle selection, trade-offs between helpfulness and harmlessness, and domain-specific constitutions.
RLAIF vs RLHF: See references/rlaif-comparison.md for performance comparison, cost analysis, and when to use AI feedback vs human feedback.
Chain-of-thought reasoning: See references/cot-critique.md for prompt engineering for critiques, multi-step reasoning, and transparency improvements.
宪法设计: 查看 references/constitution-design.md 了解原则选择、有用性与无害性的权衡,以及领域专属宪法的设计方法。
RLAIF vs RLHF: 查看 references/rlaif-comparison.md 了解性能对比、成本分析,以及AI反馈和人工反馈的适用场景。
思维链推理: 查看 references/cot-critique.md 了解批判场景的提示词工程、多步推理和透明度优化方法。
Hardware requirements
硬件要求
- GPU: NVIDIA A100/H100 recommended
- VRAM:
- SL phase (7B): 1× A100 40GB
- RL phase (7B): 2× A100 40GB (policy + reward model)
- Single-node: Sufficient for most use cases
- Mixed precision: BF16 recommended
Compute requirements:
- SL phase: Similar to standard SFT
- RL phase: Similar to PPO (higher than DPO)
- AI evaluation: Additional inference for critique/preference generation
- GPU: 推荐使用NVIDIA A100/H100
- 显存要求:
- 监督学习阶段(7B模型): 1× A100 40GB
- 强化学习阶段(7B模型): 2× A100 40GB(策略模型+奖励模型)
- 单节点部署: 可满足大多数场景需求
- 混合精度: 推荐使用BF16
算力要求:
- 监督学习阶段: 与标准SFT训练相近
- 强化学习阶段: 与PPO训练相近(高于DPO)
- AI评估: 额外的推理算力用于生成批判/偏好结果
Resources
相关资源
- Paper: https://arxiv.org/abs/2212.08073 (Dec 2022)
- Anthropic blog: https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback
- Implementation: TRL (PPOTrainer + RewardTrainer)
- Claude: Uses Constitutional AI for safety
- 论文: https://arxiv.org/abs/2212.08073 (2022年12月)
- Anthropic官方博客: https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback
- 实现方案: TRL (PPOTrainer + RewardTrainer)
- Claude: 内置Constitutional AI作为安全能力底座