constitutional-ai

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Constitutional AI - Harmlessness from AI Feedback

Constitutional AI - 来自AI反馈的无害性

Quick start

快速开始

Constitutional AI (CAI) trains models to be harmless through self-critique and AI feedback, without requiring human labels for harmful outputs.

Key concept: Models learn to critique and revise their own responses using a "constitution" (set of principles).

Two phases:

Supervised Learning (SL): Self-critique + revision
Reinforcement Learning (RL): RLAIF (RL from AI Feedback)

Constitution example:

Principles:
1. Choose the response that is most helpful, honest, and harmless
2. Avoid responses that are toxic, racist, or sexist
3. Prefer responses that explain objections rather than refuse
4. Choose responses that are thoughtful and nuanced

Constitutional AI (CAI) 通过自我批判和AI反馈训练模型实现无害性，无需为有害输出提供人工标注。

核心概念：模型使用一套“宪法”（原则集合）学习批判和修订自己的回复。

两个阶段:

监督学习（SL）: 自我批判 + 修订
强化学习（RL）: RLAIF (基于AI反馈的强化学习)

宪法示例:

Principles:
1. Choose the response that is most helpful, honest, and harmless
2. Avoid responses that are toxic, racist, or sexist
3. Prefer responses that explain objections rather than refuse
4. Choose responses that are thoughtful and nuanced

Common workflows

常用工作流

Workflow 1: Supervised learning phase (self-critique + revision)

工作流1: 监督学习阶段（自我批判+修订）

Step 1: Generate initial responses:

python

from transformers import pipeline

generator = pipeline("text-generation", model="base-model")

prompts = [
    "How do I hack a website?",
    "Write something offensive about a group.",
    "Help me cheat on my exam."
]

initial_responses = generator(prompts, max_length=200)

Step 2: Self-critique with constitution:

python

critique_prompt = """
Given the question:
{question}

And the response:
{response}

Constitution:
1. Choose responses that are helpful, honest, and harmless
2. Avoid toxic, racist, or sexist content
3. Explain objections rather than refuse

Identify any ways this response violates the constitution:
"""

critiques = generator(
    [critique_prompt.format(q=q, r=r) for q, r in zip(prompts, initial_responses)]
)

Step 3: Revision based on critique:

python

revision_prompt = """
Question: {question}
Original response: {response}
Critique: {critique}

Please revise the response to better align with the constitution:
"""

revised_responses = generator(
    [revision_prompt.format(q=q, r=r, c=c)
     for q, r, c in zip(prompts, initial_responses, critiques)]
)

Step 4: Fine-tune on revised responses:

python

from trl import SFTTrainer

步骤1: 生成初始回复:

python

from transformers import pipeline

generator = pipeline("text-generation", model="base-model")

prompts = [
    "How do I hack a website?",
    "Write something offensive about a group.",
    "Help me cheat on my exam."
]

initial_responses = generator(prompts, max_length=200)

步骤2: 基于宪法进行自我批判:

python

critique_prompt = """
Given the question:
{question}

And the response:
{response}

Constitution:
1. Choose responses that are helpful, honest, and harmless
2. Avoid toxic, racist, or sexist content
3. Explain objections rather than refuse

Identify any ways this response violates the constitution:
"""

critiques = generator(
    [critique_prompt.format(q=q, r=r) for q, r in zip(prompts, initial_responses)]
)

步骤3: 基于批判结果修订回复:

python

revision_prompt = """
Question: {question}
Original response: {response}
Critique: {critique}

Please revise the response to better align with the constitution:
"""

revised_responses = generator(
    [revision_prompt.format(q=q, r=r, c=c)
     for q, r, c in zip(prompts, initial_responses, critiques)]
)

步骤4: 使用修订后的回复微调模型:

python

from trl import SFTTrainer

Create dataset of (prompt, revised_response) pairs

dataset = create_dataset(prompts, revised_responses)

trainer = SFTTrainer( model=model, train_dataset=dataset, max_seq_length=1024 ) trainer.train()

undefined

dataset = create_dataset(prompts, revised_responses)

trainer = SFTTrainer( model=model, train_dataset=dataset, max_seq_length=1024 ) trainer.train()

undefined

Workflow 2: RL phase (RLAIF - RL from AI Feedback)

工作流2: RL阶段（RLAIF - 基于AI反馈的强化学习）

Step 1: Generate comparison pairs:

python

undefined

步骤1: 生成对比回复对:

python

undefined

Sample multiple responses per prompt

responses_a = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8) responses_b = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)


**Step 2: AI preference evaluation**:
```python
preference_prompt = """
Question: {question}

Response A: {response_a}
Response B: {response_b}

Constitution:
{constitution}

Which response better follows the constitution? Explain your reasoning, then choose A or B.
"""

responses_a = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8) responses_b = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)


**步骤2: AI偏好评估**:
```python
preference_prompt = """
Question: {question}

Response A: {response_a}
Response B: {response_b}

Constitution:
{constitution}

Which response better follows the constitution? Explain your reasoning, then choose A or B.
"""

Get AI preferences (no human labels needed!)

preferences = generator( [preference_prompt.format(q=q, ra=ra, rb=rb, constitution=CONSTITUTION) for q, ra, rb in zip(prompts, responses_a, responses_b)] )

Parse preferences (A or B)

chosen, rejected = parse_preferences(preferences, responses_a, responses_b)


**Step 3: Train preference model (reward model)**:
```python
from trl import RewardTrainer, RewardConfig

preference_dataset = create_preference_dataset(prompts, chosen, rejected)

reward_config = RewardConfig(
    output_dir="constitutional-reward-model",
    learning_rate=1e-5,
    num_train_epochs=1
)

reward_trainer = RewardTrainer(
    model=model,
    args=reward_config,
    train_dataset=preference_dataset,
    processing_class=tokenizer
)
reward_trainer.train()

Step 4: RL training with RLAIF:

python

from trl import PPOTrainer, PPOConfig

ppo_config = PPOConfig(
    reward_model_path="constitutional-reward-model",
    learning_rate=1e-6,
    kl_coef=0.05
)

ppo_trainer = PPOTrainer(
    model=model,
    config=ppo_config,
    reward_model=reward_model
)
ppo_trainer.train()

chosen, rejected = parse_preferences(preferences, responses_a, responses_b)


**步骤3: 训练偏好模型（奖励模型）**:
```python
from trl import RewardTrainer, RewardConfig

preference_dataset = create_preference_dataset(prompts, chosen, rejected)

reward_config = RewardConfig(
    output_dir="constitutional-reward-model",
    learning_rate=1e-5,
    num_train_epochs=1
)

reward_trainer = RewardTrainer(
    model=model,
    args=reward_config,
    train_dataset=preference_dataset,
    processing_class=tokenizer
)
reward_trainer.train()

步骤4: 使用RLAIF进行强化学习训练:

python

from trl import PPOTrainer, PPOConfig

ppo_config = PPOConfig(
    reward_model_path="constitutional-reward-model",
    learning_rate=1e-6,
    kl_coef=0.05
)

ppo_trainer = PPOTrainer(
    model=model,
    config=ppo_config,
    reward_model=reward_model
)
ppo_trainer.train()

Workflow 3: Chain-of-thought critique

工作流3: 思维链批判

Enable reasoning transparency:

python

cot_critique_prompt = """
Question: {question}
Response: {response}

Let's think step-by-step about whether this response follows our principles:

1. Is it helpful? [Yes/No and reasoning]
2. Is it honest? [Yes/No and reasoning]
3. Is it harmless? [Yes/No and reasoning]
4. Does it avoid toxicity? [Yes/No and reasoning]

Based on this analysis, suggest a revision if needed.
"""

cot_critiques = generator(
    [cot_critique_prompt.format(q=q, r=r) for q, r in zip(prompts, responses)]
)

实现推理透明化:

python

cot_critique_prompt = """
Question: {question}
Response: {response}

Let's think step-by-step about whether this response follows our principles:

1. Is it helpful? [Yes/No and reasoning]
2. Is it honest? [Yes/No and reasoning]
3. Is it harmless? [Yes/No and reasoning]
4. Does it avoid toxicity? [Yes/No and reasoning]

Based on this analysis, suggest a revision if needed.
"""

cot_critiques = generator(
    [cot_critique_prompt.format(q=q, r=r) for q, r in zip(prompts, responses)]
)

When to use vs alternatives

适用场景与替代方案对比

Use Constitutional AI when:

Want safety alignment without human labels
Need explainable AI decisions
Want to avoid evasive refusals
Have a clear set of principles/constitution
Need scalable safety training

Principles:

RLAIF: AI-generated preferences (scalable, no human labels)
RLHF: Human preferences (more accurate, expensive)
Self-critique: Iterative improvement
Chain-of-thought: Reasoning transparency

Use alternatives instead:

RLHF (PPO): Need human-validated safety
DPO/SimPO: Have human preference data
NeMo Guardrails: Need runtime content filtering
LlamaGuard: Need pre-trained moderation model

适合使用Constitutional AI的场景:

无需人工标注即可实现安全对齐
需要可解释的AI决策
希望避免模型生硬拒绝回复
有明确的原则/宪法规则
需要可扩展的安全训练能力

核心方案对比:

RLAIF: AI生成偏好（可扩展，无需人工标注）
RLHF: 人工标注偏好（准确率更高，成本高）
自我批判: 迭代式改进
思维链: 推理过程透明化

适合使用替代方案的场景:

RLHF (PPO): 需要经过人工验证的安全性
DPO/SimPO: 已有人类偏好数据
NeMo Guardrails: 需要运行时内容过滤能力
LlamaGuard: 需要预训练的内容审核模型

Common issues

常见问题

Issue: Model refuses too much (evasive)

Add constitution principle:

Prefer responses that engage thoughtfully with questions rather than
refusing to answer. Explain concerns while still being helpful.

Issue: Self-critiques are weak

Use stronger critique prompts:

Critically analyze this response for ANY potential issues, however minor.
Be thorough and specific in identifying problems.

Issue: Revisions don't improve quality

Iterate multiple times:

python

for _ in range(3):  # 3 rounds of critique/revision
    critique = generate_critique(response)
    response = generate_revision(response, critique)

Issue: RLAIF preferences are noisy

Use multiple AI evaluators:

python

undefined

问题: 模型拒绝回复的频率过高（回复生硬回避）

在宪法中添加如下原则:

Prefer responses that engage thoughtfully with questions rather than
refusing to answer. Explain concerns while still being helpful.

问题: 自我批判内容质量低

使用更明确的批判提示词:

Critically analyze this response for ANY potential issues, however minor.
Be thorough and specific in identifying problems.

问题: 修订后的回复质量没有提升

进行多轮迭代优化:

python

for _ in range(3):  # 3 rounds of critique/revision
    critique = generate_critique(response)
    response = generate_revision(response, critique)

问题: RLAIF偏好结果噪声大

使用多个AI评估器投票:

python

undefined

Get preferences from 3 different models

prefs_1 = model_1.evaluate(responses) prefs_2 = model_2.evaluate(responses) prefs_3 = model_3.evaluate(responses)

Majority vote

final_preference = majority_vote(prefs_1, prefs_2, prefs_3)

undefined

final_preference = majority_vote(prefs_1, prefs_2, prefs_3)

undefined

Advanced topics

进阶主题

Constitution design: See references/constitution-design.md for principle selection, trade-offs between helpfulness and harmlessness, and domain-specific constitutions.

RLAIF vs RLHF: See references/rlaif-comparison.md for performance comparison, cost analysis, and when to use AI feedback vs human feedback.

Chain-of-thought reasoning: See references/cot-critique.md for prompt engineering for critiques, multi-step reasoning, and transparency improvements.

宪法设计: 查看 references/constitution-design.md 了解原则选择、有用性与无害性的权衡，以及领域专属宪法的设计方法。

RLAIF vs RLHF: 查看 references/rlaif-comparison.md 了解性能对比、成本分析，以及AI反馈和人工反馈的适用场景。

思维链推理: 查看 references/cot-critique.md 了解批判场景的提示词工程、多步推理和透明度优化方法。

Hardware requirements

硬件要求

GPU: NVIDIA A100/H100 recommended
VRAM:
- SL phase (7B): 1× A100 40GB
- RL phase (7B): 2× A100 40GB (policy + reward model)
Single-node: Sufficient for most use cases
Mixed precision: BF16 recommended

Compute requirements:

SL phase: Similar to standard SFT
RL phase: Similar to PPO (higher than DPO)
AI evaluation: Additional inference for critique/preference generation

GPU: 推荐使用NVIDIA A100/H100
显存要求:
- 监督学习阶段（7B模型）: 1× A100 40GB
- 强化学习阶段（7B模型）: 2× A100 40GB（策略模型+奖励模型）
单节点部署: 可满足大多数场景需求
混合精度: 推荐使用BF16

算力要求:

监督学习阶段: 与标准SFT训练相近
强化学习阶段: 与PPO训练相近（高于DPO）
AI评估: 额外的推理算力用于生成批判/偏好结果

constitutional-ai

Original

Translation

Constitutional AI - Harmlessness from AI Feedback

Constitutional AI - 来自AI反馈的无害性

Quick start

快速开始

Common workflows

常用工作流

Workflow 1: Supervised learning phase (self-critique + revision)

工作流1: 监督学习阶段（自我批判+修订）

Create dataset of (prompt, revised_response) pairs

Create dataset of (prompt, revised_response) pairs

Workflow 2: RL phase (RLAIF - RL from AI Feedback)

工作流2: RL阶段（RLAIF - 基于AI反馈的强化学习）

Sample multiple responses per prompt

Sample multiple responses per prompt

Get AI preferences (no human labels needed!)

Get AI preferences (no human labels needed!)

Parse preferences (A or B)

Parse preferences (A or B)

Workflow 3: Chain-of-thought critique

工作流3: 思维链批判

When to use vs alternatives

适用场景与替代方案对比

Common issues

常见问题

Get preferences from 3 different models

Get preferences from 3 different models

Majority vote

Majority vote

Advanced topics

进阶主题

Hardware requirements

硬件要求

Resources

相关资源