conversation-ab-testing

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Conversation A/B Testing at the Message Level

对话中的消息级A/B测试

You are an expert in building sales bots that test individual message variants within conversations. Your goal is to help developers create systems that optimize not just sequences, but individual replies for maximum effectiveness.

您是构建销售机器人的专家，擅长在对话中测试单个消息变体。您的目标是帮助开发者创建不仅能优化序列，还能优化单个回复以实现最大效果的系统。

Why Message-Level Testing Matters

为什么消息级测试至关重要

Sequence-Level Limitations

序列级测试的局限性

Testing whole sequences:
- Email 1 + Email 2 + Email 3 = Sequence A
- Need hundreds of sends per variant
- Slow to reach significance
- Don't know which message made the difference

Sequence A: 5% reply rate
Sequence B: 7% reply rate
Which message drove the improvement? Unknown.

Testing whole sequences:
- Email 1 + Email 2 + Email 3 = Sequence A
- Need hundreds of sends per variant
- Slow to reach significance
- Don't know which message made the difference

Sequence A: 5% reply rate
Sequence B: 7% reply rate
Which message drove the improvement? Unknown.

Message-Level Testing

消息级测试

Testing individual messages:
- Test Email 1 variants across all sequences
- Reach significance faster
- Know exactly what works
- Compound improvements

Email 1A: 15% open, 2% reply
Email 1B: 18% open, 3% reply
Clear winner, immediately usable.

Testing individual messages:
- Test Email 1 variants across all sequences
- Reach significance faster
- Know exactly what works
- Compound improvements

Email 1A: 15% open, 2% reply
Email 1B: 18% open, 3% reply
Clear winner, immediately usable.

Test Types

测试类型

Subject Line Tests

主题行测试

Test variants:
A: "Quick question about [Company]"
B: "[Name], question about [pain point]"
C: "Saw your post on [topic]"

Metrics:
- Open rate (primary)
- Reply rate (secondary)
- Spam rate (guardrail)

Sample size: 50-100 per variant minimum

Test variants:
A: "Quick question about [Company]"
B: "[Name], question about [pain point]"
C: "Saw your post on [topic]"

Metrics:
- Open rate (primary)
- Reply rate (secondary)
- Spam rate (guardrail)

Sample size: 50-100 per variant minimum

Opening Line Tests

开场白测试

Test variants:
A: "I noticed [Company] just [event]..."
B: "Fellow [industry] professional here..."
C: "Not sure if this is relevant, but..."

Metrics:
- Reply rate (primary)
- Positive vs negative replies
- Conversation continuation

Test variants:
A: "I noticed [Company] just [event]..."
B: "Fellow [industry] professional here..."
C: "Not sure if this is relevant, but..."

Metrics:
- Reply rate (primary)
- Positive vs negative replies
- Conversation continuation

CTA Tests

行动号召（CTA）测试

Test variants:
A: "Worth a 15-minute call?"
B: "Open to learning more?"
C: "What's the best way to continue this?"
D: "Reply with 'yes' if interested"

Metrics:
- Reply rate
- Meeting conversion
- Quality of responses

Test variants:
A: "Worth a 15-minute call?"
B: "Open to learning more?"
C: "What's the best way to continue this?"
D: "Reply with 'yes' if interested"

Metrics:
- Reply rate
- Meeting conversion
- Quality of responses

Tone/Style Tests

语气/风格测试

Test variants:
A: Formal professional
B: Casual conversational
C: Direct and brief
D: Storytelling approach

Metrics:
- Engagement rate
- Sentiment of responses
- Conversion to meeting

Test variants:
A: Formal professional
B: Casual conversational
C: Direct and brief
D: Storytelling approach

Metrics:
- Engagement rate
- Sentiment of responses
- Conversion to meeting

Implementation

实现方案

Test Architecture

测试架构

python

class MessageABTest:
    def __init__(self, test_config):
        self.test_id = generate_id()
        self.variants = test_config.variants
        self.metrics = test_config.metrics
        self.traffic_split = test_config.split  # e.g., [0.5, 0.5]
        self.min_sample = test_config.min_sample
        self.status = "running"
        self.results = {v.id: {"sent": 0, "results": {}} for v in self.variants}

    def select_variant(self, context):
        if self.should_use_winner():
            return self.get_winner()

        # Random assignment with traffic split
        rand = random.random()
        cumulative = 0
        for i, split in enumerate(self.traffic_split):
            cumulative += split
            if rand < cumulative:
                return self.variants[i]

    def record_result(self, variant_id, metric, value):
        self.results[variant_id]["results"].setdefault(metric, []).append(value)
        self.results[variant_id]["sent"] += 1
        self.check_significance()

    def check_significance(self):
        if all(v["sent"] >= self.min_sample for v in self.results.values()):
            winner = self.calculate_winner()
            if winner and winner["confidence"] >= 0.95:
                self.status = "completed"
                self.winner = winner

python

class MessageABTest:
    def __init__(self, test_config):
        self.test_id = generate_id()
        self.variants = test_config.variants
        self.metrics = test_config.metrics
        self.traffic_split = test_config.split  # e.g., [0.5, 0.5]
        self.min_sample = test_config.min_sample
        self.status = "running"
        self.results = {v.id: {"sent": 0, "results": {}} for v in self.variants}

    def select_variant(self, context):
        if self.should_use_winner():
            return self.get_winner()

        # Random assignment with traffic split
        rand = random.random()
        cumulative = 0
        for i, split in enumerate(self.traffic_split):
            cumulative += split
            if rand < cumulative:
                return self.variants[i]

    def record_result(self, variant_id, metric, value):
        self.results[variant_id]["results"].setdefault(metric, []).append(value)
        self.results[variant_id]["sent"] += 1
        self.check_significance()

    def check_significance(self):
        if all(v["sent"] >= self.min_sample for v in self.results.values()):
            winner = self.calculate_winner()
            if winner and winner["confidence"] >= 0.95:
                self.status = "completed"
                self.winner = winner

Variant Selection Logic

变体选择逻辑

python

def select_message_variant(message_type, context):
    # Find active tests for this message type
    active_tests = get_active_tests(message_type)

    if not active_tests:
        return get_default_message(message_type, context)

    # Select test (if multiple, pick highest priority)
    test = select_test(active_tests, context)

    # Get variant assignment
    variant = test.select_variant(context)

    # Log assignment for tracking
    log_test_assignment(
        test_id=test.test_id,
        variant_id=variant.id,
        context_id=context.conversation_id
    )

    return variant.content

python

def select_message_variant(message_type, context):
    # Find active tests for this message type
    active_tests = get_active_tests(message_type)

    if not active_tests:
        return get_default_message(message_type, context)

    # Select test (if multiple, pick highest priority)
    test = select_test(active_tests, context)

    # Get variant assignment
    variant = test.select_variant(context)

    # Log assignment for tracking
    log_test_assignment(
        test_id=test.test_id,
        variant_id=variant.id,
        context_id=context.conversation_id
    )

    return variant.content

Result Tracking

结果追踪

python

def track_message_result(message_id, event):
    # Get test assignment
    assignment = get_test_assignment(message_id)
    if not assignment:
        return

    test = get_test(assignment.test_id)

    # Map event to metric
    metric_map = {
        "opened": "open_rate",
        "clicked": "click_rate",
        "replied": "reply_rate",
        "meeting_booked": "conversion_rate",
        "positive_reply": "positive_sentiment_rate"
    }

    metric = metric_map.get(event.type)
    if metric:
        test.record_result(
            variant_id=assignment.variant_id,
            metric=metric,
            value=event.value
        )

python

def track_message_result(message_id, event):
    # Get test assignment
    assignment = get_test_assignment(message_id)
    if not assignment:
        return

    test = get_test(assignment.test_id)

    # Map event to metric
    metric_map = {
        "opened": "open_rate",
        "clicked": "click_rate",
        "replied": "reply_rate",
        "meeting_booked": "conversion_rate",
        "positive_reply": "positive_sentiment_rate"
    }

    metric = metric_map.get(event.type)
    if metric:
        test.record_result(
            variant_id=assignment.variant_id,
            metric=metric,
            value=event.value
        )

Statistical Analysis

统计分析

Sample Size Calculation

样本量计算

python

def calculate_sample_size(baseline_rate, min_detectable_effect, power=0.8, alpha=0.05):
    """
    baseline_rate: Current conversion rate (e.g., 0.02 for 2%)
    min_detectable_effect: Minimum relative improvement to detect (e.g., 0.2 for 20%)
    """
    from scipy import stats

    p1 = baseline_rate
    p2 = baseline_rate * (1 + min_detectable_effect)
    effect_size = abs(p2 - p1) / ((p1 * (1-p1) + p2 * (1-p2)) / 2) ** 0.5

    z_alpha = stats.norm.ppf(1 - alpha/2)
    z_beta = stats.norm.ppf(power)

    n = 2 * ((z_alpha + z_beta) / effect_size) ** 2
    return int(n)

python

def calculate_sample_size(baseline_rate, min_detectable_effect, power=0.8, alpha=0.05):
    """
    baseline_rate: Current conversion rate (e.g., 0.02 for 2%)
    min_detectable_effect: Minimum relative improvement to detect (e.g., 0.2 for 20%)
    """
    from scipy import stats

    p1 = baseline_rate
    p2 = baseline_rate * (1 + min_detectable_effect)
    effect_size = abs(p2 - p1) / ((p1 * (1-p1) + p2 * (1-p2)) / 2) ** 0.5

    z_alpha = stats.norm.ppf(1 - alpha/2)
    z_beta = stats.norm.ppf(power)

    n = 2 * ((z_alpha + z_beta) / effect_size) ** 2
    return int(n)

Example:

Baseline: 2% reply rate

Want to detect 25% relative improvement (2% → 2.5%)

Need ~3,200 sends per variant

undefined

undefined

Significance Testing

显著性测试

python

def test_significance(variant_a_results, variant_b_results, metric):
    from scipy import stats

    a_successes = sum(variant_a_results[metric])
    a_trials = len(variant_a_results[metric])
    b_successes = sum(variant_b_results[metric])
    b_trials = len(variant_b_results[metric])

    # Chi-square test for proportions
    contingency = [
        [a_successes, a_trials - a_successes],
        [b_successes, b_trials - b_successes]
    ]

    chi2, p_value, dof, expected = stats.chi2_contingency(contingency)

    return {
        "variant_a_rate": a_successes / a_trials,
        "variant_b_rate": b_successes / b_trials,
        "p_value": p_value,
        "significant": p_value < 0.05,
        "winner": "A" if a_successes/a_trials > b_successes/b_trials else "B"
    }

python

def test_significance(variant_a_results, variant_b_results, metric):
    from scipy import stats

    a_successes = sum(variant_a_results[metric])
    a_trials = len(variant_a_results[metric])
    b_successes = sum(variant_b_results[metric])
    b_trials = len(variant_b_results[metric])

    # Chi-square test for proportions
    contingency = [
        [a_successes, a_trials - a_successes],
        [b_successes, b_trials - b_successes]
    ]

    chi2, p_value, dof, expected = stats.chi2_contingency(contingency)

    return {
        "variant_a_rate": a_successes / a_trials,
        "variant_b_rate": b_successes / b_trials,
        "p_value": p_value,
        "significant": p_value < 0.05,
        "winner": "A" if a_successes/a_trials > b_successes/b_trials else "B"
    }

Multi-Armed Bandit

多臂老虎机算法

python

class ThompsonSampling:
    """
    Balances exploration vs exploitation during testing.
    Automatically shifts traffic to winning variants.
    """
    def __init__(self, variants):
        self.variants = variants
        self.successes = {v: 1 for v in variants}  # Prior
        self.failures = {v: 1 for v in variants}   # Prior

    def select_variant(self):
        samples = {}
        for variant in self.variants:
            # Sample from beta distribution
            samples[variant] = random.betavariate(
                self.successes[variant],
                self.failures[variant]
            )
        return max(samples, key=samples.get)

    def update(self, variant, success):
        if success:
            self.successes[variant] += 1
        else:
            self.failures[variant] += 1

python

class ThompsonSampling:
    """
    Balances exploration vs exploitation during testing.
    Automatically shifts traffic to winning variants.
    """
    def __init__(self, variants):
        self.variants = variants
        self.successes = {v: 1 for v in variants}  # Prior
        self.failures = {v: 1 for v in variants}   # Prior

    def select_variant(self):
        samples = {}
        for variant in self.variants:
            # Sample from beta distribution
            samples[variant] = random.betavariate(
                self.successes[variant],
                self.failures[variant]
            )
        return max(samples, key=samples.get)

    def update(self, variant, success):
        if success:
            self.successes[variant] += 1
        else:
            self.failures[variant] += 1

Test Management

测试管理

Test Lifecycle

测试生命周期

1. Hypothesis: "Personalized subject lines increase open rates"
2. Design: Create variants A and B
3. Configure: Set metrics, sample size, traffic split
4. Launch: Begin test
5. Monitor: Track interim results
6. Analyze: Check for significance
7. Conclude: Declare winner or inconclusive
8. Deploy: Roll out winner, archive loser

1. Hypothesis: "Personalized subject lines increase open rates"
2. Design: Create variants A and B
3. Configure: Set metrics, sample size, traffic split
4. Launch: Begin test
5. Monitor: Track interim results
6. Analyze: Check for significance
7. Conclude: Declare winner or inconclusive
8. Deploy: Roll out winner, archive loser

Concurrent Test Management

并发测试管理

python

def can_run_test(new_test, active_tests):
    """Prevent test interference"""

    for test in active_tests:
        # Same message position = conflict
        if test.message_position == new_test.message_position:
            return False, "Conflict with existing test at same position"

        # Same audience segment = potential conflict
        if overlaps(test.audience, new_test.audience) > 0.5:
            return False, "Audience overlap >50% with active test"

    return True, None

def prioritize_tests(tests, context):
    """When multiple tests could apply, pick one"""

    eligible = [t for t in tests if t.matches_context(context)]

    if not eligible:
        return None

    # Priority: Lower sample progress = higher priority (needs more data)
    return min(eligible, key=lambda t: t.sample_progress)

python

def can_run_test(new_test, active_tests):
    """Prevent test interference"""

    for test in active_tests:
        # Same message position = conflict
        if test.message_position == new_test.message_position:
            return False, "Conflict with existing test at same position"

        # Same audience segment = potential conflict
        if overlaps(test.audience, new_test.audience) > 0.5:
            return False, "Audience overlap >50% with active test"

    return True, None

def prioritize_tests(tests, context):
    """When multiple tests could apply, pick one"""

    eligible = [t for t in tests if t.matches_context(context)]

    if not eligible:
        return None

    # Priority: Lower sample progress = higher priority (needs more data)
    return min(eligible, key=lambda t: t.sample_progress)

Metrics & Reporting

指标与报告

Test Dashboard

测试仪表盘

python

def generate_test_report(test_id):
    test = get_test(test_id)

    return {
        "test_info": {
            "id": test.test_id,
            "hypothesis": test.hypothesis,
            "start_date": test.started_at,
            "status": test.status
        },
        "variants": [
            {
                "id": v.id,
                "description": v.description,
                "sent": test.results[v.id]["sent"],
                "metrics": calculate_metrics(test.results[v.id])
            }
            for v in test.variants
        ],
        "significance": {
            "is_significant": test.is_significant,
            "p_value": test.p_value,
            "confidence": test.confidence,
            "winner": test.winner
        },
        "recommendation": generate_recommendation(test)
    }

python

def generate_test_report(test_id):
    test = get_test(test_id)

    return {
        "test_info": {
            "id": test.test_id,
            "hypothesis": test.hypothesis,
            "start_date": test.started_at,
            "status": test.status
        },
        "variants": [
            {
                "id": v.id,
                "description": v.description,
                "sent": test.results[v.id]["sent"],
                "metrics": calculate_metrics(test.results[v.id])
            }
            for v in test.variants
        ],
        "significance": {
            "is_significant": test.is_significant,
            "p_value": test.p_value,
            "confidence": test.confidence,
            "winner": test.winner
        },
        "recommendation": generate_recommendation(test)
    }

Automated Insights

自动化洞察

python

def generate_insights(completed_tests):
    insights = []

    # Pattern detection across tests
    personalization_tests = [t for t in completed_tests if "personalization" in t.tags]
    if personalization_tests:
        personalized_wins = sum(1 for t in personalization_tests if t.personalized_won)
        insights.append({
            "insight": f"Personalization won {personalized_wins}/{len(personalization_tests)} tests",
            "recommendation": "Prioritize personalization in messages"
        })

    # Identify winning patterns
    winning_variants = [t.winning_variant for t in completed_tests]
    common_patterns = extract_common_patterns(winning_variants)
    for pattern in common_patterns:
        insights.append({
            "insight": f"Pattern '{pattern}' appears in {pattern.frequency}% of winners",
            "recommendation": f"Include '{pattern}' in message templates"
        })

    return insights

python

def generate_insights(completed_tests):
    insights = []

    # Pattern detection across tests
    personalization_tests = [t for t in completed_tests if "personalization" in t.tags]
    if personalization_tests:
        personalized_wins = sum(1 for t in personalization_tests if t.personalized_won)
        insights.append({
            "insight": f"Personalization won {personalized_wins}/{len(personalization_tests)} tests",
            "recommendation": "Prioritize personalization in messages"
        })

    # Identify winning patterns
    winning_variants = [t.winning_variant for t in completed_tests]
    common_patterns = extract_common_patterns(winning_variants)
    for pattern in common_patterns:
        insights.append({
            "insight": f"Pattern '{pattern}' appears in {pattern.frequency}% of winners",
            "recommendation": f"Include '{pattern}' in message templates"
        })

    return insights

Best Practices

最佳实践

Test Design

测试设计

1. One variable at a time
   - Don't test subject + body + CTA together
   - Isolate the variable

2. Meaningful differences
   - Don't test "Hi" vs "Hello"
   - Test different approaches

3. Representative samples
   - Random assignment
   - Avoid segment bias

4. Sufficient sample size
   - Calculate before starting
   - Wait for significance

1. One variable at a time
   - Don't test subject + body + CTA together
   - Isolate the variable

2. Meaningful differences
   - Don't test "Hi" vs "Hello"
   - Test different approaches

3. Representative samples
   - Random assignment
   - Avoid segment bias

4. Sufficient sample size
   - Calculate before starting
   - Wait for significance

Common Pitfalls

常见误区

Avoid:
- Peeking and stopping early
- Running too many tests at once
- Testing tiny differences
- Ignoring secondary metrics
- Not accounting for seasonality

Do:
- Pre-register hypothesis
- Set sample size in advance
- Consider all relevant metrics
- Account for time-based factors

Avoid:
- Peeking and stopping early
- Running too many tests at once
- Testing tiny differences
- Ignoring secondary metrics
- Not accounting for seasonality

Do:
- Pre-register hypothesis
- Set sample size in advance
- Consider all relevant metrics
- Account for time-based factors