Loading...
Loading...
Compare original and translation side by side
hypothesis_tester.pyhypothesis_tester.pysample_size_calculator.pysample_size_calculator.pyscripts/hypothesis_tester.pyscripts/hypothesis_tester.pyundefinedundefinedundefinedundefinedscripts/sample_size_calculator.pyscripts/sample_size_calculator.pyundefinedundefinedundefinedundefinedscripts/confidence_interval.pyscripts/confidence_interval.pyundefinedundefined
---
---| Scenario | Metric | Test |
|---|---|---|
| A/B conversion rate (clicked/not) | Proportion | Z-test for two proportions |
| A/B revenue, load time, session length | Continuous mean | Two-sample t-test (Welch's) |
| A/B/C/n multi-variant with categories | Categorical counts | Chi-square |
| Single sample vs. known value | Mean vs. constant | One-sample t-test |
| Non-normal data, small n | Rank-based | Use Mann-Whitney U (flag for human) |
| 场景 | 指标类型 | 检验方法 |
|---|---|---|
| A/B转化率(点击/未点击) | 比例 | 双样本Z检验 |
| A/B收入、加载时间、会话时长 | 连续型均值 | 双样本t检验(Welch检验) |
| A/B/C/n多变体分类结果 | 分类计数 | 卡方检验 |
| 单样本与已知值对比 | 均值与常数 | 单样本t检验 |
| 非正态数据、小样本 | 基于排名 | 使用Mann-Whitney U检验(标记需人工确认) |
| p-value | Effect Size | Practical Impact | Decision |
|---|---|---|---|
| < α | Large / Medium | Meaningful | ✅ Ship |
| < α | Small | Negligible | ⚠️ Hold — statistically significant but not worth the complexity |
| ≥ α | — | — | 🔁 Extend (if underpowered) or ❌ Kill |
| < α | Any | Negative UX | ❌ Kill regardless |
| p值 | 效应量 | 实际影响 | 决策 |
|---|---|---|---|
| < α | 大/中 | 有意义 | ✅ 上线 |
| < α | 小 | 可忽略 | ⚠️ 暂缓——统计显著但复杂度不值得 |
| ≥ α | — | — | 🔁 延长实验(若统计功效不足)或 ❌ 终止 |
| < α | 任意 | 用户体验负面 | ❌ 无论如何都终止 |
| d | Interpretation |
|---|---|
| < 0.2 | Negligible |
| 0.2–0.5 | Small |
| 0.5–0.8 | Medium |
| > 0.8 | Large |
| h | Interpretation |
|---|---|
| < 0.2 | Negligible |
| 0.2–0.5 | Small |
| 0.5–0.8 | Medium |
| > 0.8 | Large |
| V | Interpretation |
|---|---|
| < 0.1 | Negligible |
| 0.1–0.3 | Small |
| 0.3–0.5 | Medium |
| > 0.5 | Large |
| d值 | 解读 |
|---|---|
| < 0.2 | 可忽略 |
| 0.2–0.5 | 小 |
| 0.5–0.8 | 中 |
| > 0.8 | 大 |
| h值 | 解读 |
|---|---|
| < 0.2 | 可忽略 |
| 0.2–0.5 | 小 |
| 0.5–0.8 | 中 |
| > 0.8 | 大 |
| V值 | 解读 |
|---|---|
| < 0.1 | 可忽略 |
| 0.1–0.3 | 小 |
| 0.3–0.5 | 中 |
| > 0.5 | 大 |
| Request | Deliverable |
|---|---|
| "Did our test win?" | Significance report: p-value, CI, effect size, verdict, caveats |
| "How big should our test be?" | Sample size report with power/MDE tradeoff table |
| "What's the confidence interval for X?" | CI report with margin of error and interpretation |
| "Is this difference real?" | Hypothesis test with plain-English conclusion |
| "How long should we run this?" | Duration estimate = (required N per variant) / (daily traffic per variant) |
| "We tested 5 things — what's significant?" | Multiple comparison analysis with Bonferroni-adjusted thresholds |
| 请求 | 交付物 |
|---|---|
| “我们的测试赢了吗?” | 显著性报告:p值、置信区间、效应量、结论、注意事项 |
| “我们的实验规模应该多大?” | 样本量报告,包含统计功效/MDE权衡表 |
| “X的置信区间是多少?” | 置信区间报告,包含误差范围及解读 |
| “这个差异是真实存在的吗?” | 假设检验报告,附通俗易懂的结论 |
| “这个实验应该运行多久?” | 时长估算 =(每个变体所需样本量N)/(每个变体每日流量) |
| “我们测试了5个内容——哪些是显著的?” | 多重比较分析,附Bonferroni校正后的阈值 |
| Skill | Use When |
|---|---|
| Designing the experiment before it runs — randomization, instrumentation, holdout |
| Verifying input data integrity before running any statistical test |
| Structuring the hypothesis, success metrics, and guardrail metrics |
| Analyzing product funnel and retention metrics |
| Interpreting SaaS KPIs that may feed into experiments (ARR, churn, LTV) |
| Statistical analysis of marketing campaign performance |
marketing-skill/ab-test-setupproduct-team/experiment-designerengineering/data-quality-auditor| Skill | 使用场景 |
|---|---|
| 实验运行前的设计工作——随机化、埋点、对照组设置 |
| 运行任何统计检验前验证输入数据的完整性 |
| 构建假设、确定成功指标与 guardrail metrics |
| 分析产品漏斗与留存指标 |
| 解读可能用于实验的SaaS关键指标(ARR、 churn、LTV) |
| 营销活动效果的统计分析 |
marketing-skill/ab-test-setupproduct-team/experiment-designerengineering/data-quality-auditorreferences/statistical-testing-concepts.mdreferences/statistical-testing-concepts.md