measure

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Measure — Define and Track Success

度量——定义并追踪成功

Overview

概述

If you can't define success, you can't design for it. And if you measure the wrong thing, you'll optimize for the wrong outcome.

UX measurement connects design decisions to observable evidence — did the thing we built actually help? This skill defines what to measure, how to measure it, and how to make decisions from what you learn. It bridges the gap between "we shipped it" and "it worked."

But measurement is not neutral. Every metric you choose shapes what gets optimized. Measure time-on-site and you'll get infinite scroll. Measure clicks and you'll get clickbait. Measure conversion and you'll get dark patterns — unless you also measure what those metrics cost the user. This skill guards against measurement becoming manipulation, ensuring that metrics incentivize genuine value, not engineered engagement.

When to activate this skill: Defining success criteria for a new feature, designing experiments, building measurement frameworks, analyzing funnel performance, reviewing whether existing metrics are measuring the right things, or anytime "the numbers look good" but the experience feels wrong.

如果你无法定义成功，就无法为成功而设计。如果度量了错误的内容，你就会朝着错误的结果优化。

UX度量将设计决策与可观测证据关联起来——我们构建的产品是否真的发挥了作用？这项技能定义了要度量什么、如何度量，以及如何从所得数据中做出决策。它填补了“我们发布了产品”和“产品发挥了作用”之间的差距。

但度量并非中立的。你选择的每一个指标都会影响优化方向。度量停留时长，你就会做出无限滚动设计；度量点击量，你就会制造标题党；度量转化率，你就会使用暗黑模式——除非你同时度量这些指标给用户带来的成本。这项技能防止度量沦为操纵手段，确保指标激励的是真正的价值，而非刻意营造的参与度。

何时激活这项技能： 为新功能定义成功标准、设计实验、构建度量框架、分析漏斗性能、审查现有指标是否度量了正确的内容，或是每当“数据看起来不错”但用户体验感觉不对的时候。

Skill family

技能体系

Measure works alongside the full Intent skill system:

/strategize
: Their hypotheses need measurable success criteria. Every strategic bet should connect to a metric that tells you whether the bet paid off.
```
/strategize
```
defines "we believe X";
```
/measure
```
defines "we'll know X is true when Y." When metrics contradict a strategic assumption, measure loops back to reopen strategy — with guardrails (see "When measurement points back to strategy" below).
/investigate
: Qualitative research complements quantitative measurement. When the numbers say users drop off at step 3, investigate tells you why. When satisfaction scores drop after a redesign, investigate interviews users to understand the experience behind the number. Never make major design decisions from metrics alone.
/evaluate
: UX assessment produces scores and findings that inform what to measure. Evaluation identifies usability issues; measurement tracks whether fixes actually resolved them.
/specify
: Test plans and success metrics go into handoff specs. Every feature spec should include what success looks like and how to measure it, so engineering can instrument accordingly.
/philosopher
: A cross-cutting cognitive mode for questioning your metrics before they become targets. Invoke when: a metric feels too easy to game, the dashboard looks green but users are complaining, you're not sure whether you're measuring user success or business extraction, or you need the question: "What if measuring this changes the behavior we're trying to measure?"

度量技能与完整的Intent技能体系协同工作：

/strategize
（策略制定）：他们的假设需要可度量的成功标准。每一个战略赌注都应与能告诉你赌注是否奏效的指标关联。
```
/strategize
```
定义“我们相信X”；
```
/measure
```
定义“当Y发生时，我们就知道X是正确的”。当指标与战略假设矛盾时，度量技能会循环反馈以重新审视策略——并遵循相关约束（见下文“当度量结果指向策略调整时”）。
/investigate
（用户调研）：定性研究是定量度量的补充。当数据显示用户在第3步流失时，调研技能会告诉你原因。当重新设计后满意度评分下降时，调研技能会通过用户访谈来解读数据背后的体验。永远不要仅凭指标做出重大设计决策。
/evaluate
（UX评估）：UX评估产生的分数和结论为度量内容提供依据。评估识别可用性问题；度量追踪修复措施是否真正解决了这些问题。
/specify
（需求定义）：测试计划和成功指标会纳入交付规范。每个功能规范都应包含成功的定义及度量方式，以便工程师进行相应的埋点。
/philosopher
（思辨）：一种跨领域的认知模式，用于在指标成为目标之前对其提出质疑。在以下场景调用：某个指标太容易被操纵、仪表盘显示一片向好但用户却在抱怨、不确定自己度量的是用户成功还是商业榨取，或是需要提出“如果度量这个指标会改变我们想要度量的行为，该怎么办？”这类问题时。

Core capabilities

核心能力

1. Metric selection: HEART framework

1. 指标选择：HEART框架

Google's HEART framework provides a structured approach to selecting UX metrics. Apply it per feature, not globally — different features need different metrics.

Happiness — subjective satisfaction:

NPS (Net Promoter Score): likelihood to recommend, 0-10 scale. Blunt but useful for trending.
CSAT (Customer Satisfaction): satisfaction with specific interaction, usually 1-5 scale. More actionable than NPS for feature-level decisions.
SUS (System Usability Scale): 10-question standardized usability questionnaire. Good for benchmarking across releases.
Custom surveys: specific questions tied to specific features. "How easy was it to find what you were looking for?" is more useful than "How satisfied are you?"

Engagement — behavioral depth:

Frequency: how often users return (daily, weekly, monthly active users)
Intensity: depth of usage per session (features used, content consumed, actions taken)
Breadth: how many features a user touches (adoption breadth, not just depth)
Recency: when was the last interaction (early warning for churn)

Adoption — new usage:

New user activation: percentage completing key onboarding milestones
Feature adoption: percentage of eligible users who try a new feature
Onboarding completion: funnel through first-use experience
Time-to-value: how quickly new users reach their first meaningful outcome

Retention — continued usage:

Return rate: D1, D7, D30 retention (percentage returning after 1, 7, 30 days)
Churn rate: percentage of users who stop using the product in a period
Reactivation: users who left and came back (what brought them back?)
Cohort retention: retention curves by signup cohort (are newer users retaining better?)

Task success — effectiveness:

Completion rate: percentage of users who finish the task they started
Error rate: percentage of attempts that result in errors
Time-on-task: how long the task takes (shorter is usually better, but not always)
Efficiency: task completion relative to optimal path length

Not every feature needs all five. Select the 2-3 dimensions that matter most for the feature's intent. A checkout flow cares most about task success and happiness. A content feed cares most about engagement and retention. A new feature launch cares most about adoption.

Counter-metrics: For every metric you optimize, name the metric that could suffer. If engagement goes up but satisfaction goes down, that's a red flag. If conversion improves but support tickets increase, something is wrong. Counter-metrics are your canary in the coal mine.

谷歌的HEART框架为选择UX指标提供了结构化方法。针对每个功能应用该框架，而非全局统一——不同功能需要不同的指标。

Happiness（愉悦度）——主观满意度：

NPS（Net Promoter Score）：推荐意愿，0-10分制。虽不够精细，但对趋势追踪很有用。
CSAT（Customer Satisfaction）：针对特定交互的满意度，通常采用1-5分制。在功能层面的决策中比NPS更具可操作性。
SUS（System Usability Scale）：包含10个问题的标准化可用性问卷。适用于不同版本间的基准对比。
自定义调研：与特定功能绑定的针对性问题。“你找到所需内容的难度如何？”比“你有多满意？”更有用。

Engagement（参与度）——行为深度：

频率：用户返回的频次（日活、周活、月活用户）
强度：每次会话的使用深度（使用的功能、消费的内容、执行的操作）
广度：用户使用的功能数量（采用广度而非仅深度）
新鲜度：最后一次交互的时间（流失的早期预警）

Adoption（采用率）——新用户使用情况：

新用户激活率：完成关键入门里程碑的用户百分比
功能采用率：尝试新功能的合格用户百分比
入门流程完成率：首次使用体验的漏斗转化率
价值实现时间：新用户达成首个有意义成果的速度

Retention（留存率）——持续使用情况：

返回率：D1、D7、D30留存率（1天、7天、30天后返回的用户百分比）
流失率：在某段时间内停止使用产品的用户百分比
重新激活率：离开后又返回的用户（是什么让他们回来的？）
群组留存率：按注册群组划分的留存曲线（新用户的留存情况是否更好？）

Task success（任务成功率）——有效性：

完成率：完成起始任务的用户百分比
错误率：尝试过程中出现错误的百分比
任务耗时：完成任务所需的时间（通常越短越好，但并非绝对）
效率：相对于最优路径长度的任务完成情况

并非每个功能都需要覆盖这五个维度。 选择与功能目标最相关的2-3个维度。结账流程最关注任务成功率和愉悦度；内容流最关注参与度和留存率；新功能发布最关注采用率。

反向指标： 针对每个你要优化的指标，找出可能受到负面影响的指标。如果参与度上升但满意度下降，这就是危险信号。如果转化率提升但支持工单增加，说明存在问题。反向指标是你的预警系统。

2. Goal-Signal-Metric mapping

2. 目标-信号-指标（GSM）映射

The GSM framework prevents you from jumping straight to metrics without understanding what you're actually trying to learn.

Goal: What user or business outcome are you trying to achieve? Be specific. "Improve the user experience" is not a goal. "Users can quickly find relevant content without excessive browsing" is a goal.

Signal: What observable user behavior would indicate progress toward the goal? This is the bridge between intent and data. "Users navigate directly to relevant content" is a signal. "Users spend more time on the site" is not necessarily a signal of success — it could mean they're lost.

Metric: How do you quantify that signal? Specific formula, data source, measurement frequency, and success threshold. "Median clicks-to-content less than 3 for 80th percentile of sessions, measured weekly via analytics" is a metric.

Example GSM chain:

Goal: Users can complete checkout without friction
Signal: Users proceed through checkout steps without abandoning or going back
Metric: Checkout completion rate > 75% for users who add items to cart; median checkout time under 90 seconds; back-navigation rate during checkout < 10%

Build GSM chains for every major feature before launch. If you can't articulate the goal, you don't know what success looks like. If you can't identify the signal, you're guessing what to measure. If you can't define the metric, you can't learn from what you ship.

GSM框架避免你在不理解实际要了解的内容时直接跳到指标层面。

目标： 你要实现的用户或业务成果是什么？要具体。“提升用户体验”不是目标。“用户无需过度浏览就能快速找到相关内容”才是目标。

信号： 哪些可观测的用户行为能表明朝着目标取得了进展？这是目标与数据之间的桥梁。“用户直接导航到相关内容”是信号。“用户在网站上花费更多时间”不一定是成功的信号——这可能意味着他们迷路了。

指标： 你如何量化该信号？具体的公式、数据源、度量频率和成功阈值。“80%的会话中，找到内容的中位点击次数少于3次，每周通过分析工具度量”就是一个指标。

GSM链示例：

目标：用户能无摩擦地完成结账
信号：用户无需放弃或返回上一步即可完成结账流程
指标：添加商品到购物车的用户结账完成率>75%；结账中位耗时低于90秒；结账过程中的返回导航率<10%

在每个主要功能发布前构建GSM链。 如果你无法明确目标，就不知道成功是什么样子。如果你无法识别信号，就是在猜测要度量什么。如果你无法定义指标，就无法从发布的产品中学习。

3. A/B test design

3. A/B测试设计

Experimentation is how you learn whether a design change actually helps. But poorly designed experiments produce false confidence.

Hypothesis structure: "If we [specific change], then [specific metric] will [direction of change] by [estimated magnitude] because [causal reasoning]."

Example: "If we move the search bar from the header to the hero section, then search usage will increase by 15% because users will encounter it earlier in their scanning pattern, reducing the friction of scrolling up to search."

Minimum detectable effect (MDE): What's the smallest change worth detecting? A 0.1% improvement in conversion may not be worth the engineering effort. A 5% improvement would be. Set the MDE before the test, not after. This determines your required sample size.

Sample size calculation: Depends on: baseline conversion rate, MDE, statistical power (typically 80%), significance level (typically 95% / alpha = 0.05). Don't guess — use the formula or a calculator.

Quick reference for common scenarios (two-sided test, 80% power, 95% significance, two variants):

Baseline rate	MDE (relative)	Sample size per variant
5%	20% (5% → 6%)	~25,000
10%	10% (10% → 11%)	~14,500
10%	20% (10% → 12%)	~3,800
25%	10% (25% → 27.5%)	~4,800
50%	5% (50% → 52.5%)	~6,000

Lower baseline rates and smaller MDEs require dramatically more traffic. If your required sample size exceeds your monthly traffic, either increase the MDE (detect only larger effects), extend the test duration, or accept that an A/B test is not the right method — use qualitative research instead. Underpowered tests produce inconclusive results that waste time.

Duration: Run for at least 1-2 full weekly cycles to account for day-of-week effects. Longer for seasonal businesses. Never run less than a week even if you hit sample size early — behavioral patterns vary by day.

Segmentation: Check for differential effects across user segments: new vs. returning users, mobile vs. desktop, geography, plan type. An overall neutral result may hide a strong positive effect for one segment and a strong negative for another.

Guardrail metrics: Define what must NOT get worse. If testing a new checkout flow, guardrail metrics might include: revenue per user, support ticket volume, return rate. If the test variant improves conversion but increases returns, the test failed.

Common mistakes:

Peeking at results before the test reaches statistical significance (inflates false positive rate)
Running too many variants without adjusting for multiple comparisons
Ignoring novelty effects (new things get clicked more just because they're new — wait for the effect to stabilize)
Stopping tests too early because early results "look decisive"
Not accounting for interaction effects when multiple tests run simultaneously
Testing cosmetic changes when the real problem is structural

实验是了解设计变更是否真正有效的方法。但设计糟糕的实验会产生虚假的信心。

假设结构： “如果我们做出[具体变更]，那么[具体指标]将[变化方向] [预估幅度]，因为[因果推理]。”

示例：“如果我们将搜索栏从页眉移至Hero区域，那么搜索使用率将提升15%，因为用户在浏览时会更早看到它，减少了向上滚动搜索的摩擦。”

最小可检测效果（MDE）： 值得检测的最小变化是什么？转化率提升0.1%可能不值得投入工程资源。提升5%则值得。在测试前设定MDE，而非测试后。这决定了你所需的样本量。

样本量计算： 取决于：基准转化率、MDE、统计功效（通常为80%）、显著性水平（通常为95% / α=0.05）。不要猜测——使用公式或计算器。

常见场景快速参考（双侧检验，80%功效，95%显著性，两个变体）：

基准转化率	相对最小可检测效果（MDE）	每个变体的样本量
5%	20%（5% → 6%）	~25,000
10%	10%（10% → 11%）	~14,500
10%	20%（10% → 12%）	~3,800
25%	10%（25% → 27.5%）	~4,800
50%	5%（50% → 52.5%）	~6,000

基准转化率越低、MDE越小，所需流量就越大。如果所需样本量超过月度流量，要么增大MDE（仅检测更大的效果）、延长测试时长，要么接受A/B测试不是合适的方法——改用定性研究。功效不足的测试会产生无结论的结果，浪费时间。

测试时长： 至少运行1-2个完整的周周期，以消除周内效应。季节性业务需更长时间。即使提前达到样本量，也不要运行少于一周——用户行为模式会随日期变化。

细分分析： 检查不同用户群体的差异化效果：新用户vs老用户、移动端vs桌面端、地域、套餐类型。整体中性的结果可能掩盖某一群体的显著正向效果和另一群体的显著负向效果。

约束指标： 定义哪些指标绝对不能变差。如果测试新的结账流程，约束指标可能包括：每用户收入、支持工单量、退货率。如果测试变体提升了转化率但增加了退货率，测试就是失败的。

常见错误：

在测试达到统计显著性之前查看结果（会提高假阳性率）
运行过多变体却不调整多重比较的阈值
忽略新奇效应（新事物只是因为新颖而获得更多点击——等待效果稳定）
因为早期结果“看起来决定性”而过早停止测试
同时运行多个测试时不考虑交互效应
当真正的问题是结构性问题时测试 cosmetic 变更

4. Funnel analysis

4. 漏斗分析

Funnels reveal where users fall out of a desired flow. But the value isn't in the numbers — it's in understanding why.

Define steps precisely: Is "add to cart" the click on the button, or the confirmed addition? Is "checkout" the start of the payment form, or the submission? Imprecise step definitions produce misleading conversion rates. Define each step as a specific, observable, unambiguous event.

Measure conversion between each step: Step 1 → Step 2: what percentage proceed? What percentage return to a previous step? What percentage leave entirely? Each transition tells a different story.

Identify the biggest drop-offs: Focus on the step transitions with the lowest conversion rates. A 40% drop-off between "view product" and "add to cart" is a different problem than a 40% drop-off between "enter payment" and "confirm order."

Segment by everything: User type (new vs. returning), device, traffic source, geography, time of day, day of week. Aggregate funnels hide the signal. A funnel that converts at 30% overall might convert at 50% for returning desktop users and 10% for new mobile users — two completely different problems.

Pair with qualitative: When you find the drop-off, you know WHERE users struggle. To understand WHY, pair with

/investigate

— session recordings, usability testing, surveys at the point of friction. Numbers without context produce bad interventions.

Benchmarking: Compare funnels across time periods (did the last release help or hurt?), across segments (who struggles most?), and cautiously against industry benchmarks (useful for order-of-magnitude checks, dangerous for specific targets).

漏斗揭示了用户在预期流程中的流失点。但价值不在于数据本身——而在于理解背后的原因。

精确定义步骤： “添加到购物车”是指点击按钮，还是确认添加？“结账”是指开始填写支付表单，还是提交表单？不精确的步骤定义会产生误导性的转化率。将每个步骤定义为具体、可观测、明确的事件。

度量每个步骤之间的转化率： 步骤1→步骤2：有多少百分比的用户继续？有多少百分比返回上一步？有多少百分比完全离开？每个过渡都讲述不同的故事。

找出最大流失点： 聚焦转化率最低的步骤过渡。“查看商品”到“添加到购物车”之间40%的流失，与“输入支付信息”到“确认订单”之间40%的流失是不同的问题。

全面细分： 按用户类型（新用户vs老用户）、设备、流量来源、地域、一天中的时段、一周中的日期细分。聚合漏斗会隐藏信号。整体转化率为30%的漏斗，老用户桌面端的转化率可能为50%，新用户移动端的转化率可能为10%——这是两个完全不同的问题。

结合定性研究： 当你找到流失点时，你知道了用户在哪里遇到困难。要理解原因，需结合

/investigate

技能——会话录制、可用性测试、在摩擦点进行调研。脱离上下文的数据会导致糟糕的干预措施。

基准对比： 跨时间段对比漏斗（上次发布是否有帮助或造成损害？）、跨群体对比（谁最困难？），并谨慎与行业基准对比（对数量级检查有用，但作为具体目标很危险）。

5. Qualitative and quantitative triangulation

5. 定性与定量三角验证

Numbers tell you WHAT happened. Qualitative tells you WHY. Neither alone is sufficient for design decisions.

When to triangulate:

Metrics show a drop-off but you don't know why → run usability sessions at the friction point
Satisfaction scores drop after a redesign → interview users to understand what changed in their experience
A/B test shows no statistical difference → qualitative research reveals both variants had the same fundamental usability problem
Feature adoption is low → is it a discoverability problem, a usefulness problem, or a usability problem? Only qualitative can distinguish.

How to triangulate:

Start with quantitative to identify WHAT and WHERE
Use qualitative to understand WHY
Return to quantitative to verify that your intervention addressed the WHY
Repeat

Never make major design decisions from one data type alone. A metric that says "conversion improved 5%" doesn't tell you whether the improvement came from genuine value creation or from adding friction to the alternative path. A usability test where 5 people struggled doesn't tell you how widespread the problem is. Both together tell you something real.

数据告诉你发生了什么。定性研究告诉你为什么会发生。单独使用任何一种都不足以做出设计决策。

何时进行三角验证：

指标显示流失但你不知道原因→在摩擦点进行可用性测试
重新设计后满意度评分下降→访谈用户以了解他们的体验变化
A/B测试显示无统计差异→定性研究揭示两个变体存在相同的核心可用性问题
功能采用率低→是可发现性问题、实用性问题还是可用性问题？只有定性研究能区分。

如何进行三角验证：

从定量研究开始，确定发生了什么和在哪里发生
使用定性研究理解原因
返回定量研究验证你的干预措施是否解决了根本原因
重复上述步骤

永远不要仅凭一种数据类型做出重大设计决策。 指标显示“转化率提升5%”并不能告诉你提升是来自真正的价值创造还是来自增加替代路径的摩擦。5人在可用性测试中遇到问题，并不能告诉你问题的普遍程度。两者结合才能告诉你真实情况。

6. Ethical measurement

6. 合乎伦理的度量

Metrics shape behavior — of teams, of products, and of users. Measure carefully.

Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." This is not a theoretical concern. When teams are incentivized on time-on-site, they build infinite scroll and autoplay. When they're incentivized on signups, they build deceptive registration walls. When they're incentivized on engagement, they build notification spam. The metric didn't fail — the metric became the goal instead of a proxy for the goal.

Engagement does not equal value: High engagement can signal addiction, not satisfaction. A user who checks their phone 200 times a day is engaged. They may also be anxious, distracted, and unhappy. Include satisfaction metrics alongside engagement metrics. If engagement rises but satisfaction falls, you're building a slot machine, not a useful product.

Dark metric patterns to watch for:

Counting "successful" newsletter signups from prechecked boxes
Measuring "engagement" from nagging notifications that users click to dismiss
Celebrating "retention" that's actually cancellation friction
Reporting "conversion" from misleading button labels or urgency timers
Tracking "time on site" driven by confusing navigation

Connection to Intent's anti-pattern catalog: Any metric that would improve by implementing a dark pattern is measuring the wrong thing. Before celebrating a metric improvement, ask: could this improvement have been achieved through a dark pattern? If yes, verify that it wasn't.

Ethical alternative framework: Measure user satisfaction, task completion, and effort alongside every business metric. Build a measurement dashboard with two columns: business outcomes and user outcomes. If business metrics improve while user experience metrics decline, that's a dark pattern signal — even if nobody intended it.

Business metric: conversion rate → Paired user metric: post-purchase satisfaction
Business metric: engagement (DAU) → Paired user metric: user-reported value
Business metric: retention → Paired user metric: ease of cancellation
Business metric: revenue per user → Paired user metric: perceived value for money

指标会影响团队、产品和用户的行为。需谨慎度量。

古德哈特定律： “当一个度量指标成为目标时，它就不再是一个好的度量指标。”这不是理论问题。当团队的激励与停留时长挂钩时，他们会构建无限滚动和自动播放功能。当激励与注册量挂钩时，他们会构建欺骗性的注册墙。当激励与参与度挂钩时，他们会构建通知垃圾信息。指标没有失效——指标取代了原本要代表的目标，成为了目标本身。

参与度不等于价值： 高参与度可能意味着成瘾，而非满意度。每天查看手机200次的用户是高参与度的，但他们也可能焦虑、分心和不开心。在参与度指标之外加入满意度指标。如果参与度上升但满意度下降，你正在构建的是老虎机，而非有用的产品。

需要警惕的暗黑指标模式：

统计来自预勾选复选框的“成功”新闻通讯注册量
将用户点击以关闭的烦扰通知算作“参与度”
把实际上是取消摩擦的情况当作“留存率”庆祝
报告来自误导性按钮标签或紧迫感计时器的“转化率”
追踪由混乱导航导致的“停留时长”

与Intent反模式目录的关联： 任何通过实施暗黑模式就能提升的指标，都是度量了错误的内容。在庆祝指标提升之前，问问自己：这个提升是否可以通过暗黑模式实现？如果是，要验证实际并非如此。

伦理替代框架： 在每个业务指标之外，同时度量用户满意度、任务完成率和付出的努力。构建包含两列的度量仪表盘：业务成果和用户成果。如果业务指标提升而用户体验指标下降，这就是暗黑模式的信号——即使没有人刻意为之。

业务指标：转化率 → 对应的用户指标：购后满意度
业务指标：参与度（日活） → 对应的用户指标：用户报告的价值
业务指标：留存率 → 对应的用户指标：取消难度
业务指标：每用户收入 → 对应的用户指标：感知性价比

When measurement points back to strategy

当度量结果指向策略调整时

Measurement is not only downstream of strategy. It can also reopen strategy when evidence contradicts a strategic assumption. The triggers below are specific to measurement; the general loop-back rules — human checkpoint, loop budget, written exit condition — live in

/intent

under "Loop-backs and exit conditions."

度量不仅是策略的下游环节。当证据与战略假设矛盾时，它也可以重新审视策略。以下是度量特有的触发因素；通用的循环反馈规则——人工检查点、循环预算、书面退出条件——位于

/intent

下的“循环反馈与退出条件”部分。

Triggers for reopening

/strategize

from metrics

从指标触发重新审视

/strategize

的场景

Audience contradiction. Segment analysis reveals the primary audience using the product is not the audience strategy assumed.
Feature validation failure. Adoption metrics show a supposedly core feature is unused while a supposedly peripheral feature is heavily used.
Solution-fit failure. The drop-off is not in the flow you optimized — it's before the flow. Users aren't reaching the product the way strategy assumed.
Goodhart's Law triggered. The primary metric improved, the counter-metric deteriorated, and qualitative research confirms users are worse off.
Opportunity miscount. Measured willingness-to-pay, usage frequency, or reach is an order of magnitude below the strategic estimate.

受众矛盾：细分分析显示产品的主要使用受众并非策略假设的受众。
功能验证失败：采用率指标显示某个本应是核心的功能未被使用，而某个本应是次要的功能却被大量使用。
解决方案适配失败：流失点不在你优化的流程中——而是在流程之前。用户没有按照策略假设的方式接触产品。
触发古德哈特定律：主要指标提升，反向指标恶化，且定性研究证实用户处境变差。
机会误判：度量的支付意愿、使用频率或覆盖范围比战略预估低一个数量级。

Not triggers — common false positives

非触发因素——常见的假阳性

Results slightly below projection — direction matters more than magnitude.
Early metrics from novelty or seasonal effects — wait for 2+ weekly cycles to stabilize.
One underperforming segment — may warrant segment-specific work, not a full strategy reopen.

结果略低于预期——方向比幅度更重要。
来自新奇效应或季节性效应的早期指标——等待2+个周周期让数据稳定。
某一个群体表现不佳——可能需要针对该群体的专项工作，而非全面重新审视策略。

How to reopen responsibly

如何负责任地重新审视策略

Name the strategic assumption the metric contradicts. Not "users aren't converting" — "we assumed [X audience with Y motivation] was primary, but data shows [Z]."
Bring evidence, not conclusions. Metric, counter-metric, qualitative signal, and the original assumption. Let
```
/strategize
```
reframe — don't pre-frame it.
Ask the user to authorize the reopen. Measurement can surface that strategy may be wrong; only the human with business context decides whether strategy must change.

明确指标矛盾的战略假设。不要说“用户没有转化”——要说“我们假设[X受众具有Y动机]是核心，但数据显示[Z]。”
提供证据，而非结论。指标、反向指标、定性信号以及原始假设。让
```
/strategize
```
重新构建框架——不要预先设定框架。
请求用户授权重新审视。度量可以表明策略可能有误；只有具备业务背景的人才能决定是否必须调整策略。

Stop condition

停止条件

At most one strategy reopen per project iteration based on post-launch metrics. A second reopen signals framing issues the user must resolve — stop analyzing and surface the tension directly.

每个项目迭代中，基于发布后指标最多进行一次策略重新审视。第二次重新审视表明存在用户必须解决的框架问题——停止分析，直接呈现矛盾点。

Output format

输出格式

Measurement framework (GSM map)

度量框架（GSM映射）

Goal-Signal-Metric chains for each major feature or initiative, including counter-metrics and ethical considerations.

每个主要功能或举措的目标-信号-指标链，包括反向指标和伦理考量。

A/B test plan template

A/B测试计划模板

Hypothesis, variants, primary metric, guardrail metrics, sample size calculation, duration, segmentation plan, decision criteria (what result means what action).

假设、变体、主要指标、约束指标、样本量计算、测试时长、细分计划、决策标准（不同结果对应不同行动）。

Funnel analysis template

漏斗分析模板

Step definitions, conversion rates, segmentation dimensions, drop-off analysis, qualitative research plan for top friction points.

步骤定义、转化率、细分维度、流失分析、针对主要摩擦点的定性研究计划。

Metrics dashboard specification

指标仪表盘规范

Which metrics, how displayed, update frequency, alerting thresholds, audience (who sees this and what decisions do they make from it).

包含哪些指标、展示方式、更新频率、告警阈值、受众（谁会查看以及他们会据此做出什么决策）。

Learning plan

学习计划

Post-launch measurement cadence: what to measure at day 1, week 1, month 1, quarter 1. When to check back, what to look for, when to declare success or pivot.

发布后度量节奏：第1天、第1周、第1个月、第1个季度要度量什么。何时复查、关注什么、何时宣布成功或转向。

Voice and approach

语气与方法

Precise about what data does and doesn't prove. "The data suggests" not "the data proves." Statistical significance does not mean practical significance. A p-value under 0.05 means the result is unlikely to be due to chance — it does not mean the result matters.

Transparent about limitations. Sample size, selection bias, survivorship bias, confounding variables — name them. Honest uncertainty is more useful than false confidence.

Resist false certainty. When the data is ambiguous, say so. When the sample is too small, say so. When you need qualitative research to interpret the numbers, say so. The most dangerous metric is the one that looks conclusive but isn't.

Advocate for measuring what matters to users, not just what's easy to track. Clicks are easy to count. Satisfaction is harder. Task completion is meaningful. Time-on-site is ambiguous. Advocate for the metrics that reflect user success, even when they're harder to instrument.

明确数据能证明什么、不能证明什么。用“数据表明”而非“数据证明”。统计显著性并不意味着实际显著性。p值低于0.05意味着结果不太可能是偶然的——但不意味着结果重要。

透明说明局限性。样本量、选择偏差、幸存者偏差、混杂变量——明确指出它们。诚实的不确定性比虚假的信心更有用。

抵制虚假确定性。当数据模糊时，如实说明。当样本量太小时，如实说明。当需要定性研究来解读数据时，如实说明。最危险的指标是那些看起来结论确凿但实际并非如此的指标。

倡导度量对用户重要的内容，而非仅度量易于追踪的内容。点击量易于统计，满意度则更难。任务完成率有意义，停留时长则模糊不清。倡导反映用户成功的指标，即使它们更难埋点。

Scope boundaries

范围边界

This skill owns:

本技能负责：

Metric selection and measurement framework design
A/B test design and experiment methodology
Funnel analysis methodology and templates
Ethical measurement guidance and counter-metric definition
GSM mapping and learning plan creation

指标选择和度量框架设计
A/B测试设计和实验方法论
漏斗分析方法论和模板
合乎伦理的度量指导和反向指标定义
GSM映射和学习计划创建

This skill does NOT own:

本技能不负责：

Analytics implementation and instrumentation (engineering)
Qualitative research execution (
```
/investigate
```
)
Strategic framing and hypothesis generation (
```
/strategize
```
)
UX assessment and heuristic evaluation (
```
/evaluate
```
)
Dashboard visual design (visual design)
Statistical analysis execution (data science)

分析工具的实施和埋点（工程团队）
定性研究执行（
```
/investigate
```
）
战略框架和假设生成（
```
/strategize
```
）
UX评估和启发式评估（
```
/evaluate
```
）
仪表盘视觉设计（视觉设计）
统计分析执行（数据科学）