usability-testing

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Usability Testing

可用性测试

Plan and run tests that find usability problems before users hit them in production. Stack-agnostic. Tool-agnostic.

This skill is for testing existing designs or prototypes. For broader discovery research, use

ux-research

. For conversion testing in production, use

cro-optimization

规划并执行测试，在用户遇到可用性问题之前发现它们。Stack-agnostic，Tool-agnostic。

该技能适用于测试现有设计或原型。如需更广泛的探索性研究，请使用

ux-research

。如需生产环境中的转化测试，请使用

cro-optimization

。

When to use

使用场景

Before launching a new flow or major redesign
After a redesign to verify it doesn't introduce new problems
When analytics show drop-off but you don't know why
When customer support tickets pattern around specific UI areas
Pre-launch user validation
Comparing two design directions

新流程或重大改版上线前
改版完成后，验证未引入新问题
数据分析显示用户流失，但原因不明时
客服工单集中指向特定UI区域时
上线前用户验证
对比两种设计方向

When NOT to use

不适用场景

Discovery / generative research (use
```
ux-research
```
)
Live conversion optimization (use
```
cro-optimization
```
)
Mapping the broader experience (use
```
journey-mapping
```
)
Pure quantitative measurement (use
```
analytics-strategy
```
)

探索性/生成式研究（请使用
```
ux-research
```
）
实时转化优化（请使用
```
cro-optimization
```
）
绘制整体用户体验地图（请使用
```
journey-mapping
```
）
纯定量测量（请使用
```
analytics-strategy
```
）

Required inputs

必要输入

The design or prototype to test (functional or near-functional)
Specific tasks users would do
The audience (who should be tested)
Testing infrastructure (moderated tool, unmoderated tool, in-person setup)

待测试的设计或原型（功能完整或接近完整）
用户需完成的具体任务
测试受众（目标测试人群）
测试基础设施（有主持测试工具、无主持测试工具、线下测试环境）

The framework: 5 phases

框架：5个阶段

1. Define what to test

1. 明确测试内容

Don't test the whole product. Test specific tasks.

Task selection criteria:

The task represents a real user goal (not "click around and explore")
The task has a clear start and end
The task is achievable in 2 to 10 minutes
The task is one of: most common, most strategic, most problematic

Examples of testable tasks:

"You want to find a contractor near you who can install a fence. Show me how you'd do that on this site."

"You're a first-time visitor. You want to understand if this product fits your needs. Walk me through how you'd evaluate it."

"Your team needs a new tool to manage projects. Use this site to figure out which plan is right for a 12-person team."

Task framing rules:

State the user goal, not the system action ("find a place to stay" not "click the search button")
Provide context (why are you doing this?)
Don't reveal the path
Don't use product terminology in the task framing

不要测试整个产品，应聚焦具体任务。

任务选择标准：

任务代表真实用户目标（而非“随意点击探索”）
任务有明确的开始和结束节点
任务可在2至10分钟内完成
任务属于以下类型之一：最常见、最具战略意义、问题最多

可测试任务示例：

"你想在附近找一位能安装围栏的承包商。请演示你如何在该网站上完成此操作。"

"你是首次访问者，想了解该产品是否符合你的需求。请演示你如何评估它。"

"你的团队需要一款新的项目管理工具。请使用该网站为12人团队选择合适的方案。"

任务表述规则：

说明用户目标，而非系统操作（比如“找住宿”而非“点击搜索按钮”）
提供背景信息（为什么要做这件事？）
不要透露操作路径
任务表述中不要使用产品术语

2. Choose moderated or unmoderated

2. 选择有主持或无主持测试

Moderated (live, with researcher):

Researcher observes and probes in real time
Best for early-stage prototypes, complex tasks, novel concepts
Higher cost, smaller sample (5 to 8 participants typical)
Catches surprises and probe deeper

Unmoderated (recorded, asynchronous):

Participant completes alone, often via tool (UserTesting, Maze, Lookback)
Best for stable designs, simple tasks, larger sample
Lower cost, larger sample (15 to 30 participants typical)
Catches patterns at scale, less depth per session

For most teams: moderated for early/critical decisions, unmoderated for ongoing validation.

Moderated（有主持，实时，由研究员参与）：

研究员实时观察并追问
最适合早期原型、复杂任务、新颖概念
成本较高，样本量较小（通常5至8名参与者）
能发现意外情况并深入探究

Unmoderated（无主持，录制，异步）：

参与者独立完成测试，通常通过工具（如UserTesting、Maze、Lookback）
最适合稳定设计、简单任务、大样本量
成本较低，样本量较大（通常15至30名参与者）
能规模化发现模式，但单场次深度不足

对于大多数团队：早期/关键决策使用有主持测试，持续验证使用无主持测试。

3. Recruit

3. 招募参与者

Target audience - not just convenience.

Recruit criteria:

Match real users (target audience, not just "anyone")
Mix of experience levels with the product (new and existing if applicable)
Mix of relevant device types (mobile, desktop, tablet if relevant)
Exclude friends, family, employees

Sample size:

Moderated: 5 to 8 participants (Nielsen's "5 users find 85% of usability issues" for the most common segment)
Unmoderated: 15 to 30 participants (more participants compensate for less probing)
Multi-segment testing: 5 to 8 per segment

瞄准目标受众，而非仅选择便利人群。

招募标准：

匹配真实用户（目标受众，而非“任何人”）
混合不同产品使用经验水平（如有适用，包含新用户和现有用户）
混合相关设备类型（如有适用，包含移动端、桌面端、平板端）
排除朋友、家人、员工

样本量：

有主持测试：5至8名参与者（根据Nielsen的研究，“5名用户可发现85%的常见可用性问题”）
无主持测试：15至30名参与者（更多参与者弥补深度不足）
多细分群体测试：每个群体5至8名参与者

4. Run the test

4. 执行测试

Pre-task setup:

Confirm recording works
Brief participant (purpose, anonymity, recording, "no wrong answers")
Get verbal consent
Have participant share screen if remote

Moderated session structure:

Warm-up (2 to 3 min). Easy questions to put participant at ease.
Pre-test questions (3 to 5 min). Background context, current behavior with similar products.
Task 1 (5 to 10 min). Describe task. Have participant attempt while thinking aloud.
Post-task questions (1 to 2 min). What was easy/hard? Anything confusing?
Repeat for tasks 2, 3, 4 (typically 3 to 5 tasks per 60-minute session).
Overall debrief (5 to 10 min). General reactions, comparisons to alternatives, anything else.
Close (2 min).

Moderation principles:

Encourage think-aloud ("What's going through your mind?")
Don't help unless they're truly stuck (and even then, only after a long pause)
Don't lead ("Are you looking for the menu?" - bad)
Note where they hesitate, scroll, or backtrack
Note their language vs the product's language
Note emotional reactions

Anti-patterns:

Talking too much (researcher should talk maybe 20% of the time)
Defending the design when participants struggle
Helping prematurely
Asking participants to predict their future behavior
Treating participant suggestions as features ("Users want X" - test demand for X separately)

测试前准备：

确认录制功能正常
向参与者介绍测试（目的、匿名性、录制、“没有错误答案”）
获取口头同意
如为远程测试，要求参与者共享屏幕

有主持测试流程：

热身环节（2至3分钟）：简单问题让参与者放松。
测试前问题（3至5分钟）：背景信息、使用同类产品的当前行为。
任务1（5至10分钟）：描述任务，让参与者尝试并使用Think Aloud。
任务后问题（1至2分钟）：哪些部分容易/困难？有什么困惑？
重复任务2、3、4（通常每60分钟测试包含3至5个任务）。
整体复盘（5至10分钟）：总体反馈、与竞品的对比、其他想法。
结束环节（2分钟）。

主持原则：

鼓励出声思考（“你现在在想什么？”）
除非参与者真的卡住（且经过长时间停顿），否则不要提供帮助
不要引导参与者（比如“你在找菜单吗？”——错误示例）
记录参与者犹豫、滚动或回溯的地方
记录参与者使用的语言与产品语言的差异
记录参与者的情绪反应

反模式：

研究员说得太多（研究员发言时间应占比约20%）
当参与者遇到困难时为设计辩护
过早提供帮助
要求参与者预测未来行为
将参与者的建议视为功能需求（“用户想要X”——应单独测试对X的需求）

5. Synthesize and report

5. 整合并报告结果

Patterns across participants are signal. Single-participant complaints are weaker (but worth investigating).

Synthesis steps:

Issue inventory. Every issue observed, with which participant, which task, severity.
Cluster. Issues that are the same root problem.
Severity.
- Critical: Blocks task completion. Most users hit this.
- Major: Significantly slows task. Many users hit this.
- Minor: Friction. Some users hit this. Workaround exists.
- Cosmetic: Polish. Doesn't affect task.
Recommendations. For each issue, propose specific fixes.
Prioritize. By severity and effort.

Report structure:

markdown

undefined

参与者之间的共性模式是有效信号。单个参与者的抱怨参考价值较低（但仍值得调查）。

整合步骤：

问题清单：记录观察到的所有问题，包括涉及的参与者、任务、严重程度。
聚类分组：将根源相同的问题归为一类。
严重程度分级：
- Critical（严重）：阻碍任务完成，大多数用户会遇到。
- Major（主要）：显著减慢任务进度，许多用户会遇到。
- Minor（次要）：存在摩擦，部分用户会遇到，有解决办法。
- Cosmetic（ cosmetic）：仅影响外观，不影响任务完成。
建议方案：针对每个问题提出具体修复建议。
优先级排序：根据严重程度和修复成本排序。

报告结构：

markdown

undefined

Usability Test: [Design / flow]

Summary

[2 to 3 paragraphs covering: what was tested, headline findings, top 3 priorities]

Method

[Moderated/unmoderated, sample size, audience, dates, tasks]

Critical findings

[Each with description, frequency, supporting evidence (quotes/clips), recommendation]

Major findings

[Same structure]

Minor findings

[Brief]

Cosmetic findings

[Briefest]

What worked well

[Calibration: capture successes too]

Recommendations

[Prioritized list with effort estimates]

Next steps

[Test re-run schedule, design iteration plan]

---

[Test re-run schedule, design iteration plan]

---

Workflow

工作流程

Define the goals. What decisions hinge on this? What tasks matter most?
Design tasks. 3 to 5 specific, realistic, goal-framed tasks.
Choose moderated vs unmoderated. Match to stage and depth needed.
Recruit. Specific to audience.
Pilot. 1 to 2 sessions before main batch. Refine tasks if needed.
Run. Follow the protocol. Stay disciplined.
Synthesize during, not just after. Patterns emerge by session 4 or 5.
Report. Multiple formats - written report + highlight clips.
Track fixes. Every critical issue should have an owner and date.
Re-test after fixes. Verify the fix worked, didn't introduce new issues.

明确目标：本次测试将支撑哪些决策？哪些任务最重要？
设计任务：3至5个具体、真实、以目标为导向的任务。
选择有主持或无主持测试：匹配测试阶段和所需深度。
招募参与者：精准定位目标受众。
试点测试：正式测试前开展1至2场试点，必要时优化任务。
执行测试：遵循流程，保持严谨。
边测试边整合：通常在第4或第5场测试后模式就会显现。
输出报告：多种形式——书面报告+重点片段剪辑。
跟踪修复：每个严重问题都应有负责人和完成日期。
修复后重测：验证修复有效且未引入新问题。

Failure patterns

常见失败模式

Testing the whole product instead of specific tasks. Vague results.
Tasks that reveal the path. ("Click the menu and find...")
Friends and family as participants. Biased, not representative.
Researcher leading the participant. Findings reflect the researcher.
Defending the design when participants struggle. Misses real issues.
Helping too quickly. Participant doesn't experience the friction.
Treating participant suggestions as features. Users solve their problem; product team designs the solution.
One participant = data point. A single strong opinion isn't a finding.
Skipping severity scoring. All findings treated equally; team can't prioritize.
Reports no one reads. Highlight clips and live walkthroughs work better than 80-page decks.
Testing once, never re-testing. Fixes that introduce new problems go undetected.

测试整个产品而非具体任务：结果模糊不清。
任务透露操作路径（比如“点击菜单并找到...”）。
选择朋友、家人或员工作为参与者：存在偏见，不具代表性。
研究员引导参与者：结果反映的是研究员的想法而非用户真实反馈。
当参与者遇到困难时为设计辩护：错过真实问题。
过早提供帮助：参与者未体验到实际摩擦。
将参与者的建议视为功能需求：用户解决自身问题，产品团队负责设计解决方案。
将单个参与者的意见视为数据点：单一强烈观点不能作为结论。
跳过严重程度评分：所有结果同等对待，团队无法排序优先级。
无人阅读的报告：重点片段剪辑和现场演示比80页PPT更有效。
仅测试一次，不再重测：修复可能引入新问题却未被发现。

Output format

输出格式

Default outputs:

Test plan (before testing) -
```
usability-test-plan-[topic].md
```
Task script (per session) -
```
usability-tasks-[topic].md
```
Findings report (after synthesis) -
```
usability-findings-[topic].md
```
Highlight clips (separately produced)

默认输出：

测试计划（测试前）-
```
usability-test-plan-[topic].md
```
任务脚本（每场测试）-
```
usability-tasks-[topic].md
```
结果报告（整合后）-
```
usability-findings-[topic].md
```
重点片段剪辑（单独制作）

Reference files

参考文件

```
references/task-script-patterns.md
```
- Task framing patterns by common product type, with good and bad examples.

```
references/task-script-patterns.md
```
- 按常见产品类型分类的任务表述模式，包含正反示例。