usability-testing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Usability Testing

可用性测试

Plan and run tests that find usability problems before users hit them in production. Stack-agnostic. Tool-agnostic.
This skill is for testing existing designs or prototypes. For broader discovery research, use
ux-research
. For conversion testing in production, use
cro-optimization
.

规划并执行测试,在用户遇到可用性问题之前发现它们。Stack-agnostic,Tool-agnostic。
该技能适用于测试现有设计或原型。如需更广泛的探索性研究,请使用
ux-research
。如需生产环境中的转化测试,请使用
cro-optimization

When to use

使用场景

  • Before launching a new flow or major redesign
  • After a redesign to verify it doesn't introduce new problems
  • When analytics show drop-off but you don't know why
  • When customer support tickets pattern around specific UI areas
  • Pre-launch user validation
  • Comparing two design directions
  • 新流程或重大改版上线前
  • 改版完成后,验证未引入新问题
  • 数据分析显示用户流失,但原因不明时
  • 客服工单集中指向特定UI区域时
  • 上线前用户验证
  • 对比两种设计方向

When NOT to use

不适用场景

  • Discovery / generative research (use
    ux-research
    )
  • Live conversion optimization (use
    cro-optimization
    )
  • Mapping the broader experience (use
    journey-mapping
    )
  • Pure quantitative measurement (use
    analytics-strategy
    )

  • 探索性/生成式研究(请使用
    ux-research
  • 实时转化优化(请使用
    cro-optimization
  • 绘制整体用户体验地图(请使用
    journey-mapping
  • 纯定量测量(请使用
    analytics-strategy

Required inputs

必要输入

  • The design or prototype to test (functional or near-functional)
  • Specific tasks users would do
  • The audience (who should be tested)
  • Testing infrastructure (moderated tool, unmoderated tool, in-person setup)

  • 待测试的设计或原型(功能完整或接近完整)
  • 用户需完成的具体任务
  • 测试受众(目标测试人群)
  • 测试基础设施(有主持测试工具、无主持测试工具、线下测试环境)

The framework: 5 phases

框架:5个阶段

1. Define what to test

1. 明确测试内容

Don't test the whole product. Test specific tasks.
Task selection criteria:
  • The task represents a real user goal (not "click around and explore")
  • The task has a clear start and end
  • The task is achievable in 2 to 10 minutes
  • The task is one of: most common, most strategic, most problematic
Examples of testable tasks:
"You want to find a contractor near you who can install a fence. Show me how you'd do that on this site."
"You're a first-time visitor. You want to understand if this product fits your needs. Walk me through how you'd evaluate it."
"Your team needs a new tool to manage projects. Use this site to figure out which plan is right for a 12-person team."
Task framing rules:
  • State the user goal, not the system action ("find a place to stay" not "click the search button")
  • Provide context (why are you doing this?)
  • Don't reveal the path
  • Don't use product terminology in the task framing
不要测试整个产品,应聚焦具体任务。
任务选择标准:
  • 任务代表真实用户目标(而非“随意点击探索”)
  • 任务有明确的开始和结束节点
  • 任务可在2至10分钟内完成
  • 任务属于以下类型之一:最常见、最具战略意义、问题最多
可测试任务示例:
"你想在附近找一位能安装围栏的承包商。请演示你如何在该网站上完成此操作。"
"你是首次访问者,想了解该产品是否符合你的需求。请演示你如何评估它。"
"你的团队需要一款新的项目管理工具。请使用该网站为12人团队选择合适的方案。"
任务表述规则:
  • 说明用户目标,而非系统操作(比如“找住宿”而非“点击搜索按钮”)
  • 提供背景信息(为什么要做这件事?)
  • 不要透露操作路径
  • 任务表述中不要使用产品术语

2. Choose moderated or unmoderated

2. 选择有主持或无主持测试

Moderated (live, with researcher):
  • Researcher observes and probes in real time
  • Best for early-stage prototypes, complex tasks, novel concepts
  • Higher cost, smaller sample (5 to 8 participants typical)
  • Catches surprises and probe deeper
Unmoderated (recorded, asynchronous):
  • Participant completes alone, often via tool (UserTesting, Maze, Lookback)
  • Best for stable designs, simple tasks, larger sample
  • Lower cost, larger sample (15 to 30 participants typical)
  • Catches patterns at scale, less depth per session
For most teams: moderated for early/critical decisions, unmoderated for ongoing validation.
Moderated(有主持,实时,由研究员参与):
  • 研究员实时观察并追问
  • 最适合早期原型、复杂任务、新颖概念
  • 成本较高,样本量较小(通常5至8名参与者)
  • 能发现意外情况并深入探究
Unmoderated(无主持,录制,异步):
  • 参与者独立完成测试,通常通过工具(如UserTesting、Maze、Lookback)
  • 最适合稳定设计、简单任务、大样本量
  • 成本较低,样本量较大(通常15至30名参与者)
  • 能规模化发现模式,但单场次深度不足
对于大多数团队:早期/关键决策使用有主持测试,持续验证使用无主持测试。

3. Recruit

3. 招募参与者

Target audience - not just convenience.
Recruit criteria:
  • Match real users (target audience, not just "anyone")
  • Mix of experience levels with the product (new and existing if applicable)
  • Mix of relevant device types (mobile, desktop, tablet if relevant)
  • Exclude friends, family, employees
Sample size:
  • Moderated: 5 to 8 participants (Nielsen's "5 users find 85% of usability issues" for the most common segment)
  • Unmoderated: 15 to 30 participants (more participants compensate for less probing)
  • Multi-segment testing: 5 to 8 per segment
瞄准目标受众,而非仅选择便利人群。
招募标准:
  • 匹配真实用户(目标受众,而非“任何人”)
  • 混合不同产品使用经验水平(如有适用,包含新用户和现有用户)
  • 混合相关设备类型(如有适用,包含移动端、桌面端、平板端)
  • 排除朋友、家人、员工
样本量:
  • 有主持测试:5至8名参与者(根据Nielsen的研究,“5名用户可发现85%的常见可用性问题”)
  • 无主持测试:15至30名参与者(更多参与者弥补深度不足)
  • 多细分群体测试:每个群体5至8名参与者

4. Run the test

4. 执行测试

Pre-task setup:
  • Confirm recording works
  • Brief participant (purpose, anonymity, recording, "no wrong answers")
  • Get verbal consent
  • Have participant share screen if remote
Moderated session structure:
  1. Warm-up (2 to 3 min). Easy questions to put participant at ease.
  2. Pre-test questions (3 to 5 min). Background context, current behavior with similar products.
  3. Task 1 (5 to 10 min). Describe task. Have participant attempt while thinking aloud.
  4. Post-task questions (1 to 2 min). What was easy/hard? Anything confusing?
  5. Repeat for tasks 2, 3, 4 (typically 3 to 5 tasks per 60-minute session).
  6. Overall debrief (5 to 10 min). General reactions, comparisons to alternatives, anything else.
  7. Close (2 min).
Moderation principles:
  • Encourage think-aloud ("What's going through your mind?")
  • Don't help unless they're truly stuck (and even then, only after a long pause)
  • Don't lead ("Are you looking for the menu?" - bad)
  • Note where they hesitate, scroll, or backtrack
  • Note their language vs the product's language
  • Note emotional reactions
Anti-patterns:
  • Talking too much (researcher should talk maybe 20% of the time)
  • Defending the design when participants struggle
  • Helping prematurely
  • Asking participants to predict their future behavior
  • Treating participant suggestions as features ("Users want X" - test demand for X separately)
测试前准备:
  • 确认录制功能正常
  • 向参与者介绍测试(目的、匿名性、录制、“没有错误答案”)
  • 获取口头同意
  • 如为远程测试,要求参与者共享屏幕
有主持测试流程:
  1. 热身环节(2至3分钟):简单问题让参与者放松。
  2. 测试前问题(3至5分钟):背景信息、使用同类产品的当前行为。
  3. 任务1(5至10分钟):描述任务,让参与者尝试并使用Think Aloud。
  4. 任务后问题(1至2分钟):哪些部分容易/困难?有什么困惑?
  5. 重复任务2、3、4(通常每60分钟测试包含3至5个任务)。
  6. 整体复盘(5至10分钟):总体反馈、与竞品的对比、其他想法。
  7. 结束环节(2分钟)。
主持原则:
  • 鼓励出声思考(“你现在在想什么?”)
  • 除非参与者真的卡住(且经过长时间停顿),否则不要提供帮助
  • 不要引导参与者(比如“你在找菜单吗?”——错误示例)
  • 记录参与者犹豫、滚动或回溯的地方
  • 记录参与者使用的语言与产品语言的差异
  • 记录参与者的情绪反应
反模式:
  • 研究员说得太多(研究员发言时间应占比约20%)
  • 当参与者遇到困难时为设计辩护
  • 过早提供帮助
  • 要求参与者预测未来行为
  • 将参与者的建议视为功能需求(“用户想要X”——应单独测试对X的需求)

5. Synthesize and report

5. 整合并报告结果

Patterns across participants are signal. Single-participant complaints are weaker (but worth investigating).
Synthesis steps:
  1. Issue inventory. Every issue observed, with which participant, which task, severity.
  2. Cluster. Issues that are the same root problem.
  3. Severity.
    • Critical: Blocks task completion. Most users hit this.
    • Major: Significantly slows task. Many users hit this.
    • Minor: Friction. Some users hit this. Workaround exists.
    • Cosmetic: Polish. Doesn't affect task.
  4. Recommendations. For each issue, propose specific fixes.
  5. Prioritize. By severity and effort.
Report structure:
markdown
undefined
参与者之间的共性模式是有效信号。单个参与者的抱怨参考价值较低(但仍值得调查)。
整合步骤:
  1. 问题清单:记录观察到的所有问题,包括涉及的参与者、任务、严重程度。
  2. 聚类分组:将根源相同的问题归为一类。
  3. 严重程度分级
    • Critical(严重):阻碍任务完成,大多数用户会遇到。
    • Major(主要):显著减慢任务进度,许多用户会遇到。
    • Minor(次要):存在摩擦,部分用户会遇到,有解决办法。
    • Cosmetic( cosmetic):仅影响外观,不影响任务完成。
  4. 建议方案:针对每个问题提出具体修复建议。
  5. 优先级排序:根据严重程度和修复成本排序。
报告结构:
markdown
undefined

Usability Test: [Design / flow]

Usability Test: [Design / flow]

Summary

Summary

[2 to 3 paragraphs covering: what was tested, headline findings, top 3 priorities]
[2 to 3 paragraphs covering: what was tested, headline findings, top 3 priorities]

Method

Method

[Moderated/unmoderated, sample size, audience, dates, tasks]
[Moderated/unmoderated, sample size, audience, dates, tasks]

Critical findings

Critical findings

[Each with description, frequency, supporting evidence (quotes/clips), recommendation]
[Each with description, frequency, supporting evidence (quotes/clips), recommendation]

Major findings

Major findings

[Same structure]
[Same structure]

Minor findings

Minor findings

[Brief]
[Brief]

Cosmetic findings

Cosmetic findings

[Briefest]
[Briefest]

What worked well

What worked well

[Calibration: capture successes too]
[Calibration: capture successes too]

Recommendations

Recommendations

[Prioritized list with effort estimates]
[Prioritized list with effort estimates]

Next steps

Next steps

[Test re-run schedule, design iteration plan]

---
[Test re-run schedule, design iteration plan]

---

Workflow

工作流程

  1. Define the goals. What decisions hinge on this? What tasks matter most?
  2. Design tasks. 3 to 5 specific, realistic, goal-framed tasks.
  3. Choose moderated vs unmoderated. Match to stage and depth needed.
  4. Recruit. Specific to audience.
  5. Pilot. 1 to 2 sessions before main batch. Refine tasks if needed.
  6. Run. Follow the protocol. Stay disciplined.
  7. Synthesize during, not just after. Patterns emerge by session 4 or 5.
  8. Report. Multiple formats - written report + highlight clips.
  9. Track fixes. Every critical issue should have an owner and date.
  10. Re-test after fixes. Verify the fix worked, didn't introduce new issues.

  1. 明确目标:本次测试将支撑哪些决策?哪些任务最重要?
  2. 设计任务:3至5个具体、真实、以目标为导向的任务。
  3. 选择有主持或无主持测试:匹配测试阶段和所需深度。
  4. 招募参与者:精准定位目标受众。
  5. 试点测试:正式测试前开展1至2场试点,必要时优化任务。
  6. 执行测试:遵循流程,保持严谨。
  7. 边测试边整合:通常在第4或第5场测试后模式就会显现。
  8. 输出报告:多种形式——书面报告+重点片段剪辑。
  9. 跟踪修复:每个严重问题都应有负责人和完成日期。
  10. 修复后重测:验证修复有效且未引入新问题。

Failure patterns

常见失败模式

  • Testing the whole product instead of specific tasks. Vague results.
  • Tasks that reveal the path. ("Click the menu and find...")
  • Friends and family as participants. Biased, not representative.
  • Researcher leading the participant. Findings reflect the researcher.
  • Defending the design when participants struggle. Misses real issues.
  • Helping too quickly. Participant doesn't experience the friction.
  • Treating participant suggestions as features. Users solve their problem; product team designs the solution.
  • One participant = data point. A single strong opinion isn't a finding.
  • Skipping severity scoring. All findings treated equally; team can't prioritize.
  • Reports no one reads. Highlight clips and live walkthroughs work better than 80-page decks.
  • Testing once, never re-testing. Fixes that introduce new problems go undetected.

  • 测试整个产品而非具体任务:结果模糊不清。
  • 任务透露操作路径(比如“点击菜单并找到...”)。
  • 选择朋友、家人或员工作为参与者:存在偏见,不具代表性。
  • 研究员引导参与者:结果反映的是研究员的想法而非用户真实反馈。
  • 当参与者遇到困难时为设计辩护:错过真实问题。
  • 过早提供帮助:参与者未体验到实际摩擦。
  • 将参与者的建议视为功能需求:用户解决自身问题,产品团队负责设计解决方案。
  • 将单个参与者的意见视为数据点:单一强烈观点不能作为结论。
  • 跳过严重程度评分:所有结果同等对待,团队无法排序优先级。
  • 无人阅读的报告:重点片段剪辑和现场演示比80页PPT更有效。
  • 仅测试一次,不再重测:修复可能引入新问题却未被发现。

Output format

输出格式

Default outputs:
  1. Test plan (before testing) -
    usability-test-plan-[topic].md
  2. Task script (per session) -
    usability-tasks-[topic].md
  3. Findings report (after synthesis) -
    usability-findings-[topic].md
  4. Highlight clips (separately produced)

默认输出:
  1. 测试计划(测试前)-
    usability-test-plan-[topic].md
  2. 任务脚本(每场测试)-
    usability-tasks-[topic].md
  3. 结果报告(整合后)-
    usability-findings-[topic].md
  4. 重点片段剪辑(单独制作)

Reference files

参考文件

  • references/task-script-patterns.md
    - Task framing patterns by common product type, with good and bad examples.
  • references/task-script-patterns.md
    - 按常见产品类型分类的任务表述模式,包含正反示例。