usability-testing
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseUsability Testing
可用性测试
Plan and run tests that find usability problems before users hit them in production. Stack-agnostic. Tool-agnostic.
This skill is for testing existing designs or prototypes. For broader discovery research, use . For conversion testing in production, use .
ux-researchcro-optimization规划并执行测试,在用户遇到可用性问题之前发现它们。Stack-agnostic,Tool-agnostic。
该技能适用于测试现有设计或原型。如需更广泛的探索性研究,请使用。如需生产环境中的转化测试,请使用。
ux-researchcro-optimizationWhen to use
使用场景
- Before launching a new flow or major redesign
- After a redesign to verify it doesn't introduce new problems
- When analytics show drop-off but you don't know why
- When customer support tickets pattern around specific UI areas
- Pre-launch user validation
- Comparing two design directions
- 新流程或重大改版上线前
- 改版完成后,验证未引入新问题
- 数据分析显示用户流失,但原因不明时
- 客服工单集中指向特定UI区域时
- 上线前用户验证
- 对比两种设计方向
When NOT to use
不适用场景
- Discovery / generative research (use )
ux-research - Live conversion optimization (use )
cro-optimization - Mapping the broader experience (use )
journey-mapping - Pure quantitative measurement (use )
analytics-strategy
- 探索性/生成式研究(请使用)
ux-research - 实时转化优化(请使用)
cro-optimization - 绘制整体用户体验地图(请使用)
journey-mapping - 纯定量测量(请使用)
analytics-strategy
Required inputs
必要输入
- The design or prototype to test (functional or near-functional)
- Specific tasks users would do
- The audience (who should be tested)
- Testing infrastructure (moderated tool, unmoderated tool, in-person setup)
- 待测试的设计或原型(功能完整或接近完整)
- 用户需完成的具体任务
- 测试受众(目标测试人群)
- 测试基础设施(有主持测试工具、无主持测试工具、线下测试环境)
The framework: 5 phases
框架:5个阶段
1. Define what to test
1. 明确测试内容
Don't test the whole product. Test specific tasks.
Task selection criteria:
- The task represents a real user goal (not "click around and explore")
- The task has a clear start and end
- The task is achievable in 2 to 10 minutes
- The task is one of: most common, most strategic, most problematic
Examples of testable tasks:
"You want to find a contractor near you who can install a fence. Show me how you'd do that on this site."
"You're a first-time visitor. You want to understand if this product fits your needs. Walk me through how you'd evaluate it."
"Your team needs a new tool to manage projects. Use this site to figure out which plan is right for a 12-person team."
Task framing rules:
- State the user goal, not the system action ("find a place to stay" not "click the search button")
- Provide context (why are you doing this?)
- Don't reveal the path
- Don't use product terminology in the task framing
不要测试整个产品,应聚焦具体任务。
任务选择标准:
- 任务代表真实用户目标(而非“随意点击探索”)
- 任务有明确的开始和结束节点
- 任务可在2至10分钟内完成
- 任务属于以下类型之一:最常见、最具战略意义、问题最多
可测试任务示例:
"你想在附近找一位能安装围栏的承包商。请演示你如何在该网站上完成此操作。"
"你是首次访问者,想了解该产品是否符合你的需求。请演示你如何评估它。"
"你的团队需要一款新的项目管理工具。请使用该网站为12人团队选择合适的方案。"
任务表述规则:
- 说明用户目标,而非系统操作(比如“找住宿”而非“点击搜索按钮”)
- 提供背景信息(为什么要做这件事?)
- 不要透露操作路径
- 任务表述中不要使用产品术语
2. Choose moderated or unmoderated
2. 选择有主持或无主持测试
Moderated (live, with researcher):
- Researcher observes and probes in real time
- Best for early-stage prototypes, complex tasks, novel concepts
- Higher cost, smaller sample (5 to 8 participants typical)
- Catches surprises and probe deeper
Unmoderated (recorded, asynchronous):
- Participant completes alone, often via tool (UserTesting, Maze, Lookback)
- Best for stable designs, simple tasks, larger sample
- Lower cost, larger sample (15 to 30 participants typical)
- Catches patterns at scale, less depth per session
For most teams: moderated for early/critical decisions, unmoderated for ongoing validation.
Moderated(有主持,实时,由研究员参与):
- 研究员实时观察并追问
- 最适合早期原型、复杂任务、新颖概念
- 成本较高,样本量较小(通常5至8名参与者)
- 能发现意外情况并深入探究
Unmoderated(无主持,录制,异步):
- 参与者独立完成测试,通常通过工具(如UserTesting、Maze、Lookback)
- 最适合稳定设计、简单任务、大样本量
- 成本较低,样本量较大(通常15至30名参与者)
- 能规模化发现模式,但单场次深度不足
对于大多数团队:早期/关键决策使用有主持测试,持续验证使用无主持测试。
3. Recruit
3. 招募参与者
Target audience - not just convenience.
Recruit criteria:
- Match real users (target audience, not just "anyone")
- Mix of experience levels with the product (new and existing if applicable)
- Mix of relevant device types (mobile, desktop, tablet if relevant)
- Exclude friends, family, employees
Sample size:
- Moderated: 5 to 8 participants (Nielsen's "5 users find 85% of usability issues" for the most common segment)
- Unmoderated: 15 to 30 participants (more participants compensate for less probing)
- Multi-segment testing: 5 to 8 per segment
瞄准目标受众,而非仅选择便利人群。
招募标准:
- 匹配真实用户(目标受众,而非“任何人”)
- 混合不同产品使用经验水平(如有适用,包含新用户和现有用户)
- 混合相关设备类型(如有适用,包含移动端、桌面端、平板端)
- 排除朋友、家人、员工
样本量:
- 有主持测试:5至8名参与者(根据Nielsen的研究,“5名用户可发现85%的常见可用性问题”)
- 无主持测试:15至30名参与者(更多参与者弥补深度不足)
- 多细分群体测试:每个群体5至8名参与者
4. Run the test
4. 执行测试
Pre-task setup:
- Confirm recording works
- Brief participant (purpose, anonymity, recording, "no wrong answers")
- Get verbal consent
- Have participant share screen if remote
Moderated session structure:
- Warm-up (2 to 3 min). Easy questions to put participant at ease.
- Pre-test questions (3 to 5 min). Background context, current behavior with similar products.
- Task 1 (5 to 10 min). Describe task. Have participant attempt while thinking aloud.
- Post-task questions (1 to 2 min). What was easy/hard? Anything confusing?
- Repeat for tasks 2, 3, 4 (typically 3 to 5 tasks per 60-minute session).
- Overall debrief (5 to 10 min). General reactions, comparisons to alternatives, anything else.
- Close (2 min).
Moderation principles:
- Encourage think-aloud ("What's going through your mind?")
- Don't help unless they're truly stuck (and even then, only after a long pause)
- Don't lead ("Are you looking for the menu?" - bad)
- Note where they hesitate, scroll, or backtrack
- Note their language vs the product's language
- Note emotional reactions
Anti-patterns:
- Talking too much (researcher should talk maybe 20% of the time)
- Defending the design when participants struggle
- Helping prematurely
- Asking participants to predict their future behavior
- Treating participant suggestions as features ("Users want X" - test demand for X separately)
测试前准备:
- 确认录制功能正常
- 向参与者介绍测试(目的、匿名性、录制、“没有错误答案”)
- 获取口头同意
- 如为远程测试,要求参与者共享屏幕
有主持测试流程:
- 热身环节(2至3分钟):简单问题让参与者放松。
- 测试前问题(3至5分钟):背景信息、使用同类产品的当前行为。
- 任务1(5至10分钟):描述任务,让参与者尝试并使用Think Aloud。
- 任务后问题(1至2分钟):哪些部分容易/困难?有什么困惑?
- 重复任务2、3、4(通常每60分钟测试包含3至5个任务)。
- 整体复盘(5至10分钟):总体反馈、与竞品的对比、其他想法。
- 结束环节(2分钟)。
主持原则:
- 鼓励出声思考(“你现在在想什么?”)
- 除非参与者真的卡住(且经过长时间停顿),否则不要提供帮助
- 不要引导参与者(比如“你在找菜单吗?”——错误示例)
- 记录参与者犹豫、滚动或回溯的地方
- 记录参与者使用的语言与产品语言的差异
- 记录参与者的情绪反应
反模式:
- 研究员说得太多(研究员发言时间应占比约20%)
- 当参与者遇到困难时为设计辩护
- 过早提供帮助
- 要求参与者预测未来行为
- 将参与者的建议视为功能需求(“用户想要X”——应单独测试对X的需求)
5. Synthesize and report
5. 整合并报告结果
Patterns across participants are signal. Single-participant complaints are weaker (but worth investigating).
Synthesis steps:
- Issue inventory. Every issue observed, with which participant, which task, severity.
- Cluster. Issues that are the same root problem.
- Severity.
- Critical: Blocks task completion. Most users hit this.
- Major: Significantly slows task. Many users hit this.
- Minor: Friction. Some users hit this. Workaround exists.
- Cosmetic: Polish. Doesn't affect task.
- Recommendations. For each issue, propose specific fixes.
- Prioritize. By severity and effort.
Report structure:
markdown
undefined参与者之间的共性模式是有效信号。单个参与者的抱怨参考价值较低(但仍值得调查)。
整合步骤:
- 问题清单:记录观察到的所有问题,包括涉及的参与者、任务、严重程度。
- 聚类分组:将根源相同的问题归为一类。
- 严重程度分级:
- Critical(严重):阻碍任务完成,大多数用户会遇到。
- Major(主要):显著减慢任务进度,许多用户会遇到。
- Minor(次要):存在摩擦,部分用户会遇到,有解决办法。
- Cosmetic( cosmetic):仅影响外观,不影响任务完成。
- 建议方案:针对每个问题提出具体修复建议。
- 优先级排序:根据严重程度和修复成本排序。
报告结构:
markdown
undefinedUsability Test: [Design / flow]
Usability Test: [Design / flow]
Summary
Summary
[2 to 3 paragraphs covering: what was tested, headline findings, top 3 priorities]
[2 to 3 paragraphs covering: what was tested, headline findings, top 3 priorities]
Method
Method
[Moderated/unmoderated, sample size, audience, dates, tasks]
[Moderated/unmoderated, sample size, audience, dates, tasks]
Critical findings
Critical findings
[Each with description, frequency, supporting evidence (quotes/clips), recommendation]
[Each with description, frequency, supporting evidence (quotes/clips), recommendation]
Major findings
Major findings
[Same structure]
[Same structure]
Minor findings
Minor findings
[Brief]
[Brief]
Cosmetic findings
Cosmetic findings
[Briefest]
[Briefest]
What worked well
What worked well
[Calibration: capture successes too]
[Calibration: capture successes too]
Recommendations
Recommendations
[Prioritized list with effort estimates]
[Prioritized list with effort estimates]
Next steps
Next steps
[Test re-run schedule, design iteration plan]
---[Test re-run schedule, design iteration plan]
---Workflow
工作流程
- Define the goals. What decisions hinge on this? What tasks matter most?
- Design tasks. 3 to 5 specific, realistic, goal-framed tasks.
- Choose moderated vs unmoderated. Match to stage and depth needed.
- Recruit. Specific to audience.
- Pilot. 1 to 2 sessions before main batch. Refine tasks if needed.
- Run. Follow the protocol. Stay disciplined.
- Synthesize during, not just after. Patterns emerge by session 4 or 5.
- Report. Multiple formats - written report + highlight clips.
- Track fixes. Every critical issue should have an owner and date.
- Re-test after fixes. Verify the fix worked, didn't introduce new issues.
- 明确目标:本次测试将支撑哪些决策?哪些任务最重要?
- 设计任务:3至5个具体、真实、以目标为导向的任务。
- 选择有主持或无主持测试:匹配测试阶段和所需深度。
- 招募参与者:精准定位目标受众。
- 试点测试:正式测试前开展1至2场试点,必要时优化任务。
- 执行测试:遵循流程,保持严谨。
- 边测试边整合:通常在第4或第5场测试后模式就会显现。
- 输出报告:多种形式——书面报告+重点片段剪辑。
- 跟踪修复:每个严重问题都应有负责人和完成日期。
- 修复后重测:验证修复有效且未引入新问题。
Failure patterns
常见失败模式
- Testing the whole product instead of specific tasks. Vague results.
- Tasks that reveal the path. ("Click the menu and find...")
- Friends and family as participants. Biased, not representative.
- Researcher leading the participant. Findings reflect the researcher.
- Defending the design when participants struggle. Misses real issues.
- Helping too quickly. Participant doesn't experience the friction.
- Treating participant suggestions as features. Users solve their problem; product team designs the solution.
- One participant = data point. A single strong opinion isn't a finding.
- Skipping severity scoring. All findings treated equally; team can't prioritize.
- Reports no one reads. Highlight clips and live walkthroughs work better than 80-page decks.
- Testing once, never re-testing. Fixes that introduce new problems go undetected.
- 测试整个产品而非具体任务:结果模糊不清。
- 任务透露操作路径(比如“点击菜单并找到...”)。
- 选择朋友、家人或员工作为参与者:存在偏见,不具代表性。
- 研究员引导参与者:结果反映的是研究员的想法而非用户真实反馈。
- 当参与者遇到困难时为设计辩护:错过真实问题。
- 过早提供帮助:参与者未体验到实际摩擦。
- 将参与者的建议视为功能需求:用户解决自身问题,产品团队负责设计解决方案。
- 将单个参与者的意见视为数据点:单一强烈观点不能作为结论。
- 跳过严重程度评分:所有结果同等对待,团队无法排序优先级。
- 无人阅读的报告:重点片段剪辑和现场演示比80页PPT更有效。
- 仅测试一次,不再重测:修复可能引入新问题却未被发现。
Output format
输出格式
Default outputs:
- Test plan (before testing) -
usability-test-plan-[topic].md - Task script (per session) -
usability-tasks-[topic].md - Findings report (after synthesis) -
usability-findings-[topic].md - Highlight clips (separately produced)
默认输出:
- 测试计划(测试前)-
usability-test-plan-[topic].md - 任务脚本(每场测试)-
usability-tasks-[topic].md - 结果报告(整合后)-
usability-findings-[topic].md - 重点片段剪辑(单独制作)
Reference files
参考文件
- - Task framing patterns by common product type, with good and bad examples.
references/task-script-patterns.md
- - 按常见产品类型分类的任务表述模式,包含正反示例。
references/task-script-patterns.md