computer-use-playbook

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Computer Use Playbook

计算机操作自动化手册

Overview

概述

Use this skill for end-to-end computer automation across browser and desktop surfaces. Browser use is a major track, but not the only one. Prefer deterministic methods first, then escalate to visual/native automation only when required. For browser MCP workflows, treat
tab_id
as a required handle for all stateful actions.
本技能适用于跨浏览器和桌面端的端到端计算机自动化。浏览器操作是主要方向,但并非唯一方向。优先使用确定性方法,仅在必要时才升级到视觉/原生自动化。对于浏览器MCP工作流,需将
tab_id
作为所有有状态操作的必需句柄。

Playbook Structure

手册结构

  1. Browser use (primary for web tasks): browser MCP tools, DOM snapshots, scripts, screenshots.
  2. Filesystem use: shell-native operations for deterministic file/process work.
  3. Native desktop use: coordinate and window automation only when DOM/shell are insufficient.
  4. Human-in-the-loop checkpoints: login, CAPTCHA, security prompts, or policy-gated steps.
  1. 浏览器操作(Web任务首选):浏览器MCP工具、DOM快照、脚本、截图。
  2. 文件系统操作:使用Shell原生操作完成确定性的文件/进程任务。
  3. 原生桌面操作:仅当DOM/Shell方法不足以完成任务时,才使用坐标和窗口自动化。
  4. 人机协作检查点:登录、CAPTCHA、安全提示或受策略限制的步骤。

Decision Order

决策顺序

  1. Identify the active surface: browser page, filesystem/process, or native desktop UI.
  2. For browser pages, use browser MCP tools first and keep a strict
    tab_id
    contract.
  3. For filesystem/process work, use shell/system tools first (
    rg
    ,
    ls
    ,
    find
    , etc.).
  4. Escalate to vision or native UI automation only when deterministic methods are insufficient.
  5. If blocked by login, CAPTCHA, or security gates, switch to human-in-the-loop flow.
  6. Verify each critical step with state checks plus screenshot evidence.
  1. 识别当前操作界面:浏览器页面、文件系统/进程或原生桌面UI。
  2. 对于浏览器页面,优先使用浏览器MCP工具,并严格遵循
    tab_id
    约定。
  3. 对于文件系统/进程任务,优先使用Shell/系统工具(如
    rg
    ls
    find
    等)。
  4. 仅当确定性方法无效时,才升级到视觉或原生UI自动化。
  5. 如果被登录、CAPTCHA或安全网关阻挡,切换到人机协作流程。
  6. 结合状态检查和截图证据验证每个关键步骤。

Browser Automation (Major Track)

浏览器自动化(主要方向)

Use browser tools + DOM-first for browser flows. Avoid jumping to native desktop clicks while the target is still reachable by browser tools.
Preferred sequence:
  1. open_tab
    and capture returned
    tab_id
    .
  2. navigate_to(tab_id, url)
    for explicit page transitions.
  3. dom_snapshot(tab_id, ...)
    or
    run_script(tab_id, ...)
    to identify target.
  4. run_script(tab_id, ...)
    action (click/type/submit).
  5. read_page(tab_id, ...)
    /
    run_script(tab_id, ...)
    to verify URL/title/content.
  6. screenshot(tab_id, ...)
    as evidence.
Session behavior guidance:
  • always pass
    tab_id
    for
    navigate_to
    ,
    read_page
    ,
    screenshot
    ,
    dom_snapshot
    ,
    run_script
    , and
    close_tab
    .
  • never rely on implicit active-tab behavior.
  • if a click opens a new tab/window, call
    list_tabs
    , detect the new
    tab_id
    , and continue explicitly on that
    tab_id
    .
  • keep a local map of
    purpose -> tab_id
    when handling multiple tabs.
Escalation triggers:
  • dynamic overlays not stable via selectors,
  • canvas/rendered controls,
  • consent dialogs where selector path is inconsistent,
  • native picker launched from browser (file upload dialog).
Do not overuse fallback:
  • if a browser tool can do it, stay in browser tools.
  • use native automation only for cross-app boundaries (OS dialogs, non-DOM UI).
浏览器流程优先使用浏览器工具+DOM优先策略。当目标仍可通过浏览器工具访问时,避免直接使用原生桌面点击操作。
推荐流程:
  1. 调用
    open_tab
    并记录返回的
    tab_id
  2. 调用
    navigate_to(tab_id, url)
    完成明确的页面跳转。
  3. 使用
    dom_snapshot(tab_id, ...)
    run_script(tab_id, ...)
    定位目标元素。
  4. 调用
    run_script(tab_id, ...)
    执行操作(点击/输入/提交)。
  5. 通过
    read_page(tab_id, ...)
    /
    run_script(tab_id, ...)
    验证URL/标题/内容。
  6. 调用
    screenshot(tab_id, ...)
    留存操作证据。
会话行为规范:
  • 在调用
    navigate_to
    read_page
    screenshot
    dom_snapshot
    run_script
    close_tab
    时,必须传入
    tab_id
  • 切勿依赖隐式的当前标签页行为。
  • 如果点击操作打开了新标签页/窗口,调用
    list_tabs
    检测新的
    tab_id
    ,并明确基于该
    tab_id
    继续操作。
  • 处理多标签页时,维护本地的「用途 -> tab_id」映射关系。
升级触发条件:
  • 动态浮层无法通过选择器稳定定位,
  • 画布/渲染控件,
  • 选择器路径不一致的授权弹窗,
  • 浏览器唤起的原生选择器(如文件上传对话框)。
避免过度使用降级方案:
  • 若浏览器工具可完成任务,优先使用浏览器工具。
  • 仅在跨应用边界(如系统对话框、非DOM UI)时使用原生自动化。

File Explorer and Filesystem Automation

文件资源管理器与文件系统自动化

Prefer shell-native methods before GUI clicking.
Use shell when possible:
  • search files:
    rg --files
    ,
    find
  • move/copy/rename:
    mv
    ,
    cp
    ,
    mkdir
  • inspect metadata:
    ls -la
    ,
    stat
Use native UI only when the workflow is GUI-only:
  • OS file picker from browser/app,
  • drag-drop interactions not scriptable via API,
  • app-specific explorer panes.
优先使用Shell原生方法,而非GUI点击操作。
优先使用Shell的场景:
  • 文件搜索:
    rg --files
    find
  • 移动/复制/重命名:
    mv
    cp
    mkdir
  • 元数据检查:
    ls -la
    stat
仅当工作流仅支持GUI时使用原生UI:
  • 浏览器/应用唤起的系统文件选择器,
  • 无法通过API脚本实现的拖放交互,
  • 应用专属的资源管理器面板。

Native UI Automation

原生UI自动化

Use native UI automation for interactions outside application DOM/API.
Typical tools:
  • xdotool
    for key/click/type,
  • xprop
    /
    xwininfo
    for window targeting.
Guidelines:
  • ensure window focus before typing,
  • prefer keyboard-driven deterministic paths,
  • keep retries bounded and observable,
  • re-check application state after each action.
原生UI自动化用于与应用DOM/API之外的元素交互。
常用工具:
  • xdotool
    :用于按键/点击/输入操作,
  • xprop
    /
    xwininfo
    :用于窗口定位。
操作规范:
  • 输入前确保窗口已获得焦点,
  • 优先使用键盘驱动的确定性路径,
  • 限制重试次数并可观测重试过程,
  • 每次操作后重新检查应用状态。

Human-in-the-loop rules

人机协作规则

Pause and ask for user intervention when blocked by:
  • login/2FA challenges,
  • CAPTCHA or anti-bot checkpoints,
  • legal/security confirmation screens that require explicit human intent.
When waiting for user action:
  1. explain exactly what the user must do and where.
  2. issue an audible notification using
    speak
    so the user notices immediately.
  3. wait, then re-check state (
    url
    ,
    title
    , element visibility, screenshot) before continuing.
当遇到以下阻挡时,暂停并请求用户干预:
  • 登录/双因素认证挑战,
  • CAPTCHA或反机器人检查点,
  • 需要明确人工确认的法律/安全确认界面。
等待用户操作时:
  1. 明确说明用户需要执行的操作及操作位置。
  2. 调用
    speak
    发出声音通知,确保用户及时注意到。
  3. 等待后重新检查状态(URL、标题、元素可见性、截图),再继续执行。

Special Cases

特殊场景

Consent dialogs

授权弹窗

  • DOM-first click (
    Accept all
    /
    Reject all
    /localized variants).
  • if selector fails but button is visible, use coordinate/native fallback.
  • confirm modal is not visible and main interaction path works.
  • 优先使用DOM点击(如「全部接受」/「全部拒绝」及本地化变体)。
  • 若选择器失效但按钮可见,使用坐标/原生方案降级。
  • 确认弹窗已关闭且主交互路径可正常使用。

CAPTCHA / anti-bot challenges

CAPTCHA/反机器人挑战

  • do not attempt bypass logic.
  • capture evidence and report blocked state clearly.
  • require human-in-the-loop completion.
  • notify user with
    speak
    when intervention is required.
  • 请勿尝试绕过逻辑。
  • 留存证据并清晰报告阻挡状态。
  • 要求通过人机协作完成。
  • 需要干预时调用
    speak
    通知用户。

Login and account security gates

登录与账号安全网关

  • try normal DOM steps first for username/password field fill and submit.
  • if SSO, passkey, device approval, or 2FA requires human action, pause and request user action.
  • after user confirms completion, re-snapshot and continue from verified page state.
  • 优先尝试常规DOM步骤:填写用户名/密码字段并提交。
  • 若SSO、密钥、设备验证或双因素认证需要人工操作,暂停并请求用户协助。
  • 用户确认完成后,重新获取快照并从已验证的页面状态继续执行。

File uploads

文件上传

  • use DOM file input assignment if available.
  • if native picker opens, switch to native UI automation.
  • verify upload appears in page/app state.
  • 若支持,优先使用DOM文件输入赋值。
  • 若唤起原生选择器,切换到原生UI自动化。
  • 验证上传内容已出现在页面/应用状态中。

Verification Standard

验证标准

Every important step should end with both:
  1. state evidence (URL/title/content/element state), and
  2. visual evidence (screenshot path).
If blocked, report:
  • attempted method,
  • blocker reason,
  • evidence collected,
  • next safe fallback.
每个重要步骤结束后,需同时留存:
  1. 状态证据(URL/标题/内容/元素状态),以及
  2. 视觉证据(截图路径)。
若被阻挡,需报告:
  • 尝试过的方法,
  • 阻挡原因,
  • 收集到的证据,
  • 下一个安全的降级方案。

Learning Library Structure

学习库结构

Use
references/learnings/
as the canonical knowledge base.
  • references/learnings/index.md
    : topic registry and folder convention.
  • references/learnings/general/
    : cross-task lessons.
  • references/learnings/<topic-slug>/
    : topic-specific lessons and experience log.
Topic folder convention:
  • lessons.md
    for stable workflow rules.
  • experience-log.md
    for incremental run learnings.
references/learnings/
作为标准知识库。
  • references/learnings/index.md
    :主题注册表和文件夹规范。
  • references/learnings/general/
    :跨任务经验总结。
  • references/learnings/<topic-slug>/
    :特定主题的经验总结和执行日志。
主题文件夹规范:
  • lessons.md
    :稳定的工作流规则。
  • experience-log.md
    :增量的执行经验记录。

Continuous Learning Loop (Required)

持续学习循环(必需)

Treat each real run as training data for future runs.
Before starting similar work:
  1. Load
    references/learnings/index.md
    .
  2. Map the task to a topic slug (for example
    google-flow
    ).
  3. Load
    references/learnings/general/experience-log.md
    .
  4. Load topic files when present:
    • references/learnings/<topic-slug>/lessons.md
    • references/learnings/<topic-slug>/experience-log.md
  5. If the topic folder does not exist, create it with
    lessons.md
    and
    experience-log.md
    .
During execution:
  1. Capture failure signal and the exact step where it appears.
  2. Record the minimal fix that resolved it.
  3. Keep one-action-at-a-time execution where UI state is fragile.
After completion (or meaningful failure):
  1. Append a short run note to
    references/learnings/<topic-slug>/experience-log.md
    .
  2. Include: date, context, failure signal, root cause, fix pattern, reusable rule.
  3. Keep entries concise and deduplicated by updating prior rules instead of adding noisy repeats.
将每次实际执行作为未来任务的训练数据。
开始类似任务前:
  1. 加载
    references/learnings/index.md
  2. 将任务映射到对应的主题标识(例如
    google-flow
    )。
  3. 加载
    references/learnings/general/experience-log.md
  4. 若存在对应主题文件,加载以下内容:
    • references/learnings/<topic-slug>/lessons.md
    • references/learnings/<topic-slug>/experience-log.md
  5. 若主题文件夹不存在,创建包含
    lessons.md
    experience-log.md
    的文件夹。
执行过程中:
  1. 捕获失败信号及出现失败的具体步骤。
  2. 记录解决问题的最小修复方案。
  3. 在UI状态不稳定时,保持每次仅执行一个操作。
执行完成(或出现重大失败)后:
  1. references/learnings/<topic-slug>/experience-log.md
    中添加简短的执行记录。
  2. 记录内容包括:日期、上下文、失败信号、根本原因、修复模式、可复用规则。
  3. 保持记录简洁,通过更新已有规则避免重复记录。

References

参考资料

Load
references/computer-use-techniques.md
for command snippets and fallback templates. Load
references/learnings/index.md
to select the right topic folder. Load
references/learnings/general/experience-log.md
for cross-task patterns. Load
references/learnings/google-flow/lessons.md
when automating Google Flow video creation. Load
references/learnings/google-flow/experience-log.md
for incremental Google Flow learnings.
加载
references/computer-use-techniques.md
获取命令片段和降级模板。 加载
references/learnings/index.md
选择正确的主题文件夹。 加载
references/learnings/general/experience-log.md
获取跨任务模式。 自动化Google Flow视频创建时,加载
references/learnings/google-flow/lessons.md
。 获取Google Flow增量经验时,加载
references/learnings/google-flow/experience-log.md