expo-skill-eval

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Expo Skill Eval

Expo技能评估

Evaluates skills in

plugins/expo/skills/

for trigger accuracy, generated code quality, and/or runtime rendering in Expo Go.

Requirements: macOS with Xcode (iOS simulators), Android SDK with at least one AVD, and

bun

. No other device tooling is assumed.

Workspace root:

/private/tmp/expo-skill-eval-<skill-name>/iteration-N/

(e.g.

/private/tmp/expo-skill-eval-expo-ui/iteration-4/

评估

plugins/expo/skills/

目录下的技能，包括触发准确性、生成代码质量，以及/或者在Expo Go中的运行时渲染效果。

Before starting — clarify scope

环境要求

Confirm all of the following up front, before any pipeline work — don't skip any (only skip a given item if the request already states that choice). Batch them into

AskUserQuestion

calls of ≤4 questions each, in this order:

Which skill to eval (if not clear from the request).
Prompts — which prompts drive the eval. Built-in prompts (from the skill's eval cases) are pre-selected all; drop any, add a custom text prompt, or build from an uploaded screenshot (a target UI the skill must reproduce). See Prompts below.
What to verify — one multi-select of three options: Runtime + screenshots / Trigger accuracy / Code checks (no device). See What to verify below.
Expo SDK — latest (default, auto-detected) or a pinned version.
Runner — Expo Go (default) or development build.
Platforms — iOS / Android / web (always offer all three).
Permission flag for
```
claude -p
```
— skip-permissions (default) or accept-edits.
Viewer delivery — local only (default) or publish a shareable Artifact.
If trigger accuracy is selected — confirm the published
```
expo
```
plugin is disabled (or not installed).

Each is detailed below. Items 4–6 (SDK, runner, platforms) fit naturally in one

AskUserQuestion

call.

If the skill to eval is not clear from the request, list available skills from

plugins/expo/skills/

and ask which one to evaluate.

How the skill under test is loaded — two mechanisms, one per phase (don't pick one globally): executor runs reference it by file path (

SKILL_PATH = plugins/expo/skills/<skill>/SKILL.md

, read explicitly), while the trigger eval loads it as a plugin (

--plugin-dir plugins/expo

, so the model can auto-select it from its description). Both point at the local, in-repo version — that's what you're evaluating. You do not need any special flag to launch the harness session itself (the harness finds the skill by repo path); the mechanisms apply to the

claude -p

subprocesses it spawns. See steps 1 and 3 for why each phase differs. One pre-run check (required when the trigger eval is in scope): if the published

expo

plugin is installed/enabled, disable it (via

/plugin

) before launching the harness and re-enable after. A single disable is a global-config change that both this session and the spawned

claude -p

subprocesses inherit. Why it's required for the trigger eval: that phase loads the local skill via

--plugin-dir

, and a second installed

expo

collides with it — the model may trigger the published

expo:expo-ui

, and since detection only sees the tool-call name you'd silently score the published description instead of your local edits (the collision could also just error). The executor / runtime / static phases are not affected — they read the skill under test by its local

SKILL_PATH

with no

--plugin-dir

— so a run with no trigger eval can skip the disable. Disabling

expo

does not disable

expo-skill-eval

(a standalone project skill, not part of the

expo

plugin), so the harness stays available.

Surface this to the user as an explicit up-front confirmation — the same way you confirm which skill to eval. When the trigger eval is in scope, ask the user to confirm the published

expo

plugin is disabled (or not installed) before you start step 1; if it's still enabled, pause and have them disable it via

/plugin

. Don't run the trigger eval until they confirm — the harness can't reliably detect installed plugins on its own (reading the global plugin config or

claude plugin list

would prompt), so this is a manual confirmation, not an auto-check.

Pick the prompts — built-in, custom, or a target screenshot. The prompts are the inputs that drive the executor (with-skill and without-skill); they are separate from what you verify. Confirm them with

AskUserQuestion

(skip if the request already names a prompt):

Built-in prompts — representative prompts you generate by reading the skill under test (its
```
SKILL.md
```
+
```
references/
```
) and
```
references/runtime-matrix.md
```
, covering the skill's standard use cases. (If the skill already ships eval cases under
```
evals/evals.json
```
, fold their
```
prompt
```
fields in too — but most skills don't, so you usually derive them.) Pre-select all so the default run exercises the skill's standard cases; let the user deselect any.
Custom text prompt — a one-off prompt the user types. Don't spend a dedicated option slot on this:
```
AskUserQuestion
```
auto-adds a "Type something" / Other entry, and anything typed there becomes a custom text case.
Build from an uploaded screenshot — the user gives the path to a target screenshot (a UI to reproduce). The executor is told to open it —
```
claude -p
```
reads PNGs with its Read tool — and build an app matching it; the case records the path as
```
reference_image
```
, and grading compares the generated app to that target (step 6). This is the strongest visual test for a UI skill: "build this."

Respect
AskUserQuestion
's 4-option-per-question cap with this priority (the bug to avoid: the upload option silently dropped once the four slots fill up):

Always reserve a slot for "Build from an uploaded screenshot." It's the whole point of the visual eval and must never be the option that gets dropped.
Don't add an explicit "Custom text prompt" option — the auto "Type something" / Other entry already covers it.
Fill the remaining ≤3 slots with the built-in/representative prompts, pre-selected. If there are more than 3, collapse them into one pre-selected "All built-in prompts (default)" option and offer subset-picking in a short follow-up, so the upload option still fits.

Present it as a multi-select. When "Build from an uploaded screenshot" is picked, ask for the target image path in a follow-up. Each selected prompt (built-in, typed, or image) becomes one eval case (run with-skill and without-skill).

Always confirm what to verify unless the request makes it unambiguous. Present these options and let the user pick one or more (defaults in bold based on the skill's

references/runtime-matrix.md

entry):

Option	What it does	When to suggest as default
Runtime + screenshots	Full pipeline: fixture → executor → static gate → run the app on iOS/Android and screenshot it. The runner (Expo Go or dev build) is a separate question — don't name it here.	Default for any skill that renders an app screen (the `expo-go` / `dev-build` rows in `references/runtime-matrix.md` ). Requires a booted simulator/emulator.
Trigger accuracy	Run realistic prompts via `claude -p` , check whether the skill is read. Measures recall (should-trigger queries only).	Always useful as a standalone check.
Code checks (no device)	`tsc --noEmit` + diff-aware lint + `expo export` , plus the grader checks the generated code against any custom expectations you provide. No device.	Default for `static-only` and `n/a` skills, and whenever you want to verify code patterns (correct import path, a `Host` wrapper, …) without running the app.
Present these as ONE multi-select question — "What do you want to verify?" These are grading dimensions (how to judge what gets built) — distinct from the Prompts phase (what to build). The user may pick any combination. When a prompt is an uploaded screenshot (see Prompts), include "Runtime + screenshots" so the harness captures the generated app and the grader can score it against the target.

Read

references/runtime-matrix.md

to find the skill's default mode before suggesting. If the request already specifies a mode (e.g. "just check if it triggers", "run it on device"), skip the question and proceed.

Pick the Expo SDK version — once, up front. Detect the latest with

bash /abs/path/expo-skill-eval/scripts/latest-sdk.sh

(it prints the major, e.g.

; internally it uses

bun

to run

npm view expo dist-tags --json

and read the major via

JSON.parse

semver

, and it's covered by the bash-scripts rule — so don't run the registry query inline yourself, which would prompt). Then confirm with

AskUserQuestion

: default to that latest SDK, or let the user pin an older one (e.g. to reproduce a version-specific issue). Use the chosen version everywhere the fixture is built — pass it as the

<sdk>

arg to

make-fixture.sh

and write it into each eval case's

runtime.sdk

. If the request already names a version ("eval on SDK 54"), skip detection and use it.

Default to the latest — it stays compatible with the Expo Go that

expo start

installs on the device. Pinning an SDK older than the device's installed Expo Go makes

expo start

try to prompt "Install the recommended Expo Go version?"; with no TTY (the snapshot scripts read stdin from

/dev/null

) it dies with

Input is required, but 'npx expo' is in non-interactive mode

and every snapshot fails. So only pin an older SDK when you also pre-install a matching Expo Go on the simulator/emulator — otherwise stick with latest.

Pick the runner — Expo Go (default) or a development build. Ask with

AskUserQuestion

(skip if the request already says which):

Expo Go (default) — the snapshot scripts run the app with
```
expo start --ios
```
/
```
expo start --android
```
as-is. Fast (no native compile), and it runs anything Expo Go bundles (including
```
@expo/ui
```
on SDK 56+). Cannot run custom native code (expo-modules, config plugins, native deps not in Expo Go).
Development build — the snapshot scripts run
```
expo run:ios
```
/
```
expo run:android
```
instead, compiling a native dev client per fixture. Use this for skills whose output needs custom native code (the cases that would otherwise be
```
static-only
```
). Much slower —
```
expo run
```
prebuilds and natively compiles each fixture (minutes, especially the first), and needs the full iOS/Android build toolchain — so only choose it when the skill actually requires native code. Disk-heavy: each fixture's native build is multi-GB. The snapshot phase runs
```
clean-fixture.sh
```
after each fixture to keep peak usage to ~one build, but still prefer fewer eval cases and a single platform for dev-build runs, and keep a few GB free.
```
clean-fixture.sh
```
removes the per-fixture build output (
```
node_modules
```
,
```
ios
```
,
```
android
```
,
```
.expo
```
,
```
dist
```
, and the fixture's iOS DerivedData) and keeps the app source + git. The lever for dev-build disk is fewer eval cases + one platform — it only reclaims per-fixture build output and never touches shared dependency caches, so nothing gets re-downloaded.

Pass the choice to the snapshot scripts via the

EXPO_SKILL_EVAL_RUNNER

env var (

expo-go

default, or

dev-build

), and reflect it in each eval case's

runtime.mode

(

expo-go

dev-build

). See step 5.

Pick the platforms — always ask, regardless of skill. Offer iOS / Android / web (multi-select) with

AskUserQuestion

; default to iOS + Android, but always present web as an option — don't pre-filter by skill. Web is a valid choice for most skills:

@expo/ui

's universal components (

Host

Row

Column

Button

List

, …) render on web, as do

use-dom

, NativeWind/Tailwind, API routes, and plain React Native. The only thing that won't show on web is a platform-specific native tree (

@expo/ui/swift-ui

@expo/ui/jetpack-compose

), which renders blank there — and that blank is itself a useful signal, so it's still the user's call. Web runs via

snapshot-web.sh

(

expo start --web

+ Playwright/Chromium) regardless of the runner (

expo run

is native-only; there's no web dev build), and it's the least-exercised path. Write the chosen set into each eval case's

runtime.platforms

and have

run_snapshots.py

loop them.

Confirm how
claude -p
subprocesses run — once, before starting. Ask with

AskUserQuestion

whether they may run with

--dangerously-skip-permissions

, then apply the same answer to every subprocess this run (never re-prompt mid-run):

Skip permissions (recommended) — pass
```
--dangerously-skip-permissions
```
. Each subprocess runs unattended inside a throwaway fixture under
```
/private/tmp/expo-skill-eval-*
```
and can write files and run setup commands without prompting.
Accept edits only — pass
```
--permission-mode acceptEdits
```
instead. Bash/installs are auto-denied (no TTY), so some evals may produce partial output.

A bare

claude -p

with neither flag can't write files at all. If the request already states a preference ("skip permissions", "don't use the dangerous flag"), skip the question.

Confirm how to deliver the results viewer — once, up front. Publishing to claude.ai is outward-facing, so never do it mid-run by surprise; ask in the same up-front

AskUserQuestion

(alongside the permission flag):

Local only (default) —
```
generate_viewer.py
```
writes
```
viewer.html
```
and opens it in the local browser. Nothing leaves the machine.
Publish a shareable Artifact — additionally render the viewer to a claude.ai Artifact (a default-private web page the user can share with teammates) at the very end. Only do this if the user opts in here.

If the request already says whether to share/publish, skip the question. See the Viewer section for the publish mechanics.

需要配备Xcode（iOS模拟器）的macOS系统、至少包含一个AVD的Android SDK，以及

bun

工具。无需其他设备工具。

工作区根目录：

/private/tmp/expo-skill-eval-<skill-name>/iteration-N/

（例如

/private/tmp/expo-skill-eval-expo-ui/iteration-4/

）。

Eval case schema

开始前——明确范围

You generate the run's eval cases — one per chosen prompt — and write them to

<workspace>/iteration-N/evals.json

(the viewer reads them from there). Each case extends the standard skill-creator eval-case shape with a

runtime

block and visual expectations:

json

{
  "id": 1,
  "prompt": "Build me a settings screen with a dark mode toggle and a list of options",
  "expected_output": "Working Expo Router screen",
  "expectations": [
    "Uses Expo Router file-based routing",
    "TypeScript compiles with no errors"
  ],
  "runtime": {
    "mode": "expo-go",
    "platforms": ["ios", "android"],
    "sdk": "56"
  },
  "visual_expectations": [
    "No red error screen or Expo Go error overlay on any platform",
    "A settings screen with a visible toggle control is rendered"
  ]
}

```
runtime.mode
```
: how the eval runs after the static gate —
- ```
"expo-go"
```
  : run in Expo Go (
```
expo start --<platform>
```
  ) and screenshot. Fast, JS-only. Default.
- ```
"dev-build"
```
  : build a native dev client (
```
expo run:<platform>
```
  ) and screenshot. For skills whose output uses custom native code; much slower (native compile per fixture).
- ```
"static-only"
```
  : stop after the static gate — for skills that produce no UI, or when you don't want to run a device at all (CI).
Consult
```
references/runtime-matrix.md
```
for which repo skills support which mode. (
```
dev-build
```
lets you actually run skills that previously had to be
```
static-only
```
for needing native code.)
```
runtime.platforms
```
: subset of
```
ios
```
,
```
android
```
,
```
web
```
— chosen up front (always offered, not gated on the skill; see Before starting). Defaults to
```
["ios", "android"]
```
.
```
runtime.sdk
```
: Expo SDK major for the fixture app — set it to the version chosen up front (see Before starting — clarify scope). Omit to use the latest template.
```
reference_image
```
(optional — image prompt): absolute path to a target screenshot the skill must reproduce. When set, the executor is told to open it (via its Read tool) and build a matching app, and the grader scores how closely the generated app reproduces it (step 6) on top of the usual expectations. Set in the Prompts phase via "build from an uploaded screenshot."

An image-prompt case is a normal case with

reference_image

set; enable "Runtime + screenshots" so the harness captures the result to compare against the target:

json

{
  "prompt": "Build an app whose UI matches the attached reference screenshot.",
  "reference_image": "/abs/path/to/target.png",
  "runtime": { "mode": "expo-go", "platforms": ["ios"], "sdk": "56" },
  "visual_expectations": ["Matches the reference's layout, components, and color treatment"]
}

在启动任何流水线工作前，提前确认以下所有事项——不要跳过任何一项（仅当请求已明确指定时可跳过对应项）。将这些问题分批放入最多包含4个问题的

AskUserQuestion

调用中，顺序如下：

评估哪个技能（若请求未明确说明）。
提示词——驱动评估的提示词。内置提示词（来自技能的评估用例）默认全选；可删除部分提示词、添加自定义文本提示词，或基于上传截图构建（技能需复现的目标UI）。详见下方提示词部分。
验证内容——三个选项的多选：运行时+截图 / 触发准确性 / 代码检查（无需设备）。详见下方验证内容部分。
Expo SDK版本——最新版本（默认，自动检测）或指定固定版本。
运行环境——Expo Go（默认）或开发构建版。
平台——iOS / Android / web（始终提供这三个选项）。
```
claude -p
```
的权限标志——
```
skip-permissions
```
（默认）或
```
accept-edits
```
。
结果查看器交付方式——仅本地（默认）或发布为可共享Artifact。
若选择触发准确性验证——确认已禁用已发布的
```
expo
```
插件（或未安装）。

以下是各项的详细说明。第4-6项（SDK版本、运行环境、平台）可自然整合到一个

AskUserQuestion

调用中。

若请求未明确说明要评估的技能，列出

plugins/expo/skills/

目录下的可用技能并询问用户要评估哪一个。

被测技能的加载机制——分阶段采用两种方式（无需全局选择其一）：执行器通过文件路径引用技能（

SKILL_PATH = plugins/expo/skills/<skill>/SKILL.md

，直接读取），而触发评估则将其作为插件加载（

--plugin-dir plugins/expo

，模型可根据描述自动选择）。两者均指向本地仓库内的版本——这正是你要评估的对象。你无需任何特殊标志来启动测试 harness 会话本身（harness会通过仓库路径找到技能）；这些机制仅适用于它启动的

claude -p

子进程。详见步骤1和3了解各阶段差异的原因。前置检查（当触发评估在范围内时必须执行）：如果已发布的
expo
插件已安装/启用，在启动harness之前通过

/plugin

命令禁用它，并在之后重新启用。单次禁用是全局配置变更，本次会话和启动的

claude -p

子进程都会继承该配置。触发评估需要此操作的原因：该阶段通过

--plugin-dir

加载本地技能，若同时存在已安装的

expo

插件会产生冲突——模型可能触发已发布的

expo:expo-ui

，由于检测仅能看到工具调用名称，你会无意中为已发布的描述打分，而非本地修改的版本（冲突也可能直接导致错误）。执行器/运行时/静态阶段不受影响——它们通过本地

SKILL_PATH

读取被测技能，无需

--plugin-dir

——因此未包含触发评估的运行可跳过禁用操作。禁用

expo

不会禁用

expo-skill-eval

（这是一个独立的项目技能，不属于

expo

插件），因此harness仍可正常使用。

需将此作为明确的前置确认告知用户——就像确认评估哪个技能一样。当触发评估在范围内时，要求用户在开始步骤1之前确认已发布的

expo

插件已禁用（或未安装）；若仍处于启用状态，请暂停并让用户通过

/plugin

命令禁用它。在用户确认前不要运行触发评估——harness无法可靠地检测已安装的插件（读取全局插件配置或

claude plugin list

会触发提示），因此这是手动确认步骤，而非自动检查。

选择提示词——内置、自定义或基于目标截图。提示词是驱动执行器（使用技能和不使用技能两种场景）的输入，与你要验证的内容分开。通过

AskUserQuestion

确认提示词（若请求已指定提示词则跳过）：

内置提示词——通过读取被测技能（其
```
SKILL.md
```
+
```
references/
```
目录）和
```
references/runtime-matrix.md
```
生成的代表性提示词，覆盖技能的标准使用场景。（若技能已在
```
evals/evals.json
```
中提供评估用例，也需将其中的
```
prompt
```
字段纳入——但大多数技能没有，因此通常需要自行推导。）默认全选，以便默认运行可测试技能的标准用例；允许用户取消选择部分提示词。
自定义文本提示词——用户输入的一次性提示词。无需为此设置单独选项：
```
AskUserQuestion
```
会自动添加**“输入自定义内容”**/其他选项，用户输入的内容将成为自定义文本用例。
基于上传截图构建——用户提供目标截图的路径（需复现的UI）。执行器会被要求打开该截图——
```
claude -p
```
可通过Read工具读取PNG图片——并构建与之匹配的应用；该用例将记录路径为
```
reference_image
```
，评分阶段会将生成的应用与目标截图进行对比（步骤6）。这是UI技能最强的视觉测试方式：“构建这个界面”。

需遵循
AskUserQuestion
的每道题最多4个选项的限制，优先级如下（需避免的问题：当4个选项填满时，上传选项被静默丢弃）：

始终为“基于上传截图构建”保留一个选项位。这是视觉评估的核心，绝不能成为被丢弃的选项。
不要添加明确的“自定义文本提示词”选项——自动生成的“输入自定义内容”/其他选项已覆盖此场景。
剩余的≤3个选项位填入内置/代表性提示词，默认选中。若内置提示词超过3个，将它们合并为一个默认选中的**“所有内置提示词（默认）”**选项，并在后续简短提问中提供子集选择，以确保上传选项仍能纳入。

将其作为多选问题呈现。当选择“基于上传截图构建”时，在后续提问中询问目标图片路径。每个选中的提示词（内置、自定义输入或图片）将成为一个评估用例（分别在使用技能和不使用技能的场景下运行）。

除非请求已明确说明，否则务必确认验证内容。呈现以下选项并允许用户选择一个或多个（默认选项基于技能在

references/runtime-matrix.md

中的条目，以粗体标注）：

选项	功能	何时建议设为默认
运行时+截图	完整流水线：测试夹具 → 执行器 → 静态检查 → 在iOS/Android上运行应用并截图。运行环境（Expo Go或开发构建版）是单独的问题——此处无需提及。	默认选项适用于任何渲染应用界面的技能（ `references/runtime-matrix.md` 中的 `expo-go` / `dev-build` 行）。需要已启动的模拟器/模拟器。
触发准确性	通过 `claude -p` 运行真实场景提示词，检查技能是否被调用。仅衡量召回率（仅针对应触发的查询）。	始终适合作为独立检查项。
代码检查（无需设备）	执行 `tsc --noEmit` + 差异感知lint + `expo export` ，同时评分器会将生成的代码与你提供的任何自定义预期进行对比。无需设备。	默认选项适用于 `static-only` 和 `n/a` 类型的技能，以及仅需验证代码模式（正确的导入路径、 `Host` 包装器等）而无需运行应用的场景。
*将这些作为一个多选问题呈现——“你想要验证哪些内容？”*这些是评分维度（判断构建内容的标准）——与提示词阶段（构建内容）不同。用户可选择任意组合。当提示词为上传截图时（见提示词部分），需包含“运行时+截图”选项，以便harness捕获生成的应用并让评分器与目标截图对比打分。

在建议默认选项前，先读取

references/runtime-matrix.md

找到技能的默认模式。若请求已指定模式（例如“仅检查是否触发”、“在设备上运行”），则跳过该问题直接执行。

选择Expo SDK版本——提前一次性确认。通过

bash /abs/path/expo-skill-eval/scripts/latest-sdk.sh

检测最新版本（该脚本会打印主版本号，例如

；内部使用

bun

运行

npm view expo dist-tags --json

并通过

JSON.parse

semver

读取主版本号，且符合bash脚本规则——因此不要自行直接运行注册表查询，否则会触发提示）。然后通过

AskUserQuestion

确认：默认使用该最新SDK版本，或允许用户指定旧版本（例如复现特定版本的问题）。在构建测试夹具的所有环节使用选定的版本——将其作为

<sdk>

参数传递给

make-fixture.sh

并写入每个评估用例的

runtime.sdk

字段。若请求已指定版本（“在SDK 54上评估”），则跳过检测直接使用该版本。

默认使用最新版本——这样可与

expo start

在设备上安装的Expo Go保持兼容。若指定的SDK版本低于设备上已安装的Expo Go版本，

expo start

会尝试提示“是否安装推荐的Expo Go版本？”；由于无TTY（快照脚本从

/dev/null

读取标准输入），会因

Input is required, but 'npx expo' is in non-interactive mode

错误终止，且所有快照都会失败。因此仅当你已在模拟器/模拟器上预先安装了匹配的Expo Go版本时，才指定旧SDK版本；否则请使用最新版本。

选择运行环境——Expo Go（默认）或开发构建版。通过

AskUserQuestion

询问（若请求已指定则跳过）：

Expo Go（默认）——快照脚本直接使用
```
expo start --ios
```
/
```
expo start --android
```
运行应用。速度快（无需原生编译），可运行任何Expo Go打包的内容（包括SDK 56+上的
```
@expo/ui
```
）。无法运行自定义原生代码（expo-modules、配置插件、Expo Go未包含的原生依赖）。
开发构建版——快照脚本改为运行
```
expo run:ios
```
/
```
expo run:android
```
，为每个测试夹具编译原生开发客户端。适用于输出需要自定义原生代码的技能（否则这些用例只能设为
```
static-only
```
）。速度慢得多——
```
expo run
```
会为每个测试夹具进行预构建和原生编译（尤其是首次编译需要数分钟），且需要完整的iOS/Android构建工具链——因此仅当技能确实需要原生代码时才选择此选项。磁盘占用大：每个测试夹具的原生构建输出可达数GB。快照阶段会在每个夹具完成后运行
```
clean-fixture.sh
```
，以将峰值磁盘占用控制在约一个构建的大小，但仍建议开发构建版运行时减少评估用例数量并选择单一平台，同时预留数GB空闲磁盘空间。
```
clean-fixture.sh
```
会删除每个夹具的构建输出（
```
node_modules
```
、
```
ios
```
、
```
android
```
、
```
.expo
```
、
```
dist
```
以及夹具的iOS DerivedData），保留应用源码+git。控制开发构建版磁盘占用的关键是减少评估用例数量+选择单一平台——它仅回收每个夹具的构建输出，不会触及共享依赖缓存，因此无需重新下载依赖。

通过环境变量

EXPO_SKILL_EVAL_RUNNER

将选择传递给快照脚本（默认

expo-go

，或

dev-build

），并在每个评估用例的

runtime.mode

字段中体现（

expo-go

或

dev-build

）。详见步骤5。

选择平台——无论技能类型如何，始终询问。通过

AskUserQuestion

提供iOS / Android / web（多选）选项；默认选择iOS + Android，但始终提供web选项——不要根据技能类型预先过滤。web对大多数技能都是有效的选择：

@expo/ui

的通用组件（

Host

、

Row

、

Column

、

Button

、

List

等）可在web上渲染，

use-dom

、NativeWind/Tailwind、API路由和纯React Native代码也可。唯一无法在web上显示的是平台特定的原生树（

@expo/ui/swift-ui

或

@expo/ui/jetpack-compose

），它们在web上会显示空白——但这种空白本身也是有用的信号，因此仍由用户决定是否选择web。web通过

snapshot-web.sh

运行（

expo start --web

+ Playwright/Chromium），无论运行环境如何（

expo run

仅适用于原生平台；没有web开发构建版），且是使用最少的路径。将选定的平台写入每个评估用例的

runtime.platforms

字段，并让

run_snapshots.py

循环处理这些平台。

确认
claude -p
子进程的运行方式——提前一次性确认。通过

AskUserQuestion

询问是否允许使用

--dangerously-skip-permissions

，然后将相同的答案应用于本次运行的所有子进程（运行过程中不再重新提示）：

跳过权限检查（推荐）——传递
```
--dangerously-skip-permissions
```
。每个子进程在
```
/private/tmp/expo-skill-eval-*
```
下的临时测试夹具中无人值守运行，可在无需提示的情况下写入文件和运行设置命令。
仅接受编辑——改为传递
```
--permission-mode acceptEdits
```
。Bash/安装操作会被自动拒绝（无TTY），因此部分评估可能产生不完整输出。

未携带任何上述标志的裸

claude -p

无法写入文件。若请求已指定偏好（“跳过权限检查”、“不要使用危险标志”），则跳过该问题。

确认结果查看器的交付方式——提前一次性确认。发布到claude.ai是对外公开的操作，因此绝不要在运行过程中突然执行；在前置的

AskUserQuestion

中（与权限标志一起）询问：

仅本地（默认）——
```
generate_viewer.py
```
会生成
```
viewer.html
```
并在本地浏览器中打开。内容不会离开本地机器。
发布为可共享Artifact——在最后额外将查看器渲染为claude.ai Artifact（默认私有网页，用户可与团队成员共享）。仅当用户在此处选择此选项时才执行。

若请求已说明是否要共享/发布，则跳过该问题。查看查看器部分了解发布机制。

Pipeline per eval case

评估用例 schema

Orchestration model — on the main thread you run
python3 <orchestrator>
and almost nothing else. Every phase is driven by a small Python orchestrator you

Write

into the workspace and run with

python3 /private/tmp/expo-skill-eval-<skill>/<phase>.py

(covered by the

python3

rule). The orchestrators are the only place the

scripts/*.sh

files are invoked — always via

subprocess.run(["bash", "<scripts>/<name>.sh", …])

, which runs as a child of

python3

and needs no rule of its own — and the only place parallelism, logging, and directory creation live. So on the main thread you only ever: Write orchestrators, run them with

python3

, inspect outputs with the

Read

Glob

Grep

tools, and spawn the grader subagent. Never put a command inside a chained/backgrounded/piped shell construct, and never run ad-hoc

mkdir

ls

cat

tail

echo

— that is what prompts. (A single standalone

bash …/scripts/<name>.sh …

is fine for one-off manual debugging, e.g. re-running one flaky snapshot, but the pipeline itself goes through the orchestrators.) Run each orchestrator in the foreground — let the tool call block until it finishes; the orchestrators already parallelize within a phase, so you don't need to overlap phases. Do not shell-background a phase with

… & echo "$!"

wait

(the

echo

, and

wait

segments have no rule and prompt). If you genuinely must run a phase while continuing other work, use the Bash tool's
run_in_background
parameter on a plain

python3 <orchestrator> 2>&1 | tee <ws>/…log

call — never hand-rolled shell

. Expect exactly one permission prompt at the very start: the first

Write

into the workspace.

allowed-tools

can suppress

Bash

Read

but not

Write

Edit

, so choose "allow all edits in this directory for the session" on that first prompt — it covers every orchestrator,

evals.json

, and viewer file for the whole run.

你需要生成运行的评估用例——每个选定的提示词对应一个用例——并将其写入

<workspace>/iteration-N/evals.json

（查看器从此文件读取数据）。每个用例在标准技能创建者评估用例结构的基础上扩展了

runtime

块和视觉预期：

json

{
  "id": 1,
  "prompt": "Build me a settings screen with a dark mode toggle and a list of options",
  "expected_output": "Working Expo Router screen",
  "expectations": [
    "Uses Expo Router file-based routing",
    "TypeScript compiles with no errors"
  ],
  "runtime": {
    "mode": "expo-go",
    "platforms": ["ios", "android"],
    "sdk": "56"
  },
  "visual_expectations": [
    "No red error screen or Expo Go error overlay on any platform",
    "A settings screen with a visible toggle control is rendered"
  ]
}

```
runtime.mode
```
: 静态检查后评估的运行方式——
- ```
"expo-go"
```
  : 在Expo Go中运行（
```
expo start --<platform>
```
  ）并截图。速度快，仅需JS。默认值。
- ```
"dev-build"
```
  : 构建原生开发客户端（
```
expo run:<platform>
```
  ）并截图。适用于输出需要自定义原生代码的技能；速度慢得多（每个夹具需原生编译）。
- ```
"static-only"
```
  : 静态检查后停止——适用于不生成UI的技能，或完全不想运行设备的场景（如CI）。
参考
```
references/runtime-matrix.md
```
了解仓库中各技能支持的模式。（
```
dev-build
```
可让之前因需要原生代码而只能设为
```
static-only
```
的技能实际运行起来。）
```
runtime.platforms
```
:
```
ios
```
、
```
android
```
、
```
web
```
的子集——提前选定（始终提供选项，不受技能限制；见开始前部分）。默认值为
```
["ios", "android"]
```
。
```
runtime.sdk
```
: 测试夹具应用的Expo SDK主版本——设置为提前选定的版本（见开始前——明确范围部分）。省略则使用最新模板。
```
reference_image
```
（可选——图片提示词）：技能需复现的目标截图的绝对路径。设置后，执行器会被要求打开该截图（通过Read工具）并构建匹配的应用，评分器会在常规预期之外，对比生成的应用与目标截图的相似度（步骤6）。在提示词阶段通过“基于上传截图构建”选项设置此字段。

图片提示词用例是设置了

reference_image

的常规用例；需启用“运行时+截图”选项，以便harness捕获结果并与目标对比：

json

{
  "prompt": "Build an app whose UI matches the attached reference screenshot.",
  "reference_image": "/abs/path/to/target.png",
  "runtime": { "mode": "expo-go", "platforms": ["ios"], "sdk": "56" },
  "visual_expectations": ["Matches the reference's layout, components, and color treatment"]
}

0. Workspace setup

每个评估用例的流水线

Create the run's directory tree once, with the workspace script — never with ad-hoc
mkdir
(a raw

mkdir

prompts: there is no

mkdir

rule, and a

"$WORKSPACE/…"

variable can't match a path glob anyway):

bash

bash /abs/path/expo-skill-eval/scripts/make-workspace.sh /private/tmp/expo-skill-eval-<skill> iteration-N <num-evals>

This creates

trigger-evals/scratch

and

iteration-N/eval-<i>/{with_skill,without_skill}/outputs

for every eval. It is covered by

Bash(bash *expo-skill-eval/scripts/*)

, and the

mkdir

s inside run as children of the script (no rule of their own). After this, every other directory is made by the scripts/orchestrators that need it (

make-fixture.sh

, the executor orchestrator's

os.makedirs

, the snapshot scripts) or by the

Write

tool auto-creating parents — so you never need another

mkdir

编排模型——主线程仅运行
python3 <orchestrator>
，几乎不执行其他操作。每个阶段由一个小型Python编排器驱动，你需将其

Write

到工作区并通过

python3 /private/tmp/expo-skill-eval-<skill>/<phase>.py

运行（符合

python3

规则）。编排器是唯一调用

scripts/*.sh

文件的地方——始终通过

subprocess.run(["bash", "<scripts>/<name>.sh", …])

调用，作为

python3

的子进程运行，无需单独规则——也是并行处理、日志记录和目录创建的唯一载体。因此主线程仅需执行以下操作：编写编排器、运行编排器（使用

python3

）、检查输出（使用

Read

Glob

Grep

工具），以及启动评分子代理。绝不要将命令放入链式/后台/管道化的shell结构中，也不要运行临时的

mkdir

ls

cat

tail

echo

命令——这些操作会触发提示。（单个独立的

bash …/scripts/<name>.sh …

可用于一次性手动调试，例如重新运行一个不稳定的快照，但流水线本身需通过编排器执行。）前台运行每个编排器——让工具调用阻塞直到完成；编排器已在阶段内实现并行处理，因此无需重叠阶段。不要使用

… & echo "$!"

wait

将阶段放入后台运行（

、

echo

和

wait

部分无规则，会触发提示）。若确实需要在运行阶段的同时继续其他工作，请在普通的

python3 <orchestrator> 2>&1 | tee <ws>/…log

调用中使用Bash工具的

run_in_background

参数——绝不要手动编写shell后台命令。预期仅在最开始出现一次权限提示：首次向工作区

Write

文件时。

allowed-tools

可抑制

Bash

Read

的提示，但无法抑制

Write

Edit

的提示，因此在首次提示时选择**“允许在此目录中进行所有会话内编辑”**——这将覆盖整个运行过程中的所有编排器、

evals.json

和查看器文件。

1. Trigger eval (should-trigger only)

0. 工作区设置

Write a

run_trigger_eval_real.py

script under the workspace's

trigger-evals/

directory. Use only
"should_trigger": true
queries — the expo plugin is a family of complementary skills, so multiple skills triggering on the same prompt is not a failure. Measure recall only: realistic prompts that should use the skill, scored by trigger rate.

The script should run

claude -p <query>

per query (with

--output-format=stream-json --verbose --include-partial-messages

CLAUDECODE

stripped from the env, and the permission flag confirmed up front in Before starting — clarify scope) and detect whether the target skill was triggered by watching for its

Skill

Read

tool call in the stream. Note:

--include-partial-messages

requires both

--output-format=stream-json

and

--verbose

— omitting either causes an immediate CLI error.

Load the skill under test — pass
--plugin-dir
to every trigger subprocess. The trigger eval measures whether the skill's description makes the model reach for it, so the subprocess must have the local skill (the version with your edits) loaded. A

claude -p

subprocess does not inherit the parent session's

--plugin-dir

, so add it explicitly:

--plugin-dir <plugin-root>

, where

<plugin-root>

is the absolute path to the plugin directory that owns the skill — the

plugins/expo

ancestor containing

.claude-plugin/plugin.json

(e.g.

--plugin-dir /Users/.../skills/plugins/expo

). It must be absolute: the subprocess runs from the throwaway

scratch/

cwd, so a relative

plugins/expo

won't resolve — and a missing plugin dir silently loads nothing, which masquerades as a 0% trigger rate. Then watch for the skill triggering under its plugin-qualified name (

<plugin>:<skill>

, e.g.

expo:expo-ui

). Two caveats: (1) if the published

expo

plugin is also installed globally, disable it (via

/plugin

) for the run and re-enable after — otherwise two copies of

expo

collide in the subprocess and the model may trigger the published

expo:expo-ui

, silently scoring its description instead of your local edits (trigger detection only sees the tool-call name, so it can't tell the copies apart; dev checkouts usually don't have it installed). (2) Never make a synthetic duplicate of the skill — a real loaded copy always wins, so the synthetic harness scores 0%. (Executors are unaffected by an installed plugin: they read the local

SKILL_PATH

directly and pass no

--plugin-dir

Run each query's subprocess from an empty throwaway cwd (e.g.

trigger-evals/scratch/

), not the repo root. A should-trigger prompt like "build me a settings screen" can make the subprocess write files, and with

--dangerously-skip-permissions

those writes would otherwise land in the skills repo. Trigger detection only needs the skill's

Skill

Read

call to appear in the stream — it doesn't need a fixture — so any incidental writes are throwaway.

Set a per-query subprocess timeout of at least 300 seconds. A 180s limit is too short — some queries cause the model to start generating code before triggering the skill, which pushes total runtime past 3 minutes.

Run trigger evals once per skill, not per code eval case.

使用工作区脚本一次性创建运行的目录树——绝不要使用临时
mkdir
命令（裸

mkdir

会触发提示：没有

mkdir

规则，且

"$WORKSPACE/…"

变量无法匹配路径通配符）：

bash

bash /abs/path/expo-skill-eval/scripts/make-workspace.sh /private/tmp/expo-skill-eval-<skill> iteration-N <num-evals>

此命令会为每个评估用例创建

trigger-evals/scratch

和

iteration-N/eval-<i>/{with_skill,without_skill}/outputs

目录。符合

Bash(bash *expo-skill-eval/scripts/*)

规则，内部的

mkdir

命令作为脚本的子进程运行（无需单独规则）。完成此步骤后，所有其他目录由需要它们的脚本/编排器创建（

make-fixture.sh

、执行器编排器的

os.makedirs

、快照脚本），或由

Write

工具自动创建父目录——因此你无需再执行任何

mkdir

命令。

2. Fixture

1. 触发评估（仅针对应触发的查询）

Each executor run gets a fresh Expo app, created by

scripts/make-fixture.sh <app-path> <sdk> [clean|full]

bash

scripts/make-fixture.sh <workspace>/iteration-N/eval-X/<config>/app <sdk>          # blank app (default)
scripts/make-fixture.sh <workspace>/iteration-N/eval-X/<config>/app <sdk> full     # keep example tabs

The script creates the app with

bunx create-expo-app -t default@sdk-<version>

(or the latest template when no version is given) once per SDK version + variant, caches it under

~/.cache/expo-skill-eval/fixtures/

, and clones the cache with APFS copy-on-write — so the first run per variant pays the install cost and every later run is near-instant. The default

clean

variant runs the template's

reset-project

script, so executors start from a blank app and every screen in the output is theirs — a much cleaner grading signal. Use

full

only when the eval prompt assumes an existing app (e.g. "I have an app with two tabs..."). The script also resets git inside the clone, so

git diff

in the app shows exactly what the executor changed (useful evidence for the grader).

Build fixtures sequentially, then fan out executors — never create fixtures concurrently.

make-fixture.sh

shares a cache under

~/.cache/expo-skill-eval/fixtures/

keyed by SDK+variant. If two runs both find the cache cold and call

bunx create-expo-app

at the same time, bun's link step collides and one fails with

EEXIST

/ "could not determine executable to run for package create-expo-app". So in the executor orchestrator (step 3), create all fixtures one at a time first — a plain Python loop calling

subprocess.run(["bash", "<scripts>/make-fixture.sh", app, sdk, variant])

(where

sdk

is the version chosen up front) — then fan out the

claude -p

executors with a

ThreadPoolExecutor

. Sequential creation is cheap: only the first fixture per SDK+variant pays the install cost; the rest are ~1s APFS clones. (And never fan fixtures out with ad-hoc shell like

make-fixture.sh A & make-fixture.sh B & wait

— the

wait

segments prompt; the sequential Python loop avoids both the race and the prompt.)

在工作区的

trigger-evals/

目录下编写

run_trigger_eval_real.py

脚本。仅使用**

"should_trigger": true

的查询**——expo插件是一组互补技能，因此多个技能在同一提示词下触发并非失败。仅衡量召回率：应使用该技能的真实场景提示词，按触发率打分。

脚本需为每个查询运行

claude -p <query>

（携带

--output-format=stream-json --verbose --include-partial-messages

，从环境变量中移除

CLAUDECODE

，并使用开始前——明确范围部分中确认的权限标志），并通过监听流中的

Skill

或

Read

工具调用检测目标技能是否被触发。注意：

--include-partial-messages

需要同时设置

--output-format=stream-json

和

--verbose

——省略其中任何一个都会立即导致CLI错误。

加载被测技能——为每个触发子进程传递
--plugin-dir
。触发评估衡量的是技能的描述是否会让模型选择它，因此子进程必须加载本地技能（包含你的修改的版本）。

claude -p

子进程不会继承父会话的

--plugin-dir

，因此需显式添加：

--plugin-dir <plugin-root>

，其中

<plugin-root>

是拥有该技能的插件目录的绝对路径——即包含

.claude-plugin/plugin.json

的

plugins/expo

父目录（例如

--plugin-dir /Users/.../skills/plugins/expo

）。必须使用绝对路径：子进程在临时

scratch/

目录下运行，相对路径

plugins/expo

无法解析——且缺失插件目录会静默加载不到任何内容，表现为0%的触发率。然后监听技能是否以插件限定名称触发（

<plugin>:<skill>

，例如

expo:expo-ui

）。两个注意事项：(1) 如果已发布的

expo

插件也已全局安装，需在运行期间通过

/plugin

命令禁用它，并在之后重新启用——否则子进程中会存在两个

expo

副本冲突，模型可能触发已发布的

expo:expo-ui

，无意中为其描述打分而非你的本地修改版本（触发检测仅能看到工具调用名称，无法区分副本；开发环境通常不会安装已发布版本）。(2) 绝不要创建技能的合成副本——真实加载的副本始终优先，因此合成harness的得分会是0%。（执行器不受已安装插件影响：它们直接读取本地

SKILL_PATH

，无需传递

--plugin-dir

。）

在空的临时工作目录（例如

trigger-evals/scratch/

）中运行每个查询的子进程，而非仓库根目录。类似“build me a settings screen”的应触发提示词可能会让子进程写入文件，若使用

--dangerously-skip-permissions

，这些写入操作会直接写入技能仓库。触发检测仅需流中出现技能的

Skill

Read

调用——无需测试夹具——因此任何临时写入的内容都可丢弃。

为每个查询的子进程设置至少300秒的超时时间。180秒的限制太短——部分查询会导致模型在触发技能前开始生成代码，从而使总运行时间超过3分钟。

每个技能仅需运行一次触发评估，无需针对每个代码评估用例重复运行。

3. Generate (executor subagents)

2. 测试夹具

Run executors as

claude -p

subprocess calls from a Python script, not via the

Agent

tool. The

Agent

tool spawns subagents with their own permission context — file edits inside the fixture app will prompt the user. A

claude -p

subprocess is a separate process outside the permission system entirely (the same pattern the trigger eval harness uses).

Write a Python script to

/private/tmp/expo-skill-eval-<skill>/run_executors.py

. First create every run's fixture in a sequential loop —

subprocess.run(["bash", "<scripts>/make-fixture.sh", app, sdk, variant], …)

one at a time (concurrent creation races the shared bun cache — see step 2). Then run the with-skill and without-skill

claude -p

calls in parallel via a

ThreadPoolExecutor

. Both phases live inside Python (covered by the

python3

rule), so nothing runs as ad-hoc shell on the main thread. Each executor prompt must include:

The skill path (with-skill runs only) and the eval prompt.
Image-prompt cases (
reference_image
set): the absolute path to the target screenshot plus an instruction like "Open the reference screenshot at
```
<path>
```
with your Read tool and build an app whose UI matches it as closely as you can — layout, components, spacing, and colors." (
```
claude -p
```
renders PNGs read this way, so the executor can actually see the target.)
The fixture app path: "Make your changes inside
```
<app-path>
```
. The project already exists and has dependencies installed. Use absolute paths for all file operations."
"Before writing any files, inspect the project layout — run
```
ls
```
, read
```
package.json
```
and
```
app.json
```
— to find the correct routes directory. Recent SDK default templates place Expo Router routes in
```
src/app/
```
; older ones use
```
app/
```
at the project root — inspect to confirm which this fixture uses."
"Do NOT start the dev server, boot simulators, or take screenshots — the harness does that after you finish."
Where to save a short summary of what was built.

Flags for the

claude -p

subprocess:

Strip
```
CLAUDECODE
```
from the environment (
```
env = {k: v for k, v in os.environ.items() if k != "CLAUDECODE"}
```
) — otherwise
```
claude -p
```
hangs silently when nested inside a running Claude Code session.
A permission flag, confirmed with the user up front (see Before starting — clarify scope): either
```
--dangerously-skip-permissions
```
or
```
--permission-mode acceptEdits
```
. Bake the chosen flag into the generated script. A bare
```
claude -p
```
with neither flag can't write files — it has no TTY to approve the edit and emits code as text instead.
Do NOT pass
--plugin-dir
to executors (unlike the trigger eval). The with-skill run already reads the skill by its absolute
```
SKILL_PATH
```
, so it tests the local content directly; and the without-skill run must have no skill available at all — loading the plugin would let the skill auto-trigger and contaminate the baseline. Keeping executors path-based also cleanly separates the two questions: the executor measures content quality (is the skill useful once read?), the trigger eval measures triggering (does the description get it picked?).

Capture stdout/stderr per run to a log file next to the fixture for grading evidence. Set timeout to 900s per executor — with-skill runs read multiple reference files before coding and regularly take 5–10 minutes.

每个执行器运行都会获得一个全新的Expo应用，由

scripts/make-fixture.sh <app-path> <sdk> [clean|full]

创建：

bash

scripts/make-fixture.sh <workspace>/iteration-N/eval-X/<config>/app <sdk>          # 空白应用（默认）
scripts/make-fixture.sh <workspace>/iteration-N/eval-X/<config>/app <sdk> full     # 保留示例标签页

该脚本使用

bunx create-expo-app -t default@sdk-<version>

创建应用（未指定版本时使用最新模板），每个SDK版本+变体创建一次，并缓存到

~/.cache/expo-skill-eval/fixtures/

目录下，然后通过APFS写时复制克隆缓存——因此每个变体的首次运行需承担安装成本，后续运行几乎瞬间完成。默认的

clean

变体运行模板的

reset-project

脚本，因此执行器从空白应用开始，输出中的每个界面都是其生成的——评分信号更清晰。仅当评估提示词假设已有应用时（例如“我有一个包含两个标签页的应用...”）才使用

full

变体。脚本还会重置克隆中的git，因此应用中的

git diff

可准确显示执行器的修改内容（对评分器有用的证据）。

按顺序构建测试夹具，然后并行运行执行器——绝不要并发创建夹具。

make-fixture.sh

在

~/.cache/expo-skill-eval/fixtures/

目录下共享按SDK+变体键控的缓存。若两个运行同时发现缓存未命中并同时调用

bunx create-expo-app

，bun的链接步骤会冲突，其中一个会因

EEXIST

/ "could not determine executable to run for package create-expo-app"错误失败。因此在执行器编排器（步骤3）中，首先按顺序创建所有夹具——使用普通Python循环调用

subprocess.run(["bash", "<scripts>/make-fixture.sh", app, sdk, variant])

（其中

sdk

是提前选定的版本）——然后通过

ThreadPoolExecutor

并行运行

claude -p

执行器。顺序创建成本很低：每个SDK+变体仅第一个夹具需承担安装成本；其余都是约1秒的APFS克隆。（绝不要使用临时shell命令如

make-fixture.sh A & make-fixture.sh B & wait

并行创建夹具——

wait

部分会触发提示；顺序Python循环可避免竞争和提示。）

4. Static gate

3. 生成代码（执行器子代理）

Write

run_static.py

and run it with

python3

. For each eval/config app it calls

subprocess.run(["bash", "<scripts>/check-static.sh", app, "ios,android"], capture_output=True, …)

across a

ThreadPoolExecutor

(static gates are independent — run them concurrently inside Python, never with shell

wait

), and writes each result to

eval-<i>/<config>/static.json

(exit code + captured output) for the grader.

check-static.sh

runs

tsc --noEmit

expo lint

, and

expo export

for the listed platforms. A passing export catches most import/syntax/missing-module failures without touching a device; a failing export short-circuits step 5 with a clean FAIL — record it and have the snapshot orchestrator skip that app.

通过Python脚本运行

claude -p

子进程作为执行器，不要使用

Agent

工具。

Agent

工具会使用独立的权限上下文启动子代理——测试夹具应用内的文件编辑会触发用户提示。

claude -p

子进程是完全独立于权限系统的进程（与触发评估harness使用相同模式）。

编写Python脚本到

/private/tmp/expo-skill-eval-<skill>/run_executors.py

。首先按顺序循环创建所有运行的测试夹具——逐个调用

subprocess.run(["bash", "<scripts>/make-fixture.sh", app, sdk, variant], …)

（并发创建会导致共享bun缓存竞争——见步骤2）。然后通过

ThreadPoolExecutor

并行运行使用技能和不使用技能的

claude -p

调用。两个阶段都在Python内部运行（符合

python3

规则），因此主线程不会执行任何临时shell命令。每个执行器提示词必须包含：

技能路径（仅使用技能的运行需要）和评估提示词。
图片提示词用例（设置了
reference_image
）：目标截图的绝对路径，以及类似“使用Read工具打开
```
<path>
```
处的参考截图，并构建一个UI尽可能匹配的应用——包括布局、组件、间距和颜色。”的指令（
```
claude -p
```
可渲染通过此方式读取的PNG图片，因此执行器可实际看到目标）。
测试夹具应用路径：“在
```
<app-path>
```
内进行修改。项目已存在且依赖已安装。所有文件操作使用绝对路径。”
“在写入任何文件前，检查项目布局——运行
```
ls
```
，读取
```
package.json
```
和
```
app.json
```
——以找到正确的路由目录。最新SDK默认模板将Expo Router路由放在
```
src/app/
```
；旧版本使用项目根目录下的
```
app/
```
——请检查确认此夹具使用哪种结构。”
“不要启动开发服务器、启动模拟器或截图——harness会在你完成后执行这些操作。”
保存构建内容简短摘要的位置。

claude -p

子进程的标志：

从环境变量中移除
```
CLAUDECODE
```
（
```
env = {k: v for k, v in os.environ.items() if k != "CLAUDECODE"}
```
）——否则在运行中的Claude Code会话内嵌套运行
```
claude -p
```
会静默挂起。
权限标志，由用户提前确认（见开始前——明确范围部分）：
```
--dangerously-skip-permissions
```
或
```
--permission-mode acceptEdits
```
。将选定的标志写入生成的脚本。未携带任何标志的裸
```
claude -p
```
无法写入文件——它没有TTY来批准编辑，只会以文本形式输出代码。
不要为执行器传递
--plugin-dir
（与触发评估不同）。使用技能的运行已通过绝对
```
SKILL_PATH
```
读取技能，因此直接测试本地内容；而不使用技能的运行必须完全不加载任何技能——加载插件会让技能自动触发，污染基线。让执行器基于路径运行还可清晰区分两个问题：执行器衡量内容质量（读取技能后是否有用？），触发评估衡量触发效果（描述是否能让模型选择它？）。

将每个运行的标准输出/标准错误捕获到测试夹具旁的日志文件中，作为评分证据。为每个执行器设置900秒的超时时间——使用技能的运行在编码前会读取多个参考文件，通常需要5-10分钟。

5. Run + screenshot (serial across evals)

4. 静态检查

Write

run_snapshots.py

and run it with

python3

. Simulators and emulators are shared resources, so this orchestrator runs serially (no thread pool): for each app that passed the static gate, and each platform, it

os.makedirs

the

outputs/

dir and calls

subprocess.run(["bash", "<scripts>/snapshot-<platform>.sh", app, f"{outputs}/<platform>.png", port], env={**os.environ, "EXPO_SKILL_EVAL_RUNNER": runner}, …)

. Pass the port as a positional argument: use

for iOS and

for Android —

expo run:ios/android --port N

is supported and using separate ports lets you run both platforms without port collisions if you ever parallelize. Screenshots land in the run's

outputs/

directory so the viewer renders them inline.

Reclaim disk after each fixture — essential for
dev-build
runs. Once all selected platforms for an app are captured (and before the next fixture builds), call

subprocess.run(["bash", "<scripts>/clean-fixture.sh", app])

. Each

expo run:<platform>

leaves multi-GB native build output (iOS Pods + DerivedData, Android Gradle build); without this, evals × configs × iterations pile up and fill the disk mid-run (the instability you'll see is the disk filling).

clean-fixture.sh

removes the heavy regenerable dirs (

node_modules

ios

android

.expo

dist

) and the fixture's iOS DerivedData, keeping the app source + git so the grader's

git diff

still works. With serial snapshots + per-fixture cleanup, peak disk stays at ~one fixture's build instead of all of them. (Harmless for

expo-go

runs too — they just have little to reclaim.)

runner

is the up-front choice (

expo-go

default, or

dev-build

). The snapshot scripts honor

EXPO_SKILL_EVAL_RUNNER

expo-go

launches with

expo start --<platform>

(and the Expo Go install/deep-link dance);

dev-build

launches with

expo run:<platform> --port <port>

, which compiles+installs a native dev client and skips the Expo Go steps. The scripts already default the

dev-build

timeout to 900s, but bump

EXPO_SKILL_EVAL_BUNDLE_TIMEOUT

higher if the first native compile needs it.

make-fixture.sh

pre-installs

expo-dev-client

in every fixture so the dev-client URL scheme is registered before

expo run

tries to deep-link the app open.

Snapshot scripts always capture the initial route
/
. They open the app via a deep link and take one screenshot — they cannot tap or navigate. Design eval prompts so the feature under test renders at the root route. If the executor places the main UI behind a navigation action (e.g. an "Open Settings" button on the index), the snapshot will miss the feature entirely and all visual expectations will fail.

Each

snapshot-<platform>.sh

frees its Metro port on startup (kills any stale process left on it by a crashed prior run) and tears Metro down on exit — so you never need to run

lsof

kill

pkill

yourself to clear ports (that would prompt, and it's already handled). It then starts Metro, waits for the "Bundled" line in the Metro log, settles, captures a screenshot, and tears Metro down. iOS boots the newest available iPhone simulator if none is booted; Android boots the first AVD if no device is attached (the slow path — boot once and reuse across the whole iteration). Android first recycles a wedged/
offline
emulator (graceful

adb emu kill

, then force-kill + adb reset) so a half-dead instance can't poison the run, and boots with hardware GPU (

-gpu host

, Metal-accelerated on Apple Silicon). If

host

self-aborts the emulator on a given machine (qemu

SIGABRT

deep in gfxstream/Metal — possible on Apple Silicon under load), edit

GPU_MODE

snapshot-android.sh

to a software mode (

guest

renders reliably but slowly — bump the settle; avoid

swiftshader_indirect

, which hangs at boot on arm64).

snapshot-web.sh

runs only when

platforms

includes web. Each script writes a Metro log next to the screenshot (

<name>.metro.log

) — include it in the grader's inputs. If a script exits non-zero it still attempts a best-effort screenshot (an error screen is evidence too). dev-build relaunch: after Metro is up, the scripts relaunch the app via

xcrun simctl launch

(iOS) and

adb shell am start -n <pkg>/.MainActivity

(Android) — both avoid the "Open in X?" system dialog that a URL-scheme deep link triggers on first launch.

After all screenshots for the iteration are captured, always generate the viewer — pass the workspace root to the checked-in script:

bash

python3 /abs/path/expo-skill-eval/scripts/generate_viewer.py /private/tmp/expo-skill-eval-<skill>

It writes

viewer.html

into the workspace root (one level above

iteration-N/

) and opens it in the browser itself (via

webbrowser.open

) — so no separate

open

command (and no

Bash(open:*)

rule) is needed. See the Viewer section below.

编写

run_static.py

并使用

python3

运行。对于每个评估/配置应用，通过

ThreadPoolExecutor

调用

subprocess.run(["bash", "<scripts>/check-static.sh", app, "ios,android"], capture_output=True, …)

（静态检查相互独立——在Python内部并发运行，绝不要使用shell

wait

），并将每个结果写入

eval-<i>/<config>/static.json

（退出码+捕获的输出）供评分器使用。

check-static.sh

会为指定平台运行

tsc --noEmit

、

expo lint

和

expo export

。成功的导出可在不接触设备的情况下捕获大多数导入/语法/缺失模块错误；失败的导出会以清晰的FAIL结果终止步骤5——记录该结果并让快照编排器跳过该应用。

6. Grade

5. 运行+截图（评估用例间串行执行）

Spawn a grader subagent in the foreground. Its prompt must include:

The eval prompt, expectations list, and visual_expectations from the eval case.
The instructions in
```
agents/visual-grader.md
```
(screenshot grading, redbox detection).
The screenshot files, Metro logs, and the step-4
```
static.json
```
as inputs.
Image-prompt cases (case has
```
reference_image
```
): also include the target screenshot (
```
reference_image
```
),
```
references/design-rubric.md
```
, and the fixture's
```
git diff
```
. Tell the grader to compare the generated screenshot(s) to the target and emit the
```
reference_match
```
+
```
quality
```
blocks below.

The grader writes

grading.json

next to the outputs with this shape:

json

{
  "score": 8.5,
  "max_score": 9,
  "expectations": [
    {"text": "...", "passed": true, "evidence": "..."}
  ],
  "reference_match": {
    "score": 7, "max": 10,
    "evidence": "ios.png vs target.png: same two-section grouped list + toggle; accent color differs (blue vs target's green); row spacing tighter than target"
  },
  "quality": {
    "dimensions": [
      {"name": "Layout & hierarchy", "score": 2, "max": 3, "evidence": "ios.png: …"}
    ],
    "subtotal": 17,
    "max": 24,
    "summary": "…"
  },
  "user_notes_summary": {"needs_review": false, "notes": ""}
}

Visual expectations go into the same

expectations

array with evidence naming the screenshot file and describing what is visible. The

reference_match

block (how closely the generated app reproduces the target screenshot) and the

quality

block (design-rubric scores from

references/design-rubric.md

) are emitted only for image-prompt cases — or when a quality grade is explicitly requested. Omit both for plain text-prompt runs.

编写

run_snapshots.py

并使用

python3

运行。模拟器和模拟器是共享资源，因此该编排器串行运行（无线程池）：对于每个通过静态检查的应用和每个平台，创建

outputs/

目录并调用

subprocess.run(["bash", "<scripts>/snapshot-<platform>.sh", app, f"{outputs}/<platform>.png", port], env={**os.environ, "EXPO_SKILL_EVAL_RUNNER": runner}, …)

。将端口作为位置参数传递：iOS使用

，Android使用

——

expo run:ios/android --port N

受支持，使用不同端口可避免并行运行时的端口冲突。截图会保存到运行的

outputs/

目录，以便查看器内联渲染。

每个夹具完成后回收磁盘空间——对
dev-build
运行至关重要。当一个应用的所有选定平台截图完成后（在下一个夹具构建前），调用

subprocess.run(["bash", "<scripts>/clean-fixture.sh", app])

。每个

expo run:<platform>

会留下数GB的原生构建输出（iOS Pods + DerivedData、Android Gradle构建）；若不执行此操作，评估用例×配置×迭代会累积并在运行过程中填满磁盘（你会看到的不稳定现象是磁盘已满）。

clean-fixture.sh

会删除占用空间大且可重新生成的目录（

node_modules

、

ios

、

android

、

.expo

、

dist

以及夹具的iOS DerivedData），保留应用源码+git，以便评分器的

git diff

仍能正常工作。通过串行快照+每个夹具清理，峰值磁盘占用可保持在约一个夹具的构建大小，而非所有夹具的总和。（对

expo-go

运行也无害——它们几乎没有可回收的内容。）

runner

是提前选择的运行环境（默认

expo-go

，或

dev-build

）。快照脚本会遵循

EXPO_SKILL_EVAL_RUNNER

：

expo-go

使用

expo start --<platform>

启动（并执行Expo Go安装/深度链接流程）；

dev-build

使用

expo run:<platform> --port <port>

启动，这会编译+安装原生开发客户端并跳过Expo Go步骤。脚本已将

dev-build

的超时时间默认设置为900秒，但如果首次原生编译需要更长时间，可提高

EXPO_SKILL_EVAL_BUNDLE_TIMEOUT

的值。

make-fixture.sh

会在每个夹具中预安装

expo-dev-client

，以便

expo run

尝试深度链接打开应用前，开发客户端的URL scheme已注册。

快照脚本始终捕获初始路由
/
。它们通过深度链接打开应用并拍摄一张截图——无法点击或导航。设计评估提示词时需确保被测功能在根路由渲染。若执行器将主UI放在导航操作之后（例如首页的“打开设置”按钮），快照会完全错过该功能，所有视觉预期都会失败。

每个

snapshot-<platform>.sh

启动时会释放其Metro端口（杀死之前崩溃运行留下的任何 stale 进程），并在退出时关闭Metro——因此你无需自行运行

lsof

kill

pkill

来清理端口（这会触发提示，且已由脚本处理）。然后启动Metro，等待Metro日志中的“Bundled”行，等待稳定后捕获截图，再关闭Metro。若没有已启动的模拟器，iOS会启动最新可用的iPhone模拟器；若没有连接设备，Android会启动第一个AVD（较慢的路径——启动一次并在整个迭代中复用）。Android首先回收卡住/
offline
的模拟器（优雅的

adb emu kill

，然后强制杀死+adb重置），避免半死不活的实例影响运行，并使用硬件GPU启动（

-gpu host

，在Apple Silicon上使用Metal加速）。若

host

模式在某台机器上导致模拟器自行终止（qemu在gfxstream/Metal深处触发

SIGABRT

——在Apple Silicon高负载下可能发生），可将

snapshot-android.sh

中的

GPU_MODE

改为软件模式（

guest

渲染可靠但速度慢——增加等待时间；避免

swiftshader_indirect

，它在arm64上启动时会挂起）。仅当

platforms

包含web时才运行

snapshot-web.sh

。每个脚本会在截图旁写入Metro日志（

<name>.metro.log

）——将其纳入评分器的输入。若脚本非零退出，仍会尝试捕获最佳效果的截图（错误界面也是证据）。**dev-build重启：**Metro启动后，脚本会通过

xcrun simctl launch

（iOS）和

adb shell am start -n <pkg>/.MainActivity

（Android）重启应用——两者都可避免首次启动时URL scheme深度链接触发的“在X中打开？”系统对话框。

捕获完迭代的所有截图后，始终生成查看器——将工作区根目录传递给已签入的脚本：

bash

python3 /abs/path/expo-skill-eval/scripts/generate_viewer.py /private/tmp/expo-skill-eval-<skill>

它会在工作区根目录（

iteration-N/

的上一级）生成

viewer.html

并自行在浏览器中打开（通过

webbrowser.open

）——因此无需单独的

open

命令（也无需

Bash(open:*)

规则）。详见下方查看器部分。

Rollout phases

6. 评分

Build out and debug the pipeline in this order — each phase is independently useful:

Static: steps 1–4 only (
```
runtime.mode: "static-only"
```
for everything). No devices needed; CI-friendly.
iOS: add
```
snapshot-ios.sh
```
to the loop.
```
simctl
```
is the most scriptable target.
Android: add
```
snapshot-android.sh
```
. Emulator boot is the slowest part — keep one emulator running for the whole session.
Web: add
```
snapshot-web.sh
```
for skills that target web (uses Playwright via
```
bunx
```
; first run downloads Chromium).

前台启动评分子代理。其提示词必须包含：

评估提示词、预期列表和评估用例中的visual_expectations。
```
agents/visual-grader.md
```
中的说明（截图评分、错误红框检测）。
截图文件、Metro日志和步骤4的
```
static.json
```
作为输入。
图片提示词用例（用例包含
```
reference_image
```
）：还需包含目标截图（
```
reference_image
```
）、
```
references/design-rubric.md
```
和夹具的
```
git diff
```
。告知评分器对比生成的截图与目标截图，并输出下方的
```
reference_match
```
+
```
quality
```
块。

评分器会在输出旁写入

grading.json

，格式如下：

json

{
  "score": 8.5,
  "max_score": 9,
  "expectations": [
    {"text": "...", "passed": true, "evidence": "..."}
  ],
  "reference_match": {
    "score": 7, "max": 10,
    "evidence": "ios.png vs target.png: same two-section grouped list + toggle; accent color differs (blue vs target's green); row spacing tighter than target"
  },
  "quality": {
    "dimensions": [
      {"name": "Layout & hierarchy", "score": 2, "max": 3, "evidence": "ios.png: …"}
    ],
    "subtotal": 17,
    "max": 24,
    "summary": "…"
  },
  "user_notes_summary": {"needs_review": false, "notes": ""}
}

视觉预期会纳入同一个

expectations

数组，证据中需指定截图文件并描述可见内容。

reference_match

块（生成应用与目标截图的匹配程度）和

quality

块（来自

references/design-rubric.md

的设计评分标准）仅针对图片提示词用例输出——或当明确要求质量评分时输出。纯文本提示词运行可省略这两个块。

Practical notes

分阶段部署

Temp locations: all eval workspaces go under
```
/private/tmp/expo-skill-eval-<skill-name>/iteration-N/
```
. Everything in this run —
```
Read
```
,
```
Write
```
,
```
Edit
```
, and
```
Bash
```
— is covered by the
```
allowed-tools
```
frontmatter, so a correctly-loaded skill runs prompt-free.
Permission rule forms (why this skill stays prompt-free): the rule syntax matters and the two tool families behave differently:
- Bash(...)
  rules — path-scoped to the skill's own code (no broad interpreters).
```
Bash(python3 /private/tmp/expo-skill-eval-*)
```
  (plus the
```
/tmp
```
  alias) runs the Python orchestrators you generate under the workspace;
```
Bash(python3 *expo-skill-eval/scripts/*)
```
  runs the checked-in
```
scripts/generate_viewer.py
```
  ;
```
Bash(tee /private/tmp/expo-skill-eval-*)
```
  (+
```
/tmp
```
  ) lets
```
python3 … 2>&1 | tee <workspace>/…log
```
  write a log without prompting;
```
Bash(bash *expo-skill-eval/scripts/*)
```
  runs only this skill's
```
scripts/*.sh
```
  . Because every path is pinned, the escape hatches stay denied:
```
python3 -c …
```
  ,
```
bash -c …
```
  ,
```
tee /etc/…
```
  , and running code anywhere else do not match (verified empirically — a scoped rule allows
```
bash <dir>/run.sh
```
  but blocks
```
bash -c …
```
  and any other path). Commands the scripts call internally —
```
bunx
```
  ,
```
xcrun simctl
```
  ,
```
adb
```
  ,
```
git
```
  ,
```
mkdir
```
  ,
```
expo
```
  — are children of the script, not Bash tool calls, so they need no rule. Do not run ad-hoc
```
mkdir
```
  /
```
ls
```
  /
```
find
```
  /
```
cat
```
  /
```
grep
```
  from the main thread (they have no rule and prompt — and a raw
```
mkdir "$WORKSPACE/…"
```
  can't match a path glob because the path is an unexpanded variable): create the directory tree with
```
make-workspace.sh
```
  (step 0), let orchestrators create their own dirs (
```
os.makedirs
```
  ), and inspect results with the
  Read
  /
  Glob
  /
  Grep
  tools (no Bash rule needed).
- Bash rule matching (tested, non-obvious): a Bash rule is a gitignore-style glob over the command string.
```
*
```
  matches any run of characters including
  /
  and spaces and works mid-pattern — so
```
Bash(python3 /private/tmp/expo-skill-eval-*)
```
  matches
```
python3 /private/tmp/expo-skill-eval-x/run.py 2>&1
```
  , and
```
Bash(bash *expo-skill-eval/scripts/*)
```
  matches
```
bash /any/abs/path/expo-skill-eval/scripts/foo.sh args
```
  . Two gotchas that burned earlier attempts:
```
**
```
  is matched literally (never use it in a Bash rule), and the
```
:*
```
  suffix only works right after the command token (
```
Bash(python3:*)
```
  ) — not after a partial path (
```
Bash(python3 /path-:*)
```
  does not match). Compound commands split on
```
|
```
  ,
```
&&
```
  ,
```
||
```
  ,
```
;
```
  ,
```
&
```
  and each segment needs its own matching rule.
- Read
  rules suppress prompts;
  Write
  /
  Edit
  rules do not. This is a Claude Code asymmetry (not a pattern bug, and not reload — in a session where the
```
Bash
```
  /
```
Read
```
  rules from this same frontmatter are clearly working,
```
Write
```
  still prompts): file creation/editing always goes through Claude Code's edit-approval flow regardless of
```
allowed-tools
```
  . The frontmatter still scopes
```
Read
```
  /
```
Write
```
  /
```
Edit
```
  to
```
…/expo-skill-eval-*/**
```
  (both the
```
/tmp
```
  and
```
/private/tmp
```
  forms, since macOS doesn't auto-resolve the symlink) as documentation and a guardrail, but those
```
Write
```
  /
```
Edit
```
  entries won't silence the prompt on their own. Practical consequence: at the start of a run you get one Write prompt for the workspace — choose "Yes, allow all edits in this directory for the session" and every later orchestrator /
```
evals.json
```
  / viewer write under that workspace goes through silently. That single directory approval, not a rule, is what makes file-writing prompt-free.
- Reload after editing frontmatter — a full restart, not
  /reload-skills
  .
```
allowed-tools
```
  is read once when the skill loads at session start;
```
/reload-skills
```
  reloads the skill body but does not reliably refresh the permission rules. After editing this file, quit Claude Code entirely and start a new session, then re-run the skill — otherwise a stale (cached) ruleset keeps prompting even though the file on disk is correct.
- Grader subagents run with their own permission context and will still prompt for file access — that is expected and separate from the main thread's rules.
Calling eval scripts — one standalone command, never chained. Invoke each script as its own Bash call with an absolute path:
```
bash /abs/path/expo-skill-eval/scripts/snapshot-ios.sh arg1 arg2
```
(covered by
```
Bash(bash *expo-skill-eval/scripts/*)
```
). Do not combine it with
```
&
```
,
```
&&
```
,
```
||
```
,
```
;
```
,
```
wait
```
,
```
tail
```
,
```
head
```
, or
```
echo
```
— compound commands are checked per segment, and those extra segments have no rule, so the whole thing prompts even though the
```
bash …/scripts/…
```
part is allowed. (The one allowed pipe is
```
… 2>&1 | tee <workspace>/…log
```
, since the scoped
```
tee
```
rule covers it.) Need parallelism or output trimming? Put it in a Python orchestrator (covered by
```
python3 /…/expo-skill-eval-*
```
), which runs scripts via
```
subprocess
```
across a
```
ThreadPoolExecutor
```
. Inspect results with the
```
Read
```
/
```
Glob
```
/
```
Grep
```
tools, not
```
cat
```
/
```
ls
```
/
```
grep
```
. General rule: under this skill's tight scoping, any ad-hoc shell the agent improvises will prompt — the fix is to move it into a script/orchestrator (or use the scoped
tee
), never to broaden a rule.
Inspecting outputs (screenshots, logs, files) — use tools, not shell. To find files use the Glob tool (e.g.
```
/private/tmp/expo-skill-eval-<skill>/iteration-N/**/ios.png
```
); to view them use the Read tool — Read renders PNGs visually, which is exactly what you need to confirm a screenshot rendered. To search file contents use Grep. Never use
```
find
```
/
```
ls
```
/
```
cat
```
for this: they prompt, and
```
find … -exec …
```
is deliberately not allowed because its
```
-exec
```
can run anything (e.g.
```
-exec rm
```
). These tools are scoped and prompt-free; reach for them every time you'd otherwise type
```
find
```
/
```
ls
```
/
```
cat
```
.
Generated Python scripts: write orchestration/aggregation scripts under the workspace (e.g.
```
/private/tmp/expo-skill-eval-<skill>/aggregate.py
```
) and run them with
```
python3
```
(covered by
```
Bash(python3 /private/tmp/expo-skill-eval-*)
```
). The viewer is the exception — it's the checked-in
```
scripts/generate_viewer.py
```
, run via
```
Bash(python3 *expo-skill-eval/scripts/*)
```
.
```
Write
```
auto-creates parent dirs but prompts the first time — approve the workspace directory once (see the
```
Write
```
/
```
Edit
```
note above). Capture output either by having the script write its own log or via
```
python3 … 2>&1 | tee <workspace>/…log
```
(covered by the scoped
```
tee
```
rule); read logs back with the
```
Read
```
tool. Don't use
```
python3 -c …
```
for setup (the scoped rule only matches a workspace script path, so a bare
```
-c
```
prompts).
Trigger evals vs installed plugin: detect the real installed skill name (e.g.
```
expo:expo-ui
```
) in the stream — a synthetic-duplicate harness always scores 0% when the real plugin is installed because the model picks the genuine skill over the synthetic copy.
Benchmark aggregation: save each run's
```
grading.json
```
+
```
timing.json
```
under
```
eval-<N>/<config>/run-1/
```
. Write a Python aggregation script under the workspace and run it with
```
python3
```
.
Expo Go ceiling: anything requiring custom native code (expo-module, App Clips, brownfield) cannot run in Expo Go. Use
```
static-only
```
mode for those — see
```
references/runtime-matrix.md
```
before writing eval cases for a skill (note:
```
@expo/ui
```
does run in Expo Go on SDK 56+).
API-route skills: instead of a screenshot, verify with
```
curl
```
against the route while Metro is up; record the response as an output file for grading.
Timing data: capture token counts and duration into
```
timing.json
```
immediately after each executor run — it is not recoverable later. To capture token counts, add
```
--output-format=stream-json --verbose
```
to the executor
```
claude -p
```
call and parse the
```
message_start
```
/
```
message_delta
```
events from the log. Without these flags the log only contains prose and elapsed seconds are the only recoverable metric.
First-launch dialogs: Expo Go occasionally shows a one-time prompt on a fresh simulator. If a screenshot captures a dialog instead of the app, re-run the snapshot script (it reopens the URL) and re-capture.

按以下顺序构建和调试流水线——每个阶段都独立有用：

静态检查：仅步骤1-4（所有用例的
```
runtime.mode: "static-only"
```
）。无需设备；适合CI环境。
iOS：将
```
snapshot-ios.sh
```
加入循环。
```
simctl
```
是最适合脚本化的目标。
Android：加入
```
snapshot-android.sh
```
。模拟器启动是最慢的部分——整个会话保持一个模拟器运行。
Web：为面向web的技能加入
```
snapshot-web.sh
```
（通过
```
bunx
```
使用Playwright；首次运行会下载Chromium）。

Viewer

实用说明

After taking screenshots, always generate and open the HTML viewer so the user can see results immediately without being asked. The viewer is the checked-in

scripts/generate_viewer.py

— run it with the workspace root as its argument:

bash

python3 /abs/path/expo-skill-eval/scripts/generate_viewer.py /private/tmp/expo-skill-eval-<skill>

It writes a self-contained

/private/tmp/expo-skill-eval-<skill>/viewer.html

and opens it in the browser itself (

webbrowser.open

). What it renders:

A tab per iteration (
```
iteration-*
```
under the workspace root; remembers the last active tab in
```
localStorage
```
).
For each eval case (read from
```
<iteration>/evals.json
```
): side-by-side with_skill / without_skill columns, each showing static-gate status, score, the platform screenshots (click to zoom; embedded as base64
```
data:
```
URIs so the file is self-contained), the expectation list with PASS/FAIL badges, and reviewer notes.
For image-prompt cases (a
```
grading.json
```
with
```
reference_match
```
/
```
quality
```
): the target screenshot beside the generated ones, the
```
reference_match
```
score (generated vs target), the
```
quality
```
rubric per config (one bar per dimension with its score/max plus the subtotal), and the quality delta (with_skill − without_skill subtotal) in the summary bar alongside the correctness delta.
A summary bar with with_skill %, without_skill %, and delta.
A trigger accuracy table when
```
trigger-evals/trigger_results.json
```
exists.
A dark background with color-coded scores (green ≥85%, amber ≥65%, red below).

临时位置：所有评估工作区都位于
```
/private/tmp/expo-skill-eval-<skill-name>/iteration-N/
```
下。本次运行中的所有操作——
```
Read
```
、
```
Write
```
、
```
Edit
```
和
```
Bash
```
——都符合
```
allowed-tools
```
前置条件，因此正确加载的技能运行时不会触发提示。
权限规则格式（为何此技能运行时无提示）：规则语法很重要，两类工具的行为不同：
- Bash(...)
  规则——路径限定为技能自身代码（无宽泛解释器）。
```
Bash(python3 /private/tmp/expo-skill-eval-*)
```
  （加上
```
/tmp
```
  别名）运行你在工作区下生成的Python编排器；
```
Bash(python3 *expo-skill-eval/scripts/*)
```
  运行已签入的
```
scripts/generate_viewer.py
```
  ；
```
Bash(tee /private/tmp/expo-skill-eval-*)
```
  （+
```
/tmp
```
  ）允许
```
python3 … 2>&1 | tee <workspace>/…log
```
  写入日志而不触发提示；
```
Bash(bash *expo-skill-eval/scripts/*)
```
  仅运行此技能的
```
scripts/*.sh
```
  。由于每个路径都固定，漏洞被禁止：
```
python3 -c …
```
  、
```
bash -c …
```
  、
```
tee /etc/…
```
  以及在其他位置运行代码不会匹配（经验证——限定规则允许
```
bash <dir>/run.sh
```
  但阻止
```
bash -c …
```
  和任何其他路径）。脚本内部调用的命令——
```
bunx
```
  、
```
xcrun simctl
```
  、
```
adb
```
  、
```
git
```
  、
```
mkdir
```
  、
```
expo
```
  ——是脚本的子进程，而非Bash工具调用，因此无需规则。不要从主线程运行临时的
```
mkdir
```
  /
```
ls
```
  /
```
find
```
  /
```
cat
```
  /
```
grep
```
  命令（它们无规则，会触发提示——且裸
```
mkdir "$WORKSPACE/…"
```
  无法匹配路径通配符，因为路径是未展开的变量）：使用
```
make-workspace.sh
```
  创建目录树（步骤0），让编排器创建自己的目录（
```
os.makedirs
```
  ），并使用
  Read
  /
  Glob
  /
  Grep
  工具检查结果（无需Bash规则）。
- Bash规则匹配（已测试，非显而易见）：Bash规则是命令字符串上的gitignore风格通配符。
```
*
```
  匹配任意字符序列包括
  /
  和空格，且在模式中间有效——因此
```
Bash(python3 /private/tmp/expo-skill-eval-*)
```
  匹配
```
python3 /private/tmp/expo-skill-eval-x/run.py 2>&1
```
  ，
```
Bash(bash *expo-skill-eval/scripts/*)
```
  匹配
```
bash /any/abs/path/expo-skill-eval/scripts/foo.sh args
```
  。两个曾导致问题的陷阱：
```
**
```
  会被字面匹配（绝不要在Bash规则中使用），且
```
:*
```
  后缀仅在命令 token 后有效（
```
Bash(python3:*)
```
  ）——不要在部分路径后使用（
```
Bash(python3 /path-:*)
```
  不匹配）。复合命令会按
```
|
```
  、
```
&&
```
  、
```
||
```
  、
```
;
```
  、
```
&
```
  拆分，每个部分都需要匹配规则。
- Read
  规则抑制提示；
  Write
  /
  Edit
  规则则不。这是Claude Code的不对称性（不是模式错误，也不是重载——在同一个前置条件的
```
Bash
```
  /
```
Read
```
  规则明显有效的会话中，
```
Write
```
  仍会触发提示）：文件创建/编辑始终会经过Claude Code的编辑批准流程，无论
```
allowed-tools
```
  如何设置。前置条件仍将
```
Read
```
  /
```
Write
```
  /
```
Edit
```
  限定在
```
…/expo-skill-eval-*/**
```
  （包含
```
/tmp
```
  和
```
/private/tmp
```
  形式，因为macOS不会自动解析符号链接）作为文档和防护，但这些
```
Write
```
  /
```
Edit
```
  条目本身无法静默提示。实际影响：运行开始时会收到一次工作区的Write提示——选择**“是，允许在此目录中进行所有会话内编辑”**，之后该工作区下的所有编排器/
```
evals.json
```
  /查看器写入操作都会静默完成。正是这单次目录批准（而非规则）让文件写入操作无提示。
- 编辑前置条件后重载——完全重启，而非
  /reload-skills
  。
```
allowed-tools
```
  仅在会话开始时技能加载时读取一次；
```
/reload-skills
```
  会重载技能主体但无法可靠刷新权限规则。编辑此文件后，完全退出Claude Code并启动新会话，然后重新运行技能——否则缓存的旧规则集仍会触发提示，即使磁盘上的文件已更新。
- 评分子代理使用独立的权限上下文运行，仍会触发文件访问提示——这是预期的，与主线程规则分开。
调用评估脚本——单个独立命令，绝不要链式调用。使用绝对路径将每个脚本作为独立的Bash调用：
```
bash /abs/path/expo-skill-eval/scripts/snapshot-ios.sh arg1 arg2
```
（符合
```
Bash(bash *expo-skill-eval/scripts/*)
```
规则）。不要将其与
```
&
```
、
```
&&
```
、
```
||
```
、
```
;
```
、
```
wait
```
、
```
tail
```
、
```
head
```
或
```
echo
```
组合——复合命令会按部分检查，这些额外部分无规则，因此即使
```
bash …/scripts/…
```
部分允许，整个命令仍会触发提示。（唯一允许的管道是
```
… 2>&1 | tee <workspace>/…log
```
，因为限定的
```
tee
```
规则覆盖了它。）需要并行处理或输出裁剪？将其放入Python编排器（符合
```
python3 /…/expo-skill-eval-*
```
规则），通过
```
subprocess
```
在
```
ThreadPoolExecutor
```
中运行脚本。使用
```
Read
```
/
```
Glob
```
/
```
Grep
```
工具检查结果，而非
```
cat
```
/
```
ls
```
/
```
grep
```
。通用规则：在此技能的严格限定下，代理即兴编写的任何临时shell命令都会触发提示——解决方法是将其移入脚本/编排器（或使用限定的
tee
），绝不要放宽规则。
检查输出（截图、日志、文件）——使用工具，而非shell。查找文件使用Glob工具（例如
```
/private/tmp/expo-skill-eval-<skill>/iteration-N/**/ios.png
```
）；查看文件使用Read工具——Read可可视化渲染PNG图片，这正是你确认截图是否正确渲染所需的功能。搜索文件内容使用Grep。绝不要使用
```
find
```
/
```
ls
```
/
```
cat
```
：它们会触发提示，且
```
find … -exec …
```
被故意禁止，因为其
```
-exec
```
可运行任何命令（例如
```
-exec rm
```
）。这些工具是限定范围且无提示的；每次你想输入
```
find
```
/
```
ls
```
/
```
cat
```
时都应使用它们。
生成的Python脚本：在工作区下编写编排/聚合脚本（例如
```
/private/tmp/expo-skill-eval-<skill>/aggregate.py
```
）并使用
```
python3
```
运行（符合
```
Bash(python3 /private/tmp/expo-skill-eval-*)
```
规则）。查看器是例外——它是已签入的
```
scripts/generate_viewer.py
```
，通过
```
Bash(python3 *expo-skill-eval/scripts/*)
```
运行。
```
Write
```
会自动创建父目录，但首次会触发提示——批准工作区目录一次（见上述
```
Write
```
/
```
Edit
```
说明）。通过让脚本自行写入日志或使用
```
python3 … 2>&1 | tee <workspace>/…log
```
（符合限定的
```
tee
```
规则）捕获输出；使用
```
Read
```
工具读取日志。不要使用
```
python3 -c …
```
进行设置（限定规则仅匹配工作区脚本路径，裸
```
-c
```
会触发提示）。
触发评估与已安装插件：在流中检测真实的已安装技能名称（例如
```
expo:expo-ui
```
）——当已安装真实插件时，合成副本harness的得分始终为0%，因为模型会选择真实技能而非合成副本。
基准测试聚合：将每次运行的
```
grading.json
```
+
```
timing.json
```
保存到
```
eval-<N>/<config>/run-1/
```
下。在工作区下编写Python聚合脚本并使用
```
python3
```
运行。
Expo Go限制：任何需要自定义原生代码的内容（expo-module、App Clips、混合开发）无法在Expo Go中运行。这些内容使用
```
static-only
```
模式——在为技能编写评估用例前查看
```
references/runtime-matrix.md
```
（注意：
```
@expo/ui
```
在SDK 56+上可在Expo Go中运行）。
API路由技能：无需截图，在Metro运行时通过
```
curl
```
验证路由；将响应记录为输出文件供评分使用。
计时数据：每个执行器运行完成后立即将令牌计数和持续时间捕获到
```
timing.json
```
中——之后无法恢复。要捕获令牌计数，在执行器
```
claude -p
```
调用中添加
```
--output-format=stream-json --verbose
```
并从日志中解析
```
message_start
```
/
```
message_delta
```
事件。若无这些标志，日志仅包含 prose，仅能恢复经过的秒数。
首次启动对话框：Expo Go偶尔会在全新模拟器上显示一次性提示。若截图捕获到对话框而非应用，重新运行快照脚本（它会重新打开URL）并重新捕获。

Publishing the viewer (only if opted in up front)

查看器

The local

viewer.html

is always generated. Only when the user chose "Publish a shareable Artifact" in the up-front confirmation, additionally render it to a claude.ai Artifact at the very end — never publish without that opt-in (it's outward-facing and a published page can be cached/indexed). Mechanics:

The
```
Artifact
```
tool wraps the file in its own
```
<!doctype html>…<head></head><body>
```
skeleton, so the file you hand it must be page content only — inline
```
<style>
```
/
```
<script>
```
, base64
```
data:
```
images, and a
```
<title>
```
, but no
```
<!DOCTYPE>/<html>/<head>/<body>
```
tags of its own (a full standalone document gets double-wrapped and renders wrong).
The script emits an Artifact-friendly variant when you add
```
--artifact
```
:
```
python3 /abs/path/expo-skill-eval/scripts/generate_viewer.py /private/tmp/expo-skill-eval-<skill> --artifact
```
writes
```
viewer_artifact.html
```
(same content, skeleton stripped, no browser open). Pass that file to the
```
Artifact
```
tool (
```
favicon: "📊"
```
), not the standalone one.
The viewer is already self-contained (base64 screenshots, inline CSS/JS), so it satisfies the Artifact CSP (no external hosts).

捕获截图后，始终生成并打开HTML查看器，以便用户无需询问即可立即查看结果。查看器是已签入的

scripts/generate_viewer.py

——将工作区根目录作为参数运行：

bash

python3 /abs/path/expo-skill-eval/scripts/generate_viewer.py /private/tmp/expo-skill-eval-<skill>

它会生成自包含的

/private/tmp/expo-skill-eval-<skill>/viewer.html

并自行在浏览器中打开（

webbrowser.open

）。它会渲染：

每个迭代一个标签页（工作区根目录下的
```
iteration-*
```
；在
```
localStorage
```
中记住最后激活的标签页）。
每个评估用例（从
```
<iteration>/evals.json
```
读取）：分为使用技能/不使用技能两列并排显示，每列包含静态检查状态、分数、平台截图（点击可放大；以base64
```
data:
```
URI嵌入，因此文件是自包含的）、带有PASS/FAIL标记的预期列表，以及评审注释。
图片提示词用例（
```
grading.json
```
包含
```
reference_match
```
/
```
quality
```
）：目标截图与生成截图并排显示，
```
reference_match
```
得分（生成 vs 目标），每个配置的
```
quality
```
评分标准（每个维度一个条形图，显示得分/最高分以及小计），以及摘要栏中的质量差值（使用技能 − 不使用技能的小计）和正确性差值。
摘要栏显示使用技能的通过率、不使用技能的通过率以及差值。
当
```
trigger-evals/trigger_results.json
```
存在时，显示触发准确性表格。
深色背景，分数用颜色编码（绿色≥85%，琥珀色≥65%，红色<65%）。

References

发布查看器（仅当提前选择此选项时）

```
references/runtime-matrix.md
```
— per-skill runtime applicability (expo-go vs static-only, platform notes).
```
agents/visual-grader.md
```
— screenshot grading instructions for the grader subagent.

始终会生成本地

viewer.html

。仅当用户在前置确认中选择“发布为可共享Artifact”时，才在最后额外将其渲染为claude.ai Artifact——绝不要未经选择就发布（它是对外公开的，发布的页面可能被缓存/索引）。机制：

```
Artifact
```
工具会将文件包装在自己的
```
<!doctype html>…<head></head><body>
```
骨架中，因此你传递的文件必须仅包含页面内容——内联
```
<style>
```
/
```
<script>
```
、base64
```
data:
```
图片和
```
<title>
```
，但不要包含自己的
```
<!DOCTYPE>/<html>/<head>/<body>
```
标签（完整的独立文档会被双重包装，渲染错误）。
添加
```
--artifact
```
参数时，脚本会生成适合Artifact的变体：
```
python3 /abs/path/expo-skill-eval/scripts/generate_viewer.py /private/tmp/expo-skill-eval-<skill> --artifact
```
会生成
```
viewer_artifact.html
```
（内容相同，移除了骨架，不会打开浏览器）。将此文件传递给
```
Artifact
```
工具（
```
favicon: "📊"
```
），而非独立版本。
查看器已自包含（base64截图、内联CSS/JS），因此符合Artifact的CSP（无外部主机）。

—

参考资料

—

```
references/runtime-matrix.md
```
——各技能的运行时适用性（expo-go vs static-only，平台说明）。
```
agents/visual-grader.md
```
——评分子代理的截图评分说明。