expo-skill-eval
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseExpo Skill Eval
Expo技能评估
Evaluates skills in for trigger accuracy, generated code quality, and/or runtime rendering in Expo Go.
plugins/expo/skills/Requirements: macOS with Xcode (iOS simulators), Android SDK with at least one AVD, and . No other device tooling is assumed.
bunWorkspace root: (e.g. ).
/private/tmp/expo-skill-eval-<skill-name>/iteration-N//private/tmp/expo-skill-eval-expo-ui/iteration-4/评估目录下的技能,包括触发准确性、生成代码质量,以及/或者在Expo Go中的运行时渲染效果。
plugins/expo/skills/Before starting — clarify scope
环境要求
Confirm all of the following up front, before any pipeline work — don't skip any (only skip a given item if the request already states that choice). Batch them into calls of ≤4 questions each, in this order:
AskUserQuestion- Which skill to eval (if not clear from the request).
- Prompts — which prompts drive the eval. Built-in prompts (from the skill's eval cases) are pre-selected all; drop any, add a custom text prompt, or build from an uploaded screenshot (a target UI the skill must reproduce). See Prompts below.
- What to verify — one multi-select of three options: Runtime + screenshots / Trigger accuracy / Code checks (no device). See What to verify below.
- Expo SDK — latest (default, auto-detected) or a pinned version.
- Runner — Expo Go (default) or development build.
- Platforms — iOS / Android / web (always offer all three).
- Permission flag for — skip-permissions (default) or accept-edits.
claude -p - Viewer delivery — local only (default) or publish a shareable Artifact.
- If trigger accuracy is selected — confirm the published plugin is disabled (or not installed).
expo
Each is detailed below. Items 4–6 (SDK, runner, platforms) fit naturally in one call.
AskUserQuestionIf the skill to eval is not clear from the request, list available skills from and ask which one to evaluate.
plugins/expo/skills/How the skill under test is loaded — two mechanisms, one per phase (don't pick one globally): executor runs reference it by file path (, read explicitly), while the trigger eval loads it as a plugin (, so the model can auto-select it from its description). Both point at the local, in-repo version — that's what you're evaluating. You do not need any special flag to launch the harness session itself (the harness finds the skill by repo path); the mechanisms apply to the subprocesses it spawns. See steps 1 and 3 for why each phase differs. One pre-run check (required when the trigger eval is in scope): if the published plugin is installed/enabled, disable it (via ) before launching the harness and re-enable after. A single disable is a global-config change that both this session and the spawned subprocesses inherit. Why it's required for the trigger eval: that phase loads the local skill via , and a second installed collides with it — the model may trigger the published , and since detection only sees the tool-call name you'd silently score the published description instead of your local edits (the collision could also just error). The executor / runtime / static phases are not affected — they read the skill under test by its local with no — so a run with no trigger eval can skip the disable. Disabling does not disable (a standalone project skill, not part of the plugin), so the harness stays available.
SKILL_PATH = plugins/expo/skills/<skill>/SKILL.md--plugin-dir plugins/expoclaude -pexpo/pluginclaude -p--plugin-direxpoexpo:expo-uiSKILL_PATH--plugin-direxpoexpo-skill-evalexpoSurface this to the user as an explicit up-front confirmation — the same way you confirm which skill to eval. When the trigger eval is in scope, ask the user to confirm the published plugin is disabled (or not installed) before you start step 1; if it's still enabled, pause and have them disable it via . Don't run the trigger eval until they confirm — the harness can't reliably detect installed plugins on its own (reading the global plugin config or would prompt), so this is a manual confirmation, not an auto-check.
expo/pluginclaude plugin listPick the prompts — built-in, custom, or a target screenshot. The prompts are the inputs that drive the executor (with-skill and without-skill); they are separate from what you verify. Confirm them with (skip if the request already names a prompt):
AskUserQuestion- Built-in prompts — representative prompts you generate by reading the skill under test (its +
SKILL.md) andreferences/, covering the skill's standard use cases. (If the skill already ships eval cases underreferences/runtime-matrix.md, fold theirevals/evals.jsonfields in too — but most skills don't, so you usually derive them.) Pre-select all so the default run exercises the skill's standard cases; let the user deselect any.prompt - Custom text prompt — a one-off prompt the user types. Don't spend a dedicated option slot on this: auto-adds a "Type something" / Other entry, and anything typed there becomes a custom text case.
AskUserQuestion - Build from an uploaded screenshot — the user gives the path to a target screenshot (a UI to reproduce). The executor is told to open it — reads PNGs with its Read tool — and build an app matching it; the case records the path as
claude -p, and grading compares the generated app to that target (step 6). This is the strongest visual test for a UI skill: "build this."reference_image
Respect 's 4-option-per-question cap with this priority (the bug to avoid: the upload option silently dropped once the four slots fill up):
AskUserQuestion- Always reserve a slot for "Build from an uploaded screenshot." It's the whole point of the visual eval and must never be the option that gets dropped.
- Don't add an explicit "Custom text prompt" option — the auto "Type something" / Other entry already covers it.
- Fill the remaining ≤3 slots with the built-in/representative prompts, pre-selected. If there are more than 3, collapse them into one pre-selected "All built-in prompts (default)" option and offer subset-picking in a short follow-up, so the upload option still fits.
Present it as a multi-select. When "Build from an uploaded screenshot" is picked, ask for the target image path in a follow-up. Each selected prompt (built-in, typed, or image) becomes one eval case (run with-skill and without-skill).
Always confirm what to verify unless the request makes it unambiguous. Present these options and let the user pick one or more (defaults in bold based on the skill's entry):
references/runtime-matrix.md| Option | What it does | When to suggest as default |
|---|---|---|
| Runtime + screenshots | Full pipeline: fixture → executor → static gate → run the app on iOS/Android and screenshot it. The runner (Expo Go or dev build) is a separate question — don't name it here. | Default for any skill that renders an app screen (the |
| Trigger accuracy | Run realistic prompts via | Always useful as a standalone check. |
| Code checks (no device) | | Default for |
| Present these as ONE multi-select question — "What do you want to verify?" These are grading dimensions (how to judge what gets built) — distinct from the Prompts phase (what to build). The user may pick any combination. When a prompt is an uploaded screenshot (see Prompts), include "Runtime + screenshots" so the harness captures the generated app and the grader can score it against the target. |
Read to find the skill's default mode before suggesting. If the request already specifies a mode (e.g. "just check if it triggers", "run it on device"), skip the question and proceed.
references/runtime-matrix.mdPick the Expo SDK version — once, up front. Detect the latest with (it prints the major, e.g. ; internally it uses to run and read the major via /, and it's covered by the bash-scripts rule — so don't run the registry query inline yourself, which would prompt). Then confirm with : default to that latest SDK, or let the user pin an older one (e.g. to reproduce a version-specific issue). Use the chosen version everywhere the fixture is built — pass it as the arg to and write it into each eval case's . If the request already names a version ("eval on SDK 54"), skip detection and use it.
bash /abs/path/expo-skill-eval/scripts/latest-sdk.sh56bunnpm view expo dist-tags --jsonJSON.parsesemverAskUserQuestion<sdk>make-fixture.shruntime.sdkDefault to the latest — it stays compatible with the Expo Go that installs on the device. Pinning an SDK older than the device's installed Expo Go makes try to prompt "Install the recommended Expo Go version?"; with no TTY (the snapshot scripts read stdin from ) it dies with and every snapshot fails. So only pin an older SDK when you also pre-install a matching Expo Go on the simulator/emulator — otherwise stick with latest.
expo startexpo start/dev/nullInput is required, but 'npx expo' is in non-interactive modePick the runner — Expo Go (default) or a development build. Ask with (skip if the request already says which):
AskUserQuestion- Expo Go (default) — the snapshot scripts run the app with /
expo start --iosas-is. Fast (no native compile), and it runs anything Expo Go bundles (includingexpo start --androidon SDK 56+). Cannot run custom native code (expo-modules, config plugins, native deps not in Expo Go).@expo/ui - Development build — the snapshot scripts run /
expo run:iosinstead, compiling a native dev client per fixture. Use this for skills whose output needs custom native code (the cases that would otherwise beexpo run:android). Much slower —static-onlyprebuilds and natively compiles each fixture (minutes, especially the first), and needs the full iOS/Android build toolchain — so only choose it when the skill actually requires native code. Disk-heavy: each fixture's native build is multi-GB. The snapshot phase runsexpo runafter each fixture to keep peak usage to ~one build, but still prefer fewer eval cases and a single platform for dev-build runs, and keep a few GB free.clean-fixture.shremoves the per-fixture build output (clean-fixture.sh,node_modules,ios,android,.expo, and the fixture's iOS DerivedData) and keeps the app source + git. The lever for dev-build disk is fewer eval cases + one platform — it only reclaims per-fixture build output and never touches shared dependency caches, so nothing gets re-downloaded.dist
Pass the choice to the snapshot scripts via the env var ( default, or ), and reflect it in each eval case's ( or ). See step 5.
EXPO_SKILL_EVAL_RUNNERexpo-godev-buildruntime.modeexpo-godev-buildPick the platforms — always ask, regardless of skill. Offer iOS / Android / web (multi-select) with ; default to iOS + Android, but always present web as an option — don't pre-filter by skill. Web is a valid choice for most skills: 's universal components (, , , , , …) render on web, as do , NativeWind/Tailwind, API routes, and plain React Native. The only thing that won't show on web is a platform-specific native tree ( or ), which renders blank there — and that blank is itself a useful signal, so it's still the user's call. Web runs via ( + Playwright/Chromium) regardless of the runner ( is native-only; there's no web dev build), and it's the least-exercised path. Write the chosen set into each eval case's and have loop them.
AskUserQuestion@expo/uiHostRowColumnButtonListuse-dom@expo/ui/swift-ui@expo/ui/jetpack-composesnapshot-web.shexpo start --webexpo runruntime.platformsrun_snapshots.pyConfirm how subprocesses run — once, before starting. Ask with whether they may run with , then apply the same answer to every subprocess this run (never re-prompt mid-run):
claude -pAskUserQuestion--dangerously-skip-permissions- Skip permissions (recommended) — pass . Each subprocess runs unattended inside a throwaway fixture under
--dangerously-skip-permissionsand can write files and run setup commands without prompting./private/tmp/expo-skill-eval-* - Accept edits only — pass instead. Bash/installs are auto-denied (no TTY), so some evals may produce partial output.
--permission-mode acceptEdits
A bare with neither flag can't write files at all. If the request already states a preference ("skip permissions", "don't use the dangerous flag"), skip the question.
claude -pConfirm how to deliver the results viewer — once, up front. Publishing to claude.ai is outward-facing, so never do it mid-run by surprise; ask in the same up-front (alongside the permission flag):
AskUserQuestion- Local only (default) — writes
generate_viewer.pyand opens it in the local browser. Nothing leaves the machine.viewer.html - Publish a shareable Artifact — additionally render the viewer to a claude.ai Artifact (a default-private web page the user can share with teammates) at the very end. Only do this if the user opts in here.
If the request already says whether to share/publish, skip the question. See the Viewer section for the publish mechanics.
需要配备Xcode(iOS模拟器)的macOS系统、至少包含一个AVD的Android SDK,以及工具。无需其他设备工具。
bun工作区根目录:(例如)。
/private/tmp/expo-skill-eval-<skill-name>/iteration-N//private/tmp/expo-skill-eval-expo-ui/iteration-4/Eval case schema
开始前——明确范围
You generate the run's eval cases — one per chosen prompt — and write them to (the viewer reads them from there). Each case extends the standard skill-creator eval-case shape with a block and visual expectations:
<workspace>/iteration-N/evals.jsonruntimejson
{
"id": 1,
"prompt": "Build me a settings screen with a dark mode toggle and a list of options",
"expected_output": "Working Expo Router screen",
"expectations": [
"Uses Expo Router file-based routing",
"TypeScript compiles with no errors"
],
"runtime": {
"mode": "expo-go",
"platforms": ["ios", "android"],
"sdk": "56"
},
"visual_expectations": [
"No red error screen or Expo Go error overlay on any platform",
"A settings screen with a visible toggle control is rendered"
]
}-
: how the eval runs after the static gate —
runtime.mode- : run in Expo Go (
"expo-go") and screenshot. Fast, JS-only. Default.expo start --<platform> - : build a native dev client (
"dev-build") and screenshot. For skills whose output uses custom native code; much slower (native compile per fixture).expo run:<platform> - : stop after the static gate — for skills that produce no UI, or when you don't want to run a device at all (CI).
"static-only"
Consultfor which repo skills support which mode. (references/runtime-matrix.mdlets you actually run skills that previously had to bedev-buildfor needing native code.)static-only -
: subset of
runtime.platforms,ios,android— chosen up front (always offered, not gated on the skill; see Before starting). Defaults toweb.["ios", "android"] -
: Expo SDK major for the fixture app — set it to the version chosen up front (see Before starting — clarify scope). Omit to use the latest template.
runtime.sdk -
(optional — image prompt): absolute path to a target screenshot the skill must reproduce. When set, the executor is told to open it (via its Read tool) and build a matching app, and the grader scores how closely the generated app reproduces it (step 6) on top of the usual expectations. Set in the Prompts phase via "build from an uploaded screenshot."
reference_image
An image-prompt case is a normal case with set; enable "Runtime + screenshots" so the harness captures the result to compare against the target:
reference_imagejson
{
"prompt": "Build an app whose UI matches the attached reference screenshot.",
"reference_image": "/abs/path/to/target.png",
"runtime": { "mode": "expo-go", "platforms": ["ios"], "sdk": "56" },
"visual_expectations": ["Matches the reference's layout, components, and color treatment"]
}在启动任何流水线工作前,提前确认以下所有事项——不要跳过任何一项(仅当请求已明确指定时可跳过对应项)。将这些问题分批放入最多包含4个问题的调用中,顺序如下:
AskUserQuestion- 评估哪个技能(若请求未明确说明)。
- 提示词——驱动评估的提示词。内置提示词(来自技能的评估用例)默认全选;可删除部分提示词、添加自定义文本提示词,或基于上传截图构建(技能需复现的目标UI)。详见下方提示词部分。
- 验证内容——三个选项的多选:运行时+截图 / 触发准确性 / 代码检查(无需设备)。详见下方验证内容部分。
- Expo SDK版本——最新版本(默认,自动检测)或指定固定版本。
- 运行环境——Expo Go(默认)或开发构建版。
- 平台——iOS / Android / web(始终提供这三个选项)。
- 的权限标志——
claude -p(默认)或skip-permissions。accept-edits - 结果查看器交付方式——仅本地(默认)或发布为可共享Artifact。
- 若选择触发准确性验证——确认已禁用已发布的插件(或未安装)。
expo
以下是各项的详细说明。第4-6项(SDK版本、运行环境、平台)可自然整合到一个调用中。
AskUserQuestion若请求未明确说明要评估的技能,列出目录下的可用技能并询问用户要评估哪一个。
plugins/expo/skills/被测技能的加载机制——分阶段采用两种方式(无需全局选择其一):执行器通过文件路径引用技能(,直接读取),而触发评估则将其作为插件加载(,模型可根据描述自动选择)。两者均指向本地仓库内的版本——这正是你要评估的对象。你无需任何特殊标志来启动测试 harness 会话本身(harness会通过仓库路径找到技能);这些机制仅适用于它启动的子进程。详见步骤1和3了解各阶段差异的原因。前置检查(当触发评估在范围内时必须执行):如果已发布的插件已安装/启用,在启动harness之前通过命令禁用它,并在之后重新启用。单次禁用是全局配置变更,本次会话和启动的子进程都会继承该配置。触发评估需要此操作的原因:该阶段通过加载本地技能,若同时存在已安装的插件会产生冲突——模型可能触发已发布的,由于检测仅能看到工具调用名称,你会无意中为已发布的描述打分,而非本地修改的版本(冲突也可能直接导致错误)。执行器/运行时/静态阶段不受影响——它们通过本地读取被测技能,无需——因此未包含触发评估的运行可跳过禁用操作。禁用不会禁用(这是一个独立的项目技能,不属于插件),因此harness仍可正常使用。
SKILL_PATH = plugins/expo/skills/<skill>/SKILL.md--plugin-dir plugins/expoclaude -pexpo/pluginclaude -p--plugin-direxpoexpo:expo-uiSKILL_PATH--plugin-direxpoexpo-skill-evalexpo需将此作为明确的前置确认告知用户——就像确认评估哪个技能一样。当触发评估在范围内时,要求用户在开始步骤1之前确认已发布的插件已禁用(或未安装);若仍处于启用状态,请暂停并让用户通过命令禁用它。在用户确认前不要运行触发评估——harness无法可靠地检测已安装的插件(读取全局插件配置或会触发提示),因此这是手动确认步骤,而非自动检查。
expo/pluginclaude plugin list选择提示词——内置、自定义或基于目标截图。提示词是驱动执行器(使用技能和不使用技能两种场景)的输入,与你要验证的内容分开。通过确认提示词(若请求已指定提示词则跳过):
AskUserQuestion- 内置提示词——通过读取被测技能(其+
SKILL.md目录)和references/生成的代表性提示词,覆盖技能的标准使用场景。(若技能已在references/runtime-matrix.md中提供评估用例,也需将其中的evals/evals.json字段纳入——但大多数技能没有,因此通常需要自行推导。)默认全选,以便默认运行可测试技能的标准用例;允许用户取消选择部分提示词。prompt - 自定义文本提示词——用户输入的一次性提示词。无需为此设置单独选项:会自动添加**“输入自定义内容”**/其他选项,用户输入的内容将成为自定义文本用例。
AskUserQuestion - 基于上传截图构建——用户提供目标截图的路径(需复现的UI)。执行器会被要求打开该截图——可通过Read工具读取PNG图片——并构建与之匹配的应用;该用例将记录路径为
claude -p,评分阶段会将生成的应用与目标截图进行对比(步骤6)。这是UI技能最强的视觉测试方式:“构建这个界面”。reference_image
需遵循的每道题最多4个选项的限制,优先级如下(需避免的问题:当4个选项填满时,上传选项被静默丢弃):
AskUserQuestion- 始终为“基于上传截图构建”保留一个选项位。这是视觉评估的核心,绝不能成为被丢弃的选项。
- 不要添加明确的“自定义文本提示词”选项——自动生成的“输入自定义内容”/其他选项已覆盖此场景。
- 剩余的≤3个选项位填入内置/代表性提示词,默认选中。若内置提示词超过3个,将它们合并为一个默认选中的**“所有内置提示词(默认)”**选项,并在后续简短提问中提供子集选择,以确保上传选项仍能纳入。
将其作为多选问题呈现。当选择“基于上传截图构建”时,在后续提问中询问目标图片路径。每个选中的提示词(内置、自定义输入或图片)将成为一个评估用例(分别在使用技能和不使用技能的场景下运行)。
除非请求已明确说明,否则务必确认验证内容。呈现以下选项并允许用户选择一个或多个(默认选项基于技能在中的条目,以粗体标注):
references/runtime-matrix.md| 选项 | 功能 | 何时建议设为默认 |
|---|---|---|
| 运行时+截图 | 完整流水线:测试夹具 → 执行器 → 静态检查 → 在iOS/Android上运行应用并截图。运行环境(Expo Go或开发构建版)是单独的问题——此处无需提及。 | 默认选项适用于任何渲染应用界面的技能( |
| 触发准确性 | 通过 | 始终适合作为独立检查项。 |
| 代码检查(无需设备) | 执行 | 默认选项适用于 |
| 将这些作为一个多选问题呈现——*“你想要验证哪些内容?”*这些是评分维度(判断构建内容的标准)——与提示词阶段(构建内容)不同。用户可选择任意组合。当提示词为上传截图时(见提示词部分),需包含“运行时+截图”**选项,以便harness捕获生成的应用并让评分器与目标截图对比打分。 |
在建议默认选项前,先读取找到技能的默认模式。若请求已指定模式(例如“仅检查是否触发”、“在设备上运行”),则跳过该问题直接执行。
references/runtime-matrix.md选择Expo SDK版本——提前一次性确认。通过检测最新版本(该脚本会打印主版本号,例如;内部使用运行并通过/读取主版本号,且符合bash脚本规则——因此不要自行直接运行注册表查询,否则会触发提示)。然后通过确认:默认使用该最新SDK版本,或允许用户指定旧版本(例如复现特定版本的问题)。在构建测试夹具的所有环节使用选定的版本——将其作为参数传递给并写入每个评估用例的字段。若请求已指定版本(“在SDK 54上评估”),则跳过检测直接使用该版本。
bash /abs/path/expo-skill-eval/scripts/latest-sdk.sh56bunnpm view expo dist-tags --jsonJSON.parsesemverAskUserQuestion<sdk>make-fixture.shruntime.sdk默认使用最新版本——这样可与在设备上安装的Expo Go保持兼容。若指定的SDK版本低于设备上已安装的Expo Go版本,会尝试提示“是否安装推荐的Expo Go版本?”;由于无TTY(快照脚本从读取标准输入),会因错误终止,且所有快照都会失败。因此仅当你已在模拟器/模拟器上预先安装了匹配的Expo Go版本时,才指定旧SDK版本;否则请使用最新版本。
expo startexpo start/dev/nullInput is required, but 'npx expo' is in non-interactive mode选择运行环境——Expo Go(默认)或开发构建版。通过询问(若请求已指定则跳过):
AskUserQuestion- Expo Go(默认)——快照脚本直接使用/
expo start --ios运行应用。速度快(无需原生编译),可运行任何Expo Go打包的内容(包括SDK 56+上的expo start --android)。无法运行自定义原生代码(expo-modules、配置插件、Expo Go未包含的原生依赖)。@expo/ui - 开发构建版——快照脚本改为运行/
expo run:ios,为每个测试夹具编译原生开发客户端。适用于输出需要自定义原生代码的技能(否则这些用例只能设为expo run:android)。速度慢得多——static-only会为每个测试夹具进行预构建和原生编译(尤其是首次编译需要数分钟),且需要完整的iOS/Android构建工具链——因此仅当技能确实需要原生代码时才选择此选项。磁盘占用大:每个测试夹具的原生构建输出可达数GB。快照阶段会在每个夹具完成后运行expo run,以将峰值磁盘占用控制在约一个构建的大小,但仍建议开发构建版运行时减少评估用例数量并选择单一平台,同时预留数GB空闲磁盘空间。clean-fixture.sh会删除每个夹具的构建输出(clean-fixture.sh、node_modules、ios、android、.expo以及夹具的iOS DerivedData),保留应用源码+git。控制开发构建版磁盘占用的关键是减少评估用例数量+选择单一平台——它仅回收每个夹具的构建输出,不会触及共享依赖缓存,因此无需重新下载依赖。dist
通过环境变量将选择传递给快照脚本(默认,或),并在每个评估用例的字段中体现(或)。详见步骤5。
EXPO_SKILL_EVAL_RUNNERexpo-godev-buildruntime.modeexpo-godev-build选择平台——无论技能类型如何,始终询问。通过提供iOS / Android / web(多选)选项;默认选择iOS + Android,但始终提供web选项——不要根据技能类型预先过滤。web对大多数技能都是有效的选择:的通用组件(、、、、等)可在web上渲染,、NativeWind/Tailwind、API路由和纯React Native代码也可。唯一无法在web上显示的是平台特定的原生树(或),它们在web上会显示空白——但这种空白本身也是有用的信号,因此仍由用户决定是否选择web。web通过运行( + Playwright/Chromium),无论运行环境如何(仅适用于原生平台;没有web开发构建版),且是使用最少的路径。将选定的平台写入每个评估用例的字段,并让循环处理这些平台。
AskUserQuestion@expo/uiHostRowColumnButtonListuse-dom@expo/ui/swift-ui@expo/ui/jetpack-composesnapshot-web.shexpo start --webexpo runruntime.platformsrun_snapshots.py确认子进程的运行方式——提前一次性确认。通过询问是否允许使用,然后将相同的答案应用于本次运行的所有子进程(运行过程中不再重新提示):
claude -pAskUserQuestion--dangerously-skip-permissions- 跳过权限检查(推荐)——传递。每个子进程在
--dangerously-skip-permissions下的临时测试夹具中无人值守运行,可在无需提示的情况下写入文件和运行设置命令。/private/tmp/expo-skill-eval-* - 仅接受编辑——改为传递。Bash/安装操作会被自动拒绝(无TTY),因此部分评估可能产生不完整输出。
--permission-mode acceptEdits
未携带任何上述标志的裸无法写入文件。若请求已指定偏好(“跳过权限检查”、“不要使用危险标志”),则跳过该问题。
claude -p确认结果查看器的交付方式——提前一次性确认。发布到claude.ai是对外公开的操作,因此绝不要在运行过程中突然执行;在前置的中(与权限标志一起)询问:
AskUserQuestion- 仅本地(默认)——会生成
generate_viewer.py并在本地浏览器中打开。内容不会离开本地机器。viewer.html - 发布为可共享Artifact——在最后额外将查看器渲染为claude.ai Artifact(默认私有网页,用户可与团队成员共享)。仅当用户在此处选择此选项时才执行。
若请求已说明是否要共享/发布,则跳过该问题。查看查看器部分了解发布机制。
Pipeline per eval case
评估用例 schema
Orchestration model — on the main thread you run and almost nothing else. Every phase is driven by a small Python orchestrator you into the workspace and run with (covered by the rule). The orchestrators are the only place the files are invoked — always via , which runs as a child of and needs no rule of its own — and the only place parallelism, logging, and directory creation live. So on the main thread you only ever: Write orchestrators, run them with , inspect outputs with the // tools, and spawn the grader subagent. Never put a command inside a chained/backgrounded/piped shell construct, and never run ad-hoc //// — that is what prompts. (A single standalone is fine for one-off manual debugging, e.g. re-running one flaky snapshot, but the pipeline itself goes through the orchestrators.) Run each orchestrator in the foreground — let the tool call block until it finishes; the orchestrators already parallelize within a phase, so you don't need to overlap phases. Do not shell-background a phase with / (the , , and segments have no rule and prompt). If you genuinely must run a phase while continuing other work, use the Bash tool's parameter on a plain call — never hand-rolled shell . Expect exactly one permission prompt at the very start: the first into the workspace. can suppress / but not /, so choose "allow all edits in this directory for the session" on that first prompt — it covers every orchestrator, , and viewer file for the whole run.
python3 <orchestrator>Writepython3 /private/tmp/expo-skill-eval-<skill>/<phase>.pypython3scripts/*.shsubprocess.run(["bash", "<scripts>/<name>.sh", …])python3python3ReadGlobGrepmkdirlscattailechobash …/scripts/<name>.sh …… & echo "$!"wait&echowaitrun_in_backgroundpython3 <orchestrator> 2>&1 | tee <ws>/…log&Writeallowed-toolsBashReadWriteEditevals.json你需要生成运行的评估用例——每个选定的提示词对应一个用例——并将其写入(查看器从此文件读取数据)。每个用例在标准技能创建者评估用例结构的基础上扩展了块和视觉预期:
<workspace>/iteration-N/evals.jsonruntimejson
{
"id": 1,
"prompt": "Build me a settings screen with a dark mode toggle and a list of options",
"expected_output": "Working Expo Router screen",
"expectations": [
"Uses Expo Router file-based routing",
"TypeScript compiles with no errors"
],
"runtime": {
"mode": "expo-go",
"platforms": ["ios", "android"],
"sdk": "56"
},
"visual_expectations": [
"No red error screen or Expo Go error overlay on any platform",
"A settings screen with a visible toggle control is rendered"
]
}-
: 静态检查后评估的运行方式——
runtime.mode- : 在Expo Go中运行(
"expo-go")并截图。速度快,仅需JS。默认值。expo start --<platform> - : 构建原生开发客户端(
"dev-build")并截图。适用于输出需要自定义原生代码的技能;速度慢得多(每个夹具需原生编译)。expo run:<platform> - : 静态检查后停止——适用于不生成UI的技能,或完全不想运行设备的场景(如CI)。
"static-only"
参考了解仓库中各技能支持的模式。(references/runtime-matrix.md可让之前因需要原生代码而只能设为dev-build的技能实际运行起来。)static-only -
:
runtime.platforms、ios、android的子集——提前选定(始终提供选项,不受技能限制;见开始前部分)。默认值为web。["ios", "android"] -
: 测试夹具应用的Expo SDK主版本——设置为提前选定的版本(见开始前——明确范围部分)。省略则使用最新模板。
runtime.sdk -
(可选——图片提示词):技能需复现的目标截图的绝对路径。设置后,执行器会被要求打开该截图(通过Read工具)并构建匹配的应用,评分器会在常规预期之外,对比生成的应用与目标截图的相似度(步骤6)。在提示词阶段通过“基于上传截图构建”选项设置此字段。
reference_image
图片提示词用例是设置了的常规用例;需启用“运行时+截图”选项,以便harness捕获结果并与目标对比:
reference_imagejson
{
"prompt": "Build an app whose UI matches the attached reference screenshot.",
"reference_image": "/abs/path/to/target.png",
"runtime": { "mode": "expo-go", "platforms": ["ios"], "sdk": "56" },
"visual_expectations": ["Matches the reference's layout, components, and color treatment"]
}0. Workspace setup
每个评估用例的流水线
Create the run's directory tree once, with the workspace script — never with ad-hoc (a raw prompts: there is no rule, and a variable can't match a path glob anyway):
mkdirmkdirmkdir"$WORKSPACE/…"bash
bash /abs/path/expo-skill-eval/scripts/make-workspace.sh /private/tmp/expo-skill-eval-<skill> iteration-N <num-evals>This creates and for every eval. It is covered by , and the s inside run as children of the script (no rule of their own). After this, every other directory is made by the scripts/orchestrators that need it (, the executor orchestrator's , the snapshot scripts) or by the tool auto-creating parents — so you never need another .
trigger-evals/scratchiteration-N/eval-<i>/{with_skill,without_skill}/outputsBash(bash *expo-skill-eval/scripts/*)mkdirmake-fixture.shos.makedirsWritemkdir编排模型——主线程仅运行,几乎不执行其他操作。每个阶段由一个小型Python编排器驱动,你需将其到工作区并通过运行(符合规则)。编排器是唯一调用文件的地方——始终通过调用,作为的子进程运行,无需单独规则——也是并行处理、日志记录和目录创建的唯一载体。因此主线程仅需执行以下操作:编写编排器、运行编排器(使用)、检查输出(使用//工具),以及启动评分子代理。绝不要将命令放入链式/后台/管道化的shell结构中,也不要运行临时的////命令——这些操作会触发提示。(单个独立的可用于一次性手动调试,例如重新运行一个不稳定的快照,但流水线本身需通过编排器执行。)前台运行每个编排器——让工具调用阻塞直到完成;编排器已在阶段内实现并行处理,因此无需重叠阶段。不要使用 / 将阶段放入后台运行(、和部分无规则,会触发提示)。若确实需要在运行阶段的同时继续其他工作,请在普通的调用中使用Bash工具的参数——绝不要手动编写shell后台命令。预期仅在最开始出现一次权限提示:首次向工作区文件时。可抑制/的提示,但无法抑制/的提示,因此在首次提示时选择**“允许在此目录中进行所有会话内编辑”**——这将覆盖整个运行过程中的所有编排器、和查看器文件。
python3 <orchestrator>Writepython3 /private/tmp/expo-skill-eval-<skill>/<phase>.pypython3scripts/*.shsubprocess.run(["bash", "<scripts>/<name>.sh", …])python3python3ReadGlobGrepmkdirlscattailechobash …/scripts/<name>.sh …… & echo "$!"wait&echowaitpython3 <orchestrator> 2>&1 | tee <ws>/…logrun_in_backgroundWriteallowed-toolsBashReadWriteEditevals.json1. Trigger eval (should-trigger only)
0. 工作区设置
Write a script under the workspace's directory. Use only queries — the expo plugin is a family of complementary skills, so multiple skills triggering on the same prompt is not a failure. Measure recall only: realistic prompts that should use the skill, scored by trigger rate.
run_trigger_eval_real.pytrigger-evals/"should_trigger": trueThe script should run per query (with , stripped from the env, and the permission flag confirmed up front in Before starting — clarify scope) and detect whether the target skill was triggered by watching for its or tool call in the stream. Note: requires both and — omitting either causes an immediate CLI error.
claude -p <query>--output-format=stream-json --verbose --include-partial-messagesCLAUDECODESkillRead--include-partial-messages--output-format=stream-json--verboseLoad the skill under test — pass to every trigger subprocess. The trigger eval measures whether the skill's description makes the model reach for it, so the subprocess must have the local skill (the version with your edits) loaded. A subprocess does not inherit the parent session's , so add it explicitly: , where is the absolute path to the plugin directory that owns the skill — the ancestor containing (e.g. ). It must be absolute: the subprocess runs from the throwaway cwd, so a relative won't resolve — and a missing plugin dir silently loads nothing, which masquerades as a 0% trigger rate. Then watch for the skill triggering under its plugin-qualified name (, e.g. ). Two caveats: (1) if the published plugin is also installed globally, disable it (via ) for the run and re-enable after — otherwise two copies of collide in the subprocess and the model may trigger the published , silently scoring its description instead of your local edits (trigger detection only sees the tool-call name, so it can't tell the copies apart; dev checkouts usually don't have it installed). (2) Never make a synthetic duplicate of the skill — a real loaded copy always wins, so the synthetic harness scores 0%. (Executors are unaffected by an installed plugin: they read the local directly and pass no .)
--plugin-dirclaude -p--plugin-dir--plugin-dir <plugin-root><plugin-root>plugins/expo.claude-plugin/plugin.json--plugin-dir /Users/.../skills/plugins/exposcratch/plugins/expo<plugin>:<skill>expo:expo-uiexpo/pluginexpoexpo:expo-uiSKILL_PATH--plugin-dirRun each query's subprocess from an empty throwaway cwd (e.g. ), not the repo root. A should-trigger prompt like "build me a settings screen" can make the subprocess write files, and with those writes would otherwise land in the skills repo. Trigger detection only needs the skill's / call to appear in the stream — it doesn't need a fixture — so any incidental writes are throwaway.
trigger-evals/scratch/--dangerously-skip-permissionsSkillReadSet a per-query subprocess timeout of at least 300 seconds. A 180s limit is too short — some queries cause the model to start generating code before triggering the skill, which pushes total runtime past 3 minutes.
Run trigger evals once per skill, not per code eval case.
使用工作区脚本一次性创建运行的目录树——绝不要使用临时命令(裸会触发提示:没有规则,且变量无法匹配路径通配符):
mkdirmkdirmkdir"$WORKSPACE/…"bash
bash /abs/path/expo-skill-eval/scripts/make-workspace.sh /private/tmp/expo-skill-eval-<skill> iteration-N <num-evals>此命令会为每个评估用例创建和目录。符合规则,内部的命令作为脚本的子进程运行(无需单独规则)。完成此步骤后,所有其他目录由需要它们的脚本/编排器创建(、执行器编排器的、快照脚本),或由工具自动创建父目录——因此你无需再执行任何命令。
trigger-evals/scratchiteration-N/eval-<i>/{with_skill,without_skill}/outputsBash(bash *expo-skill-eval/scripts/*)mkdirmake-fixture.shos.makedirsWritemkdir2. Fixture
1. 触发评估(仅针对应触发的查询)
Each executor run gets a fresh Expo app, created by :
scripts/make-fixture.sh <app-path> <sdk> [clean|full]bash
scripts/make-fixture.sh <workspace>/iteration-N/eval-X/<config>/app <sdk> # blank app (default)
scripts/make-fixture.sh <workspace>/iteration-N/eval-X/<config>/app <sdk> full # keep example tabsThe script creates the app with (or the latest template when no version is given) once per SDK version + variant, caches it under , and clones the cache with APFS copy-on-write — so the first run per variant pays the install cost and every later run is near-instant. The default variant runs the template's script, so executors start from a blank app and every screen in the output is theirs — a much cleaner grading signal. Use only when the eval prompt assumes an existing app (e.g. "I have an app with two tabs..."). The script also resets git inside the clone, so in the app shows exactly what the executor changed (useful evidence for the grader).
bunx create-expo-app -t default@sdk-<version>~/.cache/expo-skill-eval/fixtures/cleanreset-projectfullgit diffBuild fixtures sequentially, then fan out executors — never create fixtures concurrently. shares a cache under keyed by SDK+variant. If two runs both find the cache cold and call at the same time, bun's link step collides and one fails with / "could not determine executable to run for package create-expo-app". So in the executor orchestrator (step 3), create all fixtures one at a time first — a plain Python loop calling (where is the version chosen up front) — then fan out the executors with a . Sequential creation is cheap: only the first fixture per SDK+variant pays the install cost; the rest are ~1s APFS clones. (And never fan fixtures out with ad-hoc shell like — the / segments prompt; the sequential Python loop avoids both the race and the prompt.)
make-fixture.sh~/.cache/expo-skill-eval/fixtures/bunx create-expo-appEEXISTsubprocess.run(["bash", "<scripts>/make-fixture.sh", app, sdk, variant])sdkclaude -pThreadPoolExecutormake-fixture.sh A & make-fixture.sh B & wait&wait在工作区的目录下编写脚本。仅使用**的查询**——expo插件是一组互补技能,因此多个技能在同一提示词下触发并非失败。仅衡量召回率:应使用该技能的真实场景提示词,按触发率打分。
trigger-evals/run_trigger_eval_real.py"should_trigger": true脚本需为每个查询运行(携带,从环境变量中移除,并使用开始前——明确范围部分中确认的权限标志),并通过监听流中的或工具调用检测目标技能是否被触发。注意:需要同时设置和——省略其中任何一个都会立即导致CLI错误。
claude -p <query>--output-format=stream-json --verbose --include-partial-messagesCLAUDECODESkillRead--include-partial-messages--output-format=stream-json--verbose加载被测技能——为每个触发子进程传递。触发评估衡量的是技能的描述是否会让模型选择它,因此子进程必须加载本地技能(包含你的修改的版本)。子进程不会继承父会话的,因此需显式添加:,其中是拥有该技能的插件目录的绝对路径——即包含的父目录(例如)。必须使用绝对路径:子进程在临时目录下运行,相对路径无法解析——且缺失插件目录会静默加载不到任何内容,表现为0%的触发率。然后监听技能是否以插件限定名称触发(,例如)。两个注意事项:(1) 如果已发布的插件也已全局安装,需在运行期间通过命令禁用它,并在之后重新启用——否则子进程中会存在两个副本冲突,模型可能触发已发布的,无意中为其描述打分而非你的本地修改版本(触发检测仅能看到工具调用名称,无法区分副本;开发环境通常不会安装已发布版本)。(2) 绝不要创建技能的合成副本——真实加载的副本始终优先,因此合成harness的得分会是0%。(执行器不受已安装插件影响:它们直接读取本地,无需传递。)
--plugin-dirclaude -p--plugin-dir--plugin-dir <plugin-root><plugin-root>.claude-plugin/plugin.jsonplugins/expo--plugin-dir /Users/.../skills/plugins/exposcratch/plugins/expo<plugin>:<skill>expo:expo-uiexpo/pluginexpoexpo:expo-uiSKILL_PATH--plugin-dir在空的临时工作目录(例如)中运行每个查询的子进程,而非仓库根目录。类似“build me a settings screen”的应触发提示词可能会让子进程写入文件,若使用,这些写入操作会直接写入技能仓库。触发检测仅需流中出现技能的/调用——无需测试夹具——因此任何临时写入的内容都可丢弃。
trigger-evals/scratch/--dangerously-skip-permissionsSkillRead为每个查询的子进程设置至少300秒的超时时间。180秒的限制太短——部分查询会导致模型在触发技能前开始生成代码,从而使总运行时间超过3分钟。
每个技能仅需运行一次触发评估,无需针对每个代码评估用例重复运行。
3. Generate (executor subagents)
2. 测试夹具
Run executors as subprocess calls from a Python script, not via the tool. The tool spawns subagents with their own permission context — file edits inside the fixture app will prompt the user. A subprocess is a separate process outside the permission system entirely (the same pattern the trigger eval harness uses).
claude -pAgentAgentclaude -pWrite a Python script to . First create every run's fixture in a sequential loop — one at a time (concurrent creation races the shared bun cache — see step 2). Then run the with-skill and without-skill calls in parallel via a . Both phases live inside Python (covered by the rule), so nothing runs as ad-hoc shell on the main thread. Each executor prompt must include:
/private/tmp/expo-skill-eval-<skill>/run_executors.pysubprocess.run(["bash", "<scripts>/make-fixture.sh", app, sdk, variant], …)claude -pThreadPoolExecutorpython3- The skill path (with-skill runs only) and the eval prompt.
- Image-prompt cases (set): the absolute path to the target screenshot plus an instruction like "Open the reference screenshot at
reference_imagewith your Read tool and build an app whose UI matches it as closely as you can — layout, components, spacing, and colors." (<path>renders PNGs read this way, so the executor can actually see the target.)claude -p - The fixture app path: "Make your changes inside . The project already exists and has dependencies installed. Use absolute paths for all file operations."
<app-path> - "Before writing any files, inspect the project layout — run , read
lsandpackage.json— to find the correct routes directory. Recent SDK default templates place Expo Router routes inapp.json; older ones usesrc/app/at the project root — inspect to confirm which this fixture uses."app/ - "Do NOT start the dev server, boot simulators, or take screenshots — the harness does that after you finish."
- Where to save a short summary of what was built.
Flags for the subprocess:
claude -p- Strip from the environment (
CLAUDECODE) — otherwiseenv = {k: v for k, v in os.environ.items() if k != "CLAUDECODE"}hangs silently when nested inside a running Claude Code session.claude -p - A permission flag, confirmed with the user up front (see Before starting — clarify scope): either or
--dangerously-skip-permissions. Bake the chosen flag into the generated script. A bare--permission-mode acceptEditswith neither flag can't write files — it has no TTY to approve the edit and emits code as text instead.claude -p - Do NOT pass to executors (unlike the trigger eval). The with-skill run already reads the skill by its absolute
--plugin-dir, so it tests the local content directly; and the without-skill run must have no skill available at all — loading the plugin would let the skill auto-trigger and contaminate the baseline. Keeping executors path-based also cleanly separates the two questions: the executor measures content quality (is the skill useful once read?), the trigger eval measures triggering (does the description get it picked?).SKILL_PATH
Capture stdout/stderr per run to a log file next to the fixture for grading evidence. Set timeout to 900s per executor — with-skill runs read multiple reference files before coding and regularly take 5–10 minutes.
每个执行器运行都会获得一个全新的Expo应用,由创建:
scripts/make-fixture.sh <app-path> <sdk> [clean|full]bash
scripts/make-fixture.sh <workspace>/iteration-N/eval-X/<config>/app <sdk> # 空白应用(默认)
scripts/make-fixture.sh <workspace>/iteration-N/eval-X/<config>/app <sdk> full # 保留示例标签页该脚本使用创建应用(未指定版本时使用最新模板),每个SDK版本+变体创建一次,并缓存到目录下,然后通过APFS写时复制克隆缓存——因此每个变体的首次运行需承担安装成本,后续运行几乎瞬间完成。默认的变体运行模板的脚本,因此执行器从空白应用开始,输出中的每个界面都是其生成的——评分信号更清晰。仅当评估提示词假设已有应用时(例如“我有一个包含两个标签页的应用...”)才使用变体。脚本还会重置克隆中的git,因此应用中的可准确显示执行器的修改内容(对评分器有用的证据)。
bunx create-expo-app -t default@sdk-<version>~/.cache/expo-skill-eval/fixtures/cleanreset-projectfullgit diff按顺序构建测试夹具,然后并行运行执行器——绝不要并发创建夹具。在目录下共享按SDK+变体键控的缓存。若两个运行同时发现缓存未命中并同时调用,bun的链接步骤会冲突,其中一个会因 / "could not determine executable to run for package create-expo-app"错误失败。因此在执行器编排器(步骤3)中,首先按顺序创建所有夹具——使用普通Python循环调用(其中是提前选定的版本)——然后通过并行运行执行器。顺序创建成本很低:每个SDK+变体仅第一个夹具需承担安装成本;其余都是约1秒的APFS克隆。(绝不要使用临时shell命令如并行创建夹具——/部分会触发提示;顺序Python循环可避免竞争和提示。)
make-fixture.sh~/.cache/expo-skill-eval/fixtures/bunx create-expo-appEEXISTsubprocess.run(["bash", "<scripts>/make-fixture.sh", app, sdk, variant])sdkThreadPoolExecutorclaude -pmake-fixture.sh A & make-fixture.sh B & wait&wait4. Static gate
3. 生成代码(执行器子代理)
Write and run it with . For each eval/config app it calls across a (static gates are independent — run them concurrently inside Python, never with shell /), and writes each result to (exit code + captured output) for the grader.
run_static.pypython3subprocess.run(["bash", "<scripts>/check-static.sh", app, "ios,android"], capture_output=True, …)ThreadPoolExecutor&waiteval-<i>/<config>/static.jsoncheck-static.shtsc --noEmitexpo lintexpo export通过Python脚本运行子进程作为执行器,不要使用工具。工具会使用独立的权限上下文启动子代理——测试夹具应用内的文件编辑会触发用户提示。子进程是完全独立于权限系统的进程(与触发评估harness使用相同模式)。
claude -pAgentAgentclaude -p编写Python脚本到。首先按顺序循环创建所有运行的测试夹具——逐个调用(并发创建会导致共享bun缓存竞争——见步骤2)。然后通过并行运行使用技能和不使用技能的调用。两个阶段都在Python内部运行(符合规则),因此主线程不会执行任何临时shell命令。每个执行器提示词必须包含:
/private/tmp/expo-skill-eval-<skill>/run_executors.pysubprocess.run(["bash", "<scripts>/make-fixture.sh", app, sdk, variant], …)ThreadPoolExecutorclaude -ppython3- 技能路径(仅使用技能的运行需要)和评估提示词。
- 图片提示词用例(设置了):目标截图的绝对路径,以及类似“使用Read工具打开
reference_image处的参考截图,并构建一个UI尽可能匹配的应用——包括布局、组件、间距和颜色。”的指令(<path>可渲染通过此方式读取的PNG图片,因此执行器可实际看到目标)。claude -p - 测试夹具应用路径:“在内进行修改。项目已存在且依赖已安装。所有文件操作使用绝对路径。”
<app-path> - “在写入任何文件前,检查项目布局——运行,读取
ls和package.json——以找到正确的路由目录。最新SDK默认模板将Expo Router路由放在app.json;旧版本使用项目根目录下的src/app/——请检查确认此夹具使用哪种结构。”app/ - “不要启动开发服务器、启动模拟器或截图——harness会在你完成后执行这些操作。”
- 保存构建内容简短摘要的位置。
claude -p- 从环境变量中移除(
CLAUDECODE)——否则在运行中的Claude Code会话内嵌套运行env = {k: v for k, v in os.environ.items() if k != "CLAUDECODE"}会静默挂起。claude -p - 权限标志,由用户提前确认(见开始前——明确范围部分):或
--dangerously-skip-permissions。将选定的标志写入生成的脚本。未携带任何标志的裸--permission-mode acceptEdits无法写入文件——它没有TTY来批准编辑,只会以文本形式输出代码。claude -p - 不要为执行器传递(与触发评估不同)。使用技能的运行已通过绝对
--plugin-dir读取技能,因此直接测试本地内容;而不使用技能的运行必须完全不加载任何技能——加载插件会让技能自动触发,污染基线。让执行器基于路径运行还可清晰区分两个问题:执行器衡量内容质量(读取技能后是否有用?),触发评估衡量触发效果(描述是否能让模型选择它?)。SKILL_PATH
将每个运行的标准输出/标准错误捕获到测试夹具旁的日志文件中,作为评分证据。为每个执行器设置900秒的超时时间——使用技能的运行在编码前会读取多个参考文件,通常需要5-10分钟。
5. Run + screenshot (serial across evals)
4. 静态检查
Write and run it with . Simulators and emulators are shared resources, so this orchestrator runs serially (no thread pool): for each app that passed the static gate, and each platform, it the dir and calls . Pass the port as a positional argument: use for iOS and for Android — is supported and using separate ports lets you run both platforms without port collisions if you ever parallelize. Screenshots land in the run's directory so the viewer renders them inline.
run_snapshots.pypython3os.makedirsoutputs/subprocess.run(["bash", "<scripts>/snapshot-<platform>.sh", app, f"{outputs}/<platform>.png", port], env={**os.environ, "EXPO_SKILL_EVAL_RUNNER": runner}, …)80818082expo run:ios/android --port Noutputs/Reclaim disk after each fixture — essential for runs. Once all selected platforms for an app are captured (and before the next fixture builds), call . Each leaves multi-GB native build output (iOS Pods + DerivedData, Android Gradle build); without this, evals × configs × iterations pile up and fill the disk mid-run (the instability you'll see is the disk filling). removes the heavy regenerable dirs (, , , , ) and the fixture's iOS DerivedData, keeping the app source + git so the grader's still works. With serial snapshots + per-fixture cleanup, peak disk stays at ~one fixture's build instead of all of them. (Harmless for runs too — they just have little to reclaim.)
dev-buildsubprocess.run(["bash", "<scripts>/clean-fixture.sh", app])expo run:<platform>clean-fixture.shnode_modulesiosandroid.expodistgit diffexpo-gorunnerexpo-godev-buildEXPO_SKILL_EVAL_RUNNERexpo-goexpo start --<platform>dev-buildexpo run:<platform> --port <port>dev-buildEXPO_SKILL_EVAL_BUNDLE_TIMEOUTmake-fixture.shexpo-dev-clientexpo runSnapshot scripts always capture the initial route . They open the app via a deep link and take one screenshot — they cannot tap or navigate. Design eval prompts so the feature under test renders at the root route. If the executor places the main UI behind a navigation action (e.g. an "Open Settings" button on the index), the snapshot will miss the feature entirely and all visual expectations will fail.
/Each frees its Metro port on startup (kills any stale process left on it by a crashed prior run) and tears Metro down on exit — so you never need to run // yourself to clear ports (that would prompt, and it's already handled). It then starts Metro, waits for the "Bundled" line in the Metro log, settles, captures a screenshot, and tears Metro down. iOS boots the newest available iPhone simulator if none is booted; Android boots the first AVD if no device is attached (the slow path — boot once and reuse across the whole iteration). Android first recycles a wedged/ emulator (graceful , then force-kill + adb reset) so a half-dead instance can't poison the run, and boots with hardware GPU (, Metal-accelerated on Apple Silicon). If self-aborts the emulator on a given machine (qemu deep in gfxstream/Metal — possible on Apple Silicon under load), edit in to a software mode ( renders reliably but slowly — bump the settle; avoid , which hangs at boot on arm64). runs only when includes web. Each script writes a Metro log next to the screenshot () — include it in the grader's inputs. If a script exits non-zero it still attempts a best-effort screenshot (an error screen is evidence too). dev-build relaunch: after Metro is up, the scripts relaunch the app via (iOS) and (Android) — both avoid the "Open in X?" system dialog that a URL-scheme deep link triggers on first launch.
snapshot-<platform>.shlsofkillpkillofflineadb emu kill-gpu hosthostSIGABRTGPU_MODEsnapshot-android.shguestswiftshader_indirectsnapshot-web.shplatforms<name>.metro.logxcrun simctl launchadb shell am start -n <pkg>/.MainActivityAfter all screenshots for the iteration are captured, always generate the viewer — pass the workspace root to the checked-in script:
bash
python3 /abs/path/expo-skill-eval/scripts/generate_viewer.py /private/tmp/expo-skill-eval-<skill>It writes into the workspace root (one level above ) and opens it in the browser itself (via ) — so no separate command (and no rule) is needed. See the Viewer section below.
viewer.htmliteration-N/webbrowser.openopenBash(open:*)编写并使用运行。对于每个评估/配置应用,通过调用(静态检查相互独立——在Python内部并发运行,绝不要使用shell/),并将每个结果写入(退出码+捕获的输出)供评分器使用。
run_static.pypython3ThreadPoolExecutorsubprocess.run(["bash", "<scripts>/check-static.sh", app, "ios,android"], capture_output=True, …)&waiteval-<i>/<config>/static.jsoncheck-static.shtsc --noEmitexpo lintexpo export6. Grade
5. 运行+截图(评估用例间串行执行)
Spawn a grader subagent in the foreground. Its prompt must include:
- The eval prompt, expectations list, and visual_expectations from the eval case.
- The instructions in (screenshot grading, redbox detection).
agents/visual-grader.md - The screenshot files, Metro logs, and the step-4 as inputs.
static.json - Image-prompt cases (case has ): also include the target screenshot (
reference_image),reference_image, and the fixture'sreferences/design-rubric.md. Tell the grader to compare the generated screenshot(s) to the target and emit thegit diff+reference_matchblocks below.quality
The grader writes next to the outputs with this shape:
grading.jsonjson
{
"score": 8.5,
"max_score": 9,
"expectations": [
{"text": "...", "passed": true, "evidence": "..."}
],
"reference_match": {
"score": 7, "max": 10,
"evidence": "ios.png vs target.png: same two-section grouped list + toggle; accent color differs (blue vs target's green); row spacing tighter than target"
},
"quality": {
"dimensions": [
{"name": "Layout & hierarchy", "score": 2, "max": 3, "evidence": "ios.png: …"}
],
"subtotal": 17,
"max": 24,
"summary": "…"
},
"user_notes_summary": {"needs_review": false, "notes": ""}
}Visual expectations go into the same array with evidence naming the screenshot file and describing what is visible. The block (how closely the generated app reproduces the target screenshot) and the block (design-rubric scores from ) are emitted only for image-prompt cases — or when a quality grade is explicitly requested. Omit both for plain text-prompt runs.
expectationsreference_matchqualityreferences/design-rubric.md编写并使用运行。模拟器和模拟器是共享资源,因此该编排器串行运行(无线程池):对于每个通过静态检查的应用和每个平台,创建目录并调用。将端口作为位置参数传递:iOS使用,Android使用——受支持,使用不同端口可避免并行运行时的端口冲突。截图会保存到运行的目录,以便查看器内联渲染。
run_snapshots.pypython3outputs/subprocess.run(["bash", "<scripts>/snapshot-<platform>.sh", app, f"{outputs}/<platform>.png", port], env={**os.environ, "EXPO_SKILL_EVAL_RUNNER": runner}, …)80818082expo run:ios/android --port Noutputs/每个夹具完成后回收磁盘空间——对运行至关重要。当一个应用的所有选定平台截图完成后(在下一个夹具构建前),调用。每个会留下数GB的原生构建输出(iOS Pods + DerivedData、Android Gradle构建);若不执行此操作,评估用例×配置×迭代会累积并在运行过程中填满磁盘(你会看到的不稳定现象是磁盘已满)。会删除占用空间大且可重新生成的目录(、、、、以及夹具的iOS DerivedData),保留应用源码+git,以便评分器的仍能正常工作。通过串行快照+每个夹具清理,峰值磁盘占用可保持在约一个夹具的构建大小,而非所有夹具的总和。(对运行也无害——它们几乎没有可回收的内容。)
dev-buildsubprocess.run(["bash", "<scripts>/clean-fixture.sh", app])expo run:<platform>clean-fixture.shnode_modulesiosandroid.expodistgit diffexpo-gorunnerexpo-godev-buildEXPO_SKILL_EVAL_RUNNERexpo-goexpo start --<platform>dev-buildexpo run:<platform> --port <port>dev-buildEXPO_SKILL_EVAL_BUNDLE_TIMEOUTmake-fixture.shexpo-dev-clientexpo run快照脚本始终捕获初始路由。它们通过深度链接打开应用并拍摄一张截图——无法点击或导航。设计评估提示词时需确保被测功能在根路由渲染。若执行器将主UI放在导航操作之后(例如首页的“打开设置”按钮),快照会完全错过该功能,所有视觉预期都会失败。
/每个启动时会释放其Metro端口(杀死之前崩溃运行留下的任何 stale 进程),并在退出时关闭Metro——因此你无需自行运行//来清理端口(这会触发提示,且已由脚本处理)。然后启动Metro,等待Metro日志中的“Bundled”行,等待稳定后捕获截图,再关闭Metro。若没有已启动的模拟器,iOS会启动最新可用的iPhone模拟器;若没有连接设备,Android会启动第一个AVD(较慢的路径——启动一次并在整个迭代中复用)。Android首先回收卡住/的模拟器(优雅的,然后强制杀死+adb重置),避免半死不活的实例影响运行,并使用硬件GPU启动(,在Apple Silicon上使用Metal加速)。若模式在某台机器上导致模拟器自行终止(qemu在gfxstream/Metal深处触发——在Apple Silicon高负载下可能发生),可将中的改为软件模式(渲染可靠但速度慢——增加等待时间;避免,它在arm64上启动时会挂起)。仅当包含web时才运行。每个脚本会在截图旁写入Metro日志()——将其纳入评分器的输入。若脚本非零退出,仍会尝试捕获最佳效果的截图(错误界面也是证据)。**dev-build重启:**Metro启动后,脚本会通过(iOS)和(Android)重启应用——两者都可避免首次启动时URL scheme深度链接触发的“在X中打开?”系统对话框。
snapshot-<platform>.shlsofkillpkillofflineadb emu kill-gpu hosthostSIGABRTsnapshot-android.shGPU_MODEguestswiftshader_indirectplatformssnapshot-web.sh<name>.metro.logxcrun simctl launchadb shell am start -n <pkg>/.MainActivity捕获完迭代的所有截图后,始终生成查看器——将工作区根目录传递给已签入的脚本:
bash
python3 /abs/path/expo-skill-eval/scripts/generate_viewer.py /private/tmp/expo-skill-eval-<skill>它会在工作区根目录(的上一级)生成并自行在浏览器中打开(通过)——因此无需单独的命令(也无需规则)。详见下方查看器部分。
iteration-N/viewer.htmlwebbrowser.openopenBash(open:*)Rollout phases
6. 评分
Build out and debug the pipeline in this order — each phase is independently useful:
- Static: steps 1–4 only (for everything). No devices needed; CI-friendly.
runtime.mode: "static-only" - iOS: add to the loop.
snapshot-ios.shis the most scriptable target.simctl - Android: add . Emulator boot is the slowest part — keep one emulator running for the whole session.
snapshot-android.sh - Web: add for skills that target web (uses Playwright via
snapshot-web.sh; first run downloads Chromium).bunx
前台启动评分子代理。其提示词必须包含:
- 评估提示词、预期列表和评估用例中的visual_expectations。
- 中的说明(截图评分、错误红框检测)。
agents/visual-grader.md - 截图文件、Metro日志和步骤4的作为输入。
static.json - 图片提示词用例(用例包含):还需包含目标截图(
reference_image)、reference_image和夹具的references/design-rubric.md。告知评分器对比生成的截图与目标截图,并输出下方的git diff+reference_match块。quality
评分器会在输出旁写入,格式如下:
grading.jsonjson
{
"score": 8.5,
"max_score": 9,
"expectations": [
{"text": "...", "passed": true, "evidence": "..."}
],
"reference_match": {
"score": 7, "max": 10,
"evidence": "ios.png vs target.png: same two-section grouped list + toggle; accent color differs (blue vs target's green); row spacing tighter than target"
},
"quality": {
"dimensions": [
{"name": "Layout & hierarchy", "score": 2, "max": 3, "evidence": "ios.png: …"}
],
"subtotal": 17,
"max": 24,
"summary": "…"
},
"user_notes_summary": {"needs_review": false, "notes": ""}
}视觉预期会纳入同一个数组,证据中需指定截图文件并描述可见内容。块(生成应用与目标截图的匹配程度)和块(来自的设计评分标准)仅针对图片提示词用例输出——或当明确要求质量评分时输出。纯文本提示词运行可省略这两个块。
expectationsreference_matchqualityreferences/design-rubric.mdPractical notes
分阶段部署
-
Temp locations: all eval workspaces go under. Everything in this run —
/private/tmp/expo-skill-eval-<skill-name>/iteration-N/,Read,Write, andEdit— is covered by theBashfrontmatter, so a correctly-loaded skill runs prompt-free.allowed-tools -
Permission rule forms (why this skill stays prompt-free): the rule syntax matters and the two tool families behave differently:
- rules — path-scoped to the skill's own code (no broad interpreters).
Bash(...)(plus theBash(python3 /private/tmp/expo-skill-eval-*)alias) runs the Python orchestrators you generate under the workspace;/tmpruns the checked-inBash(python3 *expo-skill-eval/scripts/*);scripts/generate_viewer.py(+Bash(tee /private/tmp/expo-skill-eval-*)) lets/tmpwrite a log without prompting;python3 … 2>&1 | tee <workspace>/…logruns only this skill'sBash(bash *expo-skill-eval/scripts/*). Because every path is pinned, the escape hatches stay denied:scripts/*.sh,python3 -c …,bash -c …, and running code anywhere else do not match (verified empirically — a scoped rule allowstee /etc/…but blocksbash <dir>/run.shand any other path). Commands the scripts call internally —bash -c …,bunx,xcrun simctl,adb,git,mkdir— are children of the script, not Bash tool calls, so they need no rule. Do not run ad-hocexpo/mkdir/ls/find/catfrom the main thread (they have no rule and prompt — and a rawgrepcan't match a path glob because the path is an unexpanded variable): create the directory tree withmkdir "$WORKSPACE/…"(step 0), let orchestrators create their own dirs (make-workspace.sh), and inspect results with theos.makedirs/Read/Globtools (no Bash rule needed).Grep - Bash rule matching (tested, non-obvious): a Bash rule is a gitignore-style glob over the command string. matches any run of characters including
*and spaces and works mid-pattern — so/matchesBash(python3 /private/tmp/expo-skill-eval-*), andpython3 /private/tmp/expo-skill-eval-x/run.py 2>&1matchesBash(bash *expo-skill-eval/scripts/*). Two gotchas that burned earlier attempts:bash /any/abs/path/expo-skill-eval/scripts/foo.sh argsis matched literally (never use it in a Bash rule), and the**suffix only works right after the command token (:*) — not after a partial path (Bash(python3:*)does not match). Compound commands split onBash(python3 /path-:*),|,&&,||,;and each segment needs its own matching rule.& - rules suppress prompts;
Read/Writerules do not. This is a Claude Code asymmetry (not a pattern bug, and not reload — in a session where theEdit/Bashrules from this same frontmatter are clearly working,Readstill prompts): file creation/editing always goes through Claude Code's edit-approval flow regardless ofWrite. The frontmatter still scopesallowed-tools/Read/WritetoEdit(both the…/expo-skill-eval-*/**and/tmpforms, since macOS doesn't auto-resolve the symlink) as documentation and a guardrail, but those/private/tmp/Writeentries won't silence the prompt on their own. Practical consequence: at the start of a run you get one Write prompt for the workspace — choose "Yes, allow all edits in this directory for the session" and every later orchestrator /Edit/ viewer write under that workspace goes through silently. That single directory approval, not a rule, is what makes file-writing prompt-free.evals.json - Reload after editing frontmatter — a full restart, not .
/reload-skillsis read once when the skill loads at session start;allowed-toolsreloads the skill body but does not reliably refresh the permission rules. After editing this file, quit Claude Code entirely and start a new session, then re-run the skill — otherwise a stale (cached) ruleset keeps prompting even though the file on disk is correct./reload-skills - Grader subagents run with their own permission context and will still prompt for file access — that is expected and separate from the main thread's rules.
-
Calling eval scripts — one standalone command, never chained. Invoke each script as its own Bash call with an absolute path:(covered by
bash /abs/path/expo-skill-eval/scripts/snapshot-ios.sh arg1 arg2). Do not combine it withBash(bash *expo-skill-eval/scripts/*),&,&&,||,;,wait,tail, orhead— compound commands are checked per segment, and those extra segments have no rule, so the whole thing prompts even though theechopart is allowed. (The one allowed pipe isbash …/scripts/…, since the scoped… 2>&1 | tee <workspace>/…logrule covers it.) Need parallelism or output trimming? Put it in a Python orchestrator (covered bytee), which runs scripts viapython3 /…/expo-skill-eval-*across asubprocess. Inspect results with theThreadPoolExecutor/Read/Globtools, notGrep/cat/ls. General rule: under this skill's tight scoping, any ad-hoc shell the agent improvises will prompt — the fix is to move it into a script/orchestrator (or use the scopedgrep), never to broaden a rule.tee -
Inspecting outputs (screenshots, logs, files) — use tools, not shell. To find files use the Glob tool (e.g.); to view them use the Read tool — Read renders PNGs visually, which is exactly what you need to confirm a screenshot rendered. To search file contents use Grep. Never use
/private/tmp/expo-skill-eval-<skill>/iteration-N/**/ios.png/find/lsfor this: they prompt, andcatis deliberately not allowed because itsfind … -exec …can run anything (e.g.-exec). These tools are scoped and prompt-free; reach for them every time you'd otherwise type-exec rm/find/ls.cat -
Generated Python scripts: write orchestration/aggregation scripts under the workspace (e.g.) and run them with
/private/tmp/expo-skill-eval-<skill>/aggregate.py(covered bypython3). The viewer is the exception — it's the checked-inBash(python3 /private/tmp/expo-skill-eval-*), run viascripts/generate_viewer.py.Bash(python3 *expo-skill-eval/scripts/*)auto-creates parent dirs but prompts the first time — approve the workspace directory once (see theWrite/Writenote above). Capture output either by having the script write its own log or viaEdit(covered by the scopedpython3 … 2>&1 | tee <workspace>/…logrule); read logs back with theteetool. Don't useReadfor setup (the scoped rule only matches a workspace script path, so a barepython3 -c …prompts).-c -
Trigger evals vs installed plugin: detect the real installed skill name (e.g.) in the stream — a synthetic-duplicate harness always scores 0% when the real plugin is installed because the model picks the genuine skill over the synthetic copy.
expo:expo-ui -
Benchmark aggregation: save each run's+
grading.jsonundertiming.json. Write a Python aggregation script under the workspace and run it witheval-<N>/<config>/run-1/.python3 -
Expo Go ceiling: anything requiring custom native code (expo-module, App Clips, brownfield) cannot run in Expo Go. Usemode for those — see
static-onlybefore writing eval cases for a skill (note:references/runtime-matrix.mddoes run in Expo Go on SDK 56+).@expo/ui -
API-route skills: instead of a screenshot, verify withagainst the route while Metro is up; record the response as an output file for grading.
curl -
Timing data: capture token counts and duration intoimmediately after each executor run — it is not recoverable later. To capture token counts, add
timing.jsonto the executor--output-format=stream-json --verbosecall and parse theclaude -p/message_startevents from the log. Without these flags the log only contains prose and elapsed seconds are the only recoverable metric.message_delta -
First-launch dialogs: Expo Go occasionally shows a one-time prompt on a fresh simulator. If a screenshot captures a dialog instead of the app, re-run the snapshot script (it reopens the URL) and re-capture.
按以下顺序构建和调试流水线——每个阶段都独立有用:
- 静态检查:仅步骤1-4(所有用例的)。无需设备;适合CI环境。
runtime.mode: "static-only" - iOS:将加入循环。
snapshot-ios.sh是最适合脚本化的目标。simctl - Android:加入。模拟器启动是最慢的部分——整个会话保持一个模拟器运行。
snapshot-android.sh - Web:为面向web的技能加入(通过
snapshot-web.sh使用Playwright;首次运行会下载Chromium)。bunx
Viewer
实用说明
After taking screenshots, always generate and open the HTML viewer so the user can see results immediately without being asked. The viewer is the checked-in — run it with the workspace root as its argument:
scripts/generate_viewer.pybash
python3 /abs/path/expo-skill-eval/scripts/generate_viewer.py /private/tmp/expo-skill-eval-<skill>It writes a self-contained and opens it in the browser itself (). What it renders:
/private/tmp/expo-skill-eval-<skill>/viewer.htmlwebbrowser.open- A tab per iteration (under the workspace root; remembers the last active tab in
iteration-*).localStorage - For each eval case (read from ): side-by-side with_skill / without_skill columns, each showing static-gate status, score, the platform screenshots (click to zoom; embedded as base64
<iteration>/evals.jsonURIs so the file is self-contained), the expectation list with PASS/FAIL badges, and reviewer notes.data: - For image-prompt cases (a with
grading.json/reference_match): the target screenshot beside the generated ones, thequalityscore (generated vs target), thereference_matchrubric per config (one bar per dimension with its score/max plus the subtotal), and the quality delta (with_skill − without_skill subtotal) in the summary bar alongside the correctness delta.quality - A summary bar with with_skill %, without_skill %, and delta.
- A trigger accuracy table when exists.
trigger-evals/trigger_results.json - A dark background with color-coded scores (green ≥85%, amber ≥65%, red below).
-
临时位置:所有评估工作区都位于下。本次运行中的所有操作——
/private/tmp/expo-skill-eval-<skill-name>/iteration-N/、Read、Write和Edit——都符合Bash前置条件,因此正确加载的技能运行时不会触发提示。allowed-tools -
权限规则格式(为何此技能运行时无提示):规则语法很重要,两类工具的行为不同:
- 规则——路径限定为技能自身代码(无宽泛解释器)。
Bash(...)(加上Bash(python3 /private/tmp/expo-skill-eval-*)别名)运行你在工作区下生成的Python编排器;/tmp运行已签入的Bash(python3 *expo-skill-eval/scripts/*);scripts/generate_viewer.py(+Bash(tee /private/tmp/expo-skill-eval-*))允许/tmp写入日志而不触发提示;python3 … 2>&1 | tee <workspace>/…log仅运行此技能的Bash(bash *expo-skill-eval/scripts/*)。由于每个路径都固定,漏洞被禁止:scripts/*.sh、python3 -c …、bash -c …以及在其他位置运行代码不会匹配(经验证——限定规则允许tee /etc/…但阻止bash <dir>/run.sh和任何其他路径)。脚本内部调用的命令——bash -c …、bunx、xcrun simctl、adb、git、mkdir——是脚本的子进程,而非Bash工具调用,因此无需规则。不要从主线程运行临时的expo/mkdir/ls/find/cat命令(它们无规则,会触发提示——且裸grep无法匹配路径通配符,因为路径是未展开的变量):使用mkdir "$WORKSPACE/…"创建目录树(步骤0),让编排器创建自己的目录(make-workspace.sh),并使用os.makedirs/Read/Glob工具检查结果(无需Bash规则)。Grep - Bash规则匹配(已测试,非显而易见):Bash规则是命令字符串上的gitignore风格通配符。匹配任意字符序列包括
*和空格,且在模式中间有效——因此/匹配Bash(python3 /private/tmp/expo-skill-eval-*),python3 /private/tmp/expo-skill-eval-x/run.py 2>&1匹配Bash(bash *expo-skill-eval/scripts/*)。两个曾导致问题的陷阱:bash /any/abs/path/expo-skill-eval/scripts/foo.sh args会被字面匹配(绝不要在Bash规则中使用),且**后缀仅在命令 token 后有效(:*)——不要在部分路径后使用(Bash(python3:*)不匹配)。复合命令会按Bash(python3 /path-:*)、|、&&、||、;拆分,每个部分都需要匹配规则。& - 规则抑制提示;
Read/Write规则则不。这是Claude Code的不对称性(不是模式错误,也不是重载——在同一个前置条件的Edit/Bash规则明显有效的会话中,Read仍会触发提示):文件创建/编辑始终会经过Claude Code的编辑批准流程,无论Write如何设置。前置条件仍将allowed-tools/Read/Write限定在Edit(包含…/expo-skill-eval-*/**和/tmp形式,因为macOS不会自动解析符号链接)作为文档和防护,但这些/private/tmp/Write条目本身无法静默提示。实际影响:运行开始时会收到一次工作区的Write提示——选择**“是,允许在此目录中进行所有会话内编辑”**,之后该工作区下的所有编排器/Edit/查看器写入操作都会静默完成。正是这单次目录批准(而非规则)让文件写入操作无提示。evals.json - 编辑前置条件后重载——完全重启,而非。
/reload-skills仅在会话开始时技能加载时读取一次;allowed-tools会重载技能主体但无法可靠刷新权限规则。编辑此文件后,完全退出Claude Code并启动新会话,然后重新运行技能——否则缓存的旧规则集仍会触发提示,即使磁盘上的文件已更新。/reload-skills - 评分子代理使用独立的权限上下文运行,仍会触发文件访问提示——这是预期的,与主线程规则分开。
-
调用评估脚本——单个独立命令,绝不要链式调用。使用绝对路径将每个脚本作为独立的Bash调用:(符合
bash /abs/path/expo-skill-eval/scripts/snapshot-ios.sh arg1 arg2规则)。不要将其与Bash(bash *expo-skill-eval/scripts/*)、&、&&、||、;、wait、tail或head组合——复合命令会按部分检查,这些额外部分无规则,因此即使echo部分允许,整个命令仍会触发提示。(唯一允许的管道是bash …/scripts/…,因为限定的… 2>&1 | tee <workspace>/…log规则覆盖了它。)需要并行处理或输出裁剪?将其放入Python编排器(符合tee规则),通过python3 /…/expo-skill-eval-*在subprocess中运行脚本。使用ThreadPoolExecutor/Read/Glob工具检查结果,而非Grep/cat/ls。通用规则:在此技能的严格限定下,代理即兴编写的任何临时shell命令都会触发提示——解决方法是将其移入脚本/编排器(或使用限定的grep),绝不要放宽规则。tee -
检查输出(截图、日志、文件)——使用工具,而非shell。查找文件使用Glob工具(例如);查看文件使用Read工具——Read可可视化渲染PNG图片,这正是你确认截图是否正确渲染所需的功能。搜索文件内容使用Grep。绝不要使用
/private/tmp/expo-skill-eval-<skill>/iteration-N/**/ios.png/find/ls:它们会触发提示,且cat被故意禁止,因为其find … -exec …可运行任何命令(例如-exec)。这些工具是限定范围且无提示的;每次你想输入-exec rm/find/ls时都应使用它们。cat -
生成的Python脚本:在工作区下编写编排/聚合脚本(例如)并使用
/private/tmp/expo-skill-eval-<skill>/aggregate.py运行(符合python3规则)。查看器是例外——它是已签入的Bash(python3 /private/tmp/expo-skill-eval-*),通过scripts/generate_viewer.py运行。Bash(python3 *expo-skill-eval/scripts/*)会自动创建父目录,但首次会触发提示——批准工作区目录一次(见上述Write/Write说明)。通过让脚本自行写入日志或使用Edit(符合限定的python3 … 2>&1 | tee <workspace>/…log规则)捕获输出;使用tee工具读取日志。不要使用Read进行设置(限定规则仅匹配工作区脚本路径,裸python3 -c …会触发提示)。-c -
触发评估与已安装插件:在流中检测真实的已安装技能名称(例如)——当已安装真实插件时,合成副本harness的得分始终为0%,因为模型会选择真实技能而非合成副本。
expo:expo-ui -
基准测试聚合:将每次运行的+
grading.json保存到timing.json下。在工作区下编写Python聚合脚本并使用eval-<N>/<config>/run-1/运行。python3 -
Expo Go限制:任何需要自定义原生代码的内容(expo-module、App Clips、混合开发)无法在Expo Go中运行。这些内容使用模式——在为技能编写评估用例前查看
static-only(注意:references/runtime-matrix.md在SDK 56+上可在Expo Go中运行)。@expo/ui -
API路由技能:无需截图,在Metro运行时通过验证路由;将响应记录为输出文件供评分使用。
curl -
计时数据:每个执行器运行完成后立即将令牌计数和持续时间捕获到中——之后无法恢复。要捕获令牌计数,在执行器
timing.json调用中添加claude -p并从日志中解析--output-format=stream-json --verbose/message_start事件。若无这些标志,日志仅包含 prose,仅能恢复经过的秒数。message_delta -
首次启动对话框:Expo Go偶尔会在全新模拟器上显示一次性提示。若截图捕获到对话框而非应用,重新运行快照脚本(它会重新打开URL)并重新捕获。
Publishing the viewer (only if opted in up front)
查看器
The local is always generated. Only when the user chose "Publish a shareable Artifact" in the up-front confirmation, additionally render it to a claude.ai Artifact at the very end — never publish without that opt-in (it's outward-facing and a published page can be cached/indexed). Mechanics:
viewer.html- The tool wraps the file in its own
Artifactskeleton, so the file you hand it must be page content only — inline<!doctype html>…<head></head><body>/<style>, base64<script>images, and adata:, but no<title>tags of its own (a full standalone document gets double-wrapped and renders wrong).<!DOCTYPE>/<html>/<head>/<body> - The script emits an Artifact-friendly variant when you add :
--artifactwritespython3 /abs/path/expo-skill-eval/scripts/generate_viewer.py /private/tmp/expo-skill-eval-<skill> --artifact(same content, skeleton stripped, no browser open). Pass that file to theviewer_artifact.htmltool (Artifact), not the standalone one.favicon: "📊" - The viewer is already self-contained (base64 screenshots, inline CSS/JS), so it satisfies the Artifact CSP (no external hosts).
捕获截图后,始终生成并打开HTML查看器,以便用户无需询问即可立即查看结果。查看器是已签入的——将工作区根目录作为参数运行:
scripts/generate_viewer.pybash
python3 /abs/path/expo-skill-eval/scripts/generate_viewer.py /private/tmp/expo-skill-eval-<skill>它会生成自包含的并自行在浏览器中打开()。它会渲染:
/private/tmp/expo-skill-eval-<skill>/viewer.htmlwebbrowser.open- 每个迭代一个标签页(工作区根目录下的;在
iteration-*中记住最后激活的标签页)。localStorage - 每个评估用例(从读取):分为使用技能/不使用技能两列并排显示,每列包含静态检查状态、分数、平台截图(点击可放大;以base64
<iteration>/evals.jsonURI嵌入,因此文件是自包含的)、带有PASS/FAIL标记的预期列表,以及评审注释。data: - 图片提示词用例(包含
grading.json/reference_match):目标截图与生成截图并排显示,quality得分(生成 vs 目标),每个配置的reference_match评分标准(每个维度一个条形图,显示得分/最高分以及小计),以及摘要栏中的质量差值(使用技能 − 不使用技能的小计)和正确性差值。quality - 摘要栏显示使用技能的通过率、不使用技能的通过率以及差值。
- 当存在时,显示触发准确性表格。
trigger-evals/trigger_results.json - 深色背景,分数用颜色编码(绿色≥85%,琥珀色≥65%,红色<65%)。
References
发布查看器(仅当提前选择此选项时)
- — per-skill runtime applicability (expo-go vs static-only, platform notes).
references/runtime-matrix.md - — screenshot grading instructions for the grader subagent.
agents/visual-grader.md
始终会生成本地。仅当用户在前置确认中选择“发布为可共享Artifact”时,才在最后额外将其渲染为claude.ai Artifact——绝不要未经选择就发布(它是对外公开的,发布的页面可能被缓存/索引)。机制:
viewer.html- 工具会将文件包装在自己的
Artifact骨架中,因此你传递的文件必须仅包含页面内容——内联<!doctype html>…<head></head><body>/<style>、base64<script>图片和data:,但不要包含自己的<title>标签(完整的独立文档会被双重包装,渲染错误)。<!DOCTYPE>/<html>/<head>/<body> - 添加参数时,脚本会生成适合Artifact的变体:
--artifact会生成python3 /abs/path/expo-skill-eval/scripts/generate_viewer.py /private/tmp/expo-skill-eval-<skill> --artifact(内容相同,移除了骨架,不会打开浏览器)。将此文件传递给viewer_artifact.html工具(Artifact),而非独立版本。favicon: "📊" - 查看器已自包含(base64截图、内联CSS/JS),因此符合Artifact的CSP(无外部主机)。
—
参考资料
—
- ——各技能的运行时适用性(expo-go vs static-only,平台说明)。
references/runtime-matrix.md - ——评分子代理的截图评分说明。
agents/visual-grader.md