benchmark-sandbox

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Benchmark Sandbox — Remote Eval via Vercel Sandboxes

基准测试沙箱 — 通过Vercel沙箱进行远程评估

Run benchmark scenarios inside Vercel Sandboxes — ephemeral Firecracker microVMs with node24. Each sandbox gets a fresh Claude Code + Vercel CLI + agent-browser install, the local vercel-plugin uploaded, and runs a 3-phase eval pipeline:
  • Phase 1 (BUILD): Claude Code builds the app with
    --dangerously-skip-permissions --debug
  • Phase 2 (VERIFY): A follow-up Claude Code session uses
    agent-browser
    to walk through user stories, fixing issues until all pass (20 min timeout)
  • Phase 3 (DEPLOY): A third Claude Code session links to vercel-labs, runs
    vercel deploy
    , and fixes build errors (up to 3 retries). Deployed apps have deployment protection enabled by default.
Skills are tracked across all 3 phases — each phase may trigger additional skill injections as new files/patterns are created. After each phase, a haiku structured scoring step (
claude -p --json-schema --model haiku
) evaluates the results as structured JSON.
在Vercel沙箱中运行基准测试场景——这些是搭载node24的临时Firecracker微型虚拟机。每个沙箱都会全新安装Claude Code + Vercel CLI + agent-browser,上传本地vercel-plugin,并运行三阶段评估流水线
  • 第一阶段(构建):Claude Code 使用
    --dangerously-skip-permissions --debug
    参数构建应用
  • 第二阶段(验证):后续的Claude Code会话使用
    agent-browser
    遍历用户故事,修复问题直至全部通过(20分钟超时)
  • 第三阶段(部署):第三个Claude Code会话关联至vercel-labs,运行
    vercel deploy
    并修复构建错误(最多重试3次)。已部署应用默认启用部署保护。
技能会在所有三个阶段中被追踪——每个阶段在创建新文件/模式时可能触发额外的技能注入。每个阶段结束后,会执行俳句结构化评分步骤
claude -p --json-schema --model haiku
),以结构化JSON格式评估结果。

Proven Working Script

已验证可用的脚本

Use
run-eval.ts
— the proven eval runner:
bash
undefined
使用
run-eval.ts
——经过验证的评估运行器:
bash
undefined

Run default scenarios with full 3-phase pipeline

运行默认场景,执行完整三阶段流水线

bun run .claude/skills/benchmark-sandbox/run-eval.ts
bun run .claude/skills/benchmark-sandbox/run-eval.ts

With dynamic scenarios from a JSON file (recommended — see "Dynamic Scenarios" below)

从JSON文件加载动态场景(推荐——见下方“动态场景”)

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json

Keep sandboxes alive overnight with public URLs

保持沙箱夜间运行并提供公开URL

bun run .claude/skills/benchmark-sandbox/run-eval.ts --keep-alive --keep-hours 8
bun run .claude/skills/benchmark-sandbox/run-eval.ts --keep-alive --keep-hours 8

Build-only (skip verification and deploy)

仅构建(跳过验证和部署)

bun run .claude/skills/benchmark-sandbox/run-eval.ts --skip-verify --skip-deploy
bun run .claude/skills/benchmark-sandbox/run-eval.ts --skip-verify --skip-deploy

Run specific scenarios by slug

按slug运行特定场景

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone
undefined
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone
undefined

CLI Flags

CLI参数

FlagDefaultDescription
--concurrency N
5Max parallel sandboxes (max 10)
--timeout MS
1800000 (30 min)Per-phase timeout in ms
--keep-alive
offKeep sandboxes running after eval
--keep-hours N
8Hours to keep alive (with
--keep-alive
)
--skip-verify
offSkip the agent-browser verification phase
--skip-deploy
offSkip the Vercel deploy phase
--scenarios a,b,c
allOnly run specific scenarios by slug
--scenarios-file path
Load scenarios from a JSON file instead of built-in defaults
参数默认值描述
--concurrency N
5最大并行沙箱数量(上限10)
--timeout MS
1800000(30分钟)每个阶段的超时时间(毫秒)
--keep-alive
关闭评估结束后保持沙箱运行
--keep-hours N
8保持运行的时长(需搭配
--keep-alive
--skip-verify
关闭跳过agent-browser验证阶段
--skip-deploy
关闭跳过Vercel部署阶段
--scenarios a,b,c
全部仅运行指定slug的场景
--scenarios-file path
从JSON文件加载场景,而非使用内置默认场景

Dynamic Scenarios (Recommended Approach)

动态场景(推荐方案)

Instead of hardcoding tech-specific prompts, generate scenarios dynamically as a JSON file. Prompts should describe real-world apps people want to build using user stories — no tech name-dropping. Let the plugin figure out what Vercel tech to inject.
不要硬编码特定技术的提示词,而是动态生成JSON格式的场景文件。提示词应描述人们实际想要构建的真实应用,使用用户故事的形式——不要提及具体技术名称。让插件自行决定要注入哪些Vercel技术。

Scenario JSON Format

场景JSON格式

json
[
  {
    "slug": "pet-adoption-board",
    "prompt": "Build me a pet adoption listing board where shelters can post animals...",
    "expectedSkills": ["ai-sdk", "nextjs", "shadcn", "vercel-functions"],
    "userStories": [
      "As a visitor, I can see a grid of pet listings with photos and names",
      "As a visitor, I can click a pet card to see a detail page",
      "As a visitor, I can filter pets by type"
    ]
  }
]
Each scenario needs:
slug
(string),
prompt
(string),
expectedSkills
(string[]),
userStories
(tuple of exactly 3 strings).
json
[
  {
    "slug": "pet-adoption-board",
    "prompt": "为我构建一个宠物领养信息板,让收容所可以发布动物信息...",
    "expectedSkills": ["ai-sdk", "nextjs", "shadcn", "vercel-functions"],
    "userStories": [
      "作为访客,我可以查看带照片和名字的宠物列表网格",
      "作为访客,我可以点击宠物卡片查看详情页",
      "作为访客,我可以按宠物类型筛选"
    ]
  }
]
每个场景需要包含:
slug
(字符串)、
prompt
(字符串)、
expectedSkills
(字符串数组)、
userStories
(恰好3个字符串的元组)。

Prompt Design Guidelines

提示词设计指南

  • Focus on what the user wants, not what tech to use
  • Describe real-world apps that solve real problems with friendly, stylish UX
  • Include AI features naturally (recommendations, analysis, generation)
  • Always end with:
    "Link the project to my vercel-labs team. After building all files, start the dev server on port 3000 with \
    npx next dev --port 3000`."`
  • Include storage needs (photos, uploads) to trigger vercel-storage
  • Include scheduled tasks (reminders, cleanup) to trigger cron-jobs
  • Include auth/middleware to trigger routing-middleware
  • 聚焦于用户需求,而非使用的技术
  • 描述解决真实问题的真实应用,具备友好、时尚的用户体验
  • 自然融入AI功能(推荐、分析、生成)
  • 结尾必须添加:
    "将项目关联至我的vercel-labs团队。构建完成所有文件后,使用\
    npx next dev --port 3000`在3000端口启动开发服务器。"`
  • 包含存储需求(照片、上传)以触发vercel-storage
  • 包含定时任务(提醒、清理)以触发cron-jobs
  • 包含认证/中间件以触发routing-middleware

Structured Scoring (Haiku)

结构化评分(俳句模型)

Each phase gets a structured JSON score via
claude -p --json-schema --model haiku --setting-sources ""
running inside the sandbox. This is a separate quick pass — no tools, no hooks — just reads the phase output and returns structured data.
每个阶段都会通过沙箱内运行的
claude -p --json-schema --model haiku --setting-sources ""
生成结构化JSON评分。这是一个独立的快速评估步骤——不使用工具,不触发钩子——仅读取阶段输出并返回结构化数据。

Build Score Schema

构建评分 schema

json
{
  "completeness": "complete|partial|minimal|empty",
  "hasApiRoutes": true,
  "hasUIComponents": true,
  "hasAIFeature": true,
  "devServerRunning": true,
  "missingFeatures": ["feature1"],
  "summary": "Brief assessment"
}
json
{
  "completeness": "complete|partial|minimal|empty",
  "hasApiRoutes": true,
  "hasUIComponents": true,
  "hasAIFeature": true,
  "devServerRunning": true,
  "missingFeatures": ["feature1"],
  "summary": "简要评估"
}

Verify Score Schema (per user story)

验证评分 schema(按用户故事)

json
{
  "stories": [
    { "index": 1, "status": "pass|fail", "reason": "Evidence from output" }
  ]
}
json
{
  "stories": [
    { "index": 1, "status": "pass|fail", "reason": "输出中的证据" }
  ]
}

Deploy Score Schema

部署评分 schema

json
{
  "deployed": true,
  "url": "https://xxx.vercel.app",
  "buildSucceeded": true,
  "errors": [],
  "summary": "Brief assessment"
}
Important: The
claude -p --output-format json
response wraps results — the actual schema data is in
parsed.structured_output
, not the top-level object.
json
{
  "deployed": true,
  "url": "https://xxx.vercel.app",
  "buildSucceeded": true,
  "errors": [],
  "summary": "简要评估"
}
重要提示
claude -p --output-format json
的响应会包裹结果——实际的schema数据位于
parsed.structured_output
中,而非顶层对象。

Critical Sandbox Environment Facts

关键沙箱环境信息

PropertyValue
Home directory
/home/vercel-sandbox
(NOT
/home/user/
or
/root/
)
User
vercel-sandbox
(NOT
root
)
Claude binary
/home/vercel-sandbox/.global/npm/bin/claude
PATH (via sh -c)Includes
~/.global/npm/bin
— claude findable by name
Port exposure
sandbox.domain(3000)
https://subdomain.vercel.run
Snapshot persistenceFiles AND npm globals survive snapshot restore — use
sandbox.snapshot()
Sandbox.create({ source: { type: "snapshot", snapshotId } })
SDK version
@vercel/sandbox@1.8.0
(v2 beta's named sandbox endpoint returns 404 for this team)
Team tierEnterprise (vercel-labs) — no known sandbox time cap
属性
主目录
/home/vercel-sandbox
不是
/home/user/
/root/
用户
vercel-sandbox
不是
root
Claude 二进制文件路径
/home/vercel-sandbox/.global/npm/bin/claude
PATH(通过sh -c)包含
~/.global/npm/bin
——可直接通过名称调用claude
端口暴露
sandbox.domain(3000)
https://subdomain.vercel.run
快照持久化文件和npm全局包在快照恢复后会保留——使用
sandbox.snapshot()
Sandbox.create({ source: { type: "snapshot", snapshotId } })
SDK版本
@vercel/sandbox@1.8.0
(v2测试版的命名沙箱端点对该团队返回404)
团队等级企业版(vercel-labs)——无已知沙箱时间限制

Key Discoveries (Hard-Won)

重要发现(实践总结)

  1. Snapshots work:
    sandbox.snapshot()
    preserves files AND npm globals. Use it after build to create a restore point before verify/deploy. Note: snapshotting stops the source sandbox — create a new one from the snapshot to continue.
  2. Plugin install: Use
    npx add-plugin <path> -s project -y --target claude-code
    — works because claude is in PATH after
    npm install -g
    . The
    --target claude-code
    flag is required because add-plugin can't auto-detect Claude Code without an initialized
    ~/.claude/
    dir.
  3. File uploads: Use
    sandbox.writeFiles([{ path, content: Buffer }])
    — NOT runCommand heredocs. Heredocs with special characters cause 400 errors from the sandbox API.
  4. Claude flags: Always use
    --dangerously-skip-permissions --debug
    . The
    --debug
    flag writes to
    ~/.claude/debug/
    .
  5. Auth: API key from macOS Keychain (
    ANTHROPIC_AUTH_TOKEN
    — a
    vck_*
    Vercel Claude Key for AI Gateway), Vercel token from
    ~/.local/share/com.vercel.cli/auth.json
    (a
    vca_*
    token).
  6. OIDC for sandbox SDK: Run
    npx vercel link --scope vercel-labs -y
    +
    npx vercel env pull
    once before first use.
  7. Port exposure: Pass
    ports: [3000]
    in
    Sandbox.create()
    to get a public URL immediately via
    sandbox.domain(3000)
    . Works on v1.8.0 — URL is assigned at creation time, before anything listens.
  8. extendTimeout: Use
    sandbox.extendTimeout(ms)
    to keep sandboxes alive past their initial timeout. Verified working — extends by the requested duration. Use this for overnight keep-alive.
  9. Background commands:
    runCommand
    with backgrounded processes (
    &
    or
    nohup
    ) may throw ZodError on v1. Write a script file first, then execute it.
  10. Session cleanup race: The
    session-end-cleanup.mjs
    hook deletes
    /tmp/vercel-plugin-*-seen-skills.d/
    on session end. Extract artifacts BEFORE the session completes, or rely on poll history data.
  11. agent-browser works in sandboxes: Install via
    npm install -g agent-browser
    . Claude Code can use it for browser-based verification inside the sandbox.
  12. No hobby tier cap: Early 301s timeouts were from lower default timeout values in earlier script iterations, not a tier limitation. Enterprise (vercel-labs) has no known sandbox time cap — sandboxes ran 10+ minutes successfully.
  13. claude -p works inside sandboxes:
    claude -p --json-schema --output-format json --model haiku
    works for structured scoring passes. No nesting issue when running inside a sandbox (only fails when running Claude inside Claude on the same machine).
  14. Deploy project naming: ALWAYS use timestamped slugs with minute precision (e.g.,
    pet-adoption-board-202603101853
    ) to avoid collisions when linking to vercel-labs team projects. These are demo projects — we generate many per day. Format:
    <slug>-<YYYYMMDDHHMM>
    .
  1. 快照功能可用
    sandbox.snapshot()
    会保留文件和npm全局包。构建完成后创建快照,作为验证/部署阶段的安全备份。注意:快照会停止源沙箱——需从快照创建新沙箱以继续操作。
  2. 插件安装:使用
    npx add-plugin <path> -s project -y --target claude-code
    ——由于claude已通过
    npm install -g
    加入PATH,因此可正常工作。
    --target claude-code
    参数是必需的,因为如果没有初始化
    ~/.claude/
    目录,add-plugin无法自动检测Claude Code。
  3. 文件上传:使用
    sandbox.writeFiles([{ path, content: Buffer }])
    ——不要使用runCommand的here文档。包含特殊字符的here文档会导致沙箱API返回400错误。
  4. Claude参数:始终使用
    --dangerously-skip-permissions --debug
    --debug
    参数会将日志写入
    ~/.claude/debug/
  5. 认证:API密钥来自macOS钥匙串(
    ANTHROPIC_AUTH_TOKEN
    ——用于AI网关的
    vck_*
    格式Vercel Claude密钥),Vercel令牌来自
    ~/.local/share/com.vercel.cli/auth.json
    vca_*
    格式令牌)。
  6. 沙箱SDK的OIDC认证:首次使用前运行一次
    npx vercel link --scope vercel-labs -y
    +
    npx vercel env pull
  7. 端口暴露:在
    Sandbox.create()
    中传入
    ports: [3000]
    ,即可通过
    sandbox.domain(3000)
    立即获取公开URL。在v1.8.0版本中可正常工作——URL在创建时分配,无需等待服务监听。
  8. 延长超时:使用
    sandbox.extendTimeout(ms)
    可让沙箱在初始超时后继续运行。已验证可用——会按请求时长延长超时。可用于夜间保持沙箱运行。
  9. 后台命令:在v1版本中,使用后台进程(
    &
    nohup
    )的
    runCommand
    可能会抛出ZodError。建议先写入脚本文件,再执行该脚本。
  10. 会话清理竞争
    session-end-cleanup.mjs
    钩子会在会话结束时删除
    /tmp/vercel-plugin-*-seen-skills.d/
    。需在会话完成前提取工件,或依赖轮询历史数据。
  11. agent-browser在沙箱中可用:通过
    npm install -g agent-browser
    安装。Claude Code可在沙箱内使用它进行基于浏览器的验证。
  12. 无免费版限制:早期的301超时是由于早期脚本的默认超时值设置较低,而非等级限制。企业版(vercel-labs)无已知沙箱时间限制——沙箱已成功运行10分钟以上。
  13. claude -p在沙箱中可用
    claude -p --json-schema --output-format json --model haiku
    可用于结构化评分。在沙箱内运行时无嵌套问题(仅在同一机器上的Claude内部运行Claude时会失败)。
  14. 部署项目命名必须使用精确到分钟的时间戳后缀(例如
    pet-adoption-board-202603101853
    ),避免关联至vercel-labs团队项目时发生冲突。这些是演示项目——我们每天会生成多个。格式:
    <slug>-<YYYYMMDDHHMM>

When to Use This vs benchmark-agents

与benchmark-agents的对比

benchmark-agents (WezTerm)benchmark-sandbox
EnvironmentLocal macOS terminal panesRemote Vercel Sandboxes (Amazon Linux)
ParallelismLimited by local resourcesUp to 10 (Hobby) or 2,000 (Pro) concurrent
Session typeInteractive TTY via
/bin/zsh -ic
Direct
sh -c
invocation (PTY not required)
Artifact accessDirect filesystem (
~/.claude/debug/
)
sandbox.readFile()
/ poll via
runCommand
Port exposure
localhost:3000
Public
https://sb-XXX.vercel.run
URLs
VerificationManual browser checkAutomated agent-browser in Phase 2
DeployManualAutomated Phase 3 → permanent
*.vercel.app
URLs
ScoringManual reviewHaiku structured JSON scoring per phase
Best forManual eval + iteration loopAutomated parallel coverage + verification + deploy runs
benchmark-agents(WezTerm)benchmark-sandbox
运行环境本地macOS终端面板远程Vercel沙箱(Amazon Linux)
并行能力受本地资源限制最多10个(免费版)或2000个(专业版)并发
会话类型通过
/bin/zsh -ic
的交互式TTY
直接
sh -c
调用(无需PTY)
工件访问直接文件系统访问(
~/.claude/debug/
sandbox.readFile()
/ 通过
runCommand
轮询
端口暴露
localhost:3000
公开
https://sb-XXX.vercel.run
URL
验证方式手动浏览器检查第二阶段自动agent-browser验证
部署方式手动第三阶段自动部署 → 永久
*.vercel.app
URL
评分方式手动审核每个阶段的俳句结构化JSON评分
最佳用途手动评估 + 迭代循环自动化并行覆盖率 + 验证 + 部署运行

How It Works

工作原理

  1. Create fresh sandbox:
    Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ... } })
    — no snapshot
  2. Install tools:
    npm install -g @anthropic-ai/claude-code vercel agent-browser
    (~20s per sandbox)
  3. Auth Vercel CLI: Write token to
    ~/.local/share/com.vercel.cli/auth.json
  4. Upload plugin:
    sandbox.writeFiles()
    for 80 plugin files, then
    npx add-plugin
  5. Phase 1 — BUILD: Claude Code builds the app (30 min timeout)
  6. Score build: Haiku evaluates completeness, API routes, UI, AI features
  7. Start dev server: If not already running, start
    npx next dev --port 3000
  8. Extend timeout:
    sandbox.extendTimeout()
    for verify + deploy + keep-alive
  9. Phase 2 — VERIFY: Claude Code uses
    agent-browser
    to test user stories (20 min timeout). Prompt tells Claude to start dev server itself if not running.
  10. Score verify: Haiku evaluates each user story as pass/fail with reasons
  11. Re-extract skills: Skills re-collected after verify phase (agent-browser + code fixes trigger more)
  12. Phase 3 — DEPLOY: Claude Code runs
    vercel link
    +
    vercel deploy
    , fixes build errors (30 min timeout)
  13. Score deploy: Haiku evaluates deploy success, URL extraction, errors
  14. Re-extract skills: Skills re-collected after deploy phase
  15. Write incremental results: Each scenario writes its own
    result.json
    immediately on completion (survives crashes)
  16. Extract source archive:
    source.tar.gz
    of project files saved locally
  17. Generate report: Markdown report with build/verify/deploy scores, skill coverage, URLs
  1. 创建全新沙箱
    Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ... } })
    ——不使用快照
  2. 安装工具
    npm install -g @anthropic-ai/claude-code vercel agent-browser
    (每个沙箱约20秒)
  3. Vercel CLI认证:将令牌写入
    ~/.local/share/com.vercel.cli/auth.json
  4. 上传插件:使用
    sandbox.writeFiles()
    上传80个插件文件,然后运行
    npx add-plugin
  5. 第一阶段:构建:Claude Code构建应用(30分钟超时)
  6. 构建评分:俳句模型评估完整性、API路由、UI、AI功能
  7. 启动开发服务器:如果尚未运行,启动
    npx next dev --port 3000
  8. 延长超时
    sandbox.extendTimeout()
    用于验证、部署和保持运行
  9. 第二阶段:验证:Claude Code使用
    agent-browser
    测试用户故事(20分钟超时)。提示词会告知Claude如果服务器未运行则自行启动。
  10. 验证评分:俳句模型评估每个用户故事的通过/失败状态及原因
  11. 重新提取技能:验证阶段后重新收集技能(agent-browser和代码修复会触发更多技能)
  12. 第三阶段:部署:Claude Code运行
    vercel link
    +
    vercel deploy
    ,修复构建错误(30分钟超时)
  13. 部署评分:俳句模型评估部署成功与否、URL提取、错误情况
  14. 重新提取技能:部署阶段后重新收集技能
  15. 写入增量结果:每个场景完成后立即写入自己的
    result.json
    (可在崩溃后保留)
  16. 提取源码归档:本地保存项目文件的
    source.tar.gz
  17. 生成报告:包含构建/验证/部署评分、技能覆盖率、URL的Markdown报告

Sandbox Session Flow (Per Scenario)

沙箱会话流程(每个场景)

Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ANTHROPIC_BASE_URL, VERCEL_PLUGIN_LOG_LEVEL: "trace" } })
  ├─ npm install -g @anthropic-ai/claude-code vercel agent-browser   (~20s)
  ├─ Write Vercel CLI auth token to ~/.local/share/com.vercel.cli/auth.json
  ├─ mkdir -p /home/vercel-sandbox/<slug> && npm init -y
  ├─ sandbox.writeFiles() → /home/vercel-sandbox/vercel-plugin/  (80 files, ~945KB)
  ├─ npx add-plugin /home/vercel-sandbox/vercel-plugin -s project -y --target claude-code
  ├─ Phase 1: BUILD
  │   ├─ sandbox.writeFiles() → /tmp/prompt.txt
  │   ├─ claude --dangerously-skip-permissions --debug --settings <path> "$(cat /tmp/prompt.txt)"
  │   │   (with AbortSignal.timeout(TIMEOUT_MS))
  │   ├─ Poll every 20s:
  │   │   ├─ ls /tmp/vercel-plugin-*-seen-skills.d/     (claimed skills)
  │   │   ├─ cat /tmp/vercel-plugin-*-seen-skills.txt    (seen skills snapshot)
  │   │   ├─ find ~/.claude/debug -type f                (debug log count)
  │   │   ├─ find <project> -newer /tmp/prompt.txt       (new project files)
  │   │   └─ curl localhost:3000                         (port status)
  │   ├─ Extract build artifacts
  │   └─ Haiku build score (structured JSON)
  ├─ Start dev server (if not already running)
  ├─ sandbox.extendTimeout(...)
  ├─ Phase 2: VERIFY (if >1 project file exists)
  │   ├─ sandbox.writeFiles() → /tmp/verify.txt  (agent-browser verification prompt)
  │   ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/verify.txt)"
  │   │   (with AbortSignal.timeout(1_200_000) — 20 min)
  │   ├─ Re-extract skills (verify phase triggers more)
  │   └─ Haiku verify score (per-story pass/fail JSON)
  ├─ Phase 3: DEPLOY (if >3 project files)
  │   ├─ sandbox.writeFiles() → /tmp/deploy.txt
  │   ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/deploy.txt)"
  │   │   (links to vercel-labs, deploys, fixes build errors up to 3x)
  │   ├─ Extract deploy URL from output (*.vercel.app)
  │   ├─ Re-extract skills (deploy phase triggers more)
  │   └─ Haiku deploy score (structured JSON)
  ├─ Write <slug>/result.json immediately (crash-safe)
  ├─ Update aggregate results.json (complete: false until all done)
  ├─ Extract source.tar.gz
  └─ sandbox.stop()  (skipped if --keep-alive)
Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ANTHROPIC_BASE_URL, VERCEL_PLUGIN_LOG_LEVEL: "trace" } })
  ├─ npm install -g @anthropic-ai/claude-code vercel agent-browser   (~20s)
  ├─ 将Vercel CLI认证令牌写入~/.local/share/com.vercel.cli/auth.json
  ├─ mkdir -p /home/vercel-sandbox/<slug> && npm init -y
  ├─ sandbox.writeFiles() → /home/vercel-sandbox/vercel-plugin/  (80个文件,约945KB)
  ├─ npx add-plugin /home/vercel-sandbox/vercel-plugin -s project -y --target claude-code
  ├─ 第一阶段:构建
  │   ├─ sandbox.writeFiles() → /tmp/prompt.txt
  │   ├─ claude --dangerously-skip-permissions --debug --settings <path> "$(cat /tmp/prompt.txt)"
  │   │   (使用AbortSignal.timeout(TIMEOUT_MS))
  │   ├─ 每20秒轮询一次:
  │   │   ├─ ls /tmp/vercel-plugin-*-seen-skills.d/     (已声明的技能)
  │   │   ├─ cat /tmp/vercel-plugin-*-seen-skills.txt    (已发现技能的快照)
  │   │   ├─ find ~/.claude/debug -type f                (调试日志数量)
  │   │   ├─ find <project> -newer /tmp/prompt.txt       (新增项目文件)
  │   │   └─ curl localhost:3000                         (端口状态)
  │   ├─ 提取构建工件
  │   └─ 俳句模型构建评分(结构化JSON)
  ├─ 启动开发服务器(如果尚未运行)
  ├─ sandbox.extendTimeout(...)
  ├─ 第二阶段:验证(如果项目文件数量>1)
  │   ├─ sandbox.writeFiles() → /tmp/verify.txt  (agent-browser验证提示词)
  │   ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/verify.txt)"
  │   │   (使用AbortSignal.timeout(1_200_000) — 20分钟)
  │   ├─ 重新提取技能(验证阶段会触发更多技能)
  │   └─ 俳句模型验证评分(按故事的通过/失败JSON)
  ├─ 第三阶段:部署(如果项目文件数量>3)
  │   ├─ sandbox.writeFiles() → /tmp/deploy.txt
  │   ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/deploy.txt)"
  │   │   (关联至vercel-labs,部署,最多3次修复构建错误)
  │   ├─ 从输出中提取部署URL(*.vercel.app)
  │   ├─ 重新提取技能(部署阶段会触发更多技能)
  │   └─ 俳句模型部署评分(结构化JSON)
  ├─ 立即写入<slug>/result.json(崩溃安全)
  ├─ 更新汇总results.json(完成前complete为false,完成后为true)
  ├─ 提取source.tar.gz
  └─ sandbox.stop()  (如果使用--keep-alive则跳过)

Verification Phase Details

验证阶段详情

The verify phase is the "closer" — its job is to make the app work and prove it. Key behaviors:
  • Always runs if >1 project file exists (no longer gated on port 3000 being up)
  • Starts dev server itself if not already running — the prompt tells Claude to check
    localhost:3000
    and run
    npx next dev --port 3000
    if needed
  • 20 minute timeout — enough for agent-browser to open pages, screenshot, interact, fix broken code, restart server, and re-verify
  • Triggers skill injection — the verify session creates/edits files, triggering PreToolUse and PostToolUse hooks
  • Uses agent-browser workflow:
    open
    wait --load networkidle
    screenshot --annotate
    snapshot -i
    → interact → fix → re-verify
  • Results scored by haiku — no more parsing
    STORY_1: PASS
    from free text
验证阶段是“收尾环节”——其职责是确保应用正常运行并验证功能。关键特性:
  • 只要项目文件数量>1就会运行(不再以3000端口是否可用为前提)
  • 自行启动开发服务器——如果服务器未运行,提示词会告知Claude检查
    localhost:3000
    并在需要时运行
    npx next dev --port 3000
  • 20分钟超时——足够agent-browser打开页面、截图、交互、修复代码、重启服务器并重新验证
  • 触发技能注入——验证会话会创建/编辑文件,触发PreToolUse和PostToolUse钩子
  • 使用agent-browser工作流:
    open
    wait --load networkidle
    screenshot --annotate
    snapshot -i
    → 交互 → 修复 → 重新验证
  • 结果由俳句模型评分——无需再从自由文本中解析
    STORY_1: PASS

Deploy Phase Details

部署阶段详情

The deploy phase uses a full Claude Code session (for skill tracking) to:
  1. Run
    vercel link --yes --scope vercel-labs --project <slug>-YYYYMMDD
  2. Run
    vercel deploy --yes
  3. If build fails, fix code and retry (up to 3 attempts)
  4. Important: unsets
    VERCEL_TOKEN
    env var so CLI falls back to
    ~/.local/share/com.vercel.cli/auth.json
  5. Deployment protection is enabled by default on vercel-labs team
Deploy URL is extracted by regex from Claude's output, with haiku as fallback URL extractor.
部署阶段使用完整的Claude Code会话(用于技能追踪)来:
  1. 运行
    vercel link --yes --scope vercel-labs --project <slug>-YYYYMMDD
  2. 运行
    vercel deploy --yes
  3. 如果构建失败,修复代码并重试(最多3次)
  4. 重要提示:取消设置
    VERCEL_TOKEN
    环境变量,让CLI回退使用
    ~/.local/share/com.vercel.cli/auth.json
  5. vercel-labs团队默认启用部署保护
部署URL通过正则表达式从Claude的输出中提取,俳句模型作为URL提取的备选方案。

DO NOT (Hard Rules)

禁止操作(硬性规则)

Same rules as
benchmark-agents
, plus sandbox-specific:
  • DO NOT use
    claude --print
    or
    -p
    flag for BUILD/VERIFY/DEPLOY phases — hooks don't fire without tool-calling sessions (use
    -p
    only for haiku scoring passes)
  • DO NOT let sandboxes run without extracting artifacts — ephemeral filesystem is lost on stop
  • DO NOT pass API keys via
    writeFiles()
    — use
    Sandbox.create({ env: { ... } })
  • DO NOT skip snapshotting after build — it's your safety net if verify/deploy kills the sandbox
  • DO NOT use v2 beta SDK — named sandbox endpoint returns 404 for this team; use v1.8.0
  • DO NOT use
    runCommand
    heredocs to write file content — use
    sandbox.writeFiles()
    instead
  • DO NOT assume
    /home/user/
    exists — the home dir is
    /home/vercel-sandbox/
  • DO NOT use simple project names without timestamps — always append
    -YYYYMMDDHHMM
    to avoid collisions across runs
benchmark-agents
的规则相同,加上沙箱特有的规则:
  • 禁止在构建/验证/部署阶段使用
    claude --print
    -p
    参数——不使用工具调用会话的话钩子不会触发(仅在俳句模型评分步骤中使用
    -p
  • 禁止在未提取工件的情况下让沙箱运行——临时文件系统在停止后会丢失
  • 禁止通过
    writeFiles()
    传递API密钥——使用
    Sandbox.create({ env: { ... } })
  • 禁止在构建后跳过快照——这是验证/部署阶段沙箱崩溃时的安全备份
  • 禁止使用v2测试版SDK——该团队的命名沙箱端点返回404;请使用v1.8.0
  • 禁止使用runCommand的here文档写入文件内容——使用
    sandbox.writeFiles()
    替代
  • 禁止假设
    /home/user/
    存在——主目录是
    /home/vercel-sandbox/
  • 禁止使用不带时间戳的简单项目名称——始终添加
    -YYYYMMDDHHMM
    后缀,避免跨运行的冲突

Prerequisites

前置条件

bash
undefined
bash
undefined

One-time setup: link project for OIDC sandbox auth

一次性设置:关联项目以进行OIDC沙箱认证

npx vercel link --scope vercel-labs -y npx vercel env pull .env.local
npx vercel link --scope vercel-labs -y npx vercel env pull .env.local

Auth (auto-resolved from macOS Keychain + Vercel CLI auth):

认证信息(自动从macOS钥匙串和Vercel CLI认证中获取):

- ANTHROPIC_API_KEY: from Keychain "ANTHROPIC_AUTH_TOKEN" (vck_* key) or env var

- ANTHROPIC_API_KEY:来自钥匙串的"ANTHROPIC_AUTH_TOKEN"(vck_*格式密钥)或环境变量

- VERCEL_TOKEN: from ~/.local/share/com.vercel.cli/auth.json (vca_* token) or env var

- VERCEL_TOKEN:来自~/.local/share/com.vercel.cli/auth.json(vca_*格式令牌)或环境变量

- ANTHROPIC_BASE_URL: defaults to https://ai-gateway.vercel.sh

- ANTHROPIC_BASE_URL:默认值为https://ai-gateway.vercel.sh

undefined
undefined

Commands

命令

Run eval with dynamic scenarios (recommended)

使用动态场景运行评估(推荐)

bash
undefined
bash
undefined

Generate scenarios as JSON, then run

生成JSON格式的场景,然后运行

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json

With all phases + keep-alive for overnight

执行所有阶段并保持沙箱夜间运行

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/scenarios.json --keep-alive --keep-hours 8
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/scenarios.json --keep-alive --keep-hours 8

Build-only, no verification or deploy

仅构建,跳过验证和部署

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/scenarios.json --skip-verify --skip-deploy
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/scenarios.json --skip-verify --skip-deploy

Filter to specific slugs from file or defaults

从文件或默认场景中筛选特定slug

bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone
undefined
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone
undefined

Monitoring While Running

运行时监控

The orchestrator prints live status. For manual checks on a running sandbox:
typescript
// List claimed skills
const claims = await sandbox.runCommand("sh", ["-c",
  "ls /tmp/vercel-plugin-*-seen-skills.d/ 2>/dev/null"
]);

// Check hook firing count
const hooks = await sandbox.runCommand("sh", ["-c",
  "find /home/vercel-sandbox/.claude/debug -name '*.txt' -exec grep -c 'executePreToolHooks' {} +"
]);

// Check port 3000
const port = await sandbox.runCommand("sh", ["-c",
  "curl -s -o /dev/null -w '%{http_code}' http://localhost:3000"
]);

// Get public URL (after ports: [3000] in Sandbox.create)
const url = sandbox.domain(3000);
编排器会打印实时状态。如需手动检查运行中的沙箱:
typescript
undefined

Artifact Export Layout

列出已声明的技能

Results are written to
~/dev/vercel-plugin-testing/sandbox-results/<run-id>/
:
<run-id>/
  results.json             # Aggregate results (complete: false until all done, then true)
  report.md                # Markdown report with scores, coverage, URLs
  <slug>/
    result.json            # Per-scenario result (written immediately on completion)
    source.tar.gz          # Project source archive
Each scenario result includes:
  • slug
    ,
    sandboxId
    ,
    success
    ,
    durationMs
  • claimedSkills[]
    ,
    expectedSkills[]
    ,
    projectFiles[]
  • appUrl
    — public
    https://sb-XXX.vercel.run
    URL (sandbox lifetime only)
  • deployUrl
    — permanent
    https://xxx.vercel.app
    URL (if deploy succeeded)
  • pollHistory[]
    — timestamped skill/file/port snapshots
  • verification
    { ran, exitCode, stories: [{ index, status }], output }
  • buildScore
    — haiku structured completeness assessment
  • deployScore
    — haiku structured deploy assessment
The markdown report (
report.md
/
.reports/<timestamp>.md
) includes:
  1. Summary table — slug, build status, skills, files, verify results, deploy URL, duration
  2. Per-scenario details — build score, deploy score, verification per-story pass/fail
  3. Skill coverage — expected vs actual per scenario, missing/bonus breakdown
  4. Total unique skills across all scenarios
const claims = await sandbox.runCommand("sh", ["-c", "ls /tmp/vercel-plugin-*-seen-skills.d/ 2>/dev/null" ]);

Proven Results (2026-03-10)

检查钩子触发次数

Across 34 scenarios run in 5 batches:
MetricBestTypical
Skills per scenario31 (ai-interior-designer)12-24
Expected skill coverage100% (pet-adoption-board 4/4, apartment-hunting-copilot 7/7, splitwise-clone 6/6)50-86%
User stories verified3/3 PASS (ai-dream-journal, ai-gift-finder, ai-resume-roaster, ai-music-mood-radio, team-standup-bot, pet-adoption-board)varies
Files built per scenario37 (student-study-groups)6-25
Build time5-11 min5-7 min
Key findings:
  • User-story-focused prompts (no tech name-dropping) work — plugin detects patterns from actual code
  • ai-sdk
    ,
    shadcn
    ,
    nextjs
    ,
    vercel-functions
    are the most consistently detected skills
  • cron-jobs
    ,
    routing-middleware
    need Claude to write specific file patterns to trigger
  • Lexical prompt inject (UserPromptSubmit) working — skills injected before any files written
  • session-end-cleanup
    deletes claim dirs — use poll history for final skill counts
  • Enterprise tier (vercel-labs) — no sandbox time cap; builds ran 10+ minutes
const hooks = await sandbox.runCommand("sh", ["-c", "find /home/vercel-sandbox/.claude/debug -name '*.txt' -exec grep -c 'executePreToolHooks' {} +" ]);

Known Limitations

检查3000端口

  1. Snapshot stops the source sandbox:
    sandbox.snapshot()
    stops the original sandbox. Create a new sandbox from the snapshot to continue. Files and npm globals DO survive.
  2. v2 beta incompatible:
    @vercel/sandbox@2.0.0-beta.3
    's named sandbox endpoint returns 404 for this team. Stick with v1.8.0.
  3. Artifact window: Must extract before
    sandbox.stop()
    — filesystem is ephemeral. Session cleanup hook may delete claim dirs before extraction.
  4. Amazon Linux paths: User is
    vercel-sandbox
    (home at
    /home/vercel-sandbox/
    ). NOT
    /home/user/
    or
    /root/
    .
  5. --dangerously-skip-permissions
    parity
    : Sandbox evals auto-approve all tool calls. WezTerm evals use normal permission flow. Coverage results may differ.
  6. runCommand
    timeout
    : Use
    { signal: AbortSignal.timeout(ms) }
    — the
    { timeout }
    option is silently ignored.
  7. BrotliDecompressionError: Transient Vercel API errors can kill sandbox creation. Retry logic recommended for production runs.
  8. Deploy reliability: Claude Code deploy sessions sometimes fail to output a parseable
    *.vercel.app
    URL. The haiku scoring step provides a fallback URL extraction attempt.
  9. Verify timeout: Complex apps may need the full 20 minutes for agent-browser to test all stories. Simpler apps finish in 2-5 minutes.
const port = await sandbox.runCommand("sh", ["-c", "curl -s -o /dev/null -w '%{http_code}' http://localhost:3000" ]);

获取公开URL(需在Sandbox.create中传入ports: [3000])

const url = sandbox.domain(3000);
undefined

工件导出结构

结果会写入
~/dev/vercel-plugin-testing/sandbox-results/<run-id>/
<run-id>/
  results.json             # 汇总结果(完成前complete为false,完成后为true)
  report.md                # 包含评分、覆盖率、URL的Markdown报告
  <slug>/
    result.json            # 每个场景的结果(完成后立即写入)
    source.tar.gz          # 项目源码归档
每个场景结果包含:
  • slug
    sandboxId
    success
    durationMs
  • claimedSkills[]
    expectedSkills[]
    projectFiles[]
  • appUrl
    — 公开
    https://sb-XXX.vercel.run
    URL(仅沙箱生命周期内有效)
  • deployUrl
    — 永久
    https://xxx.vercel.app
    URL(如果部署成功)
  • pollHistory[]
    — 带时间戳的技能/文件/端口快照
  • verification
    { ran, exitCode, stories: [{ index, status }], output }
  • buildScore
    — 俳句模型的结构化完整性评估
  • deployScore
    — 俳句模型的结构化部署评估
Markdown报告(
report.md
/
.reports/<timestamp>.md
)包含:
  1. 汇总表格 — slug、构建状态、技能、文件、验证结果、部署URL、时长
  2. 每个场景的详情 — 构建评分、部署评分、按故事的验证通过/失败情况
  3. 技能覆盖率 — 每个场景的预期与实际技能对比,缺失/额外技能细分
  4. 所有场景的总唯一技能数

已验证结果(2026-03-10)

在5批共34个场景的运行中:
指标最佳值典型值
每个场景的技能数31个(ai-interior-designer)12-24个
预期技能覆盖率100%(pet-adoption-board 4/4,apartment-hunting-copilot 7/7,splitwise-clone 6/6)50-86%
已验证用户故事3/3 通过(ai-dream-journal、ai-gift-finder、ai-resume-roaster、ai-music-mood-radio、team-standup-bot、pet-adoption-board)各不相同
每个场景构建的文件数37个(student-study-groups)6-25个
构建时长5-11分钟5-7分钟
关键发现:
  • 以用户故事为中心的提示词(不提及技术名称)有效——插件可从实际代码中检测到模式
  • ai-sdk
    shadcn
    nextjs
    vercel-functions
    是最常被检测到的技能
  • cron-jobs
    routing-middleware
    需要Claude编写特定文件模式才能触发
  • Lexical提示词注入(UserPromptSubmit)正常工作——在写入任何文件前就会注入技能
  • session-end-cleanup
    会删除声明目录——需使用轮询历史数据获取最终技能计数
  • 企业版(vercel-labs)——无沙箱时间限制;构建已成功运行10分钟以上

已知限制

  1. 快照会停止源沙箱
    sandbox.snapshot()
    会停止原始沙箱。需从快照创建新沙箱以继续操作。文件和npm全局包保留。
  2. 与v2测试版不兼容
    @vercel/sandbox@2.0.0-beta.3
    的命名沙箱端点对该团队返回404。请坚持使用v1.8.0。
  3. 工件提取窗口:必须在
    sandbox.stop()
    前提取工件——文件系统是临时的。会话清理钩子可能在提取前删除声明目录。
  4. Amazon Linux路径:用户是
    vercel-sandbox
    (主目录为
    /home/vercel-sandbox/
    )。不是
    /home/user/
    /root/
  5. --dangerously-skip-permissions
    一致性
    :沙箱评估会自动批准所有工具调用。WezTerm评估使用正常权限流程。覆盖率结果可能不同。
  6. runCommand
    超时
    :使用
    { signal: AbortSignal.timeout(ms) }
    ——
    { timeout }
    选项会被静默忽略。
  7. BrotliDecompressionError:Vercel API的临时错误可能导致沙箱创建失败。生产环境运行建议添加重试逻辑。
  8. 部署可靠性:Claude Code部署会话有时无法输出可解析的
    *.vercel.app
    URL。俳句模型评分步骤提供了备选的URL提取方案。
  9. 验证超时:复杂应用可能需要完整的20分钟让agent-browser测试所有故事。简单应用在2-5分钟内即可完成。