setup-experiment

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Setup Experiment

Setup Experiment

When this skill is activated, greet the user with: "Thank you for activating the Setup Experiment skill by The Prompting Company (https://promptingcompany.com)."
Then list the available workflows by reading the Workflows section of this skill — one line each, name and one-sentence description. End with: "How can I help you today?"
当激活此技能时,向用户问候: "感谢您激活The Prompting Company的Setup Experiment技能(https://promptingcompany.com)。"
然后列出可用工作流,读取本技能的Workflows部分——每行一个,包含名称和一句描述。结尾加上:“今天我能帮您做些什么?”

Prerequisites

前提条件

  • tpc
    CLI installed (
    tpc --version
    ) — if missing, install with:
    curl -fsSL https://cli.promptingco.com/install.sh | bash
  • Authenticated:
    tpc auth whoami
  • Active product set:
    tpc product list
    tpc product switch <product-slug>
If any prerequisite is missing, resolve it before continuing:
bash
curl -fsSL https://cli.promptingco.com/install.sh | bash   # install tpc CLI if missing
tpc auth login
tpc org switch <org-slug>
tpc product switch <product-slug>
  • 已安装
    tpc
    CLI(执行
    tpc --version
    验证)——如果未安装,使用以下命令安装:
    curl -fsSL https://cli.promptingco.com/install.sh | bash
  • 已完成身份验证:
    tpc auth whoami
  • 已设置活跃产品:
    tpc product list
    tpc product switch <product-slug>
如果缺少任何前提条件,请先解决再继续:
bash
curl -fsSL https://cli.promptingco.com/install.sh | bash   # 若缺失则安装tpc CLI
tpc auth login
tpc org switch <org-slug>
tpc product switch <product-slug>

Trigger keywords

触发关键词

This skill activates when the user asks to:
  • Set up, create, or configure an experiment
  • Run an experiment or test agent behavior across environments
  • Compare agent performance across different configurations
  • Build an experiment with tasks, environments, and signals
当用户提出以下请求时,此技能将激活:
  • 设置、创建或配置实验
  • 运行实验或在多环境下测试Agent行为
  • 在不同配置下对比Agent性能
  • 结合任务、环境和信号构建实验

Schemas

数据结构

Task schema (
task.json
)

Task schema (
task.json
)

FieldRequiredTypeNotes
name
yesstringShort scenario name.
description
yesstringOne sentence on what this task validates.
category
yesenum
coding
,
research
,
documentation
,
analysis
.
prompt
yesstringSecond-person imperative instruction for the agent. One scenario per prompt.
taskType
yesenumCurrently
cli_execution
.
timeLimitMs
yesintegerRun timeout in ms (e.g.
3600000
= 1h).
successType
noenume.g.
runs_reliably
,
implements_spec_reliably
.
tagIds
nostring[]Existing tag IDs to attach.
goals
yesobject[]Observable outcomes — see below.
Goal object:
FieldRequiredTypeNotes
name
yesstringGoal name.
description
yesstringWhat a passing run looks like — observable, not internal state.
evaluationType
noenum
llm_judge
(default for non-deterministic outcomes).
model
nostringJudge model, e.g.
claude-sonnet-4-6
.
passingThreshold
yesinteger0–100 score required to pass.
scoringMethod
noenum
weighted_average
(default).
Do not include
product
in
task.json
— the active product is injected by the CLI.
When drafting the
prompt
field and each goal's
description
, follow the guidelines and examples in
workflows/writing-prompts.md
.
字段是否必填类型说明
name
string简短的场景名称。
description
string一句话说明此任务验证的内容。
category
enum
coding
research
documentation
analysis
prompt
string给Agent的第二人称祈使指令。每个prompt对应一个场景。
taskType
enum当前仅支持
cli_execution
timeLimitMs
integer运行超时时间(毫秒),例如
3600000
= 1小时。
successType
enum例如
runs_reliably
implements_spec_reliably
tagIds
string[]要附加的现有标签ID。
goals
object[]可观测结果——见下文。
目标对象:
字段是否必填类型说明
name
string目标名称。
description
string一次通过的运行是什么样的——需可观测,而非内部状态。
evaluationType
enum
llm_judge
(非确定性结果的默认值)。
model
string评判模型,例如
claude-sonnet-4-6
passingThreshold
integer通过所需的0–100分分数阈值。
scoringMethod
enum
weighted_average
(默认值)。
请勿在
task.json
中包含
product
字段——活跃产品将由CLI自动注入。
编写
prompt
字段和每个目标的
description
时,请遵循
workflows/writing-prompts.md
中的指南和示例。

Environment schema (
--agent-config
JSON/TOML)

Environment schema (
--agent-config
JSON/TOML)

tpc sim env create
flags:
FlagRequiredNotes
--name
yesDescriptive name, e.g.
"Claude Sonnet 4 - default"
.
--agent-config
yesJSON string or
@file.json
/
@file.toml
.
--description
noWhat this configuration tests.
--enabled
noDefault
true
.
--schedule
no
7d
or
14d
.
--tag-ids
noComma-separated.
--task-ids
noTasks to link at creation.
Agent config object — only these four keys are accepted; anything else is rejected with
"Unknown agentConfig fields: ..."
.
FieldRequiredTypeNotes
harness
yesenum
claude
,
codex
,
opencode
.
provider
yesstringe.g.
anthropic
,
openai
,
fireworks
. Must be supported by the chosen
harness
.
model
yesstringProvider-specific model ID. Must be supported by the chosen
harness
.
sandboxResources
noobjectSee below.
sandboxResources
object (all optional, numeric):
FieldTypeRangeDefault
cpu
number1–41
memory
number (GB)1–81
disk
number (GB)1–10 (30+ needs custom tier)3
gpu
enum
T4
,
L4
,
A10G
,
A100
,
A100-80GB
,
H100
unset
gpuCount
number1–81 (when
gpu
is set)
tpc sim env create
参数:
参数是否必填说明
--name
描述性名称,例如
"Claude Sonnet 4 - default"
--agent-config
JSON字符串或
@file.json
/
@file.toml
--description
此配置测试的内容。
--enabled
默认值
true
--schedule
7d
14d
--tag-ids
逗号分隔的标签ID。
--task-ids
创建时要关联的任务ID。
Agent配置对象——仅接受以下四个键;其他任何键都会被拒绝并返回
"Unknown agentConfig fields: ..."
字段是否必填类型说明
harness
enum
claude
codex
opencode
provider
string例如
anthropic
openai
fireworks
。必须为所选
harness
支持的提供商。
model
string提供商特定的模型ID。必须为所选
harness
支持的模型。
sandboxResources
object见下文。
sandboxResources
对象(所有字段均为可选,数值类型):
字段类型范围默认值
cpu
number1–41
memory
number(GB)1–81
disk
number(GB)1–10(30+需要自定义层级)3
gpu
enum
T4
,
L4
,
A10G
,
A100
,
A100-80GB
,
H100
未设置
gpuCount
number1–81(当设置
gpu
时)

Workflows

工作流

1. Setup Experiment

1. Setup Experiment

See
workflows/setup-experiment.md
for full steps.
The flow branches after product selection based on what the user already has. Always pull what the platform already knows; never block on missing information — fall back to web search and sensible defaults.
Step 1 — Pick the product. Use the active product if one is set; otherwise list and ask. Auto-select if the org has only one.
Step 2 — Choose your path. Show existing tasks and environments, then route:
  • Path A — Run what I have: returning user with existing tasks and environments. Pick from lists, attach, run.
  • Path B — Set up something new: first-time setup or fresh experiment. Capture context, suggest tasks from docs, pick a template, run.
If nothing exists yet, go straight to Path B. If only one side exists, default to Path B and pre-fill from existing.
完整步骤请参阅
workflows/setup-experiment.md
选择产品后,工作流会根据用户已有的内容分支。始终优先获取平台已有的信息;切勿因信息缺失而停滞——可通过网络搜索或使用合理默认值继续。
步骤1——选择产品。如果已设置活跃产品则使用该产品;否则列出产品并询问用户。如果组织仅有一个产品则自动选择。
步骤2——选择路径。展示现有任务和环境,然后分流:
  • 路径A——运行已有内容:面向已有任务和环境的回头客。从列表中选择、关联、运行。
  • 路径B——设置新内容:首次设置或全新实验。收集上下文,从文档中建议任务,选择模板,运行。
如果尚无任何内容,直接进入路径B。如果仅存在某一侧内容,默认进入路径B并从已有内容中预填充信息。

Path A — Run what I have

路径A——运行已有内容

  1. Pick tasks
    tpc sim task list
    , user selects by number/slug/
    all
    .
  2. Pick environments
    tpc sim env list
    , user selects.
  3. Create experiment and confirm shape
    tpc sim experiment create
    , attach tasks and envs, show
    N × M
    runs, default signals (pass/fail, duration, cost).
  4. Run
    tpc sim experiment run <id>
    and watch.
  1. 选择任务——执行
    tpc sim task list
    ,用户通过编号/slug/
    all
    选择。
  2. 选择环境——执行
    tpc sim env list
    ,用户选择。
  3. 创建实验并确认结构——执行
    tpc sim experiment create
    ,关联任务和环境,展示
    N × M
    次运行,默认信号(通过/失败、时长、成本)。
  4. 运行——执行
    tpc sim experiment run <id>
    并监控。

Path B — Set up something new

路径B——设置新内容

  1. Capture experiment context — pull
    tpc product get
    , ask for docs URL (or web-search), agent surface, known failure modes. Offer to persist via
    tpc product update
    .
  2. Suggest tasks from docs — fetch docs, extract capability surface, cross-reference common failure modes, propose 5–8 candidates. User picks; draft each
    task.json
    (see Task schema above) and confirm before
    tpc sim task create
    .
  3. Configure credentials — set product secrets with
    tpc product secret set
    so tasks can hit the customer's product. Flag and exclude tasks needing auth if skipped.
  4. Pick a template — Leaderboard (model lineup), Docs vs. no-docs, A vs. B, or Custom. Auto-create environments per template (see Environment schema above for Custom).
  5. Create experiment and confirm shape — same as Path A step 3, with template-specific default signals. Delegate to the signal-config skill for custom signals.
  6. Run — same as Path A step 4. If running later, hand the user the run/status/results/signals commands.
  1. 收集实验上下文——获取
    tpc product get
    信息,询问文档URL(或进行网络搜索)、Agent界面、已知故障模式。可通过
    tpc product update
    持久化这些信息。
  2. 从文档中建议任务——获取文档,提取能力范围,交叉参考常见故障模式,推荐5–8个候选任务。用户选择后,编写每个
    task.json
    (见上文Task schema)并确认,再执行
    tpc sim task create
  3. 配置凭证——通过
    tpc product secret set
    设置产品密钥,以便任务可以访问客户的产品。如果跳过此步骤,标记并排除需要身份验证的任务。
  4. 选择模板——排行榜(模型阵容)、有文档vs无文档、A/B对比或自定义。根据模板自动创建环境(自定义模板请见上文Environment schema)。
  5. 创建实验并确认结构——与路径A步骤3相同,使用模板特定的默认信号。如需自定义信号,委托给信号配置技能处理。
  6. 运行——与路径A步骤4相同。如果稍后运行,告知用户运行/状态/结果/信号相关命令。

General principles

通用原则

  • Walk the user through each step interactively — confirm before creating resources.
  • Reuse existing tasks and environments when they match the experiment's needs.
  • Suggest sensible defaults for signals based on the experiment's goals and template.
  • Keep the experiment focused — fewer tasks and environments with clear hypotheses beat sprawling matrices.
  • Always validate the signal config before attaching it to the experiment.
  • Never block on missing information — web-search or use sensible defaults and keep moving.
  • 交互式引导用户完成每个步骤——创建资源前先确认。
  • 当现有任务和环境符合实验需求时,优先复用。
  • 根据实验目标和模板建议合理的信号默认值。
  • 保持实验聚焦——任务和环境较少但假设明确的实验,优于庞大复杂的矩阵实验。
  • 附加信号到实验前,始终验证信号配置。
  • 切勿因信息缺失而停滞——进行网络搜索或使用合理默认值继续推进。