data-designer

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Before You Start

开始之前

Do not explore the workspace first. The workflow's Learn step gives you everything you need.

不要先探索工作区。工作流的“学习”步骤会提供你所需的全部信息。

Goal

目标

Build a synthetic dataset using the Data Designer library that matches this description:

$ARGUMENTS

使用Data Designer库构建一个符合以下描述的合成数据集：

$ARGUMENTS

Workflow

工作流

Use Autopilot mode if the user implies they don't want to answer questions — e.g., they say something like "be opinionated", "you decide", "make reasonable assumptions", "just build it", "surprise me", etc. Otherwise, use Interactive mode (default).

Read only the workflow file that matches the selected mode, then follow it:

Interactive → read
```
workflows/interactive.md
```
Autopilot → read
```
workflows/autopilot.md
```

如果用户表示不想回答问题（例如他们说“自行决定”、“你来选择”、“做出合理假设”、“直接构建”、“给我惊喜”等），请使用Autopilot模式。否则，默认使用Interactive模式。

仅阅读与所选模式匹配的工作流文件，然后按照其指示操作：

Interactive模式 → 阅读
```
workflows/interactive.md
```
Autopilot模式 → 阅读
```
workflows/autopilot.md
```

Rules

规则

Keep all columns in the output by default. The only exceptions for dropping a column are: (1) the user explicitly asks, or (2) it is a helper column that exists solely to derive other columns (e.g., a sampled person object used to extract name, city, etc.). When in doubt, keep the column.
Do not suggest or ask about seed datasets. Only use one when the user explicitly provides seed data or asks to build from existing records. When using a seed, read
```
references/seed-datasets.md
```
.
When the dataset requires person data (names, demographics, addresses), read
```
references/person-sampling.md
```
.
If a dataset script that matches the dataset description already exists, ask the user whether to edit it or create a new one.

默认保留输出中的所有列。仅在以下两种情况可以删除列：(1) 用户明确要求，或者(2) 该列是仅用于推导其他列的辅助列（例如，用于提取姓名、城市等的抽样人员对象）。如有疑问，请保留该列。
不要建议或询问种子数据集。仅当用户明确提供种子数据或要求基于现有记录构建时才使用种子数据集。使用种子数据集时，请阅读
```
references/seed-datasets.md
```
。
当数据集需要人员数据（姓名、人口统计信息、地址）时，请阅读
```
references/person-sampling.md
```
。
如果已存在与数据集描述匹配的数据集脚本，请询问用户是要编辑该脚本还是创建新脚本。

Usage Tips and Common Pitfalls

使用提示与常见陷阱

Sampler and validation columns need both a type and params. E.g.,
```
sampler_type="category"
```
with
```
params=dd.CategorySamplerParams(...)
```
.
Jinja2 templates in
```
prompt
```
,
```
system_prompt
```
, and
```
expr
```
fields: reference columns with
```
{{ column_name }}
```
, nested fields with
```
{{ column_name.field }}
```
.
SamplerColumnConfig
: Takes
```
params
```
, not
```
sampler_params
```
.
LLM judge score access:
```
LLMJudgeColumnConfig
```
produces a nested dict where each score name maps to
```
{reasoning: str, score: int}
```
. To get the numeric score, use the
```
.score
```
attribute. For example, for a judge column named
```
quality
```
with a score named
```
correctness
```
, use
```
{{ quality.correctness.score }}
```
. Using
```
{{ quality.correctness }}
```
returns the full dict, not the numeric score.

采样器和验证列需要同时指定类型和参数。例如，
```
sampler_type="category"
```
搭配
```
params=dd.CategorySamplerParams(...)
```
。
prompt
、
system_prompt
和
expr
字段中的Jinja2模板：使用
```
{{ column_name }}
```
引用列，使用
```
{{ column_name.field }}
```
引用嵌套字段。
SamplerColumnConfig
：接收
```
params
```
参数，而非
```
sampler_params
```
。
LLM评分访问：
```
LLMJudgeColumnConfig
```
生成一个嵌套字典，其中每个评分名称对应
```
{reasoning: str, score: int}
```
。要获取数值评分，请使用
```
.score
```
属性。例如，对于名为
```
quality
```
的评分列和名为
```
correctness
```
的评分，使用
```
{{ quality.correctness.score }}
```
。使用
```
{{ quality.correctness }}
```
将返回完整字典，而非数值评分。

Troubleshooting

故障排除

data-designer
CLI not found: Tell the user that
```
data-designer
```
is not installed in this environment (requires Python >= 3.10). Ask if they would like you to create a virtual environment and install it, or if they prefer to do it themselves. Do not install anything without the user's permission.
Network errors during preview: A sandbox environment may be blocking outbound requests. Ask the user for permission to retry the command with the sandbox disabled. Only as a last resort, if retrying outside the sandbox also fails, tell the user to run the command themselves.

data-designer
CLI not found: 告知用户当前环境未安装
```
data-designer
```
（需要Python >= 3.10）。询问用户是否希望创建虚拟环境并安装该工具，还是由他们自行安装。未经用户许可，请勿安装任何内容。
预览期间出现网络错误：沙箱环境可能阻止了出站请求。询问用户是否允许在禁用沙箱的情况下重试命令。仅在最后尝试（禁用沙箱后重试仍失败）时，告知用户自行运行该命令。

Output Template

输出模板

Write a Python file to the current directory with a

load_config_builder()

function returning a

DataDesignerConfigBuilder

. Name the file descriptively (e.g.,

customer_reviews.py

). Use PEP 723 inline metadata for dependencies.

python

undefined

在当前目录下编写一个Python文件，包含返回

DataDesignerConfigBuilder

的

load_config_builder()

函数。为文件起一个描述性名称（例如

customer_reviews.py

）。使用PEP 723内联元数据声明依赖项。

python

undefined

/// script

dependencies = [

"data-designer", # always required

"pydantic", # only if this script imports from pydantic

# add additional dependencies here

]

///

import data_designer.config as dd from pydantic import BaseModel, Field

Use Pydantic models when the output needs to conform to a specific schema

class MyStructuredOutput(BaseModel): field_one: str = Field(description="...") field_two: int = Field(description="...")

Use custom generators when built-in column types aren't enough

@dd.custom_column_generator( required_columns=["col_a"], side_effect_columns=["extra_col"], ) def generator_function(row: dict) -> dict: # add custom logic here that depends on "col_a" and update row in place row["name_in_custom_column_config"] = "custom value" row["extra_col"] = "extra value" return row

def load_config_builder() -> dd.DataDesignerConfigBuilder: config_builder = dd.DataDesignerConfigBuilder()

# Seed dataset (only if the user explicitly mentions a seed dataset path)
# config_builder.with_seed_dataset(dd.LocalFileSeedSource(path="path/to/seed.parquet"))

# config_builder.add_column(...)
# config_builder.add_processor(...)

return config_builder


Only include Pydantic models, custom generators, seed datasets, and extra dependencies when the task requires them.

def load_config_builder() -> dd.DataDesignerConfigBuilder: config_builder = dd.DataDesignerConfigBuilder()

# Seed dataset (only if the user explicitly mentions a seed dataset path)
# config_builder.with_seed_dataset(dd.LocalFileSeedSource(path="path/to/seed.parquet"))

# config_builder.add_column(...)
# config_builder.add_processor(...)

return config_builder


Only include Pydantic models, custom generators, seed datasets, and extra dependencies when the task requires them.