data-designer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Before You Start

开始之前

Do not explore the workspace first. The workflow's Learn step gives you everything you need.
不要先探索工作区。工作流的“学习”步骤会提供你所需的全部信息。

Goal

目标

Build a synthetic dataset using the Data Designer library that matches this description:
$ARGUMENTS
使用Data Designer库构建一个符合以下描述的合成数据集:
$ARGUMENTS

Workflow

工作流

Use Autopilot mode if the user implies they don't want to answer questions — e.g., they say something like "be opinionated", "you decide", "make reasonable assumptions", "just build it", "surprise me", etc. Otherwise, use Interactive mode (default).
Read only the workflow file that matches the selected mode, then follow it:
  • Interactive → read
    workflows/interactive.md
  • Autopilot → read
    workflows/autopilot.md
如果用户表示不想回答问题(例如他们说“自行决定”、“你来选择”、“做出合理假设”、“直接构建”、“给我惊喜”等),请使用Autopilot模式。否则,默认使用Interactive模式
仅阅读与所选模式匹配的工作流文件,然后按照其指示操作:
  • Interactive模式 → 阅读
    workflows/interactive.md
  • Autopilot模式 → 阅读
    workflows/autopilot.md

Rules

规则

  • Keep all columns in the output by default. The only exceptions for dropping a column are: (1) the user explicitly asks, or (2) it is a helper column that exists solely to derive other columns (e.g., a sampled person object used to extract name, city, etc.). When in doubt, keep the column.
  • Do not suggest or ask about seed datasets. Only use one when the user explicitly provides seed data or asks to build from existing records. When using a seed, read
    references/seed-datasets.md
    .
  • When the dataset requires person data (names, demographics, addresses), read
    references/person-sampling.md
    .
  • If a dataset script that matches the dataset description already exists, ask the user whether to edit it or create a new one.
  • 默认保留输出中的所有列。仅在以下两种情况可以删除列:(1) 用户明确要求,或者(2) 该列是仅用于推导其他列的辅助列(例如,用于提取姓名、城市等的抽样人员对象)。如有疑问,请保留该列。
  • 不要建议或询问种子数据集。仅当用户明确提供种子数据或要求基于现有记录构建时才使用种子数据集。使用种子数据集时,请阅读
    references/seed-datasets.md
  • 当数据集需要人员数据(姓名、人口统计信息、地址)时,请阅读
    references/person-sampling.md
  • 如果已存在与数据集描述匹配的数据集脚本,请询问用户是要编辑该脚本还是创建新脚本。

Usage Tips and Common Pitfalls

使用提示与常见陷阱

  • Sampler and validation columns need both a type and params. E.g.,
    sampler_type="category"
    with
    params=dd.CategorySamplerParams(...)
    .
  • Jinja2 templates in
    prompt
    ,
    system_prompt
    , and
    expr
    fields: reference columns with
    {{ column_name }}
    , nested fields with
    {{ column_name.field }}
    .
  • SamplerColumnConfig
    :
    Takes
    params
    , not
    sampler_params
    .
  • LLM judge score access:
    LLMJudgeColumnConfig
    produces a nested dict where each score name maps to
    {reasoning: str, score: int}
    . To get the numeric score, use the
    .score
    attribute. For example, for a judge column named
    quality
    with a score named
    correctness
    , use
    {{ quality.correctness.score }}
    . Using
    {{ quality.correctness }}
    returns the full dict, not the numeric score.
  • 采样器和验证列需要同时指定类型和参数。例如,
    sampler_type="category"
    搭配
    params=dd.CategorySamplerParams(...)
  • prompt
    system_prompt
    expr
    字段中的Jinja2模板
    :使用
    {{ column_name }}
    引用列,使用
    {{ column_name.field }}
    引用嵌套字段。
  • SamplerColumnConfig
    :接收
    params
    参数,而非
    sampler_params
  • LLM评分访问
    LLMJudgeColumnConfig
    生成一个嵌套字典,其中每个评分名称对应
    {reasoning: str, score: int}
    。要获取数值评分,请使用
    .score
    属性。例如,对于名为
    quality
    的评分列和名为
    correctness
    的评分,使用
    {{ quality.correctness.score }}
    。使用
    {{ quality.correctness }}
    将返回完整字典,而非数值评分。

Troubleshooting

故障排除

  • data-designer
    CLI not found:
    Tell the user that
    data-designer
    is not installed in this environment (requires Python >= 3.10). Ask if they would like you to create a virtual environment and install it, or if they prefer to do it themselves. Do not install anything without the user's permission.
  • Network errors during preview: A sandbox environment may be blocking outbound requests. Ask the user for permission to retry the command with the sandbox disabled. Only as a last resort, if retrying outside the sandbox also fails, tell the user to run the command themselves.
  • data-designer
    CLI not found:
    告知用户当前环境未安装
    data-designer
    (需要Python >= 3.10)。询问用户是否希望创建虚拟环境并安装该工具,还是由他们自行安装。未经用户许可,请勿安装任何内容。
  • 预览期间出现网络错误:沙箱环境可能阻止了出站请求。询问用户是否允许在禁用沙箱的情况下重试命令。仅在最后尝试(禁用沙箱后重试仍失败)时,告知用户自行运行该命令。

Output Template

输出模板

Write a Python file to the current directory with a
load_config_builder()
function returning a
DataDesignerConfigBuilder
. Name the file descriptively (e.g.,
customer_reviews.py
). Use PEP 723 inline metadata for dependencies.
python
undefined
在当前目录下编写一个Python文件,包含返回
DataDesignerConfigBuilder
load_config_builder()
函数。为文件起一个描述性名称(例如
customer_reviews.py
)。使用PEP 723内联元数据声明依赖项。
python
undefined

/// script

/// script

dependencies = [

dependencies = [

"data-designer", # always required

"data-designer", # always required

"pydantic", # only if this script imports from pydantic

"pydantic", # only if this script imports from pydantic

# add additional dependencies here

# add additional dependencies here

]

]

///

///

import data_designer.config as dd from pydantic import BaseModel, Field
import data_designer.config as dd from pydantic import BaseModel, Field

Use Pydantic models when the output needs to conform to a specific schema

Use Pydantic models when the output needs to conform to a specific schema

class MyStructuredOutput(BaseModel): field_one: str = Field(description="...") field_two: int = Field(description="...")
class MyStructuredOutput(BaseModel): field_one: str = Field(description="...") field_two: int = Field(description="...")

Use custom generators when built-in column types aren't enough

Use custom generators when built-in column types aren't enough

@dd.custom_column_generator( required_columns=["col_a"], side_effect_columns=["extra_col"], ) def generator_function(row: dict) -> dict: # add custom logic here that depends on "col_a" and update row in place row["name_in_custom_column_config"] = "custom value" row["extra_col"] = "extra value" return row
def load_config_builder() -> dd.DataDesignerConfigBuilder: config_builder = dd.DataDesignerConfigBuilder()
# Seed dataset (only if the user explicitly mentions a seed dataset path)
# config_builder.with_seed_dataset(dd.LocalFileSeedSource(path="path/to/seed.parquet"))

# config_builder.add_column(...)
# config_builder.add_processor(...)

return config_builder

Only include Pydantic models, custom generators, seed datasets, and extra dependencies when the task requires them.
@dd.custom_column_generator( required_columns=["col_a"], side_effect_columns=["extra_col"], ) def generator_function(row: dict) -> dict: # add custom logic here that depends on "col_a" and update row in place row["name_in_custom_column_config"] = "custom value" row["extra_col"] = "extra value" return row
def load_config_builder() -> dd.DataDesignerConfigBuilder: config_builder = dd.DataDesignerConfigBuilder()
# Seed dataset (only if the user explicitly mentions a seed dataset path)
# config_builder.with_seed_dataset(dd.LocalFileSeedSource(path="path/to/seed.parquet"))

# config_builder.add_column(...)
# config_builder.add_processor(...)

return config_builder

Only include Pydantic models, custom generators, seed datasets, and extra dependencies when the task requires them.