agentic-data-scientist

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Agentic Data Scientist

Agentic Data Scientist

Skill by ara.so — AI Agent Skills collection.
Agentic Data Scientist is an adaptive multi-agent framework that automates complex data science tasks using a sophisticated workflow with planning, execution, validation, and self-correction. Built on Google's Agent Development Kit (ADK) and Claude Agent SDK, it separates planning from execution and continuously validates work against success criteria.
ara.so提供的Skill——AI Agent技能集合。
Agentic Data Scientist是一个自适应多Agent框架,它借助规划、执行、验证和自我修正的复杂工作流来自动化复杂的数据科学任务。它基于Google的Agent Development Kit (ADK)和Claude Agent SDK构建,将规划与执行分离,并持续根据成功标准验证工作成果。

What It Does

功能特性

  • Orchestrated Mode: Full multi-agent workflow with planning, iterative execution, validation, and adaptive replanning
  • Simple Mode: Direct coding without planning overhead for quick tasks
  • Multi-Agent Architecture: Specialized agents for planning, coding, reviewing, validation, and summarization
  • Continuous Validation: Tracks progress against success criteria at every stage
  • Self-Correcting: Adapts plans based on discoveries during execution
  • MCP Integration: Access to tools via Model Context Protocol servers
  • Claude Scientific Skills: 380+ advanced scientific computing skills available to coding agent
  • 编排模式:包含规划、迭代执行、验证和自适应重规划的完整多Agent工作流
  • 简易模式:无需规划开销,直接编码完成快速任务
  • 多Agent架构:专为规划、编码、审核、验证和总结设计的专用Agent
  • 持续验证:在每个阶段跟踪进度是否符合成功标准
  • 自我修正:根据执行过程中的发现调整计划
  • MCP集成:通过Model Context Protocol服务器访问工具
  • Claude科学技能:编码Agent可使用380+高级科学计算技能

Installation

安装

bash
undefined
bash
undefined

Install globally with uv

使用uv全局安装

uv tool install agentic-data-scientist
uv tool install agentic-data-scientist

Or use directly with uvx (no installation)

或直接使用uvx(无需安装)

uvx agentic-data-scientist --mode simple "your query"
undefined
uvx agentic-data-scientist --mode simple "your query"
undefined

Prerequisites

前置条件

Required:
  1. Claude Code CLI (for coding agent):
bash
npm install -g @anthropic-ai/claude-code
  1. API Keys (set as environment variables):
bash
export OPENROUTER_API_KEY="your_openrouter_key"  # For planning/review agents
export ANTHROPIC_API_KEY="your_anthropic_key"    # For coding agent
Get keys from:
Optional:
bash
undefined
必填项:
  1. Claude Code CLI(供编码Agent使用):
bash
npm install -g @anthropic-ai/claude-code
  1. API密钥(设置为环境变量):
bash
export OPENROUTER_API_KEY="your_openrouter_key"  # 供规划/审核Agent使用
export ANTHROPIC_API_KEY="your_anthropic_key"    # 供编码Agent使用
获取密钥地址:
可选项:
bash
undefined

Disable network access (web search, URL fetching)

禁用网络访问(网页搜索、URL获取)

export DISABLE_NETWORK_ACCESS=true
undefined
export DISABLE_NETWORK_ACCESS=true
undefined

Configuration

配置

Create a
.env
file in your project directory:
bash
undefined
在项目目录中创建
.env
文件:
bash
undefined

Required

必填项

OPENROUTER_API_KEY=your_openrouter_key ANTHROPIC_API_KEY=your_anthropic_key
OPENROUTER_API_KEY=your_openrouter_key ANTHROPIC_API_KEY=your_anthropic_key

Optional

可选项

DISABLE_NETWORK_ACCESS=false # Set to true to disable web tools
undefined
DISABLE_NETWORK_ACCESS=false # 设置为true以禁用网络工具
undefined

Key Commands

核心命令

Basic Usage

基础用法

You must specify
--mode
for every command:
bash
undefined
每次命令必须指定
--mode
参数:
bash
undefined

Orchestrated mode: Full multi-agent workflow

编排模式:完整多Agent工作流

agentic-data-scientist "Perform differential expression analysis"
--mode orchestrated
--files data.csv
agentic-data-scientist "Perform differential expression analysis"
--mode orchestrated
--files data.csv

Simple mode: Direct coding, no planning

简易模式:直接编码,无需规划

agentic-data-scientist "Write a CSV parser"
--mode simple
undefined
agentic-data-scientist "Write a CSV parser"
--mode simple
undefined

File Handling

文件处理

bash
undefined
bash
undefined

Single file

单个文件

agentic-data-scientist "Analyze dataset"
--mode orchestrated
--files data.csv
agentic-data-scientist "Analyze dataset"
--mode orchestrated
--files data.csv

Multiple files

多个文件

agentic-data-scientist "Compare datasets"
--mode orchestrated
-f data1.csv -f data2.csv -f metadata.json
agentic-data-scientist "Compare datasets"
--mode orchestrated
-f data1.csv -f data2.csv -f metadata.json

Directory upload (recursive)

目录上传(递归)

agentic-data-scientist "Analyze all CSVs in folder"
--mode orchestrated
--files ./data_folder/
undefined
agentic-data-scientist "Analyze all CSVs in folder"
--mode orchestrated
--files ./data_folder/
undefined

Working Directory Options

工作目录选项

bash
undefined
bash
undefined

Default: ./agentic_output/ (preserved after completion)

默认:./agentic_output/(任务完成后保留)

agentic-data-scientist "Analyze data"
--mode orchestrated
--files data.csv
agentic-data-scientist "Analyze data"
--mode orchestrated
--files data.csv

Custom working directory

自定义工作目录

agentic-data-scientist "Generate report"
--mode orchestrated
--files data.csv
--working-dir ./my_analysis
agentic-data-scientist "Generate report"
--mode orchestrated
--files data.csv
--working-dir ./my_analysis

Temporary directory (auto-cleanup)

临时目录(自动清理)

agentic-data-scientist "Quick exploration"
--mode simple
--files data.csv
--temp-dir
agentic-data-scientist "Quick exploration"
--mode simple
--files data.csv
--temp-dir

Force keep files (override temp-dir cleanup)

强制保留文件(覆盖临时目录清理规则)

agentic-data-scientist "Analysis"
--mode orchestrated
--files data.csv
--temp-dir
--keep-files
undefined
agentic-data-scientist "Analysis"
--mode orchestrated
--files data.csv
--temp-dir
--keep-files
undefined

Logging and Debugging

日志与调试

bash
undefined
bash
undefined

Custom log file location

自定义日志文件位置

agentic-data-scientist "Analyze"
--mode orchestrated
--files data.csv
--log-file ./analysis.log
agentic-data-scientist "Analyze"
--mode orchestrated
--files data.csv
--log-file ./analysis.log

Verbose logging

详细日志

agentic-data-scientist "Debug issue"
--mode simple
--verbose
undefined
agentic-data-scientist "Debug issue"
--mode simple
--verbose
undefined

Real-World Examples

实际应用示例

Example 1: Complex Data Analysis (Orchestrated Mode)

示例1:复杂数据分析(编排模式)

bash
undefined
bash
undefined

Comprehensive analysis with multiple stages

包含多个阶段的全面分析

agentic-data-scientist
"Perform exploratory data analysis on sales data,
identify trends, create visualizations,
and build a predictive model for future sales"
--mode orchestrated
--files sales_2024.csv
--working-dir ./sales_analysis
--log-file analysis.log

**What happens:**
1. **Planning Phase**: Creates detailed plan with stages (EDA, visualization, modeling)
2. **Execution Phase**: Implements each stage iteratively with validation
3. **Validation**: Checks success criteria after each stage
4. **Adaptation**: Adjusts plan based on discoveries (e.g., data quality issues)
5. **Summary**: Generates comprehensive report with all findings
agentic-data-scientist
"Perform exploratory data analysis on sales data,
identify trends, create visualizations,
and build a predictive model for future sales"
--mode orchestrated
--files sales_2024.csv
--working-dir ./sales_analysis
--log-file analysis.log

**执行流程:**
1. **规划阶段**:创建包含多个阶段(探索性数据分析、可视化、建模)的详细计划
2. **执行阶段**:迭代执行每个阶段并进行验证
3. **验证环节**:每个阶段完成后检查是否符合成功标准
4. **自适应调整**:根据执行中的发现(如数据质量问题)调整计划
5. **总结环节**:生成包含所有发现的全面报告

Example 2: Quick Scripting (Simple Mode)

示例2:快速脚本编写(简易模式)

bash
undefined
bash
undefined

Fast coding without planning overhead

无需规划开销的快速编码

agentic-data-scientist
"Write a Python script that reads multiple CSV files,
merges them on a common ID column,
and exports to Excel with formatting"
--mode simple
--files data1.csv data2.csv data3.csv
--temp-dir

**What happens:**
- Direct execution with coding agent (no planning phase)
- Quick turnaround for straightforward tasks
- Temporary directory auto-cleanup
agentic-data-scientist
"Write a Python script that reads multiple CSV files,
merges them on a common ID column,
and exports to Excel with formatting"
--mode simple
--files data1.csv data2.csv data3.csv
--temp-dir

**执行流程:**
- 直接通过编码Agent执行(无规划阶段)
- 简单任务快速完成
- 临时目录自动清理

Example 3: Multi-File Statistical Analysis

示例3:多文件统计分析

bash
undefined
bash
undefined

Compare multiple datasets

对比多个数据集

agentic-data-scientist
"Compare the distribution of features across treatment groups,
perform statistical tests (t-test, ANOVA),
and generate publication-ready plots"
--mode orchestrated
-f control.csv
-f treatment_a.csv
-f treatment_b.csv
--working-dir ./stats_analysis
undefined
agentic-data-scientist
"Compare the distribution of features across treatment groups,
perform statistical tests (t-test, ANOVA),
and generate publication-ready plots"
--mode orchestrated
-f control.csv
-f treatment_a.csv
-f treatment_b.csv
--working-dir ./stats_analysis
undefined

Example 4: Directory-Based Analysis

示例4:基于目录的分析

bash
undefined
bash
undefined

Process all files in a directory

处理目录中的所有文件

agentic-data-scientist
"Analyze all patient data files in the folder,
aggregate results, and create summary statistics"
--mode orchestrated
--files ./patient_data/
--working-dir ./patient_analysis
undefined
agentic-data-scientist
"Analyze all patient data files in the folder,
aggregate results, and create summary statistics"
--mode orchestrated
--files ./patient_data/
--working-dir ./patient_analysis
undefined

Python API Usage

Python API使用方法

For programmatic access, use the Python API:
python
from agentic_data_scientist.cli import main
import sys
如需程序化调用,可使用Python API:
python
from agentic_data_scientist.cli import main
import sys

Prepare arguments

准备参数

sys.argv = [ 'agentic-data-scientist', 'Perform clustering analysis on customer data', '--mode', 'orchestrated', '--files', 'customers.csv', '--working-dir', './clustering_output' ]
sys.argv = [ 'agentic-data-scientist', 'Perform clustering analysis on customer data', '--mode', 'orchestrated', '--files', 'customers.csv', '--working-dir', './clustering_output' ]

Run

运行

main()

Or use the workflow directly:

```python
import asyncio
from pathlib import Path
from agentic_data_scientist.workflow import create_workflow

async def run_analysis():
    # Create workflow
    workflow = create_workflow(
        query="Analyze customer segments",
        mode="orchestrated",
        files=[Path("customers.csv")],
        working_dir=Path("./output"),
        disable_network=False
    )
    
    # Execute
    result = await workflow.execute()
    print(result)

asyncio.run(run_analysis())
main()

或者直接使用工作流:

```python
import asyncio
from pathlib import Path
from agentic_data_scientist.workflow import create_workflow

async def run_analysis():
    # 创建工作流
    workflow = create_workflow(
        query="Analyze customer segments",
        mode="orchestrated",
        files=[Path("customers.csv")],
        working_dir=Path("./output"),
        disable_network=False
    )
    
    # 执行
    result = await workflow.execute()
    print(result)

asyncio.run(run_analysis())

Common Patterns

常见使用模式

Pattern 1: Iterative Data Exploration

模式1:迭代式数据探索

bash
undefined
bash
undefined

Start with simple mode for quick exploration

使用简易模式快速探索

agentic-data-scientist
"Load dataset and show basic statistics"
--mode simple
--files data.csv
agentic-data-scientist
"Load dataset and show basic statistics"
--mode simple
--files data.csv

Then use orchestrated mode for deep analysis

然后使用编排模式进行深度分析

agentic-data-scientist
"Perform full statistical analysis including outlier detection,
correlation analysis, and clustering"
--mode orchestrated
--files data.csv
--working-dir ./deep_analysis
undefined
agentic-data-scientist
"Perform full statistical analysis including outlier detection,
correlation analysis, and clustering"
--mode orchestrated
--files data.csv
--working-dir ./deep_analysis
undefined

Pattern 2: Pipeline Development

模式2:Pipeline开发

bash
undefined
bash
undefined

Use orchestrated mode to develop a complete pipeline

使用编排模式开发完整Pipeline

agentic-data-scientist
"Create a data processing pipeline that: \
  1. Cleans and normalizes raw data \
  2. Engineers new features \
  3. Splits into train/test \
  4. Trains multiple models \
  5. Evaluates and selects best model \
  6. Exports model and metrics"
    --mode orchestrated
    --files raw_data.csv
    --working-dir ./ml_pipeline
undefined
agentic-data-scientist
"Create a data processing pipeline that: \
  1. Cleans and normalizes raw data \
  2. Engineers new features \
  3. Splits into train/test \
  4. Trains multiple models \
  5. Evaluates and selects best model \
  6. Exports model and metrics"
    --mode orchestrated
    --files raw_data.csv
    --working-dir ./ml_pipeline
undefined

Pattern 3: Report Generation

模式3:报告生成

bash
undefined
bash
undefined

Generate comprehensive reports

生成全面报告

agentic-data-scientist
"Analyze quarterly sales data and create an executive report
with visualizations, key metrics, and recommendations"
--mode orchestrated
--files q1_sales.csv q2_sales.csv q3_sales.csv q4_sales.csv
--working-dir ./quarterly_report
undefined
agentic-data-scientist
"Analyze quarterly sales data and create an executive report
with visualizations, key metrics, and recommendations"
--mode orchestrated
--files q1_sales.csv q2_sales.csv q3_sales.csv q4_sales.csv
--working-dir ./quarterly_report
undefined

Pattern 4: Debugging with Verbose Logs

模式4:使用详细日志调试

bash
undefined
bash
undefined

Enable verbose logging for troubleshooting

启用详细日志进行故障排查

agentic-data-scientist
"Complex analysis task"
--mode orchestrated
--files data.csv
--verbose
--log-file debug.log
--keep-files
undefined
agentic-data-scientist
"Complex analysis task"
--mode orchestrated
--files data.csv
--verbose
--log-file debug.log
--keep-files
undefined

Multi-Agent Workflow Details

多Agent工作流详情

Agent Roles

Agent角色

  1. Plan Maker: Creates comprehensive plans with stages and success criteria
  2. Plan Reviewer: Validates plans are complete before execution
  3. Plan Parser: Converts plans to structured executable stages
  4. Stage Orchestrator: Manages execution cycle and adaptation
  5. Coding Agent: Implements stages (powered by Claude Code with 380+ scientific skills)
  6. Review Agent: Validates implementations against requirements
  7. Criteria Checker: Tracks progress against success criteria
  8. Stage Reflector: Adapts remaining stages based on learnings
  9. Summary Agent: Synthesizes work into final report
  1. Plan Maker(规划生成Agent):创建包含阶段和成功标准的全面计划
  2. Plan Reviewer(规划审核Agent):执行前验证计划是否完整
  3. Plan Parser(规划解析Agent):将计划转换为结构化可执行阶段
  4. Stage Orchestrator(阶段编排Agent):管理执行周期和自适应调整
  5. Coding Agent(编码Agent):实现各个阶段(由具备380+科学技能的Claude Code驱动)
  6. Review Agent(审核Agent):验证实现是否符合需求
  7. Criteria Checker(标准校验Agent):跟踪进度是否符合成功标准
  8. Stage Reflector(阶段反思Agent):根据执行经验调整剩余阶段
  9. Summary Agent(总结Agent):将工作成果整合为最终报告

Workflow Phases

工作流阶段

Planning Phase:
User Query → Plan Maker → Plan Reviewer → Plan Parser → Structured Plan
Execution Phase (per stage):
Stage → Coding Agent → Review Agent → Criteria Checker → Stage Reflector
Summary Phase:
All Completed Stages → Summary Agent → Final Report
规划阶段:
用户查询 → Plan Maker → Plan Reviewer → Plan Parser → 结构化计划
执行阶段(每个阶段):
阶段 → Coding Agent → Review Agent → Criteria Checker → Stage Reflector
总结阶段:
所有已完成阶段 → Summary Agent → 最终报告

Troubleshooting

故障排查

API Key Errors

API密钥错误

bash
undefined
bash
undefined

Verify keys are set

验证密钥是否已设置

echo $OPENROUTER_API_KEY echo $ANTHROPIC_API_KEY
echo $OPENROUTER_API_KEY echo $ANTHROPIC_API_KEY

Set them if missing

若缺失则设置

export OPENROUTER_API_KEY="your_key" export ANTHROPIC_API_KEY="your_key"
undefined
export OPENROUTER_API_KEY="your_key" export ANTHROPIC_API_KEY="your_key"
undefined

Claude Code Not Found

找不到Claude Code

bash
undefined
bash
undefined

Install Claude Code CLI

安装Claude Code CLI

npm install -g @anthropic-ai/claude-code
npm install -g @anthropic-ai/claude-code

Verify installation

验证安装

claude-code --version
undefined
claude-code --version
undefined

Network Access Issues

网络访问问题

bash
undefined
bash
undefined

Disable network tools if causing problems

若网络工具引发问题,可禁用网络访问

export DISABLE_NETWORK_ACCESS=true
export DISABLE_NETWORK_ACCESS=true

Or in .env file

或在.env文件中设置

echo "DISABLE_NETWORK_ACCESS=true" >> .env
undefined
echo "DISABLE_NETWORK_ACCESS=true" >> .env
undefined

File Upload Failures

文件上传失败

bash
undefined
bash
undefined

Verify file exists

验证文件是否存在

ls -la data.csv
ls -la data.csv

Use absolute paths

使用绝对路径

agentic-data-scientist "Analyze"
--mode orchestrated
--files /absolute/path/to/data.csv
agentic-data-scientist "Analyze"
--mode orchestrated
--files /absolute/path/to/data.csv

Check directory permissions for recursive upload

检查目录权限(递归上传时)

ls -la ./data_folder/
undefined
ls -la ./data_folder/
undefined

Working Directory Issues

工作目录问题

bash
undefined
bash
undefined

Ensure directory is writable

确保目录可写

mkdir -p ./output chmod 755 ./output
mkdir -p ./output chmod 755 ./output

Use temp directory if permission issues

若存在权限问题,使用临时目录

agentic-data-scientist "Analyze"
--mode orchestrated
--files data.csv
--temp-dir
undefined
agentic-data-scientist "Analyze"
--mode orchestrated
--files data.csv
--temp-dir
undefined

Execution Hanging

执行停滞

bash
undefined
bash
undefined

Use verbose mode to see what's happening

使用详细模式查看执行状态

agentic-data-scientist "Query"
--mode orchestrated
--files data.csv
--verbose
agentic-data-scientist "Query"
--mode orchestrated
--files data.csv
--verbose

Try simple mode to isolate planning vs execution issues

尝试使用简易模式,区分是规划还是执行环节的问题

agentic-data-scientist "Query"
--mode simple
--files data.csv
undefined
agentic-data-scientist "Query"
--mode simple
--files data.csv
undefined

Output Not Preserved

输出未保留

bash
undefined
bash
undefined

Default behavior preserves files in ./agentic_output/

默认行为会将文件保留在./agentic_output/目录下

ls -la ./agentic_output/
ls -la ./agentic_output/

Explicitly set working directory

显式设置工作目录

agentic-data-scientist "Analyze"
--mode orchestrated
--files data.csv
--working-dir ./my_output
agentic-data-scientist "Analyze"
--mode orchestrated
--files data.csv
--working-dir ./my_output

Use --keep-files to override temp-dir cleanup

使用--keep-files参数覆盖临时目录的清理规则

agentic-data-scientist "Analyze"
--mode orchestrated
--files data.csv
--temp-dir
--keep-files
undefined
agentic-data-scientist "Analyze"
--mode orchestrated
--files data.csv
--temp-dir
--keep-files
undefined

Mode Selection Guide

模式选择指南

Use Orchestrated Mode when:
  • Task is complex with multiple stages
  • Need thorough planning and validation
  • Quality and completeness are critical
  • Task requires iterative refinement
  • Want comprehensive final report
Use Simple Mode when:
  • Quick scripting or one-off tasks
  • Simple question answering
  • Prototyping or exploration
  • Want fast turnaround
  • Don't need multi-stage workflow
选择编排模式的场景:
  • 任务复杂,包含多个阶段
  • 需要完善的规划和验证
  • 对结果质量和完整性要求较高
  • 任务需要迭代优化
  • 需要生成全面的最终报告
选择简易模式的场景:
  • 快速脚本编写或一次性任务
  • 简单问题解答
  • 原型开发或探索性工作
  • 追求快速交付
  • 无需多阶段工作流

Advanced Configuration

高级配置

Custom Prompts

自定义提示词

Extend the framework by customizing agent prompts:
python
from agentic_data_scientist.prompts import PLAN_MAKER_PROMPT
通过自定义Agent提示词扩展框架功能:
python
from agentic_data_scientist.prompts import PLAN_MAKER_PROMPT

Modify prompts for domain-specific needs

根据领域需求修改提示词

custom_prompt = PLAN_MAKER_PROMPT + """ Additional domain context:
  • Focus on genomics data
  • Use bioinformatics best practices """
undefined
custom_prompt = PLAN_MAKER_PROMPT + """ 额外领域上下文:
  • 聚焦基因组学数据
  • 遵循生物信息学最佳实践 """
undefined

MCP Server Integration

MCP服务器集成

The framework supports Model Context Protocol for custom tools:
python
undefined
框架支持通过Model Context Protocol集成自定义工具:
python
undefined

Configure MCP servers in your workflow

在工作流中配置MCP服务器

Agents automatically gain access to tools

Agent会自动获取工具访问权限

undefined
undefined

Access to Claude Scientific Skills

访问Claude科学技能

The coding agent has access to 380+ scientific computing skills including:
  • Statistical analysis
  • Machine learning
  • Data visualization
  • Bioinformatics
  • Scientific computing libraries
These are automatically available during execution phase.
编码Agent可访问380+科学计算技能,包括:
  • 统计分析
  • 机器学习
  • 数据可视化
  • 生物信息学
  • 科学计算库
这些技能在执行阶段会自动可用。