skill-system-eda
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSkill System EDA
Skill System EDA
Use for deterministic tabular analysis artifacts.
scripts/eda.py使用生成确定性表格分析产物。
scripts/eda.pyCore Commands
核心命令
bash
python3 scripts/eda.py profile-dataset --input data.csv --target Class --output /tmp/eda
python3 scripts/eda.py distribution-report --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py correlation-matrix --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py anomaly-profiling --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py feature-importance-scan --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py leakage-detector --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py save-contract --profile /tmp/eda/profile.yaml --output /tmp/eda/contract.yaml
python3 scripts/eda.py validate-contract --input new_data.csv --contract /tmp/eda/contract.yamlbash
python3 scripts/eda.py profile-dataset --input data.csv --target Class --output /tmp/eda
python3 scripts/eda.py distribution-report --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py correlation-matrix --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py anomaly-profiling --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py feature-importance-scan --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py leakage-detector --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py save-contract --profile /tmp/eda/profile.yaml --output /tmp/eda/contract.yaml
python3 scripts/eda.py validate-contract --input new_data.csv --contract /tmp/eda/contract.yamlOutput Model
输出模型
- creates
profile-datasetandprofile.yamlreport.md - later commands update and append sections to
profile.yamlreport.md - emits
save-contractcontract.yaml - prints JSON
validate-contract/PASSwith a violation listFAIL
- 生成
profile-dataset和profile.yaml文件report.md - 后续命令会更新并向
profile.yaml中追加内容report.md - 生成
save-contract文件contract.yaml - 输出包含违规列表的JSON格式
validate-contract/PASS结果FAIL
Analysis Rules
分析规则
- Use Polars (not pandas) for data IO/aggregation/profiling flows.
- Keep sampling deterministic with lazy when
.head(N)is used.--sample - Treat as the machine-readable source of truth;
profile.yamlis the human-readable companion.report.md - Use Polars + numpy + scipy for profiling, shifts, correlations, KS tests, and Cramer's V.
- Use sklearn feature ranking only when available; otherwise keep tree-based importance explicitly skipped.
- Use lazy scan strategy for large CSV/parquet inputs (/
scan_csv), with materialization delayed until needed.scan_parquet - Apply high-cardinality guards: unique skips one-hot in feature importance, and profile truncates categorical columns (
>50unique or>100row cardinality) to top-20 values.>50%
- 数据IO/聚合/分析流程使用Polars(而非pandas)实现。
- 当使用参数时,通过惰性
--sample方法保证采样的确定性。.head(N) - 将视为机器可读的可信数据源;
profile.yaml是面向人类的可读配套文档。report.md - 分析、数据偏移、相关性计算、KS检验和Cramer's V计算使用Polars + numpy + scipy实现。
- 仅当sklearn可用时使用其特征排序功能;否则明确跳过基于树模型的特征重要性计算。
- 对大型CSV/Parquet输入文件使用惰性扫描策略(/
scan_csv),延迟数据实例化直到需要时才执行。scan_parquet - 应用高基数防护:当唯一值数量时,特征重要性计算跳过独热编码;当分类列的唯一值数量
>50或行基数占比>100时,分析仅保留前20个高频值。>50%
Memory Integration
内存集成
- By default, commands write a summary memory plus one memory per warning/critical finding.
- Prefer when available.
skill-system-memory/scripts/mem.py store - If memory writes fail or is set, write fallback payloads under
EDA_DISABLE_MEM_PY=1..memory/pending/ - Use for deterministic tests or when no writeback is desired.
--no-memory
- 默认情况下,命令会写入一份摘要内存,以及每个警告/严重问题对应的单独内存记录。
- 优先使用(若可用)。
skill-system-memory/scripts/mem.py store - 若内存写入失败或设置了,则将备选数据写入
EDA_DISABLE_MEM_PY=1目录下。.memory/pending/ - 确定性测试或不需要回写结果时,使用参数。
--no-memory
Contract Lifecycle
合约生命周期
- derives column requirements from
save-contract.profile.yaml - Numeric ranges use observed bounds for tiny datasets and profile-derived percentile bounds for larger datasets.
- Truncated categorical columns produce rules instead of
cardinality_range.allowed_values - fails closed and returns machine-readable violations.
validate-contract
skill
{
"schema_version": "2.0",
"id": "skill-system-eda",
"version": "1.0.0",
"capabilities": [
"eda-profile",
"eda-distribution",
"eda-correlation",
"eda-anomaly",
"eda-feature-importance",
"eda-leakage",
"eda-contract-save",
"eda-contract-validate"
],
"effects": ["fs.read", "fs.write", "proc.exec"],
"operations": {
"profile-dataset": {
"description": "Profile a CSV/parquet dataset and generate profile.yaml plus report.md.",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": false },
"output": { "type": "string", "required": true },
"sample": { "type": "integer", "required": false },
"no_memory": { "type": "boolean", "required": false }
},
"output": {
"description": "Artifact paths for the generated EDA profile",
"fields": { "profile": "string", "report": "string" }
},
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "profile-dataset", "--input", "{input}", "--output", "{output}"]
}
},
"distribution-report": {
"description": "Append distribution and class-conditional analysis to an existing profile/report.",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": true },
"profile": { "type": "string", "required": true }
},
"output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "distribution-report", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
}
},
"correlation-matrix": {
"description": "Compute feature and target correlations and append them to profile/report.",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": false },
"profile": { "type": "string", "required": true }
},
"output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "correlation-matrix", "--input", "{input}", "--profile", "{profile}"]
}
},
"anomaly-profiling": {
"description": "Compare class-conditional distributions and effect sizes.",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": true },
"profile": { "type": "string", "required": true }
},
"output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "anomaly-profiling", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
}
},
"feature-importance-scan": {
"description": "Rank features with mutual information and optional tree importances.",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": true },
"profile": { "type": "string", "required": true }
},
"output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "feature-importance-scan", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
}
},
"leakage-detector": {
"description": "Detect high-correlation, target-encoding, and temporal leakage indicators.",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": true },
"profile": { "type": "string", "required": true }
},
"output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "leakage-detector", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
}
},
"save-contract": {
"description": "Generate a data contract from a saved EDA profile.",
"input": {
"profile": { "type": "string", "required": true },
"output": { "type": "string", "required": true }
},
"output": { "description": "Contract path", "fields": { "contract": "string" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "save-contract", "--profile", "{profile}", "--output", "{output}"]
}
},
"validate-contract": {
"description": "Validate a new dataset against a saved contract and emit PASS/FAIL JSON.",
"input": {
"input": { "type": "string", "required": true },
"contract": { "type": "string", "required": true }
},
"output": { "description": "Validation status and violations", "fields": { "status": "string", "violations": "array" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "validate-contract", "--input", "{input}", "--contract", "{contract}"]
}
}
},
"stdout_contract": {
"last_line_json": true
}
}- 从
save-contract中提取列要求以生成数据合约。profile.yaml - 数值范围:小型数据集使用实际观测到的边界值,大型数据集使用分析得出的百分位数边界值。
- 被截断的分类列会生成规则,而非
cardinality_range规则。allowed_values - 采用封闭失败原则,并返回机器可读的违规信息。
validate-contract
skill
{
"schema_version": "2.0",
"id": "skill-system-eda",
"version": "1.0.0",
"capabilities": [
"eda-profile",
"eda-distribution",
"eda-correlation",
"eda-anomaly",
"eda-feature-importance",
"eda-leakage",
"eda-contract-save",
"eda-contract-validate"
],
"effects": ["fs.read", "fs.write", "proc.exec"],
"operations": {
"profile-dataset": {
"description": "分析CSV/Parquet数据集并生成profile.yaml和report.md文件。",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": false },
"output": { "type": "string", "required": true },
"sample": { "type": "integer", "required": false },
"no_memory": { "type": "boolean", "required": false }
},
"output": {
"description": "生成的EDA分析结果文件路径",
"fields": { "profile": "string", "report": "string" }
},
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "profile-dataset", "--input", "{input}", "--output", "{output}"]
}
},
"distribution-report": {
"description": "在已有的分析结果/报告中追加分布分析和基于类别条件的分析内容。",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": true },
"profile": { "type": "string", "required": true }
},
"output": { "description": "更新后的分析结果/报告文件路径", "fields": { "profile": "string", "report": "string" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "distribution-report", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
}
},
"correlation-matrix": {
"description": "计算特征与目标变量的相关性并追加到分析结果/报告中。",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": false },
"profile": { "type": "string", "required": true }
},
"output": { "description": "更新后的分析结果/报告文件路径", "fields": { "profile": "string", "report": "string" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "correlation-matrix", "--input", "{input}", "--profile", "{profile}"]
}
},
"anomaly-profiling": {
"description": "比较基于类别条件的分布情况和效应量。",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": true },
"profile": { "type": "string", "required": true }
},
"output": { "description": "更新后的分析结果/报告文件路径", "fields": { "profile": "string", "report": "string" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "anomaly-profiling", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
}
},
"feature-importance-scan": {
"description": "通过互信息和可选的树模型对特征进行重要性排序。",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": true },
"profile": { "type": "string", "required": true }
},
"output": { "description": "更新后的分析结果/报告文件路径", "fields": { "profile": "string", "report": "string" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "feature-importance-scan", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
}
},
"leakage-detector": {
"description": "检测高相关性、目标编码和时间泄露等指标。",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": true },
"profile": { "type": "string", "required": true }
},
"output": { "description": "更新后的分析结果/报告文件路径", "fields": { "profile": "string", "report": "string" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "leakage-detector", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
}
},
"save-contract": {
"description": "从已保存的EDA分析结果中生成数据合约。",
"input": {
"profile": { "type": "string", "required": true },
"output": { "type": "string", "required": true }
},
"output": { "description": "数据合约文件路径", "fields": { "contract": "string" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "save-contract", "--profile", "{profile}", "--output", "{output}"]
}
},
"validate-contract": {
"description": "验证新数据集是否符合已保存的数据合约,并输出PASS/FAIL格式的JSON结果。",
"input": {
"input": { "type": "string", "required": true },
"contract": { "type": "string", "required": true }
},
"output": { "description": "验证状态和违规信息", "fields": { "status": "string", "violations": "array" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "validate-contract", "--input", "{input}", "--contract", "{contract}"]
}
}
},
"stdout_contract": {
"last_line_json": true
}
}