tao-route-visual-changenet-samples
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTAO VCN Sample Routing Skill
TAO VCN样本路由技能
You are the dispatcher between gap analysis and the augmentation modules in a VCN AOI SDA pipeline. Each augmentation module can only act on labels it knows how to handle:
- k-NN Mining can only mine real-image neighbors for labels that already exist in the source pool CSV. There is no point looking for neighbors if the pool has no
SHIFTrows.SHIFT - AnomalyGen (Cosmos SDG) can only generate synthetic anomalies for the classes its inference pipeline supports: ,
PASS,EXCESS_SOLDER,MISSING. A weak sample with a label outside this set is unroutable to AnomalyGen.BRIDGE
This skill runs once per SDA iteration immediately after gap analysis. It splits the gap-analysis parquet into one filtered parquet per module so each module operates on its own eligible subset, and it writes a human-readable summary of the per-label routing decisions.
The work is intentionally trivial: read a parquet, do two filters, write two parquets, write one summary. The skill exists to make those decisions auditable — every label must show up in the summary with a yes/no verdict for each module so a downstream reviewer can spot when a label is silently dropped because no module accepted it.
.isin(...)你是VCN AOI SDA流程中差距分析与数据增强模块之间的调度器。每个数据增强模块仅能处理其支持的标签:
- k-NN Mining仅能为已存在于source pool CSV中的标签挖掘真实图像邻居。如果池中没有行,那么寻找
SHIFT邻居毫无意义。SHIFT - AnomalyGen(Cosmos SDG)仅能为其推理流水线支持的类别生成合成异常:、
PASS、EXCESS_SOLDER、MISSING。标签不在此集合内的薄弱样本无法分配给AnomalyGen。BRIDGE
此技能在每次SDA迭代的差距分析完成后立即运行一次。它会将差距分析得到的parquet文件拆分为每个模块对应的过滤后parquet文件,以便每个模块仅处理其符合条件的子集,同时还会生成一份易读的标签路由决策摘要。
这项工作的逻辑非常简单:读取parquet文件、执行两次过滤、写入两个parquet文件、生成一份摘要。该技能的存在是为了让这些决策可审计——每个标签都必须在摘要中显示针对每个模块的是否通过的判定,这样下游审核人员就能发现哪些标签因没有模块接受而被静默丢弃。
.isin(...)Inputs
输入
- — the gap-analysis output (typically
gaps_parquetfrom<exp_dir>/rca_results/<timestamp>/gaps.parquet). Required columns:tao-analyze-gaps-visual-changenet,filepath. Other columns (label,siamese_score) are preserved verbatim.weakness - — VCN-format mining source pool CSV with a
source_pool_csvcolumn. Empty string or non-existent path is allowed; the mining subset will simply be empty in that case.label - Output directory — where the two routed parquets, the summary, and the report are written. Default: a timestamped folder under the gap-analysis result directory: .
<rca_result_dir>/routing_results/<timestamp>/ - (optional) — override the default AnomalyGen-eligible label set. Default:
anomalygen_supported_labels. Warning: This must stay in sync with{"PASS", "EXCESS_SOLDER", "MISSING", "BRIDGE"}inANOMALYGEN_SUPPORTED_LABELSand the AnomalyGen integration's actual generator coverage. Adding a new defect class to AnomalyGen means adding it here too.mdo-kratos-workflows/pipelines/sda/routing.py
- —— 差距分析的输出文件(通常是
gaps_parquet生成的tao-analyze-gaps-visual-changenet)。必填列:<exp_dir>/rca_results/<timestamp>/gaps.parquet、filepath。其他列(如label、siamese_score)将原样保留。weakness - —— 包含
source_pool_csv列的VCN格式挖掘源池CSV文件。允许为空字符串或不存在的路径;这种情况下挖掘子集将为空。label - 输出目录 —— 用于存储两个路由后的parquet文件、摘要和报告的位置。默认值:差距分析结果目录下的带时间戳文件夹:。
<rca_result_dir>/routing_results/<timestamp>/ - (可选)—— 覆盖默认的AnomalyGen适用标签集合。默认值:
anomalygen_supported_labels。注意: 此参数必须与{"PASS", "EXCESS_SOLDER", "MISSING", "BRIDGE"}中的mdo-kratos-workflows/pipelines/sda/routing.py以及AnomalyGen集成的实际生成器覆盖范围保持一致。如果为AnomalyGen添加了新的缺陷类别,也需要在此处添加。ANOMALYGEN_SUPPORTED_LABELS
Method
方法
The whole skill is two masks against the uppercased label column.
.isin(...)整个技能的核心是对大写后的标签列执行两次掩码操作。
.isin(...)Step 1 — Load and uppercase
步骤1 —— 加载并转换为大写
python
df = pd.read_parquet(gaps_parquet)
labels_upper = df["label"].astype(str).str.upper()The match is case-insensitive for both module checks. The original column is preserved unchanged in the output parquets — only the comparison key is uppercased.
labelpython
df = pd.read_parquet(gaps_parquet)
labels_upper = df["label"].astype(str).str.upper()两个模块的检查均不区分大小写。输出parquet文件中将保留原始的列——仅在比较时使用大写的键。
labelStep 2 — Mining subset
步骤2 —— 挖掘子集
python
if source_pool_csv and os.path.isfile(source_pool_csv):
pool_df = pd.read_csv(source_pool_csv)
pool_labels = {str(l).upper() for l in pool_df["label"].unique()}
mn_mask = labels_upper.isin(pool_labels)
mn_df = df[mn_mask]
else:
pool_missing = True
pool_labels = set()
mn_df = df.iloc[0:0] # empty, but with the same schema
mn_df.to_parquet(mining_gaps_parquet, index=False)If the pool CSV is missing or empty, the mining subset is an empty DataFrame with the same columns as the input so downstream readers don't crash on schema mismatch. Flag this case in the summary.
python
if source_pool_csv and os.path.isfile(source_pool_csv):
pool_df = pd.read_csv(source_pool_csv)
pool_labels = {str(l).upper() for l in pool_df["label"].unique()}
mn_mask = labels_upper.isin(pool_labels)
mn_df = df[mn_mask]
else:
pool_missing = True
pool_labels = set()
mn_df = df.iloc[0:0] # 空DataFrame,但与输入列结构一致
mn_df.to_parquet(mining_gaps_parquet, index=False)如果源池CSV文件缺失或为空,挖掘子集将是一个与输入列结构完全相同的空DataFrame,这样下游读取器就不会因结构不匹配而崩溃。在摘要中标记这种情况。
Step 3 — AnomalyGen subset
步骤3 —— AnomalyGen子集
python
ANOMALYGEN_SUPPORTED = {"PASS", "EXCESS_SOLDER", "MISSING", "BRIDGE"}
ag_mask = labels_upper.isin(ANOMALYGEN_SUPPORTED)
ag_df = df[ag_mask]
ag_df.to_parquet(anomalygen_gaps_parquet, index=False)Rows whose label is in the AnomalyGen-supported set are written verbatim to . The schema matches the input parquet exactly — downstream AnomalyGen (Cosmos SDG) needs no other changes.
anomalygen_gaps.parquetpython
ANOMALYGEN_SUPPORTED = {"PASS", "EXCESS_SOLDER", "MISSING", "BRIDGE"}
ag_mask = labels_upper.isin(ANOMALYGEN_SUPPORTED)
ag_df = df[ag_mask]
ag_df.to_parquet(anomalygen_gaps_parquet, index=False)标签属于AnomalyGen支持集合的行将被原样写入。其结构与输入parquet文件完全一致——下游的AnomalyGen(Cosmos SDG)无需任何修改。
anomalygen_gaps.parquetStep 4 — Per-label routing breakdown
步骤4 —— 按标签拆分路由详情
For every distinct label in the input gaps parquet (uppercased), record:
- — how many rows have this label
count - — yes if the label is in
mining, otherwise nopool_labels - — yes if the label is in
anomalygen, otherwise noANOMALYGEN_SUPPORTED
A label can route to both modules (e.g. PASS rows route to AnomalyGen, and if the source pool also contains PASS rows they route to Mining too). A label can also route to none — flag those, since they are silently dropped and may signal a configuration mismatch.
Write the breakdown to . The format mirrors the reference component exactly:
routing_summary.txtWeak-sample routing summary
Total weak samples: <N>
Mining subset: <N_mn> -> <mining_gaps_parquet>
AnomalyGen subset: <N_ag> -> <anomalygen_gaps_parquet>
[If pool missing:]
No source pool CSV at '<path>'; mining subset is empty.
Per-label breakdown (count, mining, anomalygen):
PASS: 50 (mining=yes, anomalygen=yes)
MISSING: 32 (mining=no, anomalygen=yes)
SHIFT: 14 (mining=yes, anomalygen=no)
EXCESS_SOLDER: 9 (mining=yes, anomalygen=yes)
...对于输入gaps parquet文件中的每个不同标签(大写形式),记录:
- —— 该标签对应的行数
count - —— 如果标签存在于
mining中则为yes,否则为nopool_labels - —— 如果标签存在于
anomalygen中则为yes,否则为noANOMALYGEN_SUPPORTED
一个标签可以同时分配给两个模块(例如,PASS行可以分配给AnomalyGen,如果源池中也包含PASS行,还可以分配给Mining)。一个标签也可能无法分配给任何模块——需要标记这些情况,因为它们会被静默丢弃,可能意味着配置不匹配。
将拆分详情写入。格式与参考组件完全一致:
routing_summary.txtWeak-sample routing summary
Total weak samples: <N>
Mining subset: <N_mn> -> <mining_gaps_parquet>
AnomalyGen subset: <N_ag> -> <anomalygen_gaps_parquet>
[If pool missing:]
No source pool CSV at '<path>'; mining subset is empty.
Per-label breakdown (count, mining, anomalygen):
PASS: 50 (mining=yes, anomalygen=yes)
MISSING: 32 (mining=no, anomalygen=yes)
SHIFT: 14 (mining=yes, anomalygen=no)
EXCESS_SOLDER: 9 (mining=yes, anomalygen=yes)
...Step 5 — Sanity checks
步骤5 —— 完整性检查
After both subsets are written, verify:
- The sum of subset sizes is not required to equal — overlap is allowed (a label can route to both modules). What matters is that every input row appears in at least one subset, OR appears in the "none" list with an explicit reason.
len(df) - If and
len(mn_df) == 0, something is wrong — flag prominently in the report.len(ag_df) == 0 - If an entire label group routes to no module, the section must call this out so the user can either seed the source pool with that label or extend AnomalyGen's supported set.
Recommended Actions
在写入两个子集后,验证:
- 子集大小之和不需要等于——允许重叠(一个标签可以分配给两个模块)。重要的是每个输入行至少出现在一个子集中,或者出现在“无匹配”列表中并带有明确原因。
len(df) - 如果且
len(mn_df) == 0,则说明存在问题——在报告中突出标记。len(ag_df) == 0 - 如果整个标签组无法分配给任何模块,部分必须指出这一点,以便用户可以选择在源池中添加该标签,或扩展AnomalyGen的支持集合。
Recommended Actions
Reference Python Recipe
参考Python脚本
This is the exact computation, lifted from . Run as a single Python script via Bash; it produces every artifact except the report.
mdo-kratos-workflows/pipelines/sda/routing.pypython
import os
import pandas as pd
ANOMALYGEN_SUPPORTED = {"PASS", "EXCESS_SOLDER", "MISSING", "BRIDGE"}
df = pd.read_parquet(gaps_parquet)
labels_upper = df["label"].astype(str).str.upper()以下是直接从提取的完整计算逻辑。通过Bash作为单个Python脚本运行;它会生成除报告外的所有产物。
mdo-kratos-workflows/pipelines/sda/routing.pypython
import os
import pandas as pd
ANOMALYGEN_SUPPORTED = {"PASS", "EXCESS_SOLDER", "MISSING", "BRIDGE"}
df = pd.read_parquet(gaps_parquet)
labels_upper = df["label"].astype(str).str.upper()Mining subset
Mining subset
pool_missing = False
if source_pool_csv and os.path.isfile(source_pool_csv):
pool_df = pd.read_csv(source_pool_csv)
pool_labels = {str(l).upper() for l in pool_df["label"].unique()}
mn_mask = labels_upper.isin(pool_labels)
mn_df = df[mn_mask]
else:
pool_missing = True
pool_labels = set()
mn_df = df.iloc[0:0]
os.makedirs(os.path.dirname(mining_gaps_parquet) or ".", exist_ok=True)
mn_df.to_parquet(mining_gaps_parquet, index=False)
pool_missing = False
if source_pool_csv and os.path.isfile(source_pool_csv):
pool_df = pd.read_csv(source_pool_csv)
pool_labels = {str(l).upper() for l in pool_df["label"].unique()}
mn_mask = labels_upper.isin(pool_labels)
mn_df = df[mn_mask]
else:
pool_missing = True
pool_labels = set()
mn_df = df.iloc[0:0]
os.makedirs(os.path.dirname(mining_gaps_parquet) or ".", exist_ok=True)
mn_df.to_parquet(mining_gaps_parquet, index=False)
AnomalyGen subset
AnomalyGen subset
ag_mask = labels_upper.isin(ANOMALYGEN_SUPPORTED)
ag_df = df[ag_mask]
os.makedirs(os.path.dirname(anomalygen_gaps_parquet) or ".", exist_ok=True)
ag_df.to_parquet(anomalygen_gaps_parquet, index=False)
ag_mask = labels_upper.isin(ANOMALYGEN_SUPPORTED)
ag_df = df[ag_mask]
os.makedirs(os.path.dirname(anomalygen_gaps_parquet) or ".", exist_ok=True)
ag_df.to_parquet(anomalygen_gaps_parquet, index=False)
Per-label breakdown
Per-label breakdown
summary_lines = [
"Weak-sample routing summary",
f"Total weak samples: {len(df)}",
f"Mining subset: {len(mn_df)} -> {mining_gaps_parquet}",
f"AnomalyGen subset: {len(ag_df)} -> {anomalygen_gaps_parquet}",
"",
]
if pool_missing:
summary_lines.append(f"No source pool CSV at {source_pool_csv!r}; mining subset is empty.")
summary_lines.append("")
summary_lines.append("Per-label breakdown (count, mining, anomalygen):")
label_counts = labels_upper.value_counts()
for label, count in label_counts.items():
in_mn = (not pool_missing) and label in pool_labels
in_ag = label in ANOMALYGEN_SUPPORTED
summary_lines.append(
f" {label}: {count} "
f"(mining={'yes' if in_mn else 'no'}, "
f"anomalygen={'yes' if in_ag else 'no'})"
)
summary_text = "\n".join(summary_lines) + "\n"
os.makedirs(logs_dir, exist_ok=True)
with open(os.path.join(logs_dir, "routing_summary.txt"), "w", encoding="utf-8") as f:
f.write(summary_text)
print(summary_text.strip())
---summary_lines = [
"Weak-sample routing summary",
f"Total weak samples: {len(df)}",
f"Mining subset: {len(mn_df)} -> {mining_gaps_parquet}",
f"AnomalyGen subset: {len(ag_df)} -> {anomalygen_gaps_parquet}",
"",
]
if pool_missing:
summary_lines.append(f"No source pool CSV at {source_pool_csv!r}; mining subset is empty.")
summary_lines.append("")
summary_lines.append("Per-label breakdown (count, mining, anomalygen):")
label_counts = labels_upper.value_counts()
for label, count in label_counts.items():
in_mn = (not pool_missing) and label in pool_labels
in_ag = label in ANOMALYGEN_SUPPORTED
summary_lines.append(
f" {label}: {count} "
f"(mining={'yes' if in_mn else 'no'}, "
f"anomalygen={'yes' if in_ag else 'no'})"
)
summary_text = "
".join(summary_lines) + "
"
".join(summary_lines) + "
"
os.makedirs(logs_dir, exist_ok=True)
with open(os.path.join(logs_dir, "routing_summary.txt"), "w", encoding="utf-8") as f:
f.write(summary_text)
print(summary_text.strip())
---Outputs
输出
Write everything into a timestamped folder. The packaging hook will copy and automatically when is written.
routing_config/claude_session.jsonlRouting_Report.md<output_dir>/routing_results/YYYY-MM-DD_HHMMSS/
├── Routing_Report.md # Full routing report
├── mining_gaps.parquet # Subset routed to k-NN Mining
├── anomalygen_gaps.parquet # Subset routed to AnomalyGen (Cosmos SDG)
├── routing_summary.txt # Plain-text per-label breakdown
├── routing_config/ # Auto-copied by hook
└── claude_session.jsonl # Auto-copied by hookAt the start of the run, get the real timestamp by running in Bash. If the user specifies a custom output path, use it directly but maintain the internal layout.
date +%Y-%m-%d_%H%M%S将所有内容写入带时间戳的文件夹。当写入时,打包钩子会自动复制和。
Routing_Report.mdrouting_config/claude_session.jsonl<output_dir>/routing_results/YYYY-MM-DD_HHMMSS/
├── Routing_Report.md # 完整路由报告
├── mining_gaps.parquet # 分配给k-NN Mining的子集
├── anomalygen_gaps.parquet # 分配给AnomalyGen(Cosmos SDG)的子集
├── routing_summary.txt # 纯文本格式的按标签拆分详情
├── routing_config/ # 钩子自动复制
└── claude_session.jsonl # 钩子自动复制在运行开始时,通过在Bash中执行获取真实时间戳。如果用户指定了自定义输出路径,请直接使用该路径,但保持内部文件结构不变。
date +%Y-%m-%d_%H%M%SReport Structure
报告结构
Keep the report short (400–800 words). Routing is a deterministic decision; the value is making the decisions auditable, not narrative.
undefined报告需简短(400–800字)。路由是确定性决策,其价值在于让决策可审计,而非叙事性内容。
undefinedVCN Routing Report: <Iteration / Experiment Name>
VCN路由报告:<迭代/实验名称>
1. Verdict
1. 结论
- Total weak samples in: <N>
- Mining subset: <N_mn> rows →
mining_gaps.parquet - AnomalyGen subset: <N_ag> rows →
anomalygen_gaps.parquet - Source pool present? <yes/no — and the path>
- One-line headline: "<X> labels routed, <Y> labels dropped (no module accepted)"
- 输入的薄弱样本总数:<N>
- Mining子集: <N_mn>行 →
mining_gaps.parquet - AnomalyGen子集: <N_ag>行 →
anomalygen_gaps.parquet - 源池是否存在?<是/否 — 以及路径>
- 一句话摘要:"<X>个标签已分配,<Y>个标签被丢弃(无模块接受)"
2. Inputs
2. 输入
| Input | Path | Notes |
|---|---|---|
| gaps_parquet | … | rows=<N>, columns=<col list> |
| source_pool_csv | … | rows=<M> or "not provided" / "missing" |
| 输入项 | 路径 | 备注 |
|---|---|---|
| gaps_parquet | … | 行数=<N>,列=<列列表> |
| source_pool_csv | … | 行数=<M> 或 "未提供" / "缺失" |
3. Per-Label Routing Decisions
3. 按标签拆分的路由决策
| Label | Count in gaps | In source pool? | Mining? | AnomalyGen? | Routed To |
|---|
(One row per distinct label in , uppercased. is one of:
, , , .
Use whenever no module accepted the label. Sort by count descending.)
gaps_parquetRouted Tomining onlyanomalygen onlymining+anomalygenneither (DROPPED)neither (DROPPED)| 标签 | gaps中的数量 | 是否在源池中? | 是否分配给Mining? | 是否分配给AnomalyGen? | 分配目标 |
|---|
(中的每个不同标签对应一行,大写形式。为以下选项之一:
、、、。
当没有模块接受该标签时,使用。按数量降序排序。)
gaps_parquet分配目标仅Mining仅AnomalyGenMining+AnomalyGen均不分配(已丢弃)均不分配(已丢弃)4. Module-Level Summaries
4. 模块级摘要
4.1 k-NN Mining
4.1 k-NN Mining
- Pool labels (from source_pool_csv): <list, or "pool missing">
- Labels accepted from input: <list>
- Total rows routed: <N_mn>
- Per-label row counts: <breakdown>
- 池标签(来自source_pool_csv):<列表,或"源池缺失">
- 从输入中接受的标签:<列表>
- 已分配的总行数:<N_mn>
- 按标签拆分的行数:<详情>
4.2 AnomalyGen (Cosmos SDG)
4.2 AnomalyGen(Cosmos SDG)
- Eligible labels (configured): PASS, EXCESS_SOLDER, MISSING, BRIDGE
- Labels accepted from input: <list>
- Total rows routed: <N_ag>
- Per-label row counts: <breakdown>
- 适用标签(已配置):PASS, EXCESS_SOLDER, MISSING, BRIDGE
- 从输入中接受的标签:<列表>
- 已分配的总行数:<N_ag>
- 按标签拆分的行数:<详情>
5. Dropped Labels (routed to NEITHER module)
5. 已丢弃的标签(未分配给任何模块)
| Label | Count | Why dropped | Suggested fix |
|---|
(Empty table is OK and means no labels were dropped. If non-empty, every row needs a
"why" — typically one of: "not in source pool AND not in AnomalyGen supported set",
"source pool missing entirely AND label not in AnomalyGen set", "label name doesn't
match any module's expected canonicalization".)
| 标签 | 数量 | 丢弃原因 | 建议修复方案 |
|---|
(空表表示没有标签被丢弃,这是正常情况。如果非空,每行都需要填写
“原因”——通常为以下之一:"不在源池中且不在AnomalyGen支持集合中"、
"源池完全缺失且标签不在AnomalyGen集合中"、"标签名称与任何模块的预期规范不匹配"。)
6. Recommended Actions
6. 建议操作
- If any labels are dropped: seed the source pool with that label, OR extend
(and the AnomalyGen generator coverage).
ANOMALYGEN_SUPPORTED_LABELS - If source pool is missing: provide to enable the Mining branch. Without it, half of the augmentation pipeline is dark.
source_pool_csv - If AnomalyGen subset is empty: gap analysis only surfaced labels AnomalyGen cannot generate; rely on Mining for this iteration, or extend the AnomalyGen integration.
- If both subsets are empty: stop the SDA iteration. Nothing downstream can run.
---- 如果有标签被丢弃:在源池中添加该标签,或扩展
(以及AnomalyGen生成器的覆盖范围)。
ANOMALYGEN_SUPPORTED_LABELS - 如果源池缺失:提供以启用Mining分支。 没有它,数据增强流程的一半将无法运行。
source_pool_csv - 如果AnomalyGen子集为空:差距分析仅发现了AnomalyGen无法 生成的标签;此迭代依赖Mining,或扩展AnomalyGen集成。
- 如果两个子集都为空:停止SDA迭代。下游没有可运行的内容。
---Execution Order
执行顺序
- Run to get the timestamp; create
date +%Y-%m-%d_%H%M%S.<output_dir>/routing_results/<timestamp>/ - Run the Python recipe (Steps 1–4) to produce ,
mining_gaps.parquet, andanomalygen_gaps.parquet. Print summary stats to stdout so the script-check hook can verify it ran.routing_summary.txt - Build the per-label decision table by reading both parquets and computing the routed-to verdict per label.
- Write last — writing it triggers the packaging hook, which copies session logs and skill config alongside.
Routing_Report.md
- 运行获取时间戳;创建
date +%Y-%m-%d_%H%M%S。<output_dir>/routing_results/<timestamp>/ - 运行Python脚本(步骤1–4)生成、
mining_gaps.parquet和anomalygen_gaps.parquet。将摘要统计信息打印到stdout,以便脚本检查钩子验证其已运行。routing_summary.txt - 通过读取两个parquet文件并计算每个标签的分配结果,构建按标签拆分的决策表。
- 最后写入——写入操作会触发打包钩子,将会话日志和技能配置文件复制到旁边。",
Routing_Report.md