tao-route-visual-changenet-samples

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

TAO VCN Sample Routing Skill

TAO VCN样本路由技能

You are the dispatcher between gap analysis and the augmentation modules in a VCN AOI SDA pipeline. Each augmentation module can only act on labels it knows how to handle:
  • k-NN Mining can only mine real-image neighbors for labels that already exist in the source pool CSV. There is no point looking for
    SHIFT
    neighbors if the pool has no
    SHIFT
    rows.
  • AnomalyGen (Cosmos SDG) can only generate synthetic anomalies for the classes its inference pipeline supports:
    PASS
    ,
    EXCESS_SOLDER
    ,
    MISSING
    ,
    BRIDGE
    . A weak sample with a label outside this set is unroutable to AnomalyGen.
This skill runs once per SDA iteration immediately after gap analysis. It splits the gap-analysis parquet into one filtered parquet per module so each module operates on its own eligible subset, and it writes a human-readable summary of the per-label routing decisions.
The work is intentionally trivial: read a parquet, do two
.isin(...)
filters, write two parquets, write one summary. The skill exists to make those decisions auditable — every label must show up in the summary with a yes/no verdict for each module so a downstream reviewer can spot when a label is silently dropped because no module accepted it.

你是VCN AOI SDA流程中差距分析与数据增强模块之间的调度器。每个数据增强模块仅能处理其支持的标签:
  • k-NN Mining仅能为已存在于source pool CSV中的标签挖掘真实图像邻居。如果池中没有
    SHIFT
    行,那么寻找
    SHIFT
    邻居毫无意义。
  • AnomalyGen(Cosmos SDG)仅能为其推理流水线支持的类别生成合成异常:
    PASS
    EXCESS_SOLDER
    MISSING
    BRIDGE
    。标签不在此集合内的薄弱样本无法分配给AnomalyGen。
此技能在每次SDA迭代的差距分析完成后立即运行一次。它会将差距分析得到的parquet文件拆分为每个模块对应的过滤后parquet文件,以便每个模块仅处理其符合条件的子集,同时还会生成一份易读的标签路由决策摘要。
这项工作的逻辑非常简单:读取parquet文件、执行两次
.isin(...)
过滤、写入两个parquet文件、生成一份摘要。该技能的存在是为了让这些决策可审计——每个标签都必须在摘要中显示针对每个模块的是否通过的判定,这样下游审核人员就能发现哪些标签因没有模块接受而被静默丢弃。

Inputs

输入

  1. gaps_parquet
    — the gap-analysis output (typically
    <exp_dir>/rca_results/<timestamp>/gaps.parquet
    from
    tao-analyze-gaps-visual-changenet
    ). Required columns:
    filepath
    ,
    label
    . Other columns (
    siamese_score
    ,
    weakness
    ) are preserved verbatim.
  2. source_pool_csv
    — VCN-format mining source pool CSV with a
    label
    column. Empty string or non-existent path is allowed; the mining subset will simply be empty in that case.
  3. Output directory — where the two routed parquets, the summary, and the report are written. Default: a timestamped folder under the gap-analysis result directory:
    <rca_result_dir>/routing_results/<timestamp>/
    .
  4. anomalygen_supported_labels
    (optional) — override the default AnomalyGen-eligible label set. Default:
    {"PASS", "EXCESS_SOLDER", "MISSING", "BRIDGE"}
    . Warning: This must stay in sync with
    ANOMALYGEN_SUPPORTED_LABELS
    in
    mdo-kratos-workflows/pipelines/sda/routing.py
    and the AnomalyGen integration's actual generator coverage. Adding a new defect class to AnomalyGen means adding it here too.

  1. gaps_parquet
    —— 差距分析的输出文件(通常是
    tao-analyze-gaps-visual-changenet
    生成的
    <exp_dir>/rca_results/<timestamp>/gaps.parquet
    )。必填列:
    filepath
    label
    。其他列(如
    siamese_score
    weakness
    )将原样保留。
  2. source_pool_csv
    —— 包含
    label
    列的VCN格式挖掘源池CSV文件。允许为空字符串或不存在的路径;这种情况下挖掘子集将为空。
  3. 输出目录 —— 用于存储两个路由后的parquet文件、摘要和报告的位置。默认值:差距分析结果目录下的带时间戳文件夹:
    <rca_result_dir>/routing_results/<timestamp>/
  4. anomalygen_supported_labels
    (可选)—— 覆盖默认的AnomalyGen适用标签集合。默认值:
    {"PASS", "EXCESS_SOLDER", "MISSING", "BRIDGE"}
    注意: 此参数必须与
    mdo-kratos-workflows/pipelines/sda/routing.py
    中的
    ANOMALYGEN_SUPPORTED_LABELS
    以及AnomalyGen集成的实际生成器覆盖范围保持一致。如果为AnomalyGen添加了新的缺陷类别,也需要在此处添加。

Method

方法

The whole skill is two
.isin(...)
masks against the uppercased label column.
整个技能的核心是对大写后的标签列执行两次
.isin(...)
掩码操作。

Step 1 — Load and uppercase

步骤1 —— 加载并转换为大写

python
df = pd.read_parquet(gaps_parquet)
labels_upper = df["label"].astype(str).str.upper()
The match is case-insensitive for both module checks. The original
label
column is preserved unchanged in the output parquets — only the comparison key is uppercased.
python
df = pd.read_parquet(gaps_parquet)
labels_upper = df["label"].astype(str).str.upper()
两个模块的检查均不区分大小写。输出parquet文件中将保留原始的
label
列——仅在比较时使用大写的键。

Step 2 — Mining subset

步骤2 —— 挖掘子集

python
if source_pool_csv and os.path.isfile(source_pool_csv):
    pool_df = pd.read_csv(source_pool_csv)
    pool_labels = {str(l).upper() for l in pool_df["label"].unique()}
    mn_mask = labels_upper.isin(pool_labels)
    mn_df = df[mn_mask]
else:
    pool_missing = True
    pool_labels = set()
    mn_df = df.iloc[0:0]   # empty, but with the same schema
mn_df.to_parquet(mining_gaps_parquet, index=False)
If the pool CSV is missing or empty, the mining subset is an empty DataFrame with the same columns as the input so downstream readers don't crash on schema mismatch. Flag this case in the summary.
python
if source_pool_csv and os.path.isfile(source_pool_csv):
    pool_df = pd.read_csv(source_pool_csv)
    pool_labels = {str(l).upper() for l in pool_df["label"].unique()}
    mn_mask = labels_upper.isin(pool_labels)
    mn_df = df[mn_mask]
else:
    pool_missing = True
    pool_labels = set()
    mn_df = df.iloc[0:0]   # 空DataFrame,但与输入列结构一致
mn_df.to_parquet(mining_gaps_parquet, index=False)
如果源池CSV文件缺失或为空,挖掘子集将是一个与输入列结构完全相同的空DataFrame,这样下游读取器就不会因结构不匹配而崩溃。在摘要中标记这种情况。

Step 3 — AnomalyGen subset

步骤3 —— AnomalyGen子集

python
ANOMALYGEN_SUPPORTED = {"PASS", "EXCESS_SOLDER", "MISSING", "BRIDGE"}
ag_mask = labels_upper.isin(ANOMALYGEN_SUPPORTED)
ag_df = df[ag_mask]
ag_df.to_parquet(anomalygen_gaps_parquet, index=False)
Rows whose label is in the AnomalyGen-supported set are written verbatim to
anomalygen_gaps.parquet
. The schema matches the input parquet exactly — downstream AnomalyGen (Cosmos SDG) needs no other changes.
python
ANOMALYGEN_SUPPORTED = {"PASS", "EXCESS_SOLDER", "MISSING", "BRIDGE"}
ag_mask = labels_upper.isin(ANOMALYGEN_SUPPORTED)
ag_df = df[ag_mask]
ag_df.to_parquet(anomalygen_gaps_parquet, index=False)
标签属于AnomalyGen支持集合的行将被原样写入
anomalygen_gaps.parquet
。其结构与输入parquet文件完全一致——下游的AnomalyGen(Cosmos SDG)无需任何修改。

Step 4 — Per-label routing breakdown

步骤4 —— 按标签拆分路由详情

For every distinct label in the input gaps parquet (uppercased), record:
  • count
    — how many rows have this label
  • mining
    — yes if the label is in
    pool_labels
    , otherwise no
  • anomalygen
    — yes if the label is in
    ANOMALYGEN_SUPPORTED
    , otherwise no
A label can route to both modules (e.g. PASS rows route to AnomalyGen, and if the source pool also contains PASS rows they route to Mining too). A label can also route to none — flag those, since they are silently dropped and may signal a configuration mismatch.
Write the breakdown to
routing_summary.txt
. The format mirrors the reference component exactly:
Weak-sample routing summary
Total weak samples: <N>
Mining subset:      <N_mn> -> <mining_gaps_parquet>
AnomalyGen subset:  <N_ag> -> <anomalygen_gaps_parquet>

[If pool missing:]
No source pool CSV at '<path>'; mining subset is empty.

Per-label breakdown (count, mining, anomalygen):
  PASS: 50 (mining=yes, anomalygen=yes)
  MISSING: 32 (mining=no, anomalygen=yes)
  SHIFT: 14 (mining=yes, anomalygen=no)
  EXCESS_SOLDER: 9 (mining=yes, anomalygen=yes)
  ...
对于输入gaps parquet文件中的每个不同标签(大写形式),记录:
  • count
    —— 该标签对应的行数
  • mining
    —— 如果标签存在于
    pool_labels
    中则为yes,否则为no
  • anomalygen
    —— 如果标签存在于
    ANOMALYGEN_SUPPORTED
    中则为yes,否则为no
一个标签可以同时分配给两个模块(例如,PASS行可以分配给AnomalyGen,如果源池中也包含PASS行,还可以分配给Mining)。一个标签也可能无法分配给任何模块——需要标记这些情况,因为它们会被静默丢弃,可能意味着配置不匹配。
将拆分详情写入
routing_summary.txt
。格式与参考组件完全一致:
Weak-sample routing summary
Total weak samples: <N>
Mining subset:      <N_mn> -> <mining_gaps_parquet>
AnomalyGen subset:  <N_ag> -> <anomalygen_gaps_parquet>

[If pool missing:]
No source pool CSV at '<path>'; mining subset is empty.

Per-label breakdown (count, mining, anomalygen):
  PASS: 50 (mining=yes, anomalygen=yes)
  MISSING: 32 (mining=no, anomalygen=yes)
  SHIFT: 14 (mining=yes, anomalygen=no)
  EXCESS_SOLDER: 9 (mining=yes, anomalygen=yes)
  ...

Step 5 — Sanity checks

步骤5 —— 完整性检查

After both subsets are written, verify:
  • The sum of subset sizes is not required to equal
    len(df)
    — overlap is allowed (a label can route to both modules). What matters is that every input row appears in at least one subset, OR appears in the "none" list with an explicit reason.
  • If
    len(mn_df) == 0
    and
    len(ag_df) == 0
    , something is wrong — flag prominently in the report.
  • If an entire label group routes to no module, the
    Recommended Actions
    section must call this out so the user can either seed the source pool with that label or extend AnomalyGen's supported set.

在写入两个子集后,验证:
  • 子集大小之和不需要等于
    len(df)
    ——允许重叠(一个标签可以分配给两个模块)。重要的是每个输入行至少出现在一个子集中,或者出现在“无匹配”列表中并带有明确原因
  • 如果
    len(mn_df) == 0
    len(ag_df) == 0
    ,则说明存在问题——在报告中突出标记。
  • 如果整个标签组无法分配给任何模块,
    Recommended Actions
    部分必须指出这一点,以便用户可以选择在源池中添加该标签,或扩展AnomalyGen的支持集合。

Reference Python Recipe

参考Python脚本

This is the exact computation, lifted from
mdo-kratos-workflows/pipelines/sda/routing.py
. Run as a single Python script via Bash; it produces every artifact except the report.
python
import os
import pandas as pd

ANOMALYGEN_SUPPORTED = {"PASS", "EXCESS_SOLDER", "MISSING", "BRIDGE"}

df = pd.read_parquet(gaps_parquet)
labels_upper = df["label"].astype(str).str.upper()
以下是直接从
mdo-kratos-workflows/pipelines/sda/routing.py
提取的完整计算逻辑。通过Bash作为单个Python脚本运行;它会生成除报告外的所有产物。
python
import os
import pandas as pd

ANOMALYGEN_SUPPORTED = {"PASS", "EXCESS_SOLDER", "MISSING", "BRIDGE"}

df = pd.read_parquet(gaps_parquet)
labels_upper = df["label"].astype(str).str.upper()

Mining subset

Mining subset

pool_missing = False if source_pool_csv and os.path.isfile(source_pool_csv): pool_df = pd.read_csv(source_pool_csv) pool_labels = {str(l).upper() for l in pool_df["label"].unique()} mn_mask = labels_upper.isin(pool_labels) mn_df = df[mn_mask] else: pool_missing = True pool_labels = set() mn_df = df.iloc[0:0] os.makedirs(os.path.dirname(mining_gaps_parquet) or ".", exist_ok=True) mn_df.to_parquet(mining_gaps_parquet, index=False)
pool_missing = False if source_pool_csv and os.path.isfile(source_pool_csv): pool_df = pd.read_csv(source_pool_csv) pool_labels = {str(l).upper() for l in pool_df["label"].unique()} mn_mask = labels_upper.isin(pool_labels) mn_df = df[mn_mask] else: pool_missing = True pool_labels = set() mn_df = df.iloc[0:0] os.makedirs(os.path.dirname(mining_gaps_parquet) or ".", exist_ok=True) mn_df.to_parquet(mining_gaps_parquet, index=False)

AnomalyGen subset

AnomalyGen subset

ag_mask = labels_upper.isin(ANOMALYGEN_SUPPORTED) ag_df = df[ag_mask] os.makedirs(os.path.dirname(anomalygen_gaps_parquet) or ".", exist_ok=True) ag_df.to_parquet(anomalygen_gaps_parquet, index=False)
ag_mask = labels_upper.isin(ANOMALYGEN_SUPPORTED) ag_df = df[ag_mask] os.makedirs(os.path.dirname(anomalygen_gaps_parquet) or ".", exist_ok=True) ag_df.to_parquet(anomalygen_gaps_parquet, index=False)

Per-label breakdown

Per-label breakdown

summary_lines = [ "Weak-sample routing summary", f"Total weak samples: {len(df)}", f"Mining subset: {len(mn_df)} -> {mining_gaps_parquet}", f"AnomalyGen subset: {len(ag_df)} -> {anomalygen_gaps_parquet}", "", ] if pool_missing: summary_lines.append(f"No source pool CSV at {source_pool_csv!r}; mining subset is empty.") summary_lines.append("") summary_lines.append("Per-label breakdown (count, mining, anomalygen):") label_counts = labels_upper.value_counts() for label, count in label_counts.items(): in_mn = (not pool_missing) and label in pool_labels in_ag = label in ANOMALYGEN_SUPPORTED summary_lines.append( f" {label}: {count} " f"(mining={'yes' if in_mn else 'no'}, " f"anomalygen={'yes' if in_ag else 'no'})" ) summary_text = "\n".join(summary_lines) + "\n"
os.makedirs(logs_dir, exist_ok=True) with open(os.path.join(logs_dir, "routing_summary.txt"), "w", encoding="utf-8") as f: f.write(summary_text) print(summary_text.strip())

---
summary_lines = [ "Weak-sample routing summary", f"Total weak samples: {len(df)}", f"Mining subset: {len(mn_df)} -> {mining_gaps_parquet}", f"AnomalyGen subset: {len(ag_df)} -> {anomalygen_gaps_parquet}", "", ] if pool_missing: summary_lines.append(f"No source pool CSV at {source_pool_csv!r}; mining subset is empty.") summary_lines.append("") summary_lines.append("Per-label breakdown (count, mining, anomalygen):") label_counts = labels_upper.value_counts() for label, count in label_counts.items(): in_mn = (not pool_missing) and label in pool_labels in_ag = label in ANOMALYGEN_SUPPORTED summary_lines.append( f" {label}: {count} " f"(mining={'yes' if in_mn else 'no'}, " f"anomalygen={'yes' if in_ag else 'no'})" ) summary_text = "
".join(summary_lines) + "
"
os.makedirs(logs_dir, exist_ok=True) with open(os.path.join(logs_dir, "routing_summary.txt"), "w", encoding="utf-8") as f: f.write(summary_text) print(summary_text.strip())

---

Outputs

输出

Write everything into a timestamped folder. The packaging hook will copy
routing_config/
and
claude_session.jsonl
automatically when
Routing_Report.md
is written.
<output_dir>/routing_results/YYYY-MM-DD_HHMMSS/
├── Routing_Report.md           # Full routing report
├── mining_gaps.parquet         # Subset routed to k-NN Mining
├── anomalygen_gaps.parquet     # Subset routed to AnomalyGen (Cosmos SDG)
├── routing_summary.txt         # Plain-text per-label breakdown
├── routing_config/             # Auto-copied by hook
└── claude_session.jsonl        # Auto-copied by hook
At the start of the run, get the real timestamp by running
date +%Y-%m-%d_%H%M%S
in Bash. If the user specifies a custom output path, use it directly but maintain the internal layout.

将所有内容写入带时间戳的文件夹。当写入
Routing_Report.md
时,打包钩子会自动复制
routing_config/
claude_session.jsonl
<output_dir>/routing_results/YYYY-MM-DD_HHMMSS/
├── Routing_Report.md           # 完整路由报告
├── mining_gaps.parquet         # 分配给k-NN Mining的子集
├── anomalygen_gaps.parquet     # 分配给AnomalyGen(Cosmos SDG)的子集
├── routing_summary.txt         # 纯文本格式的按标签拆分详情
├── routing_config/             # 钩子自动复制
└── claude_session.jsonl        # 钩子自动复制
在运行开始时,通过在Bash中执行
date +%Y-%m-%d_%H%M%S
获取真实时间戳。如果用户指定了自定义输出路径,请直接使用该路径,但保持内部文件结构不变。

Report Structure

报告结构

Keep the report short (400–800 words). Routing is a deterministic decision; the value is making the decisions auditable, not narrative.
undefined
报告需简短(400–800字)。路由是确定性决策,其价值在于让决策可审计,而非叙事性内容。
undefined

VCN Routing Report: <Iteration / Experiment Name>

VCN路由报告:<迭代/实验名称>

1. Verdict

1. 结论

  • Total weak samples in: <N>
  • Mining subset: <N_mn> rows →
    mining_gaps.parquet
  • AnomalyGen subset: <N_ag> rows →
    anomalygen_gaps.parquet
  • Source pool present? <yes/no — and the path>
  • One-line headline: "<X> labels routed, <Y> labels dropped (no module accepted)"
  • 输入的薄弱样本总数:<N>
  • Mining子集: <N_mn>行 →
    mining_gaps.parquet
  • AnomalyGen子集: <N_ag>行 →
    anomalygen_gaps.parquet
  • 源池是否存在?<是/否 — 以及路径>
  • 一句话摘要:"<X>个标签已分配,<Y>个标签被丢弃(无模块接受)"

2. Inputs

2. 输入

InputPathNotes
gaps_parquetrows=<N>, columns=<col list>
source_pool_csvrows=<M> or "not provided" / "missing"
输入项路径备注
gaps_parquet行数=<N>,列=<列列表>
source_pool_csv行数=<M> 或 "未提供" / "缺失"

3. Per-Label Routing Decisions

3. 按标签拆分的路由决策

LabelCount in gapsIn source pool?Mining?AnomalyGen?Routed To
(One row per distinct label in
gaps_parquet
, uppercased.
Routed To
is one of:
mining only
,
anomalygen only
,
mining+anomalygen
,
neither (DROPPED)
. Use
neither (DROPPED)
whenever no module accepted the label. Sort by count descending.)
标签gaps中的数量是否在源池中?是否分配给Mining?是否分配给AnomalyGen?分配目标
gaps_parquet
中的每个不同标签对应一行,大写形式。
分配目标
为以下选项之一:
仅Mining
仅AnomalyGen
Mining+AnomalyGen
均不分配(已丢弃)
。 当没有模块接受该标签时,使用
均不分配(已丢弃)
。按数量降序排序。)

4. Module-Level Summaries

4. 模块级摘要

4.1 k-NN Mining

4.1 k-NN Mining

  • Pool labels (from source_pool_csv): <list, or "pool missing">
  • Labels accepted from input: <list>
  • Total rows routed: <N_mn>
  • Per-label row counts: <breakdown>
  • 池标签(来自source_pool_csv):<列表,或"源池缺失">
  • 从输入中接受的标签:<列表>
  • 已分配的总行数:<N_mn>
  • 按标签拆分的行数:<详情>

4.2 AnomalyGen (Cosmos SDG)

4.2 AnomalyGen(Cosmos SDG)

  • Eligible labels (configured): PASS, EXCESS_SOLDER, MISSING, BRIDGE
  • Labels accepted from input: <list>
  • Total rows routed: <N_ag>
  • Per-label row counts: <breakdown>
  • 适用标签(已配置):PASS, EXCESS_SOLDER, MISSING, BRIDGE
  • 从输入中接受的标签:<列表>
  • 已分配的总行数:<N_ag>
  • 按标签拆分的行数:<详情>

5. Dropped Labels (routed to NEITHER module)

5. 已丢弃的标签(未分配给任何模块)

LabelCountWhy droppedSuggested fix
(Empty table is OK and means no labels were dropped. If non-empty, every row needs a "why" — typically one of: "not in source pool AND not in AnomalyGen supported set", "source pool missing entirely AND label not in AnomalyGen set", "label name doesn't match any module's expected canonicalization".)
标签数量丢弃原因建议修复方案
(空表表示没有标签被丢弃,这是正常情况。如果非空,每行都需要填写 “原因”——通常为以下之一:"不在源池中且不在AnomalyGen支持集合中"、 "源池完全缺失且标签不在AnomalyGen集合中"、"标签名称与任何模块的预期规范不匹配"。)

6. Recommended Actions

6. 建议操作

  1. If any labels are dropped: seed the source pool with that label, OR extend
    ANOMALYGEN_SUPPORTED_LABELS
    (and the AnomalyGen generator coverage).
  2. If source pool is missing: provide
    source_pool_csv
    to enable the Mining branch. Without it, half of the augmentation pipeline is dark.
  3. If AnomalyGen subset is empty: gap analysis only surfaced labels AnomalyGen cannot generate; rely on Mining for this iteration, or extend the AnomalyGen integration.
  4. If both subsets are empty: stop the SDA iteration. Nothing downstream can run.

---
  1. 如果有标签被丢弃:在源池中添加该标签,或扩展
    ANOMALYGEN_SUPPORTED_LABELS
    (以及AnomalyGen生成器的覆盖范围)。
  2. 如果源池缺失:提供
    source_pool_csv
    以启用Mining分支。 没有它,数据增强流程的一半将无法运行。
  3. 如果AnomalyGen子集为空:差距分析仅发现了AnomalyGen无法 生成的标签;此迭代依赖Mining,或扩展AnomalyGen集成。
  4. 如果两个子集都为空:停止SDA迭代。下游没有可运行的内容。

---

Execution Order

执行顺序

  1. Run
    date +%Y-%m-%d_%H%M%S
    to get the timestamp; create
    <output_dir>/routing_results/<timestamp>/
    .
  2. Run the Python recipe (Steps 1–4) to produce
    mining_gaps.parquet
    ,
    anomalygen_gaps.parquet
    , and
    routing_summary.txt
    . Print summary stats to stdout so the script-check hook can verify it ran.
  3. Build the per-label decision table by reading both parquets and computing the routed-to verdict per label.
  4. Write
    Routing_Report.md
    last — writing it triggers the packaging hook, which copies session logs and skill config alongside.
  1. 运行
    date +%Y-%m-%d_%H%M%S
    获取时间戳;创建
    <output_dir>/routing_results/<timestamp>/
  2. 运行Python脚本(步骤1–4)生成
    mining_gaps.parquet
    anomalygen_gaps.parquet
    routing_summary.txt
    。将摘要统计信息打印到stdout,以便脚本检查钩子验证其已运行。
  3. 通过读取两个parquet文件并计算每个标签的分配结果,构建按标签拆分的决策表。
  4. 最后写入
    Routing_Report.md
    ——写入操作会触发打包钩子,将会话日志和技能配置文件复制到旁边。",