golden-dataset-curation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Golden Dataset Curation

黄金数据集整理

Curate high-quality documents for the golden dataset with multi-agent validation

通过多Agent验证为黄金数据集整理高质量文档

Overview

概述

This skill provides patterns and workflows for adding new documents to the golden dataset with thorough quality analysis. It complements

golden-dataset-management

which handles backup/restore.

When to use this skill:

Adding new documents to the golden dataset
Classifying content types and difficulty levels
Generating test queries for new documents
Running multi-agent quality analysis

本技能提供了为黄金数据集添加新文档并进行全面质量分析的模式与工作流。它是

golden-dataset-management

（负责备份/恢复）的补充功能。

何时使用本技能：

为黄金数据集添加新文档
分类内容类型与难度等级
为新文档生成测试查询
运行多Agent质量分析

Content Types

内容类型

Type	Description	Quality Focus
`article`	Technical articles, blog posts	Depth, accuracy, actionability
`tutorial`	Step-by-step guides	Completeness, clarity, code quality
`research_paper`	Academic papers, whitepapers	Rigor, citations, methodology
`documentation`	API docs, reference materials	Accuracy, completeness, examples
`video_transcript`	Transcribed video content	Structure, coherence, key points
`code_repository`	README, code analysis	Code quality, documentation

类型	描述	质量关注点
`article`	技术文章、博客文章	深度、准确性、可操作性
`tutorial`	分步指南	完整性、清晰度、代码质量
`research_paper`	学术论文、白皮书	严谨性、引用规范、研究方法
`documentation`	API文档、参考资料	准确性、完整性、示例丰富度
`video_transcript`	视频转录内容	结构合理性、连贯性、关键点提炼
`code_repository`	README、代码分析	代码质量、文档完善度

Difficulty Levels

难度等级

Level	Semantic Complexity	Expected Score	Characteristics
trivial	Direct keyword match	>0.85	Technical terms, exact phrases
easy	Common synonyms	>0.70	Well-known concepts, slight variations
medium	Paraphrased intent	>0.55	Conceptual queries, multi-topic
hard	Multi-hop reasoning	>0.40	Cross-domain, comparative analysis
adversarial	Edge cases	Graceful degradation	Robustness tests, off-domain

等级	语义复杂度	预期分数	特征
trivial（简单）	直接关键词匹配	>0.85	技术术语、精确短语
easy（容易）	常见同义词替换	>0.70	知名概念、轻微变体
medium（中等）	意图改写	>0.55	概念性查询、多主题
hard（困难）	多跳推理	>0.40	跨领域、对比分析
adversarial（对抗性）	边缘案例	优雅降级	鲁棒性测试、域外内容

Quality Dimensions

质量维度

Dimension	Weight	Perfect	Acceptable	Failing
Accuracy	0.25	0.95-1.0	0.70-0.94	<0.70
Coherence	0.20	0.90-1.0	0.60-0.89	<0.60
Depth	0.25	0.90-1.0	0.55-0.89	<0.55
Relevance	0.30	0.95-1.0	0.70-0.94	<0.70

Evaluation focuses:

Accuracy: Technical correctness, code validity, up-to-date info
Coherence: Logical structure, clear flow, consistent terminology
Depth: Comprehensive coverage, edge cases, appropriate detail
Relevance: Alignment with AI/ML, backend, frontend, DevOps domains

维度	权重	优秀	合格	不合格
Accuracy（准确性）	0.25	0.95-1.0	0.70-0.94	<0.70
Coherence（连贯性）	0.20	0.90-1.0	0.60-0.89	<0.60
Depth（深度）	0.25	0.90-1.0	0.55-0.89	<0.55
Relevance（相关性）	0.30	0.95-1.0	0.70-0.94	<0.70

评估重点：

准确性： 技术正确性、代码有效性、信息时效性
连贯性： 逻辑结构、清晰流程、术语一致性
深度： 全面覆盖、边缘案例、细节恰当性
相关性： 与AI/ML、后端、前端、DevOps领域的契合度

Multi-Agent Pipeline

多Agent流水线

INPUT: URL/Content
        |
        v
+------------------+
|   FETCH AGENT    |  Extract structure, detect type
+--------+---------+
         |
         v
+-----------------------------------------------+
|  PARALLEL ANALYSIS AGENTS                      |
|  Quality | Difficulty | Domain  | Query Gen   |
+-----------------------------------------------+
         |
         v
+------------------+
| CONSENSUS        |  Weighted score + confidence
| AGGREGATOR       |  -> include/review/exclude
+--------+---------+
         |
         v
+------------------+
|  USER APPROVAL   |  Show scores, confirm
+--------+---------+
         |
         v
OUTPUT: Curated document entry

INPUT: URL/Content
        |
        v
+------------------+
|   FETCH AGENT    |  Extract structure, detect type
+--------+---------+
         |
         v
+-----------------------------------------------+
|  PARALLEL ANALYSIS AGENTS                      |
|  Quality | Difficulty | Domain  | Query Gen   |
+-----------------------------------------------+
         |
         v
+------------------+
| CONSENSUS        |  Weighted score + confidence
| AGGREGATOR       |  -> include/review/exclude
+--------+---------+
         |
         v
+------------------+
|  USER APPROVAL   |  Show scores, confirm
+--------+---------+
         |
         v
OUTPUT: Curated document entry

Decision Thresholds

决策阈值

Quality Score	Confidence	Decision
>= 0.75	>= 0.70	include
>= 0.55	any	review
< 0.55	any	exclude

质量分数	置信度	决策
>= 0.75	>= 0.70	纳入
>= 0.55	任意	审核
< 0.55	任意	排除

Quality Thresholds

质量阈值

yaml

undefined

yaml

undefined

Recommended thresholds for golden dataset inclusion

minimum_quality_score: 0.70 minimum_confidence: 0.65 required_tags: 2 # At least 2 domain tags required_queries: 3 # At least 3 test queries

---

minimum_quality_score: 0.70 minimum_confidence: 0.65 required_tags: 2 # At least 2 domain tags required_queries: 3 # At least 3 test queries

---

Coverage Balance Guidelines

覆盖平衡指南

Maintain balanced coverage across:

Content types: Don't over-index on articles
Difficulty levels: Need trivial AND hard queries
Domains: Spread across AI/ML, backend, frontend, etc.

需维持以下维度的平衡覆盖：

内容类型： 不要过度偏向文章类
难度等级： 既需要简单也需要困难的查询
领域： 覆盖AI/ML、后端、前端等多个领域

Duplicate Prevention Checklist

重复内容预防检查清单

Before adding:

Check URL against existing
```
source_url_map.json
```
Run semantic similarity against existing document embeddings
Warn if >80% similar to existing document

添加前需：

对照现有
```
source_url_map.json
```
检查URL
与现有文档嵌入向量进行语义相似度比对
若与现有文档相似度>80%则发出警告

Provenance Tracking

来源追踪

Always record:

Source URL (canonical)
Curation date
Agent scores (for audit trail)
Langfuse trace ID

需始终记录：

来源URL（标准链接）
整理日期
Agent评分（用于审计追踪）
Langfuse跟踪ID

Langfuse Integration

Langfuse集成

Trace Structure

跟踪结构

python

trace = langfuse.trace(
    name="golden-dataset-curation",
    metadata={"source_url": url, "document_id": doc_id}
)

python

trace = langfuse.trace(
    name="golden-dataset-curation",
    metadata={"source_url": url, "document_id": doc_id}
)

Log individual dimension scores

trace.score(name="accuracy", value=0.85) trace.score(name="coherence", value=0.90) trace.score(name="depth", value=0.78) trace.score(name="relevance", value=0.92)

Final aggregated score

trace.score(name="quality_total", value=0.87) trace.event(name="curation_decision", metadata={"decision": "include"})

undefined

trace.score(name="quality_total", value=0.87) trace.event(name="curation_decision", metadata={"decision": "include"})

undefined

Managed Prompts

托管提示词

Prompt Name	Purpose
`golden-content-classifier`	Classify content_type
`golden-difficulty-classifier`	Assign difficulty
`golden-domain-tagger`	Extract tags
`golden-query-generator`	Generate test queries

提示词名称	用途
`golden-content-classifier`	分类内容类型
`golden-difficulty-classifier`	分配难度等级
`golden-domain-tagger`	提取标签
`golden-query-generator`	生成测试查询

References

参考资料

For detailed implementation patterns, see:

```
references/selection-criteria.md
```
- Content type classification, difficulty stratification, quality evaluation dimensions, and best practices
```
references/annotation-patterns.md
```
- Multi-agent pipeline architecture, agent specifications, consensus aggregation logic, and Langfuse integration

如需详细实现模式，请参阅：

```
references/selection-criteria.md
```
- 内容类型分类、难度分层、质量评估维度及最佳实践
```
references/annotation-patterns.md
```
- 多Agent流水线架构、Agent规范、共识聚合逻辑及Langfuse集成

Related Skills

Capability Details

能力详情

content-classification

Keywords: content type, classification, document type, golden dataset Solves:

Classify document content types for golden dataset
Categorize entries by domain and purpose
Identify content requiring special handling

关键词： content type, classification, document type, golden dataset 解决问题：

为黄金数据集分类文档内容类型
按领域与用途对条目进行分类
识别需要特殊处理的内容

difficulty-stratification

Keywords: difficulty, stratification, complexity level, challenge rating Solves:

Assign difficulty levels to golden dataset entries
Ensure balanced difficulty distribution
Identify edge cases and challenging examples

关键词： difficulty, stratification, complexity level, challenge rating 解决问题：

为黄金数据集条目分配难度等级
确保难度分布均衡
识别边缘案例与具有挑战性的示例

quality-evaluation

Keywords: quality, evaluation, quality dimensions, quality criteria Solves:

Evaluate entry quality against defined criteria
Score entries on multiple quality dimensions
Identify entries needing improvement

关键词： quality, evaluation, quality dimensions, quality criteria 解决问题：

根据定义的标准评估条目质量
从多个质量维度为条目打分
识别需要改进的条目

multi-agent-analysis

Keywords: multi-agent, parallel analysis, consensus, agent evaluation Solves:

Run parallel agent evaluations on entries
Aggregate consensus from multiple analysts
Resolve disagreements in classifications

关键词： multi-agent, parallel analysis, consensus, agent evaluation 解决问题：

对条目运行并行Agent评估
聚合多个分析Agent的共识结果
解决分类中的分歧

golden-dataset-curation

Original

Translation

Golden Dataset Curation

黄金数据集整理

Overview

概述

Content Types

内容类型

Difficulty Levels

难度等级

Quality Dimensions

质量维度

Multi-Agent Pipeline

多Agent流水线

Decision Thresholds

决策阈值

Quality Thresholds

质量阈值

Recommended thresholds for golden dataset inclusion

Recommended thresholds for golden dataset inclusion

Coverage Balance Guidelines

覆盖平衡指南

Duplicate Prevention Checklist

重复内容预防检查清单

Provenance Tracking

来源追踪

Langfuse Integration

Langfuse集成

Trace Structure

跟踪结构

Log individual dimension scores

Log individual dimension scores

Final aggregated score

Final aggregated score

Managed Prompts

托管提示词

References

参考资料

Related Skills

相关技能

Capability Details

能力详情

content-classification

content-classification

difficulty-stratification

difficulty-stratification

quality-evaluation

quality-evaluation

multi-agent-analysis

multi-agent-analysis