golden-dataset-curation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Golden Dataset Curation

黄金数据集整理

Curate high-quality documents for the golden dataset with multi-agent validation
通过多Agent验证为黄金数据集整理高质量文档

Overview

概述

This skill provides patterns and workflows for adding new documents to the golden dataset with thorough quality analysis. It complements
golden-dataset-management
which handles backup/restore.
When to use this skill:
  • Adding new documents to the golden dataset
  • Classifying content types and difficulty levels
  • Generating test queries for new documents
  • Running multi-agent quality analysis

本技能提供了为黄金数据集添加新文档并进行全面质量分析的模式与工作流。它是
golden-dataset-management
(负责备份/恢复)的补充功能。
何时使用本技能:
  • 为黄金数据集添加新文档
  • 分类内容类型与难度等级
  • 为新文档生成测试查询
  • 运行多Agent质量分析

Content Types

内容类型

TypeDescriptionQuality Focus
article
Technical articles, blog postsDepth, accuracy, actionability
tutorial
Step-by-step guidesCompleteness, clarity, code quality
research_paper
Academic papers, whitepapersRigor, citations, methodology
documentation
API docs, reference materialsAccuracy, completeness, examples
video_transcript
Transcribed video contentStructure, coherence, key points
code_repository
README, code analysisCode quality, documentation

类型描述质量关注点
article
技术文章、博客文章深度、准确性、可操作性
tutorial
分步指南完整性、清晰度、代码质量
research_paper
学术论文、白皮书严谨性、引用规范、研究方法
documentation
API文档、参考资料准确性、完整性、示例丰富度
video_transcript
视频转录内容结构合理性、连贯性、关键点提炼
code_repository
README、代码分析代码质量、文档完善度

Difficulty Levels

难度等级

LevelSemantic ComplexityExpected ScoreCharacteristics
trivialDirect keyword match>0.85Technical terms, exact phrases
easyCommon synonyms>0.70Well-known concepts, slight variations
mediumParaphrased intent>0.55Conceptual queries, multi-topic
hardMulti-hop reasoning>0.40Cross-domain, comparative analysis
adversarialEdge casesGraceful degradationRobustness tests, off-domain

等级语义复杂度预期分数特征
trivial(简单)直接关键词匹配>0.85技术术语、精确短语
easy(容易)常见同义词替换>0.70知名概念、轻微变体
medium(中等)意图改写>0.55概念性查询、多主题
hard(困难)多跳推理>0.40跨领域、对比分析
adversarial(对抗性)边缘案例优雅降级鲁棒性测试、域外内容

Quality Dimensions

质量维度

DimensionWeightPerfectAcceptableFailing
Accuracy0.250.95-1.00.70-0.94<0.70
Coherence0.200.90-1.00.60-0.89<0.60
Depth0.250.90-1.00.55-0.89<0.55
Relevance0.300.95-1.00.70-0.94<0.70
Evaluation focuses:
  • Accuracy: Technical correctness, code validity, up-to-date info
  • Coherence: Logical structure, clear flow, consistent terminology
  • Depth: Comprehensive coverage, edge cases, appropriate detail
  • Relevance: Alignment with AI/ML, backend, frontend, DevOps domains

维度权重优秀合格不合格
Accuracy(准确性)0.250.95-1.00.70-0.94<0.70
Coherence(连贯性)0.200.90-1.00.60-0.89<0.60
Depth(深度)0.250.90-1.00.55-0.89<0.55
Relevance(相关性)0.300.95-1.00.70-0.94<0.70
评估重点:
  • 准确性: 技术正确性、代码有效性、信息时效性
  • 连贯性: 逻辑结构、清晰流程、术语一致性
  • 深度: 全面覆盖、边缘案例、细节恰当性
  • 相关性: 与AI/ML、后端、前端、DevOps领域的契合度

Multi-Agent Pipeline

多Agent流水线

INPUT: URL/Content
        |
        v
+------------------+
|   FETCH AGENT    |  Extract structure, detect type
+--------+---------+
         |
         v
+-----------------------------------------------+
|  PARALLEL ANALYSIS AGENTS                      |
|  Quality | Difficulty | Domain  | Query Gen   |
+-----------------------------------------------+
         |
         v
+------------------+
| CONSENSUS        |  Weighted score + confidence
| AGGREGATOR       |  -> include/review/exclude
+--------+---------+
         |
         v
+------------------+
|  USER APPROVAL   |  Show scores, confirm
+--------+---------+
         |
         v
OUTPUT: Curated document entry
INPUT: URL/Content
        |
        v
+------------------+
|   FETCH AGENT    |  Extract structure, detect type
+--------+---------+
         |
         v
+-----------------------------------------------+
|  PARALLEL ANALYSIS AGENTS                      |
|  Quality | Difficulty | Domain  | Query Gen   |
+-----------------------------------------------+
         |
         v
+------------------+
| CONSENSUS        |  Weighted score + confidence
| AGGREGATOR       |  -> include/review/exclude
+--------+---------+
         |
         v
+------------------+
|  USER APPROVAL   |  Show scores, confirm
+--------+---------+
         |
         v
OUTPUT: Curated document entry

Decision Thresholds

决策阈值

Quality ScoreConfidenceDecision
>= 0.75>= 0.70include
>= 0.55anyreview
< 0.55anyexclude

质量分数置信度决策
>= 0.75>= 0.70纳入
>= 0.55任意审核
< 0.55任意排除

Quality Thresholds

质量阈值

yaml
undefined
yaml
undefined

Recommended thresholds for golden dataset inclusion

Recommended thresholds for golden dataset inclusion

minimum_quality_score: 0.70 minimum_confidence: 0.65 required_tags: 2 # At least 2 domain tags required_queries: 3 # At least 3 test queries

---
minimum_quality_score: 0.70 minimum_confidence: 0.65 required_tags: 2 # At least 2 domain tags required_queries: 3 # At least 3 test queries

---

Coverage Balance Guidelines

覆盖平衡指南

Maintain balanced coverage across:
  • Content types: Don't over-index on articles
  • Difficulty levels: Need trivial AND hard queries
  • Domains: Spread across AI/ML, backend, frontend, etc.
需维持以下维度的平衡覆盖:
  • 内容类型: 不要过度偏向文章类
  • 难度等级: 既需要简单也需要困难的查询
  • 领域: 覆盖AI/ML、后端、前端等多个领域

Duplicate Prevention Checklist

重复内容预防检查清单

Before adding:
  1. Check URL against existing
    source_url_map.json
  2. Run semantic similarity against existing document embeddings
  3. Warn if >80% similar to existing document
添加前需:
  1. 对照现有
    source_url_map.json
    检查URL
  2. 与现有文档嵌入向量进行语义相似度比对
  3. 若与现有文档相似度>80%则发出警告

Provenance Tracking

来源追踪

Always record:
  • Source URL (canonical)
  • Curation date
  • Agent scores (for audit trail)
  • Langfuse trace ID

需始终记录:
  • 来源URL(标准链接)
  • 整理日期
  • Agent评分(用于审计追踪)
  • Langfuse跟踪ID

Langfuse Integration

Langfuse集成

Trace Structure

跟踪结构

python
trace = langfuse.trace(
    name="golden-dataset-curation",
    metadata={"source_url": url, "document_id": doc_id}
)
python
trace = langfuse.trace(
    name="golden-dataset-curation",
    metadata={"source_url": url, "document_id": doc_id}
)

Log individual dimension scores

Log individual dimension scores

trace.score(name="accuracy", value=0.85) trace.score(name="coherence", value=0.90) trace.score(name="depth", value=0.78) trace.score(name="relevance", value=0.92)
trace.score(name="accuracy", value=0.85) trace.score(name="coherence", value=0.90) trace.score(name="depth", value=0.78) trace.score(name="relevance", value=0.92)

Final aggregated score

Final aggregated score

trace.score(name="quality_total", value=0.87) trace.event(name="curation_decision", metadata={"decision": "include"})
undefined
trace.score(name="quality_total", value=0.87) trace.event(name="curation_decision", metadata={"decision": "include"})
undefined

Managed Prompts

托管提示词

Prompt NamePurpose
golden-content-classifier
Classify content_type
golden-difficulty-classifier
Assign difficulty
golden-domain-tagger
Extract tags
golden-query-generator
Generate test queries

提示词名称用途
golden-content-classifier
分类内容类型
golden-difficulty-classifier
分配难度等级
golden-domain-tagger
提取标签
golden-query-generator
生成测试查询

References

参考资料

For detailed implementation patterns, see:
  • references/selection-criteria.md
    - Content type classification, difficulty stratification, quality evaluation dimensions, and best practices
  • references/annotation-patterns.md
    - Multi-agent pipeline architecture, agent specifications, consensus aggregation logic, and Langfuse integration

如需详细实现模式,请参阅:
  • references/selection-criteria.md
    - 内容类型分类、难度分层、质量评估维度及最佳实践
  • references/annotation-patterns.md
    - 多Agent流水线架构、Agent规范、共识聚合逻辑及Langfuse集成

Related Skills

相关技能

  • golden-dataset-management
    - Backup/restore operations
  • golden-dataset-validation
    - Validation rules and checks
  • langfuse-observability
    - Tracing patterns
  • pgvector-search
    - Duplicate detection

Version: 1.0.0 (December 2025) Issue: #599
  • golden-dataset-management
    - 备份/恢复操作
  • golden-dataset-validation
    - 验证规则与检查
  • langfuse-observability
    - 跟踪模式
  • pgvector-search
    - 重复内容检测

版本: 1.0.0(2025年12月) 问题编号: #599

Capability Details

能力详情

content-classification

content-classification

Keywords: content type, classification, document type, golden dataset Solves:
  • Classify document content types for golden dataset
  • Categorize entries by domain and purpose
  • Identify content requiring special handling
关键词: content type, classification, document type, golden dataset 解决问题:
  • 为黄金数据集分类文档内容类型
  • 按领域与用途对条目进行分类
  • 识别需要特殊处理的内容

difficulty-stratification

difficulty-stratification

Keywords: difficulty, stratification, complexity level, challenge rating Solves:
  • Assign difficulty levels to golden dataset entries
  • Ensure balanced difficulty distribution
  • Identify edge cases and challenging examples
关键词: difficulty, stratification, complexity level, challenge rating 解决问题:
  • 为黄金数据集条目分配难度等级
  • 确保难度分布均衡
  • 识别边缘案例与具有挑战性的示例

quality-evaluation

quality-evaluation

Keywords: quality, evaluation, quality dimensions, quality criteria Solves:
  • Evaluate entry quality against defined criteria
  • Score entries on multiple quality dimensions
  • Identify entries needing improvement
关键词: quality, evaluation, quality dimensions, quality criteria 解决问题:
  • 根据定义的标准评估条目质量
  • 从多个质量维度为条目打分
  • 识别需要改进的条目

multi-agent-analysis

multi-agent-analysis

Keywords: multi-agent, parallel analysis, consensus, agent evaluation Solves:
  • Run parallel agent evaluations on entries
  • Aggregate consensus from multiple analysts
  • Resolve disagreements in classifications
关键词: multi-agent, parallel analysis, consensus, agent evaluation 解决问题:
  • 对条目运行并行Agent评估
  • 聚合多个分析Agent的共识结果
  • 解决分类中的分歧