knowledge-graph-construction

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Knowledge Graph Construction

知识图谱构建

What Is It?

什么是知识图谱构建？

This skill helps you design and build knowledge graphs from unstructured or semi-structured data sources. Given a domain and data corpus, it guides you through data model selection, schema design, entity/relation extraction pipelines, and layered architecture construction.

The payoff: Well-constructed knowledge graphs provide structured, verified facts that ground LLM reasoning, reduce hallucination, enable explainable retrieval, and support complex multi-hop queries that flat vector search cannot handle.

本能力可帮助你从非结构化或半结构化数据源设计并构建知识图谱。针对给定领域和数据语料，它会指导你完成数据模型选择、schema设计、实体/关系抽取流水线搭建，以及分层架构构建的全流程。

核心价值：构建完善的知识图谱可提供结构化、经过验证的事实，为LLM推理提供依据，减少幻觉，实现可解释检索，还能支持平面向量检索无法处理的复杂多跳查询。

Workflow

工作流

COPY THIS CHECKLIST and work through each step:

KG Construction Progress:
- [ ] Step 1: Identify data sources and domain scope
- [ ] Step 2: Select graph data model
- [ ] Step 3: Design schema and ontology
- [ ] Step 4: Configure extraction pipeline
- [ ] Step 5: Define layered architecture
- [ ] Step 6: Validate and quality-check the graph

Step 1: Identify data sources and domain scope

Catalog the input data: document types (papers, clinical notes, web pages, logs), volume, update frequency, and language. Define the domain boundary -- what entity types and relation types matter for the target use case. Determine whether the KG will serve RAG retrieval, reasoning/inference, analytics, or a combination. This scoping step prevents over-extraction and keeps the schema focused.

Step 2: Select graph data model

Choose the underlying data model using the Architecture Selection Guide. Key trade-offs: LPG for flexibility and rapid prototyping, RDF/OWL for standards-based interoperability and inference, Hypergraphs for complex N-ary relations, Temporal Graphs for time-evolving knowledge. Consider query language, tooling maturity, and vector integration needs. For detailed model comparisons, see Data Models Reference.

Step 3: Design schema and ontology

Define node types (entity classes), edge types (relation classes), and property schemas. Apply patterns from Schema Patterns: entity-relation for simple domains, event reification for N-ary relations, layered tiers for multi-source integration. Decide on controlled vocabularies, cardinality constraints, and whether to adopt or extend an existing ontology (e.g., Schema.org, UMLS, SNOMED). For methodology details, see Methodology Reference.

Step 4: Configure extraction pipeline

Build the pipeline that populates the graph. Core components: LLM-assisted entity extraction with multi-round verification, relation extraction via prompt-based or dependency-parsing methods, entity normalization (synonym merging, ontology linking), and schema enforcement through post-processing validation. Use few-shot examples in prompts to improve extraction consistency. Include a second-pass LLM verification to catch missed entities. For full pipeline design, see Methodology Reference.

Step 5: Define layered architecture

Structure the KG into tiers for maintainability and trust. A common pattern: Layer 1 (instance data) holds user-specific or case-specific entities and relations; Layer 2 (domain knowledge) holds curated facts from literature or domain experts; Layer 3 (canonical ontology) holds the formal schema and upper ontology. Add provenance and evidence layering so every fact traces back to its source document, extraction method, and confidence score. Temporal subgraphs capture time-indexed state for domains where knowledge evolves.

Step 6: Validate and quality-check the graph

Run validation at multiple levels: schema conformance (do all nodes and edges match declared types?), coverage (are expected entity types populated?), consistency (no contradictory edges), and completeness (sample-based human review). Use a second LLM as a validator to fact-check extracted triples against source documents. Compute graph statistics (node degree distribution, connected components, orphan nodes) to identify extraction gaps. Quality criteria are defined in Quality Rubric.

复制此检查清单并逐步完成每一步：

KG Construction Progress:
- [ ] Step 1: Identify data sources and domain scope
- [ ] Step 2: Select graph data model
- [ ] Step 3: Design schema and ontology
- [ ] Step 4: Configure extraction pipeline
- [ ] Step 5: Define layered architecture
- [ ] Step 6: Validate and quality-check the graph

步骤1：确定数据源和领域范围

梳理输入数据：文档类型（论文、临床记录、网页、日志）、体量、更新频率和语言。定义领域边界——明确目标用例需要哪些实体类型和关系类型。确定KG的用途是服务RAG检索、推理/推断、分析，还是多种用途的结合。这一范围划定步骤可避免过度抽取，保证schema聚焦核心需求。

步骤2：选择图数据模型

参考架构选择指南选择底层数据模型。核心权衡点：LPG适合高灵活性和快速原型开发，RDF/OWL适合基于标准的互操作性和推理场景，超图适合复杂N元关系场景，时序图适合随时间演化的知识场景。需要综合考虑查询语言、工具成熟度和向量集成需求。如需详细的模型对比，可参考数据模型参考文档。

步骤3：设计schema和本体

定义节点类型（实体类）、边类型（关系类）和属性schema。可复用Schema模式中的方案：简单领域使用实体-关系模式，N元关系使用事件具象化模式，多源集成场景使用分层模式。确定受控词汇表、基数约束，以及是否采用或扩展现有本体（例如Schema.org、UMLS、SNOMED）。如需了解方法细节，可参考方法论参考文档。

步骤4：配置抽取流水线

搭建用于填充图谱的流水线。核心组件包括：支持多轮验证的LLM辅助实体抽取、基于提示词或依存句法分析的关系抽取、实体归一化（同义词合并、本体链接），以及通过后处理校验实现的schema强制校验。可在提示词中加入少样本示例提升抽取一致性，还可加入LLM二次校验环节捕捉遗漏的实体。如需完整的流水线设计方案，可参考方法论参考文档。

步骤5：定义分层架构

将KG划分为不同层级以提升可维护性和可信度。通用分层模式：第1层（实例数据）存储用户或场景专属的实体和关系；第2层（领域知识）存储来自文献或领域专家的经校验事实；第3层（标准本体）存储正式schema和上层本体。加入来源和证据分层，让每一条事实都可追溯到源文档、抽取方法和置信度得分。时序子图可用于知识会随时间演化的领域，存储带时间索引的状态。

步骤6：校验并质量检查图谱

在多个层级执行校验：schema合规性（所有节点和边是否符合声明的类型？）、覆盖率（预期的实体类型是否都已填充？）、一致性（无矛盾的边）、完整性（基于抽样的人工审核）。可使用第二个LLM作为校验器，对照源文档对抽取的三元组进行事实核查。计算图谱统计指标（节点度分布、连通分量、孤立节点）识别抽取缺口。质量标准可参考质量评分规则。

Architecture Selection Guide

架构选择指南

By Use Case

按使用场景选择

Model	Flexibility	Standardization	Reasoning	Vector Integration	Query Language	Best For
LPG	High	Low	Limited	Native (Neo4j)	Cypher, Gremlin	Rapid development, RAG pipelines
RDF/OWL	Medium	High	Full (OWL-DL)	Via extensions	SPARQL	Interoperability, ontology-heavy domains
Hypergraph	High	Low	Limited	Custom	Custom APIs	N-ary relations, multi-entity events
Temporal	Medium	Low	Time-based	Via extensions	Temporal Cypher	Evolving knowledge, episodic memory

模型	灵活性	标准化程度	推理能力	向量集成支持	查询语言	适用场景
LPG	高	低	有限	原生支持（Neo4j）	Cypher、Gremlin	快速开发、RAG流水线
RDF/OWL	中等	高	完整支持（OWL-DL）	通过扩展支持	SPARQL	互操作系统、本体依赖度高的领域
超图	高	低	有限	定制实现	定制API	N元关系、多实体事件场景
时序图	中等	低	时序推理支持	通过扩展支持	Temporal Cypher	演化型知识、情景记忆场景

By Domain

按领域选择

Domain	Recommended Model	Rationale
Biomedical / Clinical	RDF/OWL	UMLS/SNOMED ontologies, reasoning needed
Enterprise / RAG	LPG	Fast iteration, vector search integration
Event-centric (news, logs)	Hypergraph or Temporal	Multi-participant events, time evolution
Legal / Compliance	RDF/OWL	Formal reasoning, provenance chains
Scientific Literature	LPG + Layered	Flexible extraction, layered trust

领域	推荐模型	原因
生物医学/临床	RDF/OWL	适配UMLS/SNOMED本体，需要推理能力
企业/RAG	LPG	迭代速度快，支持向量检索集成
事件中心场景（新闻、日志）	超图或时序图	支持多参与方事件、知识随时间演化
法律/合规	RDF/OWL	支持形式化推理、来源链追溯
科研文献	LPG+分层架构	抽取灵活，支持分层可信机制

Schema Patterns

Schema模式

Entity-Relation Pattern

实体-关系模式

The simplest pattern. Nodes represent entities, edges represent binary relations. Properties on nodes hold attributes; properties on edges hold relation metadata (confidence, source, timestamp).

(:Person {name, role}) -[:WORKS_AT {since}]-> (:Organization {name, type})
(:Drug {name, class})  -[:TREATS {efficacy}]-> (:Disease {name, icd_code})

Best for: domains with primarily binary relationships and moderate complexity.

最简单的模式。节点代表实体，边代表二元关系。节点的属性存储实体特征；边的属性存储关系元数据（置信度、来源、时间戳）。

(:Person {name, role}) -[:WORKS_AT {since}]-> (:Organization {name, type})
(:Drug {name, class})  -[:TREATS {efficacy}]-> (:Disease {name, icd_code})

适用场景：以二元关系为主、复杂度适中的领域。

Event Reification Pattern

事件具象化模式

Model N-ary relations and complex events as first-class nodes. An event node connects to all participants via typed role edges. This avoids information loss from forcing N-ary relations into binary edges.

(:ClinicalTrial {id, phase, start_date})
  -[:HAS_DRUG]->     (:Drug {name})
  -[:HAS_CONDITION]-> (:Disease {name})
  -[:HAS_OUTCOME]->   (:Outcome {measure, value})
  -[:CONDUCTED_BY]->  (:Organization {name})

Best for: events with multiple participants, clinical data, news events, financial transactions.

将N元关系和复杂事件建模为一级节点。事件节点通过带类型的角色边连接所有参与方，避免将N元关系强制转换为二元边导致的信息丢失。

(:ClinicalTrial {id, phase, start_date})
  -[:HAS_DRUG]->     (:Drug {name})
  -[:HAS_CONDITION]-> (:Disease {name})
  -[:HAS_OUTCOME]->   (:Outcome {measure, value})
  -[:CONDUCTED_BY]->  (:Organization {name})

适用场景：多参与方事件、临床数据、新闻事件、金融交易场景。

Layered Tier Pattern

分层模式

Separate the graph into trust-differentiated layers that can be queried independently or together.

Layer 3 (Canonical Ontology): Formal class hierarchy, relation definitions, constraints
Layer 2 (Domain Knowledge):   Curated facts from literature, expert-validated
Layer 1 (Instance Data):      Extracted from user documents, case-specific, lower confidence

Cross-layer edges link instances to domain concepts and domain concepts to ontology classes. Provenance metadata on every edge records: source document, extraction method, confidence score, and timestamp.

Best for: multi-source integration, RAG with trust scoring, enterprise knowledge management.

将图谱划分为不同可信度的层级，可独立查询也可联合查询。

Layer 3 (Canonical Ontology): Formal class hierarchy, relation definitions, constraints
Layer 2 (Domain Knowledge):   Curated facts from literature, expert-validated
Layer 1 (Instance Data):      Extracted from user documents, case-specific, lower confidence

跨层边将实例关联到领域概念，再将领域概念关联到本体类。每条边的来源元数据会记录：源文档、抽取方法、置信度得分和时间戳。

适用场景：多源集成、带可信评分的RAG、企业知识管理场景。

Output Template

输出模板

KNOWLEDGE GRAPH CONSTRUCTION SPECIFICATION
============================================

Domain: [Target domain and scope]
Use Case: [RAG / Reasoning / Analytics / Hybrid]
Data Sources: [List of input data types and volumes]

Data Model: [LPG / RDF / Hypergraph / Temporal]
Query Language: [Cypher / SPARQL / Gremlin / Custom]
Storage Backend: [Neo4j / Amazon Neptune / Virtuoso / etc.]

Schema Definition:
  Node Types:
  1. [EntityType] - [description]
     Properties: [list with types]
  2. [EntityType] - [description]
     Properties: [list with types]
  3. [Continue for each node type...]

  Edge Types:
  1. [RelationType] (source -> target) - [description]
     Properties: [list with types]
  2. [Continue for each edge type...]

  Constraints:
  - [Cardinality, uniqueness, required properties]

Extraction Pipeline:
  1. Entity Extraction
     - Method: [LLM-assisted / NER / Hybrid]
     - Prompt template: [summary or reference]
     - Verification: [Multi-round / Second-LLM / Manual sample]
  2. Relation Extraction
     - Method: [Prompt-based / Dependency parsing / Hybrid]
     - Few-shot examples: [count and source]
  3. Normalization
     - Deduplication: [method]
     - Ontology linking: [target ontology]
     - Synonym resolution: [approach]

Layered Architecture:
  Layer 1 (Instance): [description of instance-level data]
  Layer 2 (Domain):   [description of curated domain knowledge]
  Layer 3 (Ontology): [description of formal schema]
  Provenance: [How source/confidence/timestamp are tracked]

Validation Plan:
  - Schema conformance: [automated checks]
  - Coverage: [expected entity/relation counts]
  - Consistency: [contradiction detection method]
  - Human review: [sampling strategy]

Estimated Scale: [node count, edge count, properties per node]
Key Dependencies: [libraries, APIs, ontologies]

NEXT STEPS:
- Implement extraction pipeline on sample data
- Populate graph and run validation suite
- Iterate schema based on extraction results
- Integrate with downstream application (RAG, reasoning, etc.)

KNOWLEDGE GRAPH CONSTRUCTION SPECIFICATION
============================================

Domain: [Target domain and scope]
Use Case: [RAG / Reasoning / Analytics / Hybrid]
Data Sources: [List of input data types and volumes]

Data Model: [LPG / RDF / Hypergraph / Temporal]
Query Language: [Cypher / SPARQL / Gremlin / Custom]
Storage Backend: [Neo4j / Amazon Neptune / Virtuoso / etc.]

Schema Definition:
  Node Types:
  1. [EntityType] - [description]
     Properties: [list with types]
  2. [EntityType] - [description]
     Properties: [list with types]
  3. [Continue for each node type...]

  Edge Types:
  1. [RelationType] (source -> target) - [description]
     Properties: [list with types]
  2. [Continue for each edge type...]

  Constraints:
  - [Cardinality, uniqueness, required properties]

Extraction Pipeline:
  1. Entity Extraction
     - Method: [LLM-assisted / NER / Hybrid]
     - Prompt template: [summary or reference]
     - Verification: [Multi-round / Second-LLM / Manual sample]
  2. Relation Extraction
     - Method: [Prompt-based / Dependency parsing / Hybrid]
     - Few-shot examples: [count and source]
  3. Normalization
     - Deduplication: [method]
     - Ontology linking: [target ontology]
     - Synonym resolution: [approach]

Layered Architecture:
  Layer 1 (Instance): [description of instance-level data]
  Layer 2 (Domain):   [description of curated domain knowledge]
  Layer 3 (Ontology): [description of formal schema]
  Provenance: [How source/confidence/timestamp are tracked]

Validation Plan:
  - Schema conformance: [automated checks]
  - Coverage: [expected entity/relation counts]
  - Consistency: [contradiction detection method]
  - Human review: [sampling strategy]

Estimated Scale: [node count, edge count, properties per node]
Key Dependencies: [libraries, APIs, ontologies]

NEXT STEPS:
- Implement extraction pipeline on sample data
- Populate graph and run validation suite
- Iterate schema based on extraction results
- Integrate with downstream application (RAG, reasoning, etc.)