algo-nlp-ner

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Named Entity Recognition

Overview

概述

NER identifies and classifies named entities in text into predefined categories (Person, Organization, Location, Date, Money, etc.). Approaches: rule-based (regex, gazetteers), statistical (CRF), neural (BiLSTM-CRF, transformer-based). Modern NER uses spaCy or Hugging Face models with F1 scores 85-95%.

NER可识别文本中的命名实体并将其分类为预定义类别（Person、Organization、Location、Date、Money等）。实现方法包括：基于规则（正则表达式、地名录）、统计方法（CRF）、神经网络方法（BiLSTM-CRF、基于Transformer的模型）。现代NER通常使用spaCy或Hugging Face模型，F1分数可达85-95%。

When to Use

使用场景

Trigger conditions:

Extracting structured entities from unstructured text
Building knowledge graphs from documents
Preprocessing for information retrieval or question answering

When NOT to use:

For text classification (categorizing whole documents, not extracting entities)
For relation extraction between entities (need additional RE model)

触发条件：

从非结构化文本中提取结构化实体
从文档构建知识图谱
为信息检索或问答系统做预处理

不适用场景：

文本分类（对整个文档进行分类，而非提取实体）
实体间关系提取（需要额外的RE模型）

Algorithm

算法

IRON LAW: NER Performance Depends on DOMAIN Match
A model trained on news text (OntoNotes) performs poorly on medical
records or legal documents. Domain-specific entities (drug names,
legal citations, product SKUs) require domain-specific training data
or fine-tuning. Always evaluate on YOUR domain's data.

IRON LAW: NER Performance Depends on DOMAIN Match
A model trained on news text (OntoNotes) performs poorly on medical
records or legal documents. Domain-specific entities (drug names,
legal citations, product SKUs) require domain-specific training data
or fine-tuning. Always evaluate on YOUR domain's data.

Phase 1: Input Validation

阶段1：输入验证

Determine: target entity types (standard: PER, ORG, LOC, DATE, MONEY or custom), input language, domain. Select appropriate pre-trained model or prepare training data. Gate: Entity types defined, model or training data available.

确定：目标实体类型（标准类型：PER、ORG、LOC、DATE、MONEY或自定义类型）、输入语言、领域。选择合适的预训练模型或准备训练数据。 准入条件： 已定义实体类型，且具备可用模型或训练数据。

Phase 2: Core Algorithm

阶段2：核心算法

Pre-trained model approach:

Load model (spaCy, Hugging Face NER pipeline)
Process text through the pipeline
Extract entity spans with type labels and confidence scores

Fine-tuning approach:

Annotate 200+ domain-specific examples in BIO format
Fine-tune transformer model (BERT, RoBERTa) on annotated data
Evaluate on held-out test set

预训练模型方法：

加载模型（spaCy、Hugging Face NER pipeline）
通过pipeline处理文本
提取带有类型标签和置信度分数的实体片段

微调方法：

以BIO格式标注200+领域相关示例
在标注数据上微调Transformer模型（BERT、RoBERTa）
在预留测试集上评估模型性能

Phase 3: Verification

阶段3：验证

Evaluate: precision, recall, F1 per entity type. Check: boundary detection (exact span match) and type classification accuracy. Gate: F1 > 0.80 per entity type on domain-relevant test data.

评估：各实体类型的精确率、召回率、F1分数。检查：边界检测（精确片段匹配）和类型分类准确率。 准入条件： 在领域相关测试数据上，各实体类型的F1分数>0.80。

Phase 4: Output

阶段4：输出

Return extracted entities with types, positions, and confidence.

返回带有类型、位置和置信度的提取实体。

Output Format

输出格式

json

{
  "entities": [{"text": "Apple Inc.", "type": "ORG", "start": 0, "end": 10, "confidence": 0.95}],
  "metadata": {"model": "en_core_web_trf", "entities_found": 15, "types": {"PER": 5, "ORG": 6, "LOC": 4}}
}

json

{
  "entities": [{"text": "Apple Inc.", "type": "ORG", "start": 0, "end": 10, "confidence": 0.95}],
  "metadata": {"model": "en_core_web_trf", "entities_found": 15, "types": {"PER": 5, "ORG": 6, "LOC": 4}}
}

Examples

示例

Sample I/O

输入输出示例

Input: "Tim Cook announced that Apple will open a new store in Taipei on March 15." Expected: [Tim Cook/PER, Apple/ORG, Taipei/LOC, March 15/DATE]

输入： "Tim Cook announced that Apple will open a new store in Taipei on March 15." 预期输出： [Tim Cook/PER, Apple/ORG, Taipei/LOC, March 15/DATE]

Edge Cases

边缘情况

Input	Expected	Why
"Apple" (no context)	Ambiguous (fruit or company)	Context-dependent entity typing
Nested entities	Depends on scheme	"Bank of America" = ORG, "America" = LOC within
Misspelled entity	May miss	"Appel" not in training data

输入	预期结果	原因
"Apple"（无上下文）	歧义（水果或公司）	实体类型依赖上下文
嵌套实体	取决于标注方案	"Bank of America" = ORG，其中"America" = LOC
拼写错误的实体	可能无法识别	"Appel"不在训练数据中

Gotchas

注意事项

Boundary errors: NER often gets the entity type right but the span wrong ("New" vs "New York City"). Evaluate with both exact and partial match metrics.
Ambiguity: "Jordan" can be a person, country, or brand. Context-dependent disambiguation is hard; some models output the most likely type.
Chinese/Japanese NER: No whitespace tokenization makes boundary detection harder. Use language-specific tokenizers (jieba for Chinese).
Annotation consistency: Training data quality is critical. Inconsistent annotations (sometimes labeling "Dr." as part of name, sometimes not) degrade model performance.
Entity linking: NER identifies mentions; entity linking resolves them to knowledge base entries. "Apple" → Apple Inc. (Q312) or apple (fruit). These are separate tasks.

边界错误： NER通常能正确识别实体类型，但可能错误定位片段范围（比如识别"New"而非"New York City"）。需同时使用精确匹配和部分匹配指标进行评估。
歧义问题： "Jordan"可以指人物、国家或品牌。依赖上下文的消歧难度较大；部分模型会输出最可能的类型。
中文/日文NER： 无空格分词使得边界检测更困难。需使用语言特定的分词器（如中文用jieba）。
标注一致性： 训练数据质量至关重要。不一致的标注（有时将"Dr."纳入姓名，有时不纳入）会降低模型性能。
实体链接： NER识别实体提及；实体链接则将其映射到知识库条目。例如"Apple" → Apple Inc.（Q312）或苹果（水果）。这是两个独立的任务。

References

参考资料

For BIO annotation format and guidelines, see
```
references/bio-annotation.md
```
For fine-tuning NER with transformers, see
```
references/transformer-ner.md
```

关于BIO标注格式和指南，详见
```
references/bio-annotation.md
```
关于使用Transformer微调NER模型，详见
```
references/transformer-ner.md
```