algo-nlp-ner

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Named Entity Recognition

Named Entity Recognition

Overview

概述

NER identifies and classifies named entities in text into predefined categories (Person, Organization, Location, Date, Money, etc.). Approaches: rule-based (regex, gazetteers), statistical (CRF), neural (BiLSTM-CRF, transformer-based). Modern NER uses spaCy or Hugging Face models with F1 scores 85-95%.
NER可识别文本中的命名实体并将其分类为预定义类别(Person、Organization、Location、Date、Money等)。实现方法包括:基于规则(正则表达式、地名录)、统计方法(CRF)、神经网络方法(BiLSTM-CRF、基于Transformer的模型)。现代NER通常使用spaCy或Hugging Face模型,F1分数可达85-95%。

When to Use

使用场景

Trigger conditions:
  • Extracting structured entities from unstructured text
  • Building knowledge graphs from documents
  • Preprocessing for information retrieval or question answering
When NOT to use:
  • For text classification (categorizing whole documents, not extracting entities)
  • For relation extraction between entities (need additional RE model)
触发条件:
  • 从非结构化文本中提取结构化实体
  • 从文档构建知识图谱
  • 为信息检索或问答系统做预处理
不适用场景:
  • 文本分类(对整个文档进行分类,而非提取实体)
  • 实体间关系提取(需要额外的RE模型)

Algorithm

算法

IRON LAW: NER Performance Depends on DOMAIN Match
A model trained on news text (OntoNotes) performs poorly on medical
records or legal documents. Domain-specific entities (drug names,
legal citations, product SKUs) require domain-specific training data
or fine-tuning. Always evaluate on YOUR domain's data.
IRON LAW: NER Performance Depends on DOMAIN Match
A model trained on news text (OntoNotes) performs poorly on medical
records or legal documents. Domain-specific entities (drug names,
legal citations, product SKUs) require domain-specific training data
or fine-tuning. Always evaluate on YOUR domain's data.

Phase 1: Input Validation

阶段1:输入验证

Determine: target entity types (standard: PER, ORG, LOC, DATE, MONEY or custom), input language, domain. Select appropriate pre-trained model or prepare training data. Gate: Entity types defined, model or training data available.
确定:目标实体类型(标准类型:PER、ORG、LOC、DATE、MONEY或自定义类型)、输入语言、领域。选择合适的预训练模型或准备训练数据。 准入条件: 已定义实体类型,且具备可用模型或训练数据。

Phase 2: Core Algorithm

阶段2:核心算法

Pre-trained model approach:
  1. Load model (spaCy, Hugging Face NER pipeline)
  2. Process text through the pipeline
  3. Extract entity spans with type labels and confidence scores
Fine-tuning approach:
  1. Annotate 200+ domain-specific examples in BIO format
  2. Fine-tune transformer model (BERT, RoBERTa) on annotated data
  3. Evaluate on held-out test set
预训练模型方法:
  1. 加载模型(spaCy、Hugging Face NER pipeline)
  2. 通过pipeline处理文本
  3. 提取带有类型标签和置信度分数的实体片段
微调方法:
  1. 以BIO格式标注200+领域相关示例
  2. 在标注数据上微调Transformer模型(BERT、RoBERTa)
  3. 在预留测试集上评估模型性能

Phase 3: Verification

阶段3:验证

Evaluate: precision, recall, F1 per entity type. Check: boundary detection (exact span match) and type classification accuracy. Gate: F1 > 0.80 per entity type on domain-relevant test data.
评估:各实体类型的精确率、召回率、F1分数。检查:边界检测(精确片段匹配)和类型分类准确率。 准入条件: 在领域相关测试数据上,各实体类型的F1分数>0.80。

Phase 4: Output

阶段4:输出

Return extracted entities with types, positions, and confidence.
返回带有类型、位置和置信度的提取实体。

Output Format

输出格式

json
{
  "entities": [{"text": "Apple Inc.", "type": "ORG", "start": 0, "end": 10, "confidence": 0.95}],
  "metadata": {"model": "en_core_web_trf", "entities_found": 15, "types": {"PER": 5, "ORG": 6, "LOC": 4}}
}
json
{
  "entities": [{"text": "Apple Inc.", "type": "ORG", "start": 0, "end": 10, "confidence": 0.95}],
  "metadata": {"model": "en_core_web_trf", "entities_found": 15, "types": {"PER": 5, "ORG": 6, "LOC": 4}}
}

Examples

示例

Sample I/O

输入输出示例

Input: "Tim Cook announced that Apple will open a new store in Taipei on March 15." Expected: [Tim Cook/PER, Apple/ORG, Taipei/LOC, March 15/DATE]
输入: "Tim Cook announced that Apple will open a new store in Taipei on March 15." 预期输出: [Tim Cook/PER, Apple/ORG, Taipei/LOC, March 15/DATE]

Edge Cases

边缘情况

InputExpectedWhy
"Apple" (no context)Ambiguous (fruit or company)Context-dependent entity typing
Nested entitiesDepends on scheme"Bank of America" = ORG, "America" = LOC within
Misspelled entityMay miss"Appel" not in training data
输入预期结果原因
"Apple"(无上下文)歧义(水果或公司)实体类型依赖上下文
嵌套实体取决于标注方案"Bank of America" = ORG,其中"America" = LOC
拼写错误的实体可能无法识别"Appel"不在训练数据中

Gotchas

注意事项

  • Boundary errors: NER often gets the entity type right but the span wrong ("New" vs "New York City"). Evaluate with both exact and partial match metrics.
  • Ambiguity: "Jordan" can be a person, country, or brand. Context-dependent disambiguation is hard; some models output the most likely type.
  • Chinese/Japanese NER: No whitespace tokenization makes boundary detection harder. Use language-specific tokenizers (jieba for Chinese).
  • Annotation consistency: Training data quality is critical. Inconsistent annotations (sometimes labeling "Dr." as part of name, sometimes not) degrade model performance.
  • Entity linking: NER identifies mentions; entity linking resolves them to knowledge base entries. "Apple" → Apple Inc. (Q312) or apple (fruit). These are separate tasks.
  • 边界错误: NER通常能正确识别实体类型,但可能错误定位片段范围(比如识别"New"而非"New York City")。需同时使用精确匹配和部分匹配指标进行评估。
  • 歧义问题: "Jordan"可以指人物、国家或品牌。依赖上下文的消歧难度较大;部分模型会输出最可能的类型。
  • 中文/日文NER: 无空格分词使得边界检测更困难。需使用语言特定的分词器(如中文用jieba)。
  • 标注一致性: 训练数据质量至关重要。不一致的标注(有时将"Dr."纳入姓名,有时不纳入)会降低模型性能。
  • 实体链接: NER识别实体提及;实体链接则将其映射到知识库条目。例如"Apple" → Apple Inc.(Q312)或苹果(水果)。这是两个独立的任务。

References

参考资料

  • For BIO annotation format and guidelines, see
    references/bio-annotation.md
  • For fine-tuning NER with transformers, see
    references/transformer-ner.md
  • 关于BIO标注格式和指南,详见
    references/bio-annotation.md
  • 关于使用Transformer微调NER模型,详见
    references/transformer-ner.md