algo-nlp-ner
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseNamed Entity Recognition
Named Entity Recognition
Overview
概述
NER identifies and classifies named entities in text into predefined categories (Person, Organization, Location, Date, Money, etc.). Approaches: rule-based (regex, gazetteers), statistical (CRF), neural (BiLSTM-CRF, transformer-based). Modern NER uses spaCy or Hugging Face models with F1 scores 85-95%.
NER可识别文本中的命名实体并将其分类为预定义类别(Person、Organization、Location、Date、Money等)。实现方法包括:基于规则(正则表达式、地名录)、统计方法(CRF)、神经网络方法(BiLSTM-CRF、基于Transformer的模型)。现代NER通常使用spaCy或Hugging Face模型,F1分数可达85-95%。
When to Use
使用场景
Trigger conditions:
- Extracting structured entities from unstructured text
- Building knowledge graphs from documents
- Preprocessing for information retrieval or question answering
When NOT to use:
- For text classification (categorizing whole documents, not extracting entities)
- For relation extraction between entities (need additional RE model)
触发条件:
- 从非结构化文本中提取结构化实体
- 从文档构建知识图谱
- 为信息检索或问答系统做预处理
不适用场景:
- 文本分类(对整个文档进行分类,而非提取实体)
- 实体间关系提取(需要额外的RE模型)
Algorithm
算法
IRON LAW: NER Performance Depends on DOMAIN Match
A model trained on news text (OntoNotes) performs poorly on medical
records or legal documents. Domain-specific entities (drug names,
legal citations, product SKUs) require domain-specific training data
or fine-tuning. Always evaluate on YOUR domain's data.IRON LAW: NER Performance Depends on DOMAIN Match
A model trained on news text (OntoNotes) performs poorly on medical
records or legal documents. Domain-specific entities (drug names,
legal citations, product SKUs) require domain-specific training data
or fine-tuning. Always evaluate on YOUR domain's data.Phase 1: Input Validation
阶段1:输入验证
Determine: target entity types (standard: PER, ORG, LOC, DATE, MONEY or custom), input language, domain. Select appropriate pre-trained model or prepare training data.
Gate: Entity types defined, model or training data available.
确定:目标实体类型(标准类型:PER、ORG、LOC、DATE、MONEY或自定义类型)、输入语言、领域。选择合适的预训练模型或准备训练数据。
准入条件: 已定义实体类型,且具备可用模型或训练数据。
Phase 2: Core Algorithm
阶段2:核心算法
Pre-trained model approach:
- Load model (spaCy, Hugging Face NER pipeline)
- Process text through the pipeline
- Extract entity spans with type labels and confidence scores
Fine-tuning approach:
- Annotate 200+ domain-specific examples in BIO format
- Fine-tune transformer model (BERT, RoBERTa) on annotated data
- Evaluate on held-out test set
预训练模型方法:
- 加载模型(spaCy、Hugging Face NER pipeline)
- 通过pipeline处理文本
- 提取带有类型标签和置信度分数的实体片段
微调方法:
- 以BIO格式标注200+领域相关示例
- 在标注数据上微调Transformer模型(BERT、RoBERTa)
- 在预留测试集上评估模型性能
Phase 3: Verification
阶段3:验证
Evaluate: precision, recall, F1 per entity type. Check: boundary detection (exact span match) and type classification accuracy.
Gate: F1 > 0.80 per entity type on domain-relevant test data.
评估:各实体类型的精确率、召回率、F1分数。检查:边界检测(精确片段匹配)和类型分类准确率。
准入条件: 在领域相关测试数据上,各实体类型的F1分数>0.80。
Phase 4: Output
阶段4:输出
Return extracted entities with types, positions, and confidence.
返回带有类型、位置和置信度的提取实体。
Output Format
输出格式
json
{
"entities": [{"text": "Apple Inc.", "type": "ORG", "start": 0, "end": 10, "confidence": 0.95}],
"metadata": {"model": "en_core_web_trf", "entities_found": 15, "types": {"PER": 5, "ORG": 6, "LOC": 4}}
}json
{
"entities": [{"text": "Apple Inc.", "type": "ORG", "start": 0, "end": 10, "confidence": 0.95}],
"metadata": {"model": "en_core_web_trf", "entities_found": 15, "types": {"PER": 5, "ORG": 6, "LOC": 4}}
}Examples
示例
Sample I/O
输入输出示例
Input: "Tim Cook announced that Apple will open a new store in Taipei on March 15."
Expected: [Tim Cook/PER, Apple/ORG, Taipei/LOC, March 15/DATE]
输入: "Tim Cook announced that Apple will open a new store in Taipei on March 15."
预期输出: [Tim Cook/PER, Apple/ORG, Taipei/LOC, March 15/DATE]
Edge Cases
边缘情况
| Input | Expected | Why |
|---|---|---|
| "Apple" (no context) | Ambiguous (fruit or company) | Context-dependent entity typing |
| Nested entities | Depends on scheme | "Bank of America" = ORG, "America" = LOC within |
| Misspelled entity | May miss | "Appel" not in training data |
| 输入 | 预期结果 | 原因 |
|---|---|---|
| "Apple"(无上下文) | 歧义(水果或公司) | 实体类型依赖上下文 |
| 嵌套实体 | 取决于标注方案 | "Bank of America" = ORG,其中"America" = LOC |
| 拼写错误的实体 | 可能无法识别 | "Appel"不在训练数据中 |
Gotchas
注意事项
- Boundary errors: NER often gets the entity type right but the span wrong ("New" vs "New York City"). Evaluate with both exact and partial match metrics.
- Ambiguity: "Jordan" can be a person, country, or brand. Context-dependent disambiguation is hard; some models output the most likely type.
- Chinese/Japanese NER: No whitespace tokenization makes boundary detection harder. Use language-specific tokenizers (jieba for Chinese).
- Annotation consistency: Training data quality is critical. Inconsistent annotations (sometimes labeling "Dr." as part of name, sometimes not) degrade model performance.
- Entity linking: NER identifies mentions; entity linking resolves them to knowledge base entries. "Apple" → Apple Inc. (Q312) or apple (fruit). These are separate tasks.
- 边界错误: NER通常能正确识别实体类型,但可能错误定位片段范围(比如识别"New"而非"New York City")。需同时使用精确匹配和部分匹配指标进行评估。
- 歧义问题: "Jordan"可以指人物、国家或品牌。依赖上下文的消歧难度较大;部分模型会输出最可能的类型。
- 中文/日文NER: 无空格分词使得边界检测更困难。需使用语言特定的分词器(如中文用jieba)。
- 标注一致性: 训练数据质量至关重要。不一致的标注(有时将"Dr."纳入姓名,有时不纳入)会降低模型性能。
- 实体链接: NER识别实体提及;实体链接则将其映射到知识库条目。例如"Apple" → Apple Inc.(Q312)或苹果(水果)。这是两个独立的任务。
References
参考资料
- For BIO annotation format and guidelines, see
references/bio-annotation.md - For fine-tuning NER with transformers, see
references/transformer-ner.md
- 关于BIO标注格式和指南,详见
references/bio-annotation.md - 关于使用Transformer微调NER模型,详见
references/transformer-ner.md