Named Entity Recognition
Overview
NER identifies and classifies named entities in text into predefined categories (Person, Organization, Location, Date, Money, etc.). Approaches: rule-based (regex, gazetteers), statistical (CRF), neural (BiLSTM-CRF, transformer-based). Modern NER uses spaCy or Hugging Face models with F1 scores 85-95%.
When to Use
Trigger conditions:
- Extracting structured entities from unstructured text
- Building knowledge graphs from documents
- Preprocessing for information retrieval or question answering
When NOT to use:
- For text classification (categorizing whole documents, not extracting entities)
- For relation extraction between entities (need additional RE model)
Algorithm
IRON LAW: NER Performance Depends on DOMAIN Match
A model trained on news text (OntoNotes) performs poorly on medical
records or legal documents. Domain-specific entities (drug names,
legal citations, product SKUs) require domain-specific training data
or fine-tuning. Always evaluate on YOUR domain's data.
Phase 1: Input Validation
Determine: target entity types (standard: PER, ORG, LOC, DATE, MONEY or custom), input language, domain. Select appropriate pre-trained model or prepare training data.
Gate: Entity types defined, model or training data available.
Phase 2: Core Algorithm
Pre-trained model approach:
- Load model (spaCy, Hugging Face NER pipeline)
- Process text through the pipeline
- Extract entity spans with type labels and confidence scores
Fine-tuning approach:
- Annotate 200+ domain-specific examples in BIO format
- Fine-tune transformer model (BERT, RoBERTa) on annotated data
- Evaluate on held-out test set
Phase 3: Verification
Evaluate: precision, recall, F1 per entity type. Check: boundary detection (exact span match) and type classification accuracy.
Gate: F1 > 0.80 per entity type on domain-relevant test data.
Phase 4: Output
Return extracted entities with types, positions, and confidence.
Output Format
json
{
"entities": [{"text": "Apple Inc.", "type": "ORG", "start": 0, "end": 10, "confidence": 0.95}],
"metadata": {"model": "en_core_web_trf", "entities_found": 15, "types": {"PER": 5, "ORG": 6, "LOC": 4}}
}
Examples
Sample I/O
Input: "Tim Cook announced that Apple will open a new store in Taipei on March 15."
Expected: [Tim Cook/PER, Apple/ORG, Taipei/LOC, March 15/DATE]
Edge Cases
| Input | Expected | Why |
|---|
| "Apple" (no context) | Ambiguous (fruit or company) | Context-dependent entity typing |
| Nested entities | Depends on scheme | "Bank of America" = ORG, "America" = LOC within |
| Misspelled entity | May miss | "Appel" not in training data |
Gotchas
- Boundary errors: NER often gets the entity type right but the span wrong ("New" vs "New York City"). Evaluate with both exact and partial match metrics.
- Ambiguity: "Jordan" can be a person, country, or brand. Context-dependent disambiguation is hard; some models output the most likely type.
- Chinese/Japanese NER: No whitespace tokenization makes boundary detection harder. Use language-specific tokenizers (jieba for Chinese).
- Annotation consistency: Training data quality is critical. Inconsistent annotations (sometimes labeling "Dr." as part of name, sometimes not) degrade model performance.
- Entity linking: NER identifies mentions; entity linking resolves them to knowledge base entries. "Apple" → Apple Inc. (Q312) or apple (fruit). These are separate tasks.
References
- For BIO annotation format and guidelines, see
references/bio-annotation.md
- For fine-tuning NER with transformers, see
references/transformer-ner.md