regex-vs-llm-structured-text
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseRegex vs LLM for Structured Text Parsing
结构化文本解析:Regex vs LLM
A practical decision framework for parsing structured text (quizzes, forms, invoices, documents). The key insight: regex handles 95-98% of cases cheaply and deterministically. Reserve expensive LLM calls for the remaining edge cases.
这是一个解析结构化文本(测验、表单、发票、文档)的实用决策框架。核心思路:Regex可低成本、确定性地处理95-98%的案例。仅将昂贵的LLM调用留给剩余的边缘案例。
When to Activate
适用场景
- Parsing structured text with repeating patterns (questions, forms, tables)
- Deciding between regex and LLM for text extraction
- Building hybrid pipelines that combine both approaches
- Optimizing cost/accuracy tradeoffs in text processing
- 解析具有重复模式的结构化文本(问题、表单、表格)
- 为文本提取在Regex和LLM之间做选择
- 构建结合两种方法的混合管道
- 优化文本处理中的成本/准确性权衡
Decision Framework
决策框架
Is the text format consistent and repeating?
├── Yes (>90% follows a pattern) → Start with Regex
│ ├── Regex handles 95%+ → Done, no LLM needed
│ └── Regex handles <95% → Add LLM for edge cases only
└── No (free-form, highly variable) → Use LLM directlyIs the text format consistent and repeating?
├── Yes (>90% follows a pattern) → Start with Regex
│ ├── Regex handles 95%+ → Done, no LLM needed
│ └── Regex handles <95% → Add LLM for edge cases only
└── No (free-form, highly variable) → Use LLM directlyArchitecture Pattern
架构模式
Source Text
│
▼
[Regex Parser] ─── Extracts structure (95-98% accuracy)
│
▼
[Text Cleaner] ─── Removes noise (markers, page numbers, artifacts)
│
▼
[Confidence Scorer] ─── Flags low-confidence extractions
│
├── High confidence (≥0.95) → Direct output
│
└── Low confidence (<0.95) → [LLM Validator] → OutputSource Text
│
▼
[Regex Parser] ─── Extracts structure (95-98% accuracy)
│
▼
[Text Cleaner] ─── Removes noise (markers, page numbers, artifacts)
│
▼
[Confidence Scorer] ─── Flags low-confidence extractions
│
├── High confidence (≥0.95) → Direct output
│
└── Low confidence (<0.95) → [LLM Validator] → OutputImplementation
实现
1. Regex Parser (Handles the Majority)
1. Regex Parser (Handles the Majority)
python
import re
from dataclasses import dataclass
@dataclass(frozen=True)
class ParsedItem:
id: str
text: str
choices: tuple[str, ...]
answer: str
confidence: float = 1.0
def parse_structured_text(content: str) -> list[ParsedItem]:
"""Parse structured text using regex patterns."""
pattern = re.compile(
r"(?P<id>\d+)\.\s*(?P<text>.+?)\n"
r"(?P<choices>(?:[A-D]\..+?\n)+)"
r"Answer:\s*(?P<answer>[A-D])",
re.MULTILINE | re.DOTALL,
)
items = []
for match in pattern.finditer(content):
choices = tuple(
c.strip() for c in re.findall(r"[A-D]\.\s*(.+)", match.group("choices"))
)
items.append(ParsedItem(
id=match.group("id"),
text=match.group("text").strip(),
choices=choices,
answer=match.group("answer"),
))
return itemspython
import re
from dataclasses import dataclass
@dataclass(frozen=True)
class ParsedItem:
id: str
text: str
choices: tuple[str, ...]
answer: str
confidence: float = 1.0
def parse_structured_text(content: str) -> list[ParsedItem]:
"""Parse structured text using regex patterns."""
pattern = re.compile(
r"(?P<id>\d+)\.\s*(?P<text>.+?)\n"
r"(?P<choices>(?:[A-D]\..+?\n)+)"
r"Answer:\s*(?P<answer>[A-D])",
re.MULTILINE | re.DOTALL,
)
items = []
for match in pattern.finditer(content):
choices = tuple(
c.strip() for c in re.findall(r"[A-D]\.\s*(.+)", match.group("choices"))
)
items.append(ParsedItem(
id=match.group("id"),
text=match.group("text").strip(),
choices=choices,
answer=match.group("answer"),
))
return items2. Confidence Scoring
2. Confidence Scoring
Flag items that may need LLM review:
python
@dataclass(frozen=True)
class ConfidenceFlag:
item_id: str
score: float
reasons: tuple[str, ...]
def score_confidence(item: ParsedItem) -> ConfidenceFlag:
"""Score extraction confidence and flag issues."""
reasons = []
score = 1.0
if len(item.choices) < 3:
reasons.append("few_choices")
score -= 0.3
if not item.answer:
reasons.append("missing_answer")
score -= 0.5
if len(item.text) < 10:
reasons.append("short_text")
score -= 0.2
return ConfidenceFlag(
item_id=item.id,
score=max(0.0, score),
reasons=tuple(reasons),
)
def identify_low_confidence(
items: list[ParsedItem],
threshold: float = 0.95,
) -> list[ConfidenceFlag]:
"""Return items below confidence threshold."""
flags = [score_confidence(item) for item in items]
return [f for f in flags if f.score < threshold]Flag items that may need LLM review:
python
@dataclass(frozen=True)
class ConfidenceFlag:
item_id: str
score: float
reasons: tuple[str, ...]
def score_confidence(item: ParsedItem) -> ConfidenceFlag:
"""Score extraction confidence and flag issues."""
reasons = []
score = 1.0
if len(item.choices) < 3:
reasons.append("few_choices")
score -= 0.3
if not item.answer:
reasons.append("missing_answer")
score -= 0.5
if len(item.text) < 10:
reasons.append("short_text")
score -= 0.2
return ConfidenceFlag(
item_id=item.id,
score=max(0.0, score),
reasons=tuple(reasons),
)
def identify_low_confidence(
items: list[ParsedItem],
threshold: float = 0.95,
) -> list[ConfidenceFlag]:
"""Return items below confidence threshold."""
flags = [score_confidence(item) for item in items]
return [f for f in flags if f.score < threshold]3. LLM Validator (Edge Cases Only)
3. LLM Validator (Edge Cases Only)
python
def validate_with_llm(
item: ParsedItem,
original_text: str,
client,
) -> ParsedItem:
"""Use LLM to fix low-confidence extractions."""
response = client.messages.create(
model="claude-haiku-4-5-20251001", # Cheapest model for validation
max_tokens=500,
messages=[{
"role": "user",
"content": (
f"Extract the question, choices, and answer from this text.\n\n"
f"Text: {original_text}\n\n"
f"Current extraction: {item}\n\n"
f"Return corrected JSON if needed, or 'CORRECT' if accurate."
),
}],
)
# Parse LLM response and return corrected item...
return corrected_itempython
def validate_with_llm(
item: ParsedItem,
original_text: str,
client,
) -> ParsedItem:
"""Use LLM to fix low-confidence extractions."""
response = client.messages.create(
model="claude-haiku-4-5-20251001", # Cheapest model for validation
max_tokens=500,
messages=[{
"role": "user",
"content": (
f"Extract the question, choices, and answer from this text.\n\n"
f"Text: {original_text}\n\n"
f"Current extraction: {item}\n\n"
f"Return corrected JSON if needed, or 'CORRECT' if accurate."
),
}],
)
# Parse LLM response and return corrected item...
return corrected_item4. Hybrid Pipeline
4. Hybrid Pipeline
python
def process_document(
content: str,
*,
llm_client=None,
confidence_threshold: float = 0.95,
) -> list[ParsedItem]:
"""Full pipeline: regex -> confidence check -> LLM for edge cases."""
# Step 1: Regex extraction (handles 95-98%)
items = parse_structured_text(content)
# Step 2: Confidence scoring
low_confidence = identify_low_confidence(items, confidence_threshold)
if not low_confidence or llm_client is None:
return items
# Step 3: LLM validation (only for flagged items)
low_conf_ids = {f.item_id for f in low_confidence}
result = []
for item in items:
if item.id in low_conf_ids:
result.append(validate_with_llm(item, content, llm_client))
else:
result.append(item)
return resultpython
def process_document(
content: str,
*,
llm_client=None,
confidence_threshold: float = 0.95,
) -> list[ParsedItem]:
"""Full pipeline: regex -> confidence check -> LLM for edge cases."""
# Step 1: Regex extraction (handles 95-98%)
items = parse_structured_text(content)
# Step 2: Confidence scoring
low_confidence = identify_low_confidence(items, confidence_threshold)
if not low_confidence or llm_client is None:
return items
# Step 3: LLM validation (only for flagged items)
low_conf_ids = {f.item_id for f in low_confidence}
result = []
for item in items:
if item.id in low_conf_ids:
result.append(validate_with_llm(item, content, llm_client))
else:
result.append(item)
return resultReal-World Metrics
真实世界指标
From a production quiz parsing pipeline (410 items):
| Metric | Value |
|---|---|
| Regex success rate | 98.0% |
| Low confidence items | 8 (2.0%) |
| LLM calls needed | ~5 |
| Cost savings vs all-LLM | ~95% |
| Test coverage | 93% |
来自一个生产环境的测验解析管道(410个条目):
| 指标 | 数值 |
|---|---|
| Regex成功率 | 98.0% |
| 低置信度条目 | 8个(2.0%) |
| 所需LLM调用次数 | ~5次 |
| 与全LLM方案相比的成本节省 | ~95% |
| 测试覆盖率 | 93% |
Best Practices
最佳实践
- Start with regex — even imperfect regex gives you a baseline to improve
- Use confidence scoring to programmatically identify what needs LLM help
- Use the cheapest LLM for validation (Haiku-class models are sufficient)
- Never mutate parsed items — return new instances from cleaning/validation steps
- TDD works well for parsers — write tests for known patterns first, then edge cases
- Log metrics (regex success rate, LLM call count) to track pipeline health
- 从Regex开始——即使是不完善的Regex也能为你提供一个可改进的基准
- 使用置信度评分以程序化方式识别需要LLM协助的内容
- 使用最便宜的LLM进行验证(Haiku级模型已足够)
- 绝不修改已解析的条目——从清理/验证步骤返回新实例
- **测试驱动开发(TDD)**对解析器很有效——先为已知模式编写测试,再处理边缘案例
- 记录指标(Regex成功率、LLM调用次数)以跟踪管道健康状况
Anti-Patterns to Avoid
需避免的反模式
- Sending all text to an LLM when regex handles 95%+ of cases (expensive and slow)
- Using regex for free-form, highly variable text (LLM is better here)
- Skipping confidence scoring and hoping regex "just works"
- Mutating parsed objects during cleaning/validation steps
- Not testing edge cases (malformed input, missing fields, encoding issues)
- 当Regex可处理95%以上案例时,仍将所有文本发送给LLM(成本高且速度慢)
- 对自由格式、高度可变的文本使用Regex(LLM在这方面表现更好)
- 跳过置信度评分,寄希望于Regex“刚好能用”
- 在清理/验证步骤中修改已解析的对象
- 不测试边缘案例(格式错误的输入、缺失字段、编码问题)
When to Use
适用场景
- Quiz/exam question parsing
- Form data extraction
- Invoice/receipt processing
- Document structure parsing (headers, sections, tables)
- Any structured text with repeating patterns where cost matters
- 测验/考试题目解析
- 表单数据提取
- 发票/收据处理
- 文档结构解析(标题、章节、表格)
- 任何具有重复模式且关注成本的结构化文本