privacy-parser-pii-extraction
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePrivacy Parser — PII Span Extraction
Privacy Parser — PII片段提取
Skill by ara.so — Daily 2026 Skills collection.
privacy-parser is the inverse of OpenAI's Privacy Filter. Where the filter masks PII with , this library returns structured spans — label, text, and character offsets — using the same 1.5B model weights and label taxonomy.
<REDACTED>opf由ara.so提供的技能——属于Daily 2026 Skills合集。
privacy-parser是OpenAI Privacy Filter的反向工具。该过滤器会用掩码PII,而本库则使用相同的1.5B 模型权重和标签分类体系,返回结构化片段——标签、文本和字符偏移量。
<REDACTED>opfInstallation
安装
bash
undefinedbash
undefinedClone the repo (includes both subpackages)
Clone the repo (includes both subpackages)
git clone https://github.com/chiefautism/privacy-parser
cd privacy-parser
uv venv
uv pip install -e ./privacy-filter # installs the opf model + weights loader
uv pip install -e ./pii_parser # installs the parser library
First run downloads the `opf` 1.5B checkpoint (~3 GB) to `~/.opf/privacy_filter/`.git clone https://github.com/chiefautism/privacy-parser
cd privacy-parser
uv venv
uv pip install -e ./privacy-filter # installs the opf model + weights loader
uv pip install -e ./pii_parser # installs the parser library
首次运行会将`opf` 1.5B模型 checkpoint(约3 GB)下载到`~/.opf/privacy_filter/`目录。Quick Start
快速开始
python
from pii_parser.hybrid import HybridPIIParser
parser = HybridPIIParser(device="cpu") # or "cuda" / "mps"
result = parser.parse(
"Hi Quindle Testwick (quindle.testwick@openai.com / +1-415-555-0102), "
"account 40702810500001234567, 14 Beautiful Ct, Anytown USA, "
"password Priv4cy-Filt3r-2026."
)
for span in result.spans:
print(f"{span.label:18} {span.text}")Output:
private_person Quindle Testwick
private_email quindle.testwick@openai.com
private_phone +1-415-555-0102
account_number 40702810500001234567
private_address 14 Beautiful Ct, Anytown USA
secret Priv4cy-Filt3r-2026python
from pii_parser.hybrid import HybridPIIParser
parser = HybridPIIParser(device="cpu") # or "cuda" / "mps"
result = parser.parse(
"Hi Quindle Testwick (quindle.testwick@openai.com / +1-415-555-0102), "
"account 40702810500001234567, 14 Beautiful Ct, Anytown USA, "
"password Priv4cy-Filt3r-2026."
)
for span in result.spans:
print(f"{span.label:18} {span.text}")输出:
private_person Quindle Testwick
private_email quindle.testwick@openai.com
private_phone +1-415-555-0102
account_number 40702810500001234567
private_address 14 Beautiful Ct, Anytown USA
secret Priv4cy-Filt3r-2026Three Backends
三种后端
Choose the backend based on your speed/accuracy tradeoff:
| Backend | Weights | Speed | F1 | When to use |
|---|---|---|---|---|
| none | µs | 1.000 | Tests, known-format structured data |
| 1.5B | ~500ms CPU | 0.733 | Model-only, no post-processing |
| 1.5B | ~600ms CPU | 0.929 | Production — ship this one |
python
undefined根据速度/精度权衡选择后端:
| 后端 | 权重 | 速度 | F1值 | 使用场景 |
|---|---|---|---|---|
| 无 | 微秒级 | 1.000 | 测试、已知格式的结构化数据 |
| 1.5B | CPU约500ms | 0.733 | 仅使用模型,无后处理 |
| 1.5B | CPU约600ms | 0.929 | 生产环境——推荐使用该后端 |
python
undefinedRegex-only (no model, instant, high precision on structured formats)
仅正则表达式(无模型,即时响应,对结构化格式精度高)
from pii_parser import PIIParser
parser = PIIParser()
from pii_parser import PIIParser
parser = PIIParser()
Model-only (raw BIOES logits → Viterbi → spans)
仅模型(原始BIOES对数概率 → Viterbi算法 → 片段)
from pii_parser.model import ModelPIIParser
parser = ModelPIIParser(device="cpu")
from pii_parser.model import ModelPIIParser
parser = ModelPIIParser(device="cpu")
Hybrid: model + span-merge + regex backstop (recommended)
混合模式:模型 + 片段合并 + 正则兜底(推荐)
from pii_parser.hybrid import HybridPIIParser
parser = HybridPIIParser(device="cpu")
undefinedfrom pii_parser.hybrid import HybridPIIParser
parser = HybridPIIParser(device="cpu")
undefinedSpan Object
片段对象
Each in has:
spanresult.spanspython
span.label # str — one of the 8 label types
span.text # str — the extracted substring
span.start # int — char offset in original string
span.end # int — char offset (exclusive)result.spansspanpython
span.label # str — 8种标签类型之一
span.text # str — 提取的子字符串
span.start # int — 在原始字符串中的字符偏移量
span.end # int — 字符偏移量(不包含)Label Taxonomy (opf v2)
标签分类体系(opf v2)
private_person — full names of individuals
private_email — email addresses
private_phone — phone numbers (any format)
private_address — street/postal addresses
private_url — personal/private URLs
private_date — dates tied to individuals
account_number — bank/card/account identifiers
secret — passwords, tokens, API keysprivate_person — 个人全名
private_email — 电子邮件地址
private_phone — 电话号码(任意格式)
private_address — 街道/邮政地址
private_url — 个人/私有URL
private_date — 与个人关联的日期
account_number — 银行/卡片/账户标识符
secret — 密码、令牌、API密钥Common Patterns
常见用法
Batch processing
批量处理
python
from pii_parser.hybrid import HybridPIIParser
parser = HybridPIIParser(device="cpu")
texts = [
"Email Bob at bob@example.com",
"SSN: 123-45-6789, DOB: 1990-03-15",
"Token: ghp_abc123XYZ",
]
for text in texts:
result = parser.parse(text)
if result.spans:
print(f"Text: {text!r}")
for s in result.spans:
print(f" [{s.start}:{s.end}] {s.label} → {s.text!r}")
print()python
from pii_parser.hybrid import HybridPIIParser
parser = HybridPIIParser(device="cpu")
texts = [
"Email Bob at bob@example.com",
"SSN: 123-45-6789, DOB: 1990-03-15",
"Token: ghp_abc123XYZ",
]
for text in texts:
result = parser.parse(text)
if result.spans:
print(f"Text: {text!r}")
for s in result.spans:
print(f" [{s.start}:{s.end}] {s.label} → {s.text!r}")
print()Filter by label type
按标签类型过滤
python
result = parser.parse(long_document)
emails = [s for s in result.spans if s.label == "private_email"]
phones = [s for s in result.spans if s.label == "private_phone"]
secrets = [s for s in result.spans if s.label == "secret"]
accounts = [s for s in result.spans if s.label == "account_number"]python
result = parser.parse(long_document)
emails = [s for s in result.spans if s.label == "private_email"]
phones = [s for s in result.spans if s.label == "private_phone"]
secrets = [s for s in result.spans if s.label == "secret"]
accounts = [s for s in result.spans if s.label == "account_number"]Redact after inspection
检查后进行掩码
python
def redact(text: str, spans) -> str:
"""Replace extracted PII with [LABEL] tokens."""
result = list(text)
for span in sorted(spans, key=lambda s: s.start, reverse=True):
result[span.start:span.end] = f"[{span.label.upper()}]"
return "".join(result)
result = parser.parse("Call Alice at 555-0100 re: account 9988776655.")
clean = redact("Call Alice at 555-0100 re: account 9988776655.", result.spans)python
def redact(text: str, spans) -> str:
"""用[LABEL]令牌替换提取的PII。"""
result = list(text)
for span in sorted(spans, key=lambda s: s.start, reverse=True):
result[span.start:span.end] = f"[{span.label.upper()}]"
return "".join(result)
result = parser.parse("Call Alice at 555-0100 re: account 9988776655.")
clean = redact("Call Alice at 555-0100 re: account 9988776655.", result.spans)"Call [PRIVATE_PERSON] at [PRIVATE_PHONE] re: account [ACCOUNT_NUMBER]."
"Call [PRIVATE_PERSON] at [PRIVATE_PHONE] re: account [ACCOUNT_NUMBER]."
undefinedundefinedExport to JSON
导出为JSON
python
import json
result = parser.parse("Jane Doe, jane@corp.io, +44 20 7946 0958")
payload = [
{"label": s.label, "text": s.text, "start": s.start, "end": s.end}
for s in result.spans
]
print(json.dumps(payload, indent=2))python
import json
result = parser.parse("Jane Doe, jane@corp.io, +44 20 7946 0958")
payload = [
{"label": s.label, "text": s.text, "start": s.start, "end": s.end}
for s in result.spans
]
print(json.dumps(payload, indent=2))GPU acceleration
GPU加速
python
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
parser = HybridPIIParser(device=device)python
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
parser = HybridPIIParser(device=device)CLI
命令行工具(CLI)
bash
undefinedbash
undefinedParse a string directly
直接解析字符串
python -m pii_parser.cli_model "Alice paid 40702810500001234567 on 2026-05-17."
python -m pii_parser.cli_model "Alice paid 40702810500001234567 on 2026-05-17."
Pipe text from a file
从文件管道读取文本
cat dump.txt | python -m pii_parser.cli_model -
undefinedcat dump.txt | python -m pii_parser.cli_model -
undefinedArchitecture
架构
text
↓
opf 1.5B → BIOES logits → Viterbi (tuned transitions) → char spans
↓
span-merge (glues multi-token names: "Quindle" + "Testwick" → one span)
↓
regex backstop (URL, secret, account_number — fills model gaps)
↓
result.spans[]- BIOES tagging: Beginning / Inside / Outside / End / Single — standard NER scheme
- Viterbi: enforces valid tag transitions (no I- without B-)
- Span-merge: heuristic that joins adjacent same-label spans separated only by whitespace
- Regex backstop: high-precision patterns for labels the 1.5B model under-predicts (secrets, account numbers, URLs)
文本
↓
opf 1.5B → BIOES对数概率 → Viterbi(调优转移) → 字符片段
↓
片段合并 (拼接多令牌姓名:"Quindle" + "Testwick" → 单个片段)
↓
正则兜底 (URL、密钥、账户编号——填补模型漏洞)
↓
result.spans[]- BIOES标记: Beginning / Inside / Outside / End / Single — 标准命名实体识别(NER)方案
- Viterbi算法: 强制有效的标签转移(无B-则无I-)
- 片段合并: 启发式规则,仅用空格分隔的相邻同标签片段会被合并
- 正则兜底: 针对1.5B模型预测不足的标签(密钥、账户编号、URL)的高精度模式
Running Tests / Benchmarks
运行测试/基准
bash
undefinedbash
undefinedFull fixture suite + latency benchmark
完整测试套件 + 延迟基准
python pii_parser/tests/test_hybrid.py
Expected output:Fixture F1: 0.929
Scenarios: 8/8 passed
Latency: ~600 ms CPU
undefinedpython pii_parser/tests/test_hybrid.py
预期输出:Fixture F1: 0.929
Scenarios: 8/8 passed
Latency: ~600 ms CPU
undefinedTroubleshooting
故障排查
Slow first run — The checkpoint (~3 GB) downloads to on first use. Subsequent runs load from cache.
~/.opf/privacy_filter/CUDA out of memory — Use or reduce batch size; the 1.5B model requires ~3 GB VRAM on GPU.
device="cpu"Low recall on secrets/URLs — Use (not ); the regex backstop specifically covers these labels.
HybridPIIParserModelPIIParserSpan text doesn't match offsets — Offsets are byte-safe character indices into the original string passed to . Do not preprocess/strip the string before parsing if you need offsets to remain valid.
parse()Import error on — Ensure you installed both packages: AND .
privacy_filteruv pip install -e ./privacy-filteruv pip install -e ./pii_parserModel not found — Delete and re-run to trigger a fresh download.
~/.opf/privacy_filter/首次运行缓慢 — 首次使用时,checkpoint(约3 GB)会下载到目录。后续运行会从缓存加载。
~/.opf/privacy_filter/CUDA内存不足 — 使用或减少批量大小;1.5B模型在GPU上需要约3 GB显存。
device="cpu"密钥/URL召回率低 — 使用(而非);正则兜底专门覆盖这些标签。
HybridPIIParserModelPIIParser片段文本与偏移量不匹配 — 偏移量是传入的原始字符串中的字节安全字符索引。如果需要偏移量保持有效,请不要在解析前预处理/去除字符串空格。
parse()导入出错 — 确保已安装两个包: 和 。
privacy_filteruv pip install -e ./privacy-filteruv pip install -e ./pii_parser模型未找到 — 删除目录并重新运行以触发重新下载。
~/.opf/privacy_filter/