privacy-parser-pii-extraction

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Privacy Parser — PII Span Extraction

Privacy Parser — PII片段提取

Skill by ara.so — Daily 2026 Skills collection.
privacy-parser is the inverse of OpenAI's Privacy Filter. Where the filter masks PII with
<REDACTED>
, this library returns structured spans — label, text, and character offsets — using the same 1.5B
opf
model weights and label taxonomy.
ara.so提供的技能——属于Daily 2026 Skills合集。
privacy-parser是OpenAI Privacy Filter的反向工具。该过滤器会用
<REDACTED>
掩码PII,而本库则使用相同的1.5B
opf
模型权重和标签分类体系,返回结构化片段——标签、文本和字符偏移量。

Installation

安装

bash
undefined
bash
undefined

Clone the repo (includes both subpackages)

Clone the repo (includes both subpackages)

uv venv uv pip install -e ./privacy-filter # installs the opf model + weights loader uv pip install -e ./pii_parser # installs the parser library

First run downloads the `opf` 1.5B checkpoint (~3 GB) to `~/.opf/privacy_filter/`.
uv venv uv pip install -e ./privacy-filter # installs the opf model + weights loader uv pip install -e ./pii_parser # installs the parser library

首次运行会将`opf` 1.5B模型 checkpoint(约3 GB)下载到`~/.opf/privacy_filter/`目录。

Quick Start

快速开始

python
from pii_parser.hybrid import HybridPIIParser

parser = HybridPIIParser(device="cpu")  # or "cuda" / "mps"
result = parser.parse(
    "Hi Quindle Testwick (quindle.testwick@openai.com / +1-415-555-0102), "
    "account 40702810500001234567, 14 Beautiful Ct, Anytown USA, "
    "password Priv4cy-Filt3r-2026."
)

for span in result.spans:
    print(f"{span.label:18}  {span.text}")
Output:
private_person      Quindle Testwick
private_email       quindle.testwick@openai.com
private_phone       +1-415-555-0102
account_number      40702810500001234567
private_address     14 Beautiful Ct, Anytown USA
secret              Priv4cy-Filt3r-2026
python
from pii_parser.hybrid import HybridPIIParser

parser = HybridPIIParser(device="cpu")  # or "cuda" / "mps"
result = parser.parse(
    "Hi Quindle Testwick (quindle.testwick@openai.com / +1-415-555-0102), "
    "account 40702810500001234567, 14 Beautiful Ct, Anytown USA, "
    "password Priv4cy-Filt3r-2026."
)

for span in result.spans:
    print(f"{span.label:18}  {span.text}")
输出:
private_person      Quindle Testwick
private_email       quindle.testwick@openai.com
private_phone       +1-415-555-0102
account_number      40702810500001234567
private_address     14 Beautiful Ct, Anytown USA
secret              Priv4cy-Filt3r-2026

Three Backends

三种后端

Choose the backend based on your speed/accuracy tradeoff:
BackendWeightsSpeedF1When to use
PIIParser
noneµs1.000Tests, known-format structured data
ModelPIIParser
1.5B~500ms CPU0.733Model-only, no post-processing
HybridPIIParser
1.5B~600ms CPU0.929Production — ship this one
python
undefined
根据速度/精度权衡选择后端:
后端权重速度F1值使用场景
PIIParser
微秒级1.000测试、已知格式的结构化数据
ModelPIIParser
1.5BCPU约500ms0.733仅使用模型,无后处理
HybridPIIParser
1.5BCPU约600ms0.929生产环境——推荐使用该后端
python
undefined

Regex-only (no model, instant, high precision on structured formats)

仅正则表达式(无模型,即时响应,对结构化格式精度高)

from pii_parser import PIIParser parser = PIIParser()
from pii_parser import PIIParser parser = PIIParser()

Model-only (raw BIOES logits → Viterbi → spans)

仅模型(原始BIOES对数概率 → Viterbi算法 → 片段)

from pii_parser.model import ModelPIIParser parser = ModelPIIParser(device="cpu")
from pii_parser.model import ModelPIIParser parser = ModelPIIParser(device="cpu")

Hybrid: model + span-merge + regex backstop (recommended)

混合模式:模型 + 片段合并 + 正则兜底(推荐)

from pii_parser.hybrid import HybridPIIParser parser = HybridPIIParser(device="cpu")
undefined
from pii_parser.hybrid import HybridPIIParser parser = HybridPIIParser(device="cpu")
undefined

Span Object

片段对象

Each
span
in
result.spans
has:
python
span.label    # str — one of the 8 label types
span.text     # str — the extracted substring
span.start    # int — char offset in original string
span.end      # int — char offset (exclusive)
result.spans
中的每个
span
包含以下属性:
python
span.label    # str — 8种标签类型之一
span.text     # str — 提取的子字符串
span.start    # int — 在原始字符串中的字符偏移量
span.end      # int — 字符偏移量(不包含)

Label Taxonomy (opf v2)

标签分类体系(opf v2)

private_person    — full names of individuals
private_email     — email addresses
private_phone     — phone numbers (any format)
private_address   — street/postal addresses
private_url       — personal/private URLs
private_date      — dates tied to individuals
account_number    — bank/card/account identifiers
secret            — passwords, tokens, API keys
private_person    — 个人全名
private_email     — 电子邮件地址
private_phone     — 电话号码(任意格式)
private_address   — 街道/邮政地址
private_url       — 个人/私有URL
private_date      — 与个人关联的日期
account_number    — 银行/卡片/账户标识符
secret            — 密码、令牌、API密钥

Common Patterns

常见用法

Batch processing

批量处理

python
from pii_parser.hybrid import HybridPIIParser

parser = HybridPIIParser(device="cpu")

texts = [
    "Email Bob at bob@example.com",
    "SSN: 123-45-6789, DOB: 1990-03-15",
    "Token: ghp_abc123XYZ",
]

for text in texts:
    result = parser.parse(text)
    if result.spans:
        print(f"Text: {text!r}")
        for s in result.spans:
            print(f"  [{s.start}:{s.end}] {s.label}{s.text!r}")
        print()
python
from pii_parser.hybrid import HybridPIIParser

parser = HybridPIIParser(device="cpu")

texts = [
    "Email Bob at bob@example.com",
    "SSN: 123-45-6789, DOB: 1990-03-15",
    "Token: ghp_abc123XYZ",
]

for text in texts:
    result = parser.parse(text)
    if result.spans:
        print(f"Text: {text!r}")
        for s in result.spans:
            print(f"  [{s.start}:{s.end}] {s.label}{s.text!r}")
        print()

Filter by label type

按标签类型过滤

python
result = parser.parse(long_document)

emails   = [s for s in result.spans if s.label == "private_email"]
phones   = [s for s in result.spans if s.label == "private_phone"]
secrets  = [s for s in result.spans if s.label == "secret"]
accounts = [s for s in result.spans if s.label == "account_number"]
python
result = parser.parse(long_document)

emails   = [s for s in result.spans if s.label == "private_email"]
phones   = [s for s in result.spans if s.label == "private_phone"]
secrets  = [s for s in result.spans if s.label == "secret"]
accounts = [s for s in result.spans if s.label == "account_number"]

Redact after inspection

检查后进行掩码

python
def redact(text: str, spans) -> str:
    """Replace extracted PII with [LABEL] tokens."""
    result = list(text)
    for span in sorted(spans, key=lambda s: s.start, reverse=True):
        result[span.start:span.end] = f"[{span.label.upper()}]"
    return "".join(result)

result = parser.parse("Call Alice at 555-0100 re: account 9988776655.")
clean  = redact("Call Alice at 555-0100 re: account 9988776655.", result.spans)
python
def redact(text: str, spans) -> str:
    """用[LABEL]令牌替换提取的PII。"""
    result = list(text)
    for span in sorted(spans, key=lambda s: s.start, reverse=True):
        result[span.start:span.end] = f"[{span.label.upper()}]"
    return "".join(result)

result = parser.parse("Call Alice at 555-0100 re: account 9988776655.")
clean  = redact("Call Alice at 555-0100 re: account 9988776655.", result.spans)

"Call [PRIVATE_PERSON] at [PRIVATE_PHONE] re: account [ACCOUNT_NUMBER]."

"Call [PRIVATE_PERSON] at [PRIVATE_PHONE] re: account [ACCOUNT_NUMBER]."

undefined
undefined

Export to JSON

导出为JSON

python
import json

result = parser.parse("Jane Doe, jane@corp.io, +44 20 7946 0958")
payload = [
    {"label": s.label, "text": s.text, "start": s.start, "end": s.end}
    for s in result.spans
]
print(json.dumps(payload, indent=2))
python
import json

result = parser.parse("Jane Doe, jane@corp.io, +44 20 7946 0958")
payload = [
    {"label": s.label, "text": s.text, "start": s.start, "end": s.end}
    for s in result.spans
]
print(json.dumps(payload, indent=2))

GPU acceleration

GPU加速

python
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
parser = HybridPIIParser(device=device)
python
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
parser = HybridPIIParser(device=device)

CLI

命令行工具(CLI)

bash
undefined
bash
undefined

Parse a string directly

直接解析字符串

python -m pii_parser.cli_model "Alice paid 40702810500001234567 on 2026-05-17."
python -m pii_parser.cli_model "Alice paid 40702810500001234567 on 2026-05-17."

Pipe text from a file

从文件管道读取文本

cat dump.txt | python -m pii_parser.cli_model -
undefined
cat dump.txt | python -m pii_parser.cli_model -
undefined

Architecture

架构

text
opf 1.5B → BIOES logits → Viterbi (tuned transitions) → char spans
span-merge  (glues multi-token names: "Quindle" + "Testwick" → one span)
regex backstop  (URL, secret, account_number — fills model gaps)
result.spans[]
  • BIOES tagging: Beginning / Inside / Outside / End / Single — standard NER scheme
  • Viterbi: enforces valid tag transitions (no I- without B-)
  • Span-merge: heuristic that joins adjacent same-label spans separated only by whitespace
  • Regex backstop: high-precision patterns for labels the 1.5B model under-predicts (secrets, account numbers, URLs)
文本
opf 1.5B → BIOES对数概率 → Viterbi(调优转移) → 字符片段
片段合并 (拼接多令牌姓名:"Quindle" + "Testwick" → 单个片段)
正则兜底 (URL、密钥、账户编号——填补模型漏洞)
result.spans[]
  • BIOES标记: Beginning / Inside / Outside / End / Single — 标准命名实体识别(NER)方案
  • Viterbi算法: 强制有效的标签转移(无B-则无I-)
  • 片段合并: 启发式规则,仅用空格分隔的相邻同标签片段会被合并
  • 正则兜底: 针对1.5B模型预测不足的标签(密钥、账户编号、URL)的高精度模式

Running Tests / Benchmarks

运行测试/基准

bash
undefined
bash
undefined

Full fixture suite + latency benchmark

完整测试套件 + 延迟基准

python pii_parser/tests/test_hybrid.py

Expected output:
Fixture F1: 0.929 Scenarios: 8/8 passed Latency: ~600 ms CPU
undefined
python pii_parser/tests/test_hybrid.py

预期输出:
Fixture F1: 0.929 Scenarios: 8/8 passed Latency: ~600 ms CPU
undefined

Troubleshooting

故障排查

Slow first run — The checkpoint (~3 GB) downloads to
~/.opf/privacy_filter/
on first use. Subsequent runs load from cache.
CUDA out of memory — Use
device="cpu"
or reduce batch size; the 1.5B model requires ~3 GB VRAM on GPU.
Low recall on secrets/URLs — Use
HybridPIIParser
(not
ModelPIIParser
); the regex backstop specifically covers these labels.
Span text doesn't match offsets — Offsets are byte-safe character indices into the original string passed to
parse()
. Do not preprocess/strip the string before parsing if you need offsets to remain valid.
Import error on
privacy_filter
— Ensure you installed both packages:
uv pip install -e ./privacy-filter
AND
uv pip install -e ./pii_parser
.
Model not found — Delete
~/.opf/privacy_filter/
and re-run to trigger a fresh download.
首次运行缓慢 — 首次使用时,checkpoint(约3 GB)会下载到
~/.opf/privacy_filter/
目录。后续运行会从缓存加载。
CUDA内存不足 — 使用
device="cpu"
或减少批量大小;1.5B模型在GPU上需要约3 GB显存。
密钥/URL召回率低 — 使用
HybridPIIParser
(而非
ModelPIIParser
);正则兜底专门覆盖这些标签。
片段文本与偏移量不匹配 — 偏移量是传入
parse()
的原始字符串中的字节安全字符索引。如果需要偏移量保持有效,请不要在解析前预处理/去除字符串空格。
导入
privacy_filter
出错
— 确保已安装两个包:
uv pip install -e ./privacy-filter
uv pip install -e ./pii_parser
模型未找到 — 删除
~/.opf/privacy_filter/
目录并重新运行以触发重新下载。