openai-privacy-filter
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOpenAI Privacy Filter
OpenAI Privacy Filter
Skill by ara.so — Daily 2026 Skills collection.
OpenAI Privacy Filter is a bidirectional token-classification model (1.5B params, 50M active) for detecting and masking PII spans in text. It runs in a single forward pass with constrained Viterbi decoding, supports a 128k-token context window, and is licensed Apache 2.0.
由ara.so开发的Skill — 属于Daily 2026 Skills合集。
OpenAI Privacy Filter是一款双向token分类模型(15亿参数,5亿活跃参数),用于检测并掩码文本中的PII片段。它通过约束维特比解码(Viterbi decoding)实现单次前向传播,支持128k-token的上下文窗口,采用Apache 2.0许可证。
Installation
安装
bash
pip install -e .bash
pip install -e .or from a cloned repo:
or from a cloned repo:
git clone https://github.com/openai/privacy-filter
cd privacy-filter
pip install -e .
After install, the `opf` CLI is available. On first use it downloads the model checkpoint to `~/.opf/privacy_filter` unless `OPF_CHECKPOINT` is set.
```bash
export OPF_CHECKPOINT=/path/to/local/checkpoint_dirgit clone https://github.com/openai/privacy-filter
cd privacy-filter
pip install -e .
安装完成后,`opf`命令行工具(CLI)即可使用。首次使用时,模型检查点将下载到`~/.opf/privacy_filter`目录,除非设置了`OPF_CHECKPOINT`环境变量。
```bash
export OPF_CHECKPOINT=/path/to/local/checkpoint_dirDetected PII Categories
可检测的PII类别
| Label | Description |
|---|---|
| Bank/card/account numbers |
| Physical addresses |
| Email addresses |
| Personal names |
| Phone numbers |
| Personal URLs |
| Dates of birth / personal dates |
| Credentials, tokens, API keys |
| 标签 | 描述 |
|---|---|
| 银行/银行卡/账户号码 |
| 物理地址 |
| 电子邮箱地址 |
| 个人姓名 |
| 电话号码 |
| 个人URL |
| 出生日期/个人日期 |
| 凭证、令牌、API密钥 |
CLI Usage
命令行工具(CLI)使用方法
One-shot redaction
一次性脱敏
bash
undefinedbash
undefinedRedact inline text
Redact inline text
opf "Alice was born on 1990-01-02 and her email is alice@example.com."
opf "Alice was born on 1990-01-02 and her email is alice@example.com."
Force CPU inference
Force CPU inference
opf --device cpu "Alice was born on 1990-01-02."
opf --device cpu "Alice was born on 1990-01-02."
Use a specific checkpoint
Use a specific checkpoint
opf --checkpoint /path/to/checkpoint_dir "Alice Johnson, SSN 123-45-6789"
opf --checkpoint /path/to/checkpoint_dir "Alice Johnson, SSN 123-45-6789"
Redact an entire file
Redact an entire file
opf -f /path/to/document.txt
opf -f /path/to/document.txt
Pipe input
Pipe input
cat document.txt | grep "sensitive" | opf
cat document.txt | grep "sensitive" | opf
Interactive mode (no input provided)
Interactive mode (no input provided)
opf
undefinedopf
undefinedEvaluation
评估
bash
undefinedbash
undefinedEvaluate on a labeled JSONL dataset
Evaluate on a labeled JSONL dataset
opf eval examples/data/sample_eval_five_examples.jsonl
opf eval examples/data/sample_eval_five_examples.jsonl
See all eval options
See all eval options
opf eval --help
undefinedopf eval --help
undefinedFinetuning
微调
bash
undefinedbash
undefinedFinetune on your labeled dataset
Finetune on your labeled dataset
opf train /path/to/train.jsonl --output-dir /path/to/finetuned_checkpoint
opf train /path/to/train.jsonl --output-dir /path/to/finetuned_checkpoint
See all training options
See all training options
opf train --help
undefinedopf train --help
undefinedPython API
Python API
python
from opf import PrivacyFilterpython
from opf import PrivacyFilterLoad with default checkpoint (~/.opf/privacy_filter or OPF_CHECKPOINT)
Load with default checkpoint (~/.opf/privacy_filter or OPF_CHECKPOINT)
pf = PrivacyFilter()
pf = PrivacyFilter()
Or specify a checkpoint explicitly
Or specify a checkpoint explicitly
pf = PrivacyFilter(checkpoint="/path/to/checkpoint_dir")
pf = PrivacyFilter(checkpoint="/path/to/checkpoint_dir")
Redact a single string
Redact a single string
result = pf.redact("Alice Johnson called from +1-800-555-0199.")
print(result.redacted_text)
result = pf.redact("Alice Johnson called from +1-800-555-0199.")
print(result.redacted_text)
"██████████████ called from ██████████████."
"██████████████ called from ██████████████."
Access detected spans
Access detected spans
for span in result.spans:
print(span.label, span.text, span.start, span.end)
undefinedfor span in result.spans:
print(span.label, span.text, span.start, span.end)
undefinedBatch processing
批量处理
python
from opf import PrivacyFilter
pf = PrivacyFilter(device="cuda") # or "cpu"
texts = [
"Contact Bob Smith at bob@example.com",
"Her SSN is 123-45-6789 and DOB is 1985-03-15",
"API key: sk-abc123xyz789",
]
results = pf.redact_batch(texts)
for r in results:
print(r.redacted_text)
print(r.spans)python
from opf import PrivacyFilter
pf = PrivacyFilter(device="cuda") # or "cpu"
texts = [
"Contact Bob Smith at bob@example.com",
"Her SSN is 123-45-6789 and DOB is 1985-03-15",
"API key: sk-abc123xyz789",
]
results = pf.redact_batch(texts)
for r in results:
print(r.redacted_text)
print(r.spans)Precision/Recall tuning via operating points
通过操作点调整精确率/召回率
python
from opf import PrivacyFilterpython
from opf import PrivacyFilterHigh recall (broader masking, more false positives)
High recall (broader masking, more false positives)
pf_recall = PrivacyFilter(operating_point="high_recall")
pf_recall = PrivacyFilter(operating_point="high_recall")
High precision (stricter masking, fewer false positives)
High precision (stricter masking, fewer false positives)
pf_precision = PrivacyFilter(operating_point="high_precision")
pf_precision = PrivacyFilter(operating_point="high_precision")
Default balanced
Default balanced
pf_default = PrivacyFilter()
undefinedpf_default = PrivacyFilter()
undefinedData Format
数据格式
Input for eval and training (JSONL)
评估与训练输入(JSONL)
Each line is a JSON object:
jsonl
{"text": "Alice was born on 1990-01-02.", "spans": [{"start": 0, "end": 5, "label": "private_person"}, {"start": 18, "end": 28, "label": "private_date"}]}
{"text": "Email bob@corp.com for details.", "spans": [{"start": 6, "end": 18, "label": "private_email"}]}每行是一个JSON对象:
jsonl
{"text": "Alice was born on 1990-01-02.", "spans": [{"start": 0, "end": 5, "label": "private_person"}, {"start": 18, "end": 28, "label": "private_date"}]}
{"text": "Email bob@corp.com for details.", "spans": [{"start": 6, "end": 18, "label": "private_email"}]}JSON output schema
JSON输出格式
json
{
"redacted_text": "██████ was born on ██████████.",
"spans": [
{
"label": "private_person",
"text": "Alice",
"start": 0,
"end": 5,
"score": 0.987
},
{
"label": "private_date",
"text": "1990-01-02",
"start": 18,
"end": 28,
"score": 0.973
}
]
}See in the repo for full payload spec.
OUTPUT_SCHEMAS.mdjson
{
"redacted_text": "██████ was born on ██████████.",
"spans": [
{
"label": "private_person",
"text": "Alice",
"start": 0,
"end": 5,
"score": 0.987
},
{
"label": "private_date",
"text": "1990-01-02",
"start": 18,
"end": 28,
"score": 0.973
}
]
}详见仓库中的文件获取完整负载规范。
OUTPUT_SCHEMAS.mdFinetuning Workflow
微调工作流
bash
undefinedbash
undefinedPrepare labeled JSONL (see data format above)
Prepare labeled JSONL (see data format above)
Run finetuning
Run finetuning
opf train train.jsonl
--output-dir ./my_finetuned_model
--eval-file eval.jsonl
--epochs 3
--batch-size 8
--output-dir ./my_finetuned_model
--eval-file eval.jsonl
--epochs 3
--batch-size 8
opf train train.jsonl
--output-dir ./my_finetuned_model
--eval-file eval.jsonl
--epochs 3
--batch-size 8
--output-dir ./my_finetuned_model
--eval-file eval.jsonl
--epochs 3
--batch-size 8
Use the finetuned model
Use the finetuned model
opf --checkpoint ./my_finetuned_model "redact this text"
See `FINETUNING.md` and `examples/scripts/finetuning/` for runnable demo harnesses.opf --checkpoint ./my_finetuned_model "redact this text"
详见`FINETUNING.md`和`examples/scripts/finetuning/`中的可运行演示工具。Environment Variables
环境变量
| Variable | Purpose |
|---|---|
| Path to model checkpoint directory (overrides default |
| 变量 | 用途 |
|---|---|
| 模型检查点目录路径(覆盖默认的 |
Project Structure
项目结构
opf/
├── __main__.py # CLI entrypoint (redact, eval, train)
├── _api.py # Python-facing API
├── _cli/ # Argument parsing, terminal rendering
├── _core/ # Runtime loading, span conversion, decoding
├── _eval/ # Dataset loading, metrics, eval runners
├── _train/ # Finetuning argument parsing and runners
├── _model/ # Transformer impl, checkpoint config, weight loading
examples/
├── data/ # Sample eval/finetune JSONL fixtures
├── scripts/finetuning/ # Runnable finetuning demo scriptsopf/
├── __main__.py # CLI入口(脱敏、评估、训练)
├── _api.py # Python面向API
├── _cli/ # 参数解析、终端渲染
├── _core/ # 运行时加载、片段转换、解码
├── _eval/ # 数据集加载、指标、评估运行器
├── _train/ # 微调参数解析与运行器
├── _model/ # Transformer实现、检查点配置、权重加载
examples/
├── data/ # 示例评估/微调JSONL测试数据
├── scripts/finetuning/ # 可运行的微调演示脚本Common Patterns
常见用法示例
Pipeline: sanitize files before uploading to an LLM
流水线:上传至LLM前清理文件
python
from opf import PrivacyFilter
import json
pf = PrivacyFilter()
def sanitize_for_llm(raw_text: str) -> str:
result = pf.redact(raw_text)
return result.redacted_text
with open("raw_data.txt") as f:
clean = sanitize_for_llm(f.read())
print(clean)python
from opf import PrivacyFilter
import json
pf = PrivacyFilter()
def sanitize_for_llm(raw_text: str) -> str:
result = pf.redact(raw_text)
return result.redacted_text
with open("raw_data.txt") as f:
clean = sanitize_for_llm(f.read())
print(clean)Audit: log all detected PII spans without redacting
审计:记录所有检测到的PII片段而不脱敏
python
from opf import PrivacyFilter
pf = PrivacyFilter()
def audit_pii(text: str) -> list[dict]:
result = pf.redact(text)
return [
{"label": s.label, "text": s.text, "start": s.start, "end": s.end}
for s in result.spans
]
findings = audit_pii("Bob Jones (DOB: 1978-06-15) owes $1,200.")
print(json.dumps(findings, indent=2))python
from opf import PrivacyFilter
pf = PrivacyFilter()
def audit_pii(text: str) -> list[dict]:
result = pf.redact(text)
return [
{"label": s.label, "text": s.text, "start": s.start, "end": s.end}
for s in result.spans
]
findings = audit_pii("Bob Jones (DOB: 1978-06-15) owes $1,200.")
print(json.dumps(findings, indent=2))Filter specific label types only
仅过滤特定标签类型
python
from opf import PrivacyFilter
pf = PrivacyFilter()
def redact_only(text: str, labels: list[str]) -> str:
result = pf.redact(text)
# Rebuild text redacting only chosen labels
chars = list(text)
for span in result.spans:
if span.label in labels:
for i in range(span.start, span.end):
chars[i] = "█"
return "".join(chars)python
from opf import PrivacyFilter
pf = PrivacyFilter()
def redact_only(text: str, labels: list[str]) -> str:
result = pf.redact(text)
# 重新构建文本,仅脱敏选定标签的内容
chars = list(text)
for span in result.spans:
if span.label in labels:
for i in range(span.start, span.end):
chars[i] = "█"
return "".join(chars)Only redact emails and phones, keep names
仅脱敏邮箱和电话,保留姓名
output = redact_only(
"Call Alice at 555-1234 or alice@example.com",
labels=["private_phone", "private_email"]
)
print(output)
output = redact_only(
"Call Alice at 555-1234 or alice@example.com",
labels=["private_phone", "private_email"]
)
print(output)
"Call Alice at ████████ or █████████████████"
"Call Alice at ████████ or █████████████████"
undefinedundefinedTroubleshooting
故障排除
Model not found / auto-download fails
- Set to a local checkpoint directory, or ensure internet access for the first run.
OPF_CHECKPOINT - Checkpoint is downloaded from https://huggingface.co/openai/privacy-filter.
CUDA out of memory
- Use or reduce batch size with
--device cpu.--batch-size 1
Low recall on domain-specific identifiers
- Finetune on representative labeled examples using .
opf train - Try for broader masking.
operating_point="high_recall"
Fragmented span boundaries
- Expected in heavy-punctuation or mixed-format text; the Viterbi decoder mitigates this but is not perfect.
- Finetuning on in-domain data is the recommended fix.
Non-English / non-Latin text
- The model is primarily English; multilingual performance is not guaranteed. Evaluate on your target language before production use.
模型未找到/自动下载失败
- 设置指向本地检查点目录,或确保首次运行时网络通畅。
OPF_CHECKPOINT - 检查点将从https://huggingface.co/openai/privacy-filter下载。
CUDA内存不足
- 使用参数,或通过
--device cpu减小批量大小。--batch-size 1
特定领域标识符召回率低
- 使用命令基于代表性标注示例进行微调。
opf train - 尝试设置以实现更广泛的掩码。
operating_point="high_recall"
片段边界碎片化
- 在标点密集或格式混合的文本中属于正常现象;维特比解码器可缓解此问题,但无法做到完美。
- 推荐的解决方法是针对特定领域数据进行微调。
非英文/非拉丁文本
- 该模型主要针对英文优化;多语言性能无法保证。投入生产使用前,请针对目标语言进行评估。
References
参考资料
- Model weights (HuggingFace)
- Live demo
- Model card (PDF)
- — finetuning workflow
FINETUNING.md - — JSON response formats
OUTPUT_SCHEMAS.md - — output and eval mode details
EVAL_AND_OUTPUT_MODES.md
- 模型权重(HuggingFace)
- 在线演示
- 模型卡片(PDF)
- — 微调工作流
FINETUNING.md - — JSON响应格式
OUTPUT_SCHEMAS.md - — 输出与评估模式详情
EVAL_AND_OUTPUT_MODES.md