openai-privacy-filter

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

OpenAI Privacy Filter

OpenAI Privacy Filter

Skill by ara.so — Daily 2026 Skills collection.
OpenAI Privacy Filter is a bidirectional token-classification model (1.5B params, 50M active) for detecting and masking PII spans in text. It runs in a single forward pass with constrained Viterbi decoding, supports a 128k-token context window, and is licensed Apache 2.0.
ara.so开发的Skill — 属于Daily 2026 Skills合集。
OpenAI Privacy Filter是一款双向token分类模型(15亿参数,5亿活跃参数),用于检测并掩码文本中的PII片段。它通过约束维特比解码(Viterbi decoding)实现单次前向传播,支持128k-token的上下文窗口,采用Apache 2.0许可证。

Installation

安装

bash
pip install -e .
bash
pip install -e .

or from a cloned repo:

or from a cloned repo:

git clone https://github.com/openai/privacy-filter cd privacy-filter pip install -e .

After install, the `opf` CLI is available. On first use it downloads the model checkpoint to `~/.opf/privacy_filter` unless `OPF_CHECKPOINT` is set.

```bash
export OPF_CHECKPOINT=/path/to/local/checkpoint_dir
git clone https://github.com/openai/privacy-filter cd privacy-filter pip install -e .

安装完成后,`opf`命令行工具(CLI)即可使用。首次使用时,模型检查点将下载到`~/.opf/privacy_filter`目录,除非设置了`OPF_CHECKPOINT`环境变量。

```bash
export OPF_CHECKPOINT=/path/to/local/checkpoint_dir

Detected PII Categories

可检测的PII类别

LabelDescription
account_number
Bank/card/account numbers
private_address
Physical addresses
private_email
Email addresses
private_person
Personal names
private_phone
Phone numbers
private_url
Personal URLs
private_date
Dates of birth / personal dates
secret
Credentials, tokens, API keys
标签描述
account_number
银行/银行卡/账户号码
private_address
物理地址
private_email
电子邮箱地址
private_person
个人姓名
private_phone
电话号码
private_url
个人URL
private_date
出生日期/个人日期
secret
凭证、令牌、API密钥

CLI Usage

命令行工具(CLI)使用方法

One-shot redaction

一次性脱敏

bash
undefined
bash
undefined

Redact inline text

Redact inline text

opf "Alice was born on 1990-01-02 and her email is alice@example.com."
opf "Alice was born on 1990-01-02 and her email is alice@example.com."

Force CPU inference

Force CPU inference

opf --device cpu "Alice was born on 1990-01-02."
opf --device cpu "Alice was born on 1990-01-02."

Use a specific checkpoint

Use a specific checkpoint

opf --checkpoint /path/to/checkpoint_dir "Alice Johnson, SSN 123-45-6789"
opf --checkpoint /path/to/checkpoint_dir "Alice Johnson, SSN 123-45-6789"

Redact an entire file

Redact an entire file

opf -f /path/to/document.txt
opf -f /path/to/document.txt

Pipe input

Pipe input

cat document.txt | grep "sensitive" | opf
cat document.txt | grep "sensitive" | opf

Interactive mode (no input provided)

Interactive mode (no input provided)

opf
undefined
opf
undefined

Evaluation

评估

bash
undefined
bash
undefined

Evaluate on a labeled JSONL dataset

Evaluate on a labeled JSONL dataset

opf eval examples/data/sample_eval_five_examples.jsonl
opf eval examples/data/sample_eval_five_examples.jsonl

See all eval options

See all eval options

opf eval --help
undefined
opf eval --help
undefined

Finetuning

微调

bash
undefined
bash
undefined

Finetune on your labeled dataset

Finetune on your labeled dataset

opf train /path/to/train.jsonl --output-dir /path/to/finetuned_checkpoint
opf train /path/to/train.jsonl --output-dir /path/to/finetuned_checkpoint

See all training options

See all training options

opf train --help
undefined
opf train --help
undefined

Python API

Python API

python
from opf import PrivacyFilter
python
from opf import PrivacyFilter

Load with default checkpoint (~/.opf/privacy_filter or OPF_CHECKPOINT)

Load with default checkpoint (~/.opf/privacy_filter or OPF_CHECKPOINT)

pf = PrivacyFilter()
pf = PrivacyFilter()

Or specify a checkpoint explicitly

Or specify a checkpoint explicitly

pf = PrivacyFilter(checkpoint="/path/to/checkpoint_dir")
pf = PrivacyFilter(checkpoint="/path/to/checkpoint_dir")

Redact a single string

Redact a single string

result = pf.redact("Alice Johnson called from +1-800-555-0199.") print(result.redacted_text)
result = pf.redact("Alice Johnson called from +1-800-555-0199.") print(result.redacted_text)

"██████████████ called from ██████████████."

"██████████████ called from ██████████████."

Access detected spans

Access detected spans

for span in result.spans: print(span.label, span.text, span.start, span.end)
undefined
for span in result.spans: print(span.label, span.text, span.start, span.end)
undefined

Batch processing

批量处理

python
from opf import PrivacyFilter

pf = PrivacyFilter(device="cuda")  # or "cpu"

texts = [
    "Contact Bob Smith at bob@example.com",
    "Her SSN is 123-45-6789 and DOB is 1985-03-15",
    "API key: sk-abc123xyz789",
]

results = pf.redact_batch(texts)
for r in results:
    print(r.redacted_text)
    print(r.spans)
python
from opf import PrivacyFilter

pf = PrivacyFilter(device="cuda")  # or "cpu"

texts = [
    "Contact Bob Smith at bob@example.com",
    "Her SSN is 123-45-6789 and DOB is 1985-03-15",
    "API key: sk-abc123xyz789",
]

results = pf.redact_batch(texts)
for r in results:
    print(r.redacted_text)
    print(r.spans)

Precision/Recall tuning via operating points

通过操作点调整精确率/召回率

python
from opf import PrivacyFilter
python
from opf import PrivacyFilter

High recall (broader masking, more false positives)

High recall (broader masking, more false positives)

pf_recall = PrivacyFilter(operating_point="high_recall")
pf_recall = PrivacyFilter(operating_point="high_recall")

High precision (stricter masking, fewer false positives)

High precision (stricter masking, fewer false positives)

pf_precision = PrivacyFilter(operating_point="high_precision")
pf_precision = PrivacyFilter(operating_point="high_precision")

Default balanced

Default balanced

pf_default = PrivacyFilter()
undefined
pf_default = PrivacyFilter()
undefined

Data Format

数据格式

Input for eval and training (JSONL)

评估与训练输入(JSONL)

Each line is a JSON object:
jsonl
{"text": "Alice was born on 1990-01-02.", "spans": [{"start": 0, "end": 5, "label": "private_person"}, {"start": 18, "end": 28, "label": "private_date"}]}
{"text": "Email bob@corp.com for details.", "spans": [{"start": 6, "end": 18, "label": "private_email"}]}
每行是一个JSON对象:
jsonl
{"text": "Alice was born on 1990-01-02.", "spans": [{"start": 0, "end": 5, "label": "private_person"}, {"start": 18, "end": 28, "label": "private_date"}]}
{"text": "Email bob@corp.com for details.", "spans": [{"start": 6, "end": 18, "label": "private_email"}]}

JSON output schema

JSON输出格式

json
{
  "redacted_text": "██████ was born on ██████████.",
  "spans": [
    {
      "label": "private_person",
      "text": "Alice",
      "start": 0,
      "end": 5,
      "score": 0.987
    },
    {
      "label": "private_date",
      "text": "1990-01-02",
      "start": 18,
      "end": 28,
      "score": 0.973
    }
  ]
}
See
OUTPUT_SCHEMAS.md
in the repo for full payload spec.
json
{
  "redacted_text": "██████ was born on ██████████.",
  "spans": [
    {
      "label": "private_person",
      "text": "Alice",
      "start": 0,
      "end": 5,
      "score": 0.987
    },
    {
      "label": "private_date",
      "text": "1990-01-02",
      "start": 18,
      "end": 28,
      "score": 0.973
    }
  ]
}
详见仓库中的
OUTPUT_SCHEMAS.md
文件获取完整负载规范。

Finetuning Workflow

微调工作流

bash
undefined
bash
undefined

Prepare labeled JSONL (see data format above)

Prepare labeled JSONL (see data format above)

Run finetuning

Run finetuning

opf train train.jsonl
--output-dir ./my_finetuned_model
--eval-file eval.jsonl
--epochs 3
--batch-size 8
opf train train.jsonl
--output-dir ./my_finetuned_model
--eval-file eval.jsonl
--epochs 3
--batch-size 8

Use the finetuned model

Use the finetuned model

opf --checkpoint ./my_finetuned_model "redact this text"

See `FINETUNING.md` and `examples/scripts/finetuning/` for runnable demo harnesses.
opf --checkpoint ./my_finetuned_model "redact this text"

详见`FINETUNING.md`和`examples/scripts/finetuning/`中的可运行演示工具。

Environment Variables

环境变量

VariablePurpose
OPF_CHECKPOINT
Path to model checkpoint directory (overrides default
~/.opf/privacy_filter
)
变量用途
OPF_CHECKPOINT
模型检查点目录路径(覆盖默认的
~/.opf/privacy_filter

Project Structure

项目结构

opf/
├── __main__.py          # CLI entrypoint (redact, eval, train)
├── _api.py              # Python-facing API
├── _cli/                # Argument parsing, terminal rendering
├── _core/               # Runtime loading, span conversion, decoding
├── _eval/               # Dataset loading, metrics, eval runners
├── _train/              # Finetuning argument parsing and runners
├── _model/              # Transformer impl, checkpoint config, weight loading
examples/
├── data/                # Sample eval/finetune JSONL fixtures
├── scripts/finetuning/  # Runnable finetuning demo scripts
opf/
├── __main__.py          # CLI入口(脱敏、评估、训练)
├── _api.py              # Python面向API
├── _cli/                # 参数解析、终端渲染
├── _core/               # 运行时加载、片段转换、解码
├── _eval/               # 数据集加载、指标、评估运行器
├── _train/              # 微调参数解析与运行器
├── _model/              # Transformer实现、检查点配置、权重加载
examples/
├── data/                # 示例评估/微调JSONL测试数据
├── scripts/finetuning/  # 可运行的微调演示脚本

Common Patterns

常见用法示例

Pipeline: sanitize files before uploading to an LLM

流水线:上传至LLM前清理文件

python
from opf import PrivacyFilter
import json

pf = PrivacyFilter()

def sanitize_for_llm(raw_text: str) -> str:
    result = pf.redact(raw_text)
    return result.redacted_text

with open("raw_data.txt") as f:
    clean = sanitize_for_llm(f.read())

print(clean)
python
from opf import PrivacyFilter
import json

pf = PrivacyFilter()

def sanitize_for_llm(raw_text: str) -> str:
    result = pf.redact(raw_text)
    return result.redacted_text

with open("raw_data.txt") as f:
    clean = sanitize_for_llm(f.read())

print(clean)

Audit: log all detected PII spans without redacting

审计:记录所有检测到的PII片段而不脱敏

python
from opf import PrivacyFilter

pf = PrivacyFilter()

def audit_pii(text: str) -> list[dict]:
    result = pf.redact(text)
    return [
        {"label": s.label, "text": s.text, "start": s.start, "end": s.end}
        for s in result.spans
    ]

findings = audit_pii("Bob Jones (DOB: 1978-06-15) owes $1,200.")
print(json.dumps(findings, indent=2))
python
from opf import PrivacyFilter

pf = PrivacyFilter()

def audit_pii(text: str) -> list[dict]:
    result = pf.redact(text)
    return [
        {"label": s.label, "text": s.text, "start": s.start, "end": s.end}
        for s in result.spans
    ]

findings = audit_pii("Bob Jones (DOB: 1978-06-15) owes $1,200.")
print(json.dumps(findings, indent=2))

Filter specific label types only

仅过滤特定标签类型

python
from opf import PrivacyFilter

pf = PrivacyFilter()

def redact_only(text: str, labels: list[str]) -> str:
    result = pf.redact(text)
    # Rebuild text redacting only chosen labels
    chars = list(text)
    for span in result.spans:
        if span.label in labels:
            for i in range(span.start, span.end):
                chars[i] = "█"
    return "".join(chars)
python
from opf import PrivacyFilter

pf = PrivacyFilter()

def redact_only(text: str, labels: list[str]) -> str:
    result = pf.redact(text)
    # 重新构建文本,仅脱敏选定标签的内容
    chars = list(text)
    for span in result.spans:
        if span.label in labels:
            for i in range(span.start, span.end):
                chars[i] = "█"
    return "".join(chars)

Only redact emails and phones, keep names

仅脱敏邮箱和电话,保留姓名

output = redact_only( "Call Alice at 555-1234 or alice@example.com", labels=["private_phone", "private_email"] ) print(output)
output = redact_only( "Call Alice at 555-1234 or alice@example.com", labels=["private_phone", "private_email"] ) print(output)

"Call Alice at ████████ or █████████████████"

"Call Alice at ████████ or █████████████████"

undefined
undefined

Troubleshooting

故障排除

Model not found / auto-download fails
CUDA out of memory
  • Use
    --device cpu
    or reduce batch size with
    --batch-size 1
    .
Low recall on domain-specific identifiers
  • Finetune on representative labeled examples using
    opf train
    .
  • Try
    operating_point="high_recall"
    for broader masking.
Fragmented span boundaries
  • Expected in heavy-punctuation or mixed-format text; the Viterbi decoder mitigates this but is not perfect.
  • Finetuning on in-domain data is the recommended fix.
Non-English / non-Latin text
  • The model is primarily English; multilingual performance is not guaranteed. Evaluate on your target language before production use.
模型未找到/自动下载失败
CUDA内存不足
  • 使用
    --device cpu
    参数,或通过
    --batch-size 1
    减小批量大小。
特定领域标识符召回率低
  • 使用
    opf train
    命令基于代表性标注示例进行微调。
  • 尝试设置
    operating_point="high_recall"
    以实现更广泛的掩码。
片段边界碎片化
  • 在标点密集或格式混合的文本中属于正常现象;维特比解码器可缓解此问题,但无法做到完美。
  • 推荐的解决方法是针对特定领域数据进行微调。
非英文/非拉丁文本
  • 该模型主要针对英文优化;多语言性能无法保证。投入生产使用前,请针对目标语言进行评估。

References

参考资料