OpenAI Privacy Filter

Skill by ara.so — Daily 2026 Skills collection.

OpenAI Privacy Filter is a bidirectional token-classification model (1.5B params, 50M active) for detecting and masking PII spans in text. It runs in a single forward pass with constrained Viterbi decoding, supports a 128k-token context window, and is licensed Apache 2.0.

由ara.so开发的Skill — 属于Daily 2026 Skills合集。

OpenAI Privacy Filter是一款双向token分类模型（15亿参数，5亿活跃参数），用于检测并掩码文本中的PII片段。它通过约束维特比解码（Viterbi decoding）实现单次前向传播，支持128k-token的上下文窗口，采用Apache 2.0许可证。

Installation

安装

bash

pip install -e .

bash

pip install -e .

or from a cloned repo:

git clone https://github.com/openai/privacy-filter cd privacy-filter pip install -e .


After install, the `opf` CLI is available. On first use it downloads the model checkpoint to `~/.opf/privacy_filter` unless `OPF_CHECKPOINT` is set.

```bash
export OPF_CHECKPOINT=/path/to/local/checkpoint_dir

git clone https://github.com/openai/privacy-filter cd privacy-filter pip install -e .


安装完成后，`opf`命令行工具（CLI）即可使用。首次使用时，模型检查点将下载到`~/.opf/privacy_filter`目录，除非设置了`OPF_CHECKPOINT`环境变量。

```bash
export OPF_CHECKPOINT=/path/to/local/checkpoint_dir

Detected PII Categories

可检测的PII类别

Label	Description
`account_number`	Bank/card/account numbers
`private_address`	Physical addresses
`private_email`	Email addresses
`private_person`	Personal names
`private_phone`	Phone numbers
`private_url`	Personal URLs
`private_date`	Dates of birth / personal dates
`secret`	Credentials, tokens, API keys

标签	描述
`account_number`	银行/银行卡/账户号码
`private_address`	物理地址
`private_email`	电子邮箱地址
`private_person`	个人姓名
`private_phone`	电话号码
`private_url`	个人URL
`private_date`	出生日期/个人日期
`secret`	凭证、令牌、API密钥

CLI Usage

命令行工具（CLI）使用方法

One-shot redaction

一次性脱敏

bash

undefined

bash

undefined

Redact inline text

opf "Alice was born on 1990-01-02 and her email is alice@example.com."

Force CPU inference

opf --device cpu "Alice was born on 1990-01-02."

Use a specific checkpoint

opf --checkpoint /path/to/checkpoint_dir "Alice Johnson, SSN 123-45-6789"

Redact an entire file

opf -f /path/to/document.txt

Pipe input

cat document.txt | grep "sensitive" | opf

Interactive mode (no input provided)

opf

undefined

opf

undefined

Evaluation

评估

bash

undefined

bash

undefined

Evaluate on a labeled JSONL dataset

opf eval examples/data/sample_eval_five_examples.jsonl

See all eval options

opf eval --help

undefined

opf eval --help

undefined

Finetuning

微调

bash

undefined

bash

undefined

Finetune on your labeled dataset

opf train /path/to/train.jsonl --output-dir /path/to/finetuned_checkpoint

See all training options

opf train --help

undefined

opf train --help

undefined

Python API

python

from opf import PrivacyFilter

python

from opf import PrivacyFilter

Load with default checkpoint (~/.opf/privacy_filter or OPF_CHECKPOINT)

pf = PrivacyFilter()

Or specify a checkpoint explicitly

pf = PrivacyFilter(checkpoint="/path/to/checkpoint_dir")

Redact a single string

result = pf.redact("Alice Johnson called from +1-800-555-0199.") print(result.redacted_text)

"██████████████ called from ██████████████."

Access detected spans

for span in result.spans: print(span.label, span.text, span.start, span.end)

undefined

for span in result.spans: print(span.label, span.text, span.start, span.end)

undefined

Batch processing

批量处理

python

from opf import PrivacyFilter

pf = PrivacyFilter(device="cuda")  # or "cpu"

texts = [
    "Contact Bob Smith at bob@example.com",
    "Her SSN is 123-45-6789 and DOB is 1985-03-15",
    "API key: sk-abc123xyz789",
]

results = pf.redact_batch(texts)
for r in results:
    print(r.redacted_text)
    print(r.spans)

python

from opf import PrivacyFilter

pf = PrivacyFilter(device="cuda")  # or "cpu"

texts = [
    "Contact Bob Smith at bob@example.com",
    "Her SSN is 123-45-6789 and DOB is 1985-03-15",
    "API key: sk-abc123xyz789",
]

results = pf.redact_batch(texts)
for r in results:
    print(r.redacted_text)
    print(r.spans)

Precision/Recall tuning via operating points

通过操作点调整精确率/召回率

python

from opf import PrivacyFilter

python

from opf import PrivacyFilter

High recall (broader masking, more false positives)

pf_recall = PrivacyFilter(operating_point="high_recall")

High precision (stricter masking, fewer false positives)

pf_precision = PrivacyFilter(operating_point="high_precision")

Default balanced

pf_default = PrivacyFilter()

undefined

pf_default = PrivacyFilter()

undefined

Data Format

数据格式

Input for eval and training (JSONL)

评估与训练输入（JSONL）

Each line is a JSON object:

jsonl

{"text": "Alice was born on 1990-01-02.", "spans": [{"start": 0, "end": 5, "label": "private_person"}, {"start": 18, "end": 28, "label": "private_date"}]}
{"text": "Email bob@corp.com for details.", "spans": [{"start": 6, "end": 18, "label": "private_email"}]}

每行是一个JSON对象：

jsonl

{"text": "Alice was born on 1990-01-02.", "spans": [{"start": 0, "end": 5, "label": "private_person"}, {"start": 18, "end": 28, "label": "private_date"}]}
{"text": "Email bob@corp.com for details.", "spans": [{"start": 6, "end": 18, "label": "private_email"}]}

JSON output schema

JSON输出格式

json

{
  "redacted_text": "██████ was born on ██████████.",
  "spans": [
    {
      "label": "private_person",
      "text": "Alice",
      "start": 0,
      "end": 5,
      "score": 0.987
    },
    {
      "label": "private_date",
      "text": "1990-01-02",
      "start": 18,
      "end": 28,
      "score": 0.973
    }
  ]
}

See

OUTPUT_SCHEMAS.md

in the repo for full payload spec.

json

{
  "redacted_text": "██████ was born on ██████████.",
  "spans": [
    {
      "label": "private_person",
      "text": "Alice",
      "start": 0,
      "end": 5,
      "score": 0.987
    },
    {
      "label": "private_date",
      "text": "1990-01-02",
      "start": 18,
      "end": 28,
      "score": 0.973
    }
  ]
}

详见仓库中的

OUTPUT_SCHEMAS.md

文件获取完整负载规范。

Finetuning Workflow

微调工作流

bash

undefined

bash

undefined

Prepare labeled JSONL (see data format above)

Run finetuning

opf train train.jsonl
--output-dir ./my_finetuned_model
--eval-file eval.jsonl
--epochs 3
--batch-size 8

Use the finetuned model

opf --checkpoint ./my_finetuned_model "redact this text"


See `FINETUNING.md` and `examples/scripts/finetuning/` for runnable demo harnesses.

opf --checkpoint ./my_finetuned_model "redact this text"


详见`FINETUNING.md`和`examples/scripts/finetuning/`中的可运行演示工具。

Environment Variables

环境变量

Variable	Purpose
`OPF_CHECKPOINT`	Path to model checkpoint directory (overrides default `~/.opf/privacy_filter` )

变量	用途
`OPF_CHECKPOINT`	模型检查点目录路径（覆盖默认的 `~/.opf/privacy_filter` ）

Project Structure

项目结构

opf/
├── __main__.py          # CLI entrypoint (redact, eval, train)
├── _api.py              # Python-facing API
├── _cli/                # Argument parsing, terminal rendering
├── _core/               # Runtime loading, span conversion, decoding
├── _eval/               # Dataset loading, metrics, eval runners
├── _train/              # Finetuning argument parsing and runners
├── _model/              # Transformer impl, checkpoint config, weight loading
examples/
├── data/                # Sample eval/finetune JSONL fixtures
├── scripts/finetuning/  # Runnable finetuning demo scripts

opf/
├── __main__.py          # CLI入口（脱敏、评估、训练）
├── _api.py              # Python面向API
├── _cli/                # 参数解析、终端渲染
├── _core/               # 运行时加载、片段转换、解码
├── _eval/               # 数据集加载、指标、评估运行器
├── _train/              # 微调参数解析与运行器
├── _model/              # Transformer实现、检查点配置、权重加载
examples/
├── data/                # 示例评估/微调JSONL测试数据
├── scripts/finetuning/  # 可运行的微调演示脚本

Common Patterns

常见用法示例

Pipeline: sanitize files before uploading to an LLM

流水线：上传至LLM前清理文件

python

from opf import PrivacyFilter
import json

pf = PrivacyFilter()

def sanitize_for_llm(raw_text: str) -> str:
    result = pf.redact(raw_text)
    return result.redacted_text

with open("raw_data.txt") as f:
    clean = sanitize_for_llm(f.read())

print(clean)

python

from opf import PrivacyFilter
import json

pf = PrivacyFilter()

def sanitize_for_llm(raw_text: str) -> str:
    result = pf.redact(raw_text)
    return result.redacted_text

with open("raw_data.txt") as f:
    clean = sanitize_for_llm(f.read())

print(clean)

Audit: log all detected PII spans without redacting

审计：记录所有检测到的PII片段而不脱敏

python

from opf import PrivacyFilter

pf = PrivacyFilter()

def audit_pii(text: str) -> list[dict]:
    result = pf.redact(text)
    return [
        {"label": s.label, "text": s.text, "start": s.start, "end": s.end}
        for s in result.spans
    ]

findings = audit_pii("Bob Jones (DOB: 1978-06-15) owes $1,200.")
print(json.dumps(findings, indent=2))

python

from opf import PrivacyFilter

pf = PrivacyFilter()

def audit_pii(text: str) -> list[dict]:
    result = pf.redact(text)
    return [
        {"label": s.label, "text": s.text, "start": s.start, "end": s.end}
        for s in result.spans
    ]

findings = audit_pii("Bob Jones (DOB: 1978-06-15) owes $1,200.")
print(json.dumps(findings, indent=2))

Filter specific label types only

仅过滤特定标签类型

python

from opf import PrivacyFilter

pf = PrivacyFilter()

def redact_only(text: str, labels: list[str]) -> str:
    result = pf.redact(text)
    # Rebuild text redacting only chosen labels
    chars = list(text)
    for span in result.spans:
        if span.label in labels:
            for i in range(span.start, span.end):
                chars[i] = "█"
    return "".join(chars)

python

from opf import PrivacyFilter

pf = PrivacyFilter()

def redact_only(text: str, labels: list[str]) -> str:
    result = pf.redact(text)
    # 重新构建文本，仅脱敏选定标签的内容
    chars = list(text)
    for span in result.spans:
        if span.label in labels:
            for i in range(span.start, span.end):
                chars[i] = "█"
    return "".join(chars)

Only redact emails and phones, keep names

仅脱敏邮箱和电话，保留姓名

output = redact_only( "Call Alice at 555-1234 or alice@example.com", labels=["private_phone", "private_email"] ) print(output)

"Call Alice at ████████ or █████████████████"

undefined

undefined

Troubleshooting

故障排除

Model not found / auto-download fails

Set
```
OPF_CHECKPOINT
```
to a local checkpoint directory, or ensure internet access for the first run.
Checkpoint is downloaded from https://huggingface.co/openai/privacy-filter.

CUDA out of memory

Use
```
--device cpu
```
or reduce batch size with
```
--batch-size 1
```
.

Low recall on domain-specific identifiers

Finetune on representative labeled examples using
```
opf train
```
.
Try
```
operating_point="high_recall"
```
for broader masking.

Fragmented span boundaries

Expected in heavy-punctuation or mixed-format text; the Viterbi decoder mitigates this but is not perfect.
Finetuning on in-domain data is the recommended fix.

Non-English / non-Latin text

The model is primarily English; multilingual performance is not guaranteed. Evaluate on your target language before production use.

模型未找到/自动下载失败

设置
```
OPF_CHECKPOINT
```
指向本地检查点目录，或确保首次运行时网络通畅。
检查点将从https://huggingface.co/openai/privacy-filter下载。

CUDA内存不足

使用
```
--device cpu
```
参数，或通过
```
--batch-size 1
```
减小批量大小。

特定领域标识符召回率低

使用
```
opf train
```
命令基于代表性标注示例进行微调。
尝试设置
```
operating_point="high_recall"
```
以实现更广泛的掩码。

片段边界碎片化

在标点密集或格式混合的文本中属于正常现象；维特比解码器可缓解此问题，但无法做到完美。
推荐的解决方法是针对特定领域数据进行微调。

非英文/非拉丁文本

该模型主要针对英文优化；多语言性能无法保证。投入生产使用前，请针对目标语言进行评估。

References

参考资料

Model weights (HuggingFace)
Live demo
Model card (PDF)
```
FINETUNING.md
```
— finetuning workflow
```
OUTPUT_SCHEMAS.md
```
— JSON response formats
```
EVAL_AND_OUTPUT_MODES.md
```
— output and eval mode details

模型权重（HuggingFace）
在线演示
模型卡片（PDF）
```
FINETUNING.md
```
— 微调工作流
```
OUTPUT_SCHEMAS.md
```
— JSON响应格式
```
EVAL_AND_OUTPUT_MODES.md
```
— 输出与评估模式详情

openai-privacy-filter

Original

Translation

OpenAI Privacy Filter

OpenAI Privacy Filter

Installation

安装

or from a cloned repo:

or from a cloned repo:

Detected PII Categories

可检测的PII类别

CLI Usage

命令行工具（CLI）使用方法

One-shot redaction

一次性脱敏

Redact inline text

Redact inline text

Force CPU inference

Force CPU inference

Use a specific checkpoint

Use a specific checkpoint

Redact an entire file

Redact an entire file

Pipe input

Pipe input

Interactive mode (no input provided)

Interactive mode (no input provided)

Evaluation

评估

Evaluate on a labeled JSONL dataset

Evaluate on a labeled JSONL dataset

See all eval options

See all eval options

Finetuning

微调

Finetune on your labeled dataset

Finetune on your labeled dataset

See all training options

See all training options

Python API

Python API

Load with default checkpoint (~/.opf/privacy_filter or OPF_CHECKPOINT)

Load with default checkpoint (~/.opf/privacy_filter or OPF_CHECKPOINT)

Or specify a checkpoint explicitly

Or specify a checkpoint explicitly

Redact a single string

Redact a single string

"██████████████ called from ██████████████."

"██████████████ called from ██████████████."

Access detected spans

Access detected spans

Batch processing

批量处理

Precision/Recall tuning via operating points

通过操作点调整精确率/召回率

High recall (broader masking, more false positives)

High recall (broader masking, more false positives)

High precision (stricter masking, fewer false positives)

High precision (stricter masking, fewer false positives)

Default balanced

Default balanced

Data Format

数据格式

Input for eval and training (JSONL)

评估与训练输入（JSONL）

JSON output schema

JSON输出格式

Finetuning Workflow

微调工作流

Prepare labeled JSONL (see data format above)

Prepare labeled JSONL (see data format above)

Run finetuning

Run finetuning

Use the finetuned model

Use the finetuned model

Environment Variables

环境变量

Project Structure

项目结构

Common Patterns