data-anonymizer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Anonymizer
数据匿名化工具
Detect and mask personally identifiable information (PII) in text documents and structured data. Supports multiple masking strategies and can process CSV files at scale.
检测并掩码文本文档和结构化数据中的个人可识别信息(PII)。支持多种掩码策略,可批量处理CSV文件。
Quick Start
快速开始
python
from scripts.data_anonymizer import DataAnonymizerpython
from scripts.data_anonymizer import DataAnonymizerAnonymize text
Anonymize text
anonymizer = DataAnonymizer()
result = anonymizer.anonymize("Contact John Smith at john@email.com or 555-123-4567")
print(result)
anonymizer = DataAnonymizer()
result = anonymizer.anonymize("Contact John Smith at john@email.com or 555-123-4567")
print(result)
"Contact [NAME] at [EMAIL] or [PHONE]"
"Contact [NAME] at [EMAIL] or [PHONE]"
Anonymize CSV
—
anonymizer.anonymize_csv("customers.csv", "customers_anon.csv")
undefinedundefinedFeatures
功能特性
- PII Detection: Names, emails, phones, SSN, addresses, credit cards, dates
- Multiple Strategies: Mask, redact, hash, fake data replacement
- CSV Processing: Anonymize specific columns or auto-detect
- Reversible Tokens: Optional mapping for de-anonymization
- Custom Patterns: Add your own PII patterns
- Audit Report: List all detected PII with locations
- PII检测:姓名、邮箱、电话、社保号、地址、信用卡号、日期
- 多种策略:掩码、编辑、哈希、虚假数据替换
- CSV处理:可匿名化指定列或自动检测
- 可逆标记:可选用于去匿名化的映射功能
- 自定义规则:添加专属PII识别规则
- 审计报告:列出所有检测到的PII及其位置
API Reference
API参考
Initialization
初始化
python
anonymizer = DataAnonymizer(
strategy="mask", # mask, redact, hash, fake
reversible=False # Enable token mapping
)python
anonymizer = DataAnonymizer(
strategy="mask", # mask, redact, hash, fake
reversible=False # Enable token mapping
)Text Anonymization
文本匿名化
python
undefinedpython
undefinedBasic anonymization
Basic anonymization
result = anonymizer.anonymize(text)
result = anonymizer.anonymize(text)
With specific PII types
With specific PII types
result = anonymizer.anonymize(text, pii_types=["email", "phone"])
result = anonymizer.anonymize(text, pii_types=["email", "phone"])
Get detected PII report
Get detected PII report
result, report = anonymizer.anonymize(text, return_report=True)
undefinedresult, report = anonymizer.anonymize(text, return_report=True)
undefinedMasking Strategies
掩码策略
python
text = "Email john@test.com, call 555-1234"python
text = "Email john@test.com, call 555-1234"Mask (default) - replace with type labels
Mask (default) - replace with type labels
anonymizer.strategy = "mask"
anonymizer.strategy = "mask"
"Email [EMAIL], call [PHONE]"
"Email [EMAIL], call [PHONE]"
Redact - replace with asterisks
Redact - replace with asterisks
anonymizer.strategy = "redact"
anonymizer.strategy = "redact"
"Email ***************, call ********"
"Email ***************, call ********"
Hash - replace with hash
Hash - replace with hash
anonymizer.strategy = "hash"
anonymizer.strategy = "hash"
"Email a1b2c3d4, call e5f6g7h8"
"Email a1b2c3d4, call e5f6g7h8"
Fake - replace with realistic fake data
Fake - replace with realistic fake data
anonymizer.strategy = "fake"
anonymizer.strategy = "fake"
"Email jane@example.org, call 555-9876"
"Email jane@example.org, call 555-9876"
undefinedundefinedCSV Processing
CSV处理
python
undefinedpython
undefinedAuto-detect PII columns
Auto-detect PII columns
anonymizer.anonymize_csv("input.csv", "output.csv")
anonymizer.anonymize_csv("input.csv", "output.csv")
Specify columns
Specify columns
anonymizer.anonymize_csv(
"input.csv",
"output.csv",
columns=["name", "email", "phone"]
)
anonymizer.anonymize_csv(
"input.csv",
"output.csv",
columns=["name", "email", "phone"]
)
Different strategies per column
Different strategies per column
anonymizer.anonymize_csv(
"input.csv",
"output.csv",
column_strategies={
"name": "fake",
"email": "hash",
"ssn": "redact"
}
)
undefinedanonymizer.anonymize_csv(
"input.csv",
"output.csv",
column_strategies={
"name": "fake",
"email": "hash",
"ssn": "redact"
}
)
undefinedReversible Anonymization
可逆匿名化
python
anonymizer = DataAnonymizer(reversible=True)python
anonymizer = DataAnonymizer(reversible=True)Anonymize with token mapping
Anonymize with token mapping
result = anonymizer.anonymize("John Smith: john@test.com")
mapping = anonymizer.get_mapping()
result = anonymizer.anonymize("John Smith: john@test.com")
mapping = anonymizer.get_mapping()
Save mapping securely
Save mapping securely
anonymizer.save_mapping("mapping.json", encrypt=True, password="secret")
anonymizer.save_mapping("mapping.json", encrypt=True, password="secret")
Later, de-anonymize
Later, de-anonymize
anonymizer.load_mapping("mapping.json", password="secret")
original = anonymizer.deanonymize(result)
undefinedanonymizer.load_mapping("mapping.json", password="secret")
original = anonymizer.deanonymize(result)
undefinedCustom Patterns
自定义规则
python
undefinedpython
undefinedAdd custom PII pattern
Add custom PII pattern
anonymizer.add_pattern(
name="employee_id",
pattern=r"EMP-\d{6}",
label="[EMPLOYEE_ID]"
)
undefinedanonymizer.add_pattern(
name="employee_id",
pattern=r"EMP-\d{6}",
label="[EMPLOYEE_ID]"
)
undefinedCLI Usage
命令行使用
bash
undefinedbash
undefinedAnonymize text file
Anonymize text file
python data_anonymizer.py --input document.txt --output document_anon.txt
python data_anonymizer.py --input document.txt --output document_anon.txt
Anonymize CSV
Anonymize CSV
python data_anonymizer.py --input customers.csv --output customers_anon.csv
python data_anonymizer.py --input customers.csv --output customers_anon.csv
Specific strategy
Specific strategy
python data_anonymizer.py --input data.csv --output anon.csv --strategy fake
python data_anonymizer.py --input data.csv --output anon.csv --strategy fake
Generate audit report
Generate audit report
python data_anonymizer.py --input document.txt --report audit.json
python data_anonymizer.py --input document.txt --report audit.json
Specific PII types only
Specific PII types only
python data_anonymizer.py --input doc.txt --types email phone ssn
undefinedpython data_anonymizer.py --input doc.txt --types email phone ssn
undefinedCLI Arguments
命令行参数
| Argument | Description | Default |
|---|---|---|
| Input file | Required |
| Output file | Required |
| Masking strategy | mask |
| PII types to detect | all |
| CSV columns to process | auto |
| Generate audit report | - |
| Enable token mapping | False |
| 参数 | 描述 | 默认值 |
|---|---|---|
| 输入文件 | 必填 |
| 输出文件 | 必填 |
| 掩码策略 | mask |
| 需检测的PII类型 | 全部 |
| 需处理的CSV列 | 自动检测 |
| 生成审计报告 | - |
| 启用标记映射 | False |
Supported PII Types
支持的PII类型
| Type | Examples | Pattern |
|---|---|---|
| John Smith, Mary Johnson | NLP-based |
| user@domain.com | Regex |
| 555-123-4567, (555) 123-4567 | Regex |
| 123-45-6789 | Regex |
| 4111-1111-1111-1111 | Regex + Luhn |
| 123 Main St, City, ST 12345 | NLP + Regex |
| 01/15/1990, January 15, 1990 | Regex |
| 192.168.1.1 | Regex |
| 类型 | 示例 | 识别方式 |
|---|---|---|
| John Smith, Mary Johnson | 基于NLP |
| user@domain.com | 正则表达式 |
| 555-123-4567, (555) 123-4567 | 正则表达式 |
| 123-45-6789 | 正则表达式 |
| 4111-1111-1111-1111 | 正则表达式 + Luhn算法 |
| 123 Main St, City, ST 12345 | NLP + 正则表达式 |
| 01/15/1990, January 15, 1990 | 正则表达式 |
| 192.168.1.1 | 正则表达式 |
Examples
示例
Anonymize Customer Support Logs
匿名化客户支持日志
python
anonymizer = DataAnonymizer(strategy="mask")
log = """
Ticket #1234: Customer John Doe (john.doe@company.com) called about
billing issue. SSN on file: 123-45-6789. Callback number: 555-867-5309.
Address: 123 Oak Street, Springfield, IL 62701.
"""
result = anonymizer.anonymize(log)
print(result)python
anonymizer = DataAnonymizer(strategy="mask")
log = """
Ticket #1234: Customer John Doe (john.doe@company.com) called about
billing issue. SSN on file: 123-45-6789. Callback number: 555-867-5309.
Address: 123 Oak Street, Springfield, IL 62701.
"""
result = anonymizer.anonymize(log)
print(result)Ticket #1234: Customer [NAME] ([EMAIL]) called about
Ticket #1234: Customer [NAME] ([EMAIL]) called about
billing issue. SSN on file: [SSN]. Callback number: [PHONE].
billing issue. SSN on file: [SSN]. Callback number: [PHONE].
Address: [ADDRESS].
Address: [ADDRESS].
undefinedundefinedGDPR Compliance for Database Export
GDPR合规的数据库导出
python
anonymizer = DataAnonymizer(strategy="hash")python
anonymizer = DataAnonymizer(strategy="hash")Consistent hashing for joins
Consistent hashing for joins
anonymizer.anonymize_csv(
"users.csv",
"users_anon.csv",
columns=["email", "name", "phone"]
)
anonymizer.anonymize_csv(
"orders.csv",
"orders_anon.csv",
columns=["customer_email"] # Same hash as users.email
)
undefinedanonymizer.anonymize_csv(
"users.csv",
"users_anon.csv",
columns=["email", "name", "phone"]
)
anonymizer.anonymize_csv(
"orders.csv",
"orders_anon.csv",
columns=["customer_email"] # Same hash as users.email
)
undefinedGenerate Test Data from Production
从生产数据生成测试数据
python
anonymizer = DataAnonymizer(strategy="fake")python
anonymizer = DataAnonymizer(strategy="fake")Replace real PII with realistic fake data
Replace real PII with realistic fake data
anonymizer.anonymize_csv(
"production_data.csv",
"test_data.csv"
)
anonymizer.anonymize_csv(
"production_data.csv",
"test_data.csv"
)
Test data has same structure but fake PII
Test data has same structure but fake PII
undefinedundefinedDependencies
依赖项
pandas>=2.0.0
faker>=18.0.0pandas>=2.0.0
faker>=18.0.0Limitations
局限性
- Name detection may miss unusual names
- Address detection works best for US formats
- Custom patterns may be needed for domain-specific PII
- Fake data replacement doesn't preserve exact format
- 姓名检测可能遗漏不常见的姓名
- 地址检测对美国格式的支持最佳
- 针对特定领域的PII可能需要自定义规则
- 虚假数据替换无法完全保留原始格式