anonymize-logs
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAnonymize Logs
日志匿名化
Sanitize customer-provided log files so they are safe to commit to source control.
对客户提供的日志文件进行清理,使其可以安全提交到版本控制系统。
What you provide
需提供的内容
| Input | How to provide |
|---|---|
| Log file(s) to sanitize | |
| Output location (optional) | free text path, defaults to same directory with |
| In-place override (optional) | say "in place" to overwrite the original |
| 输入内容 | 提供方式 |
|---|---|
| 待清理的日志文件 | 通过@提及文件或直接粘贴内容 |
| 输出位置(可选) | 自由文本路径,默认与原文件同目录并添加 |
| 原地覆盖(可选) | 输入“in place”即可覆盖原文件 |
Golden rule — never reformat the content
黄金准则——绝不重新格式化内容
Only replace sensitive values. Do not touch anything else.
The ingest pipeline parser depends on exact whitespace, delimiters, quoting, and line structure.
- NDJSON (one JSON object per line): do not pretty-print, re-indent, or restructure. Each line must remain a single compact JSON object on one line.
- Syslog / CEF / key-value logs: do not add or remove spaces, change quoting, or normalize field order.
- Multiline logs: preserve line grouping exactly.
- Replace only the values that identify real people, systems, or organizations — preserve field names, delimiters, structural tokens, and everything else character-for-character.
仅替换敏感值。请勿修改其他任何内容。
摄入pipeline解析器依赖于精确的空格、分隔符、引号和行结构。
- NDJSON(每行一个JSON对象):请勿格式化美化、重新缩进或重构。每行必须保持为单行紧凑的JSON对象。
- Syslog / CEF / 键值对日志:请勿添加或删除空格、修改引号格式或标准化字段顺序。
- 多行日志:严格保留行分组结构。
- 仅替换可识别真实人员、系统或组织的值——严格保留字段名、分隔符、结构标记及其他所有字符。
Workflow
工作流程
Step 1 — Line-by-line replacement
步骤1——逐行替换
Read every line and replace all sensitive values inline. Cover at minimum:
- Authentication artifacts: API keys, bearer tokens, passwords, OAuth tokens, base64-encoded credentials, private keys and certs (PEM blocks, SSH private keys), TLS/SSH fingerprints (JA3/JA4 hashes, SSH host key fingerprints, certificate fingerprints), DHCP fingerprints, partial secrets (,
token_prefix,password_hash_prefix) — partial exposure still identifies the credentialhashed_token - Personal identifiers: email addresses (including CC/BCC lists, delegate/owner/creator email variants), usernames, display names, employee IDs, phone numbers, principal names (e.g. ), email subjects and body text
user@tenant.onmicrosoft.com - Organizational identifiers: company names, tenant IDs (including ,
home_tenant_id,resource_tenant_idvariants), account IDs, subscription IDs, billing account IDs, org slugs embedded in paths or JSON fields, org unit paths (e.g.aad_tenant_id,orgunit_path), department names and IDs, cost center IDs, Windows SIDs (Security Identifiers) in pipe names, task names, or registry pathsorg_unit_path - Infrastructure identifiers: internal hostnames, FQDNs, private IP addresses, MAC addresses, internal URLs (staging/prod hostnames, internal tool domains), cloud resource IDs (ARNs, S3 bucket names, GCP project IDs, Azure subscription/resource names), Kubernetes cluster names, node names, pod names, and namespace names, container names and IDs, database hostnames and names (including ,
database.host,database.name), Windows domain topology fields (domain controller hostnames, NT domain names,database_principal_name,domain_controller_object_guid)domain_controller_object_sid - Device and hardware identifiers: serial numbers, hardware UUIDs, machine IDs, device UUIDs, BIOS/firmware version strings that are unique to a specific device
- File system paths: process command lines, file paths, registry key paths, and log file paths that embed usernames, org names, or internal system structure (e.g. ,
C:\Users\alice\,/home/bob/)HKLM\...\S-1-5-21-... - Connection strings: database URIs, Redis URLs, any connection string that includes credentials or internal hostnames
- Resource ownership: owner email, creator email, last-modified-by identity, delegate user email, assignee email, impersonator fields — any field that names a specific person as the actor on a resource
- Tracking identifiers: session IDs, request/correlation IDs, transaction IDs, or any long opaque string tied to a real entity
- Hash values: replace when they could be derived from sensitive input (password hashes, HMAC secrets) — preserve file hashes (MD5, SHA1, SHA256 of file content) and other content-addressable references (git SHAs, TLS cert hashes used as identifiers)
- Geographic specifics: precise GPS coordinates, real street addresses — city and country names are generally safe to keep
Apply placeholder conventions and shape rules (see below) as you go.
读取每一行并逐行替换所有敏感值。至少覆盖以下类别:
- 认证凭证:API keys、bearer tokens、密码、OAuth tokens、base64编码凭证、私钥和证书(PEM块、SSH私钥)、TLS/SSH指纹(JA3/JA4哈希值、SSH主机密钥指纹、证书指纹)、DHCP指纹、部分密钥(、
token_prefix、password_hash_prefix)——部分泄露仍会暴露凭证信息hashed_token - 个人身份信息:邮箱地址(包括抄送/密送列表、代理人/所有者/创建者邮箱变体)、用户名、显示名称、员工ID、电话号码、主体名称(如)、邮件主题和正文内容
user@tenant.onmicrosoft.com - 组织身份信息:公司名称、租户ID(包括、
home_tenant_id、resource_tenant_id等变体)、账户ID、订阅ID、计费账户ID、嵌入在路径或JSON字段中的组织别名、组织单元路径(如aad_tenant_id、orgunit_path)、部门名称和ID、成本中心ID、管道名称、任务名称或注册表路径中的Windows SID(安全标识符)org_unit_path - 基础设施身份信息:内部主机名、FQDN、私有IP地址、MAC地址、内部URL(预发布/生产环境主机名、内部工具域名)、云资源ID(ARNs、S3存储桶名称、GCP项目ID、Azure订阅/资源名称)、Kubernetes集群名称、节点名称、Pod名称和命名空间名称、容器名称和ID、数据库主机名和名称(包括、
database.host、database.name)、Windows域拓扑字段(域控制器主机名、NT域名、database_principal_name、domain_controller_object_guid)domain_controller_object_sid - 设备和硬件身份信息:序列号、硬件UUID、机器ID、设备UUID、特定设备独有的BIOS/固件版本字符串
- 文件系统路径:包含用户名、组织名称或内部系统结构的进程命令行、文件路径、注册表键路径和日志文件路径(如、
C:\Users\alice\、/home/bob/)HKLM\...\S-1-5-21-... - 连接字符串:数据库URI、Redis URL、任何包含凭证或内部主机名的连接字符串
- 资源归属信息:所有者邮箱、创建者邮箱、最后修改人身份、代理人邮箱、经办人邮箱、模拟者字段——任何指定特定人员为资源操作方的字段
- 跟踪标识符:会话ID、请求/关联ID、交易ID或任何与真实实体绑定的长随机字符串
- 哈希值:当哈希值可从敏感输入推导时替换(密码哈希、HMAC密钥)——保留文件哈希(文件内容的MD5、SHA1、SHA256)和其他内容可寻址引用(git SHA、用作标识符的TLS证书哈希)
- 地理信息细节:精确GPS坐标、真实街道地址——城市和国家名称通常可保留
替换时遵循占位符规范和格式规则(见下文)。
Step 2 — Verify structure is intact
步骤2——验证结构完整性
Confirm after sanitization:
- Line count is unchanged
- JSON lines are still valid JSON (for NDJSON files):
bash
python3 -c " import json, sys with open('FILE') as f: for i, line in enumerate(f, 1): line = line.strip() if line: try: json.loads(line) except Exception as e: print(f'Line {i}: {e}') " - Timestamps still match the format the pipeline uses for date parsing
- Enum / status / action values that pipeline conditions branch on are untouched
清理完成后确认:
- 行数保持不变
- JSON行仍为有效JSON(针对NDJSON文件):
bash
python3 -c " import json, sys with open('FILE') as f: for i, line in enumerate(f, 1): line = line.strip() if line: try: json.loads(line) except Exception as e: print(f'Line {i}: {e}') " - 时间戳格式仍与pipeline用于日期解析的格式匹配
- pipeline条件分支依赖的枚举/状态/操作值未被修改
Placeholder conventions
占位符规范
Use consistent, realistic-looking replacements — not strings, which break format-sensitive parsers.
REDACTED| Type | Replacement |
|---|---|
| |
| IPv4 | RFC 5737 ranges: |
| IPv6 | |
| Hostname / FQDN | |
| Domain | |
| UUID | |
| API key / token | |
| Username | |
| Display name | |
| Org / company name | |
| Account / tenant ID | |
| Cloud resource ID | |
| S3 bucket name | |
| MAC address | |
| Serial number | |
| Device / machine ID | use a synthetic UUID or |
| Windows SID | |
| File path (Windows) | |
| File path (Unix) | |
| Kubernetes cluster | |
| Phone number | |
| Database host / name | |
| Department / org unit | |
| Hashed / partial token | replace with full synthetic token of same format |
| DHCP fingerprint | |
| JA4 fingerprint | replace with same-length hex string |
Consistency rule: map identical original values to identical placeholders throughout the file. If the same IP appears 10 times, it must become the same replacement IP all 10 times — so cross-event correlations remain testable.
使用一致、逼真的替换值——请勿使用这类会破坏格式敏感解析器的字符串。
REDACTED| 类型 | 替换值 |
|---|---|
| 邮箱 | |
| IPv4 | RFC 5737规定的范围: |
| IPv6 | |
| 主机名 / FQDN | |
| 域名 | |
| UUID | |
| API key / token | |
| 用户名 | |
| 显示名称 | |
| 组织/公司名称 | |
| 账户/租户ID | |
| 云资源ID | |
| S3存储桶名称 | |
| MAC地址 | |
| 序列号 | |
| 设备/机器ID | 使用合成UUID或 |
| Windows SID | |
| 文件路径(Windows) | |
| 文件路径(Unix) | |
| Kubernetes集群 | |
| 电话号码 | |
| 数据库主机/名称 | |
| 部门/组织单元 | |
| 哈希/部分token | 替换为相同格式的完整合成token |
| DHCP指纹 | |
| JA4指纹 | 替换为相同长度的十六进制字符串 |
一致性规则:将相同的原始值映射为相同的占位符,贯穿整个文件。如果同一IP出现10次,必须每次都替换为同一个IP——这样跨事件的关联关系仍可测试。
Shape rule — replacements must match the original format
格式规则——替换值必须与原始值格式匹配
Every replacement must have the same shape as the original value. The parser and pipeline conditions depend on value format, not just field presence.
- Numeric ID → numeric ID: →
/d/123/edit, not/d/456/edit/d/example-document-id/edit - UUID → UUID: a real UUID must become a synthetic UUID of the same version, not a descriptive string
- URL → URL: replace only the sensitive segment (hostname, path ID) — preserve the scheme, path structure, and query string shape
- →
https://docs.google.com/drawings/d/123/edit(replace the ID, not the host —https://docs.google.com/drawings/d/000000000000/editis a public service name, not an org identifier)docs.google.com - →
https://internal.corp.com/api/v1/resource(replace the internal hostname, keep the path)https://host-redacted.example.local/api/v1/resource
- String ID → same-length or same-format string: opaque alphanumeric IDs should become opaque alphanumeric placeholders of similar length, not descriptive names
- Hostname in a URL vs. standalone hostname: only replace hostnames that identify real internal infrastructure — public well-known hostnames (,
docs.google.com,api.github.com) identify a service, not an organization, and do not need to be replaceds3.amazonaws.com
Malformed or garbage values must not be replaced. If a value looks broken, synthetic, or contains no real identifying information (e.g. , , , empty strings, placeholder-looking values), leave it exactly as-is. Replacing a malformed value with a well-formed placeholder changes the shape and can alter pipeline behaviour — a grok that fails on the original will now succeed on the sanitized version, masking the real error.
http://1=Y +z\\00/00/0000N/AIf you are unsure what shape to use, look at neighbouring values of the same field type in the same file and match their format.
每个替换值必须与原始值格式一致。解析器和pipeline条件依赖于值的格式,而不仅仅是字段的存在。
- 数字ID → 数字ID:→
/d/123/edit,而非/d/456/edit/d/example-document-id/edit - UUID → UUID:真实UUID必须替换为同版本的合成UUID,而非描述性字符串
- URL → URL:仅替换敏感部分(主机名、路径ID)——保留协议、路径结构和查询字符串格式
- →
https://docs.google.com/drawings/d/123/edit(替换ID,而非主机——https://docs.google.com/drawings/d/000000000000/edit是公共服务名称,不属于组织标识符,无需替换)docs.google.com - →
https://internal.corp.com/api/v1/resource(替换内部主机名,保留路径)https://host-redacted.example.local/api/v1/resource
- 字符串ID → 相同长度或格式的字符串:随机字母数字ID应替换为相似长度的随机字母数字占位符,而非描述性名称
- URL中的主机名 vs. 独立主机名:仅替换可识别真实内部基础设施的主机名——公共知名主机名(、
docs.google.com、api.github.com)标识的是服务,而非组织,无需替换s3.amazonaws.com
请勿替换格式错误或无效值。如果某个值看起来损坏、合成或不包含真实识别信息(如、、、空字符串、类似占位符的值),请保持原样。将格式错误的值替换为格式正确的占位符会改变其格式,可能影响pipeline的行为——原本无法匹配的grok模式现在会匹配成功,掩盖真实错误。
http://1=Y +z\\00/00/0000N/A如果不确定应使用何种格式,请查看同一文件中同类型字段的相邻值,匹配其格式。
What to preserve
需要保留的内容
Do not replace:
- Protocol names, action verbs, event types, severity levels (,
ALLOW,DENY,INFO)ERROR - HTTP status codes, port numbers, numeric metric values
- Field names and keys
- Timestamps (format and timezone must stay intact)
- Structural tokens (brackets, braces, pipes, commas, tabs)
- Public well-known service hostnames in URLs (,
docs.google.com, etc.) — replace the path ID if it is sensitive, not the hostapi.github.com - File hashes (MD5, SHA1, SHA256 of file content) — these are content-addressable and safe; do not replace them
- User agent strings () — these reveal browser/OS type but not identity; safe to keep
Mozilla/5.0 ... - City and country names in geo enrichment fields — replace only precise coordinates and street addresses
请勿替换:
- 协议名称、操作动词、事件类型、严重级别(、
ALLOW、DENY、INFO)ERROR - HTTP状态码、端口号、数值指标值
- 字段名和键
- 时间戳(格式和时区必须保持不变)
- 结构标记(括号、大括号、管道符、逗号、制表符)
- URL中的公共知名服务主机名(、
docs.google.com等)——如果路径ID敏感则替换,无需替换主机api.github.com - 文件哈希(文件内容的MD5、SHA1、SHA256)——这些是内容可寻址的,安全可靠;请勿替换
- 用户代理字符串()——这些仅显示浏览器/操作系统类型,不涉及身份信息;可保留
Mozilla/5.0 ... - 地理 enrichment字段中的城市和国家名称——仅替换精确坐标和街道地址