anonymize-logs

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Anonymize Logs

日志匿名化

Sanitize customer-provided log files so they are safe to commit to source control.
对客户提供的日志文件进行清理,使其可以安全提交到版本控制系统。

What you provide

需提供的内容

InputHow to provide
Log file(s) to sanitize
@
-mention files or paste inline
Output location (optional)free text path, defaults to same directory with
.sanitized
suffix
In-place override (optional)say "in place" to overwrite the original
输入内容提供方式
待清理的日志文件通过@提及文件或直接粘贴内容
输出位置(可选)自由文本路径,默认与原文件同目录并添加
.sanitized
后缀
原地覆盖(可选)输入“in place”即可覆盖原文件

Golden rule — never reformat the content

黄金准则——绝不重新格式化内容

Only replace sensitive values. Do not touch anything else.
The ingest pipeline parser depends on exact whitespace, delimiters, quoting, and line structure.
  • NDJSON (one JSON object per line): do not pretty-print, re-indent, or restructure. Each line must remain a single compact JSON object on one line.
  • Syslog / CEF / key-value logs: do not add or remove spaces, change quoting, or normalize field order.
  • Multiline logs: preserve line grouping exactly.
  • Replace only the values that identify real people, systems, or organizations — preserve field names, delimiters, structural tokens, and everything else character-for-character.
仅替换敏感值。请勿修改其他任何内容。
摄入pipeline解析器依赖于精确的空格、分隔符、引号和行结构。
  • NDJSON(每行一个JSON对象):请勿格式化美化、重新缩进或重构。每行必须保持为单行紧凑的JSON对象。
  • Syslog / CEF / 键值对日志:请勿添加或删除空格、修改引号格式或标准化字段顺序。
  • 多行日志:严格保留行分组结构。
  • 仅替换可识别真实人员、系统或组织的——严格保留字段名、分隔符、结构标记及其他所有字符。

Workflow

工作流程

Step 1 — Line-by-line replacement

步骤1——逐行替换

Read every line and replace all sensitive values inline. Cover at minimum:
  • Authentication artifacts: API keys, bearer tokens, passwords, OAuth tokens, base64-encoded credentials, private keys and certs (PEM blocks, SSH private keys), TLS/SSH fingerprints (JA3/JA4 hashes, SSH host key fingerprints, certificate fingerprints), DHCP fingerprints, partial secrets (
    token_prefix
    ,
    password_hash_prefix
    ,
    hashed_token
    ) — partial exposure still identifies the credential
  • Personal identifiers: email addresses (including CC/BCC lists, delegate/owner/creator email variants), usernames, display names, employee IDs, phone numbers, principal names (e.g.
    user@tenant.onmicrosoft.com
    ), email subjects and body text
  • Organizational identifiers: company names, tenant IDs (including
    home_tenant_id
    ,
    resource_tenant_id
    ,
    aad_tenant_id
    variants), account IDs, subscription IDs, billing account IDs, org slugs embedded in paths or JSON fields, org unit paths (e.g.
    orgunit_path
    ,
    org_unit_path
    ), department names and IDs, cost center IDs, Windows SIDs (Security Identifiers) in pipe names, task names, or registry paths
  • Infrastructure identifiers: internal hostnames, FQDNs, private IP addresses, MAC addresses, internal URLs (staging/prod hostnames, internal tool domains), cloud resource IDs (ARNs, S3 bucket names, GCP project IDs, Azure subscription/resource names), Kubernetes cluster names, node names, pod names, and namespace names, container names and IDs, database hostnames and names (including
    database.host
    ,
    database.name
    ,
    database_principal_name
    ), Windows domain topology fields (domain controller hostnames, NT domain names,
    domain_controller_object_guid
    ,
    domain_controller_object_sid
    )
  • Device and hardware identifiers: serial numbers, hardware UUIDs, machine IDs, device UUIDs, BIOS/firmware version strings that are unique to a specific device
  • File system paths: process command lines, file paths, registry key paths, and log file paths that embed usernames, org names, or internal system structure (e.g.
    C:\Users\alice\
    ,
    /home/bob/
    ,
    HKLM\...\S-1-5-21-...
    )
  • Connection strings: database URIs, Redis URLs, any connection string that includes credentials or internal hostnames
  • Resource ownership: owner email, creator email, last-modified-by identity, delegate user email, assignee email, impersonator fields — any field that names a specific person as the actor on a resource
  • Tracking identifiers: session IDs, request/correlation IDs, transaction IDs, or any long opaque string tied to a real entity
  • Hash values: replace when they could be derived from sensitive input (password hashes, HMAC secrets) — preserve file hashes (MD5, SHA1, SHA256 of file content) and other content-addressable references (git SHAs, TLS cert hashes used as identifiers)
  • Geographic specifics: precise GPS coordinates, real street addresses — city and country names are generally safe to keep
Apply placeholder conventions and shape rules (see below) as you go.
读取每一行并逐行替换所有敏感值。至少覆盖以下类别:
  • 认证凭证:API keys、bearer tokens、密码、OAuth tokens、base64编码凭证、私钥和证书(PEM块、SSH私钥)、TLS/SSH指纹(JA3/JA4哈希值、SSH主机密钥指纹、证书指纹)、DHCP指纹、部分密钥(
    token_prefix
    password_hash_prefix
    hashed_token
    )——部分泄露仍会暴露凭证信息
  • 个人身份信息:邮箱地址(包括抄送/密送列表、代理人/所有者/创建者邮箱变体)、用户名、显示名称、员工ID、电话号码、主体名称(如
    user@tenant.onmicrosoft.com
    )、邮件主题和正文内容
  • 组织身份信息:公司名称、租户ID(包括
    home_tenant_id
    resource_tenant_id
    aad_tenant_id
    等变体)、账户ID、订阅ID、计费账户ID、嵌入在路径或JSON字段中的组织别名、组织单元路径(如
    orgunit_path
    org_unit_path
    )、部门名称和ID、成本中心ID、管道名称、任务名称或注册表路径中的Windows SID(安全标识符)
  • 基础设施身份信息:内部主机名、FQDN、私有IP地址、MAC地址、内部URL(预发布/生产环境主机名、内部工具域名)、云资源ID(ARNs、S3存储桶名称、GCP项目ID、Azure订阅/资源名称)、Kubernetes集群名称、节点名称、Pod名称和命名空间名称、容器名称和ID、数据库主机名和名称(包括
    database.host
    database.name
    database_principal_name
    )、Windows域拓扑字段(域控制器主机名、NT域名、
    domain_controller_object_guid
    domain_controller_object_sid
  • 设备和硬件身份信息:序列号、硬件UUID、机器ID、设备UUID、特定设备独有的BIOS/固件版本字符串
  • 文件系统路径:包含用户名、组织名称或内部系统结构的进程命令行、文件路径、注册表键路径和日志文件路径(如
    C:\Users\alice\
    /home/bob/
    HKLM\...\S-1-5-21-...
  • 连接字符串:数据库URI、Redis URL、任何包含凭证或内部主机名的连接字符串
  • 资源归属信息:所有者邮箱、创建者邮箱、最后修改人身份、代理人邮箱、经办人邮箱、模拟者字段——任何指定特定人员为资源操作方的字段
  • 跟踪标识符:会话ID、请求/关联ID、交易ID或任何与真实实体绑定的长随机字符串
  • 哈希值:当哈希值可从敏感输入推导时替换(密码哈希、HMAC密钥)——保留文件哈希(文件内容的MD5、SHA1、SHA256)和其他内容可寻址引用(git SHA、用作标识符的TLS证书哈希)
  • 地理信息细节:精确GPS坐标、真实街道地址——城市和国家名称通常可保留
替换时遵循占位符规范和格式规则(见下文)。

Step 2 — Verify structure is intact

步骤2——验证结构完整性

Confirm after sanitization:
  • Line count is unchanged
  • JSON lines are still valid JSON (for NDJSON files):
    bash
    python3 -c "
    import json, sys
    with open('FILE') as f:
        for i, line in enumerate(f, 1):
            line = line.strip()
            if line:
                try: json.loads(line)
                except Exception as e: print(f'Line {i}: {e}')
    "
  • Timestamps still match the format the pipeline uses for date parsing
  • Enum / status / action values that pipeline conditions branch on are untouched
清理完成后确认:
  • 行数保持不变
  • JSON行仍为有效JSON(针对NDJSON文件):
    bash
    python3 -c "
    import json, sys
    with open('FILE') as f:
        for i, line in enumerate(f, 1):
            line = line.strip()
            if line:
                try: json.loads(line)
                except Exception as e: print(f'Line {i}: {e}')
    "
  • 时间戳格式仍与pipeline用于日期解析的格式匹配
  • pipeline条件分支依赖的枚举/状态/操作值未被修改

Placeholder conventions

占位符规范

Use consistent, realistic-looking replacements — not
REDACTED
strings, which break format-sensitive parsers.
TypeReplacement
Email
user@example.com
,
admin@example.org
IPv4RFC 5737 ranges:
198.51.100.10
,
203.0.113.20
,
192.0.2.30
IPv6
2001:db8::10
Hostname / FQDN
host-1.example.local
,
srv-web-01.example.internal
Domain
example.com
,
example.org
,
example.net
UUID
89a1d5c1-2b3e-4f67-8a9b-0c1d2e3f4a5b
API key / token
sk_test_example_key_1234567890
,
dGVzdC10b2tlbi0xMjM0NTY3ODk=
Username
alice.johnson
,
bob.smith
Display name
Alice Johnson
,
Bob Smith
Org / company name
Example Corp
,
Acme Inc
Account / tenant ID
000000000000
,
example-tenant-id
Cloud resource ID
arn:aws:iam::000000000000:user/example-user
S3 bucket name
example-bucket
MAC address
00-00-5E-00-53-23
(RFC 7042 documentation range)
Serial number
SN000000000001
Device / machine IDuse a synthetic UUID or
device-id-example-000001
Windows SID
S-1-5-21-000000000-000000000-000000000-1000
File path (Windows)
C:\Users\example-user\AppData\...
File path (Unix)
/home/example-user/...
or use
~
Kubernetes cluster
example-cluster
,
example-node-1
Phone number
734-555-0100
(555 range is reserved for fiction)
Database host / name
db-host.example.local
,
example_database
Department / org unit
example-department
,
/example-org/example-unit
Hashed / partial tokenreplace with full synthetic token of same format
DHCP fingerprint
example-dhcp-fingerprint-000001
JA4 fingerprintreplace with same-length hex string
Consistency rule: map identical original values to identical placeholders throughout the file. If the same IP appears 10 times, it must become the same replacement IP all 10 times — so cross-event correlations remain testable.
使用一致、逼真的替换值——请勿使用
REDACTED
这类会破坏格式敏感解析器的字符串。
类型替换值
邮箱
user@example.com
,
admin@example.org
IPv4RFC 5737规定的范围:
198.51.100.10
,
203.0.113.20
,
192.0.2.30
IPv6
2001:db8::10
主机名 / FQDN
host-1.example.local
,
srv-web-01.example.internal
域名
example.com
,
example.org
,
example.net
UUID
89a1d5c1-2b3e-4f67-8a9b-0c1d2e3f4a5b
API key / token
sk_test_example_key_1234567890
,
dGVzdC10b2tlbi0xMjM0NTY3ODk=
用户名
alice.johnson
,
bob.smith
显示名称
Alice Johnson
,
Bob Smith
组织/公司名称
Example Corp
,
Acme Inc
账户/租户ID
000000000000
,
example-tenant-id
云资源ID
arn:aws:iam::000000000000:user/example-user
S3存储桶名称
example-bucket
MAC地址
00-00-5E-00-53-23
(RFC 7042文档规定范围)
序列号
SN000000000001
设备/机器ID使用合成UUID或
device-id-example-000001
Windows SID
S-1-5-21-000000000-000000000-000000000-1000
文件路径(Windows)
C:\Users\example-user\AppData\...
文件路径(Unix)
/home/example-user/...
或使用
~
Kubernetes集群
example-cluster
,
example-node-1
电话号码
734-555-0100
(555范围为虚构专用)
数据库主机/名称
db-host.example.local
,
example_database
部门/组织单元
example-department
,
/example-org/example-unit
哈希/部分token替换为相同格式的完整合成token
DHCP指纹
example-dhcp-fingerprint-000001
JA4指纹替换为相同长度的十六进制字符串
一致性规则:将相同的原始值映射为相同的占位符,贯穿整个文件。如果同一IP出现10次,必须每次都替换为同一个IP——这样跨事件的关联关系仍可测试。

Shape rule — replacements must match the original format

格式规则——替换值必须与原始值格式匹配

Every replacement must have the same shape as the original value. The parser and pipeline conditions depend on value format, not just field presence.
  • Numeric ID → numeric ID:
    /d/123/edit
    /d/456/edit
    , not
    /d/example-document-id/edit
  • UUID → UUID: a real UUID must become a synthetic UUID of the same version, not a descriptive string
  • URL → URL: replace only the sensitive segment (hostname, path ID) — preserve the scheme, path structure, and query string shape
    • https://docs.google.com/drawings/d/123/edit
      https://docs.google.com/drawings/d/000000000000/edit
      (replace the ID, not the host —
      docs.google.com
      is a public service name, not an org identifier)
    • https://internal.corp.com/api/v1/resource
      https://host-redacted.example.local/api/v1/resource
      (replace the internal hostname, keep the path)
  • String ID → same-length or same-format string: opaque alphanumeric IDs should become opaque alphanumeric placeholders of similar length, not descriptive names
  • Hostname in a URL vs. standalone hostname: only replace hostnames that identify real internal infrastructure — public well-known hostnames (
    docs.google.com
    ,
    api.github.com
    ,
    s3.amazonaws.com
    ) identify a service, not an organization, and do not need to be replaced
Malformed or garbage values must not be replaced. If a value looks broken, synthetic, or contains no real identifying information (e.g.
http://1=Y +z\\
,
00/00/0000
,
N/A
, empty strings, placeholder-looking values), leave it exactly as-is. Replacing a malformed value with a well-formed placeholder changes the shape and can alter pipeline behaviour — a grok that fails on the original will now succeed on the sanitized version, masking the real error.
If you are unsure what shape to use, look at neighbouring values of the same field type in the same file and match their format.
每个替换值必须与原始值格式一致。解析器和pipeline条件依赖于值的格式,而不仅仅是字段的存在。
  • 数字ID → 数字ID
    /d/123/edit
    /d/456/edit
    ,而非
    /d/example-document-id/edit
  • UUID → UUID:真实UUID必须替换为同版本的合成UUID,而非描述性字符串
  • URL → URL:仅替换敏感部分(主机名、路径ID)——保留协议、路径结构和查询字符串格式
    • https://docs.google.com/drawings/d/123/edit
      https://docs.google.com/drawings/d/000000000000/edit
      (替换ID,而非主机——
      docs.google.com
      是公共服务名称,不属于组织标识符,无需替换)
    • https://internal.corp.com/api/v1/resource
      https://host-redacted.example.local/api/v1/resource
      (替换内部主机名,保留路径)
  • 字符串ID → 相同长度或格式的字符串:随机字母数字ID应替换为相似长度的随机字母数字占位符,而非描述性名称
  • URL中的主机名 vs. 独立主机名:仅替换可识别真实内部基础设施的主机名——公共知名主机名(
    docs.google.com
    api.github.com
    s3.amazonaws.com
    )标识的是服务,而非组织,无需替换
请勿替换格式错误或无效值。如果某个值看起来损坏、合成或不包含真实识别信息(如
http://1=Y +z\\
00/00/0000
N/A
、空字符串、类似占位符的值),请保持原样。将格式错误的值替换为格式正确的占位符会改变其格式,可能影响pipeline的行为——原本无法匹配的grok模式现在会匹配成功,掩盖真实错误。
如果不确定应使用何种格式,请查看同一文件中同类型字段的相邻值,匹配其格式。

What to preserve

需要保留的内容

Do not replace:
  • Protocol names, action verbs, event types, severity levels (
    ALLOW
    ,
    DENY
    ,
    INFO
    ,
    ERROR
    )
  • HTTP status codes, port numbers, numeric metric values
  • Field names and keys
  • Timestamps (format and timezone must stay intact)
  • Structural tokens (brackets, braces, pipes, commas, tabs)
  • Public well-known service hostnames in URLs (
    docs.google.com
    ,
    api.github.com
    , etc.) — replace the path ID if it is sensitive, not the host
  • File hashes (MD5, SHA1, SHA256 of file content) — these are content-addressable and safe; do not replace them
  • User agent strings (
    Mozilla/5.0 ...
    ) — these reveal browser/OS type but not identity; safe to keep
  • City and country names in geo enrichment fields — replace only precise coordinates and street addresses
请勿替换:
  • 协议名称、操作动词、事件类型、严重级别(
    ALLOW
    DENY
    INFO
    ERROR
  • HTTP状态码、端口号、数值指标值
  • 字段名和键
  • 时间戳(格式和时区必须保持不变)
  • 结构标记(括号、大括号、管道符、逗号、制表符)
  • URL中的公共知名服务主机名(
    docs.google.com
    api.github.com
    等)——如果路径ID敏感则替换,无需替换主机
  • 文件哈希(文件内容的MD5、SHA1、SHA256)——这些是内容可寻址的,安全可靠;请勿替换
  • 用户代理字符串(
    Mozilla/5.0 ...
    )——这些仅显示浏览器/操作系统类型,不涉及身份信息;可保留
  • 地理 enrichment字段中的城市和国家名称——仅替换精确坐标和街道地址