fix-sentry-issues

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Fix Sentry Issues

修复Sentry问题

Systematically discover, triage, investigate, and fix production issues using Sentry MCP. One PR per issue, root-cause analysis required.
借助Sentry MCP系统化地发现、分类、调查并修复生产环境问题。每个问题对应一个PR,必须进行根因分析。

Critical Rule: Truth-Seek, Don't Suppress

核心原则:探寻真相,而非掩盖问题

NEVER treat log level changes as fixes. Changing
logger.error
to
logger.warn
or
logger.info
silences Sentry but doesn't fix the user's experience.
For every failing code path, ask "Why does this fail?" — not "How do I make Sentry quiet?"
绝对不要将修改日志级别当作修复方案。把
logger.error
改为
logger.warn
logger.info
只会让Sentry不再告警,但并没有解决用户实际遇到的问题。
对于每一条报错的代码路径,要问**“为什么会失败?”——而不是“怎么让Sentry不再告警?”**

Anti-patterns to avoid

需要避免的反模式

These are specific failure modes from real experience. Do NOT do these:
  1. Batch-classifying issues as "expected" without investigating each one. Reading an error message and seeing a fallback path does NOT mean you understand the failure. You must trace the full input path to understand what's being sent and why it fails.
  2. Treating "has a fallback" as "not a problem." A fallback means the user gets degraded results. Ask: why does the primary path fail? Can we prevent the failure upstream? Is the input wrong? Is the timeout too tight? Is there a missing filter?
  3. Combining multiple issues into one "noise reduction" PR. Each issue has its own root cause. Investigate and fix them individually. The only exception is issues that share an identical root cause discovered through investigation.
  4. Throwing away error details. Never change
    catch (error) { logger.error(..., error) }
    to
    catch { logger.info(...) }
    . The structured error data (status codes, messages, stack traces) is exactly what you need to understand the failure.
  5. Deciding the fix during triage. The triage table should classify issues as "Investigate" or "Ignore" — never pre-decide that the fix is a log level change. You don't know the fix until you've completed investigation.
这些是从实际经验中总结出的典型错误做法,绝对不要这么做:
  1. 未逐个调查就批量标记问题为“预期情况”。仅看错误信息并发现有降级处理逻辑,不代表你真正理解了失败原因。你必须追踪完整的输入路径,明确传入的数据是什么、为什么会失败。
  2. 将“存在降级处理”等同于“没有问题”。降级处理意味着用户会得到体验打折扣的结果。要问:主路径为什么会失败?我们能否在源头避免失败?输入数据是否有误?超时时间是否过短?是否缺少必要的过滤?
  3. 将多个问题合并到一个“清理无效告警”PR中。每个问题都有其独立的根因,需要分别调查和修复。唯一例外是经调查确认多个问题的根因完全相同的情况。
  4. 丢弃错误详情。绝对不要把
    catch (error) { logger.error(..., error) }
    改为
    catch { logger.info(...) }
    。结构化的错误数据(状态码、消息、栈追踪)正是你理解失败原因的关键。
  5. 在分类阶段就确定修复方案。分类表中的操作列只能是“调查”或“忽略”——绝对不要预先决定修复方案是修改日志级别。在完成调查前,你无法确定正确的修复方式。

When a log level change IS valid

哪些场景下修改日志级别是合理的

A downgrade to
logger.info
is valid ONLY for genuinely expected operational states — NOT for failures with fallbacks. Examples:
  • Valid: User's Notion database doesn't have an optional "Author" column → property skipped. This is user configuration, not a failure.
  • Valid: Supabase returns 404 for a link the user deleted. The resource genuinely doesn't exist.
  • Invalid: Firecrawl scrape fails 300 times/day → downgrade to info. WHY is it failing? Are we sending URLs it can't handle? Are we hitting rate limits?
  • Invalid: Summary generation times out → downgrade to info. WHY is the API slow? Is the content too large? Is there a network issue?
只有当某个状态确实是预期的正常运营状态时,才可以将日志级别降级为
logger.info
——而不是针对有降级处理的失败场景。示例:
  • 合理场景:用户的Notion数据库没有可选的“作者”列 → 跳过该属性。这是用户配置问题,而非系统失败。
  • 合理场景:Supabase对用户已删除的链接返回404。该资源确实不存在。
  • 不合理场景:Firecrawl每日抓取失败300次 → 降级为info级别。为什么会失败?我们是否传入了它无法处理的URL?是否触发了速率限制?
  • 不合理场景:摘要生成超时 → 降级为info级别。为什么API会变慢?内容是否过大?是否存在网络问题?

Phase 1: Discover

阶段1:发现问题

Use Sentry MCP to find the org, project, and all unresolved issues. Use
ToolSearch
first to load the Sentry MCP tools.
mcp__sentry__find_organizations()
mcp__sentry__find_projects(organizationSlug, regionUrl)
mcp__sentry__search_issues(
  organizationSlug, projectSlugOrId, regionUrl,
  naturalLanguageQuery: "all unresolved issues sorted by events",
  limit: 25
)
Build a triage table. The Action column should be Investigate or Ignore — never a pre-decided fix:
markdown
| ID | Title | Events | Action | Reason |
|----|-------|--------|--------|--------|
| PROJ-A | Error in save | 14 | Investigate | User-facing save failure |
| PROJ-B | GM_register... | 3 | Ignore | Greasemonkey extension |
使用Sentry MCP查找组织、项目以及所有未解决的问题。首先使用
ToolSearch
加载Sentry MCP工具。
mcp__sentry__find_organizations()
mcp__sentry__find_projects(organizationSlug, regionUrl)
mcp__sentry__search_issues(
  organizationSlug, projectSlugOrId, regionUrl,
  naturalLanguageQuery: "all unresolved issues sorted by events",
  limit: 25
)
构建分类表。操作列只能是调查忽略——不要预先填写修复方案:
markdown
| ID | 标题 | 事件数 | 操作 | 原因 |
|----|-------|--------|--------|--------|
| PROJ-A | 保存时出错 | 14 | 调查 | 影响用户的保存失败 |
| PROJ-B | GM_register... | 3 | 忽略 | Greasemonkey扩展导致 |

Phase 2: Triage

阶段2:分类问题

Classify every issue before writing any code. Only two categories at this stage:
在编写任何代码之前,先对所有问题进行分类。此阶段仅分为两类:

Investigate (our code, worth understanding)

调查(属于我们的代码,值得深入了解)

  • Multiple events establishing a pattern
  • User sees degraded experience (error status, missing data, broken UI)
  • High-volume warnings that might indicate an upstream problem
  • Recurring on every run/sync (stale references, cron-triggered)
  • 多次出现,形成规律
  • 用户体验受影响(错误状态码、数据缺失、UI损坏)
  • 大量告警,可能预示上游存在问题
  • 每次运行/同步时都会重复出现(过期引用、定时任务触发)

Ignore (third-party noise)

忽略(第三方无效告警)

  • Browser extension code (
    GM_registerMenuCommand
    ,
    CONFIG
    ,
    currentInset
    , MetaMask JSON-RPC)
  • Stale module imports after deploy (
    ChunkLoadError
    — self-resolving)
  • Single-event transients with no reproduction path
  • Issues already fixed by a recent commit
Apply triage decisions:
mcp__sentry__update_issue(issueId, organizationSlug, regionUrl, status: "ignored")  // noise
mcp__sentry__update_issue(issueId, organizationSlug, regionUrl, status: "resolved") // already fixed
  • 浏览器扩展代码导致的错误(
    GM_registerMenuCommand
    CONFIG
    currentInset
    、MetaMask JSON-RPC)
  • 部署后出现的过期模块导入错误(
    ChunkLoadError
    ——会自行解决)
  • 仅出现一次且无法复现的临时错误
  • 已被近期提交修复的问题
应用分类决策:
mcp__sentry__update_issue(issueId, organizationSlug, regionUrl, status: "ignored")  // 无效告警
mcp__sentry__update_issue(issueId, organizationSlug, regionUrl, status: "resolved") // 已修复

Phase 3: Investigate (one issue at a time)

阶段3:调查问题(逐个处理)

For each "Investigate" issue, work through these steps in order. Do NOT skip steps or batch multiple issues together.
对于每个标记为“调查”的问题,按以下步骤依次操作。不要跳过步骤,也不要批量处理多个问题。

3a. Pull event-level data

3a. 获取事件级数据

Issue summaries hide the details you need. Always pull actual events AND the full issue details:
mcp__sentry__get_issue_details(issueId, organizationSlug, regionUrl)
mcp__sentry__search_issue_events(
  issueId, organizationSlug, regionUrl,
  naturalLanguageQuery: "all events with extra data",
  limit: 15
)
Extract from the events: actual URLs, request parameters, stack traces, timestamps, user context, extra data fields (status codes, content lengths, etc.). These are the real inputs that triggered the failure.
问题摘要会隐藏你需要的细节。务必拉取实际事件以及完整的问题详情:
mcp__sentry__get_issue_details(issueId, organizationSlug, regionUrl)
mcp__sentry__search_issue_events(
  issueId, organizationSlug, regionUrl,
  naturalLanguageQuery: "all events with extra data",
  limit: 15
)
从事件中提取:实际URL、请求参数、栈追踪、时间戳、用户上下文、额外数据字段(状态码、内容长度等)。这些正是触发失败的真实输入数据。

3b. Cross-reference with Axiom logs

3b. 关联Axiom日志

Axiom events include
traceId
fields that correlate with Sentry errors. Use the Axiom CLI to pull surrounding logs for richer context:
bash
undefined
Axiom事件包含与Sentry错误关联的
traceId
字段。使用Axiom CLI拉取相关日志,获取更丰富的上下文:
bash
undefined

Get the traceId from the Sentry event's trace context

从Sentry事件的trace上下文中获取traceId

Then query Axiom for all events with that traceId

然后查询Axiom中所有包含该traceId的事件

axiom query "['shiori-events'] | where traceId == '<traceId>'" -f json
axiom query "['shiori-events'] | where traceId == '<traceId>'" -f json

Or search by userId around the error timestamp for broader context

或者根据userId和错误时间范围搜索更广泛的上下文

axiom query "['shiori-events'] | where userId == '<userId>' | where _time > datetime('2025-01-01T00:00:00Z') and _time < datetime('2025-01-01T01:00:00Z')" -f json

Axiom logs include fields like `authMethod`, `client_version`, `event` type, and request metadata that Sentry often lacks. This helps you understand what the user was doing before and after the error.
axiom query "['shiori-events'] | where userId == '<userId>' | where _time > datetime('2025-01-01T00:00:00Z') and _time < datetime('2025-01-01T01:00:00Z')" -f json

Axiom日志包含`authMethod`、`client_version`、事件类型和请求元数据等字段,这些是Sentry通常没有的信息。这有助于你了解用户在错误发生前后的操作。

3c. Read the failing code path

3c. 阅读报错的代码路径

Follow the stack trace. Read every file in the chain. Understand what the code does before proposing changes. Use subagents for parallel file exploration if the stack is deep.
跟随栈追踪信息,阅读调用链中的每个文件。在提出修改方案之前,先理解代码的功能。如果调用链很深,可以使用子代理并行探索多个文件。

3d. Trace the input path upstream

3d. 向上游追踪输入路径

This is the step most often skipped, and the most important:
  • What data reaches the failing function? Trace backwards from the error to the original input. What URL/payload/parameters were passed?
  • Should this input have reached this code path at all? Is there a missing filter, validation, or early return upstream?
  • What does the input look like? For URL-based failures: is it a binary file? A redirect? A localhost URL? Something the API can't handle?
  • Is the failure in our code or an external service? If external: can we prevent sending bad inputs? Can we add better pre-filtering?
这是最常被跳过但也是最重要的一步:
  • **哪些数据到达了报错的函数?**从错误点反向追踪到原始输入。传入的是哪些URL/负载/参数?
  • **这些输入本应到达这个代码路径吗?**上游是否缺少过滤、验证或提前返回逻辑?
  • **输入的具体格式是什么?**对于URL相关的失败:是二进制文件?重定向?本地URL?还是API无法处理的内容?
  • **失败是出现在我们的代码中还是外部服务中?**如果是外部服务:我们能否避免发送无效输入?能否添加更好的前置过滤?

3e. Reproduce and verify

3e. 复现并验证

Use the actual failing inputs from Sentry events:
  • Call the function with the exact data that failed
  • fetch()
    the actual URLs that timed out — are they reachable?
  • Add temporary
    console.log
    statements to verify your understanding of the code flow
  • Check if the failure is in our code or an external service
使用Sentry事件中的真实失败输入:
  • 使用导致失败的精确数据调用函数
  • fetch()
    请求超时的实际URL——是否可达?
  • 添加临时的
    console.log
    语句,验证你对代码流程的理解
  • 确认失败是出现在我们的代码中还是外部服务中

3f. Identify root cause

3f. 确定根因

Ask these questions in order:
  1. Why does this specific input fail? (e.g., "Firecrawl can't scrape a .png URL")
  2. Why does this input reach this code path? (e.g., "No extension check before calling Firecrawl")
  3. What's the right fix? (e.g., "Filter binary URLs before calling Firecrawl" — not "suppress the log")
  4. Should we also improve observability? (e.g., "Add status code to the log so we can see the failure distribution")
Common root causes:
PatternRoot CauseReal Fix
External API fails on certain URLsWrong inputs being sent (binary files, bad formats)Filter/validate inputs before sending
External API timeoutTimeout too tight, or input too large, or missing retryInvestigate what's slow, adjust timeout or input size
DB rejects "invalid json"Unsanitized input (null bytes, control chars)Sanitize before insert
Processing stuck in "error"Timeout budget doesn't account for full pipelineAdjust timeouts, save partial results on timeout
Same error on every cron runStale reference to deleted external resourceDetect staleness, auto-clean
Error logged but details not usefulError object not included, or status code missingImprove the log to include actionable details
按顺序问自己这些问题:
  1. 为什么这个特定输入会失败?(例如:“Firecrawl无法抓取.png格式的URL”)
  2. 为什么这个输入会到达这个代码路径?(例如:“调用Firecrawl之前没有检查文件扩展名”)
  3. 正确的修复方案是什么?(例如:“调用Firecrawl之前过滤二进制URL”——而非“屏蔽日志”)
  4. 我们是否需要改进可观测性?(例如:“在日志中添加状态码,以便查看失败分布情况”)
常见根因:
模式根因正确修复方案
外部API对某些URL请求失败发送了错误的输入(二进制文件、格式错误)发送前过滤/验证输入
外部API超时超时时间过短、输入过大或缺少重试机制调查慢请求原因,调整超时时间或输入大小
数据库拒绝“无效json”输入未经过滤(空字节、控制字符)插入前清理输入
处理过程卡在“错误”状态超时预算未考虑完整流程调整超时时间,超时保存部分结果
定时任务每次运行都会出现相同错误引用了已删除的外部资源检测过期引用,自动清理
日志中存在错误但缺少有用详情未包含错误对象或状态码改进日志,添加可用于排查的详情

3g. Know your log levels

3g. 了解日志级别

Log levels control what reaches Sentry:
LevelSends to Sentry?Use for
logger.error
Yes (error)Unexpected bugs, states that should never occur
logger.warn
Yes (warning)Handled failures worth monitoring — keep until you understand the pattern
logger.info
NoGenuinely expected operational states (not "failures with fallbacks")
日志级别决定了哪些内容会发送到Sentry:
级别是否发送到Sentry?使用场景
logger.error
是(错误级别)意外的bug、本不应出现的状态
logger.warn
是(警告级别)已处理但值得监控的失败——保留直到理解规律
logger.info
确实是预期的正常运营状态(而非“有降级处理的失败”)

Phase 4: Fix

阶段4:修复问题

4a. Branch from main

4a. 从main分支创建分支

bash
git checkout main && git pull
git checkout -b fix/<descriptive-name>
One branch per issue. Keep fixes focused.
bash
git checkout main && git pull
git checkout -b fix/<描述性名称>
每个问题对应一个分支,保持修复内容聚焦。

4b. Write tests first

4b. 先编写测试

Tests must use data derived from actual Sentry events, not hypothetical inputs. The test should fail before the fix and pass after.
测试必须使用来自Sentry事件的真实数据,而非假设的输入。测试应在修复前失败,修复后通过。

4c. Implement the fix

4c. 实现修复方案

Fix the root cause, not the symptom.
Self-check before committing: If the fix is primarily a log level change, STOP. Ask yourself:
  • Did I investigate why this fails, or did I just see a fallback and suppress?
  • Can I prevent the failure upstream instead of silencing it?
  • Am I throwing away error details that would help debug future occurrences?
  • Would a staff engineer look at this PR and say "but why does it fail in the first place?"
修复根因,而非症状。
提交前自检:如果修复方案主要是修改日志级别,请停止操作。问自己:
  • 我是否调查了失败的原因,还是仅仅看到有降级处理就选择屏蔽?
  • 我能否在源头避免失败,而不是掩盖它?
  • 我是否丢弃了有助于未来排查的错误详情?
  • 资深工程师看到这个PR时,会不会问“但它为什么会失败?”

4d. Verify

4d. 验证修复

  • Run tests (e.g.,
    bun run test
    )
  • Run lint
  • Confirm the fix handles the actual failing inputs from Sentry events
  • Remove any temporary
    console.log
    statements
  • 运行测试(例如:
    bun run test
  • 运行代码检查
  • 确认修复方案能处理Sentry事件中的真实失败输入
  • 删除所有临时的
    console.log
    语句

4e. Create PR

4e. 创建PR

bash
git push -u origin fix/<descriptive-name>
gh pr create --title "<short title>" --body "$(cat <<'EOF'
bash
git push -u origin fix/<描述性名称>
gh pr create --title "<简短标题>" --body "$(cat <<'EOF'

Summary

摘要

  • Root cause: [What was actually wrong — the upstream reason, not just "it throws an error"]
  • Fix: [What changed and why this prevents the failure, not just silences it]
  • 根因:[实际存在的问题——上游原因,而非仅仅“抛出错误”]
  • 修复方案:[修改内容以及为什么能避免失败,而非仅仅屏蔽日志]

Test plan

测试计划

  • Tests written using data from Sentry events
  • All tests pass
  • Lint passes EOF )"
undefined
  • 使用Sentry事件中的数据编写测试
  • 所有测试通过
  • 代码检查通过 EOF )"
undefined

4f. Resolve in Sentry

4f. 在Sentry中标记为已解决

After PR is merged:
bash
git checkout main && git pull
mcp__sentry__update_issue(issueId, organizationSlug, regionUrl, status: "resolved")
PR合并后:
bash
git checkout main && git pull
mcp__sentry__update_issue(issueId, organizationSlug, regionUrl, status: "resolved")

Phase 5: Repeat

阶段5:重复流程

Work through issues by priority (most events first). After each PR:
  1. Return to main, pull latest
  2. Pick next issue from the triage table
  3. Start Phase 3 again — full investigation for each issue
按优先级处理问题(事件数最多的优先)。每个PR完成后:
  1. 切回main分支,拉取最新代码
  2. 从分类表中选择下一个问题
  3. 再次从阶段3开始——对每个问题进行完整调查

Checklist Per Issue

每个问题的检查清单

[ ] Pulled event-level data (not just issue summary)
[ ] Cross-referenced with Axiom logs using traceId for surrounding context
[ ] Read the failing code path end-to-end
[ ] Traced the input path upstream — understood what data triggers the failure
[ ] Identified root cause (not just "it has a fallback")
[ ] Fix prevents the failure, not just suppresses the log
[ ] Tests use real-world data from Sentry events
[ ] Tests pass, lint passes
[ ] No error details thrown away (catch variables, status codes, etc.)
[ ] PR created with upstream root cause explanation
[ ] Sentry issue resolved after merge
[ ] 拉取了事件级数据(而非仅问题摘要)
[ ] 使用traceId关联Axiom日志,获取上下文
[ ] 完整阅读了报错的代码路径
[ ] 向上游追踪了输入路径——理解了触发失败的数据
[ ] 确定了根因(而非仅“存在降级处理”)
[ ] 修复方案避免了失败,而非仅仅屏蔽日志
[ ] 测试使用了Sentry事件中的真实数据
[ ] 测试通过,代码检查通过
[ ] 未丢弃任何错误详情(捕获变量、状态码等)
[ ] PR中说明了上游根因
[ ] PR合并后在Sentry中标记问题为已解决