pr-miner
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePR Miner Skill
PR Miner 技能
Operator Context
操作上下文
This skill operates as an operator for deterministic GitHub data extraction, configuring Claude's behavior for mining PR review comments. It implements the Pipeline architectural pattern — authenticate, mine, validate — with strict separation between extraction (this skill) and analysis (Code Archaeologist agent).
本技能作为确定性GitHub数据提取的操作器,配置Claude的行为以挖掘PR评审评论。它实现了Pipeline架构模式——认证、挖掘、验证——严格区分提取(本技能)与分析(Code Archaeologist agent)的职责。
Hardcoded Behaviors (Always Apply)
硬编码行为(始终生效)
- CLAUDE.md Compliance: Read and follow repository CLAUDE.md before execution
- Over-Engineering Prevention: Extract only requested data. No analysis, interpretation, or pattern detection — that is the Code Archaeologist agent's job
- Deterministic Extraction Only: Output raw JSON. Never analyze patterns, generate rules, or interpret comments
- Authenticate First: Verify GitHub token before starting any mining operation
- Rate Limit Respect: Honor GitHub API rate limits with exponential backoff
- CLAUDE.md 合规性:执行前阅读并遵循仓库的CLAUDE.md文档
- 避免过度设计:仅提取请求的数据。不进行分析、解读或模式检测——这些是Code Archaeologist agent的工作
- 仅确定性提取:输出原始JSON。绝不分析模式、生成规则或解读评论
- 先认证:开始任何挖掘操作前验证GitHub令牌
- 遵守速率限制:通过指数退避策略尊重GitHub API的速率限制
Default Behaviors (ON unless disabled)
默认行为(默认开启,可关闭)
- Merged PRs Only: Focus on merged PRs representing accepted standards
- Imperative Filtering: Filter for imperative language keywords (should, must, avoid, prefer)
- Progress Reporting: Display mining progress during long operations
- Temporary File Cleanup: Remove partial JSON and temp files at completion; keep only final output
- Summary Output: Report interaction count and reviewer distribution after mining
- 仅合并的PR:聚焦于代表已接受标准的合并PR
- 命令式过滤:过滤包含命令式语言关键词的评论(should、must、avoid、prefer)
- 进度报告:长时操作中显示挖掘进度
- 临时文件清理:完成后删除部分JSON和临时文件;仅保留最终输出
- 汇总输出:挖掘完成后报告交互数量和评审者分布情况
Optional Behaviors (OFF unless enabled)
可选行为(默认关闭,需开启)
- All Comments Mode: Capture every review comment regardless of language ()
--all-comments - Reviewer Filter: Focus on comments from particular reviewers ()
--reviewer - Date Range: Limit mining to specific time periods (/
--since)--until - Multi-Repo: Mine across multiple repositories in single operation
- 全评论模式:捕获所有评审评论,不受语言限制(参数)
--all-comments - 评审者过滤:聚焦于特定评审者的评论(参数)
--reviewer - 日期范围:将挖掘限制在特定时间段(/
--since参数)--until - 多仓库支持:单次操作中跨多个仓库挖掘
What This Skill CAN Do
本技能可完成的任务
- Extract review comments with code context (before/after) from merged PRs
- Track resolution status per comment (changed, resolved, dismissed, unresolved)
- Filter by imperative language keywords or capture all comments
- Mine multiple repositories in a single operation
- Respect GitHub API rate limits with retry logic
- Generate structured JSON output for downstream analysis
- Validate GitHub authentication before mining
- 从合并PR中提取包含代码上下文(修改前后)的评审评论
- 跟踪每条评论的解决状态(已修改、已解决、已驳回、未解决)
- 通过命令式语言关键词过滤或捕获所有评论
- 单次操作中挖掘多个仓库
- 通过重试逻辑遵守GitHub API速率限制
- 生成结构化JSON输出用于下游分析
- 挖掘前验证GitHub认证有效性
What This Skill CANNOT Do
本技能不可完成的任务
- Analyze patterns or generate rules (Code Archaeologist agent's job)
- Interpret comment meaning or intent (pure extraction only)
- Create enforcement rules (no Semgrep/golangci-lint generation)
- Mine private repos without proper token permissions (requires scope)
repo - Process non-GitHub platforms (GitHub-specific implementation)
- Monitor PRs in real-time (snapshot-based mining only)
- 分析模式或生成规则(Code Archaeologist agent的职责)
- 解读评论的含义或意图(仅做纯提取操作)
- 创建强制执行规则(不生成Semgrep/golangci-lint规则)
- 无适当令牌权限时挖掘私有仓库(需要权限范围)
repo - 处理非GitHub平台(仅支持GitHub)
- 实时监控PR(仅基于快照的挖掘)
Instructions
操作步骤
Phase 1: AUTHENTICATE
阶段1:认证
Goal: Verify GitHub access and validate target repositories before mining.
Step 1: Verify token
bash
python3 ~/.claude/scripts/miner.py --check-authConfirm output shows valid authentication with scope.
repoStep 2: Validate target repositories
Confirm each target repository:
- Exists and is accessible with current token
- Has merged PRs with review comments
- Uses a code review workflow (not just self-merges)
Step 3: Check rate limits
bash
gh api rate_limit --jq '.resources.core | "Remaining: \(.remaining)/\(.limit), Resets: \(.reset)"'Ensure sufficient API calls remain for the planned mining scope (estimate 3-5 calls per PR).
Gate: Token is valid, repositories are accessible, rate limits are sufficient. Proceed only when gate passes.
目标:挖掘前验证GitHub访问权限和目标仓库的可访问性。
步骤1:验证令牌
bash
python3 ~/.claude/scripts/miner.py --check-auth确认输出显示认证有效且拥有权限范围。
repo步骤2:验证目标仓库
确认每个目标仓库:
- 存在且当前令牌可访问
- 有包含评审评论的合并PR
- 使用代码评审工作流(而非仅自合并)
步骤3:检查速率限制
bash
gh api rate_limit --jq '.resources.core | "Remaining: \(.remaining)/\(.limit), Resets: \(.reset)"'确保剩余API调用次数足够支撑计划的挖掘范围(估算每个PR需要3-5次调用)。
准入条件:令牌有效、仓库可访问、速率限制充足。仅当满足所有条件时才可继续。
Phase 2: MINE
阶段2:挖掘
Goal: Extract raw review comment data with code context.
Step 1: Determine scope
Choose mining parameters based on the task:
- Single repo:
python3 ~/.claude/scripts/miner.py org/repo output.json --limit 50 - Multi-repo:
python3 ~/.claude/scripts/miner.py org/repo-a,org/repo-b output.json --limit 50 - Filtered: Add ,
--reviewer name, or--since date--all-comments
Start with 50 PRs. Expand only after validating output quality.
Step 2: Execute mining
bash
python3 ~/.claude/scripts/miner.py <repos> mined_data/<output>.json --limit <N>Monitor progress output. Watch for:
- Rate limit warnings
- Authentication errors
- Empty PR responses (may indicate bot-only reviews)
Step 3: Verify extraction
Check the output file exists and contains data:
bash
python3 -c "import json; d=json.load(open('mined_data/<output>.json')); print(f'PRs: {d[\"metadata\"][\"pr_count\"]}, Interactions: {d[\"metadata\"][\"interaction_count\"]}')"If interaction count is below 20, consider expanding scope (, , or broader date range).
--limit--all-commentsGate: Mining completed without errors. Output JSON contains meaningful interaction data. Proceed only when gate passes.
目标:提取包含代码上下文的原始评审评论数据。
步骤1:确定范围
根据任务选择挖掘参数:
- 单仓库:
python3 ~/.claude/scripts/miner.py org/repo output.json --limit 50 - 多仓库:
python3 ~/.claude/scripts/miner.py org/repo-a,org/repo-b output.json --limit 50 - 过滤模式:添加、
--reviewer name或--since date参数--all-comments
从50个PR开始。仅在验证输出质量后再扩大范围。
步骤2:执行挖掘
bash
python3 ~/.claude/scripts/miner.py <repos> mined_data/<output>.json --limit <N>监控进度输出,注意:
- 速率限制警告
- 认证错误
- 空PR响应(可能表示仅机器人评审)
步骤3:验证提取结果
检查输出文件是否存在且包含数据:
bash
python3 -c "import json; d=json.load(open('mined_data/<output>.json')); print(f'PRs: {d[\"metadata\"][\"pr_count\"]}, Interactions: {d[\"metadata\"][\"interaction_count\"]}')"如果交互数量低于20,考虑扩大范围(调整、启用或放宽日期范围)。
--limit--all-comments准入条件:挖掘无错误完成。输出JSON包含有意义的交互数据。仅当满足条件时才可继续。
Phase 3: VALIDATE
阶段3:验证
Goal: Confirm output quality and completeness.
Step 1: Run validation script
bash
python3 ~/.claude/scripts/validate.pyStep 2: Spot-check data quality
Review 3-5 interactions manually:
- contains actionable review feedback (not just "LGTM")
comment_text - and
code_beforefields are populated where resolution is "changed"code_after - and
reviewerfields are not emptyauthor - URLs resolve to valid GitHub locations
Step 3: Generate summary statistics
bash
python3 ~/.claude/scripts/miner.py <repos> <output>.json --summaryVerify:
- Reviewer distribution is not dominated by a single person (unless was used)
--reviewer - Interaction count is proportional to PR count (expect 2-5 interactions per PR)
- Resolution types include a mix (not all "unknown")
Step 4: Clean up temporary files
Remove any partial JSON, debug logs, or temp files created during mining. Keep only the final output in .
mined_data/Gate: Validation passes. Data quality is sufficient for downstream analysis. Mining is complete.
目标:确认输出的质量和完整性。
步骤1:运行验证脚本
bash
python3 ~/.claude/scripts/validate.py步骤2:抽样检查数据质量
手动检查3-5条交互记录:
- 包含可执行的评审反馈(而非仅"LGTM")
comment_text - 当解决状态为"changed"时,和
code_before字段已填充code_after - 和
reviewer字段不为空author - URL可解析到有效的GitHub地址
步骤3:生成汇总统计
bash
python3 ~/.claude/scripts/miner.py <repos> <output>.json --summary验证:
- 评审者分布未被单一用户主导(除非使用了参数)
--reviewer - 交互数量与PR数量成比例(预期每个PR有2-5次交互)
- 解决状态包含多种类型(而非全部为"unknown")
步骤4:清理临时文件
删除挖掘过程中创建的部分JSON、调试日志或临时文件。仅保留目录下的最终输出。
mined_data/准入条件:验证通过。数据质量足以支撑下游分析。挖掘流程完成。
Output Format
输出格式
json
{
"metadata": {
"repo": "org/repo",
"mined_at": "2025-11-20T14:30:00Z",
"pr_count": 50,
"interaction_count": 127
},
"interactions": [
{
"source": "pr_review",
"pr_number": 234,
"pr_title": "Add error wrapping",
"author": "developer",
"reviewer": "senior-developer",
"file": "service/user.go",
"line": 45,
"comment_text": "Please use errors.Is() instead of == for error comparison",
"diff_hunk": "@@ -42,7 +42,7 @@...",
"code_before": "if err == ErrNotFound {",
"code_after": "if errors.Is(err, ErrNotFound) {",
"resolution": "changed",
"url": "https://github.com/org/repo/pull/234#discussion_r123456",
"created_at": "2025-10-15T10:23:45Z"
}
]
}json
{
"metadata": {
"repo": "org/repo",
"mined_at": "2025-11-20T14:30:00Z",
"pr_count": 50,
"interaction_count": 127
},
"interactions": [
{
"source": "pr_review",
"pr_number": 234,
"pr_title": "Add error wrapping",
"author": "developer",
"reviewer": "senior-developer",
"file": "service/user.go",
"line": 45,
"comment_text": "Please use errors.Is() instead of == for error comparison",
"diff_hunk": "@@ -42,7 +42,7 @@...",
"code_before": "if err == ErrNotFound {",
"code_after": "if errors.Is(err, ErrNotFound) {",
"resolution": "changed",
"url": "https://github.com/org/repo/pull/234#discussion_r123456",
"created_at": "2025-10-15T10:23:45Z"
}
]
}Resolution Types
解决状态类型
- changed: Comment led to code modification
- resolved: Marked resolved without code change
- dismissed: Dismissed by author
- unresolved: Still open when PR merged
- unknown: Cannot determine resolution
- changed:评论导致代码修改
- resolved:标记为已解决但未修改代码
- dismissed:被作者驳回
- unresolved:PR合并时仍未解决
- unknown:无法确定解决状态
File Naming Conventions
文件命名规范
Raw data (): or
mined_data/{reviewer}_{repos}_{date}.json{repos}_all_{date}.jsonDistilled rules (): or
rules/{repos}_coding_rules.md{reviewer}_{repos}_patterns.md原始数据(目录): 或
mined_data/{reviewer}_{repos}_{date}.json{repos}_all_{date}.json提炼规则(目录): 或
rules/{repos}_coding_rules.md{reviewer}_{repos}_patterns.mdExamples
示例
Example 1: Mine Single Repository
示例1:挖掘单个仓库
User says: "Extract review patterns from our Go service"
Actions:
- Verify token and repo access (AUTHENTICATE)
- Mine last 50 merged PRs with imperative filtering (MINE)
- Validate output has sufficient interactions (VALIDATE)
Result: with 100+ interactions
mined_data/go-service_all_2026-02-13.json
用户需求:"从我们的Go服务中提取评审模式"
操作步骤:
- 验证令牌和仓库访问权限(认证阶段)
- 挖掘最近50个合并PR,启用命令式过滤(挖掘阶段)
- 验证输出包含足够的交互记录(验证阶段)
结果:,包含100+条交互记录
mined_data/go-service_all_2026-02-13.json
Example 2: Mine Specific Reviewer Across Repos
示例2:跨仓库挖掘特定评审者的评论
User says: "Get all of Alice's review comments across our backend repos"
Actions:
- Verify token and all repo access (AUTHENTICATE)
- Mine with across 3 repos (MINE)
--reviewer alice - Validate reviewer field shows only Alice (VALIDATE)
Result: for Code Archaeologist analysis
mined_data/alice_backend_2026-02-13.json
用户需求:"获取Alice在我们所有后端仓库中的所有评审评论"
操作步骤:
- 验证令牌和所有目标仓库的访问权限(认证阶段)
- 使用参数跨3个仓库挖掘(挖掘阶段)
--reviewer alice - 验证评审者字段仅显示Alice(验证阶段)
结果:,可提供给Code Archaeologist agent分析
mined_data/alice_backend_2026-02-13.json
Example 3: Team Standards Extraction
示例3:团队规范提取
User says: "Build a dataset of our team's coding standards from PRs"
Actions:
- Verify token for all team repos (AUTHENTICATE)
- Mine 50 PRs from each of 4 team repos (MINE)
- Validate cross-repo coverage and interaction quality (VALIDATE) Result: Multi-repo dataset ready for pattern analysis
用户需求:"从PR中构建我们团队的代码规范数据集"
操作步骤:
- 验证所有团队仓库的令牌权限(认证阶段)
- 从4个团队仓库各挖掘50个PR(挖掘阶段)
- 验证跨仓库覆盖范围和交互质量(验证阶段) 结果:多仓库数据集,可用于后续模式分析
Error Handling
错误处理
Error: "Bad credentials" or Authentication Failure
错误:"Bad credentials" 或认证失败
Cause: Token expired, revoked, or missing scope
Solution:
repo- Verify is set:
GITHUB_TOKENecho $GITHUB_TOKEN | head -c 10 - Check token permissions:
gh auth status - Regenerate token with scope if needed
repo - Retry authentication check before proceeding
原因:令牌过期、被撤销或缺少权限范围
解决方案:
repo- 验证已设置:
GITHUB_TOKENecho $GITHUB_TOKEN | head -c 10 - 检查令牌权限:
gh auth status - 如有需要,重新生成包含权限范围的令牌
repo - 重试认证检查后再继续
Error: "Rate limit exceeded"
错误:"Rate limit exceeded"
Cause: Too many API calls in the current hour
Solution:
- Check reset time:
gh api rate_limit --jq '.resources.core.reset' - Wait for reset or reduce mining scope ()
--limit - For large mining operations, split across multiple sessions
- Consider using a dedicated token with higher rate limits
原因:当前小时内API调用次数过多
解决方案:
- 查看重置时间:
gh api rate_limit --jq '.resources.core.reset' - 等待重置或缩小挖掘范围(调整参数)
--limit - 大规模挖掘操作可拆分到多个会话执行
- 考虑使用具有更高速率限制的专用令牌
Error: "No interactions found"
错误:"No interactions found"
Cause: Repo has few review comments, or filters too restrictive
Solution:
- Try to disable imperative filtering
--all-comments - Increase to mine more PRs
--limit - Broaden date range with
--since - Verify the repo uses code review (check for review activity manually)
原因:仓库评审评论少,或过滤条件过于严格
解决方案:
- 尝试启用关闭命令式过滤
--all-comments - 增大参数以挖掘更多PR
--limit - 使用参数放宽日期范围
--since - 手动验证仓库是否使用代码评审(检查评审活动)
Anti-Patterns
反模式
Anti-Pattern 1: Analyzing During Mining
反模式1:挖掘过程中进行分析
What it looks like: "I mined the data and found 5 key patterns: always use errors.Is()..."
Why wrong: This skill extracts raw data. Pattern analysis is the Code Archaeologist agent's job. Mixing extraction with interpretation creates unreliable, non-deterministic output.
Do instead: Mine data, validate output, hand off JSON to Code Archaeologist.
表现:"我挖掘了数据并发现5个关键模式:始终使用errors.Is()..."
问题:本技能仅负责提取原始数据。模式分析是Code Archaeologist agent的职责。混合提取与解读会产生不可靠、非确定性的输出。
正确做法:挖掘数据、验证输出,将原始JSON交付给Code Archaeologist agent。
Anti-Pattern 2: Mining Without Authentication Check
反模式2:未检查认证就开始挖掘
What it looks like: Running immediately, failing 10 minutes later on "Bad credentials"
Why wrong: Wastes time and API rate limits. No early validation of token permissions.
Do instead: Complete Phase 1 (AUTHENTICATE) before any mining.
miner.py表现:直接运行,10分钟后因"Bad credentials"失败
问题:浪费时间和API速率限制。未提前验证令牌权限。
正确做法:完成阶段1(认证)后再开始任何挖掘操作。
miner.pyAnti-Pattern 3: Mining Entire Repository History
反模式3:挖掘整个仓库历史
What it looks like: to get "everything"
Why wrong: Extremely slow, burns rate limits, old PRs reflect outdated standards, massive output files are hard to process.
Do instead: Start with . Expand only after validating output quality.
--limit 10000--limit 50 --since <6-months-ago>表现:使用以获取"所有数据"
问题:极慢、耗尽速率限制、旧PR反映过时标准、超大输出文件难以处理。
正确做法:从开始。仅在验证输出质量后再扩大范围。
--limit 10000--limit 50 --since <6-months-ago>Anti-Pattern 4: Skipping Output Validation
反模式4:跳过输出验证
What it looks like: Mining completes, immediately passing output to Code Archaeologist without checking
Why wrong: May contain zero useful interactions, incomplete data from API errors, or bot-generated noise. Garbage in, garbage out.
Do instead: Complete Phase 3 (VALIDATE). Spot-check interactions, verify counts, review distribution.
表现:挖掘完成后直接将输出交付给Code Archaeologist agent,未做检查
问题:输出可能包含零有效交互、API错误导致的不完整数据,或机器人生成的噪声。垃圾进,垃圾出。
正确做法:完成阶段3(验证)。抽样检查交互记录、验证数量、评审分布情况。
References
参考
This skill uses these shared patterns:
- Anti-Rationalization - Prevents shortcut rationalizations
- Verification Checklist - Pre-completion checks
本技能使用以下共享模式:
- Anti-Rationalization - 防止捷径式合理化
- Verification Checklist - 完成前检查清单
Domain-Specific Anti-Rationalization
领域特定反合理化
| Rationalization | Why It's Wrong | Required Action |
|---|---|---|
| "Token worked last time" | Tokens expire, permissions change | Run |
| "50 PRs is enough" | Depends on review density | Validate interaction count before proceeding |
| "I can summarize the patterns" | Extraction skill, not analysis skill | Output raw JSON only |
| "All comments mode wastes time" | Imperative filter may miss valuable feedback | Consider |
| 合理化借口 | 问题所在 | 要求操作 |
|---|---|---|
| "上次令牌还能用" | 令牌可能过期、权限可能变更 | 每次会话都运行 |
| "50个PR足够了" | 取决于评审密度 | 继续前验证交互数量 |
| "我可以总结模式" | 本技能是提取工具,而非分析工具 | 仅输出原始JSON |
| "全评论模式浪费时间" | 命令式过滤可能遗漏有价值的反馈 | 首次运行考虑使用 |
Reference Files
参考文件
- : Full list of detected imperative keywords
${CLAUDE_SKILL_DIR}/references/imperative-keywords.txt - : Real-world mining examples and expected output
${CLAUDE_SKILL_DIR}/references/examples.md - : Main mining script (GitHub API extraction)
${CLAUDE_SKILL_DIR}/scripts/miner.py - : Output validation script
${CLAUDE_SKILL_DIR}/scripts/validate.py
- :检测到的命令式关键词完整列表
${CLAUDE_SKILL_DIR}/references/imperative-keywords.txt - :真实世界挖掘示例和预期输出
${CLAUDE_SKILL_DIR}/references/examples.md - :主挖掘脚本(GitHub API提取)
${CLAUDE_SKILL_DIR}/scripts/miner.py - :输出验证脚本
${CLAUDE_SKILL_DIR}/scripts/validate.py