pr-miner

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

PR Miner Skill

PR Miner 技能

Operator Context

操作上下文

This skill operates as an operator for deterministic GitHub data extraction, configuring Claude's behavior for mining PR review comments. It implements the Pipeline architectural pattern — authenticate, mine, validate — with strict separation between extraction (this skill) and analysis (Code Archaeologist agent).

本技能作为确定性GitHub数据提取的操作器，配置Claude的行为以挖掘PR评审评论。它实现了Pipeline架构模式——认证、挖掘、验证——严格区分提取（本技能）与分析（Code Archaeologist agent）的职责。

Hardcoded Behaviors (Always Apply)

硬编码行为（始终生效）

CLAUDE.md Compliance: Read and follow repository CLAUDE.md before execution
Over-Engineering Prevention: Extract only requested data. No analysis, interpretation, or pattern detection — that is the Code Archaeologist agent's job
Deterministic Extraction Only: Output raw JSON. Never analyze patterns, generate rules, or interpret comments
Authenticate First: Verify GitHub token before starting any mining operation
Rate Limit Respect: Honor GitHub API rate limits with exponential backoff

CLAUDE.md 合规性：执行前阅读并遵循仓库的CLAUDE.md文档
避免过度设计：仅提取请求的数据。不进行分析、解读或模式检测——这些是Code Archaeologist agent的工作
仅确定性提取：输出原始JSON。绝不分析模式、生成规则或解读评论
先认证：开始任何挖掘操作前验证GitHub令牌
遵守速率限制：通过指数退避策略尊重GitHub API的速率限制

Default Behaviors (ON unless disabled)

默认行为（默认开启，可关闭）

Merged PRs Only: Focus on merged PRs representing accepted standards
Imperative Filtering: Filter for imperative language keywords (should, must, avoid, prefer)
Progress Reporting: Display mining progress during long operations
Temporary File Cleanup: Remove partial JSON and temp files at completion; keep only final output
Summary Output: Report interaction count and reviewer distribution after mining

仅合并的PR：聚焦于代表已接受标准的合并PR
命令式过滤：过滤包含命令式语言关键词的评论（should、must、avoid、prefer）
进度报告：长时操作中显示挖掘进度
临时文件清理：完成后删除部分JSON和临时文件；仅保留最终输出
汇总输出：挖掘完成后报告交互数量和评审者分布情况

Optional Behaviors (OFF unless enabled)

可选行为（默认关闭，需开启）

All Comments Mode: Capture every review comment regardless of language (
```
--all-comments
```
)
Reviewer Filter: Focus on comments from particular reviewers (
```
--reviewer
```
)
Date Range: Limit mining to specific time periods (
```
--since
```
/
```
--until
```
)
Multi-Repo: Mine across multiple repositories in single operation

全评论模式：捕获所有评审评论，不受语言限制（
```
--all-comments
```
参数）
评审者过滤：聚焦于特定评审者的评论（
```
--reviewer
```
参数）
日期范围：将挖掘限制在特定时间段（
```
--since
```
/
```
--until
```
参数）
多仓库支持：单次操作中跨多个仓库挖掘

What This Skill CAN Do

本技能可完成的任务

Extract review comments with code context (before/after) from merged PRs
Track resolution status per comment (changed, resolved, dismissed, unresolved)
Filter by imperative language keywords or capture all comments
Mine multiple repositories in a single operation
Respect GitHub API rate limits with retry logic
Generate structured JSON output for downstream analysis
Validate GitHub authentication before mining

从合并PR中提取包含代码上下文（修改前后）的评审评论
跟踪每条评论的解决状态（已修改、已解决、已驳回、未解决）
通过命令式语言关键词过滤或捕获所有评论
单次操作中挖掘多个仓库
通过重试逻辑遵守GitHub API速率限制
生成结构化JSON输出用于下游分析
挖掘前验证GitHub认证有效性

What This Skill CANNOT Do

本技能不可完成的任务

Analyze patterns or generate rules (Code Archaeologist agent's job)
Interpret comment meaning or intent (pure extraction only)
Create enforcement rules (no Semgrep/golangci-lint generation)
Mine private repos without proper token permissions (requires
```
repo
```
scope)
Process non-GitHub platforms (GitHub-specific implementation)
Monitor PRs in real-time (snapshot-based mining only)

分析模式或生成规则（Code Archaeologist agent的职责）
解读评论的含义或意图（仅做纯提取操作）
创建强制执行规则（不生成Semgrep/golangci-lint规则）
无适当令牌权限时挖掘私有仓库（需要
```
repo
```
权限范围）
处理非GitHub平台（仅支持GitHub）
实时监控PR（仅基于快照的挖掘）

Instructions

操作步骤

Phase 1: AUTHENTICATE

阶段1：认证

Goal: Verify GitHub access and validate target repositories before mining.

Step 1: Verify token

bash

python3 ~/.claude/scripts/miner.py --check-auth

Confirm output shows valid authentication with

repo

scope.

Step 2: Validate target repositories

Confirm each target repository:

Exists and is accessible with current token
Has merged PRs with review comments
Uses a code review workflow (not just self-merges)

Step 3: Check rate limits

bash

gh api rate_limit --jq '.resources.core | "Remaining: \(.remaining)/\(.limit), Resets: \(.reset)"'

Ensure sufficient API calls remain for the planned mining scope (estimate 3-5 calls per PR).

Gate: Token is valid, repositories are accessible, rate limits are sufficient. Proceed only when gate passes.

目标：挖掘前验证GitHub访问权限和目标仓库的可访问性。

步骤1：验证令牌

bash

python3 ~/.claude/scripts/miner.py --check-auth

确认输出显示认证有效且拥有

repo

权限范围。

步骤2：验证目标仓库

确认每个目标仓库：

存在且当前令牌可访问
有包含评审评论的合并PR
使用代码评审工作流（而非仅自合并）

步骤3：检查速率限制

bash

gh api rate_limit --jq '.resources.core | "Remaining: \(.remaining)/\(.limit), Resets: \(.reset)"'

确保剩余API调用次数足够支撑计划的挖掘范围（估算每个PR需要3-5次调用）。

准入条件：令牌有效、仓库可访问、速率限制充足。仅当满足所有条件时才可继续。

Phase 2: MINE

阶段2：挖掘

Goal: Extract raw review comment data with code context.

Step 1: Determine scope

Choose mining parameters based on the task:

Single repo:

python3 ~/.claude/scripts/miner.py org/repo output.json --limit 50

Multi-repo:

python3 ~/.claude/scripts/miner.py org/repo-a,org/repo-b output.json --limit 50

Filtered: Add

--reviewer name

--since date

, or

--all-comments

Start with 50 PRs. Expand only after validating output quality.

Step 2: Execute mining

bash

python3 ~/.claude/scripts/miner.py <repos> mined_data/<output>.json --limit <N>

Monitor progress output. Watch for:

Rate limit warnings
Authentication errors
Empty PR responses (may indicate bot-only reviews)

Step 3: Verify extraction

Check the output file exists and contains data:

bash

python3 -c "import json; d=json.load(open('mined_data/<output>.json')); print(f'PRs: {d[\"metadata\"][\"pr_count\"]}, Interactions: {d[\"metadata\"][\"interaction_count\"]}')"

If interaction count is below 20, consider expanding scope (

--limit

--all-comments

, or broader date range).

Gate: Mining completed without errors. Output JSON contains meaningful interaction data. Proceed only when gate passes.

目标：提取包含代码上下文的原始评审评论数据。

步骤1：确定范围

根据任务选择挖掘参数：

单仓库：

python3 ~/.claude/scripts/miner.py org/repo output.json --limit 50

多仓库：

python3 ~/.claude/scripts/miner.py org/repo-a,org/repo-b output.json --limit 50

过滤模式：添加

--reviewer name

、

--since date

或

--all-comments

参数

从50个PR开始。仅在验证输出质量后再扩大范围。

步骤2：执行挖掘

bash

python3 ~/.claude/scripts/miner.py <repos> mined_data/<output>.json --limit <N>

监控进度输出，注意：

速率限制警告
认证错误
空PR响应（可能表示仅机器人评审）

步骤3：验证提取结果

检查输出文件是否存在且包含数据：

bash

python3 -c "import json; d=json.load(open('mined_data/<output>.json')); print(f'PRs: {d[\"metadata\"][\"pr_count\"]}, Interactions: {d[\"metadata\"][\"interaction_count\"]}')"

如果交互数量低于20，考虑扩大范围（调整

--limit

、启用

--all-comments

或放宽日期范围）。

准入条件：挖掘无错误完成。输出JSON包含有意义的交互数据。仅当满足条件时才可继续。

Phase 3: VALIDATE

阶段3：验证

Goal: Confirm output quality and completeness.

Step 1: Run validation script

bash

python3 ~/.claude/scripts/validate.py

Step 2: Spot-check data quality

Review 3-5 interactions manually:

```
comment_text
```
contains actionable review feedback (not just "LGTM")
```
code_before
```
and
```
code_after
```
fields are populated where resolution is "changed"
```
reviewer
```
and
```
author
```
fields are not empty
URLs resolve to valid GitHub locations

Step 3: Generate summary statistics

bash

python3 ~/.claude/scripts/miner.py <repos> <output>.json --summary

Verify:

Reviewer distribution is not dominated by a single person (unless
```
--reviewer
```
was used)
Interaction count is proportional to PR count (expect 2-5 interactions per PR)
Resolution types include a mix (not all "unknown")

Step 4: Clean up temporary files

Remove any partial JSON, debug logs, or temp files created during mining. Keep only the final output in

mined_data/

Gate: Validation passes. Data quality is sufficient for downstream analysis. Mining is complete.

目标：确认输出的质量和完整性。

步骤1：运行验证脚本

bash

python3 ~/.claude/scripts/validate.py

步骤2：抽样检查数据质量

手动检查3-5条交互记录：

```
comment_text
```
包含可执行的评审反馈（而非仅"LGTM"）
当解决状态为"changed"时，
```
code_before
```
和
```
code_after
```
字段已填充
```
reviewer
```
和
```
author
```
字段不为空
URL可解析到有效的GitHub地址

步骤3：生成汇总统计

bash

python3 ~/.claude/scripts/miner.py <repos> <output>.json --summary

验证：

评审者分布未被单一用户主导（除非使用了
```
--reviewer
```
参数）
交互数量与PR数量成比例（预期每个PR有2-5次交互）
解决状态包含多种类型（而非全部为"unknown"）

步骤4：清理临时文件

删除挖掘过程中创建的部分JSON、调试日志或临时文件。仅保留

mined_data/

目录下的最终输出。

准入条件：验证通过。数据质量足以支撑下游分析。挖掘流程完成。

Output Format

输出格式

json

{
  "metadata": {
    "repo": "org/repo",
    "mined_at": "2025-11-20T14:30:00Z",
    "pr_count": 50,
    "interaction_count": 127
  },
  "interactions": [
    {
      "source": "pr_review",
      "pr_number": 234,
      "pr_title": "Add error wrapping",
      "author": "developer",
      "reviewer": "senior-developer",
      "file": "service/user.go",
      "line": 45,
      "comment_text": "Please use errors.Is() instead of == for error comparison",
      "diff_hunk": "@@ -42,7 +42,7 @@...",
      "code_before": "if err == ErrNotFound {",
      "code_after": "if errors.Is(err, ErrNotFound) {",
      "resolution": "changed",
      "url": "https://github.com/org/repo/pull/234#discussion_r123456",
      "created_at": "2025-10-15T10:23:45Z"
    }
  ]
}

json

{
  "metadata": {
    "repo": "org/repo",
    "mined_at": "2025-11-20T14:30:00Z",
    "pr_count": 50,
    "interaction_count": 127
  },
  "interactions": [
    {
      "source": "pr_review",
      "pr_number": 234,
      "pr_title": "Add error wrapping",
      "author": "developer",
      "reviewer": "senior-developer",
      "file": "service/user.go",
      "line": 45,
      "comment_text": "Please use errors.Is() instead of == for error comparison",
      "diff_hunk": "@@ -42,7 +42,7 @@...",
      "code_before": "if err == ErrNotFound {",
      "code_after": "if errors.Is(err, ErrNotFound) {",
      "resolution": "changed",
      "url": "https://github.com/org/repo/pull/234#discussion_r123456",
      "created_at": "2025-10-15T10:23:45Z"
    }
  ]
}

Resolution Types

解决状态类型

changed: Comment led to code modification
resolved: Marked resolved without code change
dismissed: Dismissed by author
unresolved: Still open when PR merged
unknown: Cannot determine resolution

changed：评论导致代码修改
resolved：标记为已解决但未修改代码
dismissed：被作者驳回
unresolved：PR合并时仍未解决
unknown：无法确定解决状态

File Naming Conventions

文件命名规范

Raw data (

mined_data/

{reviewer}_{repos}_{date}.json

{repos}_all_{date}.json

Distilled rules (

rules/

{repos}_coding_rules.md

{reviewer}_{repos}_patterns.md

原始数据（

mined_data/

目录）：

{reviewer}_{repos}_{date}.json

或

{repos}_all_{date}.json

提炼规则（

rules/

目录）：

{repos}_coding_rules.md

或

{reviewer}_{repos}_patterns.md

Examples

示例

Example 1: Mine Single Repository

示例1：挖掘单个仓库

User says: "Extract review patterns from our Go service" Actions:

Verify token and repo access (AUTHENTICATE)
Mine last 50 merged PRs with imperative filtering (MINE)
Validate output has sufficient interactions (VALIDATE) Result:
```
mined_data/go-service_all_2026-02-13.json
```
with 100+ interactions

用户需求："从我们的Go服务中提取评审模式" 操作步骤：

验证令牌和仓库访问权限（认证阶段）
挖掘最近50个合并PR，启用命令式过滤（挖掘阶段）
验证输出包含足够的交互记录（验证阶段）结果：
```
mined_data/go-service_all_2026-02-13.json
```
，包含100+条交互记录

Example 2: Mine Specific Reviewer Across Repos

示例2：跨仓库挖掘特定评审者的评论

User says: "Get all of Alice's review comments across our backend repos" Actions:

Verify token and all repo access (AUTHENTICATE)
Mine with
```
--reviewer alice
```
across 3 repos (MINE)
Validate reviewer field shows only Alice (VALIDATE) Result:
```
mined_data/alice_backend_2026-02-13.json
```
for Code Archaeologist analysis

用户需求："获取Alice在我们所有后端仓库中的所有评审评论" 操作步骤：

验证令牌和所有目标仓库的访问权限（认证阶段）
使用
```
--reviewer alice
```
参数跨3个仓库挖掘（挖掘阶段）
验证评审者字段仅显示Alice（验证阶段）结果：
```
mined_data/alice_backend_2026-02-13.json
```
，可提供给Code Archaeologist agent分析

Example 3: Team Standards Extraction

示例3：团队规范提取

User says: "Build a dataset of our team's coding standards from PRs" Actions:

Verify token for all team repos (AUTHENTICATE)
Mine 50 PRs from each of 4 team repos (MINE)
Validate cross-repo coverage and interaction quality (VALIDATE) Result: Multi-repo dataset ready for pattern analysis

用户需求："从PR中构建我们团队的代码规范数据集" 操作步骤：

验证所有团队仓库的令牌权限（认证阶段）
从4个团队仓库各挖掘50个PR（挖掘阶段）
验证跨仓库覆盖范围和交互质量（验证阶段）结果：多仓库数据集，可用于后续模式分析

Error Handling

错误处理

Error: "Bad credentials" or Authentication Failure

错误："Bad credentials" 或认证失败

Cause: Token expired, revoked, or missing

repo

scope Solution:

Verify

GITHUB_TOKEN

is set:

echo $GITHUB_TOKEN | head -c 10

Check token permissions:
```
gh auth status
```
Regenerate token with
```
repo
```
scope if needed
Retry authentication check before proceeding

原因：令牌过期、被撤销或缺少

repo

权限范围解决方案：

验证

GITHUB_TOKEN

已设置：

echo $GITHUB_TOKEN | head -c 10

检查令牌权限：
```
gh auth status
```
如有需要，重新生成包含
```
repo
```
权限范围的令牌
重试认证检查后再继续

Error: "Rate limit exceeded"

错误："Rate limit exceeded"

Cause: Too many API calls in the current hour Solution:

Check reset time:

gh api rate_limit --jq '.resources.core.reset'

Wait for reset or reduce mining scope (
```
--limit
```
)
For large mining operations, split across multiple sessions
Consider using a dedicated token with higher rate limits

原因：当前小时内API调用次数过多解决方案：

查看重置时间：

gh api rate_limit --jq '.resources.core.reset'

等待重置或缩小挖掘范围（调整
```
--limit
```
参数）
大规模挖掘操作可拆分到多个会话执行
考虑使用具有更高速率限制的专用令牌

Error: "No interactions found"

错误："No interactions found"

Cause: Repo has few review comments, or filters too restrictive Solution:

Try
```
--all-comments
```
to disable imperative filtering
Increase
```
--limit
```
to mine more PRs
Broaden date range with
```
--since
```
Verify the repo uses code review (check for review activity manually)

原因：仓库评审评论少，或过滤条件过于严格解决方案：

尝试启用
```
--all-comments
```
关闭命令式过滤
增大
```
--limit
```
参数以挖掘更多PR
使用
```
--since
```
参数放宽日期范围
手动验证仓库是否使用代码评审（检查评审活动）

Anti-Patterns

反模式

Anti-Pattern 1: Analyzing During Mining

反模式1：挖掘过程中进行分析

What it looks like: "I mined the data and found 5 key patterns: always use errors.Is()..." Why wrong: This skill extracts raw data. Pattern analysis is the Code Archaeologist agent's job. Mixing extraction with interpretation creates unreliable, non-deterministic output. Do instead: Mine data, validate output, hand off JSON to Code Archaeologist.

表现："我挖掘了数据并发现5个关键模式：始终使用errors.Is()..." 问题：本技能仅负责提取原始数据。模式分析是Code Archaeologist agent的职责。混合提取与解读会产生不可靠、非确定性的输出。 正确做法：挖掘数据、验证输出，将原始JSON交付给Code Archaeologist agent。

Anti-Pattern 2: Mining Without Authentication Check

反模式2：未检查认证就开始挖掘

What it looks like: Running

miner.py

immediately, failing 10 minutes later on "Bad credentials" Why wrong: Wastes time and API rate limits. No early validation of token permissions. Do instead: Complete Phase 1 (AUTHENTICATE) before any mining.

表现：直接运行

miner.py

，10分钟后因"Bad credentials"失败问题：浪费时间和API速率限制。未提前验证令牌权限。 正确做法：完成阶段1（认证）后再开始任何挖掘操作。

Anti-Pattern 3: Mining Entire Repository History

反模式3：挖掘整个仓库历史

What it looks like:

--limit 10000

to get "everything" Why wrong: Extremely slow, burns rate limits, old PRs reflect outdated standards, massive output files are hard to process. Do instead: Start with

--limit 50 --since <6-months-ago>

. Expand only after validating output quality.

表现：使用

--limit 10000

以获取"所有数据" 问题：极慢、耗尽速率限制、旧PR反映过时标准、超大输出文件难以处理。 正确做法：从

--limit 50 --since <6-months-ago>

开始。仅在验证输出质量后再扩大范围。

Anti-Pattern 4: Skipping Output Validation

反模式4：跳过输出验证

What it looks like: Mining completes, immediately passing output to Code Archaeologist without checking Why wrong: May contain zero useful interactions, incomplete data from API errors, or bot-generated noise. Garbage in, garbage out. Do instead: Complete Phase 3 (VALIDATE). Spot-check interactions, verify counts, review distribution.

表现：挖掘完成后直接将输出交付给Code Archaeologist agent，未做检查问题：输出可能包含零有效交互、API错误导致的不完整数据，或机器人生成的噪声。垃圾进，垃圾出。 正确做法：完成阶段3（验证）。抽样检查交互记录、验证数量、评审分布情况。

References

参考

This skill uses these shared patterns:

Anti-Rationalization - Prevents shortcut rationalizations
Verification Checklist - Pre-completion checks

本技能使用以下共享模式：

Anti-Rationalization - 防止捷径式合理化
Verification Checklist - 完成前检查清单

Domain-Specific Anti-Rationalization

领域特定反合理化

Rationalization	Why It's Wrong	Required Action
"Token worked last time"	Tokens expire, permissions change	Run `--check-auth` every session
"50 PRs is enough"	Depends on review density	Validate interaction count before proceeding
"I can summarize the patterns"	Extraction skill, not analysis skill	Output raw JSON only
"All comments mode wastes time"	Imperative filter may miss valuable feedback	Consider `--all-comments` for first run

合理化借口	问题所在	要求操作
"上次令牌还能用"	令牌可能过期、权限可能变更	每次会话都运行 `--check-auth`
"50个PR足够了"	取决于评审密度	继续前验证交互数量
"我可以总结模式"	本技能是提取工具，而非分析工具	仅输出原始JSON
"全评论模式浪费时间"	命令式过滤可能遗漏有价值的反馈	首次运行考虑使用 `--all-comments`

Reference Files

参考文件

${CLAUDE_SKILL_DIR}/references/imperative-keywords.txt

: Full list of detected imperative keywords

```
${CLAUDE_SKILL_DIR}/references/examples.md
```
: Real-world mining examples and expected output
```
${CLAUDE_SKILL_DIR}/scripts/miner.py
```
: Main mining script (GitHub API extraction)
```
${CLAUDE_SKILL_DIR}/scripts/validate.py
```
: Output validation script

${CLAUDE_SKILL_DIR}/references/imperative-keywords.txt

：检测到的命令式关键词完整列表

${CLAUDE_SKILL_DIR}/references/examples.md

：真实世界挖掘示例和预期输出

```
${CLAUDE_SKILL_DIR}/scripts/miner.py
```
：主挖掘脚本（GitHub API提取）
```
${CLAUDE_SKILL_DIR}/scripts/validate.py
```
：输出验证脚本