idea-creator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Research Idea Creator

研究想法生成器

Generate publishable research ideas for: $ARGUMENTS
为以下内容生成可发表的研究想法:$ARGUMENTS

Overview

概述

Given a broad research direction from the user, systematically generate, validate, and rank concrete research ideas. This skill composes with
/research-lit
,
/novelty-check
, and
/research-review
to form a complete idea discovery pipeline.
根据用户提供的宽泛研究方向,系统性地生成、验证并排序具体的研究想法。本技能可与
/research-lit
/novelty-check
/research-review
组合,形成完整的想法发现流程。

Constants

常量定义

  • PILOT_MAX_HOURS = 2 — Skip any pilot estimated to take > 2 hours per GPU. Flag as "needs manual pilot".
  • PILOT_TIMEOUT_HOURS = 3 — Hard timeout: kill pilots exceeding 3 hours. Collect partial results if available.
  • MAX_PILOT_IDEAS = 3 — Pilot at most 3 ideas in parallel. Additional ideas are validated on paper only.
  • MAX_TOTAL_GPU_HOURS = 8 — Total GPU budget for all pilots combined.
  • REVIEWER_MODEL =
    gpt-5.4
    — Model used via Codex MCP for brainstorming and review. Must be an OpenAI model (e.g.,
    gpt-5.4
    ,
    o3
    ,
    gpt-4o
    ).
💡 Override via argument, e.g.,
/idea-creator "topic" — pilot budget: 4h per idea, 20h total
.
  • PILOT_MAX_HOURS = 2 —— 跳过任何单GPU预估耗时超过2小时的试点实验,标记为“需手动试点”。
  • PILOT_TIMEOUT_HOURS = 3 —— 硬性超时限制:终止耗时超过3小时的试点实验,若有部分结果则收集。
  • MAX_PILOT_IDEAS = 3 —— 最多同时进行3个想法的试点实验,其余想法仅通过论文进行验证。
  • MAX_TOTAL_GPU_HOURS = 8 —— 所有试点实验的总GPU预算。
  • REVIEWER_MODEL =
    gpt-5.4
    —— 通过Codex MCP用于头脑风暴和评审的模型,必须是OpenAI模型(例如
    gpt-5.4
    o3
    gpt-4o
    )。
💡 可通过参数覆盖配置,例如:
/idea-creator "topic" — pilot budget: 4h per idea, 20h total

Workflow

工作流程

Phase 1: Landscape Survey (5-10 min)

阶段1:领域现状调研(5-10分钟)

Map the research area to understand what exists and where the gaps are.
  1. Scan local paper library first: Check
    papers/
    and
    literature/
    in the project directory for existing PDFs. Read first 3 pages of relevant papers to build a baseline understanding before searching online. This avoids re-discovering what the user already knows.
  2. Search recent literature using WebSearch:
    • Top venues in the last 2 years (NeurIPS, ICML, ICLR, ACL, EMNLP, etc.)
    • Recent arXiv preprints (last 6 months)
    • Use 5+ different query formulations
    • Read abstracts and introductions of the top 10-15 papers
  3. Build a landscape map:
    • Group papers by sub-direction / approach
    • Identify what has been tried and what hasn't
    • Note recurring limitations mentioned in "Future Work" sections
    • Flag any open problems explicitly stated by multiple papers
  4. Identify structural gaps:
    • Methods that work in domain A but haven't been tried in domain B
    • Contradictory findings between papers (opportunity for resolution)
    • Assumptions that everyone makes but nobody has tested
    • Scaling regimes that haven't been explored
    • Diagnostic questions that nobody has asked
梳理研究领域,了解现有成果及空白。
  1. 首先扫描本地论文库:检查项目目录下的
    papers/
    literature/
    文件夹中的现有PDF文件。在进行在线搜索前,阅读相关论文的前3页以建立基础认知,避免重复探索用户已了解的内容。
  2. 使用WebSearch搜索近期文献
    • 近2年顶级会议的成果(NeurIPS、ICML、ICLR、ACL、EMNLP等)
    • 近6个月arXiv预印本
    • 使用5种以上不同的查询表述
    • 阅读前10-15篇论文的摘要和引言
  3. 构建领域现状图谱
    • 按子方向/方法对论文进行分组
    • 明确已尝试的方向和未探索的空白
    • 记录论文“未来工作”部分提到的常见局限性
    • 标记多篇论文明确提出的开放问题
  4. 识别结构性空白
    • 在A领域有效但未在B领域尝试的方法
    • 论文之间的矛盾结论(存在解决机会)
    • 所有人默认但无人验证的假设
    • 未探索过的缩放场景
    • 无人提出过的诊断性问题

Phase 2: Idea Generation (brainstorm with external LLM)

阶段2:想法生成(借助外部LLM进行头脑风暴)

Use the external LLM via Codex MCP for divergent thinking:
mcp__codex__codex:
  model: REVIEWER_MODEL
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    You are a senior ML researcher brainstorming research ideas.

    Research direction: [user's direction]

    Here is the current landscape:
    [paste landscape map from Phase 1]

    Key gaps identified:
    [paste gaps from Phase 1]

    Generate 8-12 concrete research ideas. For each idea:
    1. One-sentence summary
    2. Core hypothesis (what you expect to find and why)
    3. Minimum viable experiment (what's the cheapest way to test this?)
    4. Expected contribution type: empirical finding / new method / theoretical result / diagnostic
    5. Risk level: LOW (likely works) / MEDIUM (50-50) / HIGH (speculative)
    6. Estimated effort: days / weeks / months

    Prioritize ideas that are:
    - Testable with moderate compute (8x RTX 3090 or less)
    - Likely to produce a clear positive OR negative result (both are publishable)
    - Not "apply X to Y" unless the application reveals genuinely surprising insights
    - Differentiated from the 10-15 papers above

    Be creative but grounded. A great idea is one where the answer matters regardless of which way it goes.
Save the threadId for follow-up.
通过Codex MCP调用外部大语言模型进行发散性思考:
mcp__codex__codex:
  model: REVIEWER_MODEL
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    You are a senior ML researcher brainstorming research ideas.

    Research direction: [user's direction]

    Here is the current landscape:
    [paste landscape map from Phase 1]

    Key gaps identified:
    [paste gaps from Phase 1]

    Generate 8-12 concrete research ideas. For each idea:
    1. One-sentence summary
    2. Core hypothesis (what you expect to find and why)
    3. Minimum viable experiment (what's the cheapest way to test this?)
    4. Expected contribution type: empirical finding / new method / theoretical result / diagnostic
    5. Risk level: LOW (likely works) / MEDIUM (50-50) / HIGH (speculative)
    6. Estimated effort: days / weeks / months

    Prioritize ideas that are:
    - Testable with moderate compute (8x RTX 3090 or less)
    - Likely to produce a clear positive OR negative result (both are publishable)
    - Not "apply X to Y" unless the application reveals genuinely surprising insights
    - Differentiated from the 10-15 papers above

    Be creative but grounded. A great idea is one where the answer matters regardless of which way it goes.
保存threadId以便后续跟进。

Phase 3: First-Pass Filtering

阶段3:首轮筛选

For each generated idea, quickly evaluate:
  1. Feasibility check: Can we actually run this experiment with available resources?
    • Compute requirements (estimate GPU-hours)
    • Data availability
    • Implementation complexity
    • Skip ideas requiring > 1 week of GPU time or unavailable datasets
  2. Novelty quick-check: For each idea, do 2-3 targeted searches to see if it's already been done. Full
    /novelty-check
    comes later for survivors.
  3. Impact estimation: Would a reviewer care about the result?
    • "So what?" test: if the experiment succeeds, does it change how people think?
    • Is the finding actionable or just interesting?
Eliminate ideas that fail any of these. Typically 8-12 ideas reduce to 4-6.
对每个生成的想法进行快速评估:
  1. 可行性检查:我们是否真的可以利用现有资源开展该实验?
    • 计算资源需求(预估GPU时长)
    • 数据可用性
    • 实现复杂度
    • 跳过GPU耗时超过1周或需要不可用数据集的想法
  2. 新颖性快速校验:针对每个想法,进行2-3次定向搜索,确认是否已有相关研究。后续会对留存的想法进行完整的
    /novelty-check
  3. 影响力预估:评审专家会关注该结果吗?
    • “那又如何?”测试:如果实验成功,是否会改变人们的认知?
    • 研究结果是可落地的还是仅具有趣味性?
剔除未通过任意一项检查的想法。通常8-12个想法会缩减至4-6个。

Phase 4: Deep Validation (for top ideas)

阶段4:深度验证(针对优质想法)

For each surviving idea, run a deeper evaluation:
  1. Novelty check: Use the
    /novelty-check
    workflow (multi-source search + GPT-5.4 cross-verification) for each idea
  2. Critical review: Use GPT-5.4 via
    mcp__codex__codex-reply
    (same thread):
    Here are our top ideas after filtering:
    [paste surviving ideas with novelty check results]
    
    For each, play devil's advocate:
    - What's the strongest objection a reviewer would raise?
    - What's the most likely failure mode?
    - How would you rank these for a top venue submission?
    - Which 2-3 would you actually work on?
  3. Combine rankings: Merge your assessment with GPT-5.4's ranking. Select top 2-3 ideas for pilot experiments.
对每个留存的想法进行深度评估:
  1. 新颖性校验:对每个想法执行
    /novelty-check
    工作流(多源搜索 + GPT-5.4交叉验证)
  2. 批判性评审:通过
    mcp__codex__codex-reply
    调用GPT-5.4(使用同一线程):
    Here are our top ideas after filtering:
    [paste surviving ideas with novelty check results]
    
    For each, play devil's advocate:
    - What's the strongest objection a reviewer would raise?
    - What's the most likely failure mode?
    - How would you rank these for a top venue submission?
    - Which 2-3 would you actually work on?
  3. 合并排序结果:将你的评估结果与GPT-5.4的排序结果合并,选出排名前2-3的想法进行试点实验。

Phase 5: Parallel Pilot Experiments (for top 2-3 ideas)

阶段5:并行试点实验(针对排名前2-3的想法)

Before committing to a full research effort, run cheap pilot experiments to get empirical signal. This is the key differentiator from paper-only validation.
  1. Design pilots: For each top idea, define the minimal experiment that would give a positive or negative signal:
    • Single seed, small scale (e.g., small dataset subset, fewer epochs)
    • Target: 30 min - PILOT_MAX_HOURS per pilot on 1 GPU
    • Estimate GPU-hours BEFORE launching. If estimated time > PILOT_MAX_HOURS, reduce scale (fewer epochs, smaller subset) or flag as "needs manual pilot"
    • Clear success metric defined upfront (e.g., "if metric improves by > 1%, signal is positive")
  2. Deploy in parallel: Use
    /run-experiment
    to launch pilots on different GPUs simultaneously:
    GPU 0: Pilot for Idea 1
    GPU 1: Pilot for Idea 2
    GPU 2: Pilot for Idea 3
    Use
    run_in_background: true
    to launch all at once.
  3. Collect results: Use
    /monitor-experiment
    to check progress. If any pilot exceeds PILOT_TIMEOUT_HOURS, kill it and collect partial results. Once all pilots complete (or timeout), compare:
    • Which ideas showed positive signal?
    • Which showed null/negative results? (eliminate or deprioritize)
    • Any surprising findings that suggest a pivot?
    • Total GPU-hours consumed (track against MAX_TOTAL_GPU_HOURS budget)
  4. Re-rank based on empirical evidence: Update the idea ranking using pilot results. An idea with strong pilot signal jumps ahead of a theoretically appealing but untested idea.
Note: Skip this phase if the ideas are purely theoretical or if no GPU is available. Flag skipped ideas as "needs pilot validation" in the report.
在投入完整研究资源前,开展低成本试点实验以获取实证信号,这是与纯论文验证的核心区别。
  1. 设计试点实验:针对每个排名靠前的想法,定义能给出正向或负向信号的最小实验:
    • 单种子、小规模(例如:小数据集子集、更少训练轮数)
    • 目标:单GPU上耗时30分钟至PILOT_MAX_HOURS
    • 启动前预估GPU时长:如果预估耗时超过PILOT_MAX_HOURS,缩小实验规模(减少训练轮数、使用更小的数据集子集)或标记为“需手动试点”
    • 提前明确成功指标(例如:“如果指标提升超过1%,则信号为正向”)
  2. 并行部署:使用
    /run-experiment
    在不同GPU上同时启动试点实验:
    GPU 0: Pilot for Idea 1
    GPU 1: Pilot for Idea 2
    GPU 2: Pilot for Idea 3
    设置
    run_in_background: true
    以同时启动所有实验。
  3. 收集结果:使用
    /monitor-experiment
    检查进度。如果任何试点实验耗时超过PILOT_TIMEOUT_HOURS,终止实验并收集部分结果。所有试点实验完成(或超时)后,对比结果:
    • 哪些想法给出了正向信号?
    • 哪些给出了无结果/负向结果?(剔除或降低优先级)
    • 是否有意外发现表明需要调整方向?
    • 消耗的总GPU时长(对照MAX_TOTAL_GPU_HOURS预算进行跟踪)
  4. 基于实证结果重新排序:利用试点实验结果更新想法排名。具有强试点信号的想法会优先于理论上有吸引力但未经过测试的想法。
注意:如果想法纯理论性或无GPU可用,可跳过此阶段。在报告中标记跳过的想法为“需试点验证”。

Phase 6: Output — Ranked Idea Report

阶段6:输出——排序后的想法报告

Write a structured report to
IDEA_REPORT.md
in the project root:
markdown
undefined
在项目根目录下生成结构化报告
IDEA_REPORT.md
markdown
undefined

Research Idea Report

Research Idea Report

Direction: [user's research direction] Generated: [date] Ideas evaluated: X generated → Y survived filtering → Z piloted → W recommended
Direction: [user's research direction] Generated: [date] Ideas evaluated: X generated → Y survived filtering → Z piloted → W recommended

Landscape Summary

Landscape Summary

[3-5 paragraphs on the current state of the field]
[3-5 paragraphs on the current state of the field]

Recommended Ideas (ranked)

Recommended Ideas (ranked)

Idea 1: [title]

Idea 1: [title]

  • Hypothesis: [one sentence]
  • Minimum experiment: [concrete description]
  • Expected outcome: [what success/failure looks like]
  • Novelty: X/10 — closest work: [paper]
  • Feasibility: [compute, data, implementation estimates]
  • Risk: LOW/MEDIUM/HIGH
  • Contribution type: empirical / method / theory / diagnostic
  • Pilot result: [POSITIVE: metric +X% / NEGATIVE: no signal / SKIPPED: needs GPU]
  • Reviewer's likely objection: [strongest counterargument]
  • Why we should do this: [1-2 sentences]
  • Hypothesis: [one sentence]
  • Minimum experiment: [concrete description]
  • Expected outcome: [what success/failure looks like]
  • Novelty: X/10 — closest work: [paper]
  • Feasibility: [compute, data, implementation estimates]
  • Risk: LOW/MEDIUM/HIGH
  • Contribution type: empirical / method / theory / diagnostic
  • Pilot result: [POSITIVE: metric +X% / NEGATIVE: no signal / SKIPPED: needs GPU]
  • Reviewer's likely objection: [strongest counterargument]
  • Why we should do this: [1-2 sentences]

Idea 2: [title]

Idea 2: [title]

...
...

Eliminated Ideas (for reference)

Eliminated Ideas (for reference)

IdeaReason eliminated
...Already done by [paper]
...Requires > 1 week GPU time
...Result wouldn't be interesting either way
IdeaReason eliminated
...Already done by [paper]
...Requires > 1 week GPU time
...Result wouldn't be interesting either way

Pilot Experiment Results

Pilot Experiment Results

IdeaGPUTimeKey MetricSignal
Idea 1GPU 045 min+2.3% CEPOSITIVE
Idea 2GPU 130 min-0.1% CENEGATIVE
Idea 3GPU 21.5 hr+0.8% CEWEAK POSITIVE
IdeaGPUTimeKey MetricSignal
Idea 1GPU 045 min+2.3% CEPOSITIVE
Idea 2GPU 130 min-0.1% CENEGATIVE
Idea 3GPU 21.5 hr+0.8% CEWEAK POSITIVE

Suggested Execution Order

Suggested Execution Order

  1. Start with Idea 1 (positive pilot signal, lowest risk)
  2. Idea 3 as backup (weak signal, may need larger scale to confirm)
  3. Idea 2 eliminated by pilot — negative result documented
  1. Start with Idea 1 (positive pilot signal, lowest risk)
  2. Idea 3 as backup (weak signal, may need larger scale to confirm)
  3. Idea 2 eliminated by pilot — negative result documented

Next Steps

Next Steps

  • Scale up Idea 1 to full experiment (multi-seed, full dataset)
  • If confirmed, invoke /auto-review-loop for full iteration
undefined
  • Scale up Idea 1 to full experiment (multi-seed, full dataset)
  • If confirmed, invoke /auto-review-loop for full iteration
undefined

Key Rules

核心规则

  • The user provides a DIRECTION, not an idea. Your job is to generate the ideas.
  • Quantity first, quality second: brainstorm broadly, then filter ruthlessly.
  • A good negative result is just as publishable as a positive one. Prioritize ideas where the answer matters regardless of direction.
  • Don't fall in love with any idea before validating it. Be willing to kill ideas.
  • Always estimate compute cost. An idea that needs 1000 GPU-hours is not actionable for most researchers.
  • "Apply X to Y" is the lowest form of research idea. Push for deeper questions.
  • Include eliminated ideas in the report — they save future time by documenting dead ends.
  • If the user's direction is too broad, ask them to narrow it before proceeding.
  • 用户提供的是研究方向,而非具体想法。你的任务是生成想法。
  • 先求量再求质:广泛头脑风暴,然后严格筛选。
  • 好的负向结果与正向结果同样具有发表价值。优先选择无论结果如何,其答案都有意义的想法。
  • 在验证前不要偏爱任何想法,要敢于舍弃。
  • 始终预估计算成本。对于大多数研究者来说,需要1000 GPU时的想法是不切实际的。
  • “将X应用于Y”是最低级的研究想法,要追求更有深度的问题。
  • 在报告中包含被剔除的想法——记录这些死胡同可节省未来的时间。
  • 如果用户的方向过于宽泛,请先要求他们缩小范围再继续。

Composing with Other Skills

与其他技能的组合使用

After this skill produces the ranked report:
/idea-creator "direction"     → ranked ideas
/novelty-check "top idea"     → deep novelty verification (already done in Phase 4, but user can re-run)
/research-review "top idea"   → external critical feedback
implement                     → write code
/run-experiment               → deploy to GPU
/auto-review-loop             → iterate until submission-ready
本技能生成排序后的报告后,可进行以下操作:
/idea-creator "direction"     → 排序后的想法
/novelty-check "top idea"     → 深度新颖性验证(已在阶段4完成,但用户可重新运行)
/research-review "top idea"   → 外部批判性反馈
implement                     → 编写代码
/run-experiment               → 部署至GPU
/auto-review-loop             → 迭代直至可提交