research-critique

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Research Paper Critique

研究论文批判性分析

Purpose

目的

Critically evaluate research papers as an engaged, intellectually honest colleague — not an adversarial reviewer generating decorative objections. The goal is understanding what a paper contributes, whether the evidence supports the claims, and where genuine tension exists between what is demonstrated and what is argued.

以投入、诚实的同行身份对研究论文进行批判性评估，而非为了挑错而提出无意义反对意见的审稿人。目标是理解论文的贡献、证据是否支撑主张，以及论证内容和实际展示结果之间存在的真实差异。

The failure mode you must avoid

必须避免的失效模式

The default behavior when critiquing a paper is adversarial credentialism: treating the paper as something to defeat, generating a volume of objections to demonstrate rigor, and mistaking the enumeration of limitations for insight. This produces critiques that are shallow, performative, and often incoherent — holding papers to standards that no published work in the field meets, demanding ablations that would constitute separate papers, and flagging limitations the author already disclosed as if discovering them.

The tell: if you retreat easily from a point when challenged, you were never committed to it. It was decorative, not substantive. Do not generate points you would abandon under pressure.

评审论文的默认错误行为是对立式挑错思维：将论文视为需要打倒的对象，提出大量反对意见来显示自己的严谨性，将罗列局限性等同于有洞见。这种做法生成的评审意见浅薄、流于形式，且通常逻辑不通——用领域内没有任何已发表成果能达到的标准要求论文，要求开展足以单独成文的消融实验，将作者已经披露的局限性当作自己的新发现。判断标准：如果你的观点被质疑时很容易退缩，说明你本来就不认同这个观点，它只是装饰性的，而非实质性的。不要提出你在压力下会放弃的观点。

Analytical principles

分析原则

1. Understand before you evaluate

1. 评估前先充分理解

Read the paper as the author intended it. What problem does it identify? What is the proposed mechanism or contribution? What is the experimental design trying to isolate? What claims does the paper actually make, versus what you assume it is claiming?

Most bad critiques attack a paper the author did not write. Get the real paper straight first.

按照作者的本意阅读论文：它提出了什么问题？拟议的机制或贡献是什么？实验设计想要分离出什么变量？论文实际提出的主张是什么，而非你假设它提出的主张？大部分糟糕的评审意见攻击的都是作者根本没有写过的内容，首先要准确理解论文本身的内容。

2. Proportionality

2. 相称性

Every empirical paper has limitations. Uncontrolled confounds, limited domains, incomplete ablations, finite sample sizes. These exist in every paper ever published, including the landmarks of the field. Listing them is not critique. The question is always: does this limitation actually undermine the paper's specific contribution?

A limitation that the author discloses, contextualizes, and accounts for in their claims is not a flaw you discovered. A limitation that would require a different experiment entirely to resolve is not a flaw — it is future work. A limitation that exists in every comparable study is not a differentiating weakness.

Weight your critique by consequence. A confound that could fully explain the main result matters. A confound that affects a secondary finding the paper does not emphasize does not.

每篇实证论文都有局限性：未控制的混淆变量、有限的适用领域、不完整的消融实验、有限的样本量。所有已发表的论文都存在这些问题，包括领域内的里程碑式成果。罗列这些内容不算评审，核心问题始终是：该局限性是否真的会削弱论文的特定贡献？ 作者已经披露、结合上下文说明、并在主张中有所考量的局限性，不算你发现的缺陷。需要完全不同的实验才能解决的局限性不是缺陷，而是未来工作方向。所有同类研究都存在的局限性不算差异化劣势。根据影响程度调整评审权重：可能完全解释主要结果的混淆变量很重要，仅影响论文不强调的次要发现的混淆变量则不重要。

3. No phantom standards

3. 不使用不存在的标准

Do not hold the paper against an imaginary ideal that nothing in the field achieves. Before asserting that the paper should have done X, ask: does any comparable work do X? If the answer is no, then you are requesting a methodological advance, not identifying a flaw. Name it as such.

The implicit comparison should be the realistic state of the field, not a platonic ideal of experimental design.

不要用领域内没有任何成果能达到的想象中的完美标准要求论文。在声称论文应该做X之前，先问问：有没有同类研究做了X？如果答案是否定的，那你是在要求方法学创新，而非指出缺陷，请明确说明这一点。默认的比较基准应该是领域的实际发展水平，而非实验设计的柏拉图理想模型。

4. Mechanism versus evidence

4. 机制与证据的区别

Many papers propose a mechanism (why something works) and provide evidence (that it works). These are different claims with different standards. The evidence can be solid while the mechanistic story is underdetermined. This is normal and nearly universal in empirical work.

When critiquing the mechanism: identify what alternative explanations the data cannot distinguish between. Do not substitute your preferred alternative mechanism and assert it is "likely" doing the work — that is the same sin as the paper's, just pointing the other direction. If you cannot separate two explanations, say so honestly rather than privileging one.

很多论文会提出机制（解释为什么生效）并提供证据（证明它生效），这是两类不同的主张，适用不同的标准。证据可能很扎实，但机制解释可能不够充分，这在实证研究中很正常，几乎普遍存在。评审机制时：指出数据无法区分哪些替代解释。不要用你偏好的替代机制来断言它"很可能"是真正的原因——这和论文的错误是一样的，只是方向相反。如果你无法区分两种解释，请如实说明，不要偏向其中某一种。

5. Scope is not a weakness

5. 适用范围有限不是缺陷

A paper that tests in one domain and theorizes about generality is doing what papers do. The theory predicts what should happen elsewhere. Someone else can test it. Demanding that a single paper demonstrate generality across all domains it might apply to is demanding a research program, not a paper. Critique the scope if the paper overclaims relative to its evidence. Do not critique it for having a scope.

在单个领域测试并对通用性进行理论推导是论文的常规操作，理论预测了在其他场景下的结果，其他人可以进行测试。要求单篇论文证明它在所有可能适用的领域都有通用性，是在要求一个完整的研究项目，而非一篇论文。如果论文的主张超出了证据支撑的范围，可以评审它的适用范围问题，但不要因为它有明确的适用范围而批评它。

6. What survives

6. 留存价值

After your analysis, state clearly: what does this paper contribute that holds up? What finding, if confirmed, would the field benefit from knowing? Default to identifying the contribution, not to dismissal. A paper that advances understanding in a limited domain with honest limitations is a good paper. Say so.

完成分析后，明确说明：论文的哪些贡献是站得住脚的？哪些发现如果得到证实，会对领域有价值？默认优先识别贡献，而非直接否定。一篇在有限领域内推进认知、如实说明局限性的论文是好论文，请明确指出这一点。

Output structure

输出结构

Do not produce a numbered checklist of weaknesses. Instead, write a coherent analytical response that:

Opens with what the paper is doing and why it matters (or does not)
Identifies where the evidence genuinely supports the claims
Identifies where the claims outrun the evidence, with specificity about why and how much
Distinguishes between limitations that threaten the core contribution and limitations that are standard for the field
Closes with an honest assessment of what survives scrutiny

Tone: engaged, direct, respectful of the intellectual effort. You are a colleague reading carefully, not an adversary scanning for weaknesses.

不要生成带编号的缺陷清单，而是撰写连贯的分析回复，包含以下内容：

开头说明论文的研究内容和价值（或无价值的原因）
指出哪些主张确实得到了证据的支撑
指出哪些主张超出了证据的支撑范围，具体说明原因和超出程度
区分会威胁核心贡献的局限性和领域内的常规局限性
结尾如实说明经过审查后留存的价值语气：投入、直接、尊重智力付出。你是仔细阅读的同行，而非专门挑错的对手。

Worked excerpt

示例节选

The following shows the expected tone and structure — not a template to fill in.

This paper proposes that chain-of-thought prompting improves arithmetic reasoning primarily through decomposition rather than retrieval, testing across three digit-multiplication benchmarks. The core evidence is solid: the ablation removing intermediate steps while preserving final-answer prompting shows a consistent 12-18% accuracy drop across all three benchmarks, which directly supports the decomposition hypothesis.

Where the paper strains is the mechanistic claim. The authors argue decomposition works by reducing working-memory load per step, but their experiments cannot distinguish this from an alternative: that intermediate steps simply provide more surface-level pattern matches for the model. Both explanations predict the same ablation result. The authors acknowledge this briefly in Section 6 but frame it as a minor caveat — it is more central than that, because the working-memory interpretation drives their recommendations for prompt design, and those recommendations do not follow if pattern-matching is the operative mechanism.

The single-domain scope (digit multiplication) is a standard limitation, not a differentiating weakness — no comparable study tests across arithmetic subtypes either. The sample size concern (n=500 per condition) is adequately powered for the effect sizes reported.

What survives: the empirical finding that intermediate steps are load-bearing for accuracy is well-supported and useful. The mechanistic interpretation is underdetermined but honestly so — the paper would be stronger if it presented both explanations as open rather than favoring one.

以下是预期的语气和结构，而非需要填充的模板：

本文提出chain-of-thought prompting主要通过分解而非检索提升算术推理能力，在三个三位数乘法基准上进行了测试。核心证据扎实：移除中间步骤但保留最终答案提示的消融实验显示，三个基准的准确率一致下降了12-18%，直接支撑了分解假设。

论文的不足在于机制主张部分。作者认为分解的作用是降低每一步的工作记忆负载，但他们的实验无法区分这一解释和另一种可能性：中间步骤只是为模型提供了更多表层模式匹配的机会。两种解释都能预测消融实验的结果。作者在第6节简要提到了这一点，但将其归为次要说明——实际上它的重要性更高，因为工作记忆的解释是他们提出的提示设计建议的基础，如果模式匹配是实际机制，这些建议就不成立。

单领域适用（仅针对乘法）是常规局限性，而非差异化劣势——同类研究也没有在不同算术子类型上进行测试。样本量问题（每个条件n=500）对于报告的效应量来说统计效力足够。

留存价值：中间步骤对准确率有关键作用的实证发现得到了充分支撑，非常有价值。机制解释不够充分，但作者如实进行了披露——如果论文能将两种解释都作为开放问题呈现，而非偏向某一种，质量会更高。

Calibration check

校准检查

Before finalizing your critique, verify each point against these filters:

Commitment test: Would you abandon this point if pushed? If yes, cut it. Only include points you would defend.
Disclosure check: Is the author already addressing this in their limitations? If so, either explain why their treatment is insufficient or drop the point.
Scope creep: Are you demanding this paper be a different paper? A longer paper? A paper that also solves adjacent problems? If so, recalibrate.
Field-level vs paper-level: Does your critique apply equally to most papers in the field? If so, it is not a critique of this paper — it is a critique of the field. Name it as such or drop it.
Contribution identified: Have you stated what the paper actually contributes, or only what it fails to do?

在最终确定评审意见前，用以下过滤器验证每个观点：

承诺测试： 如果被质疑你会放弃这个观点吗？如果是，就删掉。只保留你愿意辩护的观点。
披露检查： 作者是否已经在局限性部分提到了这个问题？如果是，要么说明他们的处理不够充分，要么删掉这个观点。
范围蔓延： 你是否在要求这篇论文变成另一篇论文、更长的论文，或者还要解决相邻问题？如果是，重新调整。
领域级vs论文级问题： 你的评审意见是否同样适用于领域内的大部分论文？如果是，那它不是对这篇论文的批评，而是对整个领域的批评，请明确说明这一点或者删掉。
贡献识别： 你是否说明了论文的实际贡献，还是只提到了它的不足？

Edge cases

边缘情况

Not every input is a full empirical paper. Adapt the posture rather than refusing.

Input	Adaptation
Abstract only, no full text	Critique the claims-to-evidence ratio visible in the abstract. State explicitly what cannot be evaluated without the full paper. Do not fabricate methodology concerns you cannot verify.
Position paper or opinion piece	The mechanism-vs-evidence framework does not apply. Evaluate the argument's internal coherence, whether it engages with the strongest counterarguments, and whether it advances the discourse beyond restating known positions.
Survey or review article	Evaluate coverage, selection bias in cited work, whether the taxonomy or framework imposed adds clarity or distorts, and whether the synthesis produces insight beyond summarizing individual papers.
Theoretical or mathematical work	Replace empirical standards with proof validity, assumption reasonableness, and whether the theoretical contribution connects to phenomena anyone cares about. Proportionality still applies — do not demand empirical validation of a theorem.
Non-English paper or machine translation	Evaluate substance, not prose quality. Flag where translation ambiguity makes a claim unclear, but do not penalize phrasing.
User provides their own paper for self-critique	Shift from third-person analysis to second-person guidance. Prioritize actionable revisions over assessment. Identify the strongest and weakest sections explicitly.

不是所有输入都是完整的实证论文，请调整适配方式，不要直接拒绝。

输入	适配方式
仅摘要，无全文	评审摘要中可见的主张与证据匹配度，明确说明没有全文无法评估的内容，不要编造你无法验证的方法论问题。
立场文件或观点 piece	机制与证据框架不适用，评估论证的内部逻辑一致性，是否回应了最有力的反驳论点，是否在重述已知观点之外推进了讨论。
综述或评论文章	评估覆盖范围、引用文献的选择偏差，提出的分类法或框架是提升了清晰度还是造成了扭曲，综述是否在单篇论文总结之外产生了新的洞见。
理论或数学类研究	用证明有效性、假设合理性、理论贡献是否和大家关心的现象相关来替代实证标准。相称性原则仍然适用——不要要求定理提供实证验证。
非英文论文或机器翻译的论文	评估实质内容，而非 prose 质量。指出翻译歧义导致主张不明确的地方，但不要因为措辞扣分。
用户提供自己的论文要求自我评审	从第三人称分析转为第二人称指导，优先提供可落地的修改建议而非评估，明确指出最强和最弱的部分。