nowait-reasoning-optimizer

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

NOWAIT Reasoning Optimizer

NOWAIT推理优化器

Implements the NOWAIT technique from the paper "Wait, We Don't Need to 'Wait'! Removing Thinking Tokens Improves Reasoning Efficiency" (Wang et al., 2025).

实现了论文《Wait, We Don't Need to 'Wait'! Removing Thinking Tokens Improves Reasoning Efficiency》（Wang等人，2025）中的NOWAIT技术。

Overview

概述

NOWAIT is a training-free inference-time intervention that suppresses self-reflection tokens (e.g., "Wait", "Hmm", "Alternatively") during generation, reducing chain-of-thought (CoT) trajectory length by 27-51% without compromising model utility.

NOWAIT是一种无需训练的推理阶段干预技术，在生成过程中抑制自我反思类token（如"Wait"、"Hmm"、"Alternatively"），在不影响模型效用的前提下将思维链（CoT）轨迹长度减少27-51%。

When to Use

适用场景

Deploying R1-style reasoning models with limited compute
Reducing inference latency for production systems
Optimizing token costs for reasoning tasks
Working with verbose CoT outputs that need streamlining

在计算资源有限的情况下部署R1风格的推理模型
降低生产系统的推理延迟
优化推理任务的token成本
处理需要精简的冗长CoT输出

Supported Models

支持的模型

Model Series	Type	Token Reduction
QwQ-32B	RL-based	16-31%
Phi4-Reasoning-Plus	RL-based	23-28%
Qwen3-32B	RL-based	13-16%
Kimi-VL-A3B	Multimodal	40-60%
QvQ-72B-Preview	Multimodal	20-30%

Important: NOWAIT works best with RL-based models. Distilled models (Qwen3-4B/8B/14B) show degraded performance when reflection tokens are suppressed.

模型系列	类型	Token缩减率
QwQ-32B	基于强化学习（RL）	16-31%
Phi4-Reasoning-Plus	基于强化学习（RL）	23-28%
Qwen3-32B	基于强化学习（RL）	13-16%
Kimi-VL-A3B	多模态	40-60%
QvQ-72B-Preview	多模态	20-30%

重要提示：NOWAIT在基于强化学习的模型上效果最佳。蒸馏模型（Qwen3-4B/8B/14B）在抑制反思类token时会出现性能下降。

Quick Start

快速开始

1. Basic Implementation

1. 基础实现

python

from scripts.nowait_processor import NOWAITLogitProcessor

python

from scripts.nowait_processor import NOWAITLogitProcessor

Initialize processor for your model's tokenizer

processor = NOWAITLogitProcessor(tokenizer)

Use during generation

outputs = model.generate( inputs, logits_processor=[processor], max_new_tokens=32768 )

undefined

outputs = model.generate( inputs, logits_processor=[processor], max_new_tokens=32768 )

undefined

2. Keywords Suppressed

2. 被抑制的关键词

See

references/keywords.md

for the complete list. Core keywords:

wait, alternatively, hmm, but, however, check, 
double-check, maybe, verify, again, oh, ah

完整列表请查看

references/keywords.md

。核心关键词：

wait, alternatively, hmm, but, however, check, 
double-check, maybe, verify, again, oh, ah

How It Works

工作原理

Initialize Keywords: Identify reflection keywords from empirical analysis
Expand to Token Variants: Map keywords to all token variants in vocabulary (e.g., "wait" → " wait", "Wait", " Wait", ".wait", "WAIT")
Suppress During Inference: Set logits of reflection tokens to large negative values during decoding

Logits (Before)         Logits (After)
Wait     0.8     →     Wait     -inf
First    0.6     →     First    0.6
Hmm      0.5     →     Hmm      -inf
Let      0.4     →     Let      0.4

初始化关键词：通过实证分析识别反思类关键词
扩展为Token变体：将关键词映射到词汇表中的所有Token变体（例如“wait”→“ wait”、“Wait”、“ Wait”、“.wait”、“WAIT”）
推理阶段抑制：在解码过程中将反思类Token的logits设置为极大的负值

Logits（抑制前）         Logits（抑制后）
Wait     0.8     →     Wait     -inf
First    0.6     →     First    0.6
Hmm      0.5     →     Hmm      -inf
Let      0.4     →     Let      0.4

Key Findings

关键发现

Why It Works

为何有效

NOWAIT doesn't eliminate self-reflection entirely—it guides models to skip unnecessary "waiting" reasoning
Models still perform essential verification at key decision points
Results in more linear, straightforward reasoning paths

NOWAIT并非完全消除自我反思，而是引导模型跳过不必要的「等待式」推理
模型仍会在关键决策点执行必要的验证
最终形成更线性、直接的推理路径

RL vs Distilled Models

强化学习模型 vs 蒸馏模型

Model Type	NOWAIT Effect	Recommendation
RL-based (QwQ, Phi4, Qwen3-32B)	Stable accuracy, significant token reduction	✅ Recommended
Distilled (Qwen3-4B/8B/14B)	Accuracy degradation on hard tasks	⚠️ Use with caution

Distilled models rely heavily on CoT structure from training data—removing reflection tokens disrupts their reasoning patterns.

模型类型	NOWAIT效果	建议
基于RL的模型（QwQ、Phi4、Qwen3-32B）	准确率稳定，Token缩减显著	✅ 推荐使用
蒸馏模型（Qwen3-4B/8B/14B）	在复杂任务上准确率下降	⚠️ 谨慎使用

蒸馏模型严重依赖训练数据中的CoT结构，移除反思类Token会破坏其推理模式。

Integration Examples

集成示例

HuggingFace Transformers

python

from transformers import AutoModelForCausalLM, AutoTokenizer
from scripts.nowait_processor import NOWAITLogitProcessor

model = AutoModelForCausalLM.from_pretrained("Qwen/QwQ-32B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B")

processor = NOWAITLogitProcessor(tokenizer)

response = model.generate(
    tokenizer(prompt, return_tensors="pt").input_ids,
    logits_processor=[processor],
    max_new_tokens=32768,
    do_sample=True,
    temperature=0.7
)

python

from transformers import AutoModelForCausalLM, AutoTokenizer
from scripts.nowait_processor import NOWAITLogitProcessor

model = AutoModelForCausalLM.from_pretrained("Qwen/QwQ-32B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B")

processor = NOWAITLogitProcessor(tokenizer)

response = model.generate(
    tokenizer(prompt, return_tensors="pt").input_ids,
    logits_processor=[processor],
    max_new_tokens=32768,
    do_sample=True,
    temperature=0.7
)

vLLM

python

from vllm import LLM, SamplingParams
from scripts.nowait_processor import get_nowait_bad_words_ids

llm = LLM(model="Qwen/QwQ-32B")
bad_words_ids = get_nowait_bad_words_ids(llm.get_tokenizer())

sampling_params = SamplingParams(
    max_tokens=32768,
    bad_words_ids=bad_words_ids
)

python

from vllm import LLM, SamplingParams
from scripts.nowait_processor import get_nowait_bad_words_ids

llm = LLM(model="Qwen/QwQ-32B")
bad_words_ids = get_nowait_bad_words_ids(llm.get_tokenizer())

sampling_params = SamplingParams(
    max_tokens=32768,
    bad_words_ids=bad_words_ids
)

Expected Results

预期结果

Task Type	Original Tokens	NOWAIT Tokens	Reduction
Math (AIME)	15,000	10,500	30%
Visual QA (MMMU)	2,900	1,450	50%
Video QA (MMVU)	1,700	1,250	27%

任务类型	原始Token数	使用NOWAIT后的Token数	缩减率
数学（AIME）	15,000	10,500	30%
视觉问答（MMMU）	2,900	1,450	50%
视频问答（MMVU）	1,700	1,250	27%

Limitations

局限性

Less effective on very simple problems where CoT overhead is already minimal
Distilled models may suffer accuracy loss on challenging tasks
Some domains may require model-specific keyword tuning

在CoT开销本就极小的极简单问题上效果较差
蒸馏模型在复杂任务上可能会出现准确率下降
部分领域可能需要针对特定模型调整关键词

References

参考资料

Paper: arXiv:2506.08343v2
Complete keyword list:
```
references/keywords.md
```
Implementation:
```
scripts/nowait_processor.py
```

论文：arXiv:2506.08343v2
完整关键词列表：
```
references/keywords.md
```
实现代码：
```
scripts/nowait_processor.py
```