nowait-reasoning-optimizer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseNOWAIT Reasoning Optimizer
NOWAIT推理优化器
Implements the NOWAIT technique from the paper "Wait, We Don't Need to 'Wait'! Removing Thinking Tokens Improves Reasoning Efficiency" (Wang et al., 2025).
实现了论文《Wait, We Don't Need to 'Wait'! Removing Thinking Tokens Improves Reasoning Efficiency》(Wang等人,2025)中的NOWAIT技术。
Overview
概述
NOWAIT is a training-free inference-time intervention that suppresses self-reflection tokens (e.g., "Wait", "Hmm", "Alternatively") during generation, reducing chain-of-thought (CoT) trajectory length by 27-51% without compromising model utility.
NOWAIT是一种无需训练的推理阶段干预技术,在生成过程中抑制自我反思类token(如"Wait"、"Hmm"、"Alternatively"),在不影响模型效用的前提下将思维链(CoT)轨迹长度减少27-51%。
When to Use
适用场景
- Deploying R1-style reasoning models with limited compute
- Reducing inference latency for production systems
- Optimizing token costs for reasoning tasks
- Working with verbose CoT outputs that need streamlining
- 在计算资源有限的情况下部署R1风格的推理模型
- 降低生产系统的推理延迟
- 优化推理任务的token成本
- 处理需要精简的冗长CoT输出
Supported Models
支持的模型
| Model Series | Type | Token Reduction |
|---|---|---|
| QwQ-32B | RL-based | 16-31% |
| Phi4-Reasoning-Plus | RL-based | 23-28% |
| Qwen3-32B | RL-based | 13-16% |
| Kimi-VL-A3B | Multimodal | 40-60% |
| QvQ-72B-Preview | Multimodal | 20-30% |
Important: NOWAIT works best with RL-based models. Distilled models (Qwen3-4B/8B/14B) show degraded performance when reflection tokens are suppressed.
| 模型系列 | 类型 | Token缩减率 |
|---|---|---|
| QwQ-32B | 基于强化学习(RL) | 16-31% |
| Phi4-Reasoning-Plus | 基于强化学习(RL) | 23-28% |
| Qwen3-32B | 基于强化学习(RL) | 13-16% |
| Kimi-VL-A3B | 多模态 | 40-60% |
| QvQ-72B-Preview | 多模态 | 20-30% |
重要提示:NOWAIT在基于强化学习的模型上效果最佳。蒸馏模型(Qwen3-4B/8B/14B)在抑制反思类token时会出现性能下降。
Quick Start
快速开始
1. Basic Implementation
1. 基础实现
python
from scripts.nowait_processor import NOWAITLogitProcessorpython
from scripts.nowait_processor import NOWAITLogitProcessorInitialize processor for your model's tokenizer
Initialize processor for your model's tokenizer
processor = NOWAITLogitProcessor(tokenizer)
processor = NOWAITLogitProcessor(tokenizer)
Use during generation
Use during generation
outputs = model.generate(
inputs,
logits_processor=[processor],
max_new_tokens=32768
)
undefinedoutputs = model.generate(
inputs,
logits_processor=[processor],
max_new_tokens=32768
)
undefined2. Keywords Suppressed
2. 被抑制的关键词
See for the complete list. Core keywords:
references/keywords.mdwait, alternatively, hmm, but, however, check,
double-check, maybe, verify, again, oh, ah完整列表请查看。核心关键词:
references/keywords.mdwait, alternatively, hmm, but, however, check,
double-check, maybe, verify, again, oh, ahHow It Works
工作原理
- Initialize Keywords: Identify reflection keywords from empirical analysis
- Expand to Token Variants: Map keywords to all token variants in vocabulary (e.g., "wait" → " wait", "Wait", " Wait", ".wait", "WAIT")
- Suppress During Inference: Set logits of reflection tokens to large negative values during decoding
Logits (Before) Logits (After)
Wait 0.8 → Wait -inf
First 0.6 → First 0.6
Hmm 0.5 → Hmm -inf
Let 0.4 → Let 0.4- 初始化关键词:通过实证分析识别反思类关键词
- 扩展为Token变体:将关键词映射到词汇表中的所有Token变体(例如“wait”→“ wait”、“Wait”、“ Wait”、“.wait”、“WAIT”)
- 推理阶段抑制:在解码过程中将反思类Token的logits设置为极大的负值
Logits(抑制前) Logits(抑制后)
Wait 0.8 → Wait -inf
First 0.6 → First 0.6
Hmm 0.5 → Hmm -inf
Let 0.4 → Let 0.4Key Findings
关键发现
Why It Works
为何有效
- NOWAIT doesn't eliminate self-reflection entirely—it guides models to skip unnecessary "waiting" reasoning
- Models still perform essential verification at key decision points
- Results in more linear, straightforward reasoning paths
- NOWAIT并非完全消除自我反思,而是引导模型跳过不必要的「等待式」推理
- 模型仍会在关键决策点执行必要的验证
- 最终形成更线性、直接的推理路径
RL vs Distilled Models
强化学习模型 vs 蒸馏模型
| Model Type | NOWAIT Effect | Recommendation |
|---|---|---|
| RL-based (QwQ, Phi4, Qwen3-32B) | Stable accuracy, significant token reduction | ✅ Recommended |
| Distilled (Qwen3-4B/8B/14B) | Accuracy degradation on hard tasks | ⚠️ Use with caution |
Distilled models rely heavily on CoT structure from training data—removing reflection tokens disrupts their reasoning patterns.
| 模型类型 | NOWAIT效果 | 建议 |
|---|---|---|
| 基于RL的模型(QwQ、Phi4、Qwen3-32B) | 准确率稳定,Token缩减显著 | ✅ 推荐使用 |
| 蒸馏模型(Qwen3-4B/8B/14B) | 在复杂任务上准确率下降 | ⚠️ 谨慎使用 |
蒸馏模型严重依赖训练数据中的CoT结构,移除反思类Token会破坏其推理模式。
Integration Examples
集成示例
HuggingFace Transformers
HuggingFace Transformers
python
from transformers import AutoModelForCausalLM, AutoTokenizer
from scripts.nowait_processor import NOWAITLogitProcessor
model = AutoModelForCausalLM.from_pretrained("Qwen/QwQ-32B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B")
processor = NOWAITLogitProcessor(tokenizer)
response = model.generate(
tokenizer(prompt, return_tensors="pt").input_ids,
logits_processor=[processor],
max_new_tokens=32768,
do_sample=True,
temperature=0.7
)python
from transformers import AutoModelForCausalLM, AutoTokenizer
from scripts.nowait_processor import NOWAITLogitProcessor
model = AutoModelForCausalLM.from_pretrained("Qwen/QwQ-32B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B")
processor = NOWAITLogitProcessor(tokenizer)
response = model.generate(
tokenizer(prompt, return_tensors="pt").input_ids,
logits_processor=[processor],
max_new_tokens=32768,
do_sample=True,
temperature=0.7
)vLLM
vLLM
python
from vllm import LLM, SamplingParams
from scripts.nowait_processor import get_nowait_bad_words_ids
llm = LLM(model="Qwen/QwQ-32B")
bad_words_ids = get_nowait_bad_words_ids(llm.get_tokenizer())
sampling_params = SamplingParams(
max_tokens=32768,
bad_words_ids=bad_words_ids
)python
from vllm import LLM, SamplingParams
from scripts.nowait_processor import get_nowait_bad_words_ids
llm = LLM(model="Qwen/QwQ-32B")
bad_words_ids = get_nowait_bad_words_ids(llm.get_tokenizer())
sampling_params = SamplingParams(
max_tokens=32768,
bad_words_ids=bad_words_ids
)Expected Results
预期结果
| Task Type | Original Tokens | NOWAIT Tokens | Reduction |
|---|---|---|---|
| Math (AIME) | 15,000 | 10,500 | 30% |
| Visual QA (MMMU) | 2,900 | 1,450 | 50% |
| Video QA (MMVU) | 1,700 | 1,250 | 27% |
| 任务类型 | 原始Token数 | 使用NOWAIT后的Token数 | 缩减率 |
|---|---|---|---|
| 数学(AIME) | 15,000 | 10,500 | 30% |
| 视觉问答(MMMU) | 2,900 | 1,450 | 50% |
| 视频问答(MMVU) | 1,700 | 1,250 | 27% |
Limitations
局限性
- Less effective on very simple problems where CoT overhead is already minimal
- Distilled models may suffer accuracy loss on challenging tasks
- Some domains may require model-specific keyword tuning
- 在CoT开销本就极小的极简单问题上效果较差
- 蒸馏模型在复杂任务上可能会出现准确率下降
- 部分领域可能需要针对特定模型调整关键词
References
参考资料
- Paper: arXiv:2506.08343v2
- Complete keyword list:
references/keywords.md - Implementation:
scripts/nowait_processor.py
- 论文:arXiv:2506.08343v2
- 完整关键词列表:
references/keywords.md - 实现代码:
scripts/nowait_processor.py