nnsight-remote-interpretability
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesennsight: Transparent Access to Neural Network Internals
nnsight:透明访问神经网络内部结构
nnsight (/ɛn.saɪt/) enables researchers to interpret and manipulate the internals of any PyTorch model, with the unique capability of running the same code locally on small models or remotely on massive models (70B+) via NDIF.
GitHub: ndif-team/nnsight (730+ stars)
Paper: NNsight and NDIF: Democratizing Access to Foundation Model Internals (ICLR 2025)
nnsight(发音:/ɛn.saɪt/)支持研究人员解读和操作任意PyTorch模型的内部结构,其独特优势在于可通过NDIF在本地小模型或远程超大规模模型(700亿参数及以上)上运行相同代码。
GitHub:ndif-team/nnsight(730+星标)
论文:NNsight and NDIF: Democratizing Access to Foundation Model Internals(ICLR 2025收录)
Key Value Proposition
核心价值主张
Write once, run anywhere: The same interpretability code works on GPT-2 locally or Llama-3.1-405B remotely. Just toggle .
remote=Truepython
undefined一次编写,随处运行:相同的可解释性代码可在本地GPT-2或远程Llama-3.1-405B上运行,只需切换参数即可。
remote=Truepython
undefinedLocal execution (small model)
Local execution (small model)
with model.trace("Hello world"):
hidden = model.transformer.h[5].output[0].save()
with model.trace("Hello world"):
hidden = model.transformer.h[5].output[0].save()
Remote execution (massive model) - same code!
Remote execution (massive model) - same code!
with model.trace("Hello world", remote=True):
hidden = model.model.layers[40].output[0].save()
undefinedwith model.trace("Hello world", remote=True):
hidden = model.model.layers[40].output[0].save()
undefinedWhen to Use nnsight
何时使用nnsight
Use nnsight when you need to:
- Run interpretability experiments on models too large for local GPUs (70B, 405B)
- Work with any PyTorch architecture (transformers, Mamba, custom models)
- Perform multi-token generation interventions
- Share activations between different prompts
- Access full model internals without reimplementation
Consider alternatives when:
- You want consistent API across models → Use TransformerLens
- You need declarative, shareable interventions → Use pyvene
- You're training SAEs → Use SAELens
- You only work with small models locally → TransformerLens may be simpler
在以下场景中选择nnsight:
- 需要在本地GPU无法承载的超大规模模型(700亿、4050亿参数)上进行可解释性实验
- 处理任意PyTorch架构(Transformer、Mamba、自定义模型)
- 执行多token生成干预
- 在不同提示词之间共享激活值
- 无需重新实现即可访问完整模型内部结构
考虑替代工具的场景:
- 需要跨模型的统一API → 使用TransformerLens
- 需要声明式、可共享的干预方案 → 使用pyvene
- 训练SAE(稀疏自动编码器) → 使用SAELens
- 仅在本地处理小模型 → TransformerLens可能更简单
Installation
安装
bash
undefinedbash
undefinedBasic installation
Basic installation
pip install nnsight
pip install nnsight
For vLLM support
For vLLM support
pip install "nnsight[vllm]"
For remote NDIF execution, sign up at [login.ndif.us](https://login.ndif.us) for an API key.pip install "nnsight[vllm]"
如需使用NDIF远程执行,请前往[login.ndif.us](https://login.ndif.us)注册获取API密钥。Core Concepts
核心概念
LanguageModel Wrapper
LanguageModel 包装器
python
from nnsight import LanguageModelpython
from nnsight import LanguageModelLoad model (uses HuggingFace under the hood)
Load model (uses HuggingFace under the hood)
model = LanguageModel("openai-community/gpt2", device_map="auto")
model = LanguageModel("openai-community/gpt2", device_map="auto")
For larger models
For larger models
model = LanguageModel("meta-llama/Llama-3.1-8B", device_map="auto")
undefinedmodel = LanguageModel("meta-llama/Llama-3.1-8B", device_map="auto")
undefinedTracing Context
追踪上下文
The context manager enables deferred execution - operations are collected into a computation graph:
tracepython
from nnsight import LanguageModel
model = LanguageModel("gpt2", device_map="auto")
with model.trace("The Eiffel Tower is in") as tracer:
# Access any module's output
hidden_states = model.transformer.h[5].output[0].save()
# Access attention patterns
attn = model.transformer.h[5].attn.attn_dropout.input[0][0].save()
# Modify activations
model.transformer.h[8].output[0][:] = 0 # Zero out layer 8
# Get final output
logits = model.output.save()tracepython
from nnsight import LanguageModel
model = LanguageModel("gpt2", device_map="auto")
with model.trace("The Eiffel Tower is in") as tracer:
# Access any module's output
hidden_states = model.transformer.h[5].output[0].save()
# Access attention patterns
attn = model.transformer.h[5].attn.attn_dropout.input[0][0].save()
# Modify activations
model.transformer.h[8].output[0][:] = 0 # Zero out layer 8
# Get final output
logits = model.output.save()After context exits, access saved values
After context exits, access saved values
print(hidden_states.shape) # [batch, seq, hidden]
undefinedprint(hidden_states.shape) # [batch, seq, hidden]
undefinedProxy Objects
代理对象
Inside , module accesses return Proxy objects that record operations:
tracepython
with model.trace("Hello"):
# These are all Proxy objects - operations are deferred
h5_out = model.transformer.h[5].output[0] # Proxy
h5_mean = h5_out.mean(dim=-1) # Proxy
h5_saved = h5_mean.save() # Save for later access在上下文内,访问模块会返回Proxy对象,用于记录操作:
tracepython
with model.trace("Hello"):
# These are all Proxy objects - operations are deferred
h5_out = model.transformer.h[5].output[0] # Proxy
h5_mean = h5_out.mean(dim=-1) # Proxy
h5_saved = h5_mean.save() # Save for later accessWorkflow 1: Activation Analysis
工作流1:激活值分析
Step-by-Step
分步指南
python
from nnsight import LanguageModel
import torch
model = LanguageModel("gpt2", device_map="auto")
prompt = "The capital of France is"
with model.trace(prompt) as tracer:
# 1. Collect activations from multiple layers
layer_outputs = []
for i in range(12): # GPT-2 has 12 layers
layer_out = model.transformer.h[i].output[0].save()
layer_outputs.append(layer_out)
# 2. Get attention patterns
attn_patterns = []
for i in range(12):
# Access attention weights (after softmax)
attn = model.transformer.h[i].attn.attn_dropout.input[0][0].save()
attn_patterns.append(attn)
# 3. Get final logits
logits = model.output.save()python
from nnsight import LanguageModel
import torch
model = LanguageModel("gpt2", device_map="auto")
prompt = "The capital of France is"
with model.trace(prompt) as tracer:
# 1. Collect activations from multiple layers
layer_outputs = []
for i in range(12): # GPT-2 has 12 layers
layer_out = model.transformer.h[i].output[0].save()
layer_outputs.append(layer_out)
# 2. Get attention patterns
attn_patterns = []
for i in range(12):
# Access attention weights (after softmax)
attn = model.transformer.h[i].attn.attn_dropout.input[0][0].save()
attn_patterns.append(attn)
# 3. Get final logits
logits = model.output.save()4. Analyze outside context
4. Analyze outside context
for i, layer_out in enumerate(layer_outputs):
print(f"Layer {i} output shape: {layer_out.shape}")
print(f"Layer {i} norm: {layer_out.norm().item():.3f}")
for i, layer_out in enumerate(layer_outputs):
print(f"Layer {i} output shape: {layer_out.shape}")
print(f"Layer {i} norm: {layer_out.norm().item():.3f}")
5. Find top predictions
5. Find top predictions
probs = torch.softmax(logits[0, -1], dim=-1)
top_tokens = probs.topk(5)
for token, prob in zip(top_tokens.indices, top_tokens.values):
print(f"{model.tokenizer.decode(token)}: {prob.item():.3f}")
undefinedprobs = torch.softmax(logits[0, -1], dim=-1)
top_tokens = probs.topk(5)
for token, prob in zip(top_tokens.indices, top_tokens.values):
print(f"{model.tokenizer.decode(token)}: {prob.item():.3f}")
undefinedChecklist
检查清单
- Load model with LanguageModel wrapper
- Use trace context for operations
- Call on values you need after context
.save() - Access saved values outside context
- Use ,
.shape, etc. for analysis.norm()
- 使用LanguageModel包装器加载模型
- 使用trace上下文执行操作
- 对需要在上下文外访问的值调用
.save() - 在上下文外访问已保存的值
- 使用、
.shape等方法进行分析.norm()
Workflow 2: Activation Patching
工作流2:激活值修补
Step-by-Step
分步指南
python
from nnsight import LanguageModel
import torch
model = LanguageModel("gpt2", device_map="auto")
clean_prompt = "The Eiffel Tower is in"
corrupted_prompt = "The Colosseum is in"python
from nnsight import LanguageModel
import torch
model = LanguageModel("gpt2", device_map="auto")
clean_prompt = "The Eiffel Tower is in"
corrupted_prompt = "The Colosseum is in"1. Get clean activations
1. Get clean activations
with model.trace(clean_prompt) as tracer:
clean_hidden = model.transformer.h[8].output[0].save()
with model.trace(clean_prompt) as tracer:
clean_hidden = model.transformer.h[8].output[0].save()
2. Patch clean into corrupted run
2. Patch clean into corrupted run
with model.trace(corrupted_prompt) as tracer:
# Replace layer 8 output with clean activations
model.transformer.h[8].output[0][:] = clean_hidden
patched_logits = model.output.save()with model.trace(corrupted_prompt) as tracer:
# Replace layer 8 output with clean activations
model.transformer.h[8].output[0][:] = clean_hidden
patched_logits = model.output.save()3. Compare predictions
3. Compare predictions
paris_token = model.tokenizer.encode(" Paris")[0]
rome_token = model.tokenizer.encode(" Rome")[0]
patched_probs = torch.softmax(patched_logits[0, -1], dim=-1)
print(f"Paris prob: {patched_probs[paris_token].item():.3f}")
print(f"Rome prob: {patched_probs[rome_token].item():.3f}")
undefinedparis_token = model.tokenizer.encode(" Paris")[0]
rome_token = model.tokenizer.encode(" Rome")[0]
patched_probs = torch.softmax(patched_logits[0, -1], dim=-1)
print(f"Paris prob: {patched_probs[paris_token].item():.3f}")
print(f"Rome prob: {patched_probs[rome_token].item():.3f}")
undefinedSystematic Patching Sweep
系统性修补扫描
python
def patch_layer_position(layer, position, clean_cache, corrupted_prompt):
"""Patch single layer/position from clean to corrupted."""
with model.trace(corrupted_prompt) as tracer:
# Get current activation
current = model.transformer.h[layer].output[0]
# Patch only specific position
current[:, position, :] = clean_cache[layer][:, position, :]
logits = model.output.save()
return logitspython
def patch_layer_position(layer, position, clean_cache, corrupted_prompt):
"""Patch single layer/position from clean to corrupted."""
with model.trace(corrupted_prompt) as tracer:
# Get current activation
current = model.transformer.h[layer].output[0]
# Patch only specific position
current[:, position, :] = clean_cache[layer][:, position, :]
logits = model.output.save()
return logitsSweep over all layers and positions
Sweep over all layers and positions
results = torch.zeros(12, seq_len)
for layer in range(12):
for pos in range(seq_len):
logits = patch_layer_position(layer, pos, clean_hidden, corrupted)
results[layer, pos] = compute_metric(logits)
undefinedresults = torch.zeros(12, seq_len)
for layer in range(12):
for pos in range(seq_len):
logits = patch_layer_position(layer, pos, clean_hidden, corrupted)
results[layer, pos] = compute_metric(logits)
undefinedWorkflow 3: Remote Execution with NDIF
工作流3:使用NDIF进行远程执行
Run the same experiments on massive models without local GPUs.
无需本地GPU即可在超大规模模型上运行相同实验。
Step-by-Step
分步指南
python
from nnsight import LanguageModelpython
from nnsight import LanguageModel1. Load large model (will run remotely)
1. Load large model (will run remotely)
model = LanguageModel("meta-llama/Llama-3.1-70B")
model = LanguageModel("meta-llama/Llama-3.1-70B")
2. Same code, just add remote=True
2. Same code, just add remote=True
with model.trace("The meaning of life is", remote=True) as tracer:
# Access internals of 70B model!
layer_40_out = model.model.layers[40].output[0].save()
logits = model.output.save()
with model.trace("The meaning of life is", remote=True) as tracer:
# Access internals of 70B model!
layer_40_out = model.model.layers[40].output[0].save()
logits = model.output.save()
3. Results returned from NDIF
3. Results returned from NDIF
print(f"Layer 40 shape: {layer_40_out.shape}")
print(f"Layer 40 shape: {layer_40_out.shape}")
4. Generation with interventions
4. Generation with interventions
with model.trace(remote=True) as tracer:
with tracer.invoke("What is 2+2?"):
# Intervene during generation
model.model.layers[20].output[0][:, -1, :] *= 1.5
output = model.generate(max_new_tokens=50)undefinedwith model.trace(remote=True) as tracer:
with tracer.invoke("What is 2+2?"):
# Intervene during generation
model.model.layers[20].output[0][:, -1, :] *= 1.5
output = model.generate(max_new_tokens=50)undefinedNDIF Setup
NDIF 设置
- Sign up at login.ndif.us
- Get API key
- Set environment variable or pass to nnsight:
python
import os
os.environ["NDIF_API_KEY"] = "your_key"- 前往login.ndif.us注册
- 获取API密钥
- 设置环境变量或直接传入nnsight:
python
import os
os.environ["NDIF_API_KEY"] = "your_key"Or configure directly
Or configure directly
from nnsight import CONFIG
CONFIG.API_KEY = "your_key"
undefinedfrom nnsight import CONFIG
CONFIG.API_KEY = "your_key"
undefinedAvailable Models on NDIF
NDIF 可用模型
- Llama-3.1-8B, 70B, 405B
- DeepSeek-R1 models
- Various open-weight models (check ndif.us for current list)
- Llama-3.1-8B, 70B, 405B
- DeepSeek-R1系列模型
- 各类开源权重模型(当前列表请查看ndif.us)
Workflow 4: Cross-Prompt Activation Sharing
工作流4:跨提示词激活值共享
Share activations between different inputs in a single trace.
python
from nnsight import LanguageModel
model = LanguageModel("gpt2", device_map="auto")
with model.trace() as tracer:
# First prompt
with tracer.invoke("The cat sat on the"):
cat_hidden = model.transformer.h[6].output[0].save()
# Second prompt - inject cat's activations
with tracer.invoke("The dog ran through the"):
# Replace with cat's activations at layer 6
model.transformer.h[6].output[0][:] = cat_hidden
dog_with_cat = model.output.save()在单次追踪中在不同输入之间共享激活值。
python
from nnsight import LanguageModel
model = LanguageModel("gpt2", device_map="auto")
with model.trace() as tracer:
# First prompt
with tracer.invoke("The cat sat on the"):
cat_hidden = model.transformer.h[6].output[0].save()
# Second prompt - inject cat's activations
with tracer.invoke("The dog ran through the"):
# Replace with cat's activations at layer 6
model.transformer.h[6].output[0][:] = cat_hidden
dog_with_cat = model.output.save()The dog prompt now has cat's internal representations
The dog prompt now has cat's internal representations
undefinedundefinedWorkflow 5: Gradient-Based Analysis
工作流5:基于梯度的分析
Access gradients during backward pass.
python
from nnsight import LanguageModel
import torch
model = LanguageModel("gpt2", device_map="auto")
with model.trace("The quick brown fox") as tracer:
# Save activations and enable gradient
hidden = model.transformer.h[5].output[0].save()
hidden.retain_grad()
logits = model.output
# Compute loss on specific token
target_token = model.tokenizer.encode(" jumps")[0]
loss = -logits[0, -1, target_token]
# Backward pass
loss.backward()在反向传播过程中访问梯度。
python
from nnsight import LanguageModel
import torch
model = LanguageModel("gpt2", device_map="auto")
with model.trace("The quick brown fox") as tracer:
# Save activations and enable gradient
hidden = model.transformer.h[5].output[0].save()
hidden.retain_grad()
logits = model.output
# Compute loss on specific token
target_token = model.tokenizer.encode(" jumps")[0]
loss = -logits[0, -1, target_token]
# Backward pass
loss.backward()Access gradients
Access gradients
grad = hidden.grad
print(f"Gradient shape: {grad.shape}")
print(f"Gradient norm: {grad.norm().item():.3f}")
**Note**: Gradient access not supported for vLLM or remote execution.grad = hidden.grad
print(f"Gradient shape: {grad.shape}")
print(f"Gradient norm: {grad.norm().item():.3f}")
**注意**:vLLM或远程执行不支持梯度访问。Common Issues & Solutions
常见问题与解决方案
Issue: Module path differs between models
问题:不同模型的模块路径不一致
python
undefinedpython
undefinedGPT-2 structure
GPT-2 structure
model.transformer.h[5].output[0]
model.transformer.h[5].output[0]
LLaMA structure
LLaMA structure
model.model.layers[5].output[0]
model.model.layers[5].output[0]
Solution: Check model structure
Solution: Check model structure
print(model._model) # See actual module names
undefinedprint(model._model) # See actual module names
undefinedIssue: Forgetting to save
问题:忘记保存值
python
undefinedpython
undefinedWRONG: Value not accessible outside trace
WRONG: Value not accessible outside trace
with model.trace("Hello"):
hidden = model.transformer.h[5].output[0] # Not saved!
print(hidden) # Error or wrong value
with model.trace("Hello"):
hidden = model.transformer.h[5].output[0] # Not saved!
print(hidden) # Error or wrong value
RIGHT: Call .save()
RIGHT: Call .save()
with model.trace("Hello"):
hidden = model.transformer.h[5].output[0].save()
print(hidden) # Works!
undefinedwith model.trace("Hello"):
hidden = model.transformer.h[5].output[0].save()
print(hidden) # Works!
undefinedIssue: Remote timeout
问题:远程执行超时
python
undefinedpython
undefinedFor long operations, increase timeout
For long operations, increase timeout
with model.trace("prompt", remote=True, timeout=300) as tracer:
# Long operation...
undefinedwith model.trace("prompt", remote=True, timeout=300) as tracer:
# Long operation...
undefinedIssue: Memory with many saved activations
问题:保存过多激活值导致内存不足
python
undefinedpython
undefinedOnly save what you need
Only save what you need
with model.trace("prompt"):
# Don't save everything
for i in range(100):
model.transformer.h[i].output[0].save() # Memory heavy!
# Better: save specific layers
key_layers = [0, 5, 11]
for i in key_layers:
model.transformer.h[i].output[0].save()undefinedwith model.trace("prompt"):
# Don't save everything
for i in range(100):
model.transformer.h[i].output[0].save() # Memory heavy!
# Better: save specific layers
key_layers = [0, 5, 11]
for i in key_layers:
model.transformer.h[i].output[0].save()undefinedIssue: vLLM gradient limitation
问题:vLLM梯度限制
python
undefinedpython
undefinedvLLM doesn't support gradients
vLLM doesn't support gradients
Use standard execution for gradient analysis
Use standard execution for gradient analysis
model = LanguageModel("gpt2", device_map="auto") # Not vLLM
undefinedmodel = LanguageModel("gpt2", device_map="auto") # Not vLLM
undefinedKey API Reference
核心API参考
| Method/Property | Purpose |
|---|---|
| Start tracing context |
| Save value for access after trace |
| Slice/index proxy (assignment patches) |
| Add prompt within trace |
| Generate with interventions |
| Final model output logits |
| Underlying HuggingFace model |
| 方法/属性 | 用途 |
|---|---|
| 启动追踪上下文 |
| 保存值以便在追踪结束后访问 |
| 切片/索引代理对象(赋值即修补) |
| 在追踪内添加提示词 |
| 带干预的文本生成 |
| 模型最终输出logits |
| 底层HuggingFace模型 |
Comparison with Other Tools
与其他工具的对比
| Feature | nnsight | TransformerLens | pyvene |
|---|---|---|---|
| Any architecture | Yes | Transformers only | Yes |
| Remote execution | Yes (NDIF) | No | No |
| Consistent API | No | Yes | Yes |
| Deferred execution | Yes | No | No |
| HuggingFace native | Yes | Reimplemented | Yes |
| Shareable configs | No | No | Yes |
| 特性 | nnsight | TransformerLens | pyvene |
|---|---|---|---|
| 支持任意架构 | 是 | 仅Transformer | 是 |
| 远程执行 | 是(NDIF) | 否 | 否 |
| 统一API | 否 | 是 | 是 |
| 延迟执行 | 是 | 否 | 否 |
| 原生支持HuggingFace | 是 | 重新实现 | 是 |
| 可共享配置 | 否 | 否 | 是 |
Reference Documentation
参考文档
For detailed API documentation, tutorials, and advanced usage, see the folder:
references/| File | Contents |
|---|---|
| references/README.md | Overview and quick start guide |
| references/api.md | Complete API reference for LanguageModel, tracing, proxy objects |
| references/tutorials.md | Step-by-step tutorials for local and remote interpretability |
如需详细API文档、教程和进阶用法,请查看目录:
references/| 文件 | 内容 |
|---|---|
| references/README.md | 概览与快速入门指南 |
| references/api.md | LanguageModel、追踪、代理对象的完整API参考 |
| references/tutorials.md | 本地和远程可解释性实验的分步教程 |
External Resources
外部资源
Tutorials
教程
Official Documentation
官方文档
Papers
论文
- NNsight and NDIF Paper - Fiotto-Kaufman et al. (ICLR 2025)
- NNsight and NDIF 论文 - Fiotto-Kaufman等人(ICLR 2025)
Architecture Support
架构支持
nnsight works with any PyTorch model:
- Transformers: GPT-2, LLaMA, Mistral, etc.
- State Space Models: Mamba
- Vision Models: ViT, CLIP
- Custom architectures: Any nn.Module
The key is knowing the module structure to access the right components.
nnsight可兼容任意PyTorch模型:
- Transformer:GPT-2、LLaMA、Mistral等
- 状态空间模型:Mamba
- 视觉模型:ViT、CLIP
- 自定义架构:任意nn.Module
关键在于了解模块结构以访问正确的组件。