torchcode-pytorch-interview-practice

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

TorchCode — PyTorch Interview Practice

TorchCode — PyTorch面试练习

Skill by ara.so — Daily 2026 Skills collection.

TorchCode is a Jupyter-based, self-hosted coding practice environment for ML engineers. It provides 40 curated problems covering PyTorch fundamentals and architectures (softmax, LayerNorm, MultiHeadAttention, GPT-2, etc.) with an automated judge that gives instant pass/fail feedback, gradient verification, and timing — like LeetCode but for tensors.

来自ara.so的技能 — 2026每日技能合集。

TorchCode是一款基于Jupyter的自托管式编码练习环境，面向机器学习工程师。它提供了40个精心挑选的问题，涵盖PyTorch基础和架构（softmax、LayerNorm、MultiHeadAttention、GPT-2等），并配有自动评判系统，可即时给出通过/失败反馈、梯度验证和计时功能——就像面向张量的LeetCode。

Installation & Setup

安装与设置

Option 1: Online (zero install)

选项1：在线使用（无需安装）

Hugging Face Spaces: https://huggingface.co/spaces/duoan/TorchCode
Google Colab: Every notebook has an "Open in Colab" badge

Hugging Face Spaces：https://huggingface.co/spaces/duoan/TorchCode
Google Colab：每个笔记本都带有“在Colab中打开”的徽章

Option 2: pip (for use inside Colab or existing environment)

选项2：pip安装（适用于Colab或现有环境）

bash

pip install torch-judge

bash

pip install torch-judge

Option 3: Docker (pre-built image)

选项3：Docker（预构建镜像）

bash

docker run -p 8888:8888 -e PORT=8888 ghcr.io/duoan/torchcode:latest

bash

docker run -p 8888:8888 -e PORT=8888 ghcr.io/duoan/torchcode:latest

Open http://localhost:8888

打开 http://localhost:8888

undefined

undefined

Option 4: Build locally

bash

git clone https://github.com/duoan/TorchCode.git
cd TorchCode
make run

bash

git clone https://github.com/duoan/TorchCode.git
cd TorchCode
make run

Open http://localhost:8888


`make run` auto-detects Docker or Podman and falls back to local build if the registry image is unavailable (common on Apple Silicon/arm64).

---


`make run`会自动检测Docker或Podman，如果镜像仓库中的镜像不可用（在Apple Silicon/arm64设备上常见），会回退到本地构建。

---

Judge API

评判API

The

torch_judge

package provides the core API used in every notebook.

python

from torch_judge import check, status, hint, reset_progress

torch_judge

包提供了每个笔记本都会用到的核心API。

python

from torch_judge import check, status, hint, reset_progress

List all 40 problems and your progress

列出所有40个问题及你的进度

status()

Run tests for a specific problem

针对特定问题运行测试

check("relu") check("softmax") check("layernorm") check("attention") check("gpt2")

Get a hint without spoilers

获取无剧透提示

hint("softmax")

Reset progress for a problem

重置某个问题的进度

reset_progress("relu")

undefined

reset_progress("relu")

undefined

check()

return values

check()

返回值

Colored pass/fail per test case
Correctness check against PyTorch reference implementation
Gradient verification (autograd compatibility)
Timing measurement

每个测试用例的彩色通过/失败标记
与PyTorch参考实现的正确性校验
梯度验证（自动求导兼容性）
计时测量

Problem Set Overview

问题集概述

Difficulty levels: Easy → Medium → Hard

难度等级：简单 → 中等 → 困难

#	Problem	Key Concepts
1	ReLU	Activation functions, element-wise ops
2	Softmax	Numerical stability, exp/log tricks
3	Linear Layer	`y = xW^T + b` , Kaiming init, `nn.Parameter`
4	LayerNorm	Normalization, affine transform
5	Self-Attention	QKV projections, scaled dot-product
6	Multi-Head Attention	Head splitting, concatenation
7	BatchNorm	Batch vs layer statistics, train/eval
8	RMSNorm	LLaMA-style norm
16	Cross-Entropy Loss	Log-softmax, logsumexp trick
17	Dropout	Train/eval mode, inverted scaling
18	Embedding	Lookup table, `weight[indices]`
19	GELU	`torch.erf` , Gaussian error linear unit
20	Kaiming Init	`std = sqrt(2/fan_in)`
21	Gradient Clipping	Norm-based clipping
31	Gradient Accumulation	Micro-batching, loss scaling
40	Linear Regression	Normal equation, GD from scratch

序号	问题	核心概念
1	ReLU	激活函数、逐元素操作
2	Softmax	数值稳定性、指数/对数技巧
3	线性层	`y = xW^T + b` 、Kaiming初始化、 `nn.Parameter`
4	LayerNorm	归一化、仿射变换
5	自注意力	QKV投影、缩放点积
6	多头注意力	头拆分、拼接
7	BatchNorm	批次与层统计、训练/评估模式
8	RMSNorm	LLaMA风格归一化
16	交叉熵损失	Log-softmax、LogSumExp技巧
17	Dropout	训练/评估模式、反向缩放
18	Embedding	查找表、 `weight[indices]`
19	GELU	`torch.erf` 、高斯误差线性单元
20	Kaiming初始化	`std = sqrt(2/fan_in)`
21	梯度裁剪	基于范数的裁剪
31	梯度累积	微批次、损失缩放
40	线性回归	正规方程、从零实现梯度下降

Working Through a Problem

问题练习流程

Each problem notebook has the same structure:

templates/
  01_relu.ipynb       # Blank template — your workspace
  02_softmax.ipynb
  ...
solutions/
  01_relu.ipynb       # Reference solution (study after attempt)

每个问题笔记本都具有相同的结构：

templates/
  01_relu.ipynb       # 空白模板 — 你的工作区
  02_softmax.ipynb
  ...
solutions/
  01_relu.ipynb       # 参考解决方案（尝试后再学习）

Typical notebook workflow

典型笔记本工作流

python

undefined

python

undefined

Cell 1: Import judge

单元格1：导入评判工具

from torch_judge import check, hint import torch import torch.nn as nn

Cell 2: Your implementation

单元格2：你的实现

def my_relu(x: torch.Tensor) -> torch.Tensor: # TODO: implement ReLU without using torch.relu or F.relu raise NotImplementedError

def my_relu(x: torch.Tensor) -> torch.Tensor: # TODO: 不使用torch.relu或F.relu实现ReLU raise NotImplementedError

Cell 3: Run the judge

单元格3：运行评判

check("relu")

---

check("relu")

---

Real Implementation Examples

实际实现示例

ReLU (Problem 1 — Easy)

ReLU（问题1 — 简单）

python

def my_relu(x: torch.Tensor) -> torch.Tensor:
    return torch.clamp(x, min=0)
    # Alternative: return x * (x > 0)
    # Alternative: return torch.where(x > 0, x, torch.zeros_like(x))

python

def my_relu(x: torch.Tensor) -> torch.Tensor:
    return torch.clamp(x, min=0)
    # 替代方案：return x * (x > 0)
    # 替代方案：return torch.where(x > 0, x, torch.zeros_like(x))

Softmax (Problem 2 — Easy, numerically stable)

Softmax（问题2 — 简单，数值稳定版）

python

def my_softmax(x: torch.Tensor, dim: int = -1) -> torch.Tensor:
    # Subtract max for numerical stability (prevents overflow)
    x_max = x.max(dim=dim, keepdim=True).values
    x_shifted = x - x_max
    exp_x = torch.exp(x_shifted)
    return exp_x / exp_x.sum(dim=dim, keepdim=True)

python

def my_softmax(x: torch.Tensor, dim: int = -1) -> torch.Tensor:
    # 减去最大值以保证数值稳定性（防止溢出）
    x_max = x.max(dim=dim, keepdim=True).values
    x_shifted = x - x_max
    exp_x = torch.exp(x_shifted)
    return exp_x / exp_x.sum(dim=dim, keepdim=True)

LayerNorm (Problem 4 — Medium)

LayerNorm（问题4 — 中等）

python

def my_layer_norm(
    x: torch.Tensor,
    weight: torch.Tensor,   # gamma (scale)
    bias: torch.Tensor,     # beta (shift)
    eps: float = 1e-5
) -> torch.Tensor:
    mean = x.mean(dim=-1, keepdim=True)
    var = x.var(dim=-1, keepdim=True, unbiased=False)
    x_norm = (x - mean) / torch.sqrt(var + eps)
    return weight * x_norm + bias

python

def my_layer_norm(
    x: torch.Tensor,
    weight: torch.Tensor,   # gamma（缩放）
    bias: torch.Tensor,     # beta（偏移）
    eps: float = 1e-5
) -> torch.Tensor:
    mean = x.mean(dim=-1, keepdim=True)
    var = x.var(dim=-1, keepdim=True, unbiased=False)
    x_norm = (x - mean) / torch.sqrt(var + eps)
    return weight * x_norm + bias

RMSNorm (Problem 8 — Medium, LLaMA-style)

RMSNorm（问题8 — 中等，LLaMA风格）

python

def rms_norm(x: torch.Tensor, weight: torch.Tensor, eps: float = 1e-6) -> torch.Tensor:
    rms = torch.sqrt((x ** 2).mean(dim=-1, keepdim=True) + eps)
    return (x / rms) * weight

python

def rms_norm(x: torch.Tensor, weight: torch.Tensor, eps: float = 1e-6) -> torch.Tensor:
    rms = torch.sqrt((x ** 2).mean(dim=-1, keepdim=True) + eps)
    return (x / rms) * weight

Scaled Dot-Product Self-Attention (Problem 5 — Medium)

缩放点积自注意力（问题5 — 中等）

python

import torch.nn.functional as F
import math

def scaled_dot_product_attention(
    Q: torch.Tensor,  # (B, heads, T, head_dim)
    K: torch.Tensor,
    V: torch.Tensor,
    mask: torch.Tensor = None
) -> torch.Tensor:
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    attn_weights = F.softmax(scores, dim=-1)
    return torch.matmul(attn_weights, V)

python

import torch.nn.functional as F
import math

def scaled_dot_product_attention(
    Q: torch.Tensor,  # (B, heads, T, head_dim)
    K: torch.Tensor,
    V: torch.Tensor,
    mask: torch.Tensor = None
) -> torch.Tensor:
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    attn_weights = F.softmax(scores, dim=-1)
    return torch.matmul(attn_weights, V)

Multi-Head Attention (Problem 6 — Medium)

多头注意力（问题6 — 中等）

python

class MyMultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, num_heads: int):
        super().__init__()
        assert d_model % num_heads == 0
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads
        self.d_model = d_model

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
        B, T, C = x.shape

        def split_heads(t):
            return t.view(B, T, self.num_heads, self.head_dim).transpose(1, 2)

        Q = split_heads(self.W_q(x))
        K = split_heads(self.W_k(x))
        V = split_heads(self.W_v(x))

        attn_out = scaled_dot_product_attention(Q, K, V, mask)
        # (B, heads, T, head_dim) -> (B, T, d_model)
        attn_out = attn_out.transpose(1, 2).contiguous().view(B, T, C)
        return self.W_o(attn_out)

python

class MyMultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, num_heads: int):
        super().__init__()
        assert d_model % num_heads == 0
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads
        self.d_model = d_model

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
        B, T, C = x.shape

        def split_heads(t):
            return t.view(B, T, self.num_heads, self.head_dim).transpose(1, 2)

        Q = split_heads(self.W_q(x))
        K = split_heads(self.W_k(x))
        V = split_heads(self.W_v(x))

        attn_out = scaled_dot_product_attention(Q, K, V, mask)
        # (B, heads, T, head_dim) -> (B, T, d_model)
        attn_out = attn_out.transpose(1, 2).contiguous().view(B, T, C)
        return self.W_o(attn_out)

Cross-Entropy Loss (Problem 16 — Easy)

交叉熵损失（问题16 — 简单）

python

def cross_entropy_loss(logits: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
    # logits: (B, C), targets: (B,) with class indices
    # Use logsumexp trick for numerical stability
    log_sum_exp = torch.logsumexp(logits, dim=-1)  # (B,)
    log_probs = logits[torch.arange(len(targets)), targets]  # (B,)
    return (log_sum_exp - log_probs).mean()

python

def cross_entropy_loss(logits: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
    # logits: (B, C), targets: (B,) 包含类别索引
    # 使用LogSumExp技巧保证数值稳定性
    log_sum_exp = torch.logsumexp(logits, dim=-1)  # (B,)
    log_probs = logits[torch.arange(len(targets)), targets]  # (B,)
    return (log_sum_exp - log_probs).mean()

Dropout (Problem 17 — Easy)

Dropout（问题17 — 简单）

python

class MyDropout(nn.Module):
    def __init__(self, p: float = 0.5):
        super().__init__()
        self.p = p

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if not self.training or self.p == 0:
            return x
        mask = torch.bernoulli(torch.ones_like(x) * (1 - self.p))
        return x * mask / (1 - self.p)  # inverted scaling

python

class MyDropout(nn.Module):
    def __init__(self, p: float = 0.5):
        super().__init__()
        self.p = p

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if not self.training or self.p == 0:
            return x
        mask = torch.bernoulli(torch.ones_like(x) * (1 - self.p))
        return x * mask / (1 - self.p)  # 反向缩放

Kaiming Init (Problem 20 — Easy)

Kaiming初始化（问题20 — 简单）

python

def kaiming_init(weight: torch.Tensor) -> torch.Tensor:
    fan_in = weight.size(1)
    std = math.sqrt(2.0 / fan_in)
    with torch.no_grad():
        weight.normal_(0, std)
    return weight

python

def kaiming_init(weight: torch.Tensor) -> torch.Tensor:
    fan_in = weight.size(1)
    std = math.sqrt(2.0 / fan_in)
    with torch.no_grad():
        weight.normal_(0, std)
    return weight

Gradient Clipping (Problem 21 — Easy)

梯度裁剪（问题21 — 简单）

python

def clip_grad_norm(parameters, max_norm: float) -> float:
    params = [p for p in parameters if p.grad is not None]
    total_norm = torch.sqrt(sum(p.grad.data.norm() ** 2 for p in params))
    clip_coef = max_norm / (total_norm + 1e-6)
    if clip_coef < 1:
        for p in params:
            p.grad.data.mul_(clip_coef)
    return total_norm.item()

python

def clip_grad_norm(parameters, max_norm: float) -> float:
    params = [p for p in parameters if p.grad is not None]
    total_norm = torch.sqrt(sum(p.grad.data.norm() ** 2 for p in params))
    clip_coef = max_norm / (total_norm + 1e-6)
    if clip_coef < 1:
        for p in params:
            p.grad.data.mul_(clip_coef)
    return total_norm.item()

Gradient Accumulation (Problem 31 — Easy)

梯度累积（问题31 — 简单）

python

def train_with_accumulation(model, optimizer, dataloader, accumulation_steps=4):
    optimizer.zero_grad()
    for i, (inputs, targets) in enumerate(dataloader):
        outputs = model(inputs)
        loss = criterion(outputs, targets) / accumulation_steps  # scale loss
        loss.backward()

        if (i + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()

python

def train_with_accumulation(model, optimizer, dataloader, accumulation_steps=4):
    optimizer.zero_grad()
    for i, (inputs, targets) in enumerate(dataloader):
        outputs = model(inputs)
        loss = criterion(outputs, targets) / accumulation_steps  # 缩放损失
        loss.backward()

        if (i + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()

Common Patterns & Tips

常见模式与技巧

Numerical stability pattern

数值稳定性模式

Always subtract the max before

exp()

python

undefined

在调用

exp()

前务必减去最大值：

python

undefined

WRONG — can overflow for large values

错误写法 — 大数值时可能溢出

exp_x = torch.exp(x)

CORRECT — numerically stable

正确写法 — 数值稳定

exp_x = torch.exp(x - x.max(dim=-1, keepdim=True).values)

undefined

exp_x = torch.exp(x - x.max(dim=-1, keepdim=True).values)

undefined

Causal attention mask (for GPT-style models)

因果注意力掩码（适用于GPT风格模型）

python

def causal_mask(T: int, device) -> torch.Tensor:
    return torch.tril(torch.ones(T, T, device=device)).unsqueeze(0).unsqueeze(0)

python

def causal_mask(T: int, device) -> torch.Tensor:
    return torch.tril(torch.ones(T, T, device=device)).unsqueeze(0).unsqueeze(0)

nn.Module skeleton (used in many problems)

nn.Module骨架（用于许多问题）

python

class MyLayer(nn.Module):
    def __init__(self, ...):
        super().__init__()
        self.weight = nn.Parameter(torch.empty(...))
        self.bias = nn.Parameter(torch.zeros(...))
        self._init_weights()

    def _init_weights(self):
        nn.init.kaiming_uniform_(self.weight)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        ...

python

class MyLayer(nn.Module):
    def __init__(self, ...):
        super().__init__()
        self.weight = nn.Parameter(torch.empty(...))
        self.bias = nn.Parameter(torch.zeros(...))
        self._init_weights()

    def _init_weights(self):
        nn.init.kaiming_uniform_(self.weight)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        ...

Train vs eval mode pattern

训练与评估模式切换

python

def forward(self, x):
    if self.training:
        # use batch statistics
        mean = x.mean(dim=0)
        var = x.var(dim=0, unbiased=False)
        # update running stats
        self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * mean
        self.running_var = (1 - self.momentum) * self.running_var + self.momentum * var
    else:
        # use running statistics
        mean = self.running_mean
        var = self.running_var
    return (x - mean) / torch.sqrt(var + self.eps) * self.weight + self.bias

python

def forward(self, x):
    if self.training:
        # 使用批次统计
        mean = x.mean(dim=0)
        var = x.var(dim=0, unbiased=False)
        # 更新运行统计
        self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * mean
        self.running_var = (1 - self.momentum) * self.running_var + self.momentum * var
    else:
        # 使用运行统计
        mean = self.running_mean
        var = self.running_var
    return (x - mean) / torch.sqrt(var + self.eps) * self.weight + self.bias

Project Structure

项目结构

TorchCode/
├── templates/          # Blank notebooks for each problem (your workspace)
│   ├── 01_relu.ipynb
│   ├── 02_softmax.ipynb
│   └── ...
├── solutions/          # Reference solutions (study after attempting)
│   └── ...
├── torch_judge/        # Auto-grading package
│   ├── __init__.py     # check(), status(), hint(), reset_progress()
│   └── tasks/          # Per-problem test cases
├── Dockerfile
├── Makefile
└── pyproject.toml      # torch-judge package definition

TorchCode/
├── templates/          # 每个问题的空白笔记本（你的工作区）
│   ├── 01_relu.ipynb
│   ├── 02_softmax.ipynb
│   └── ...
├── solutions/          # 参考解决方案（尝试后再学习）
│   └── ...
├── torch_judge/        # 自动评分包
│   ├── __init__.py     # check(), status(), hint(), reset_progress()
│   └── tasks/          # 每个问题的测试用例
├── Dockerfile
├── Makefile
└── pyproject.toml      # torch-judge包定义

Troubleshooting

故障排除

Docker image not available for Apple Silicon (arm64)

Apple Silicon（arm64）设备上Docker镜像不可用

bash

undefined

bash

undefined

make run auto-falls back to local build, or force it:

make run会自动回退到本地构建，或手动执行：

make build make start

undefined

make build make start

undefined

check()

not found in Colab

Colab中找不到

check()

函数

bash

!pip install torch-judge

bash

!pip install torch-judge

then restart runtime

然后重启运行时

undefined

undefined

Notebook reset to blank template

笔记本重置为空白模板

Use the toolbar "Reset" button in JupyterLab to reset any notebook to its original blank state — useful for re-practicing a problem.

使用JupyterLab工具栏中的“重置”按钮，可将任意笔记本恢复为原始空白状态 — 非常适合重新练习某个问题。

Gradient check fails but output is correct

梯度检查失败但输出正确

Ensure your implementation uses PyTorch operations (not NumPy) so autograd works:

python

undefined

确保你的实现使用PyTorch操作（而非NumPy），这样自动求导才能正常工作：

python

undefined

WRONG — breaks autograd

错误写法 — 破坏自动求导

import numpy as np result = np.exp(x.numpy())

CORRECT — autograd compatible

正确写法 — 兼容自动求导

result = torch.exp(x)

undefined

result = torch.exp(x)

undefined

Viewing reference solution

查看参考解决方案

After attempting a problem, open the matching file in

solutions/

solutions/02_softmax.ipynb

尝试完成问题后，打开

solutions/

目录下对应的文件：

solutions/02_softmax.ipynb

Key Concepts Tested

测试的核心概念

Concept	Problems
Numerical stability	Softmax, Cross-Entropy, LogSumExp
Autograd / `nn.Parameter`	Linear, LayerNorm, all nn.Module problems
Train vs eval behavior	BatchNorm, Dropout
Broadcasting	LayerNorm, RMSNorm, attention masking
Shape manipulation	Multi-Head Attention (view, transpose, contiguous)
Weight initialization	Kaiming Init, Linear Layer
Memory-efficient training	Gradient Accumulation, Gradient Clipping

概念	对应问题
数值稳定性	Softmax、交叉熵损失、LogSumExp
自动求导 / `nn.Parameter`	线性层、LayerNorm、所有nn.Module相关问题
训练与评估模式行为	BatchNorm、Dropout
广播机制	LayerNorm、RMSNorm、注意力掩码
形状操作	多头注意力（view、transpose、contiguous）
权重初始化	Kaiming初始化、线性层
内存高效训练	梯度累积、梯度裁剪

torchcode-pytorch-interview-practice

Original

Translation

TorchCode — PyTorch Interview Practice

TorchCode — PyTorch面试练习

Installation & Setup

安装与设置

Option 1: Online (zero install)

选项1：在线使用（无需安装）

Option 2: pip (for use inside Colab or existing environment)

选项2：pip安装（适用于Colab或现有环境）

Option 3: Docker (pre-built image)

选项3：Docker（预构建镜像）

Open http://localhost:8888

打开 http://localhost:8888

Option 4: Build locally

Option 4: Build locally

Open http://localhost:8888

Open http://localhost:8888

Judge API

评判API

List all 40 problems and your progress

列出所有40个问题及你的进度

Run tests for a specific problem

针对特定问题运行测试

Get a hint without spoilers

获取无剧透提示

Reset progress for a problem

重置某个问题的进度

check() return values

check()返回值

Problem Set Overview

问题集概述

Difficulty levels: Easy → Medium → Hard

难度等级：简单 → 中等 → 困难

Working Through a Problem

问题练习流程

Typical notebook workflow

典型笔记本工作流

Cell 1: Import judge

单元格1：导入评判工具

Cell 2: Your implementation

单元格2：你的实现

Cell 3: Run the judge

单元格3：运行评判

Real Implementation Examples

实际实现示例

ReLU (Problem 1 — Easy)

ReLU（问题1 — 简单）

Softmax (Problem 2 — Easy, numerically stable)

Softmax（问题2 — 简单，数值稳定版）

LayerNorm (Problem 4 — Medium)

LayerNorm（问题4 — 中等）

RMSNorm (Problem 8 — Medium, LLaMA-style)

RMSNorm（问题8 — 中等，LLaMA风格）

Scaled Dot-Product Self-Attention (Problem 5 — Medium)

缩放点积自注意力（问题5 — 中等）

Multi-Head Attention (Problem 6 — Medium)

多头注意力（问题6 — 中等）

Cross-Entropy Loss (Problem 16 — Easy)

交叉熵损失（问题16 — 简单）

Dropout (Problem 17 — Easy)

Dropout（问题17 — 简单）

Kaiming Init (Problem 20 — Easy)

Kaiming初始化（问题20 — 简单）

Gradient Clipping (Problem 21 — Easy)

梯度裁剪（问题21 — 简单）

Gradient Accumulation (Problem 31 — Easy)

梯度累积（问题31 — 简单）

Common Patterns & Tips

常见模式与技巧

Numerical stability pattern

数值稳定性模式

WRONG — can overflow for large values

错误写法 — 大数值时可能溢出

CORRECT — numerically stable

正确写法 — 数值稳定

Causal attention mask (for GPT-style models)

因果注意力掩码（适用于GPT风格模型）

nn.Module skeleton (used in many problems)

`check()`
return values

`check()`
返回值

`check()`
not found in Colab

Colab中找不到
`check()`
函数