TensorBoard: Visualization Toolkit for ML
TensorBoard:机器学习可视化工具包
When to Use This Skill
何时使用该工具
Use TensorBoard when you need to:
- Visualize training metrics like loss and accuracy over time
- Debug models with histograms and distributions
- Compare experiments across multiple runs
- Visualize model graphs and architecture
- Project embeddings to lower dimensions (t-SNE, PCA)
- Track hyperparameter experiments
- Profile performance and identify bottlenecks
- Visualize images and text during training
Users: 20M+ downloads/year | GitHub Stars: 27k+ | License: Apache 2.0
在以下场景中使用TensorBoard:
- 可视化训练指标:如随时间变化的损失值和准确率
- 调试模型:使用直方图和分布图表
- 对比实验:跨多次运行对比实验结果
- 可视化模型图:展示模型架构
- 嵌入投影:将高维数据降维展示(t-SNE、PCA)
- 追踪超参数:记录超参数实验过程
- 性能分析:识别性能瓶颈
- 可视化训练中的图像与文本
用户规模:年下载量超2000万 | GitHub星标:27000+ | 许可证:Apache 2.0
Install TensorBoard
安装TensorBoard
PyTorch integration
PyTorch集成
pip install torch torchvision tensorboard
pip install torch torchvision tensorboard
TensorFlow integration (TensorBoard included)
TensorFlow集成(已包含TensorBoard)
Launch TensorBoard
启动TensorBoard
tensorboard --logdir=runs
tensorboard --logdir=runs
python
from torch.utils.tensorboard import SummaryWriter
python
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter('runs/experiment_1')
writer = SummaryWriter('runs/experiment_1')
for epoch in range(10):
train_loss = train_epoch()
val_acc = validate()
# Log metrics
writer.add_scalar('Loss/train', train_loss, epoch)
writer.add_scalar('Accuracy/val', val_acc, epoch)
for epoch in range(10):
train_loss = train_epoch()
val_acc = validate()
# 记录指标
writer.add_scalar('Loss/train', train_loss, epoch)
writer.add_scalar('Accuracy/val', val_acc, epoch)
Launch: tensorboard --logdir=runs
启动:tensorboard --logdir=runs
TensorFlow/Keras
TensorFlow/Keras
python
import tensorflow as tf
python
import tensorflow as tf
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir='logs/fit',
histogram_freq=1
)
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir='logs/fit',
histogram_freq=1
)
model.fit(
x_train, y_train,
epochs=10,
validation_data=(x_val, y_val),
callbacks=[tensorboard_callback]
)
model.fit(
x_train, y_train,
epochs=10,
validation_data=(x_val, y_val),
callbacks=[tensorboard_callback]
)
Launch: tensorboard --logdir=logs
启动:tensorboard --logdir=logs
1. SummaryWriter (PyTorch)
1. SummaryWriter(PyTorch)
python
from torch.utils.tensorboard import SummaryWriter
python
from torch.utils.tensorboard import SummaryWriter
Default directory: runs/CURRENT_DATETIME
默认目录:runs/CURRENT_DATETIME
writer = SummaryWriter('runs/experiment_1')
writer = SummaryWriter('runs/experiment_1')
Custom comment (appended to default directory)
自定义注释(追加到默认目录)
writer = SummaryWriter(comment='baseline')
writer = SummaryWriter(comment='baseline')
writer.add_scalar('Loss/train', 0.5, step=0)
writer.add_scalar('Loss/train', 0.3, step=1)
writer.add_scalar('Loss/train', 0.5, step=0)
writer.add_scalar('Loss/train', 0.3, step=1)
writer.flush()
writer.close()
writer.flush()
writer.close()
2. Logging Scalars
2. 记录标量数据
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()
for epoch in range(100):
train_loss = train()
val_loss = validate()
# Log individual metrics
writer.add_scalar('Loss/train', train_loss, epoch)
writer.add_scalar('Loss/val', val_loss, epoch)
writer.add_scalar('Accuracy/train', train_acc, epoch)
writer.add_scalar('Accuracy/val', val_acc, epoch)
# Learning rate
lr = optimizer.param_groups[0]['lr']
writer.add_scalar('Learning_rate', lr, epoch)
writer.close()
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()
for epoch in range(100):
train_loss = train()
val_loss = validate()
# 记录单个指标
writer.add_scalar('Loss/train', train_loss, epoch)
writer.add_scalar('Loss/val', val_loss, epoch)
writer.add_scalar('Accuracy/train', train_acc, epoch)
writer.add_scalar('Accuracy/val', val_acc, epoch)
# 学习率
lr = optimizer.param_groups[0]['lr']
writer.add_scalar('Learning_rate', lr, epoch)
writer.close()
import tensorflow as tf
train_summary_writer = tf.summary.create_file_writer('logs/train')
val_summary_writer = tf.summary.create_file_writer('logs/val')
for epoch in range(100):
with train_summary_writer.as_default():
tf.summary.scalar('loss', train_loss, step=epoch)
tf.summary.scalar('accuracy', train_acc, step=epoch)
with val_summary_writer.as_default():
tf.summary.scalar('loss', val_loss, step=epoch)
tf.summary.scalar('accuracy', val_acc, step=epoch)
import tensorflow as tf
train_summary_writer = tf.summary.create_file_writer('logs/train')
val_summary_writer = tf.summary.create_file_writer('logs/val')
for epoch in range(100):
with train_summary_writer.as_default():
tf.summary.scalar('loss', train_loss, step=epoch)
tf.summary.scalar('accuracy', train_acc, step=epoch)
with val_summary_writer.as_default():
tf.summary.scalar('loss', val_loss, step=epoch)
tf.summary.scalar('accuracy', val_acc, step=epoch)
3. Logging Multiple Scalars
3. 记录多个标量数据
PyTorch: Group related metrics
PyTorch:分组相关指标
writer.add_scalars('Loss', {
'train': train_loss,
'validation': val_loss,
'test': test_loss
}, epoch)
writer.add_scalars('Metrics', {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1_score
}, epoch)
writer.add_scalars('Loss', {
'train': train_loss,
'validation': val_loss,
'test': test_loss
}, epoch)
writer.add_scalars('Metrics', {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1_score
}, epoch)
import torch
from torchvision.utils import make_grid
import torch
from torchvision.utils import make_grid
writer.add_image('Input/sample', img_tensor, epoch)
writer.add_image('Input/sample', img_tensor, epoch)
Multiple images as grid
多张图像组成网格
img_grid = make_grid(images[:64], nrow=8)
writer.add_image('Batch/inputs', img_grid, epoch)
img_grid = make_grid(images[:64], nrow=8)
writer.add_image('Batch/inputs', img_grid, epoch)
Predictions visualization
预测结果可视化
pred_grid = make_grid(predictions[:16], nrow=4)
writer.add_image('Predictions', pred_grid, epoch)
pred_grid = make_grid(predictions[:16], nrow=4)
writer.add_image('Predictions', pred_grid, epoch)
import tensorflow as tf
with file_writer.as_default():
# Encode images as PNG
tf.summary.image('Training samples', images, step=epoch, max_outputs=25)
import tensorflow as tf
with file_writer.as_default():
# 将图像编码为PNG格式
tf.summary.image('Training samples', images, step=epoch, max_outputs=25)
5. Logging Histograms
5. 记录直方图
PyTorch: Track weight distributions
PyTorch:追踪权重分布
for name, param in model.named_parameters():
writer.add_histogram(name, param, epoch)
# Track gradients
if param.grad is not None:
writer.add_histogram(f'{name}.grad', param.grad, epoch)
for name, param in model.named_parameters():
writer.add_histogram(name, param, epoch)
# 追踪梯度
if param.grad is not None:
writer.add_histogram(f'{name}.grad', param.grad, epoch)
writer.add_histogram('Activations/relu1', activations, epoch)
writer.add_histogram('Activations/relu1', activations, epoch)
with file_writer.as_default():
tf.summary.histogram('weights/layer1', layer1.kernel, step=epoch)
tf.summary.histogram('activations/relu1', activations, step=epoch)
with file_writer.as_default():
tf.summary.histogram('weights/layer1', layer1.kernel, step=epoch)
tf.summary.histogram('activations/relu1', activations, step=epoch)
6. Logging Model Graph
6. 记录模型图
import torch
model = MyModel()
dummy_input = torch.randn(1, 3, 224, 224)
writer.add_graph(model, dummy_input)
writer.close()
import torch
model = MyModel()
dummy_input = torch.randn(1, 3, 224, 224)
writer.add_graph(model, dummy_input)
writer.close()
TensorFlow (automatic with Keras)
TensorFlow(Keras自动支持)
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir='logs',
write_graph=True
)
model.fit(x, y, callbacks=[tensorboard_callback])
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir='logs',
write_graph=True
)
model.fit(x, y, callbacks=[tensorboard_callback])
Visualize high-dimensional data (embeddings, features) in 2D/3D.
python
import torch
from torch.utils.tensorboard import SummaryWriter
将高维数据(嵌入向量、特征)以2D/3D形式可视化。
python
import torch
from torch.utils.tensorboard import SummaryWriter
Get embeddings (e.g., word embeddings, image features)
获取嵌入向量(如词嵌入、图像特征)
embeddings = model.get_embeddings(data) # Shape: (N, embedding_dim)
embeddings = model.get_embeddings(data) # 形状:(N, embedding_dim)
Metadata (labels for each point)
元数据(每个点的标签)
metadata = ['class_1', 'class_2', 'class_1', ...]
metadata = ['class_1', 'class_2', 'class_1', ...]
Images (optional, for image embeddings)
图像(可选,用于图像嵌入)
label_images = torch.stack([img1, img2, img3, ...])
label_images = torch.stack([img1, img2, img3, ...])
Log to TensorBoard
记录到TensorBoard
writer.add_embedding(
embeddings,
metadata=metadata,
label_img=label_images,
global_step=epoch
)
**In TensorBoard:**
- Navigate to "Projector" tab
- Choose PCA, t-SNE, or UMAP visualization
- Search, filter, and explore clusters
writer.add_embedding(
embeddings,
metadata=metadata,
label_img=label_images,
global_step=epoch
)
**在TensorBoard中操作**:
- 导航至「Projector」标签页
- 选择PCA、t-SNE或UMAP可视化方式
- 搜索、过滤并探索聚类结果
Hyperparameter Tuning
超参数调优
python
from torch.utils.tensorboard import SummaryWriter
python
from torch.utils.tensorboard import SummaryWriter
Try different hyperparameters
尝试不同超参数
for lr in [0.001, 0.01, 0.1]:
for batch_size in [16, 32, 64]:
# Create unique run directory
writer = SummaryWriter(f'runs/lr{lr}_bs{batch_size}')
# Log hyperparameters
writer.add_hparams(
{'lr': lr, 'batch_size': batch_size},
{'hparam/accuracy': final_acc, 'hparam/loss': final_loss}
)
# Train and log
for epoch in range(10):
loss = train(lr, batch_size)
writer.add_scalar('Loss/train', loss, epoch)
writer.close()
for lr in [0.001, 0.01, 0.1]:
for batch_size in [16, 32, 64]:
# 创建唯一的运行目录
writer = SummaryWriter(f'runs/lr{lr}_bs{batch_size}')
# 记录超参数
writer.add_hparams(
{'lr': lr, 'batch_size': batch_size},
{'hparam/accuracy': final_acc, 'hparam/loss': final_loss}
)
# 训练并记录
for epoch in range(10):
loss = train(lr, batch_size)
writer.add_scalar('Loss/train', loss, epoch)
writer.close()
Compare in TensorBoard's "HParams" tab
在TensorBoard的「HParams」标签页中对比
PyTorch: Log text (e.g., model predictions, summaries)
PyTorch:记录文本(如模型预测结果、摘要)
writer.add_text('Predictions', f'Epoch {epoch}: {predictions}', epoch)
writer.add_text('Config', str(config), 0)
writer.add_text('Predictions', f'Epoch {epoch}: {predictions}', epoch)
writer.add_text('Config', str(config), 0)
Log markdown tables
记录Markdown表格
markdown_table = """
| Metric | Value |
|---|
| Accuracy | 0.95 |
| F1 Score | 0.93 |
| """ | |
| writer.add_text('Results', markdown_table, epoch) | |
markdown_table = """
| Metric | Value |
|---|
| Accuracy | 0.95 |
| F1 Score | 0.93 |
| """ | |
| writer.add_text('Results', markdown_table, epoch) | |
Precision-Recall curves for classification.
python
from torch.utils.tensorboard import SummaryWriter
用于分类任务的精确率-召回率曲线。
python
from torch.utils.tensorboard import SummaryWriter
Get predictions and labels
获取预测结果和标签
predictions = model(test_data) # Shape: (N, num_classes)
labels = test_labels # Shape: (N,)
predictions = model(test_data) # 形状:(N, num_classes)
labels = test_labels # 形状:(N,)
Log PR curve for each class
为每个类别记录PR曲线
for i in range(num_classes):
writer.add_pr_curve(
f'PR_curve/class_{i}',
labels == i,
predictions[:, i],
global_step=epoch
)
for i in range(num_classes):
writer.add_pr_curve(
f'PR_curve/class_{i}',
labels == i,
predictions[:, i],
global_step=epoch
)
PyTorch Training Loop
PyTorch训练循环
python
import torch
import torch.nn as nn
from torch.utils.tensorboard import SummaryWriter
python
import torch
import torch.nn as nn
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter('runs/resnet_experiment')
model = ResNet50()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
writer = SummaryWriter('runs/resnet_experiment')
model = ResNet50()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
dummy_input = torch.randn(1, 3, 224, 224)
writer.add_graph(model, dummy_input)
dummy_input = torch.randn(1, 3, 224, 224)
writer.add_graph(model, dummy_input)
for epoch in range(50):
model.train()
train_loss = 0.0
train_correct = 0
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
train_loss += loss.item()
pred = output.argmax(dim=1)
train_correct += pred.eq(target).sum().item()
# Log batch metrics (every 100 batches)
if batch_idx % 100 == 0:
global_step = epoch * len(train_loader) + batch_idx
writer.add_scalar('Loss/train_batch', loss.item(), global_step)
# Epoch metrics
train_loss /= len(train_loader)
train_acc = train_correct / len(train_loader.dataset)
# Validation
model.eval()
val_loss = 0.0
val_correct = 0
with torch.no_grad():
for data, target in val_loader:
output = model(data)
val_loss += criterion(output, target).item()
pred = output.argmax(dim=1)
val_correct += pred.eq(target).sum().item()
val_loss /= len(val_loader)
val_acc = val_correct / len(val_loader.dataset)
# Log epoch metrics
writer.add_scalars('Loss', {'train': train_loss, 'val': val_loss}, epoch)
writer.add_scalars('Accuracy', {'train': train_acc, 'val': val_acc}, epoch)
# Log learning rate
writer.add_scalar('Learning_rate', optimizer.param_groups[0]['lr'], epoch)
# Log histograms (every 5 epochs)
if epoch % 5 == 0:
for name, param in model.named_parameters():
writer.add_histogram(name, param, epoch)
# Log sample predictions
if epoch % 10 == 0:
sample_images = data[:8]
writer.add_image('Sample_inputs', make_grid(sample_images), epoch)
writer.close()
for epoch in range(50):
model.train()
train_loss = 0.0
train_correct = 0
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
train_loss += loss.item()
pred = output.argmax(dim=1)
train_correct += pred.eq(target).sum().item()
# 记录批次指标(每100个批次记录一次)
if batch_idx % 100 == 0:
global_step = epoch * len(train_loader) + batch_idx
writer.add_scalar('Loss/train_batch', loss.item(), global_step)
# epoch指标
train_loss /= len(train_loader)
train_acc = train_correct / len(train_loader.dataset)
# 验证
model.eval()
val_loss = 0.0
val_correct = 0
with torch.no_grad():
for data, target in val_loader:
output = model(data)
val_loss += criterion(output, target).item()
pred = output.argmax(dim=1)
val_correct += pred.eq(target).sum().item()
val_loss /= len(val_loader)
val_acc = val_correct / len(val_loader.dataset)
# 记录epoch指标
writer.add_scalars('Loss', {'train': train_loss, 'val': val_loss}, epoch)
writer.add_scalars('Accuracy', {'train': train_acc, 'val': val_acc}, epoch)
# 记录学习率
writer.add_scalar('Learning_rate', optimizer.param_groups[0]['lr'], epoch)
# 记录直方图(每5个epoch记录一次)
if epoch % 5 == 0:
for name, param in model.named_parameters():
writer.add_histogram(name, param, epoch)
# 记录样本预测结果(每10个epoch记录一次)
if epoch % 10 == 0:
sample_images = data[:8]
writer.add_image('Sample_inputs', make_grid(sample_images), epoch)
writer.close()
TensorFlow/Keras Training
TensorFlow/Keras训练
python
import tensorflow as tf
python
import tensorflow as tf
model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
TensorBoard callback
TensorBoard回调函数
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir='logs/fit',
histogram_freq=1, # Log histograms every epoch
write_graph=True, # Visualize model graph
write_images=True, # Visualize weights as images
update_freq='epoch', # Log metrics every epoch
profile_batch='500,520', # Profile batches 500-520
embeddings_freq=1 # Log embeddings every epoch
)
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir='logs/fit',
histogram_freq=1, # 每个epoch记录一次直方图
write_graph=True, # 可视化模型图
write_images=True, # 将权重可视化为图像
update_freq='epoch', # 每个epoch记录一次指标
profile_batch='500,520', # 分析第500-520个批次
embeddings_freq=1 # 每个epoch记录一次嵌入向量
)
model.fit(
x_train, y_train,
epochs=10,
validation_data=(x_val, y_val),
callbacks=[tensorboard_callback]
)
model.fit(
x_train, y_train,
epochs=10,
validation_data=(x_val, y_val),
callbacks=[tensorboard_callback]
)
Comparing Experiments
对比实验
Run experiments with different configs
使用不同配置运行实验
python train.py --lr 0.001 --logdir runs/exp1
python train.py --lr 0.01 --logdir runs/exp2
python train.py --lr 0.1 --logdir runs/exp3
python train.py --lr 0.001 --logdir runs/exp1
python train.py --lr 0.01 --logdir runs/exp2
python train.py --lr 0.1 --logdir runs/exp3
View all runs together
同时查看所有运行结果
tensorboard --logdir=runs
**In TensorBoard:**
- All runs appear in the same dashboard
- Toggle runs on/off for comparison
- Use regex to filter run names
- Overlay charts to compare metrics
tensorboard --logdir=runs
**在TensorBoard中操作**:
- 所有运行结果会显示在同一个仪表板中
- 可切换运行结果的显示/隐藏状态进行对比
- 使用正则表达式过滤运行名称
- 叠加图表以对比指标
Organizing Experiments
实验组织
Hierarchical organization
层级化组织
runs/
├── baseline/
│ ├── run_1/
│ └── run_2/
├── improved/
│ ├── run_1/
│ └── run_2/
└── final/
└── run_1/
runs/
├── baseline/
│ ├── run_1/
│ └── run_2/
├── improved/
│ ├── run_1/
│ └── run_2/
└── final/
└── run_1/
writer = SummaryWriter('runs/baseline/run_1')
writer = SummaryWriter('runs/baseline/run_1')
1. Use Descriptive Run Names
1. 使用描述性的运行名称
✅ Good: Descriptive names
✅ 推荐:描述性名称
from datetime import datetime
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
writer = SummaryWriter(f'runs/resnet50_lr0.001_bs32_{timestamp}')
from datetime import datetime
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
writer = SummaryWriter(f'runs/resnet50_lr0.001_bs32_{timestamp}')
❌ Bad: Auto-generated names
❌ 不推荐:自动生成的名称
writer = SummaryWriter() # Creates runs/Jan01_12-34-56_hostname
writer = SummaryWriter() # 会创建runs/Jan01_12-34-56_hostname目录
2. Group Related Metrics
2. 对相关指标进行分组
✅ Good: Grouped metrics
✅ 推荐:分组后的指标
writer.add_scalar('Loss/train', train_loss, step)
writer.add_scalar('Loss/val', val_loss, step)
writer.add_scalar('Accuracy/train', train_acc, step)
writer.add_scalar('Accuracy/val', val_acc, step)
writer.add_scalar('Loss/train', train_loss, step)
writer.add_scalar('Loss/val', val_loss, step)
writer.add_scalar('Accuracy/train', train_acc, step)
writer.add_scalar('Accuracy/val', val_acc, step)
❌ Bad: Flat namespace
❌ 不推荐:扁平化命名空间
writer.add_scalar('train_loss', train_loss, step)
writer.add_scalar('val_loss', val_loss, step)
writer.add_scalar('train_loss', train_loss, step)
writer.add_scalar('val_loss', val_loss, step)
3. Log Regularly but Not Too Often
3. 定期记录但不过于频繁
✅ Good: Log epoch metrics always, batch metrics occasionally
✅ 推荐:始终记录epoch指标,偶尔记录批次指标
for epoch in range(100):
for batch_idx, (data, target) in enumerate(train_loader):
loss = train_step(data, target)
# Log every 100 batches
if batch_idx % 100 == 0:
writer.add_scalar('Loss/batch', loss, global_step)
# Always log epoch metrics
writer.add_scalar('Loss/epoch', epoch_loss, epoch)
for epoch in range(100):
for batch_idx, (data, target) in enumerate(train_loader):
loss = train_step(data, target)
# 每100个批次记录一次
if batch_idx % 100 == 0:
writer.add_scalar('Loss/batch', loss, global_step)
# 始终记录epoch指标
writer.add_scalar('Loss/epoch', epoch_loss, epoch)
❌ Bad: Log every batch (creates huge log files)
❌ 不推荐:每个批次都记录(会生成巨大的日志文件)
for batch in train_loader:
writer.add_scalar('Loss', loss, step) # Too frequent
for batch in train_loader:
writer.add_scalar('Loss', loss, step) # 过于频繁
4. Close Writer When Done
4. 使用完毕后关闭写入器
✅ Good: Use context manager
✅ 推荐:使用上下文管理器
with SummaryWriter('runs/exp1') as writer:
for epoch in range(10):
writer.add_scalar('Loss', loss, epoch)
with SummaryWriter('runs/exp1') as writer:
for epoch in range(10):
writer.add_scalar('Loss', loss, epoch)
Automatically closes
会自动关闭
writer = SummaryWriter('runs/exp1')
writer = SummaryWriter('runs/exp1')
... logging ...
... 记录数据 ...
5. Use Separate Writers for Train/Val
5. 为训练和验证使用独立的写入器
✅ Good: Separate log directories
✅ 推荐:使用独立的日志目录
train_writer = SummaryWriter('runs/exp1/train')
val_writer = SummaryWriter('runs/exp1/val')
train_writer.add_scalar('loss', train_loss, epoch)
val_writer.add_scalar('loss', val_loss, epoch)
train_writer = SummaryWriter('runs/exp1/train')
val_writer = SummaryWriter('runs/exp1/val')
train_writer.add_scalar('loss', train_loss, epoch)
val_writer.add_scalar('loss', val_loss, epoch)
Performance Profiling
性能分析
TensorFlow Profiler
TensorFlow分析器
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir='logs',
profile_batch='10,20' # Profile batches 10-20
)
model.fit(x, y, callbacks=[tensorboard_callback])
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir='logs',
profile_batch='10,20' # 分析第10-20个批次
)
model.fit(x, y, callbacks=[tensorboard_callback])
View in TensorBoard Profile tab
在TensorBoard的Profile标签页中查看
Shows: GPU utilization, kernel stats, memory usage, bottlenecks
展示内容:GPU利用率、内核统计、内存使用情况、性能瓶颈
PyTorch Profiler
PyTorch分析器
python
import torch.profiler as profiler
with profiler.profile(
activities=[
profiler.ProfilerActivity.CPU,
profiler.ProfilerActivity.CUDA
],
on_trace_ready=torch.profiler.tensorboard_trace_handler('./runs/profiler'),
record_shapes=True,
with_stack=True
) as prof:
for batch in train_loader:
loss = train_step(batch)
prof.step()
python
import torch.profiler as profiler
with profiler.profile(
activities=[
profiler.ProfilerActivity.CPU,
profiler.ProfilerActivity.CUDA
],
on_trace_ready=torch.profiler.tensorboard_trace_handler('./runs/profiler'),
record_shapes=True,
with_stack=True
) as prof:
for batch in train_loader:
loss = train_step(batch)
prof.step()
View in TensorBoard Profile tab
在TensorBoard的Profile标签页中查看
references/visualization.md
- Comprehensive visualization guide
- - Performance profiling patterns
references/integrations.md
- Framework-specific integration examples
references/visualization.md
- 全面的可视化指南
- - 性能分析模式
references/integrations.md
- 框架专属集成示例