ml-inference-optimization
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseML Inference Optimization
机器学习推理优化
When to Use This Skill
何时使用该技能
Use this skill when:
- Optimizing ML inference latency
- Reducing model size for deployment
- Implementing model compression techniques
- Designing inference caching strategies
- Deploying models at the edge
- Balancing accuracy vs. latency trade-offs
Keywords: inference optimization, latency, model compression, distillation, pruning, quantization, caching, edge ML, TensorRT, ONNX, model serving, batching, hardware acceleration
在以下场景中使用该技能:
- 优化机器学习推理延迟
- 减小模型部署尺寸
- 实现模型压缩技术
- 设计推理缓存策略
- 在边缘部署模型
- 平衡精度与延迟的权衡
关键词: inference optimization, latency, model compression, distillation, pruning, quantization, caching, edge ML, TensorRT, ONNX, model serving, batching, hardware acceleration
Inference Optimization Overview
推理优化概述
text
┌─────────────────────────────────────────────────────────────────────┐
│ Inference Optimization Stack │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Model Level │ │
│ │ Distillation │ Pruning │ Quantization │ Architecture Search │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Compiler Level │ │
│ │ Graph optimization │ Operator fusion │ Memory planning │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Runtime Level │ │
│ │ Batching │ Caching │ Async execution │ Multi-threading │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Hardware Level │ │
│ │ GPU │ TPU │ NPU │ CPU SIMD │ Custom accelerators │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘text
┌─────────────────────────────────────────────────────────────────────┐
│ Inference Optimization Stack │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Model Level │ │
│ │ Distillation │ Pruning │ Quantization │ Architecture Search │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Compiler Level │ │
│ │ Graph optimization │ Operator fusion │ Memory planning │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Runtime Level │ │
│ │ Batching │ Caching │ Async execution │ Multi-threading │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Hardware Level │ │
│ │ GPU │ TPU │ NPU │ CPU SIMD │ Custom accelerators │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘Model Compression Techniques
模型压缩技术
Technique Overview
技术概述
| Technique | Size Reduction | Speed Improvement | Accuracy Impact |
|---|---|---|---|
| Quantization | 2-4x | 2-4x | Low (1-2%) |
| Pruning | 2-10x | 1-3x | Low-Medium |
| Distillation | 3-10x | 3-10x | Medium |
| Low-rank factorization | 2-5x | 1.5-3x | Low-Medium |
| Weight sharing | 10-100x | Variable | Medium-High |
| 技术 | 尺寸缩减比例 | 速度提升比例 | 精度影响 |
|---|---|---|---|
| Quantization | 2-4倍 | 2-4倍 | 低(1-2%) |
| Pruning | 2-10倍 | 1-3倍 | 低-中 |
| Distillation | 3-10倍 | 3-10倍 | 中 |
| Low-rank factorization | 2-5倍 | 1.5-3倍 | 低-中 |
| Weight sharing | 10-100倍 | 可变 | 中-高 |
Knowledge Distillation
知识蒸馏
text
┌─────────────────────────────────────────────────────────────────────┐
│ Knowledge Distillation │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ │
│ │ Teacher Model│ (Large, accurate, slow) │
│ │ GPT-4 │ │
│ └──────────────┘ │
│ │ │
│ ▼ Soft labels (probability distributions) │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Training Process │ │
│ │ Loss = α × CrossEntropy(student, hard_labels) │ │
│ │ + (1-α) × KL_Div(student, teacher_soft_labels) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │Student Model │ (Small, nearly as accurate, fast) │
│ │ DistilBERT │ │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘Distillation Types:
| Type | Description | Use Case |
|---|---|---|
| Response distillation | Match teacher outputs | General compression |
| Feature distillation | Match intermediate layers | Better transfer |
| Relation distillation | Match sample relationships | Structured data |
| Self-distillation | Model teaches itself | Regularization |
text
┌─────────────────────────────────────────────────────────────────────┐
│ Knowledge Distillation │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ │
│ │ Teacher Model│ (Large, accurate, slow) │
│ │ GPT-4 │ │
│ └──────────────┘ │
│ │ │
│ ▼ Soft labels (probability distributions) │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Training Process │ │
│ │ Loss = α × CrossEntropy(student, hard_labels) │ │
│ │ + (1-α) × KL_Div(student, teacher_soft_labels) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │Student Model │ (Small, nearly as accurate, fast) │
│ │ DistilBERT │ │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘蒸馏类型:
| 类型 | 描述 | 适用场景 |
|---|---|---|
| Response distillation | 匹配教师模型输出 | 通用压缩 |
| Feature distillation | 匹配中间层特征 | 更好的迁移性 |
| Relation distillation | 匹配样本间关系 | 结构化数据 |
| Self-distillation | 模型自蒸馏 | 正则化 |
Pruning Strategies
剪枝策略
text
Unstructured Pruning (Weight-level):
Before: [0.1, 0.8, 0.2, 0.9, 0.05, 0.7]
After: [0.0, 0.8, 0.0, 0.9, 0.0, 0.7] (50% sparse)
• Flexible, high sparsity possible
• Needs sparse hardware/libraries
Structured Pruning (Channel/Layer-level):
Before: ┌───┬───┬───┬───┐
│ C1│ C2│ C3│ C4│
└───┴───┴───┴───┘
After: ┌───┬───┬───┐
│ C1│ C3│ C4│ (Removed C2 entirely)
└───┴───┴───┘
• Works with standard hardware
• Lower compression ratioPruning Decision Criteria:
| Method | Description | Effectiveness |
|---|---|---|
| Magnitude-based | Remove smallest weights | Simple, effective |
| Gradient-based | Remove low-gradient weights | Better accuracy |
| Second-order | Use Hessian information | Best but expensive |
| Lottery ticket | Find winning subnetwork | Theoretical insight |
text
Unstructured Pruning (Weight-level):
Before: [0.1, 0.8, 0.2, 0.9, 0.05, 0.7]
After: [0.0, 0.8, 0.0, 0.9, 0.0, 0.7] (50% sparse)
• Flexible, high sparsity possible
• Needs sparse hardware/libraries
Structured Pruning (Channel/Layer-level):
Before: ┌───┬───┬───┬───┐
│ C1│ C2│ C3│ C4│
└───┴───┴───┴───┘
After: ┌───┬───┬───┐
│ C1│ C3│ C4│ (Removed C2 entirely)
└───┴───┴───┘
• Works with standard hardware
• Lower compression ratio剪枝决策标准:
| 方法 | 描述 | 有效性 |
|---|---|---|
| Magnitude-based | 移除权重值最小的参数 | 简单、有效 |
| Gradient-based | 移除梯度值低的参数 | 精度表现更好 |
| Second-order | 使用海森矩阵信息 | 效果最佳但成本高 |
| Lottery ticket | 寻找最优子网络 | 理论指导性强 |
Quantization (Detailed)
量化(详细说明)
text
Precision Hierarchy:
FP32 (32 bits): ████████████████████████████████
FP16 (16 bits): ████████████████
BF16 (16 bits): ████████████████ (different mantissa/exponent)
INT8 (8 bits): ████████
INT4 (4 bits): ████
Binary (1 bit): █
Memory and Compute Scale ProportionallyQuantization Approaches:
| Approach | When Applied | Quality | Effort |
|---|---|---|---|
| Dynamic quantization | Runtime | Good | Low |
| Static quantization | Post-training with calibration | Better | Medium |
| QAT | During training | Best | High |
text
Precision Hierarchy:
FP32 (32 bits): ████████████████████████████████
FP16 (16 bits): ████████████████
BF16 (16 bits): ████████████████ (different mantissa/exponent)
INT8 (8 bits): ████████
INT4 (4 bits): ████
Binary (1 bit): █
Memory and Compute Scale Proportionally量化方法:
| 方法 | 应用时机 | 精度表现 | 实施成本 |
|---|---|---|---|
| Dynamic quantization | 运行时 | 良好 | 低 |
| Static quantization | 训练后校准 | 更好 | 中 |
| QAT | 训练过程中 | 最佳 | 高 |
Compiler-Level Optimization
编译器层面优化
Graph Optimization
图优化
text
Original Graph:
Input → Conv → BatchNorm → ReLU → Conv → BatchNorm → ReLU → Output
Optimized Graph (Operator Fusion):
Input → FusedConvBNReLU → FusedConvBNReLU → Output
Benefits:
• Fewer kernel launches
• Better memory locality
• Reduced memory bandwidthtext
Original Graph:
Input → Conv → BatchNorm → ReLU → Conv → BatchNorm → ReLU → Output
Optimized Graph (Operator Fusion):
Input → FusedConvBNReLU → FusedConvBNReLU → Output
Benefits:
• Fewer kernel launches
• Better memory locality
• Reduced memory bandwidthCommon Optimizations
常见优化手段
| Optimization | Description | Speedup |
|---|---|---|
| Operator fusion | Combine sequential ops | 1.2-2x |
| Constant folding | Pre-compute constants | 1.1-1.5x |
| Dead code elimination | Remove unused ops | Variable |
| Layout optimization | Optimize tensor memory layout | 1.1-1.3x |
| Memory planning | Optimize buffer allocation | 1.1-1.2x |
| 优化手段 | 描述 | 提速比例 |
|---|---|---|
| Operator fusion | 合并连续算子 | 1.2-2倍 |
| Constant folding | 预计算常量 | 1.1-1.5倍 |
| Dead code elimination | 移除未使用算子 | 可变 |
| Layout optimization | 优化张量内存布局 | 1.1-1.3倍 |
| Memory planning | 优化缓冲区分配 | 1.1-1.2倍 |
Optimization Frameworks
优化框架
| Framework | Vendor | Best For |
|---|---|---|
| TensorRT | NVIDIA | NVIDIA GPUs, lowest latency |
| ONNX Runtime | Microsoft | Cross-platform, broad support |
| OpenVINO | Intel | Intel CPUs/GPUs |
| Core ML | Apple | Apple devices |
| TFLite | Mobile, embedded | |
| Apache TVM | Open source | Custom hardware, research |
| 框架 | 厂商 | 最佳适用场景 |
|---|---|---|
| TensorRT | NVIDIA | NVIDIA GPU、低延迟需求 |
| ONNX Runtime | Microsoft | 跨平台、广泛模型支持 |
| OpenVINO | Intel | Intel CPU/GPU |
| Core ML | Apple | Apple设备 |
| TFLite | 移动、嵌入式设备 | |
| Apache TVM | 开源 | 自定义硬件、研究场景 |
Runtime Optimization
运行时优化
Batching Strategies
批处理策略
text
No Batching:
Request 1: [Process] → Response 1 10ms
Request 2: [Process] → Response 2 10ms
Request 3: [Process] → Response 3 10ms
Total: 30ms, GPU underutilized
Dynamic Batching:
Requests 1-3: [Wait 5ms] → [Process batch] → Responses
Total: 15ms, 2x throughput
Trade-off: Latency vs. Throughput
• Larger batch: Higher throughput, higher latency
• Smaller batch: Lower latency, lower throughputBatching Parameters:
| Parameter | Description | Trade-off |
|---|---|---|
| Maximum batch size | Throughput vs. latency |
| Wait time for batch fill | Latency vs. efficiency |
| Minimum before processing | Latency predictability |
text
No Batching:
Request 1: [Process] → Response 1 10ms
Request 2: [Process] → Response 2 10ms
Request 3: [Process] → Response 3 10ms
Total: 30ms, GPU underutilized
Dynamic Batching:
Requests 1-3: [Wait 5ms] → [Process batch] → Responses
Total: 15ms, 2x throughput
Trade-off: Latency vs. Throughput
• Larger batch: Higher throughput, higher latency
• Smaller batch: Lower latency, lower throughput批处理参数:
| 参数 | 描述 | 权衡关系 |
|---|---|---|
| 最大批处理尺寸 | 吞吐量与延迟 |
| 批处理填充等待时间 | 延迟与效率 |
| 触发处理的最小批尺寸 | 延迟可预测性 |
Caching Strategies
缓存策略
text
┌─────────────────────────────────────────────────────────────────────┐
│ Inference Caching Layers │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Layer 1: Input Cache │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Cache exact inputs → Return cached outputs │ │
│ │ Hit rate: Low (inputs rarely repeat exactly) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ Layer 2: Embedding Cache │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Cache computed embeddings for repeated tokens/entities │ │
│ │ Hit rate: Medium (common tokens repeat) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ Layer 3: KV Cache (for transformers) │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Cache key-value pairs for attention │ │
│ │ Hit rate: High (reuse across tokens in sequence) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ Layer 4: Result Cache │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Cache semantic equivalents (fuzzy matching) │ │
│ │ Hit rate: Variable (depends on query distribution) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘Semantic Caching for LLMs:
text
Query: "What's the capital of France?"
↓
Hash + Embed query
↓
Search cache (similarity > threshold)
↓
├── Hit: Return cached response
└── Miss: Generate → Cache → Returntext
┌─────────────────────────────────────────────────────────────────────┐
│ Inference Caching Layers │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Layer 1: Input Cache │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Cache exact inputs → Return cached outputs │ │
│ │ Hit rate: Low (inputs rarely repeat exactly) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ Layer 2: Embedding Cache │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Cache computed embeddings for repeated tokens/entities │ │
│ │ Hit rate: Medium (common tokens repeat) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ Layer 3: KV Cache (for transformers) │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Cache key-value pairs for attention │ │
│ │ Hit rate: High (reuse across tokens in sequence) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ Layer 4: Result Cache │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Cache semantic equivalents (fuzzy matching) │ │
│ │ Hit rate: Variable (depends on query distribution) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘大语言模型语义缓存:
text
Query: "What's the capital of France?"
↓
Hash + Embed query
↓
Search cache (similarity > threshold)
↓
├── Hit: Return cached response
└── Miss: Generate → Cache → ReturnAsync and Parallel Execution
异步与并行执行
text
Sequential:
┌─────┐ ┌─────┐ ┌─────┐
│Prep │→│Model│→│Post │ Total: 30ms
│10ms │ │15ms │ │5ms │
└─────┘ └─────┘ └─────┘
Pipelined:
Request 1: │Prep│Model│Post│
Request 2: │Prep│Model│Post│
Request 3: │Prep│Model│Post│
Throughput: 3x higher
Latency per request: Sametext
Sequential:
┌─────┐ ┌─────┐ ┌─────┐
│Prep │→│Model│→│Post │ Total: 30ms
│10ms │ │15ms │ │5ms │
└─────┘ └─────┘ └─────┘
Pipelined:
Request 1: │Prep│Model│Post│
Request 2: │Prep│Model│Post│
Request 3: │Prep│Model│Post│
Throughput: 3x higher
Latency per request: SameHardware Acceleration
硬件加速
Hardware Comparison
硬件对比
| Hardware | Strengths | Limitations | Best For |
|---|---|---|---|
| GPU (NVIDIA) | High parallelism, mature ecosystem | Power, cost | Training, large batch inference |
| TPU (Google) | Matrix ops, cloud integration | Vendor lock-in | Google Cloud workloads |
| NPU (Apple/Qualcomm) | Power efficient, on-device | Limited models | Mobile, edge |
| CPU | Flexible, available | Slower for ML | Low-batch, CPU-bound |
| FPGA | Customizable, low latency | Development complexity | Specialized workloads |
| 硬件 | 优势 | 局限性 | 最佳适用场景 |
|---|---|---|---|
| GPU (NVIDIA) | 高并行性、成熟生态 | 功耗高、成本高 | 训练、大批量推理 |
| TPU (Google) | 矩阵运算优化、云集成 | 厂商锁定 | Google云工作负载 |
| NPU (Apple/Qualcomm) | 功耗低、设备端运行 | 模型支持有限 | 移动、边缘设备 |
| CPU | 灵活、易获取 | 机器学习运算速度慢 | 小批量、CPU密集型任务 |
| FPGA | 可定制、低延迟 | 开发复杂度高 | 专用工作负载 |
GPU Optimization
GPU优化
| Optimization | Description | Impact |
|---|---|---|
| Tensor Cores | Use FP16/INT8 tensor operations | 2-8x speedup |
| CUDA graphs | Reduce kernel launch overhead | 1.5-2x for small models |
| Multi-stream | Parallel execution | Higher throughput |
| Memory pooling | Reduce allocation overhead | Lower latency variance |
| 优化手段 | 描述 | 影响 |
|---|---|---|
| Tensor Cores | 使用FP16/INT8张量运算 | 2-8倍提速 |
| CUDA graphs | 减少内核启动开销 | 小模型提速1.5-2倍 |
| Multi-stream | 并行执行 | 更高吞吐量 |
| Memory pooling | 减少内存分配开销 | 降低延迟波动 |
Edge Deployment
边缘部署
Edge Constraints
边缘约束
text
┌─────────────────────────────────────────────────────────────────────┐
│ Edge Deployment Constraints │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Resource Constraints: │
│ ├── Memory: 1-4 GB (vs. 64+ GB cloud) │
│ ├── Compute: 1-10 TOPS (vs. 100+ TFLOPS cloud) │
│ ├── Power: 5-15W (vs. 300W+ cloud) │
│ └── Storage: 16-128 GB (vs. TB cloud) │
│ │
│ Operational Constraints: │
│ ├── No network (offline operation) │
│ ├── Variable ambient conditions │
│ ├── Infrequent updates │
│ └── Long deployment lifetime │
│ │
└─────────────────────────────────────────────────────────────────────┘text
┌─────────────────────────────────────────────────────────────────────┐
│ Edge Deployment Constraints │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Resource Constraints: │
│ ├── Memory: 1-4 GB (vs. 64+ GB cloud) │
│ ├── Compute: 1-10 TOPS (vs. 100+ TFLOPS cloud) │
│ ├── Power: 5-15W (vs. 300W+ cloud) │
│ └── Storage: 16-128 GB (vs. TB cloud) │
│ │
│ Operational Constraints: │
│ ├── No network (offline operation) │
│ ├── Variable ambient conditions │
│ ├── Infrequent updates │
│ └── Long deployment lifetime │
│ │
└─────────────────────────────────────────────────────────────────────┘Edge Optimization Strategies
边缘优化策略
| Strategy | Description | Use When |
|---|---|---|
| Model selection | Use edge-native models (MobileNet, EfficientNet) | Accuracy acceptable |
| Aggressive quantization | INT8 or lower | Memory/power constrained |
| On-device distillation | Distill to tiny model | Extreme constraints |
| Split inference | Edge preprocessing, cloud inference | Network available |
| Model caching | Cache results locally | Repeated queries |
| 策略 | 描述 | 适用场景 |
|---|---|---|
| Model selection | 使用边缘原生模型(MobileNet、EfficientNet) | 精度要求可接受时 |
| Aggressive quantization | 采用INT8或更低精度 | 内存/功耗受限场景 |
| On-device distillation | 蒸馏为超小模型 | 极端资源约束场景 |
| Split inference | 边缘预处理,云端推理 | 有网络连接时 |
| Model caching | 本地缓存结果 | 查询重复率高时 |
Edge ML Frameworks
边缘机器学习框架
| Framework | Platform | Features |
|---|---|---|
| TensorFlow Lite | Android, iOS, embedded | Quantization, delegates |
| Core ML | iOS, macOS | Neural Engine optimization |
| ONNX Runtime Mobile | Cross-platform | Broad model support |
| PyTorch Mobile | Android, iOS | Familiar API |
| TensorRT | NVIDIA Jetson | Maximum performance |
| 框架 | 平台 | 特性 |
|---|---|---|
| TensorFlow Lite | Android、iOS、嵌入式设备 | 量化、委托加速 |
| Core ML | iOS、macOS | 神经引擎优化 |
| ONNX Runtime Mobile | 跨平台 | 广泛模型支持 |
| PyTorch Mobile | Android、iOS | 熟悉的API |
| TensorRT | NVIDIA Jetson | 极致性能 |
Latency Profiling
延迟分析
Profiling Methodology
分析方法
text
┌─────────────────────────────────────────────────────────────────────┐
│ Latency Breakdown Analysis │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ 1. Data Loading: ████████░░░░░░░░░░ 15% │
│ 2. Preprocessing: ██████░░░░░░░░░░░░ 10% │
│ 3. Model Inference: ████████████████░░ 60% │
│ 4. Postprocessing: ████░░░░░░░░░░░░░░ 8% │
│ 5. Response Serialization:███░░░░░░░░░░░░░░░ 7% │
│ │
│ Target: Model inference (60% = biggest optimization opportunity) │
│ │
└─────────────────────────────────────────────────────────────────────┘text
┌─────────────────────────────────────────────────────────────────────┐
│ Latency Breakdown Analysis │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ 1. Data Loading: ████████░░░░░░░░░░ 15% │
│ 2. Preprocessing: ██████░░░░░░░░░░░░ 10% │
│ 3. Model Inference: ████████████████░░ 60% │
│ 4. Postprocessing: ████░░░░░░░░░░░░░░ 8% │
│ 5. Response Serialization:███░░░░░░░░░░░░░░░ 7% │
│ │
│ Target: Model inference (60% = biggest optimization opportunity) │
│ │
└─────────────────────────────────────────────────────────────────────┘Profiling Tools
分析工具
| Tool | Use For |
|---|---|
| PyTorch Profiler | PyTorch model profiling |
| TensorBoard | TensorFlow visualization |
| NVIDIA Nsight | GPU profiling |
| Chrome Tracing | General timeline visualization |
| perf | CPU profiling |
| 工具 | 适用场景 |
|---|---|
| PyTorch Profiler | PyTorch模型分析 |
| TensorBoard | TensorFlow可视化 |
| NVIDIA Nsight | GPU分析 |
| Chrome Tracing | 通用时间线可视化 |
| perf | CPU分析 |
Key Metrics
关键指标
| Metric | Description | Target |
|---|---|---|
| P50 latency | Median latency | < SLA |
| P99 latency | Tail latency | < 2x P50 |
| Throughput | Requests/second | Meet demand |
| GPU utilization | Compute usage | > 80% |
| Memory bandwidth | Memory usage | < limit |
| 指标 | 描述 | 目标值 |
|---|---|---|
| P50 latency | 中位数延迟 | 低于服务等级协议(SLA) |
| P99 latency | 尾部延迟 | 低于P50的2倍 |
| Throughput | 每秒请求数 | 满足业务需求 |
| GPU utilization | 计算资源使用率 | > 80% |
| Memory bandwidth | 内存带宽使用率 | 低于限制值 |
Optimization Workflow
优化工作流
Systematic Approach
系统化方法
text
┌─────────────────────────────────────────────────────────────────────┐
│ Optimization Workflow │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ 1. Baseline │
│ └── Measure current performance (latency, throughput, accuracy) │
│ │
│ 2. Profile │
│ └── Identify bottlenecks (model, data, system) │
│ │
│ 3. Optimize (in order of effort/impact): │
│ ├── Hardware: Use right accelerator │
│ ├── Compiler: Enable optimizations (TensorRT, ONNX) │
│ ├── Runtime: Batching, caching, async │
│ ├── Model: Quantization, pruning │
│ └── Architecture: Distillation, model change │
│ │
│ 4. Validate │
│ └── Verify accuracy maintained, latency improved │
│ │
│ 5. Deploy and Monitor │
│ └── Track real-world performance │
│ │
└─────────────────────────────────────────────────────────────────────┘text
┌─────────────────────────────────────────────────────────────────────┐
│ Optimization Workflow │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ 1. Baseline │
│ └── Measure current performance (latency, throughput, accuracy) │
│ │
│ 2. Profile │
│ └── Identify bottlenecks (model, data, system) │
│ │
│ 3. Optimize (in order of effort/impact): │
│ ├── Hardware: Use right accelerator │
│ ├── Compiler: Enable optimizations (TensorRT, ONNX) │
│ ├── Runtime: Batching, caching, async │
│ ├── Model: Quantization, pruning │
│ └── Architecture: Distillation, model change │
│ │
│ 4. Validate │
│ └── Verify accuracy maintained, latency improved │
│ │
│ 5. Deploy and Monitor │
│ └── Track real-world performance │
│ │
└─────────────────────────────────────────────────────────────────────┘Optimization Priority Matrix
优化优先级矩阵
text
High Impact
│
Compiler Opts ────┼──── Quantization
(easy win) │ (best ROI)
│
Low Effort ──────────────┼──────────────── High Effort
│
Batching ────┼──── Distillation
(quick win) │ (major effort)
│
Low Impacttext
High Impact
│
Compiler Opts ────┼──── Quantization
(easy win) │ (best ROI)
│
Low Effort ──────────────┼──────────────── High Effort
│
Batching ────┼──── Distillation
(quick win) │ (major effort)
│
Low ImpactCommon Patterns
常见模式
Multi-Model Serving
多模型服务
text
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ Request → ┌─────────┐ │
│ │ Router │ │
│ └─────────┘ │
│ │ │ │ │
│ ┌────────┘ │ └────────┐ │
│ ▼ ▼ ▼ │
│ ┌───────┐ ┌───────┐ ┌───────┐ │
│ │ Tiny │ │ Small │ │ Large │ │
│ │ <10ms │ │ <50ms │ │<500ms │ │
│ └───────┘ └───────┘ └───────┘ │
│ │
│ Routing strategies: │
│ • Complexity-based: Simple→Tiny, Complex→Large │
│ • Confidence-based: Try Tiny, escalate if low confidence │
│ • SLA-based: Route based on latency requirements │
│ │
└─────────────────────────────────────────────────────────────────────┘text
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ Request → ┌─────────┐ │
│ │ Router │ │
│ └─────────┘ │
│ │ │ │ │
│ ┌────────┘ │ └────────┐ │
│ ▼ ▼ ▼ │
│ ┌───────┐ ┌───────┐ ┌───────┐ │
│ │ Tiny │ │ Small │ │ Large │ │
│ │ <10ms │ │ <50ms │ │<500ms │ │
│ └───────┘ └───────┘ └───────┘ │
│ │
│ Routing strategies: │
│ • Complexity-based: Simple→Tiny, Complex→Large │
│ • Confidence-based: Try Tiny, escalate if low confidence │
│ • SLA-based: Route based on latency requirements │
│ │
└─────────────────────────────────────────────────────────────────────┘Speculative Execution
推测执行
text
Query: "Translate: Hello"
│
├──▶ Small model (draft): "Bonjour" (5ms)
│
└──▶ Large model (verify): Check "Bonjour" (10ms parallel)
│
├── Accept: Return immediately
└── Reject: Generate with large model
Speedup: 2-3x when drafts are often acceptedtext
Query: "Translate: Hello"
│
├──▶ Small model (draft): "Bonjour" (5ms)
│
└──▶ Large model (verify): Check "Bonjour" (10ms parallel)
│
├── Accept: Return immediately
└── Reject: Generate with large model
Speedup: 2-3x when drafts are often acceptedCascade Models
级联模型
text
Input → ┌────────┐
│ Filter │ ← Cheap filter (reject obvious negatives)
└────────┘
│ (candidates only)
▼
┌────────┐
│ Stage 1│ ← Fast model (coarse ranking)
└────────┘
│ (top-100)
▼
┌────────┐
│ Stage 2│ ← Accurate model (fine ranking)
└────────┘
│ (top-10)
▼
Output
Benefit: 10x cheaper, similar accuracytext
Input → ┌────────┐
│ Filter │ ← Cheap filter (reject obvious negatives)
└────────┘
│ (candidates only)
▼
┌────────┐
│ Stage 1│ ← Fast model (coarse ranking)
└────────┘
│ (top-100)
▼
┌────────┐
│ Stage 2│ ← Accurate model (fine ranking)
└────────┘
│ (top-10)
▼
Output
Benefit: 10x cheaper, similar accuracyOptimization Checklist
优化检查清单
Pre-Deployment
部署前
- Profile baseline performance
- Identify primary bottleneck (model, data, system)
- Apply compiler optimizations (TensorRT, ONNX)
- Evaluate quantization (INT8 usually safe)
- Tune batch size for target throughput
- Test accuracy after optimization
- 分析基准性能
- 识别主要瓶颈(模型、数据、系统)
- 应用编译器优化(TensorRT、ONNX)
- 评估量化方案(INT8通常安全)
- 针对目标吞吐量调整批处理尺寸
- 验证优化后的精度
Deployment
部署阶段
- Configure appropriate hardware
- Enable caching where applicable
- Set up monitoring (latency, throughput, errors)
- Configure auto-scaling policies
- Implement graceful degradation
- 配置合适的硬件
- 启用适用的缓存策略
- 设置监控(延迟、吞吐量、错误率)
- 配置自动扩缩容策略
- 实现优雅降级机制
Post-Deployment
部署后
- Monitor p99 latency
- Track accuracy metrics
- Analyze cache hit rates
- Review cost efficiency
- Plan iterative improvements
- 监控P99延迟
- 追踪精度指标
- 分析缓存命中率
- 审查成本效率
- 规划迭代优化
Related Skills
相关技能
- - LLM-specific serving optimization
llm-serving-patterns - - End-to-end ML pipeline design
ml-system-design - - Performance as quality attribute
quality-attributes-taxonomy - - Capacity planning for ML systems
estimation-techniques
- - 大语言模型专属服务优化
llm-serving-patterns - - 端到端机器学习 pipeline 设计
ml-system-design - - 作为质量属性的性能优化
quality-attributes-taxonomy - - 机器学习系统容量规划
estimation-techniques
Version History
版本历史
- v1.0.0 (2025-12-26): Initial release - ML inference optimization patterns
- v1.0.0 (2025-12-26): 初始版本 - 机器学习推理优化模式
Last Updated
最后更新
Date: 2025-12-26
日期: 2025-12-26