ml-inference-optimization

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

ML Inference Optimization

机器学习推理优化

When to Use This Skill

何时使用该技能

Use this skill when:

Optimizing ML inference latency
Reducing model size for deployment
Implementing model compression techniques
Designing inference caching strategies
Deploying models at the edge
Balancing accuracy vs. latency trade-offs

Keywords: inference optimization, latency, model compression, distillation, pruning, quantization, caching, edge ML, TensorRT, ONNX, model serving, batching, hardware acceleration

在以下场景中使用该技能：

优化机器学习推理延迟
减小模型部署尺寸
实现模型压缩技术
设计推理缓存策略
在边缘部署模型
平衡精度与延迟的权衡

关键词： inference optimization, latency, model compression, distillation, pruning, quantization, caching, edge ML, TensorRT, ONNX, model serving, batching, hardware acceleration

Inference Optimization Overview

推理优化概述

text

┌─────────────────────────────────────────────────────────────────────┐
│                 Inference Optimization Stack                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    Model Level                                │  │
│  │  Distillation │ Pruning │ Quantization │ Architecture Search │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              │                                      │
│                              ▼                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                   Compiler Level                              │  │
│  │  Graph optimization │ Operator fusion │ Memory planning       │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              │                                      │
│                              ▼                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                  Runtime Level                                │  │
│  │  Batching │ Caching │ Async execution │ Multi-threading      │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              │                                      │
│                              ▼                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                  Hardware Level                               │  │
│  │  GPU │ TPU │ NPU │ CPU SIMD │ Custom accelerators            │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

text

┌─────────────────────────────────────────────────────────────────────┐
│                 Inference Optimization Stack                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    Model Level                                │  │
│  │  Distillation │ Pruning │ Quantization │ Architecture Search │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              │                                      │
│                              ▼                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                   Compiler Level                              │  │
│  │  Graph optimization │ Operator fusion │ Memory planning       │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              │                                      │
│                              ▼                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                  Runtime Level                                │  │
│  │  Batching │ Caching │ Async execution │ Multi-threading      │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              │                                      │
│                              ▼                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                  Hardware Level                               │  │
│  │  GPU │ TPU │ NPU │ CPU SIMD │ Custom accelerators            │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Model Compression Techniques

模型压缩技术

Technique Overview

技术概述

Technique	Size Reduction	Speed Improvement	Accuracy Impact
Quantization	2-4x	2-4x	Low (1-2%)
Pruning	2-10x	1-3x	Low-Medium
Distillation	3-10x	3-10x	Medium
Low-rank factorization	2-5x	1.5-3x	Low-Medium
Weight sharing	10-100x	Variable	Medium-High

技术	尺寸缩减比例	速度提升比例	精度影响
Quantization	2-4倍	2-4倍	低（1-2%）
Pruning	2-10倍	1-3倍	低-中
Distillation	3-10倍	3-10倍	中
Low-rank factorization	2-5倍	1.5-3倍	低-中
Weight sharing	10-100倍	可变	中-高

Knowledge Distillation

知识蒸馏

text

┌─────────────────────────────────────────────────────────────────────┐
│                    Knowledge Distillation                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────┐                                                   │
│  │ Teacher Model│ (Large, accurate, slow)                          │
│  │   GPT-4      │                                                   │
│  └──────────────┘                                                   │
│         │                                                           │
│         ▼ Soft labels (probability distributions)                   │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    Training Process                           │  │
│  │  Loss = α × CrossEntropy(student, hard_labels)               │  │
│  │       + (1-α) × KL_Div(student, teacher_soft_labels)         │  │
│  └──────────────────────────────────────────────────────────────┘  │
│         │                                                           │
│         ▼                                                           │
│  ┌──────────────┐                                                   │
│  │Student Model │ (Small, nearly as accurate, fast)                │
│  │  DistilBERT  │                                                   │
│  └──────────────┘                                                   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Distillation Types:

Type	Description	Use Case
Response distillation	Match teacher outputs	General compression
Feature distillation	Match intermediate layers	Better transfer
Relation distillation	Match sample relationships	Structured data
Self-distillation	Model teaches itself	Regularization

text

┌─────────────────────────────────────────────────────────────────────┐
│                    Knowledge Distillation                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────┐                                                   │
│  │ Teacher Model│ (Large, accurate, slow)                          │
│  │   GPT-4      │                                                   │
│  └──────────────┘                                                   │
│         │                                                           │
│         ▼ Soft labels (probability distributions)                   │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    Training Process                           │  │
│  │  Loss = α × CrossEntropy(student, hard_labels)               │  │
│  │       + (1-α) × KL_Div(student, teacher_soft_labels)         │  │
│  └──────────────────────────────────────────────────────────────┘  │
│         │                                                           │
│         ▼                                                           │
│  ┌──────────────┐                                                   │
│  │Student Model │ (Small, nearly as accurate, fast)                │
│  │  DistilBERT  │                                                   │
│  └──────────────┘                                                   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

蒸馏类型：

类型	描述	适用场景
Response distillation	匹配教师模型输出	通用压缩
Feature distillation	匹配中间层特征	更好的迁移性
Relation distillation	匹配样本间关系	结构化数据
Self-distillation	模型自蒸馏	正则化

Pruning Strategies

剪枝策略

text

Unstructured Pruning (Weight-level):
Before: [0.1, 0.8, 0.2, 0.9, 0.05, 0.7]
After:  [0.0, 0.8, 0.0, 0.9, 0.0, 0.7]  (50% sparse)
• Flexible, high sparsity possible
• Needs sparse hardware/libraries

Structured Pruning (Channel/Layer-level):
Before: ┌───┬───┬───┬───┐
        │ C1│ C2│ C3│ C4│
        └───┴───┴───┴───┘
After:  ┌───┬───┬───┐
        │ C1│ C3│ C4│  (Removed C2 entirely)
        └───┴───┴───┘
• Works with standard hardware
• Lower compression ratio

Pruning Decision Criteria:

Method	Description	Effectiveness
Magnitude-based	Remove smallest weights	Simple, effective
Gradient-based	Remove low-gradient weights	Better accuracy
Second-order	Use Hessian information	Best but expensive
Lottery ticket	Find winning subnetwork	Theoretical insight

text

Unstructured Pruning (Weight-level):
Before: [0.1, 0.8, 0.2, 0.9, 0.05, 0.7]
After:  [0.0, 0.8, 0.0, 0.9, 0.0, 0.7]  (50% sparse)
• Flexible, high sparsity possible
• Needs sparse hardware/libraries

Structured Pruning (Channel/Layer-level):
Before: ┌───┬───┬───┬───┐
        │ C1│ C2│ C3│ C4│
        └───┴───┴───┴───┘
After:  ┌───┬───┬───┐
        │ C1│ C3│ C4│  (Removed C2 entirely)
        └───┴───┴───┘
• Works with standard hardware
• Lower compression ratio

剪枝决策标准：

方法	描述	有效性
Magnitude-based	移除权重值最小的参数	简单、有效
Gradient-based	移除梯度值低的参数	精度表现更好
Second-order	使用海森矩阵信息	效果最佳但成本高
Lottery ticket	寻找最优子网络	理论指导性强

Quantization (Detailed)

量化（详细说明）

text

Precision Hierarchy:

FP32 (32 bits): ████████████████████████████████
FP16 (16 bits): ████████████████
BF16 (16 bits): ████████████████  (different mantissa/exponent)
INT8 (8 bits):  ████████
INT4 (4 bits):  ████
Binary (1 bit): █

Memory and Compute Scale Proportionally

Quantization Approaches:

Approach	When Applied	Quality	Effort
Dynamic quantization	Runtime	Good	Low
Static quantization	Post-training with calibration	Better	Medium
QAT	During training	Best	High

text

Precision Hierarchy:

FP32 (32 bits): ████████████████████████████████
FP16 (16 bits): ████████████████
BF16 (16 bits): ████████████████  (different mantissa/exponent)
INT8 (8 bits):  ████████
INT4 (4 bits):  ████
Binary (1 bit): █

Memory and Compute Scale Proportionally

量化方法：

方法	应用时机	精度表现	实施成本
Dynamic quantization	运行时	良好	低
Static quantization	训练后校准	更好	中
QAT	训练过程中	最佳	高

Compiler-Level Optimization

编译器层面优化

Graph Optimization

图优化

text

Original Graph:
Input → Conv → BatchNorm → ReLU → Conv → BatchNorm → ReLU → Output

Optimized Graph (Operator Fusion):
Input → FusedConvBNReLU → FusedConvBNReLU → Output

Benefits:
• Fewer kernel launches
• Better memory locality
• Reduced memory bandwidth

text

Original Graph:
Input → Conv → BatchNorm → ReLU → Conv → BatchNorm → ReLU → Output

Optimized Graph (Operator Fusion):
Input → FusedConvBNReLU → FusedConvBNReLU → Output

Benefits:
• Fewer kernel launches
• Better memory locality
• Reduced memory bandwidth

Common Optimizations

常见优化手段

Optimization	Description	Speedup
Operator fusion	Combine sequential ops	1.2-2x
Constant folding	Pre-compute constants	1.1-1.5x
Dead code elimination	Remove unused ops	Variable
Layout optimization	Optimize tensor memory layout	1.1-1.3x
Memory planning	Optimize buffer allocation	1.1-1.2x

优化手段	描述	提速比例
Operator fusion	合并连续算子	1.2-2倍
Constant folding	预计算常量	1.1-1.5倍
Dead code elimination	移除未使用算子	可变
Layout optimization	优化张量内存布局	1.1-1.3倍
Memory planning	优化缓冲区分配	1.1-1.2倍

Optimization Frameworks

优化框架

Framework	Vendor	Best For
TensorRT	NVIDIA	NVIDIA GPUs, lowest latency
ONNX Runtime	Microsoft	Cross-platform, broad support
OpenVINO	Intel	Intel CPUs/GPUs
Core ML	Apple	Apple devices
TFLite	Google	Mobile, embedded
Apache TVM	Open source	Custom hardware, research

框架	厂商	最佳适用场景
TensorRT	NVIDIA	NVIDIA GPU、低延迟需求
ONNX Runtime	Microsoft	跨平台、广泛模型支持
OpenVINO	Intel	Intel CPU/GPU
Core ML	Apple	Apple设备
TFLite	Google	移动、嵌入式设备
Apache TVM	开源	自定义硬件、研究场景

Runtime Optimization

运行时优化

Batching Strategies

批处理策略

text

No Batching:
Request 1: [Process] → Response 1      10ms
Request 2: [Process] → Response 2      10ms
Request 3: [Process] → Response 3      10ms
Total: 30ms, GPU underutilized

Dynamic Batching:
Requests 1-3: [Wait 5ms] → [Process batch] → Responses
Total: 15ms, 2x throughput

Trade-off: Latency vs. Throughput
• Larger batch: Higher throughput, higher latency
• Smaller batch: Lower latency, lower throughput

Batching Parameters:

Parameter	Description	Trade-off
`batch_size`	Maximum batch size	Throughput vs. latency
`max_wait_time`	Wait time for batch fill	Latency vs. efficiency
`min_batch_size`	Minimum before processing	Latency predictability

text

No Batching:
Request 1: [Process] → Response 1      10ms
Request 2: [Process] → Response 2      10ms
Request 3: [Process] → Response 3      10ms
Total: 30ms, GPU underutilized

Dynamic Batching:
Requests 1-3: [Wait 5ms] → [Process batch] → Responses
Total: 15ms, 2x throughput

Trade-off: Latency vs. Throughput
• Larger batch: Higher throughput, higher latency
• Smaller batch: Lower latency, lower throughput

批处理参数：

参数	描述	权衡关系
`batch_size`	最大批处理尺寸	吞吐量与延迟
`max_wait_time`	批处理填充等待时间	延迟与效率
`min_batch_size`	触发处理的最小批尺寸	延迟可预测性

Caching Strategies

缓存策略

text

┌─────────────────────────────────────────────────────────────────────┐
│                    Inference Caching Layers                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Layer 1: Input Cache                                               │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache exact inputs → Return cached outputs                   │   │
│  │ Hit rate: Low (inputs rarely repeat exactly)                 │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  Layer 2: Embedding Cache                                           │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache computed embeddings for repeated tokens/entities       │   │
│  │ Hit rate: Medium (common tokens repeat)                      │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  Layer 3: KV Cache (for transformers)                               │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache key-value pairs for attention                          │   │
│  │ Hit rate: High (reuse across tokens in sequence)             │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  Layer 4: Result Cache                                              │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache semantic equivalents (fuzzy matching)                  │   │
│  │ Hit rate: Variable (depends on query distribution)           │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Semantic Caching for LLMs:

text

Query: "What's the capital of France?"
       ↓
Hash + Embed query
       ↓
Search cache (similarity > threshold)
       ↓
├── Hit: Return cached response
└── Miss: Generate → Cache → Return

text

┌─────────────────────────────────────────────────────────────────────┐
│                    Inference Caching Layers                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Layer 1: Input Cache                                               │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache exact inputs → Return cached outputs                   │   │
│  │ Hit rate: Low (inputs rarely repeat exactly)                 │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  Layer 2: Embedding Cache                                           │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache computed embeddings for repeated tokens/entities       │   │
│  │ Hit rate: Medium (common tokens repeat)                      │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  Layer 3: KV Cache (for transformers)                               │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache key-value pairs for attention                          │   │
│  │ Hit rate: High (reuse across tokens in sequence)             │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  Layer 4: Result Cache                                              │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache semantic equivalents (fuzzy matching)                  │   │
│  │ Hit rate: Variable (depends on query distribution)           │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

大语言模型语义缓存：

text

Query: "What's the capital of France?"
       ↓
Hash + Embed query
       ↓
Search cache (similarity > threshold)
       ↓
├── Hit: Return cached response
└── Miss: Generate → Cache → Return

Async and Parallel Execution

异步与并行执行

text

Sequential:
┌─────┐ ┌─────┐ ┌─────┐
│Prep │→│Model│→│Post │  Total: 30ms
│10ms │ │15ms │ │5ms  │
└─────┘ └─────┘ └─────┘

Pipelined:
Request 1: │Prep│Model│Post│
Request 2:      │Prep│Model│Post│
Request 3:           │Prep│Model│Post│

Throughput: 3x higher
Latency per request: Same

text

Sequential:
┌─────┐ ┌─────┐ ┌─────┐
│Prep │→│Model│→│Post │  Total: 30ms
│10ms │ │15ms │ │5ms  │
└─────┘ └─────┘ └─────┘

Pipelined:
Request 1: │Prep│Model│Post│
Request 2:      │Prep│Model│Post│
Request 3:           │Prep│Model│Post│

Throughput: 3x higher
Latency per request: Same

Hardware Acceleration

硬件加速

Hardware Comparison

硬件对比

Hardware	Strengths	Limitations	Best For
GPU (NVIDIA)	High parallelism, mature ecosystem	Power, cost	Training, large batch inference
TPU (Google)	Matrix ops, cloud integration	Vendor lock-in	Google Cloud workloads
NPU (Apple/Qualcomm)	Power efficient, on-device	Limited models	Mobile, edge
CPU	Flexible, available	Slower for ML	Low-batch, CPU-bound
FPGA	Customizable, low latency	Development complexity	Specialized workloads

硬件	优势	局限性	最佳适用场景
GPU (NVIDIA)	高并行性、成熟生态	功耗高、成本高	训练、大批量推理
TPU (Google)	矩阵运算优化、云集成	厂商锁定	Google云工作负载
NPU (Apple/Qualcomm)	功耗低、设备端运行	模型支持有限	移动、边缘设备
CPU	灵活、易获取	机器学习运算速度慢	小批量、CPU密集型任务
FPGA	可定制、低延迟	开发复杂度高	专用工作负载

GPU Optimization

GPU优化

Optimization	Description	Impact
Tensor Cores	Use FP16/INT8 tensor operations	2-8x speedup
CUDA graphs	Reduce kernel launch overhead	1.5-2x for small models
Multi-stream	Parallel execution	Higher throughput
Memory pooling	Reduce allocation overhead	Lower latency variance

优化手段	描述	影响
Tensor Cores	使用FP16/INT8张量运算	2-8倍提速
CUDA graphs	减少内核启动开销	小模型提速1.5-2倍
Multi-stream	并行执行	更高吞吐量
Memory pooling	减少内存分配开销	降低延迟波动

Edge Deployment

边缘部署

Edge Constraints

边缘约束

text

┌─────────────────────────────────────────────────────────────────────┐
│                      Edge Deployment Constraints                    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Resource Constraints:                                              │
│  ├── Memory: 1-4 GB (vs. 64+ GB cloud)                             │
│  ├── Compute: 1-10 TOPS (vs. 100+ TFLOPS cloud)                    │
│  ├── Power: 5-15W (vs. 300W+ cloud)                                │
│  └── Storage: 16-128 GB (vs. TB cloud)                             │
│                                                                     │
│  Operational Constraints:                                           │
│  ├── No network (offline operation)                                 │
│  ├── Variable ambient conditions                                    │
│  ├── Infrequent updates                                            │
│  └── Long deployment lifetime                                       │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

text

┌─────────────────────────────────────────────────────────────────────┐
│                      Edge Deployment Constraints                    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Resource Constraints:                                              │
│  ├── Memory: 1-4 GB (vs. 64+ GB cloud)                             │
│  ├── Compute: 1-10 TOPS (vs. 100+ TFLOPS cloud)                    │
│  ├── Power: 5-15W (vs. 300W+ cloud)                                │
│  └── Storage: 16-128 GB (vs. TB cloud)                             │
│                                                                     │
│  Operational Constraints:                                           │
│  ├── No network (offline operation)                                 │
│  ├── Variable ambient conditions                                    │
│  ├── Infrequent updates                                            │
│  └── Long deployment lifetime                                       │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Edge Optimization Strategies

边缘优化策略

Strategy	Description	Use When
Model selection	Use edge-native models (MobileNet, EfficientNet)	Accuracy acceptable
Aggressive quantization	INT8 or lower	Memory/power constrained
On-device distillation	Distill to tiny model	Extreme constraints
Split inference	Edge preprocessing, cloud inference	Network available
Model caching	Cache results locally	Repeated queries

策略	描述	适用场景
Model selection	使用边缘原生模型（MobileNet、EfficientNet）	精度要求可接受时
Aggressive quantization	采用INT8或更低精度	内存/功耗受限场景
On-device distillation	蒸馏为超小模型	极端资源约束场景
Split inference	边缘预处理，云端推理	有网络连接时
Model caching	本地缓存结果	查询重复率高时

Edge ML Frameworks

边缘机器学习框架

Framework	Platform	Features
TensorFlow Lite	Android, iOS, embedded	Quantization, delegates
Core ML	iOS, macOS	Neural Engine optimization
ONNX Runtime Mobile	Cross-platform	Broad model support
PyTorch Mobile	Android, iOS	Familiar API
TensorRT	NVIDIA Jetson	Maximum performance

框架	平台	特性
TensorFlow Lite	Android、iOS、嵌入式设备	量化、委托加速
Core ML	iOS、macOS	神经引擎优化
ONNX Runtime Mobile	跨平台	广泛模型支持
PyTorch Mobile	Android、iOS	熟悉的API
TensorRT	NVIDIA Jetson	极致性能

Latency Profiling

延迟分析

Profiling Methodology

分析方法

text

┌─────────────────────────────────────────────────────────────────────┐
│                    Latency Breakdown Analysis                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. Data Loading:          ████████░░░░░░░░░░  15%                 │
│  2. Preprocessing:         ██████░░░░░░░░░░░░  10%                 │
│  3. Model Inference:       ████████████████░░  60%                 │
│  4. Postprocessing:        ████░░░░░░░░░░░░░░   8%                 │
│  5. Response Serialization:███░░░░░░░░░░░░░░░   7%                 │
│                                                                     │
│  Target: Model inference (60% = biggest optimization opportunity)  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

text

┌─────────────────────────────────────────────────────────────────────┐
│                    Latency Breakdown Analysis                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. Data Loading:          ████████░░░░░░░░░░  15%                 │
│  2. Preprocessing:         ██████░░░░░░░░░░░░  10%                 │
│  3. Model Inference:       ████████████████░░  60%                 │
│  4. Postprocessing:        ████░░░░░░░░░░░░░░   8%                 │
│  5. Response Serialization:███░░░░░░░░░░░░░░░   7%                 │
│                                                                     │
│  Target: Model inference (60% = biggest optimization opportunity)  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Profiling Tools

分析工具

Tool	Use For
PyTorch Profiler	PyTorch model profiling
TensorBoard	TensorFlow visualization
NVIDIA Nsight	GPU profiling
Chrome Tracing	General timeline visualization
perf	CPU profiling

工具	适用场景
PyTorch Profiler	PyTorch模型分析
TensorBoard	TensorFlow可视化
NVIDIA Nsight	GPU分析
Chrome Tracing	通用时间线可视化
perf	CPU分析

Key Metrics

关键指标

Metric	Description	Target
P50 latency	Median latency	< SLA
P99 latency	Tail latency	< 2x P50
Throughput	Requests/second	Meet demand
GPU utilization	Compute usage	> 80%
Memory bandwidth	Memory usage	< limit

指标	描述	目标值
P50 latency	中位数延迟	低于服务等级协议（SLA）
P99 latency	尾部延迟	低于P50的2倍
Throughput	每秒请求数	满足业务需求
GPU utilization	计算资源使用率	> 80%
Memory bandwidth	内存带宽使用率	低于限制值

Optimization Workflow

优化工作流

Systematic Approach

系统化方法

text

┌─────────────────────────────────────────────────────────────────────┐
│                  Optimization Workflow                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. Baseline                                                        │
│     └── Measure current performance (latency, throughput, accuracy) │
│                                                                     │
│  2. Profile                                                         │
│     └── Identify bottlenecks (model, data, system)                  │
│                                                                     │
│  3. Optimize (in order of effort/impact):                           │
│     ├── Hardware: Use right accelerator                             │
│     ├── Compiler: Enable optimizations (TensorRT, ONNX)            │
│     ├── Runtime: Batching, caching, async                          │
│     ├── Model: Quantization, pruning                                │
│     └── Architecture: Distillation, model change                    │
│                                                                     │
│  4. Validate                                                        │
│     └── Verify accuracy maintained, latency improved                │
│                                                                     │
│  5. Deploy and Monitor                                              │
│     └── Track real-world performance                                │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

text

┌─────────────────────────────────────────────────────────────────────┐
│                  Optimization Workflow                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. Baseline                                                        │
│     └── Measure current performance (latency, throughput, accuracy) │
│                                                                     │
│  2. Profile                                                         │
│     └── Identify bottlenecks (model, data, system)                  │
│                                                                     │
│  3. Optimize (in order of effort/impact):                           │
│     ├── Hardware: Use right accelerator                             │
│     ├── Compiler: Enable optimizations (TensorRT, ONNX)            │
│     ├── Runtime: Batching, caching, async                          │
│     ├── Model: Quantization, pruning                                │
│     └── Architecture: Distillation, model change                    │
│                                                                     │
│  4. Validate                                                        │
│     └── Verify accuracy maintained, latency improved                │
│                                                                     │
│  5. Deploy and Monitor                                              │
│     └── Track real-world performance                                │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Optimization Priority Matrix

优化优先级矩阵

text

                    High Impact
                         │
    Compiler Opts    ────┼──── Quantization
    (easy win)           │     (best ROI)
                         │
Low Effort ──────────────┼──────────────── High Effort
                         │
    Batching         ────┼──── Distillation
    (quick win)          │     (major effort)
                         │
                    Low Impact

text

                    High Impact
                         │
    Compiler Opts    ────┼──── Quantization
    (easy win)           │     (best ROI)
                         │
Low Effort ──────────────┼──────────────── High Effort
                         │
    Batching         ────┼──── Distillation
    (quick win)          │     (major effort)
                         │
                    Low Impact

Common Patterns

常见模式

Multi-Model Serving

多模型服务

text

┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  Request → ┌─────────┐                                              │
│            │ Router  │                                              │
│            └─────────┘                                              │
│               │   │   │                                             │
│      ┌────────┘   │   └────────┐                                    │
│      ▼            ▼            ▼                                    │
│  ┌───────┐   ┌───────┐   ┌───────┐                                 │
│  │ Tiny  │   │ Small │   │ Large │                                 │
│  │ <10ms │   │ <50ms │   │<500ms │                                 │
│  └───────┘   └───────┘   └───────┘                                 │
│                                                                     │
│  Routing strategies:                                                │
│  • Complexity-based: Simple→Tiny, Complex→Large                    │
│  • Confidence-based: Try Tiny, escalate if low confidence          │
│  • SLA-based: Route based on latency requirements                  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

text

┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  Request → ┌─────────┐                                              │
│            │ Router  │                                              │
│            └─────────┘                                              │
│               │   │   │                                             │
│      ┌────────┘   │   └────────┐                                    │
│      ▼            ▼            ▼                                    │
│  ┌───────┐   ┌───────┐   ┌───────┐                                 │
│  │ Tiny  │   │ Small │   │ Large │                                 │
│  │ <10ms │   │ <50ms │   │<500ms │                                 │
│  └───────┘   └───────┘   └───────┘                                 │
│                                                                     │
│  Routing strategies:                                                │
│  • Complexity-based: Simple→Tiny, Complex→Large                    │
│  • Confidence-based: Try Tiny, escalate if low confidence          │
│  • SLA-based: Route based on latency requirements                  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Speculative Execution

推测执行

text

Query: "Translate: Hello"
        │
        ├──▶ Small model (draft): "Bonjour" (5ms)
        │
        └──▶ Large model (verify): Check "Bonjour" (10ms parallel)
             │
             ├── Accept: Return immediately
             └── Reject: Generate with large model

Speedup: 2-3x when drafts are often accepted

text

Query: "Translate: Hello"
        │
        ├──▶ Small model (draft): "Bonjour" (5ms)
        │
        └──▶ Large model (verify): Check "Bonjour" (10ms parallel)
             │
             ├── Accept: Return immediately
             └── Reject: Generate with large model

Speedup: 2-3x when drafts are often accepted

Cascade Models

级联模型

text

Input → ┌────────┐
        │ Filter │ ← Cheap filter (reject obvious negatives)
        └────────┘
             │ (candidates only)
             ▼
        ┌────────┐
        │ Stage 1│ ← Fast model (coarse ranking)
        └────────┘
             │ (top-100)
             ▼
        ┌────────┐
        │ Stage 2│ ← Accurate model (fine ranking)
        └────────┘
             │ (top-10)
             ▼
         Output

Benefit: 10x cheaper, similar accuracy

text

Input → ┌────────┐
        │ Filter │ ← Cheap filter (reject obvious negatives)
        └────────┘
             │ (candidates only)
             ▼
        ┌────────┐
        │ Stage 1│ ← Fast model (coarse ranking)
        └────────┘
             │ (top-100)
             ▼
        ┌────────┐
        │ Stage 2│ ← Accurate model (fine ranking)
        └────────┘
             │ (top-10)
             ▼
         Output

Benefit: 10x cheaper, similar accuracy

Optimization Checklist

优化检查清单

Pre-Deployment

部署前

Deployment

部署阶段

Post-Deployment

部署后

Related Skills

Version History

版本历史

v1.0.0 (2025-12-26): Initial release - ML inference optimization patterns

v1.0.0 (2025-12-26): 初始版本 - 机器学习推理优化模式

Last Updated

最后更新

Date: 2025-12-26

日期： 2025-12-26