ml-inference-optimization

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

ML Inference Optimization

机器学习推理优化

When to Use This Skill

何时使用该技能

Use this skill when:
  • Optimizing ML inference latency
  • Reducing model size for deployment
  • Implementing model compression techniques
  • Designing inference caching strategies
  • Deploying models at the edge
  • Balancing accuracy vs. latency trade-offs
Keywords: inference optimization, latency, model compression, distillation, pruning, quantization, caching, edge ML, TensorRT, ONNX, model serving, batching, hardware acceleration
在以下场景中使用该技能:
  • 优化机器学习推理延迟
  • 减小模型部署尺寸
  • 实现模型压缩技术
  • 设计推理缓存策略
  • 在边缘部署模型
  • 平衡精度与延迟的权衡
关键词: inference optimization, latency, model compression, distillation, pruning, quantization, caching, edge ML, TensorRT, ONNX, model serving, batching, hardware acceleration

Inference Optimization Overview

推理优化概述

text
┌─────────────────────────────────────────────────────────────────────┐
│                 Inference Optimization Stack                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    Model Level                                │  │
│  │  Distillation │ Pruning │ Quantization │ Architecture Search │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              │                                      │
│                              ▼                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                   Compiler Level                              │  │
│  │  Graph optimization │ Operator fusion │ Memory planning       │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              │                                      │
│                              ▼                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                  Runtime Level                                │  │
│  │  Batching │ Caching │ Async execution │ Multi-threading      │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              │                                      │
│                              ▼                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                  Hardware Level                               │  │
│  │  GPU │ TPU │ NPU │ CPU SIMD │ Custom accelerators            │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
text
┌─────────────────────────────────────────────────────────────────────┐
│                 Inference Optimization Stack                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    Model Level                                │  │
│  │  Distillation │ Pruning │ Quantization │ Architecture Search │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              │                                      │
│                              ▼                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                   Compiler Level                              │  │
│  │  Graph optimization │ Operator fusion │ Memory planning       │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              │                                      │
│                              ▼                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                  Runtime Level                                │  │
│  │  Batching │ Caching │ Async execution │ Multi-threading      │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              │                                      │
│                              ▼                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                  Hardware Level                               │  │
│  │  GPU │ TPU │ NPU │ CPU SIMD │ Custom accelerators            │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Model Compression Techniques

模型压缩技术

Technique Overview

技术概述

TechniqueSize ReductionSpeed ImprovementAccuracy Impact
Quantization2-4x2-4xLow (1-2%)
Pruning2-10x1-3xLow-Medium
Distillation3-10x3-10xMedium
Low-rank factorization2-5x1.5-3xLow-Medium
Weight sharing10-100xVariableMedium-High
技术尺寸缩减比例速度提升比例精度影响
Quantization2-4倍2-4倍低(1-2%)
Pruning2-10倍1-3倍低-中
Distillation3-10倍3-10倍
Low-rank factorization2-5倍1.5-3倍低-中
Weight sharing10-100倍可变中-高

Knowledge Distillation

知识蒸馏

text
┌─────────────────────────────────────────────────────────────────────┐
│                    Knowledge Distillation                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────┐                                                   │
│  │ Teacher Model│ (Large, accurate, slow)                          │
│  │   GPT-4      │                                                   │
│  └──────────────┘                                                   │
│         │                                                           │
│         ▼ Soft labels (probability distributions)                   │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    Training Process                           │  │
│  │  Loss = α × CrossEntropy(student, hard_labels)               │  │
│  │       + (1-α) × KL_Div(student, teacher_soft_labels)         │  │
│  └──────────────────────────────────────────────────────────────┘  │
│         │                                                           │
│         ▼                                                           │
│  ┌──────────────┐                                                   │
│  │Student Model │ (Small, nearly as accurate, fast)                │
│  │  DistilBERT  │                                                   │
│  └──────────────┘                                                   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
Distillation Types:
TypeDescriptionUse Case
Response distillationMatch teacher outputsGeneral compression
Feature distillationMatch intermediate layersBetter transfer
Relation distillationMatch sample relationshipsStructured data
Self-distillationModel teaches itselfRegularization
text
┌─────────────────────────────────────────────────────────────────────┐
│                    Knowledge Distillation                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────┐                                                   │
│  │ Teacher Model│ (Large, accurate, slow)                          │
│  │   GPT-4      │                                                   │
│  └──────────────┘                                                   │
│         │                                                           │
│         ▼ Soft labels (probability distributions)                   │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    Training Process                           │  │
│  │  Loss = α × CrossEntropy(student, hard_labels)               │  │
│  │       + (1-α) × KL_Div(student, teacher_soft_labels)         │  │
│  └──────────────────────────────────────────────────────────────┘  │
│         │                                                           │
│         ▼                                                           │
│  ┌──────────────┐                                                   │
│  │Student Model │ (Small, nearly as accurate, fast)                │
│  │  DistilBERT  │                                                   │
│  └──────────────┘                                                   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
蒸馏类型:
类型描述适用场景
Response distillation匹配教师模型输出通用压缩
Feature distillation匹配中间层特征更好的迁移性
Relation distillation匹配样本间关系结构化数据
Self-distillation模型自蒸馏正则化

Pruning Strategies

剪枝策略

text
Unstructured Pruning (Weight-level):
Before: [0.1, 0.8, 0.2, 0.9, 0.05, 0.7]
After:  [0.0, 0.8, 0.0, 0.9, 0.0, 0.7]  (50% sparse)
• Flexible, high sparsity possible
• Needs sparse hardware/libraries

Structured Pruning (Channel/Layer-level):
Before: ┌───┬───┬───┬───┐
        │ C1│ C2│ C3│ C4│
        └───┴───┴───┴───┘
After:  ┌───┬───┬───┐
        │ C1│ C3│ C4│  (Removed C2 entirely)
        └───┴───┴───┘
• Works with standard hardware
• Lower compression ratio
Pruning Decision Criteria:
MethodDescriptionEffectiveness
Magnitude-basedRemove smallest weightsSimple, effective
Gradient-basedRemove low-gradient weightsBetter accuracy
Second-orderUse Hessian informationBest but expensive
Lottery ticketFind winning subnetworkTheoretical insight
text
Unstructured Pruning (Weight-level):
Before: [0.1, 0.8, 0.2, 0.9, 0.05, 0.7]
After:  [0.0, 0.8, 0.0, 0.9, 0.0, 0.7]  (50% sparse)
• Flexible, high sparsity possible
• Needs sparse hardware/libraries

Structured Pruning (Channel/Layer-level):
Before: ┌───┬───┬───┬───┐
        │ C1│ C2│ C3│ C4│
        └───┴───┴───┴───┘
After:  ┌───┬───┬───┐
        │ C1│ C3│ C4│  (Removed C2 entirely)
        └───┴───┴───┘
• Works with standard hardware
• Lower compression ratio
剪枝决策标准:
方法描述有效性
Magnitude-based移除权重值最小的参数简单、有效
Gradient-based移除梯度值低的参数精度表现更好
Second-order使用海森矩阵信息效果最佳但成本高
Lottery ticket寻找最优子网络理论指导性强

Quantization (Detailed)

量化(详细说明)

text
Precision Hierarchy:

FP32 (32 bits): ████████████████████████████████
FP16 (16 bits): ████████████████
BF16 (16 bits): ████████████████  (different mantissa/exponent)
INT8 (8 bits):  ████████
INT4 (4 bits):  ████
Binary (1 bit): █

Memory and Compute Scale Proportionally
Quantization Approaches:
ApproachWhen AppliedQualityEffort
Dynamic quantizationRuntimeGoodLow
Static quantizationPost-training with calibrationBetterMedium
QATDuring trainingBestHigh
text
Precision Hierarchy:

FP32 (32 bits): ████████████████████████████████
FP16 (16 bits): ████████████████
BF16 (16 bits): ████████████████  (different mantissa/exponent)
INT8 (8 bits):  ████████
INT4 (4 bits):  ████
Binary (1 bit): █

Memory and Compute Scale Proportionally
量化方法:
方法应用时机精度表现实施成本
Dynamic quantization运行时良好
Static quantization训练后校准更好
QAT训练过程中最佳

Compiler-Level Optimization

编译器层面优化

Graph Optimization

图优化

text
Original Graph:
Input → Conv → BatchNorm → ReLU → Conv → BatchNorm → ReLU → Output

Optimized Graph (Operator Fusion):
Input → FusedConvBNReLU → FusedConvBNReLU → Output

Benefits:
• Fewer kernel launches
• Better memory locality
• Reduced memory bandwidth
text
Original Graph:
Input → Conv → BatchNorm → ReLU → Conv → BatchNorm → ReLU → Output

Optimized Graph (Operator Fusion):
Input → FusedConvBNReLU → FusedConvBNReLU → Output

Benefits:
• Fewer kernel launches
• Better memory locality
• Reduced memory bandwidth

Common Optimizations

常见优化手段

OptimizationDescriptionSpeedup
Operator fusionCombine sequential ops1.2-2x
Constant foldingPre-compute constants1.1-1.5x
Dead code eliminationRemove unused opsVariable
Layout optimizationOptimize tensor memory layout1.1-1.3x
Memory planningOptimize buffer allocation1.1-1.2x
优化手段描述提速比例
Operator fusion合并连续算子1.2-2倍
Constant folding预计算常量1.1-1.5倍
Dead code elimination移除未使用算子可变
Layout optimization优化张量内存布局1.1-1.3倍
Memory planning优化缓冲区分配1.1-1.2倍

Optimization Frameworks

优化框架

FrameworkVendorBest For
TensorRTNVIDIANVIDIA GPUs, lowest latency
ONNX RuntimeMicrosoftCross-platform, broad support
OpenVINOIntelIntel CPUs/GPUs
Core MLAppleApple devices
TFLiteGoogleMobile, embedded
Apache TVMOpen sourceCustom hardware, research
框架厂商最佳适用场景
TensorRTNVIDIANVIDIA GPU、低延迟需求
ONNX RuntimeMicrosoft跨平台、广泛模型支持
OpenVINOIntelIntel CPU/GPU
Core MLAppleApple设备
TFLiteGoogle移动、嵌入式设备
Apache TVM开源自定义硬件、研究场景

Runtime Optimization

运行时优化

Batching Strategies

批处理策略

text
No Batching:
Request 1: [Process] → Response 1      10ms
Request 2: [Process] → Response 2      10ms
Request 3: [Process] → Response 3      10ms
Total: 30ms, GPU underutilized

Dynamic Batching:
Requests 1-3: [Wait 5ms] → [Process batch] → Responses
Total: 15ms, 2x throughput

Trade-off: Latency vs. Throughput
• Larger batch: Higher throughput, higher latency
• Smaller batch: Lower latency, lower throughput
Batching Parameters:
ParameterDescriptionTrade-off
batch_size
Maximum batch sizeThroughput vs. latency
max_wait_time
Wait time for batch fillLatency vs. efficiency
min_batch_size
Minimum before processingLatency predictability
text
No Batching:
Request 1: [Process] → Response 1      10ms
Request 2: [Process] → Response 2      10ms
Request 3: [Process] → Response 3      10ms
Total: 30ms, GPU underutilized

Dynamic Batching:
Requests 1-3: [Wait 5ms] → [Process batch] → Responses
Total: 15ms, 2x throughput

Trade-off: Latency vs. Throughput
• Larger batch: Higher throughput, higher latency
• Smaller batch: Lower latency, lower throughput
批处理参数:
参数描述权衡关系
batch_size
最大批处理尺寸吞吐量与延迟
max_wait_time
批处理填充等待时间延迟与效率
min_batch_size
触发处理的最小批尺寸延迟可预测性

Caching Strategies

缓存策略

text
┌─────────────────────────────────────────────────────────────────────┐
│                    Inference Caching Layers                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Layer 1: Input Cache                                               │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache exact inputs → Return cached outputs                   │   │
│  │ Hit rate: Low (inputs rarely repeat exactly)                 │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  Layer 2: Embedding Cache                                           │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache computed embeddings for repeated tokens/entities       │   │
│  │ Hit rate: Medium (common tokens repeat)                      │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  Layer 3: KV Cache (for transformers)                               │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache key-value pairs for attention                          │   │
│  │ Hit rate: High (reuse across tokens in sequence)             │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  Layer 4: Result Cache                                              │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache semantic equivalents (fuzzy matching)                  │   │
│  │ Hit rate: Variable (depends on query distribution)           │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
Semantic Caching for LLMs:
text
Query: "What's the capital of France?"
Hash + Embed query
Search cache (similarity > threshold)
├── Hit: Return cached response
└── Miss: Generate → Cache → Return
text
┌─────────────────────────────────────────────────────────────────────┐
│                    Inference Caching Layers                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Layer 1: Input Cache                                               │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache exact inputs → Return cached outputs                   │   │
│  │ Hit rate: Low (inputs rarely repeat exactly)                 │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  Layer 2: Embedding Cache                                           │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache computed embeddings for repeated tokens/entities       │   │
│  │ Hit rate: Medium (common tokens repeat)                      │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  Layer 3: KV Cache (for transformers)                               │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache key-value pairs for attention                          │   │
│  │ Hit rate: High (reuse across tokens in sequence)             │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  Layer 4: Result Cache                                              │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache semantic equivalents (fuzzy matching)                  │   │
│  │ Hit rate: Variable (depends on query distribution)           │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
大语言模型语义缓存:
text
Query: "What's the capital of France?"
Hash + Embed query
Search cache (similarity > threshold)
├── Hit: Return cached response
└── Miss: Generate → Cache → Return

Async and Parallel Execution

异步与并行执行

text
Sequential:
┌─────┐ ┌─────┐ ┌─────┐
│Prep │→│Model│→│Post │  Total: 30ms
│10ms │ │15ms │ │5ms  │
└─────┘ └─────┘ └─────┘

Pipelined:
Request 1: │Prep│Model│Post│
Request 2:      │Prep│Model│Post│
Request 3:           │Prep│Model│Post│

Throughput: 3x higher
Latency per request: Same
text
Sequential:
┌─────┐ ┌─────┐ ┌─────┐
│Prep │→│Model│→│Post │  Total: 30ms
│10ms │ │15ms │ │5ms  │
└─────┘ └─────┘ └─────┘

Pipelined:
Request 1: │Prep│Model│Post│
Request 2:      │Prep│Model│Post│
Request 3:           │Prep│Model│Post│

Throughput: 3x higher
Latency per request: Same

Hardware Acceleration

硬件加速

Hardware Comparison

硬件对比

HardwareStrengthsLimitationsBest For
GPU (NVIDIA)High parallelism, mature ecosystemPower, costTraining, large batch inference
TPU (Google)Matrix ops, cloud integrationVendor lock-inGoogle Cloud workloads
NPU (Apple/Qualcomm)Power efficient, on-deviceLimited modelsMobile, edge
CPUFlexible, availableSlower for MLLow-batch, CPU-bound
FPGACustomizable, low latencyDevelopment complexitySpecialized workloads
硬件优势局限性最佳适用场景
GPU (NVIDIA)高并行性、成熟生态功耗高、成本高训练、大批量推理
TPU (Google)矩阵运算优化、云集成厂商锁定Google云工作负载
NPU (Apple/Qualcomm)功耗低、设备端运行模型支持有限移动、边缘设备
CPU灵活、易获取机器学习运算速度慢小批量、CPU密集型任务
FPGA可定制、低延迟开发复杂度高专用工作负载

GPU Optimization

GPU优化

OptimizationDescriptionImpact
Tensor CoresUse FP16/INT8 tensor operations2-8x speedup
CUDA graphsReduce kernel launch overhead1.5-2x for small models
Multi-streamParallel executionHigher throughput
Memory poolingReduce allocation overheadLower latency variance
优化手段描述影响
Tensor Cores使用FP16/INT8张量运算2-8倍提速
CUDA graphs减少内核启动开销小模型提速1.5-2倍
Multi-stream并行执行更高吞吐量
Memory pooling减少内存分配开销降低延迟波动

Edge Deployment

边缘部署

Edge Constraints

边缘约束

text
┌─────────────────────────────────────────────────────────────────────┐
│                      Edge Deployment Constraints                    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Resource Constraints:                                              │
│  ├── Memory: 1-4 GB (vs. 64+ GB cloud)                             │
│  ├── Compute: 1-10 TOPS (vs. 100+ TFLOPS cloud)                    │
│  ├── Power: 5-15W (vs. 300W+ cloud)                                │
│  └── Storage: 16-128 GB (vs. TB cloud)                             │
│                                                                     │
│  Operational Constraints:                                           │
│  ├── No network (offline operation)                                 │
│  ├── Variable ambient conditions                                    │
│  ├── Infrequent updates                                            │
│  └── Long deployment lifetime                                       │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
text
┌─────────────────────────────────────────────────────────────────────┐
│                      Edge Deployment Constraints                    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Resource Constraints:                                              │
│  ├── Memory: 1-4 GB (vs. 64+ GB cloud)                             │
│  ├── Compute: 1-10 TOPS (vs. 100+ TFLOPS cloud)                    │
│  ├── Power: 5-15W (vs. 300W+ cloud)                                │
│  └── Storage: 16-128 GB (vs. TB cloud)                             │
│                                                                     │
│  Operational Constraints:                                           │
│  ├── No network (offline operation)                                 │
│  ├── Variable ambient conditions                                    │
│  ├── Infrequent updates                                            │
│  └── Long deployment lifetime                                       │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Edge Optimization Strategies

边缘优化策略

StrategyDescriptionUse When
Model selectionUse edge-native models (MobileNet, EfficientNet)Accuracy acceptable
Aggressive quantizationINT8 or lowerMemory/power constrained
On-device distillationDistill to tiny modelExtreme constraints
Split inferenceEdge preprocessing, cloud inferenceNetwork available
Model cachingCache results locallyRepeated queries
策略描述适用场景
Model selection使用边缘原生模型(MobileNet、EfficientNet)精度要求可接受时
Aggressive quantization采用INT8或更低精度内存/功耗受限场景
On-device distillation蒸馏为超小模型极端资源约束场景
Split inference边缘预处理,云端推理有网络连接时
Model caching本地缓存结果查询重复率高时

Edge ML Frameworks

边缘机器学习框架

FrameworkPlatformFeatures
TensorFlow LiteAndroid, iOS, embeddedQuantization, delegates
Core MLiOS, macOSNeural Engine optimization
ONNX Runtime MobileCross-platformBroad model support
PyTorch MobileAndroid, iOSFamiliar API
TensorRTNVIDIA JetsonMaximum performance
框架平台特性
TensorFlow LiteAndroid、iOS、嵌入式设备量化、委托加速
Core MLiOS、macOS神经引擎优化
ONNX Runtime Mobile跨平台广泛模型支持
PyTorch MobileAndroid、iOS熟悉的API
TensorRTNVIDIA Jetson极致性能

Latency Profiling

延迟分析

Profiling Methodology

分析方法

text
┌─────────────────────────────────────────────────────────────────────┐
│                    Latency Breakdown Analysis                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. Data Loading:          ████████░░░░░░░░░░  15%                 │
│  2. Preprocessing:         ██████░░░░░░░░░░░░  10%                 │
│  3. Model Inference:       ████████████████░░  60%                 │
│  4. Postprocessing:        ████░░░░░░░░░░░░░░   8%                 │
│  5. Response Serialization:███░░░░░░░░░░░░░░░   7%                 │
│                                                                     │
│  Target: Model inference (60% = biggest optimization opportunity)  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
text
┌─────────────────────────────────────────────────────────────────────┐
│                    Latency Breakdown Analysis                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. Data Loading:          ████████░░░░░░░░░░  15%                 │
│  2. Preprocessing:         ██████░░░░░░░░░░░░  10%                 │
│  3. Model Inference:       ████████████████░░  60%                 │
│  4. Postprocessing:        ████░░░░░░░░░░░░░░   8%                 │
│  5. Response Serialization:███░░░░░░░░░░░░░░░   7%                 │
│                                                                     │
│  Target: Model inference (60% = biggest optimization opportunity)  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Profiling Tools

分析工具

ToolUse For
PyTorch ProfilerPyTorch model profiling
TensorBoardTensorFlow visualization
NVIDIA NsightGPU profiling
Chrome TracingGeneral timeline visualization
perfCPU profiling
工具适用场景
PyTorch ProfilerPyTorch模型分析
TensorBoardTensorFlow可视化
NVIDIA NsightGPU分析
Chrome Tracing通用时间线可视化
perfCPU分析

Key Metrics

关键指标

MetricDescriptionTarget
P50 latencyMedian latency< SLA
P99 latencyTail latency< 2x P50
ThroughputRequests/secondMeet demand
GPU utilizationCompute usage> 80%
Memory bandwidthMemory usage< limit
指标描述目标值
P50 latency中位数延迟低于服务等级协议(SLA)
P99 latency尾部延迟低于P50的2倍
Throughput每秒请求数满足业务需求
GPU utilization计算资源使用率> 80%
Memory bandwidth内存带宽使用率低于限制值

Optimization Workflow

优化工作流

Systematic Approach

系统化方法

text
┌─────────────────────────────────────────────────────────────────────┐
│                  Optimization Workflow                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. Baseline                                                        │
│     └── Measure current performance (latency, throughput, accuracy) │
│                                                                     │
│  2. Profile                                                         │
│     └── Identify bottlenecks (model, data, system)                  │
│                                                                     │
│  3. Optimize (in order of effort/impact):                           │
│     ├── Hardware: Use right accelerator                             │
│     ├── Compiler: Enable optimizations (TensorRT, ONNX)            │
│     ├── Runtime: Batching, caching, async                          │
│     ├── Model: Quantization, pruning                                │
│     └── Architecture: Distillation, model change                    │
│                                                                     │
│  4. Validate                                                        │
│     └── Verify accuracy maintained, latency improved                │
│                                                                     │
│  5. Deploy and Monitor                                              │
│     └── Track real-world performance                                │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
text
┌─────────────────────────────────────────────────────────────────────┐
│                  Optimization Workflow                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. Baseline                                                        │
│     └── Measure current performance (latency, throughput, accuracy) │
│                                                                     │
│  2. Profile                                                         │
│     └── Identify bottlenecks (model, data, system)                  │
│                                                                     │
│  3. Optimize (in order of effort/impact):                           │
│     ├── Hardware: Use right accelerator                             │
│     ├── Compiler: Enable optimizations (TensorRT, ONNX)            │
│     ├── Runtime: Batching, caching, async                          │
│     ├── Model: Quantization, pruning                                │
│     └── Architecture: Distillation, model change                    │
│                                                                     │
│  4. Validate                                                        │
│     └── Verify accuracy maintained, latency improved                │
│                                                                     │
│  5. Deploy and Monitor                                              │
│     └── Track real-world performance                                │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Optimization Priority Matrix

优化优先级矩阵

text
                    High Impact
    Compiler Opts    ────┼──── Quantization
    (easy win)           │     (best ROI)
Low Effort ──────────────┼──────────────── High Effort
    Batching         ────┼──── Distillation
    (quick win)          │     (major effort)
                    Low Impact
text
                    High Impact
    Compiler Opts    ────┼──── Quantization
    (easy win)           │     (best ROI)
Low Effort ──────────────┼──────────────── High Effort
    Batching         ────┼──── Distillation
    (quick win)          │     (major effort)
                    Low Impact

Common Patterns

常见模式

Multi-Model Serving

多模型服务

text
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  Request → ┌─────────┐                                              │
│            │ Router  │                                              │
│            └─────────┘                                              │
│               │   │   │                                             │
│      ┌────────┘   │   └────────┐                                    │
│      ▼            ▼            ▼                                    │
│  ┌───────┐   ┌───────┐   ┌───────┐                                 │
│  │ Tiny  │   │ Small │   │ Large │                                 │
│  │ <10ms │   │ <50ms │   │<500ms │                                 │
│  └───────┘   └───────┘   └───────┘                                 │
│                                                                     │
│  Routing strategies:                                                │
│  • Complexity-based: Simple→Tiny, Complex→Large                    │
│  • Confidence-based: Try Tiny, escalate if low confidence          │
│  • SLA-based: Route based on latency requirements                  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
text
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  Request → ┌─────────┐                                              │
│            │ Router  │                                              │
│            └─────────┘                                              │
│               │   │   │                                             │
│      ┌────────┘   │   └────────┐                                    │
│      ▼            ▼            ▼                                    │
│  ┌───────┐   ┌───────┐   ┌───────┐                                 │
│  │ Tiny  │   │ Small │   │ Large │                                 │
│  │ <10ms │   │ <50ms │   │<500ms │                                 │
│  └───────┘   └───────┘   └───────┘                                 │
│                                                                     │
│  Routing strategies:                                                │
│  • Complexity-based: Simple→Tiny, Complex→Large                    │
│  • Confidence-based: Try Tiny, escalate if low confidence          │
│  • SLA-based: Route based on latency requirements                  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Speculative Execution

推测执行

text
Query: "Translate: Hello"
        ├──▶ Small model (draft): "Bonjour" (5ms)
        └──▶ Large model (verify): Check "Bonjour" (10ms parallel)
             ├── Accept: Return immediately
             └── Reject: Generate with large model

Speedup: 2-3x when drafts are often accepted
text
Query: "Translate: Hello"
        ├──▶ Small model (draft): "Bonjour" (5ms)
        └──▶ Large model (verify): Check "Bonjour" (10ms parallel)
             ├── Accept: Return immediately
             └── Reject: Generate with large model

Speedup: 2-3x when drafts are often accepted

Cascade Models

级联模型

text
Input → ┌────────┐
        │ Filter │ ← Cheap filter (reject obvious negatives)
        └────────┘
             │ (candidates only)
        ┌────────┐
        │ Stage 1│ ← Fast model (coarse ranking)
        └────────┘
             │ (top-100)
        ┌────────┐
        │ Stage 2│ ← Accurate model (fine ranking)
        └────────┘
             │ (top-10)
         Output

Benefit: 10x cheaper, similar accuracy
text
Input → ┌────────┐
        │ Filter │ ← Cheap filter (reject obvious negatives)
        └────────┘
             │ (candidates only)
        ┌────────┐
        │ Stage 1│ ← Fast model (coarse ranking)
        └────────┘
             │ (top-100)
        ┌────────┐
        │ Stage 2│ ← Accurate model (fine ranking)
        └────────┘
             │ (top-10)
         Output

Benefit: 10x cheaper, similar accuracy

Optimization Checklist

优化检查清单

Pre-Deployment

部署前

  • Profile baseline performance
  • Identify primary bottleneck (model, data, system)
  • Apply compiler optimizations (TensorRT, ONNX)
  • Evaluate quantization (INT8 usually safe)
  • Tune batch size for target throughput
  • Test accuracy after optimization
  • 分析基准性能
  • 识别主要瓶颈(模型、数据、系统)
  • 应用编译器优化(TensorRT、ONNX)
  • 评估量化方案(INT8通常安全)
  • 针对目标吞吐量调整批处理尺寸
  • 验证优化后的精度

Deployment

部署阶段

  • Configure appropriate hardware
  • Enable caching where applicable
  • Set up monitoring (latency, throughput, errors)
  • Configure auto-scaling policies
  • Implement graceful degradation
  • 配置合适的硬件
  • 启用适用的缓存策略
  • 设置监控(延迟、吞吐量、错误率)
  • 配置自动扩缩容策略
  • 实现优雅降级机制

Post-Deployment

部署后

  • Monitor p99 latency
  • Track accuracy metrics
  • Analyze cache hit rates
  • Review cost efficiency
  • Plan iterative improvements
  • 监控P99延迟
  • 追踪精度指标
  • 分析缓存命中率
  • 审查成本效率
  • 规划迭代优化

Related Skills

相关技能

  • llm-serving-patterns
    - LLM-specific serving optimization
  • ml-system-design
    - End-to-end ML pipeline design
  • quality-attributes-taxonomy
    - Performance as quality attribute
  • estimation-techniques
    - Capacity planning for ML systems
  • llm-serving-patterns
    - 大语言模型专属服务优化
  • ml-system-design
    - 端到端机器学习 pipeline 设计
  • quality-attributes-taxonomy
    - 作为质量属性的性能优化
  • estimation-techniques
    - 机器学习系统容量规划

Version History

版本历史

  • v1.0.0 (2025-12-26): Initial release - ML inference optimization patterns

  • v1.0.0 (2025-12-26): 初始版本 - 机器学习推理优化模式

Last Updated

最后更新

Date: 2025-12-26
日期: 2025-12-26