ml-model-training

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

ML Model Training

机器学习模型训练

Train machine learning models with proper data handling and evaluation.

通过规范的数据处理与评估流程训练机器学习模型。

Training Workflow

训练流程

Data Preparation → 2. Feature Engineering → 3. Model Selection → 4. Training → 5. Evaluation

数据准备 → 2. 特征工程 → 3. 模型选择 → 4. 模型训练 → 5. 模型评估

Data Preparation

数据准备

python

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

python

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

Load and clean data

df = pd.read_csv('data.csv') df = df.dropna()

Encode categorical variables

le = LabelEncoder() df['category'] = le.fit_transform(df['category'])

Split data (70/15/15)

X = df.drop('target', axis=1) y = df['target'] X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3) X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)

Scale features

scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_val = scaler.transform(X_val) X_test = scaler.transform(X_test)

undefined

scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_val = scaler.transform(X_val) X_test = scaler.transform(X_test)

undefined

Scikit-learn Training

Scikit-learn 训练

python

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_val)
print(classification_report(y_val, y_pred))

python

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_val)
print(classification_report(y_val, y_pred))

PyTorch Training

PyTorch 训练

python

import torch
import torch.nn as nn

class Model(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1),
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.layers(x)

model = Model(X_train.shape[1])
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.BCELoss()

for epoch in range(100):
    model.train()
    optimizer.zero_grad()
    output = model(X_train_tensor)
    loss = criterion(output, y_train_tensor)
    loss.backward()
    optimizer.step()

python

import torch
import torch.nn as nn

class Model(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1),
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.layers(x)

model = Model(X_train.shape[1])
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.BCELoss()

for epoch in range(100):
    model.train()
    optimizer.zero_grad()
    output = model(X_train_tensor)
    loss = criterion(output, y_train_tensor)
    loss.backward()
    optimizer.step()

Evaluation Metrics

评估指标

Task	Metrics
Classification	Accuracy, Precision, Recall, F1, AUC-ROC
Regression	MSE, RMSE, MAE, R²

任务类型	评估指标
分类任务	准确率（Accuracy）、精确率（Precision）、召回率（Recall）、F1值、AUC-ROC
回归任务	MSE、RMSE、MAE、R²

Complete Framework Examples

完整框架示例

PyTorch: See references/pytorch-training.md for complete training with:
- Custom model classes with BatchNorm and Dropout
- Training/validation loops with early stopping
- Learning rate scheduling
- Model checkpointing
- Full evaluation with classification report
TensorFlow/Keras: See references/tensorflow-keras.md for:
- Sequential model architecture
- Callbacks (EarlyStopping, ReduceLROnPlateau, ModelCheckpoint, TensorBoard)
- Training history visualization
- TFLite conversion for mobile deployment
- Custom training loops

PyTorch：查看 references/pytorch-training.md 获取完整训练方案，包括：
- 包含BatchNorm和Dropout的自定义模型类
- 带早停机制的训练/验证循环
- 学习率调度
- 模型 checkpoint 保存
- 结合分类报告的完整评估
TensorFlow/Keras：查看 references/tensorflow-keras.md 获取以下内容：
- 序列模型架构
- 回调函数（EarlyStopping、ReduceLROnPlateau、ModelCheckpoint、TensorBoard）
- 训练历史可视化
- 用于移动端部署的TFLite转换
- 自定义训练循环

Best Practices

最佳实践

Do:

Use cross-validation for robust evaluation
Track experiments with MLflow
Save model checkpoints regularly
Monitor for overfitting
Document hyperparameters
Use 70/15/15 train/val/test split

Don't:

Train without a validation set
Ignore class imbalance
Skip feature scaling
Use test set for hyperparameter tuning
Forget to set random seeds

建议做法：

使用交叉验证实现鲁棒性评估
用MLflow跟踪实验过程
定期保存模型 checkpoint
监控过拟合情况
记录超参数
采用70/15/15的训练/验证/测试集划分比例

避免做法：

无验证集直接训练
忽略类别不平衡问题
跳过特征缩放步骤
使用测试集进行超参数调优
忘记设置随机种子

Known Issues Prevention

常见问题预防

1. Data Leakage

1. 数据泄露

Problem: Scaling or transforming data before splitting leads to test set information leaking into training.

Solution: Always split data first, then fit transformers only on training data:

python

undefined

问题：在划分数据集前进行缩放或转换，会导致测试集信息泄露到训练过程中。

解决方案：先划分数据集，再仅基于训练数据拟合转换器：

python

undefined

✅ Correct: Fit on train, transform train/val/test

✅ 正确做法：基于训练集拟合，转换训练/验证/测试集

scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_val = scaler.transform(X_val) # Only transform X_test = scaler.transform(X_test) # Only transform

scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_val = scaler.transform(X_val) # 仅做转换 X_test = scaler.transform(X_test) # 仅做转换

❌ Wrong: Fitting on all data

❌ 错误做法：基于全部数据拟合

X_all = scaler.fit_transform(X) # Leaks test info!

undefined

X_all = scaler.fit_transform(X) # 泄露测试集信息！

undefined

2. Class Imbalance Ignored

2. 忽略类别不平衡

Problem: Training on imbalanced datasets (e.g., 95% class A, 5% class B) leads to models that predict only the majority class.

Solution: Use class weights or resampling:

python

from sklearn.utils.class_weight import compute_class_weight

问题：在不平衡数据集上训练（例如95%为A类，5%为B类），会导致模型仅预测多数类。

解决方案：使用类别权重或重采样：

python

from sklearn.utils.class_weight import compute_class_weight

Compute class weights

计算类别权重

class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train) model = RandomForestClassifier(class_weight='balanced')

Or use SMOTE for oversampling minority class

或使用SMOTE对少数类进行过采样

from imblearn.over_sampling import SMOTE smote = SMOTE() X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

undefined

from imblearn.over_sampling import SMOTE smote = SMOTE() X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

undefined

3. Overfitting Due to No Regularization

3. 无正则化导致过拟合

Problem: Complex models memorize training data, perform poorly on validation/test sets.

Solution: Add regularization techniques:

python

undefined

问题：复杂模型会记忆训练数据，在验证/测试集上表现不佳。

解决方案：添加正则化技术：

python

undefined

Dropout in PyTorch

PyTorch中的Dropout

nn.Dropout(0.3)

L2 regularization in scikit-learn

scikit-learn中的L2正则化

RandomForestClassifier(max_depth=10, min_samples_split=20)

Early stopping in Keras

Keras中的早停机制

from tensorflow.keras.callbacks import EarlyStopping early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True) model.fit(X_train, y_train, validation_data=(X_val, y_val), callbacks=[early_stop])

undefined

undefined

4. Not Setting Random Seeds

4. 未设置随机种子

Problem: Results are not reproducible across runs, making debugging and comparison impossible.

Solution: Set all random seeds:

python

import random
import numpy as np
import torch

random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)

问题：实验结果无法复现，导致调试和对比无法进行。

解决方案：设置所有随机种子：

python

import random
import numpy as np
import torch

random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)

5. Using Test Set for Hyperparameter Tuning

5. 使用测试集进行超参数调优

Problem: Optimizing hyperparameters on test set leads to overfitting to test data.

Solution: Use validation set for tuning, test set only for final evaluation:

python

from sklearn.model_selection import GridSearchCV

问题：在测试集上优化超参数会导致模型过拟合测试数据。

解决方案：使用验证集进行调优，仅用测试集做最终评估：

python

from sklearn.model_selection import GridSearchCV

✅ Correct: Tune on train+val, evaluate on test

✅ 正确做法：基于训练+验证集调优，在测试集上评估

param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, 15]} grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5) grid_search.fit(X_train, y_train) # Cross-validation on training set best_model = grid_search.best_estimator_

param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, 15]} grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5) grid_search.fit(X_train, y_train) # 在训练集上进行交叉验证 best_model = grid_search.best_estimator_

Final evaluation on held-out test set

在预留的测试集上做最终评估

final_score = best_model.score(X_test, y_test)

undefined

final_score = best_model.score(X_test, y_test)

undefined

When to Load References

何时参考文档

Load reference files when you need:

PyTorch implementation details: Load
```
references/pytorch-training.md
```
for complete training loops with early stopping, learning rate scheduling, and checkpointing
TensorFlow/Keras patterns: Load
```
references/tensorflow-keras.md
```
for callback usage, custom training loops, and mobile deployment with TFLite

当你需要以下内容时，可查看参考文件：

PyTorch 实现细节：查看
```
references/pytorch-training.md
```
获取带早停、学习率调度和checkpoint保存的完整训练循环
TensorFlow/Keras 实践模式：查看
```
references/tensorflow-keras.md
```
获取回调函数使用、自定义训练循环，以及基于TFLite的移动端部署方案