ml-best-practices

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

ML Best Practices

ML最佳实践

I want to read a story about the data, not just run code. Ensure every code cell is followed by a markdown cell analyzing the results. End the notebook with a summary comprehensively answering the prompt.
If there is a good match between the user's request and a corresponding example plan, then adapt the example plan to fully answer the user's request:
我想要了解数据背后的故事,而不只是运行代码。确保每个代码单元格后都有一个Markdown单元格来分析结果。在笔记本末尾添加一个总结,全面回应用户的提示。
如果用户的请求与某个示例方案高度匹配,请调整该示例方案以全面回应用户的请求:

Clustering:

Clustering(聚类):

Identify distinct groups based on their features.
  • Understand the schema and field descriptions.
  • Visualize features referenced in the prompt (e.g., with histograms, scatterplots).
  • Transform dates into timestamps.
  • Before applying encoders, check if the dataset already contains pre-encoded features and prefer existing numerical representations.
  • Prefer to keep data instead of dropping it when possible.
  • Transform ordinal data with an ordinal encoder.
  • Transform nominal data with a one hot encoder.
  • Standardize numerical features.
  • Perform clustering with a range of values, and collect the silhouette score.
  • Choose the optimal number of clusters based on the silhouette score.
  • Use dimensionality reduction (e.g., PCA) to project the data into two dimensions.
  • Scatterplot the samples in two dimensions with cluster labels as the hue.
  • Scatterplot the samples in two dimensions with a discrete feature as the hue.
  • Describe the clusters in text by feature distributions or typical feature values.
  • Conclusion: comprehensively answer the prompt in a final markdown cell.
基于特征识别不同的群体。
  • 理解数据结构和字段描述。
  • 可视化提示中提到的特征(例如使用直方图、散点图)。
  • 将日期转换为时间戳。
  • 在应用编码器之前,检查数据集是否已包含预编码特征,优先使用现有的数值表示形式。
  • 尽可能保留数据,而非删除数据。
  • 使用ordinal encoder(序数编码器)转换有序数据。
  • 使用one hot encoder(独热编码器)转换标称数据。
  • 标准化数值特征。
  • 使用一系列值进行聚类,并收集silhouette score(轮廓系数)。
  • 根据轮廓系数选择最佳聚类数量。
  • 使用降维技术(例如PCA)将数据投影到二维空间。
  • 以聚类标签为色调,在二维空间中绘制样本散点图。
  • 以离散特征为色调,在二维空间中绘制样本散点图。
  • 通过特征分布或典型特征值用文字描述聚类。
  • 结论:在最后的Markdown单元格中全面回应用户的提示。

Time Series Forecasting:

Time Series Forecasting(时间序列预测):

Develop a predictive model to estimate future values based on historical trends. How might different modeling approaches impact the prediction accuracy?
  • Understand the schema and field descriptions.
  • Visualize the target feature over time at a reasonable granularity.
  • Always perform a chronological split on the data to create training, validation, and test sets.
  • Are there seasonal trends?
  • Test for stationarity.
  • Discuss possible modeling approaches. How might different modeling approaches impact the prediction accuracy?
  • Train two time series forecasting models to predict the target feature. Use previous seasonality and stationarity information as model hyperparameters.
  • Predict the target feature for the training and validation sets.
  • Optionally, hypertune models with the validation set.
  • Visualize the actual and predicted target feature vs time for each model on the training and validation sets.
  • Evaluate the validation performance with error metrics.
  • Select a model.
  • Retrain the selected model on the test and validation sets.
  • Predict the test values with the selected model.
  • Visualize the average target feature and the predicted test values.
  • Conclusion: comprehensively answer the prompt in a final markdown cell.
基于历史趋势开发预测模型以估算未来值。不同的建模方法会如何影响预测准确性?
  • 理解数据结构和字段描述。
  • 以合理的粒度可视化目标特征随时间的变化。
  • 务必按时间顺序拆分数据,创建训练集、验证集和测试集。
  • 是否存在季节性趋势?
  • 测试数据的平稳性。
  • 讨论可能的建模方法。不同的建模方法会如何影响预测准确性?
  • 训练两个时间序列预测模型来预测目标特征。将之前的季节性和平稳性信息用作模型超参数。
  • 预测训练集和验证集的目标特征值。
  • 可选:使用验证集对模型进行超参数调优。
  • 可视化每个模型在训练集和验证集上的实际目标特征值与预测目标特征值随时间的变化。
  • 使用误差指标评估验证集性能。
  • 选择一个模型。
  • 在测试集和验证集上重新训练选定的模型。
  • 使用选定的模型预测测试集的值。
  • 可视化目标特征的平均值和测试集的预测值。
  • 结论:在最后的Markdown单元格中全面回应用户的提示。

Exploratory Data Analysis / Anomaly Detection:

Exploratory Data Analysis / Anomaly Detection(探索性数据分析/异常检测):

Identify and describe any outliers, unusual patterns, or significant trends observed in the data. Provide visualizations to support your findings.
  • Understand the schema and field descriptions.
  • Visualize the target feature distribution in a way that shows outliers.
  • Identify and describe any outliers in the target feature.
  • Visualize relationships between the target feature and other features.
  • Identify and describe unusual patterns or significant trends.
  • Visualize patterns and trends.
  • Conclusion: comprehensively answer the prompt in a final markdown cell.
识别并描述数据中观察到的任何异常值、异常模式或显著趋势。提供可视化结果以支持你的发现。
  • 理解数据结构和字段描述。
  • 以能显示异常值的方式可视化目标特征分布。
  • 识别并描述目标特征中的任何异常值。
  • 可视化目标特征与其他特征之间的关系。
  • 识别并描述异常模式或显著趋势。
  • 可视化模式和趋势。
  • 结论:在最后的Markdown单元格中全面回应用户的提示。

Classification:

Classification(分类):

Given the data, can we classify by the target feature?
  • Understand the schema and field descriptions.
  • Identify rows that don't make sense. How many are there and what do they contain?
  • Identify rows without a target value. How many are there and what do they contain?
  • Drop rows that don't match the schema or don't have the target value (if it is reasonable to do so).
  • Split data into training, validation, and test sets.
  • Create features to represent when data are missing, if this is meaningful.
  • Handle missing data. Prefer to keep data instead of dropping it when possible.
  • Before applying encoders, check if the dataset already contains pre-encoded features and prefer existing numerical representations.
  • Transform ordinal data with an ordinal encoder.
  • Transform nominal data with a one hot encoder.
  • Standardize numerical features.
  • Train multiple models.
  • If there is evidence of overfitting, regularize and retrain the model.
  • If there is evidence of underfitting, consider adding or engineering features.
  • Evaluate the models.
  • Create confusion matrices.
  • Conclusion: comprehensively answer the prompt in a final markdown cell.
给定数据,我们能否按目标特征进行分类?
  • 理解数据结构和字段描述。
  • 识别不合理的行。这类行有多少,包含什么内容?
  • 识别没有目标值的行。这类行有多少,包含什么内容?
  • 删除不符合数据结构或没有目标值的行(如果合理的话)。
  • 将数据拆分为训练集、验证集和测试集。
  • 如果有意义,创建表示数据缺失情况的特征。
  • 处理缺失数据。尽可能保留数据,而非删除数据。
  • 在应用编码器之前,检查数据集是否已包含预编码特征,优先使用现有的数值表示形式。
  • 使用ordinal encoder(序数编码器)转换有序数据。
  • 使用one hot encoder(独热编码器)转换标称数据。
  • 标准化数值特征。
  • 训练多个模型。
  • 如果有过拟合迹象,对模型进行正则化并重新训练。
  • 如果有欠拟合迹象,考虑添加或构建特征。
  • 评估模型。
  • 创建混淆矩阵。
  • 结论:在最后的Markdown单元格中全面回应用户的提示。

Regression:

Regression(回归):

Predict the continuous valued target feature.
  • Understand the schema and field descriptions.
  • Identify rows that don't make sense. How many are there and what do they contain?
  • Identify rows without a target value. How many are there and what do they contain?
  • Develop an understanding of the data and determine how to handle missing values. This should make sense in the business context.
  • Identify any potential sources of group leakage. Aggregate where appropriate to prevent this.
  • Visualize target feature.
  • Split data into training, validation, and test sets.
  • Handle missing data. Prefer to keep data instead of dropping it when possible.
  • Before applying encoders, check if the dataset already contains pre-encoded features and prefer existing numerical representations.
  • Transform ordinal data with an ordinal encoder.
  • Transform nominal data with a one hot encoder. Restrict high cardinality categorical features to a tractable size.
  • Standardize numerical features.
  • Train multiple models.
  • Visualize the actual vs predicted values on training and validation data.
  • If there is evidence of overfitting, regularize and retrain the model.
  • If there is evidence of underfitting, consider adding or engineering features.
  • Evaluate the model error.
  • Conclusion: comprehensively answer the prompt in a final markdown cell.
预测连续值类型的目标特征。
  • 理解数据结构和字段描述。
  • 识别不合理的行。这类行有多少,包含什么内容?
  • 识别没有目标值的行。这类行有多少,包含什么内容?
  • 了解数据情况,并确定如何处理缺失值。这应符合业务场景逻辑。
  • 识别可能的群体泄露来源。在适当情况下进行聚合以避免此类问题。
  • 可视化目标特征。
  • 将数据拆分为训练集、验证集和测试集。
  • 处理缺失数据。尽可能保留数据,而非删除数据。
  • 在应用编码器之前,检查数据集是否已包含预编码特征,优先使用现有的数值表示形式。
  • 使用ordinal encoder(序数编码器)转换有序数据。
  • 使用one hot encoder(独热编码器)转换标称数据。将高基数分类特征限制在易于处理的范围内。
  • 标准化数值特征。
  • 训练多个模型。
  • 可视化训练集和验证集上的实际值与预测值。
  • 如果有过拟合迹象,对模型进行正则化并重新训练。
  • 如果有欠拟合迹象,考虑添加或构建特征。
  • 评估模型误差。
  • 结论:在最后的Markdown单元格中全面回应用户的提示。

Comparing ML Models:

Comparing ML Models(机器学习模型对比):

Evaluate and compare multiple models to determine which is most suitable for production based on predictive power, robustness, and viability.
  • Understand the schema and align metrics with business goals (e.g., cost of false positives vs. false negatives).
  • Establish baselines: define a naive baseline (majority class/mean) and a simple ML baseline (e.g., Logistic/Linear Regression).
  • Ensure rigorous validation: use identical, fixed data splits for all models and perform $k$-fold cross-validation.
  • If data is temporal, use chronological splits for validation.
  • Select and report metrics beyond accuracy (e.g., F1-Score, PR-AUC, MAE, RMSE) that reflect business impact.
  • Use bootstrapping to calculate 95% confidence intervals for key metrics to determine statistical significance.
  • Perform slice-based error analysis: evaluate model performance across key subpopulations and demographics to identify bias or specific failure modes.
  • Inspect and compare confusion matrices, residual plots, and calibration curves.
  • Evaluate operational trade-offs: consider inference latency, training time, compute cost, and model size.
  • Assess interpretability using tools like SHAP or LIME where transparency is required.
  • Conclusion: Recommend the optimal model for the specific use case, justifying the choice with both performance and production viability.
评估并对比多个模型,根据预测能力、鲁棒性和可行性确定最适合生产环境的模型。
  • 理解数据结构,并使指标与业务目标对齐(例如,假阳性与假阴性的成本)。
  • 建立基准:定义一个朴素基准(多数类别/均值)和一个简单的机器学习基准(例如,Logistic/Linear Regression)。
  • 确保严谨的验证:对所有模型使用相同的固定数据拆分,并执行k折交叉验证。
  • 如果数据具有时间属性,使用按时间顺序的拆分进行验证。
  • 选择并报告超越准确率的指标(例如,F1-Score、PR-AUC、MAE、RMSE),这些指标能反映业务影响。
  • 使用自举法计算关键指标的95%置信区间,以确定统计显著性。
  • 执行基于切片的误差分析:评估模型在关键子群体和人口统计数据上的性能,以识别偏差或特定失效模式。
  • 检查并对比混淆矩阵、残差图和校准曲线。
  • 评估操作权衡:考虑推理延迟、训练时间、计算成本和模型大小。
  • 在需要透明度的场景下,使用SHAP或LIME等工具评估模型的可解释性。
  • 结论:针对特定用例推荐最优模型,并用性能和生产可行性两方面的依据证明选择的合理性。

No match:

No match(无匹配场景):

  • Understand the schema and field descriptions.
  • Identify rows that don't make sense. How many are there and what do they contain?
  • Identify rows without a target value. How many are there and what do they contain?
  • Drop rows that don't match the schema or don't have the target value (if it is reasonable to do so).
  • Create features to represent when data are missing, if this is meaningful.
  • Handle missing data. Prefer to keep data instead of dropping it when possible.
  • Before applying encoders, check if the dataset already contains pre-encoded features and prefer existing numerical representations.
  • Transform ordinal data with an ordinal encoder.
  • Transform nominal data with a one hot encoder.
  • Standardize numerical features.
  • Conclusion: comprehensively answer the prompt in a final markdown cell.
  • 理解数据结构和字段描述。
  • 识别不合理的行。这类行有多少,包含什么内容?
  • 识别没有目标值的行。这类行有多少,包含什么内容?
  • 删除不符合数据结构或没有目标值的行(如果合理的话)。
  • 如果有意义,创建表示数据缺失情况的特征。
  • 处理缺失数据。尽可能保留数据,而非删除数据。
  • 在应用编码器之前,检查数据集是否已包含预编码特征,优先使用现有的数值表示形式。
  • 使用ordinal encoder(序数编码器)转换有序数据。
  • 使用one hot encoder(独热编码器)转换标称数据。
  • 标准化数值特征。
  • 结论:在最后的Markdown单元格中全面回应用户的提示。

Essential ML Practices

Essential ML Practices(核心机器学习实践)

[!IMPORTANT] ALWAYS follow these ML practices
  • Strict Featurization Ordering: For supervised learning ALWAYS split the dataset into training and test data BEFORE fitting preprocessing pipelines (e.g. scaling, encoding). Fit the pipelines on the training data and test data independently.
  • Handling Missing or NULL Values: ALWAYS check for and handle missing and NULL values. First, analyze their frequency. Then, decide whether to keep them, drop them or impute them with a contextually appropriate value, and explain your reasoning.
[!IMPORTANT] 务必遵循以下机器学习实践
  • 严格的特征处理顺序:对于监督学习,务必在拟合预处理流水线(例如标准化、编码)之前将数据集划分为训练集和测试集。分别在训练集和测试集上拟合流水线。
  • 缺失值或NULL值处理务必检查并处理缺失值和NULL值。首先分析它们出现的频率,然后决定是保留、删除还是使用符合上下文的合适值进行插补,并解释你的理由。