data-scientist

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Data Scientist

数据科学家

Purpose

核心目标

Provides statistical analysis and predictive modeling expertise specializing in machine learning, experimental design, and causal inference. Builds rigorous models and translates complex statistical findings into actionable business insights with proper validation and uncertainty quantification.

提供专业的统计分析和预测建模能力，专注于机器学习、实验设计和因果推断。构建严谨的模型，并将复杂的统计发现转化为可落地的业务洞察，同时提供恰当的验证和不确定性量化。

When to Use

适用场景

Performing exploratory data analysis (EDA) to find patterns and anomalies
Building predictive models (classification, regression, forecasting)
Designing and analyzing A/B tests or experiments
Conducting rigorous statistical hypothesis testing
Creating advanced visualizations and data narratives
Defining metrics and KPIs for business problems

执行探索性数据分析（EDA）以发现模式和异常值
构建预测模型（分类、回归、预测）
设计和分析A/B测试或实验
开展严谨的统计假设检验
创建高级可视化和数据叙事内容
为业务问题定义指标和KPI

Core Capabilities

核心能力

Statistical Modeling

统计建模

Building predictive models using regression, classification, and clustering
Implementing time series forecasting and causal inference
Designing and analyzing A/B tests and experiments
Performing feature engineering and selection

使用回归、分类和聚类方法构建预测模型
实现时间序列预测和因果推断
设计和分析A/B测试及实验
执行特征工程与特征选择

Machine Learning

机器学习

Training and evaluating supervised and unsupervised learning models
Implementing deep learning models for complex patterns
Performing hyperparameter tuning and model optimization
Validating models with cross-validation and holdout sets

训练和评估有监督与无监督学习模型
实现深度学习模型以捕捉复杂模式
执行超参数调优和模型优化
通过交叉验证和保留数据集验证模型

Data Exploration

数据探索

Conducting exploratory data analysis (EDA) to discover patterns
Identifying anomalies and outliers in datasets
Creating advanced visualizations for insight discovery
Generating hypotheses from data exploration

开展探索性数据分析（EDA）以发现数据模式
识别数据集中的异常值和离群点
创建高级可视化以挖掘洞察
从数据探索中生成假设

Communication and Storytelling

沟通与叙事

Translating statistical findings into business language
Creating compelling data narratives for stakeholders
Building interactive notebooks and reports
Presenting findings with uncertainty quantification

将统计发现转化为业务语言
为利益相关者打造有说服力的数据叙事
构建交互式笔记本和报告
呈现包含不确定性量化的分析结果

3. Core Workflows

3. 核心工作流

Workflow 1: Exploratory Data Analysis (EDA) & Cleaning

工作流1：探索性数据分析（EDA）与数据清洗

Goal: Understand data distribution, quality, and relationships before modeling.

Steps:

Load and Profile Data

python

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv("customer_data.csv")

# Basic profiling
print(df.info())
print(df.describe())

# Missing values analysis
missing = df.isnull().sum() / len(df)
print(missing[missing > 0].sort_values(ascending=False))

Univariate Analysis (Distributions)

python

# Numerical features
num_cols = df.select_dtypes(include=[np.number]).columns
for col in num_cols:
    plt.figure(figsize=(10, 4))
    plt.subplot(1, 2, 1)
    sns.histplot(df[col], kde=True)
    plt.subplot(1, 2, 2)
    sns.boxplot(x=df[col])
    plt.show()

# Categorical features
cat_cols = df.select_dtypes(exclude=[np.number]).columns
for col in cat_cols:
    print(df[col].value_counts(normalize=True))

Bivariate Analysis (Relationships)

python

# Correlation matrix
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')

# Target vs Features
target = 'churn'
sns.boxplot(x=target, y='tenure', data=df)

Data Cleaning

python

# Impute missing values
df['age'].fillna(df['age'].median(), inplace=True)
df['category'].fillna('Unknown', inplace=True)

# Handle outliers (Example: Cap at 99th percentile)
cap = df['income'].quantile(0.99)
df['income'] = np.where(df['income'] > cap, cap, df['income'])

Verification:

No missing values in critical columns.
Distributions understood (normal vs skewed).
Target variable balance checked.

目标： 在建模前理解数据分布、质量和变量关系。

步骤：

加载与数据剖析

python

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv("customer_data.csv")

# Basic profiling
print(df.info())
print(df.describe())

# Missing values analysis
missing = df.isnull().sum() / len(df)
print(missing[missing > 0].sort_values(ascending=False))

单变量分析（分布情况）

python

# Numerical features
num_cols = df.select_dtypes(include=[np.number]).columns
for col in num_cols:
    plt.figure(figsize=(10, 4))
    plt.subplot(1, 2, 1)
    sns.histplot(df[col], kde=True)
    plt.subplot(1, 2, 2)
    sns.boxplot(x=df[col])
    plt.show()

# Categorical features
cat_cols = df.select_dtypes(exclude=[np.number]).columns
for col in cat_cols:
    print(df[col].value_counts(normalize=True))

双变量分析（变量关系）

python

# Correlation matrix
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')

# Target vs Features
target = 'churn'
sns.boxplot(x=target, y='tenure', data=df)

数据清洗

python

# Impute missing values
df['age'].fillna(df['age'].median(), inplace=True)
df['category'].fillna('Unknown', inplace=True)

# Handle outliers (Example: Cap at 99th percentile)
cap = df['income'].quantile(0.99)
df['income'] = np.where(df['income'] > cap, cap, df['income'])

验证标准：

关键列无缺失值。
已理解数据分布（正态分布 vs 偏态分布）。
已检查目标变量的平衡性。

Workflow 3: A/B Test Analysis

工作流3：A/B测试分析

Goal: Analyze results of a website conversion experiment.

Steps:

Define Hypothesis
- H0: Conversion Rate B <= Conversion Rate A
- H1: Conversion Rate B > Conversion Rate A
- Alpha: 0.05

Load and Aggregate Data

python

# data: ['user_id', 'group', 'converted']
results = df.groupby('group')['converted'].agg(['count', 'sum', 'mean'])
results.columns = ['n_users', 'conversions', 'conversion_rate']
print(results)

Statistical Test (Proportions Z-test)

python

from statsmodels.stats.proportion import proportions_ztest

control = results.loc['A']
treatment = results.loc['B']

count = np.array([treatment['conversions'], control['conversions']])
nobs = np.array([treatment['n_users'], control['n_users']])

stat, p_value = proportions_ztest(count, nobs, alternative='larger')

print(f"Z-statistic: {stat:.4f}")
print(f"P-value: {p_value:.4f}")

Confidence Intervals

python

from statsmodels.stats.proportion import proportion_confint

(lower_con, lower_treat), (upper_con, upper_treat) = proportion_confint(count, nobs, alpha=0.05)

print(f"Control CI: [{lower_con:.4f}, {upper_con:.4f}]")
print(f"Treatment CI: [{lower_treat:.4f}, {upper_treat:.4f}]")

Conclusion
- If p-value < 0.05: Reject H0. Variation B is statistically significantly better.
- Check practical significance (Lift magnitude).

目标： 分析网站转化实验的结果。

步骤：

定义假设
- 原假设（H0）：版本B的转化率 ≤ 版本A的转化率
- 备择假设（H1）：版本B的转化率 > 版本A的转化率
- 显著性水平（Alpha）：0.05

加载与聚合数据

python

# data: ['user_id', 'group', 'converted']
results = df.groupby('group')['converted'].agg(['count', 'sum', 'mean'])
results.columns = ['n_users', 'conversions', 'conversion_rate']
print(results)

统计检验（比例Z检验）

python

from statsmodels.stats.proportion import proportions_ztest

control = results.loc['A']
treatment = results.loc['B']

count = np.array([treatment['conversions'], control['conversions']])
nobs = np.array([treatment['n_users'], control['n_users']])

stat, p_value = proportions_ztest(count, nobs, alternative='larger')

print(f"Z-statistic: {stat:.4f}")
print(f"P-value: {p_value:.4f}")

置信区间

python

from statsmodels.stats.proportion import proportion_confint

(lower_con, lower_treat), (upper_con, upper_treat) = proportion_confint(count, nobs, alpha=0.05)

print(f"Control CI: [{lower_con:.4f}, {upper_con:.4f}]")
print(f"Treatment CI: [{lower_treat:.4f}, {upper_treat:.4f}]")

结论
- 若p值 < 0.05：拒绝原假设。版本B在统计上显著更优。
- 检查实际显著性（提升幅度）。

Workflow 5: Causal Inference (Propensity Score Matching)

工作流5：因果推断（倾向得分匹配）

Goal: Estimate impact of a "Premium Membership" on "Spend" when A/B test isn't possible (observational data).

Steps:

Problem Setup
- Treatment: Premium Member (1) vs Free (0)
- Outcome: Annual Spend ($)
- Confounders: Age, Income, Location, Tenure (Factors affecting both membership and spend)

Calculate Propensity Scores

python

from sklearn.linear_model import LogisticRegression

# P(Treatment=1 | Confounders)
confounders = ['age', 'income', 'tenure']
logit = LogisticRegression()
logit.fit(df[confounders], df['is_premium'])

df['propensity_score'] = logit.predict_proba(df[confounders])[:, 1]

# Check overlap (Common Support)
sns.histplot(data=df, x='propensity_score', hue='is_premium', element='step')

Matching (Nearest Neighbor)

python

from sklearn.neighbors import NearestNeighbors

# Separate groups
treatment = df[df['is_premium'] == 1]
control = df[df['is_premium'] == 0]

# Find neighbors for treatment group in control group
nn = NearestNeighbors(n_neighbors=1, algorithm='ball_tree')
nn.fit(control[['propensity_score']])

distances, indices = nn.kneighbors(treatment[['propensity_score']])

# Create matched dataframe
matched_control = control.iloc[indices.flatten()]

# Compare outcomes
ate = treatment['spend'].mean() - matched_control['spend'].mean()
print(f"Average Treatment Effect (ATE): ${ate:.2f}")

Validation (Balance Check)
- Check if confounders are balanced after matching (e.g., Mean Age of Treatment vs Matched Control should be similar).
- ```
abs(mean_diff) / pooled_std < 0.1
```
  (Standardized Mean Difference).

目标： 当无法开展A/B测试（观测数据场景）时，估算“高级会员”对“消费金额”的影响。

步骤：

问题设定
- 处理组：高级会员（1） vs 免费会员（0）
- 结果变量：年度消费金额（美元）
- 混杂变量：年龄、收入、地区、使用时长（同时影响会员身份和消费金额的因素）

计算倾向得分

python

from sklearn.linear_model import LogisticRegression

# P(Treatment=1 | Confounders)
confounders = ['age', 'income', 'tenure']
logit = LogisticRegression()
logit.fit(df[confounders], df['is_premium'])

df['propensity_score'] = logit.predict_proba(df[confounders])[:, 1]

# Check overlap (Common Support)
sns.histplot(data=df, x='propensity_score', hue='is_premium', element='step')

匹配（最近邻匹配）

python

from sklearn.neighbors import NearestNeighbors

# Separate groups
treatment = df[df['is_premium'] == 1]
control = df[df['is_premium'] == 0]

# Find neighbors for treatment group in control group
nn = NearestNeighbors(n_neighbors=1, algorithm='ball_tree')
nn.fit(control[['propensity_score']])

distances, indices = nn.kneighbors(treatment[['propensity_score']])

# Create matched dataframe
matched_control = control.iloc[indices.flatten()]

# Compare outcomes
ate = treatment['spend'].mean() - matched_control['spend'].mean()
print(f"Average Treatment Effect (ATE): ${ate:.2f}")

验证（平衡性检查）
- 检查匹配后混杂变量是否平衡（例如，处理组与匹配对照组的平均年龄应相近）。
- ```
abs(mean_diff) / pooled_std < 0.1
```
  （标准化均值差）。

5. Anti-Patterns & Gotchas

5. 反模式与常见陷阱

❌ Anti-Pattern 1: Data Leakage

❌ 反模式1：数据泄露

What it looks like:

Scaling/Standardizing the entire dataset before train/test split.
Using future information (e.g., "next_month_churn") as a feature.
Including target-derived features (e.g., mean target encoding) calculated on the whole set.

Why it fails:

Model performance is artificially inflated during training/validation.
Fails completely in production on new, unseen data.

Correct approach:

Split FIRST, then transform.
Fit scalers/encoders ONLY on
```
X_train
```
, then transform
```
X_test
```
.
Use
```
Pipeline
```
objects to ensure safety.

表现形式：

在划分训练/测试集之前对整个数据集进行缩放/标准化。
使用未来信息（例如，“next_month_churn”）作为特征。
包含基于整个数据集计算的目标衍生特征（例如，均值目标编码）。

失败原因：

模型在训练/验证阶段的性能被人为高估。
在生产环境中处理新的未见过的数据时完全失效。

正确做法：

先划分数据集，再进行转换。
仅在
```
X_train
```
上拟合缩放器/编码器，然后转换
```
X_test
```
。
使用
```
Pipeline
```
对象确保流程安全。

❌ Anti-Pattern 2: P-Hacking (Data Dredging)

❌ 反模式2：P值操纵（数据挖掘）

What it looks like:

Testing 50 different hypotheses or subgroups.
Reporting only the one result with p < 0.05.
Stopping an A/B test exactly when significance is reached (peeking).

Why it fails:

High probability of False Positives (Type I error).
Findings are random noise, not reproducible effects.

Correct approach:

Pre-register hypotheses.
Apply Bonferroni correction or False Discovery Rate (FDR) control for multiple comparisons.
Determine sample size before the experiment and stick to it.

表现形式：

测试50种不同的假设或子群体。
仅报告p < 0.05的结果。
一旦达到显著性就立即停止A/B测试（偷看数据）。

失败原因：

假阳性（I类错误）的概率极高。
发现的结果只是随机噪声，而非可复现的效应。

正确做法：

预先注册假设。
对多重比较应用Bonferroni校正或错误发现率（FDR）控制。
在实验开始前确定样本量并严格遵守。

❌ Anti-Pattern 3: Ignoring Imbalanced Classes

❌ 反模式3：忽略类别不平衡

What it looks like:

Training a fraud detection model on data with 0.1% fraud.
Reporting 99.9% Accuracy as "Success".

Why it fails:

The model simply predicts "No Fraud" for everyone.
Fails to detect the actual class of interest.

Correct approach:

Use appropriate metrics: Precision-Recall AUC, F1-Score.
Resampling techniques: SMOTE (Synthetic Minority Over-sampling Technique), Random Undersampling.
Class weights:
```
scale_pos_weight
```
in XGBoost,
```
class_weight='balanced'
```
in Sklearn.

表现形式：

在包含0.1%欺诈样本的数据上训练欺诈检测模型。
报告99.9%的准确率作为“成功”指标。

失败原因：

模型只是简单地对所有样本预测“无欺诈”。
无法检测到真正需要关注的类别。

正确做法：

使用合适的指标：精确率-召回率AUC、F1分数。
重采样技术：SMOTE（合成少数类过采样技术）、随机欠采样。
类别权重：XGBoost中的
```
scale_pos_weight
```
，Sklearn中的
```
class_weight='balanced'
```
。

7. Quality Checklist

7. 质量检查清单

Methodology & Rigor:

Hypothesis defined clearly before analysis.
Assumptions checked (normality, independence, homoscedasticity) for statistical tests.
Train/Test/Validation split performed correctly (no leakage).
Imbalanced classes handled appropriate (metrics, resampling).
Cross-validation used for model assessment.

Code & Reproducibility:

Code stored in git with
```
requirements.txt
```
or
```
environment.yml
```
.
Random seeds set for reproducibility (
```
random_state=42
```
).
Hardcoded paths replaced with relative paths or config variables.
Complex logic wrapped in functions/classes with docstrings.

Interpretation & Communication:

Results interpreted in business terms (e.g., "Revenue lift" vs "Log-loss decrease").
Confidence intervals provided for estimates.
"Black box" models explained using SHAP or LIME if needed.
Caveats and limitations explicitly stated.

Performance:

EDA performed on sampled data if dataset > 10GB.
Vectorized operations used (pandas/numpy) instead of loops.
Query optimized (filtering early, selecting only needed columns).

方法论与严谨性：

分析前已明确定义假设。
已检查统计检验的假设条件（正态性、独立性、同方差性）。
正确执行了训练/测试/验证集划分（无数据泄露）。
已适当处理类别不平衡问题（指标、重采样）。
使用交叉验证进行模型评估。

代码与可复现性：

代码存储在Git中，并附带
```
requirements.txt
```
或
```
environment.yml
```
。
设置了随机种子以确保可复现性（
```
random_state=42
```
）。
硬编码路径已替换为相对路径或配置变量。
复杂逻辑已封装为带文档字符串的函数/类。

解读与沟通：

结果已用业务术语解读（例如，“收入提升” vs “对数损失降低”）。
为估算结果提供了置信区间。
若有需要，使用SHAP或LIME解释“黑箱”模型。
明确说明分析的局限性和注意事项。

性能：

若数据集大于10GB，已基于采样数据执行EDA。
使用向量化操作（pandas/numpy）而非循环。
查询已优化（提前过滤、仅选择所需列）。

Examples

示例

Example 1: A/B Test Analysis for Feature Launch

示例1：功能上线的A/B测试分析

Scenario: Product team wants to know if a new recommendation algorithm increases user engagement.

Analysis Approach:

Experimental Design: Random assignment (50/50), minimum sample size calculation
Data Collection: Tracked click-through rate, time on page, conversion
Statistical Testing: Two-sample t-test with bootstrapped confidence intervals
Results: Significant improvement in CTR (p < 0.01), 12% lift

Key Analysis:

python

undefined

场景： 产品团队希望了解新推荐算法是否能提升用户参与度。

分析方法：

实验设计：随机分配（50/50），计算最小样本量
数据收集：跟踪点击率、页面停留时间、转化率
统计检验：双样本t检验结合自助法置信区间
结果：点击率显著提升（p < 0.01），提升幅度为12%

核心分析代码：

python

undefined

Bootstrap confidence interval for difference in means

from scipy import stats diff = treatment_means - control_means ci = np.percentile(bootstrap_diffs, [2.5, 97.5])


**Outcome:** Feature launched with 95% probability of positive impact

from scipy import stats diff = treatment_means - control_means ci = np.percentile(bootstrap_diffs, [2.5, 97.5])


**产出：** 该功能已上线，有95%的概率产生正向影响

Example 2: Time Series Forecasting for Demand Planning

示例2：需求规划的时间序列预测

Scenario: Retail chain needs to forecast next-quarter sales for inventory planning.

Modeling Approach:

Exploratory Analysis: Identified trends, seasonality (weekly, holiday)
Feature Engineering: Promotions, weather, economic indicators
Model Selection: Compared ARIMA, Prophet, and gradient boosting
Validation: Walk-forward validation on last 12 months

Results:

Model	MAPE	90% CI Width
ARIMA	12.3%	±15%
Prophet	9.8%	±12%
XGBoost	7.2%	±9%

Deliverable: Production model with automated retraining pipeline

场景： 零售连锁企业需要预测下一季度的销售额以进行库存规划。

建模方法：

探索性分析：识别趋势、季节性（周度、节假日）
特征工程：促销活动、天气、经济指标
模型选择：对比ARIMA、Prophet和梯度提升模型
验证：基于过去12个月数据执行滚动验证

结果：

模型	平均绝对百分比误差（MAPE）	90%置信区间宽度
ARIMA	12.3%	±15%
Prophet	9.8%	±12%
XGBoost	7.2%	±9%

交付物： 带有自动重训练流水线的生产级模型

Example 3: Causal Attribution Analysis

示例3：因果归因分析

Scenario: Marketing wants to understand which channels drive actual conversions vs. appear correlated.

Causal Methods:

Propensity Score Matching: Match users with similar characteristics
Difference-in-Differences: Compare changes before/after campaigns
Instrumental Variables: Address selection bias in observational data

Key Findings:

TV ads: 3.2x ROAS (strongest attribution)
Social media: 1.1x ROAS (attribution unclear)
Email: 5.8x ROAS (highest efficiency)

场景： 营销团队希望了解哪些渠道真正推动转化，而非仅仅呈现相关性。

因果方法：

倾向得分匹配：匹配具有相似特征的用户
双重差分法：对比活动前后的变化
工具变量法：解决观测数据中的选择偏差

核心发现：

电视广告：3.2倍广告回报率（ROAS），归因最强
社交媒体：1.1倍广告回报率（ROAS），归因不明确
邮件营销：5.8倍广告回报率（ROAS），效率最高

Best Practices

最佳实践

Experimental Design

实验设计

Randomization: Ensure true random assignment to treatment/control
Sample Size Calculation: Power analysis before starting experiments
Multiple Testing: Adjust significance levels when testing multiple hypotheses
Control Variables: Include relevant covariates to reduce variance
Duration Planning: Run experiments long enough for stable results

随机化：确保处理组/对照组的分配是真正随机的
样本量计算：实验开始前进行功效分析
多重检验：测试多个假设时调整显著性水平
控制变量：纳入相关协变量以减少方差
时长规划：实验运行时间足够长以获得稳定结果

Model Development

模型开发

Feature Engineering: Create interpretable, predictive features
Cross-Validation: Use time-aware splits for time series data
Model Interpretability: Use SHAP/LIME to explain predictions
Validation Metrics: Choose metrics aligned with business objectives
Overfitting Prevention: Regularization, early stopping, held-out data

特征工程：创建可解释、有预测力的特征
交叉验证：针对时间序列数据使用时间感知划分
模型可解释性：使用SHAP/LIME解释预测结果
验证指标：选择与业务目标对齐的指标
过拟合预防：正则化、早停、保留数据集

Statistical Rigor

统计严谨性

Uncertainty Quantification: Always report confidence intervals
Significance Interpretation: P-value is not effect size
Assumption Checking: Validate statistical test assumptions
Sensitivity Analysis: Test robustness to modeling choices
Pre-registration: Document analysis plan before seeing results

不确定性量化：始终报告置信区间
显著性解读：p值不等于效应量
假设检验：验证统计检验的假设条件
敏感性分析：测试建模选择的鲁棒性
预先注册：在查看结果前记录分析计划

Communication and Impact

沟通与影响力

Business Translation: Convert statistical terms to business impact
Actionable Recommendations: Tie findings to specific decisions
Visual Storytelling: Create compelling narratives from data
Stakeholder Communication: Tailor level of technical detail
Documentation: Maintain reproducible analysis records

业务转化：将统计术语转化为业务影响
可落地建议：将分析发现与具体决策挂钩
可视化叙事：打造有说服力的数据叙事内容
利益相关者沟通：根据受众调整技术细节的深度
文档记录：维护可复现的分析记录

Ethical Data Science

数据科学伦理

Fairness Considerations: Check for bias across protected groups
Privacy Protection: Anonymize sensitive data appropriately
Transparency: Document data sources and methodology
Responsible AI: Consider societal impact of models
Data Quality: Acknowledge limitations and potential biases

公平性考量：检查受保护群体间的偏差
隐私保护：适当匿名化敏感数据
透明度：记录数据来源和方法论
负责任AI：考虑模型的社会影响
数据质量：承认数据的局限性和潜在偏差