datanalysis-credit-risk
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Cleaning and Variable Screening
数据清洗与变量筛选
Quick Start
快速开始
bash
undefinedbash
undefinedRun the complete data cleaning pipeline
Run the complete data cleaning pipeline
python ".github/skills/datanalysis-credit-risk/scripts/example.py"
undefinedpython ".github/skills/datanalysis-credit-risk/scripts/example.py"
undefinedComplete Process Description
完整流程说明
The data cleaning pipeline consists of the following 11 steps, each executed independently without deleting the original data:
- Get Data - Load and format raw data
- Organization Sample Analysis - Statistics of sample count and bad sample rate for each organization
- Separate OOS Data - Separate out-of-sample (OOS) samples from modeling samples
- Filter Abnormal Months - Remove months with insufficient bad sample count or total sample count
- Calculate Missing Rate - Calculate overall and organization-level missing rates for each feature
- Drop High Missing Rate Features - Remove features with overall missing rate exceeding threshold
- Drop Low IV Features - Remove features with overall IV too low or IV too low in too many organizations
- Drop High PSI Features - Remove features with unstable PSI
- Null Importance Denoising - Remove noise features using label permutation method
- Drop High Correlation Features - Remove high correlation features based on original gain
- Export Report - Generate Excel report containing details and statistics of all steps
该数据清洗流水线包含以下11个步骤,每个步骤独立执行,不会删除原始数据:
- 获取数据 - 加载并格式化原始数据
- 机构样本分析 - 统计各机构的样本数量与坏样本率
- 分离OOS数据 - 将样本外(OOS)样本与建模样本分离
- 过滤异常月份 - 移除坏样本数量或总样本数量不足的月份
- 计算缺失率 - 计算每个特征的整体缺失率及机构维度缺失率
- 删除高缺失率特征 - 移除整体缺失率超过阈值的特征
- 删除低IV特征 - 移除整体IV过低或在过多机构中IV过低的特征
- 删除高PSI特征 - 移除PSI不稳定的特征
- Null Importance去噪 - 使用标签置换法移除噪声特征
- 删除高相关性特征 - 基于原始增益移除高相关性特征
- 导出报告 - 生成包含所有步骤详情与统计数据的Excel报告
Core Functions
核心功能
| Function | Purpose | Module |
|---|---|---|
| Load and format data | references.func |
| Organization sample analysis | references.func |
| Calculate missing rate | references.func |
| Filter abnormal months | references.analysis |
| Drop high missing rate features | references.analysis |
| Drop low IV features | references.analysis |
| Drop high PSI features | references.analysis |
| Null Importance denoising | references.analysis |
| Drop high correlation features | references.analysis |
| IV distribution statistics | references.analysis |
| PSI distribution statistics | references.analysis |
| Value ratio distribution statistics | references.analysis |
| Export cleaning report | references.analysis |
| 函数 | 用途 | 模块 |
|---|---|---|
| 加载并格式化数据 | references.func |
| 机构样本分析 | references.func |
| 计算缺失率 | references.func |
| 过滤异常月份 | references.analysis |
| 删除高缺失率特征 | references.analysis |
| 删除低IV特征 | references.analysis |
| 删除高PSI特征 | references.analysis |
| Null Importance去噪 | references.analysis |
| 删除高相关性特征 | references.analysis |
| IV分布统计 | references.analysis |
| PSI分布统计 | references.analysis |
| 值占比分布统计 | references.analysis |
| 导出清洗报告 | references.analysis |
Parameter Description
参数说明
Data Loading Parameters
数据加载参数
- : Data file path (best are parquet format)
DATA_PATH - : Date column name
DATE_COL - : Label column name
Y_COL - : Organization column name
ORG_COL - : Primary key column name list
KEY_COLS
- : 数据文件路径(最佳为parquet格式)
DATA_PATH - : 日期列名
DATE_COL - : 标签列名
Y_COL - : 机构列名
ORG_COL - : 主键列名列表
KEY_COLS
OOS Organization Configuration
OOS机构配置
- : Out-of-sample organization list
OOS_ORGS
- : 样本外机构列表
OOS_ORGS
Abnormal Month Filtering Parameters
异常月份过滤参数
- : Minimum bad sample count per month (default 10)
min_ym_bad_sample - : Minimum total sample count per month (default 500)
min_ym_sample
- : 单月最小坏样本数量(默认10)
min_ym_bad_sample - : 单月最小总样本数量(默认500)
min_ym_sample
Missing Rate Parameters
缺失率参数
- : Overall missing rate threshold (default 0.6)
missing_ratio
- : 整体缺失率阈值(默认0.6)
missing_ratio
IV Parameters
IV参数
- : Overall IV threshold (default 0.1)
overall_iv_threshold - : Single organization IV threshold (default 0.1)
org_iv_threshold - : Maximum tolerated low IV organization count (default 2)
max_org_threshold
- : 整体IV阈值(默认0.1)
overall_iv_threshold - : 单机构IV阈值(默认0.1)
org_iv_threshold - : 可容忍的低IV机构最大数量(默认2)
max_org_threshold
PSI Parameters
PSI参数
- : PSI threshold (default 0.1)
psi_threshold - : Maximum unstable month ratio (default 1/3)
max_months_ratio - : Maximum unstable organization count (default 6)
max_orgs
- : PSI阈值(默认0.1)
psi_threshold - : 最大不稳定月份占比(默认1/3)
max_months_ratio - : 最大不稳定机构数量(默认6)
max_orgs
Null Importance Parameters
Null Importance参数
- : Number of trees (default 100)
n_estimators - : Maximum tree depth (default 5)
max_depth - : Gain difference threshold (default 50)
gain_threshold
- : 树数量(默认100)
n_estimators - : 树最大深度(默认5)
max_depth - : 增益差值阈值(默认50)
gain_threshold
High Correlation Parameters
高相关性参数
- : Correlation threshold (default 0.9)
max_corr - : Keep top N features by original gain ranking (default 20)
top_n_keep
- : 相关性阈值(默认0.9)
max_corr - : 按原始增益排名保留前N个特征(默认20)
top_n_keep
Output Report
输出报告
The generated Excel report contains the following sheets:
- 汇总 - Summary information of all steps, including operation results and conditions
- 机构样本统计 - Sample count and bad sample rate for each organization
- 分离OOS数据 - OOS sample and modeling sample counts
- Step4-异常月份处理 - Abnormal months that were removed
- 缺失率明细 - Overall and organization-level missing rates for each feature
- Step5-有值率分布统计 - Distribution of features in different value ratio ranges
- Step6-高缺失率处理 - High missing rate features that were removed
- Step7-IV明细 - IV values of each feature in each organization and overall
- Step7-IV处理 - Features that do not meet IV conditions and low IV organizations
- Step7-IV分布统计 - Distribution of features in different IV ranges
- Step8-PSI明细 - PSI values of each feature in each organization each month
- Step8-PSI处理 - Features that do not meet PSI conditions and unstable organizations
- Step8-PSI分布统计 - Distribution of features in different PSI ranges
- Step9-null importance处理 - Noise features that were removed
- Step10-高相关性剔除 - High correlation features that were removed
生成的Excel报告包含以下工作表:
- 汇总 - 所有步骤的汇总信息,包含操作结果与执行条件
- 机构样本统计 - 各机构的样本数量与坏样本率
- 分离OOS数据 - OOS样本与建模样本数量
- Step4-异常月份处理 - 被移除的异常月份
- 缺失率明细 - 每个特征的整体缺失率及机构维度缺失率
- Step5-有值率分布统计 - 不同有值率区间的特征分布情况
- Step6-高缺失率处理 - 被移除的高缺失率特征
- Step7-IV明细 - 每个特征在各机构及整体的IV值
- Step7-IV处理 - 不满足IV条件的特征与低IV机构
- Step7-IV分布统计 - 不同IV区间的特征分布情况
- Step8-PSI明细 - 每个特征在各机构各月份的PSI值
- Step8-PSI处理 - 不满足PSI条件的特征与不稳定机构
- Step8-PSI分布统计 - 不同PSI区间的特征分布情况
- Step9-null importance处理 - 被移除的噪声特征
- Step10-高相关性剔除 - 被移除的高相关性特征
Features
特性
- Interactive Input: Parameters can be input before each step execution, with default values supported
- Independent Execution: Each step is executed independently without deleting original data, facilitating comparative analysis
- Complete Report: Generate complete Excel report containing details, statistics, and distributions
- Multi-process Support: IV and PSI calculations support multi-process acceleration
- Organization-level Analysis: Support organization-level statistics and modeling/OOS distinction
- 交互式输入: 每个步骤执行前可输入参数,支持默认值
- 独立执行: 每个步骤独立执行,不会删除原始数据,便于对比分析
- 完整报告: 生成包含详情、统计数据与分布情况的完整Excel报告
- 多进程支持: IV与PSI计算支持多进程加速
- 机构级分析: 支持机构维度统计以及建模/OOS划分