datanalysis-credit-risk

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Cleaning and Variable Screening

数据清洗与变量筛选

Quick Start

快速开始

bash
undefined
bash
undefined

Run the complete data cleaning pipeline

Run the complete data cleaning pipeline

python ".github/skills/datanalysis-credit-risk/scripts/example.py"
undefined
python ".github/skills/datanalysis-credit-risk/scripts/example.py"
undefined

Complete Process Description

完整流程说明

The data cleaning pipeline consists of the following 11 steps, each executed independently without deleting the original data:
  1. Get Data - Load and format raw data
  2. Organization Sample Analysis - Statistics of sample count and bad sample rate for each organization
  3. Separate OOS Data - Separate out-of-sample (OOS) samples from modeling samples
  4. Filter Abnormal Months - Remove months with insufficient bad sample count or total sample count
  5. Calculate Missing Rate - Calculate overall and organization-level missing rates for each feature
  6. Drop High Missing Rate Features - Remove features with overall missing rate exceeding threshold
  7. Drop Low IV Features - Remove features with overall IV too low or IV too low in too many organizations
  8. Drop High PSI Features - Remove features with unstable PSI
  9. Null Importance Denoising - Remove noise features using label permutation method
  10. Drop High Correlation Features - Remove high correlation features based on original gain
  11. Export Report - Generate Excel report containing details and statistics of all steps
该数据清洗流水线包含以下11个步骤,每个步骤独立执行,不会删除原始数据:
  1. 获取数据 - 加载并格式化原始数据
  2. 机构样本分析 - 统计各机构的样本数量与坏样本率
  3. 分离OOS数据 - 将样本外(OOS)样本与建模样本分离
  4. 过滤异常月份 - 移除坏样本数量或总样本数量不足的月份
  5. 计算缺失率 - 计算每个特征的整体缺失率及机构维度缺失率
  6. 删除高缺失率特征 - 移除整体缺失率超过阈值的特征
  7. 删除低IV特征 - 移除整体IV过低或在过多机构中IV过低的特征
  8. 删除高PSI特征 - 移除PSI不稳定的特征
  9. Null Importance去噪 - 使用标签置换法移除噪声特征
  10. 删除高相关性特征 - 基于原始增益移除高相关性特征
  11. 导出报告 - 生成包含所有步骤详情与统计数据的Excel报告

Core Functions

核心功能

FunctionPurposeModule
get_dataset()
Load and format datareferences.func
org_analysis()
Organization sample analysisreferences.func
missing_check()
Calculate missing ratereferences.func
drop_abnormal_ym()
Filter abnormal monthsreferences.analysis
drop_highmiss_features()
Drop high missing rate featuresreferences.analysis
drop_lowiv_features()
Drop low IV featuresreferences.analysis
drop_highpsi_features()
Drop high PSI featuresreferences.analysis
drop_highnoise_features()
Null Importance denoisingreferences.analysis
drop_highcorr_features()
Drop high correlation featuresreferences.analysis
iv_distribution_by_org()
IV distribution statisticsreferences.analysis
psi_distribution_by_org()
PSI distribution statisticsreferences.analysis
value_ratio_distribution_by_org()
Value ratio distribution statisticsreferences.analysis
export_cleaning_report()
Export cleaning reportreferences.analysis
函数用途模块
get_dataset()
加载并格式化数据references.func
org_analysis()
机构样本分析references.func
missing_check()
计算缺失率references.func
drop_abnormal_ym()
过滤异常月份references.analysis
drop_highmiss_features()
删除高缺失率特征references.analysis
drop_lowiv_features()
删除低IV特征references.analysis
drop_highpsi_features()
删除高PSI特征references.analysis
drop_highnoise_features()
Null Importance去噪references.analysis
drop_highcorr_features()
删除高相关性特征references.analysis
iv_distribution_by_org()
IV分布统计references.analysis
psi_distribution_by_org()
PSI分布统计references.analysis
value_ratio_distribution_by_org()
值占比分布统计references.analysis
export_cleaning_report()
导出清洗报告references.analysis

Parameter Description

参数说明

Data Loading Parameters

数据加载参数

  • DATA_PATH
    : Data file path (best are parquet format)
  • DATE_COL
    : Date column name
  • Y_COL
    : Label column name
  • ORG_COL
    : Organization column name
  • KEY_COLS
    : Primary key column name list
  • DATA_PATH
    : 数据文件路径(最佳为parquet格式)
  • DATE_COL
    : 日期列名
  • Y_COL
    : 标签列名
  • ORG_COL
    : 机构列名
  • KEY_COLS
    : 主键列名列表

OOS Organization Configuration

OOS机构配置

  • OOS_ORGS
    : Out-of-sample organization list
  • OOS_ORGS
    : 样本外机构列表

Abnormal Month Filtering Parameters

异常月份过滤参数

  • min_ym_bad_sample
    : Minimum bad sample count per month (default 10)
  • min_ym_sample
    : Minimum total sample count per month (default 500)
  • min_ym_bad_sample
    : 单月最小坏样本数量(默认10)
  • min_ym_sample
    : 单月最小总样本数量(默认500)

Missing Rate Parameters

缺失率参数

  • missing_ratio
    : Overall missing rate threshold (default 0.6)
  • missing_ratio
    : 整体缺失率阈值(默认0.6)

IV Parameters

IV参数

  • overall_iv_threshold
    : Overall IV threshold (default 0.1)
  • org_iv_threshold
    : Single organization IV threshold (default 0.1)
  • max_org_threshold
    : Maximum tolerated low IV organization count (default 2)
  • overall_iv_threshold
    : 整体IV阈值(默认0.1)
  • org_iv_threshold
    : 单机构IV阈值(默认0.1)
  • max_org_threshold
    : 可容忍的低IV机构最大数量(默认2)

PSI Parameters

PSI参数

  • psi_threshold
    : PSI threshold (default 0.1)
  • max_months_ratio
    : Maximum unstable month ratio (default 1/3)
  • max_orgs
    : Maximum unstable organization count (default 6)
  • psi_threshold
    : PSI阈值(默认0.1)
  • max_months_ratio
    : 最大不稳定月份占比(默认1/3)
  • max_orgs
    : 最大不稳定机构数量(默认6)

Null Importance Parameters

Null Importance参数

  • n_estimators
    : Number of trees (default 100)
  • max_depth
    : Maximum tree depth (default 5)
  • gain_threshold
    : Gain difference threshold (default 50)
  • n_estimators
    : 树数量(默认100)
  • max_depth
    : 树最大深度(默认5)
  • gain_threshold
    : 增益差值阈值(默认50)

High Correlation Parameters

高相关性参数

  • max_corr
    : Correlation threshold (default 0.9)
  • top_n_keep
    : Keep top N features by original gain ranking (default 20)
  • max_corr
    : 相关性阈值(默认0.9)
  • top_n_keep
    : 按原始增益排名保留前N个特征(默认20)

Output Report

输出报告

The generated Excel report contains the following sheets:
  1. 汇总 - Summary information of all steps, including operation results and conditions
  2. 机构样本统计 - Sample count and bad sample rate for each organization
  3. 分离OOS数据 - OOS sample and modeling sample counts
  4. Step4-异常月份处理 - Abnormal months that were removed
  5. 缺失率明细 - Overall and organization-level missing rates for each feature
  6. Step5-有值率分布统计 - Distribution of features in different value ratio ranges
  7. Step6-高缺失率处理 - High missing rate features that were removed
  8. Step7-IV明细 - IV values of each feature in each organization and overall
  9. Step7-IV处理 - Features that do not meet IV conditions and low IV organizations
  10. Step7-IV分布统计 - Distribution of features in different IV ranges
  11. Step8-PSI明细 - PSI values of each feature in each organization each month
  12. Step8-PSI处理 - Features that do not meet PSI conditions and unstable organizations
  13. Step8-PSI分布统计 - Distribution of features in different PSI ranges
  14. Step9-null importance处理 - Noise features that were removed
  15. Step10-高相关性剔除 - High correlation features that were removed
生成的Excel报告包含以下工作表:
  1. 汇总 - 所有步骤的汇总信息,包含操作结果与执行条件
  2. 机构样本统计 - 各机构的样本数量与坏样本率
  3. 分离OOS数据 - OOS样本与建模样本数量
  4. Step4-异常月份处理 - 被移除的异常月份
  5. 缺失率明细 - 每个特征的整体缺失率及机构维度缺失率
  6. Step5-有值率分布统计 - 不同有值率区间的特征分布情况
  7. Step6-高缺失率处理 - 被移除的高缺失率特征
  8. Step7-IV明细 - 每个特征在各机构及整体的IV值
  9. Step7-IV处理 - 不满足IV条件的特征与低IV机构
  10. Step7-IV分布统计 - 不同IV区间的特征分布情况
  11. Step8-PSI明细 - 每个特征在各机构各月份的PSI值
  12. Step8-PSI处理 - 不满足PSI条件的特征与不稳定机构
  13. Step8-PSI分布统计 - 不同PSI区间的特征分布情况
  14. Step9-null importance处理 - 被移除的噪声特征
  15. Step10-高相关性剔除 - 被移除的高相关性特征

Features

特性

  • Interactive Input: Parameters can be input before each step execution, with default values supported
  • Independent Execution: Each step is executed independently without deleting original data, facilitating comparative analysis
  • Complete Report: Generate complete Excel report containing details, statistics, and distributions
  • Multi-process Support: IV and PSI calculations support multi-process acceleration
  • Organization-level Analysis: Support organization-level statistics and modeling/OOS distinction
  • 交互式输入: 每个步骤执行前可输入参数,支持默认值
  • 独立执行: 每个步骤独立执行,不会删除原始数据,便于对比分析
  • 完整报告: 生成包含详情、统计数据与分布情况的完整Excel报告
  • 多进程支持: IV与PSI计算支持多进程加速
  • 机构级分析: 支持机构维度统计以及建模/OOS划分