datanalysis-credit-risk

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Data Cleaning and Variable Screening

数据清洗与变量筛选

Quick Start

快速开始

bash

undefined

bash

undefined

Run the complete data cleaning pipeline

python ".github/skills/datanalysis-credit-risk/scripts/example.py"

undefined

python ".github/skills/datanalysis-credit-risk/scripts/example.py"

undefined

Complete Process Description

完整流程说明

The data cleaning pipeline consists of the following 11 steps, each executed independently without deleting the original data:

Get Data - Load and format raw data
Organization Sample Analysis - Statistics of sample count and bad sample rate for each organization
Separate OOS Data - Separate out-of-sample (OOS) samples from modeling samples
Filter Abnormal Months - Remove months with insufficient bad sample count or total sample count
Calculate Missing Rate - Calculate overall and organization-level missing rates for each feature
Drop High Missing Rate Features - Remove features with overall missing rate exceeding threshold
Drop Low IV Features - Remove features with overall IV too low or IV too low in too many organizations
Drop High PSI Features - Remove features with unstable PSI
Null Importance Denoising - Remove noise features using label permutation method
Drop High Correlation Features - Remove high correlation features based on original gain
Export Report - Generate Excel report containing details and statistics of all steps

该数据清洗流水线包含以下11个步骤，每个步骤独立执行，不会删除原始数据：

获取数据 - 加载并格式化原始数据
机构样本分析 - 统计各机构的样本数量与坏样本率
分离OOS数据 - 将样本外（OOS）样本与建模样本分离
过滤异常月份 - 移除坏样本数量或总样本数量不足的月份
计算缺失率 - 计算每个特征的整体缺失率及机构维度缺失率
删除高缺失率特征 - 移除整体缺失率超过阈值的特征
删除低IV特征 - 移除整体IV过低或在过多机构中IV过低的特征
删除高PSI特征 - 移除PSI不稳定的特征
Null Importance去噪 - 使用标签置换法移除噪声特征
删除高相关性特征 - 基于原始增益移除高相关性特征
导出报告 - 生成包含所有步骤详情与统计数据的Excel报告

Core Functions

核心功能

Function	Purpose	Module
`get_dataset()`	Load and format data	references.func
`org_analysis()`	Organization sample analysis	references.func
`missing_check()`	Calculate missing rate	references.func
`drop_abnormal_ym()`	Filter abnormal months	references.analysis
`drop_highmiss_features()`	Drop high missing rate features	references.analysis
`drop_lowiv_features()`	Drop low IV features	references.analysis
`drop_highpsi_features()`	Drop high PSI features	references.analysis
`drop_highnoise_features()`	Null Importance denoising	references.analysis
`drop_highcorr_features()`	Drop high correlation features	references.analysis
`iv_distribution_by_org()`	IV distribution statistics	references.analysis
`psi_distribution_by_org()`	PSI distribution statistics	references.analysis
`value_ratio_distribution_by_org()`	Value ratio distribution statistics	references.analysis
`export_cleaning_report()`	Export cleaning report	references.analysis

函数	用途	模块
`get_dataset()`	加载并格式化数据	references.func
`org_analysis()`	机构样本分析	references.func
`missing_check()`	计算缺失率	references.func
`drop_abnormal_ym()`	过滤异常月份	references.analysis
`drop_highmiss_features()`	删除高缺失率特征	references.analysis
`drop_lowiv_features()`	删除低IV特征	references.analysis
`drop_highpsi_features()`	删除高PSI特征	references.analysis
`drop_highnoise_features()`	Null Importance去噪	references.analysis
`drop_highcorr_features()`	删除高相关性特征	references.analysis
`iv_distribution_by_org()`	IV分布统计	references.analysis
`psi_distribution_by_org()`	PSI分布统计	references.analysis
`value_ratio_distribution_by_org()`	值占比分布统计	references.analysis
`export_cleaning_report()`	导出清洗报告	references.analysis

Parameter Description

参数说明

Data Loading Parameters

数据加载参数

```
DATA_PATH
```
: Data file path (best are parquet format)
```
DATE_COL
```
: Date column name
```
Y_COL
```
: Label column name
```
ORG_COL
```
: Organization column name
```
KEY_COLS
```
: Primary key column name list

```
DATA_PATH
```
: 数据文件路径（最佳为parquet格式）
```
DATE_COL
```
: 日期列名
```
Y_COL
```
: 标签列名
```
ORG_COL
```
: 机构列名
```
KEY_COLS
```
: 主键列名列表

OOS Organization Configuration

OOS机构配置

```
OOS_ORGS
```
: Out-of-sample organization list

```
OOS_ORGS
```
: 样本外机构列表

Abnormal Month Filtering Parameters

异常月份过滤参数

```
min_ym_bad_sample
```
: Minimum bad sample count per month (default 10)
```
min_ym_sample
```
: Minimum total sample count per month (default 500)

```
min_ym_bad_sample
```
: 单月最小坏样本数量（默认10）
```
min_ym_sample
```
: 单月最小总样本数量（默认500）

Missing Rate Parameters

缺失率参数

```
missing_ratio
```
: Overall missing rate threshold (default 0.6)

```
missing_ratio
```
: 整体缺失率阈值（默认0.6）

IV Parameters

IV参数

```
overall_iv_threshold
```
: Overall IV threshold (default 0.1)
```
org_iv_threshold
```
: Single organization IV threshold (default 0.1)
```
max_org_threshold
```
: Maximum tolerated low IV organization count (default 2)

```
overall_iv_threshold
```
: 整体IV阈值（默认0.1）
```
org_iv_threshold
```
: 单机构IV阈值（默认0.1）
```
max_org_threshold
```
: 可容忍的低IV机构最大数量（默认2）

PSI Parameters

PSI参数

```
psi_threshold
```
: PSI threshold (default 0.1)
```
max_months_ratio
```
: Maximum unstable month ratio (default 1/3)
```
max_orgs
```
: Maximum unstable organization count (default 6)

```
psi_threshold
```
: PSI阈值（默认0.1）
```
max_months_ratio
```
: 最大不稳定月份占比（默认1/3）
```
max_orgs
```
: 最大不稳定机构数量（默认6）

Null Importance Parameters

Null Importance参数

```
n_estimators
```
: Number of trees (default 100)
```
max_depth
```
: Maximum tree depth (default 5)
```
gain_threshold
```
: Gain difference threshold (default 50)

```
n_estimators
```
: 树数量（默认100）
```
max_depth
```
: 树最大深度（默认5）
```
gain_threshold
```
: 增益差值阈值（默认50）

High Correlation Parameters

高相关性参数

```
max_corr
```
: Correlation threshold (default 0.9)
```
top_n_keep
```
: Keep top N features by original gain ranking (default 20)

```
max_corr
```
: 相关性阈值（默认0.9）
```
top_n_keep
```
: 按原始增益排名保留前N个特征（默认20）

Output Report

输出报告

The generated Excel report contains the following sheets:

汇总 - Summary information of all steps, including operation results and conditions
机构样本统计 - Sample count and bad sample rate for each organization
分离OOS数据 - OOS sample and modeling sample counts
Step4-异常月份处理 - Abnormal months that were removed
缺失率明细 - Overall and organization-level missing rates for each feature
Step5-有值率分布统计 - Distribution of features in different value ratio ranges
Step6-高缺失率处理 - High missing rate features that were removed
Step7-IV明细 - IV values of each feature in each organization and overall
Step7-IV处理 - Features that do not meet IV conditions and low IV organizations
Step7-IV分布统计 - Distribution of features in different IV ranges
Step8-PSI明细 - PSI values of each feature in each organization each month
Step8-PSI处理 - Features that do not meet PSI conditions and unstable organizations
Step8-PSI分布统计 - Distribution of features in different PSI ranges
Step9-null importance处理 - Noise features that were removed
Step10-高相关性剔除 - High correlation features that were removed

生成的Excel报告包含以下工作表：

汇总 - 所有步骤的汇总信息，包含操作结果与执行条件
机构样本统计 - 各机构的样本数量与坏样本率
分离OOS数据 - OOS样本与建模样本数量
Step4-异常月份处理 - 被移除的异常月份
缺失率明细 - 每个特征的整体缺失率及机构维度缺失率
Step5-有值率分布统计 - 不同有值率区间的特征分布情况
Step6-高缺失率处理 - 被移除的高缺失率特征
Step7-IV明细 - 每个特征在各机构及整体的IV值
Step7-IV处理 - 不满足IV条件的特征与低IV机构
Step7-IV分布统计 - 不同IV区间的特征分布情况
Step8-PSI明细 - 每个特征在各机构各月份的PSI值
Step8-PSI处理 - 不满足PSI条件的特征与不稳定机构
Step8-PSI分布统计 - 不同PSI区间的特征分布情况
Step9-null importance处理 - 被移除的噪声特征
Step10-高相关性剔除 - 被移除的高相关性特征

Features

特性

Interactive Input: Parameters can be input before each step execution, with default values supported
Independent Execution: Each step is executed independently without deleting original data, facilitating comparative analysis
Complete Report: Generate complete Excel report containing details, statistics, and distributions
Multi-process Support: IV and PSI calculations support multi-process acceleration
Organization-level Analysis: Support organization-level statistics and modeling/OOS distinction

交互式输入: 每个步骤执行前可输入参数，支持默认值
独立执行: 每个步骤独立执行，不会删除原始数据，便于对比分析
完整报告: 生成包含详情、统计数据与分布情况的完整Excel报告
多进程支持: IV与PSI计算支持多进程加速
机构级分析: 支持机构维度统计以及建模/OOS划分