Loading...
Loading...
Compare original and translation side by side
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import hashlib
import re
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:.4f}')import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import hashlib
import re
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:.4f}')amount > 0status IN ('active','inactive')end_date >= start_dateamount > 0status IN ('active','inactive')end_date >= start_datecompleteness = pd.DataFrame({
'column': df.columns,
'null_count': df.isnull().sum().values,
'null_pct': (df.isnull().sum() / len(df) * 100).round(2).values,
'empty_string_count': [(df[col] == '').sum() if df[col].dtype == 'object' else 0 for col in df.columns],
'disguised_null_count': [
df[col].isin(['N/A', 'n/a', 'NA', 'null', 'NULL', 'None', 'none', '-', '--', 'unknown', 'UNKNOWN', 'TBD', 'tbd']).sum()
if df[col].dtype == 'object' else 0
for col in df.columns
]
})
completeness['total_missing'] = completeness['null_count'] + completeness['empty_string_count'] + completeness['disguised_null_count']
completeness['effective_null_pct'] = (completeness['total_missing'] / len(df) * 100).round(2)completeness = pd.DataFrame({
'column': df.columns,
'null_count': df.isnull().sum().values,
'null_pct': (df.isnull().sum() / len(df) * 100).round(2).values,
'empty_string_count': [(df[col] == '').sum() if df[col].dtype == 'object' else 0 for col in df.columns],
'disguised_null_count': [
df[col].isin(['N/A', 'n/a', 'NA', 'null', 'NULL', 'None', 'none', '-', '--', 'unknown', 'UNKNOWN', 'TBD', 'tbd']).sum()
if df[col].dtype == 'object' else 0
for col in df.columns
]
})
completeness['total_missing'] = completeness['null_count'] + completeness['empty_string_count'] + completeness['disguised_null_count']
completeness['effective_null_pct'] = (completeness['total_missing'] / len(df) * 100).round(2)row_completeness = df.notnull().sum(axis=1) / len(df.columns) * 100row_completeness = df.notnull().sum(axis=1) / len(df.columns) * 100pk_cols = ['id'] # or composite key
total_rows = len(df)
unique_rows = df[pk_cols].drop_duplicates().shape[0]
duplicate_count = total_rows - unique_rowspk_cols = ['id'] # 或复合主键
total_rows = len(df)
unique_rows = df[pk_cols].drop_duplicates().shape[0]
duplicate_count = total_rows - unique_rowsfull_dupes = df.duplicated(keep=False).sum()full_dupes = df.duplicated(keep=False).sum()user_iduser_idundefinedundefined
**Uniqueness Score** = 100 if primary key is fully unique, else (unique_pk_count / total_rows) * 100
**唯一性得分**:如果主键完全唯一则为100分,否则为(唯一主键数量 / 总行数) * 100def detect_formats(series):
patterns = {
'email': r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$',
'phone_us': r'^\+?1?[-.\s]?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}$',
'uuid': r'^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$',
'date_iso': r'^\d{4}-\d{2}-\d{2}$',
'url': r'^https?://[^\s]+$',
'zip_us': r'^\d{5}(-\d{4})?$',
}
results = {}
for name, pattern in patterns.items():
match_count = series.dropna().str.match(pattern, na=False).sum()
if match_count > 0:
results[name] = match_count / series.dropna().shape[0] * 100
return resultsdef detect_formats(series):
patterns = {
'email': r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$',
'phone_us': r'^\\+?1?[-.\\s]?\\(?\\d{3}\\)?[-.\\s]?\\d{3}[-.\\s]?\\d{4}$',
'uuid': r'^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$',
'date_iso': r'^\\d{4}-\\d{2}-\\d{2}$',
'url': r'^https?://[^\\s]+$',
'zip_us': r'^\\d{5}(-\\d{4})?$',
}
results = {}
for name, pattern in patterns.items():
match_count = series.dropna().str.match(pattern, na=False).sum()
if match_count > 0:
results[name] = match_count / series.dropna().shape[0] * 100
return resultsstart_date <= end_datequantity * unit_price ~= total_pricecitystatecountrystatus = 'active'deleted_atstart_date <= end_datequantity * unit_price ~= total_pricecitystatecountrystatus = 'active'deleted_atorphan_count = df1[~df1['foreign_key'].isin(df2['primary_key'])].shape[0]orphan_count = df1[~df1['foreign_key'].isin(df2['primary_key'])].shape[0]valid_values = {'active', 'inactive', 'suspended'}
invalid = df[~df['status'].isin(valid_values) & df['status'].notnull()]valid_values = {'active', 'inactive', 'suspended'}
invalid = df[~df['status'].isin(valid_values) & df['status'].notnull()]max_timestamp = df[timestamp_col].max()
freshness_lag = datetime.now() - max_timestampmax_timestamp = df[timestamp_col].max()
freshness_lag = datetime.now() - max_timestampdaily_counts = df.set_index(timestamp_col).resample('D').size()
missing_days = daily_counts[daily_counts == 0]
low_days = daily_counts[daily_counts < daily_counts.median() * 0.1]daily_counts = df.set_index(timestamp_col).resample('D').size()
missing_days = daily_counts[daily_counts == 0]
low_days = daily_counts[daily_counts < daily_counts.median() * 0.1]event_timeloaded_atcreated_atlatency = (df['loaded_at'] - df['event_time']).dt.total_seconds()event_timeloaded_atcreated_atlatency = (df['loaded_at'] - df['event_time']).dt.total_seconds()range_checks = {
'age': (0, 120),
'price': (0, None), # None = no upper bound
'latitude': (-90, 90),
'longitude': (-180, 180),
'percentage': (0, 100),
}range_checks = {
'age': (0, 120),
'price': (0, None), # None表示无上限
'latitude': (-90, 90),
'longitude': (-180, 180),
'percentage': (0, 100),
}expected_total_revenue = 1_234_567
actual_total_revenue = df['revenue'].sum()
variance_pct = abs(actual_total_revenue - expected_total_revenue) / expected_total_revenue * 100expected_total_revenue = 1_234_567
actual_total_revenue = df['revenue'].sum()
variance_pct = abs(actual_total_revenue - expected_total_revenue) / expected_total_revenue * 100============================================================
DATA QUALITY SCORECARD
============================================================
Dataset: [name]
Assessed: [timestamp]
Rows: [count]
Columns: [count]
------------------------------------------------------------
Dimension Score Grade Issues Found
------------------------------------------------------------
Completeness [XX]% [A-F] [count] issues
Uniqueness [XX]% [A-F] [count] issues
Consistency [XX]% [A-F] [count] issues
Timeliness [XX]% [A-F] [count] issues
Accuracy [XX]% [A-F] [count] issues
Validity [XX]% [A-F] [count] issues
------------------------------------------------------------
OVERALL SCORE [XX]% [A-F]
============================================================
Grading Scale: A (95-100) | B (85-94) | C (70-84) | D (50-69) | F (<50)============================================================
数据质量评分卡
============================================================
数据集名称: [名称]
评估时间: [时间戳]
行数: [数量]
列数: [数量]
------------------------------------------------------------
维度 得分 等级 发现的问题数
------------------------------------------------------------
完整性 [XX]% [A-F] [数量] 个问题
唯一性 [XX]% [A-F] [数量] 个问题
一致性 [XX]% [A-F] [数量] 个问题
及时性 [XX]% [A-F] [数量] 个问题
准确性 [XX]% [A-F] [数量] 个问题
有效性 [XX]% [A-F] [数量] 个问题
------------------------------------------------------------
总体得分 [XX]% [A-F]
============================================================
等级划分:A (95-100) | B (85-94) | C (70-84) | D (50-69) | F (<50)| # | Dimension | Severity | Column(s) | Description | Records Affected | Recommended Action |
|---|---|---|---|---|---|---|
| 1 | Uniqueness | CRITICAL | order_id | 2,341 duplicate primary keys | 2,341 (0.5%) | Deduplicate; investigate pipeline |
| 2 | Accuracy | WARNING | price | 89 negative values | 89 (0.02%) | Validate business logic for refunds |
| ... |
| # | 维度 | 严重程度 | 涉及列 | 描述 | 受影响记录数 | 建议措施 |
|---|---|---|---|---|---|---|
| 1 | 唯一性 | 关键 | order_id | 存在2341个重复主键 | 2341条(0.5%) | 去重;调查数据管道问题 |
| 2 | 准确性 | 警告 | price | 存在89个负值 | 89条(0.02%) | 验证退款业务逻辑 |
| ... |