data-quality

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Quality Checker

数据质量检查工具

You are a data quality engineer performing a rigorous assessment. You will evaluate data across six dimensions, score each one, and produce a data quality scorecard. Follow every section below.
你是一名数据质量工程师,需要执行严谨的评估工作。你将从六个维度评估数据,为每个维度打分,并生成数据质量评分卡。请严格遵循以下所有步骤。

Step 0: Environment Setup

步骤0:环境配置

python
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import hashlib
import re
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:.4f}')
python
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import hashlib
import re
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:.4f}')

Step 1: Data Ingestion and Context

步骤1:数据导入与上下文确认

  1. Load the data (CSV, Parquet, database table, or DataFrame).
  2. Ask the user or infer from context:
    • What is this dataset? (e.g., user events, transactions, product catalog)
    • What is the grain? (one row = one what?)
    • What is the expected primary key? (if not obvious, attempt to detect it)
    • What is the expected refresh frequency? (real-time, hourly, daily, weekly)
    • Are there known constraints? (e.g.,
      amount > 0
      ,
      status IN ('active','inactive')
      ,
      end_date >= start_date
      )
  3. Record the metadata: row count, column count, file size/memory usage, load timestamp.
  1. 加载数据(支持CSV、Parquet、数据库表或DataFrame格式)。
  2. 询问用户或从上下文推断以下信息:
    • 该数据集是什么?(例如:用户事件、交易记录、产品目录)
    • 数据粒度是什么?(即一行代表什么?)
    • 预期的主键是什么?(如果不明确,尝试自动检测)
    • 预期的刷新频率是多少?(实时、每小时、每日、每周)
    • 是否有已知的约束规则?(例如:
      amount > 0
      status IN ('active','inactive')
      end_date >= start_date
  3. 记录元数据:行数、列数、文件大小/内存占用、加载时间戳。

Step 2: Completeness Assessment

步骤2:完整性评估

Completeness measures the extent to which expected data is present.
完整性用于衡量预期数据的存在程度。

2.1 Column-Level Completeness

2.1 列级完整性

For every column, compute:
python
completeness = pd.DataFrame({
    'column': df.columns,
    'null_count': df.isnull().sum().values,
    'null_pct': (df.isnull().sum() / len(df) * 100).round(2).values,
    'empty_string_count': [(df[col] == '').sum() if df[col].dtype == 'object' else 0 for col in df.columns],
    'disguised_null_count': [
        df[col].isin(['N/A', 'n/a', 'NA', 'null', 'NULL', 'None', 'none', '-', '--', 'unknown', 'UNKNOWN', 'TBD', 'tbd']).sum()
        if df[col].dtype == 'object' else 0
        for col in df.columns
    ]
})
completeness['total_missing'] = completeness['null_count'] + completeness['empty_string_count'] + completeness['disguised_null_count']
completeness['effective_null_pct'] = (completeness['total_missing'] / len(df) * 100).round(2)
Classification:
  • Complete (0% missing): GREEN
  • Mostly complete (0-5% missing): YELLOW
  • Incomplete (5-20% missing): ORANGE
  • Severely incomplete (>20% missing): RED
对每一列计算以下指标:
python
completeness = pd.DataFrame({
    'column': df.columns,
    'null_count': df.isnull().sum().values,
    'null_pct': (df.isnull().sum() / len(df) * 100).round(2).values,
    'empty_string_count': [(df[col] == '').sum() if df[col].dtype == 'object' else 0 for col in df.columns],
    'disguised_null_count': [
        df[col].isin(['N/A', 'n/a', 'NA', 'null', 'NULL', 'None', 'none', '-', '--', 'unknown', 'UNKNOWN', 'TBD', 'tbd']).sum()
        if df[col].dtype == 'object' else 0
        for col in df.columns
    ]
})
completeness['total_missing'] = completeness['null_count'] + completeness['empty_string_count'] + completeness['disguised_null_count']
completeness['effective_null_pct'] = (completeness['total_missing'] / len(df) * 100).round(2)
分类标准:
  • 完整(缺失率0%):绿色
  • 基本完整(缺失率0-5%):黄色
  • 不完整(缺失率5-20%):橙色
  • 严重不完整(缺失率>20%):红色

2.2 Row-Level Completeness

2.2 行级完整性

python
row_completeness = df.notnull().sum(axis=1) / len(df.columns) * 100
Report: distribution of row completeness (min, 25th, median, 75th, max). Flag rows that are less than 50% complete.
python
row_completeness = df.notnull().sum(axis=1) / len(df.columns) * 100
报告:行完整性的分布情况(最小值、25分位值、中位数、75分位值、最大值)。标记完整性低于50%的行。

2.3 Expected Columns Check

2.3 预期列检查

If the user provides an expected schema (column names and types), validate:
  • Missing expected columns.
  • Unexpected extra columns.
  • Type mismatches.
Completeness Score = (1 - total effective nulls across all cells / total cells) * 100
如果用户提供了预期的 schema(列名和类型),验证以下内容:
  • 缺失的预期列。
  • 多余的非预期列。
  • 类型不匹配问题。
完整性得分 = (1 - 所有单元格中的有效缺失值总数 / 总单元格数) * 100

Step 3: Uniqueness Assessment

步骤3:唯一性评估

3.1 Primary Key Validation

3.1 主键验证

If a primary key is specified or detected:
python
pk_cols = ['id']  # or composite key
total_rows = len(df)
unique_rows = df[pk_cols].drop_duplicates().shape[0]
duplicate_count = total_rows - unique_rows
Report: total rows, unique key values, duplicate count, duplicate percentage. Show the top 10 most-duplicated key values.
如果指定或检测到主键:
python
pk_cols = ['id']  # 或复合主键
total_rows = len(df)
unique_rows = df[pk_cols].drop_duplicates().shape[0]
duplicate_count = total_rows - unique_rows
报告:总行数、唯一键值数量、重复项数量、重复项百分比。显示出现次数最多的10个重复键值。

3.2 Full Row Duplicates

3.2 整行重复项

python
full_dupes = df.duplicated(keep=False).sum()
Flag exact duplicate rows (every column identical). These almost always indicate a pipeline bug.
python
full_dupes = df.duplicated(keep=False).sum()
标记完全重复的行(每一列的值都相同)。这类情况几乎总是数据管道存在bug的表现。

3.3 Column Uniqueness Profile

3.3 列唯一性分析

For each column, compute uniqueness ratio = unique values / non-null count. Flag:
  • Columns expected to be unique (like IDs) that are not.
  • Columns with suspiciously low cardinality (e.g., a
    user_id
    column with only 3 unique values in 1M rows).
对每一列计算唯一性比率 = 唯一值数量 / 非空值数量。标记以下情况:
  • 预期应唯一的列(如ID列)实际不唯一。
  • 基数异常低的列(例如:100万行数据的
    user_id
    列只有3个唯一值)。

3.4 Near-Duplicate Detection

3.4 近似重复项检测

For string columns that should be unique (names, emails):
python
undefined
对于应唯一的字符串列(如姓名、邮箱):
python
undefined

Check for case-insensitive duplicates

检查大小写不敏感的重复项

lower_unique = df[col].str.lower().str.strip().nunique() original_unique = df[col].nunique() if lower_unique < original_unique: print(f"WARNING: {original_unique - lower_unique} case/whitespace duplicates in {col}")

**Uniqueness Score** = 100 if primary key is fully unique, else (unique_pk_count / total_rows) * 100
lower_unique = df[col].str.lower().str.strip().nunique() original_unique = df[col].nunique() if lower_unique < original_unique: print(f"警告:{col}列中存在{original_unique - lower_unique}个大小写/空格导致的重复项")

**唯一性得分**:如果主键完全唯一则为100分,否则为(唯一主键数量 / 总行数) * 100

Step 4: Consistency Assessment

步骤4:一致性评估

4.1 Format Consistency

4.1 格式一致性

For string columns, check for format consistency:
python
def detect_formats(series):
    patterns = {
        'email': r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$',
        'phone_us': r'^\+?1?[-.\s]?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}$',
        'uuid': r'^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$',
        'date_iso': r'^\d{4}-\d{2}-\d{2}$',
        'url': r'^https?://[^\s]+$',
        'zip_us': r'^\d{5}(-\d{4})?$',
    }
    results = {}
    for name, pattern in patterns.items():
        match_count = series.dropna().str.match(pattern, na=False).sum()
        if match_count > 0:
            results[name] = match_count / series.dropna().shape[0] * 100
    return results
Flag columns where multiple formats coexist (e.g., dates as both "2024-01-01" and "01/01/2024").
对于字符串列,检查格式一致性:
python
def detect_formats(series):
    patterns = {
        'email': r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$',
        'phone_us': r'^\\+?1?[-.\\s]?\\(?\\d{3}\\)?[-.\\s]?\\d{3}[-.\\s]?\\d{4}$',
        'uuid': r'^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$',
        'date_iso': r'^\\d{4}-\\d{2}-\\d{2}$',
        'url': r'^https?://[^\\s]+$',
        'zip_us': r'^\\d{5}(-\\d{4})?$',
    }
    results = {}
    for name, pattern in patterns.items():
        match_count = series.dropna().str.match(pattern, na=False).sum()
        if match_count > 0:
            results[name] = match_count / series.dropna().shape[0] * 100
    return results
标记存在多种格式共存的列(例如:日期同时以"2024-01-01"和"01/01/2024"格式存储)。

4.2 Cross-Column Consistency

4.2 跨列一致性

Check logical rules:
  • start_date <= end_date
  • quantity * unit_price ~= total_price
    (within rounding tolerance)
  • city
    /
    state
    /
    country
    consistency (e.g., "New York" city should not appear with state "CA")
  • Status fields that contradict other columns (e.g.,
    status = 'active'
    but
    deleted_at
    is not null)
检查逻辑规则:
  • start_date <= end_date
  • quantity * unit_price ~= total_price
    (在四舍五入误差范围内)
  • city
    /
    state
    /
    country
    的一致性(例如:城市为"New York"时,州不应为"CA")
  • 状态字段与其他列存在矛盾的情况(例如:
    status = 'active'
    deleted_at
    不为空)

4.3 Referential Consistency

4.3 引用一致性

If multiple tables are provided, check foreign key integrity:
python
orphan_count = df1[~df1['foreign_key'].isin(df2['primary_key'])].shape[0]
Report orphaned records (child records with no matching parent).
如果提供了多个表,检查外键完整性:
python
orphan_count = df1[~df1['foreign_key'].isin(df2['primary_key'])].shape[0]
报告孤立记录(没有匹配父记录的子记录)。

4.4 Categorical Consistency

4.4 分类一致性

For categorical columns with a known valid set:
python
valid_values = {'active', 'inactive', 'suspended'}
invalid = df[~df['status'].isin(valid_values) & df['status'].notnull()]
Report invalid values and their counts. Also flag:
  • Leading/trailing whitespace.
  • Mixed case inconsistency (e.g., "Active" vs "active" vs "ACTIVE").
Consistency Score = (rows passing all consistency checks / total rows) * 100
对于有已知有效值集合的分类列:
python
valid_values = {'active', 'inactive', 'suspended'}
invalid = df[~df['status'].isin(valid_values) & df['status'].notnull()]
报告无效值及其数量。同时标记以下情况:
  • 首尾空格问题。
  • 大小写不一致(例如:"Active"、"active"、"ACTIVE")。
一致性得分 = (通过所有一致性检查的行数 / 总行数) * 100

Step 5: Timeliness Assessment

步骤5:及时性评估

5.1 Data Freshness

5.1 数据新鲜度

If the dataset has a timestamp column (created_at, updated_at, event_time):
python
max_timestamp = df[timestamp_col].max()
freshness_lag = datetime.now() - max_timestamp
Classification:
  • Fresh (lag < expected frequency): GREEN
  • Stale (lag 1-3x expected frequency): YELLOW
  • Very stale (lag > 3x expected frequency): RED
如果数据集包含时间戳列(created_at、updated_at、event_time):
python
max_timestamp = df[timestamp_col].max()
freshness_lag = datetime.now() - max_timestamp
分类标准:
  • 新鲜(延迟小于预期刷新频率):绿色
  • 陈旧(延迟为预期刷新频率的1-3倍):黄色
  • 非常陈旧(延迟超过预期刷新频率的3倍):红色

5.2 Temporal Coverage

5.2 时间覆盖范围

Check for gaps in the time series:
python
daily_counts = df.set_index(timestamp_col).resample('D').size()
missing_days = daily_counts[daily_counts == 0]
low_days = daily_counts[daily_counts < daily_counts.median() * 0.1]
Report: date range covered, days with zero records, days with anomalously low records.
检查时间序列中的缺口:
python
daily_counts = df.set_index(timestamp_col).resample('D').size()
missing_days = daily_counts[daily_counts == 0]
low_days = daily_counts[daily_counts < daily_counts.median() * 0.1]
报告:覆盖的日期范围、无记录的日期、记录数量异常低的日期。

5.3 Late-Arriving Data

5.3 延迟到达的数据

If there is both an
event_time
and a
loaded_at
/
created_at
column:
python
latency = (df['loaded_at'] - df['event_time']).dt.total_seconds()
Report: median latency, 95th percentile latency, max latency. Flag any records arriving more than 24 hours late.
Timeliness Score = 100 if fresh and no gaps, reduced by 10 for each day of staleness and 5 for each missing day in the expected range.
如果同时存在
event_time
loaded_at
/
created_at
列:
python
latency = (df['loaded_at'] - df['event_time']).dt.total_seconds()
报告:延迟中位数、95分位延迟值、最大延迟值。标记到达时间超过24小时的记录。
及时性得分:如果数据新鲜且无时间缺口则为100分,每延迟一天扣10分,预期时间范围内每缺失一天扣5分。

Step 6: Accuracy Assessment

步骤6:准确性评估

6.1 Range Checks

6.1 范围检查

For numeric columns, validate against expected ranges:
python
range_checks = {
    'age': (0, 120),
    'price': (0, None),  # None = no upper bound
    'latitude': (-90, 90),
    'longitude': (-180, 180),
    'percentage': (0, 100),
}
Apply any user-specified ranges. Also apply common-sense ranges:
  • Dates should not be in the future (unless they are scheduled events).
  • Monetary amounts should usually be positive.
  • Counts should be non-negative integers.
对于数值列,验证是否符合预期范围:
python
range_checks = {
    'age': (0, 120),
    'price': (0, None),  # None表示无上限
    'latitude': (-90, 90),
    'longitude': (-180, 180),
    'percentage': (0, 100),
}
应用用户指定的范围检查,同时应用常识性范围:
  • 日期不应为未来日期(除非是预定事件)。
  • 金额通常应为正数。
  • 计数应为非负整数。

6.2 Statistical Outliers

6.2 统计异常值

For numeric columns, flag values beyond 3 standard deviations or beyond the 0.1st / 99.9th percentiles. These are not necessarily wrong, but worth investigating.
对于数值列,标记超出3倍标准差或0.1%/99.9%分位数的值。这些值不一定错误,但值得进一步调查。

6.3 Pattern Violations

6.3 模式违反

Check for values that violate expected patterns:
  • Email columns that fail basic validation.
  • Phone numbers with wrong digit counts.
  • Zip codes outside valid ranges.
检查违反预期模式的值:
  • 未通过基本验证的邮箱列。
  • 位数错误的电话号码。
  • 超出有效范围的邮政编码。

6.4 Cross-Source Validation

6.4 跨源验证

If the user provides a reference dataset or known-good aggregates:
python
expected_total_revenue = 1_234_567
actual_total_revenue = df['revenue'].sum()
variance_pct = abs(actual_total_revenue - expected_total_revenue) / expected_total_revenue * 100
Flag variances > 1%.
Accuracy Score = (rows passing all range and accuracy checks / total rows) * 100
如果用户提供了参考数据集或已知正确的聚合值:
python
expected_total_revenue = 1_234_567
actual_total_revenue = df['revenue'].sum()
variance_pct = abs(actual_total_revenue - expected_total_revenue) / expected_total_revenue * 100
标记差异超过1%的情况。
准确性得分 = (通过所有范围和准确性检查的行数 / 总行数) * 100

Step 7: Validity Assessment

步骤7:有效性评估

7.1 Data Type Validity

7.1 数据类型有效性

Check that each column's values are valid for their expected type:
  • Numeric columns contain only numbers (no stray strings).
  • Date columns parse correctly.
  • Boolean columns contain only true/false/null.
检查每一列的值是否符合其预期类型:
  • 数值列仅包含数字(无多余字符串)。
  • 日期列可正确解析。
  • 布尔列仅包含true/false/null。

7.2 Business Rule Validation

7.2 业务规则验证

Apply any business rules the user specifies. For example:
  • Every order must have at least one line item.
  • Refund amount must not exceed original order amount.
  • User cannot have a subscription end date before the start date.
Validity Score = (rows passing all validity checks / total rows) * 100
应用用户指定的任何业务规则,例如:
  • 每个订单必须至少包含一个订单项。
  • 退款金额不得超过原始订单金额。
  • 用户的订阅结束日期不得早于开始日期。
有效性得分 = (通过所有有效性检查的行数 / 总行数) * 100

Step 8: Data Quality Scorecard

步骤8:数据质量评分卡

Produce the final scorecard:
============================================================
            DATA QUALITY SCORECARD
============================================================
Dataset:         [name]
Assessed:        [timestamp]
Rows:            [count]
Columns:         [count]
------------------------------------------------------------
Dimension        Score    Grade    Issues Found
------------------------------------------------------------
Completeness     [XX]%    [A-F]    [count] issues
Uniqueness       [XX]%    [A-F]    [count] issues
Consistency      [XX]%    [A-F]    [count] issues
Timeliness       [XX]%    [A-F]    [count] issues
Accuracy         [XX]%    [A-F]    [count] issues
Validity         [XX]%    [A-F]    [count] issues
------------------------------------------------------------
OVERALL SCORE    [XX]%    [A-F]
============================================================

Grading Scale: A (95-100) | B (85-94) | C (70-84) | D (50-69) | F (<50)
Overall Score = weighted average:
  • Completeness: 20%
  • Uniqueness: 20%
  • Consistency: 20%
  • Timeliness: 10%
  • Accuracy: 20%
  • Validity: 10%
Adjust weights if the user specifies different priorities.
生成最终评分卡:
============================================================
            数据质量评分卡
============================================================
数据集名称:         [名称]
评估时间:        [时间戳]
行数:            [数量]
列数:         [数量]
------------------------------------------------------------
维度        得分    等级    发现的问题数
------------------------------------------------------------
完整性     [XX]%    [A-F]    [数量] 个问题
唯一性       [XX]%    [A-F]    [数量] 个问题
一致性      [XX]%    [A-F]    [数量] 个问题
及时性       [XX]%    [A-F]    [数量] 个问题
准确性         [XX]%    [A-F]    [数量] 个问题
有效性         [XX]%    [A-F]    [数量] 个问题
------------------------------------------------------------
总体得分    [XX]%    [A-F]
============================================================

等级划分:A (95-100) | B (85-94) | C (70-84) | D (50-69) | F (<50)
总体得分 = 加权平均值:
  • 完整性:20%
  • 唯一性:20%
  • 一致性:20%
  • 及时性:10%
  • 准确性:20%
  • 有效性:10%
如果用户指定了不同的优先级,可调整权重。

Step 9: Issue Register

步骤9:问题登记册

Produce a table of all issues found, sorted by severity:
#DimensionSeverityColumn(s)DescriptionRecords AffectedRecommended Action
1UniquenessCRITICALorder_id2,341 duplicate primary keys2,341 (0.5%)Deduplicate; investigate pipeline
2AccuracyWARNINGprice89 negative values89 (0.02%)Validate business logic for refunds
...
Severity levels:
  • CRITICAL: blocks analysis, data is unreliable for this dimension. Must fix before using.
  • WARNING: data is usable with caveats. Should fix soon.
  • INFO: minor issue, good to track but not blocking.
生成所有发现问题的表格,按严重程度排序:
#维度严重程度涉及列描述受影响记录数建议措施
1唯一性关键order_id存在2341个重复主键2341条(0.5%)去重;调查数据管道问题
2准确性警告price存在89个负值89条(0.02%)验证退款业务逻辑
...
严重程度等级:
  • 关键:阻碍分析,该维度的数据不可靠。使用前必须修复。
  • 警告:数据可在注意事项下使用。应尽快修复。
  • 信息:次要问题,建议跟踪但不阻碍使用。

Step 10: Recommendations

步骤10:建议

Provide actionable recommendations:
  1. Immediate fixes (for CRITICAL issues).
  2. Pipeline improvements (add validation, schema enforcement, deduplication).
  3. Monitoring suggestions (what metrics to track over time, thresholds for alerts).
  4. Documentation gaps (what metadata or context is missing).
提供可执行的建议:
  1. 立即修复(针对关键问题)。
  2. 管道优化(添加验证、schema强制、去重机制)。
  3. 监控建议(需要跟踪的指标、告警阈值)。
  4. 文档缺口(缺失的元数据或上下文信息)。

Edge Cases

边缘情况处理

  • Empty dataset: score all dimensions as 0%, flag as CRITICAL.
  • Single-row dataset: skip statistical checks, warn that sample size is too small.
  • No timestamp column: skip Timeliness entirely, note it as "Not Assessed" in the scorecard.
  • User provides no context: make reasonable assumptions but document them clearly. Ask the user to confirm key assumptions (primary key, expected ranges, valid categories).
  • Very large dataset (>10M rows): sample for statistical checks but compute exact counts for completeness and uniqueness on the full dataset. State clearly which checks used sampling.
  • 空数据集:所有维度得分为0%,标记为关键问题。
  • 单行数据集:跳过统计检查,警告样本量过小。
  • 无时间戳列:完全跳过及时性评估,在评分卡中标记为"未评估"。
  • 用户未提供上下文:做出合理假设并明确记录。请用户确认关键假设(主键、预期范围、有效分类)。
  • 超大数据集(>1000万行):统计检查使用抽样,但完整性和唯一性检查需计算全量数据的精确值。明确标注哪些检查使用了抽样。",