clean-data-xls

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Clean Data

清理数据

Clean messy data in the active sheet or a specified range.
清理活动工作表或指定区域内的杂乱数据。

Environment

运行环境

  • If running inside Excel (Office Add-in / Office JS): Use Office JS directly (
    Excel.run(async (context) => {...})
    ). Read via
    range.values
    , write helper-column formulas via
    range.formulas = [["=TRIM(A2)"]]
    . The in-place vs helper-column decision still applies.
  • If operating on a standalone .xlsx file: Use Python/openpyxl.
  • 在Excel内运行(Office 插件 / Office JS): 直接使用Office JS(
    Excel.run(async (context) => {...})
    )。通过
    range.values
    读取数据,通过
    range.formulas = [["=TRIM(A2)"]]
    写入辅助列公式。仍需决定是直接修改原数据还是使用辅助列。
  • 处理独立.xlsx文件: 使用Python/openpyxl。

Workflow

工作流程

Step 1: Scope

步骤1:确定范围

  • If a range is given (e.g.
    A1:F200
    ), use it
  • Otherwise use the full used range of the active sheet
  • Profile each column: detect its dominant type (text / number / date) and identify outliers
  • 如果指定了数据范围(如
    A1:F200
    ),则使用该范围
  • 否则使用活动工作表的全部已用区域
  • 分析每一列:检测其主要数据类型(文本/数字/日期)并识别异常值

Step 2: Detect issues

步骤2:检测问题

IssueWhat to look for
Whitespaceleading/trailing spaces, double spaces
Casinginconsistent casing in categorical columns (
usa
/
USA
/
Usa
)
Number-as-textnumeric values stored as text; stray
$
,
,
,
%
in number cells
Datesmixed formats in the same column (
3/8/26
,
2026-03-08
,
March 8 2026
)
Duplicatesexact-duplicate rows and near-duplicates (case/whitespace differences)
Blanksempty cells in otherwise-populated columns
Mixed typesa column that's 98% numbers but has 3 text entries
Encodingmojibake (
é
,
’
), non-printing characters
Errors
#REF!
,
#N/A
,
#VALUE!
,
#DIV/0!
问题类型检测要点
空格问题前导/尾随空格、连续空格
大小写问题分类列中大小写不一致(如
usa
/
USA
/
Usa
文本格式存储数字数值以文本格式存储;数字单元格中存在多余的
$
,
%
符号
日期格式问题同一列中日期格式混合(如
3/8/26
2026-03-08
March 8 2026
重复项完全重复的行和近似重复项(大小写/空格差异导致)
空值非空列中的空单元格
混合类型某列98%为数字但包含3个文本条目
编码问题乱码(如
é
’
)、不可打印字符
公式错误
#REF!
#N/A
#VALUE!
#DIV/0!

Step 3: Propose fixes

步骤3:提出修复方案

Show a summary table before changing anything:
ColumnIssueCountProposed Fix
在进行任何修改前,展示汇总表格:
问题数量建议修复方案

Step 4: Apply

步骤4:执行修复

  • Prefer formulas over hardcoded cleaned values — where the cleaned output can be expressed as a formula (e.g.
    =TRIM(A2)
    ,
    =VALUE(SUBSTITUTE(B2,"$",""))
    ,
    =UPPER(C2)
    ,
    =DATEVALUE(D2)
    ), write the formula in an adjacent helper column rather than computing the result in Python and overwriting the original. This keeps the transformation transparent and auditable.
  • Only overwrite in place with computed values when the user explicitly asks for it, or when no sensible formula equivalent exists (e.g. encoding/mojibake repair)
  • For destructive operations (removing duplicates, filling blanks, overwriting originals), confirm with the user first
  • After each category of fix (whitespace → casing → number conversion → dates → dedup), show the user a sample of what changed and get confirmation before moving to the next category
  • Report a before/after summary of what changed
  • 优先使用公式而非硬编码清理后的值——如果清理后的结果可以用公式表示(如
    =TRIM(A2)
    =VALUE(SUBSTITUTE(B2,"$",""))
    =UPPER(C2)
    =DATEVALUE(D2)
    ),则在相邻的辅助列中写入公式,而非通过Python计算结果后覆盖原数据。这样可以保持转换过程的透明性和可审计性。
  • 仅当用户明确要求,或没有合适的公式替代方案时(如修复编码乱码),才直接覆盖原数据
  • 对于破坏性操作(删除重复项、填充空值、覆盖原数据),需先征得用户确认
  • 在完成每一类修复(空格→大小写→数字转换→日期→去重)后,向用户展示修改示例,确认后再进行下一类修复
  • 展示修改前后的汇总信息