clean-data-xls
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseClean Data
清理数据
Clean messy data in the active sheet or a specified range.
清理活动工作表或指定区域内的杂乱数据。
Environment
运行环境
- If running inside Excel (Office Add-in / Office JS): Use Office JS directly (). Read via
Excel.run(async (context) => {...}), write helper-column formulas viarange.values. The in-place vs helper-column decision still applies.range.formulas = [["=TRIM(A2)"]] - If operating on a standalone .xlsx file: Use Python/openpyxl.
- 在Excel内运行(Office 插件 / Office JS): 直接使用Office JS()。通过
Excel.run(async (context) => {...})读取数据,通过range.values写入辅助列公式。仍需决定是直接修改原数据还是使用辅助列。range.formulas = [["=TRIM(A2)"]] - 处理独立.xlsx文件: 使用Python/openpyxl。
Workflow
工作流程
Step 1: Scope
步骤1:确定范围
- If a range is given (e.g. ), use it
A1:F200 - Otherwise use the full used range of the active sheet
- Profile each column: detect its dominant type (text / number / date) and identify outliers
- 如果指定了数据范围(如),则使用该范围
A1:F200 - 否则使用活动工作表的全部已用区域
- 分析每一列:检测其主要数据类型(文本/数字/日期)并识别异常值
Step 2: Detect issues
步骤2:检测问题
| Issue | What to look for |
|---|---|
| Whitespace | leading/trailing spaces, double spaces |
| Casing | inconsistent casing in categorical columns ( |
| Number-as-text | numeric values stored as text; stray |
| Dates | mixed formats in the same column ( |
| Duplicates | exact-duplicate rows and near-duplicates (case/whitespace differences) |
| Blanks | empty cells in otherwise-populated columns |
| Mixed types | a column that's 98% numbers but has 3 text entries |
| Encoding | mojibake ( |
| Errors | |
| 问题类型 | 检测要点 |
|---|---|
| 空格问题 | 前导/尾随空格、连续空格 |
| 大小写问题 | 分类列中大小写不一致(如 |
| 文本格式存储数字 | 数值以文本格式存储;数字单元格中存在多余的 |
| 日期格式问题 | 同一列中日期格式混合(如 |
| 重复项 | 完全重复的行和近似重复项(大小写/空格差异导致) |
| 空值 | 非空列中的空单元格 |
| 混合类型 | 某列98%为数字但包含3个文本条目 |
| 编码问题 | 乱码(如 |
| 公式错误 | |
Step 3: Propose fixes
步骤3:提出修复方案
Show a summary table before changing anything:
| Column | Issue | Count | Proposed Fix |
|---|
在进行任何修改前,展示汇总表格:
| 列 | 问题 | 数量 | 建议修复方案 |
|---|
Step 4: Apply
步骤4:执行修复
- Prefer formulas over hardcoded cleaned values — where the cleaned output can be expressed as a formula (e.g. ,
=TRIM(A2),=VALUE(SUBSTITUTE(B2,"$","")),=UPPER(C2)), write the formula in an adjacent helper column rather than computing the result in Python and overwriting the original. This keeps the transformation transparent and auditable.=DATEVALUE(D2) - Only overwrite in place with computed values when the user explicitly asks for it, or when no sensible formula equivalent exists (e.g. encoding/mojibake repair)
- For destructive operations (removing duplicates, filling blanks, overwriting originals), confirm with the user first
- After each category of fix (whitespace → casing → number conversion → dates → dedup), show the user a sample of what changed and get confirmation before moving to the next category
- Report a before/after summary of what changed
- 优先使用公式而非硬编码清理后的值——如果清理后的结果可以用公式表示(如、
=TRIM(A2)、=VALUE(SUBSTITUTE(B2,"$",""))、=UPPER(C2)),则在相邻的辅助列中写入公式,而非通过Python计算结果后覆盖原数据。这样可以保持转换过程的透明性和可审计性。=DATEVALUE(D2) - 仅当用户明确要求,或没有合适的公式替代方案时(如修复编码乱码),才直接覆盖原数据
- 对于破坏性操作(删除重复项、填充空值、覆盖原数据),需先征得用户确认
- 在完成每一类修复(空格→大小写→数字转换→日期→去重)后,向用户展示修改示例,确认后再进行下一类修复
- 展示修改前后的汇总信息