sn-da-large-file-analysis

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Large Scale Excel Analysis Skill

大规模Excel分析技能

Mandatory Rules

强制规则

When total rows >= 10,000, you MUST use the methods in this skill.

Data Scale	Read Strategy	Reason
< 10k rows	`pd.read_excel()` directly	No memory pressure
10k–100k rows	`pd.read_excel()` → convert to Parquet → `pd.read_parquet()` for analysis	Avoid repeated slow reads
100k–1M rows	openpyxl `read_only` + `iter_rows` streaming → Parquet	`pd.read_excel()` will OOM or timeout
> 1M rows	Streaming read + multi-sheet split (Excel max 1,048,576 rows per sheet)	Must chunk

Prohibited:

Do NOT use
```
pd.read_excel()
```
to fully load 100k+ row files
Do NOT search for fonts with
```
fc-list
```
,
```
find ... fonts
```
, or install packages with
```
pip install
```
Do NOT use
```
df.iterrows()
```
on large DataFrames (use
```
itertuples()
```
or vectorized ops)
Do NOT use
```
df.apply(lambda...)
```
for operations that can be vectorized

当总行数≥10000时，必须使用本技能中的方法。

数据规模	读取策略	原因
<10000行	直接使用 `pd.read_excel()`	无内存压力
10000–100000行	`pd.read_excel()` → 转换为Parquet → 使用 `pd.read_parquet()` 进行分析	避免重复慢读取
100000–1000000行	openpyxl `read_only` + `iter_rows` 流式读取 → Parquet	`pd.read_excel()` 会导致内存溢出（OOM）或超时
>1000000行	流式读取 + 多工作表拆分（Excel单工作表最大支持1,048,576行）	必须分块处理

禁止操作：

请勿使用
```
pd.read_excel()
```
完整加载10万行以上的文件
请勿使用
```
fc-list
```
、
```
find ... fonts
```
搜索字体，或使用
```
pip install
```
安装包
请勿在大型DataFrame上使用
```
df.iterrows()
```
（使用
```
itertuples()
```
或向量化操作）
对于可向量化的操作，请勿使用
```
df.apply(lambda...)
```

Environment Setup

环境配置

python

import pandas as pd
import numpy as np
import os
import gc

pd.options.mode.copy_on_write = True

python

import pandas as pd
import numpy as np
import os
import gc

pd.options.mode.copy_on_write = True

CJK font setup (fixed paths — do NOT search for fonts)

⚠️ Copy this block as-is. Do NOT use fc-list, find, subprocess, or glob to locate fonts.

import matplotlib import matplotlib.pyplot as plt import matplotlib.font_manager as fm

_FONT_PATHS = [ '/mnt/afs_agents/SimHei.ttf', '/mnt/afs_agents/mnt/data/SimHei.ttf', os.path.expanduser('~/.fonts/SimHei.ttf'), '/usr/share/fonts/truetype/wqy/wqy-zenhei.ttc', '/usr/share/fonts/SimHei.ttf', ] for _p in _FONT_PATHS: if os.path.exists(_p): fm.fontManager.addfont(_p) matplotlib.rcParams['font.family'] = fm.FontProperties(fname=_p).get_name() break matplotlib.rcParams['axes.unicode_minus'] = False

---

import matplotlib import matplotlib.pyplot as plt import matplotlib.font_manager as fm

---

Core Method 1: Inspect File Structure (Without Loading Data)

核心方法1：检查文件结构（无需加载数据）

Before any operation on a large file, inspect sheets and row counts without loading data into memory:

python

import openpyxl

def inspect_excel(file_path):
    """Stream-inspect Excel structure. Returns {sheet_name: {rows, columns}}."""
    wb = openpyxl.load_workbook(file_path, read_only=True, data_only=True)
    info = {}
    for name in wb.sheetnames:
        ws = wb[name]
        row_count = 0
        header = None
        for i, row in enumerate(ws.iter_rows(values_only=True)):
            if i == 0:
                header = [str(c) if c is not None else f"Col_{j}" for j, c in enumerate(row)]
            else:
                row_count += 1
        info[name] = {"rows": row_count, "columns": header}
    wb.close()
    return info

在对大文件执行任何操作前，无需将数据加载到内存即可检查工作表和行数：

python

import openpyxl

def inspect_excel(file_path):
    """Stream-inspect Excel structure. Returns {sheet_name: {rows, columns}}."""
    wb = openpyxl.load_workbook(file_path, read_only=True, data_only=True)
    info = {}
    for name in wb.sheetnames:
        ws = wb[name]
        row_count = 0
        header = None
        for i, row in enumerate(ws.iter_rows(values_only=True)):
            if i == 0:
                header = [str(c) if c is not None else f"Col_{j}" for j, c in enumerate(row)]
            else:
                row_count += 1
        info[name] = {"rows": row_count, "columns": header}
    wb.close()
    return info

Usage

file_info = inspect_excel(file_path) for sheet, meta in file_info.items(): print(f"Sheet '{sheet}': {meta['rows']} rows, {len(meta['columns'])} cols") print(f" Columns: {meta['columns'][:10]}...") total_rows = sum(m['rows'] for m in file_info.values()) print(f"Total rows: {total_rows}")

---

---

Core Method 2: Streaming Read → Parquet (100k+ Rows)

核心方法2：流式读取→Parquet（10万行以上）

For 100k+ row files, never use

pd.read_excel()

. Use openpyxl streaming → Parquet:

python

import openpyxl
import pyarrow as pa
import pyarrow.parquet as pq

def stream_excel_to_parquet(excel_path, parquet_path, sheet_name=None, chunk_size=50000):
    """Stream Excel rows to Parquet with constant memory usage.

    All columns are cast to string to avoid cross-chunk schema mismatches
    (Excel mixed-type columns may be all-None in some chunks, causing PyArrow
    to infer null type instead of string). Convert numeric columns after loading
    Parquet with pd.to_numeric() as needed.
    """
    wb = openpyxl.load_workbook(excel_path, read_only=True, data_only=True)
    ws = wb[sheet_name] if sheet_name else wb.active

    header = None
    writer = None
    chunk_rows = []
    total_written = 0

    def _flush(rows):
        nonlocal writer
        table = pa.table({
            col: pa.array(
                [str(r[idx]) if r[idx] is not None else None for r in rows],
                type=pa.string(),
            )
            for idx, col in enumerate(header)
        })
        if writer is None:
            writer = pq.ParquetWriter(parquet_path, table.schema)
        writer.write_table(table)

    for i, row in enumerate(ws.iter_rows(values_only=True)):
        if i == 0:
            header = [str(c) if c is not None else f"Col_{j}" for j, c in enumerate(row)]
            continue

        chunk_rows.append(list(row))

        if len(chunk_rows) >= chunk_size:
            _flush(chunk_rows)
            total_written += len(chunk_rows)
            print(f"  Written {total_written:,} rows...")
            chunk_rows = []
            gc.collect()

    if chunk_rows:
        _flush(chunk_rows)
        total_written += len(chunk_rows)

    if writer:
        writer.close()
    wb.close()
    print(f"Done: {total_written:,} rows -> {parquet_path}")
    return total_written

对于10万行以上的文件，绝对不要使用

pd.read_excel()

。使用openpyxl流式读取→Parquet：

python

import openpyxl
import pyarrow as pa
import pyarrow.parquet as pq

def stream_excel_to_parquet(excel_path, parquet_path, sheet_name=None, chunk_size=50000):
    """Stream Excel rows to Parquet with constant memory usage.

    All columns are cast to string to avoid cross-chunk schema mismatches
    (Excel mixed-type columns may be all-None in some chunks, causing PyArrow
    to infer null type instead of string). Convert numeric columns after loading
    Parquet with pd.to_numeric() as needed.
    """
    wb = openpyxl.load_workbook(excel_path, read_only=True, data_only=True)
    ws = wb[sheet_name] if sheet_name else wb.active

    header = None
    writer = None
    chunk_rows = []
    total_written = 0

    def _flush(rows):
        nonlocal writer
        table = pa.table({
            col: pa.array(
                [str(r[idx]) if r[idx] is not None else None for r in rows],
                type=pa.string(),
            )
            for idx, col in enumerate(header)
        })
        if writer is None:
            writer = pq.ParquetWriter(parquet_path, table.schema)
        writer.write_table(table)

    for i, row in enumerate(ws.iter_rows(values_only=True)):
        if i == 0:
            header = [str(c) if c is not None else f"Col_{j}" for j, c in enumerate(row)]
            continue

        chunk_rows.append(list(row))

        if len(chunk_rows) >= chunk_size:
            _flush(chunk_rows)
            total_written += len(chunk_rows)
            print(f"  Written {total_written:,} rows...")
            chunk_rows = []
            gc.collect()

    if chunk_rows:
        _flush(chunk_rows)
        total_written += len(chunk_rows)

    if writer:
        writer.close()
    wb.close()
    print(f"Done: {total_written:,} rows -> {parquet_path}")
    return total_written

Core Method 3: Medium File Parquet Conversion (10k–100k Rows)

核心方法3：中等文件Parquet转换（1万–10万行）

For 10k–100k rows,

pd.read_excel()

won't OOM, but Parquet is much faster for repeated analysis:

python

def convert_excel_to_parquet(excel_path, parquet_path, sheet_name=0):
    """Medium file: pd.read_excel -> Parquet cache."""
    if os.path.exists(parquet_path):
        print(f"Cache exists: {parquet_path}")
        return
    df = pd.read_excel(excel_path, sheet_name=sheet_name)
    df.columns = df.columns.astype(str)
    df.to_parquet(parquet_path, engine='pyarrow', compression='snappy')
    row_count = len(df)
    del df
    gc.collect()
    print(f"Converted {row_count:,} rows -> {parquet_path}")

对于1万–10万行的文件，

pd.read_excel()

不会导致内存溢出，但Parquet在重复分析时速度快得多：

python

def convert_excel_to_parquet(excel_path, parquet_path, sheet_name=0):
    """Medium file: pd.read_excel -> Parquet cache."""
    if os.path.exists(parquet_path):
        print(f"Cache exists: {parquet_path}")
        return
    df = pd.read_excel(excel_path, sheet_name=sheet_name)
    df.columns = df.columns.astype(str)
    df.to_parquet(parquet_path, engine='pyarrow', compression='snappy')
    row_count = len(df)
    del df
    gc.collect()
    print(f"Converted {row_count:,} rows -> {parquet_path}")

Core Method 4: Memory Optimization (Type Downcasting)

核心方法4：内存优化（类型向下转换）

After loading Parquet, further reduce memory footprint:

python

def optimize_dtypes(df):
    """Auto-downcast numeric types + convert low-cardinality strings to Category.
    Typically saves 50-80% memory."""
    start_mb = df.memory_usage(deep=True).sum() / 1024**2

    for col in df.select_dtypes(include=['int64', 'int32']).columns:
        c_min, c_max = df[col].min(), df[col].max()
        if c_min >= np.iinfo(np.int8).min and c_max <= np.iinfo(np.int8).max:
            df[col] = df[col].astype(np.int8)
        elif c_min >= np.iinfo(np.int16).min and c_max <= np.iinfo(np.int16).max:
            df[col] = df[col].astype(np.int16)
        elif c_min >= np.iinfo(np.int32).min and c_max <= np.iinfo(np.int32).max:
            df[col] = df[col].astype(np.int32)

    for col in df.select_dtypes(include=['float64']).columns:
        df[col] = df[col].astype(np.float32)

    for col in df.select_dtypes(include=['object', 'string']).columns:
        if df[col].nunique() / max(len(df), 1) < 0.5:
            df[col] = df[col].astype('category')

    end_mb = df.memory_usage(deep=True).sum() / 1024**2
    print(f"Memory: {start_mb:.1f} MB -> {end_mb:.1f} MB (saved {(1 - end_mb/start_mb)*100:.0f}%)")
    return df

加载Parquet后，进一步减少内存占用：

python

def optimize_dtypes(df):
    """Auto-downcast numeric types + convert low-cardinality strings to Category.
    Typically saves 50-80% memory."""
    start_mb = df.memory_usage(deep=True).sum() / 1024**2

    for col in df.select_dtypes(include=['int64', 'int32']).columns:
        c_min, c_max = df[col].min(), df[col].max()
        if c_min >= np.iinfo(np.int8).min and c_max <= np.iinfo(np.int8).max:
            df[col] = df[col].astype(np.int8)
        elif c_min >= np.iinfo(np.int16).min and c_max <= np.iinfo(np.int16).max:
            df[col] = df[col].astype(np.int16)
        elif c_min >= np.iinfo(np.int32).min and c_max <= np.iinfo(np.int32).max:
            df[col] = df[col].astype(np.int32)

    for col in df.select_dtypes(include=['float64']).columns:
        df[col] = df[col].astype(np.float32)

    for col in df.select_dtypes(include=['object', 'string']).columns:
        if df[col].nunique() / max(len(df), 1) < 0.5:
            df[col] = df[col].astype('category')

    end_mb = df.memory_usage(deep=True).sum() / 1024**2
    print(f"Memory: {start_mb:.1f} MB -> {end_mb:.1f} MB (saved {(1 - end_mb/start_mb)*100:.0f}%)")
    return df

Core Method 5: Large File Writing

核心方法5：大文件写入

python

def write_large_excel(df, output_path, sheet_name="Sheet1"):
    """Auto-select write strategy based on data size."""
    total_cells = len(df) * len(df.columns)

    if len(df) > 1_000_000:
        csv_path = output_path.rsplit('.', 1)[0] + '.csv'
        df.to_csv(csv_path, index=False)
        print(f"Over 1M rows — exported as CSV: {csv_path}")
        return csv_path

    if total_cells > 50_000:
        from openpyxl import Workbook
        from openpyxl.cell import WriteOnlyCell

        wb = Workbook(write_only=True)
        ws = wb.create_sheet(title=sheet_name)
        ws.append(list(df.columns))
        for idx, row in enumerate(df.itertuples(index=False)):
            ws.append([None if pd.isna(v) else v for v in row])
            if (idx + 1) % 100_000 == 0:
                print(f"  Written {idx + 1:,} rows...")
        wb.save(output_path)
        wb.close()
        print(f"write_only mode: {len(df):,} rows -> {output_path}")
    else:
        df.to_excel(output_path, index=False, sheet_name=sheet_name)
        print(f"Standard write: {len(df):,} rows -> {output_path}")
    return output_path

python

def write_large_excel(df, output_path, sheet_name="Sheet1"):
    """Auto-select write strategy based on data size."""
    total_cells = len(df) * len(df.columns)

    if len(df) > 1_000_000:
        csv_path = output_path.rsplit('.', 1)[0] + '.csv'
        df.to_csv(csv_path, index=False)
        print(f"Over 1M rows — exported as CSV: {csv_path}")
        return csv_path

    if total_cells > 50_000:
        from openpyxl import Workbook
        from openpyxl.cell import WriteOnlyCell

        wb = Workbook(write_only=True)
        ws = wb.create_sheet(title=sheet_name)
        ws.append(list(df.columns))
        for idx, row in enumerate(df.itertuples(index=False)):
            ws.append([None if pd.isna(v) else v for v in row])
            if (idx + 1) % 100_000 == 0:
                print(f"  Written {idx + 1:,} rows...")
        wb.save(output_path)
        wb.close()
        print(f"write_only mode: {len(df):,} rows -> {output_path}")
    else:
        df.to_excel(output_path, index=False, sheet_name=sheet_name)
        print(f"Standard write: {len(df):,} rows -> {output_path}")
    return output_path

Example 1: 100k-Row Table — Column Distribution + Chart

示例1：10万行表格——列分布+图表

Scenario: User has a 100k-row sales Excel file and wants regional sales distribution with a bar chart.

python

import pandas as pd
import os, gc

excel_path = "sales_100k.xlsx"
parquet_path = "sales_100k.parquet"

场景：用户有一个10万行的销售Excel文件，想要生成区域销售分布柱状图。

python

import pandas as pd
import os, gc

excel_path = "sales_100k.xlsx"
parquet_path = "sales_100k.parquet"

=== Step 1: Inspect structure ===

file_info = inspect_excel(excel_path) total_rows = sum(m['rows'] for m in file_info.values()) print(f"Total rows: {total_rows}")

=== Step 2: Choose read strategy by row count ===

if total_rows >= 100_000: stream_excel_to_parquet(excel_path, parquet_path) else: convert_excel_to_parquet(excel_path, parquet_path)

=== Step 3: Load Parquet + optimize memory ===

df = pd.read_parquet(parquet_path) df = optimize_dtypes(df) print(f"Shape: {df.shape}") print(df.head(3))

=== Step 4: Analysis ===

region_sales = df.groupby('Region')['Sales'].sum().sort_values(ascending=False) print(region_sales)

=== Step 5: Visualization ===

fig, ax = plt.subplots(figsize=(10, 6)) region_sales.plot(kind='bar', ax=ax, color='#4C72B0') ax.set_title('Sales by Region') ax.set_ylabel('Sales') plt.tight_layout() plt.savefig('region_sales.png', dpi=150, bbox_inches='tight') plt.show()

=== Step 6: Cleanup ===

del df gc.collect()

---

del df gc.collect()

---

Example 2: 1M-Row Table — Streaming Read + Filter + Export

示例2：100万行表格——流式读取+过滤+导出

Scenario: User has a 1M-row transaction log and wants records with amount > 10,000 exported.

python

import pandas as pd
import os, gc

excel_path = "transactions_1m.xlsx"
parquet_path = "transactions_1m.parquet"

场景：用户有一个100万行的交易日志，想要导出金额>10000的记录。

python

import pandas as pd
import os, gc

excel_path = "transactions_1m.xlsx"
parquet_path = "transactions_1m.parquet"

=== Step 1: Stream to Parquet (1M rows — MUST use streaming, never pd.read_excel) ===

stream_excel_to_parquet(excel_path, parquet_path, chunk_size=50000)

=== Step 2: Load only needed columns (saves memory) ===

df = pd.read_parquet(parquet_path, columns=['TransactionID', 'Amount', 'Date', 'Type']) df = optimize_dtypes(df) print(f"Shape: {df.shape}, Memory: {df.memory_usage(deep=True).sum()/1024**2:.1f} MB")

=== Step 3: Vectorized filtering (never use apply/iterrows) ===

mask = df['Amount'] > 10000 high_value = df[mask].copy() print(f"Filtered: {len(high_value):,} / {len(df):,} rows")

=== Step 4: Export ===

output_path = write_large_excel(high_value, 'high_value_transactions.xlsx')

=== Step 5: Cleanup ===

del df, high_value gc.collect()

---

del df, high_value gc.collect()

---

Vectorized Operations Cheat Sheet

向量化操作速查表

On large files, never use slow operations — use vectorized alternatives:

Slow (Prohibited)	Fast (Use This)
`df.apply(lambda x: x*2)`	`df['col'] * 2`
`df.iterrows()`	`df.itertuples(index=False)`
`for i in range(len(df)): df.iloc[i]`	Vectorized boolean indexing `df[mask]`
`df['a'].map(lambda x: 'Y' if x>0 else 'N')`	`np.where(df['a']>0, 'Y', 'N')`
`df.groupby('a').apply(custom_func)`	`df.groupby('a').agg({'b':'sum','c':'mean'})`

在处理大文件时，绝对不要使用慢速操作——改用向量化替代方案：

慢速（禁止）	快速（推荐）
`df.apply(lambda x: x*2)`	`df['col'] * 2`
`df.iterrows()`	`df.itertuples(index=False)`
`for i in range(len(df)): df.iloc[i]`	向量化布尔索引 `df[mask]`
`df['a'].map(lambda x: 'Y' if x>0 else 'N')`	`np.where(df['a']>0, 'Y', 'N')`
`df.groupby('a').apply(custom_func)`	`df.groupby('a').agg({'b':'sum','c':'mean'})`

Memory Estimation

内存估算

Estimate memory before loading to avoid OOM:

Estimated MB ≈ rows × cols × 8 / 1024² (numeric columns)
Estimated MB ≈ rows × cols × 50 / 1024² (with text columns)

Rows	20 cols (numeric)	20 cols (with text)
100k	~15 MB	~95 MB
500k	~76 MB	~477 MB
1M	~153 MB	~953 MB

When estimated memory exceeds 80% of available RAM, use column-selective loading (

pd.read_parquet(columns=[...])

) or chunked processing.

加载前估算内存以避免OOM：

Estimated MB ≈ rows × cols × 8 / 1024² (numeric columns)
Estimated MB ≈ rows × cols × 50 / 1024² (with text columns)

行数	20列（数值型）	20列（含文本）
10万	~15 MB	~95 MB
50万	~76 MB	~477 MB
100万	~153 MB	~953 MB

当估算内存超过可用RAM的80%时，使用选择性列加载（

pd.read_parquet(columns=[...])

）或分块处理。

Best Practices

最佳实践

Parquet is king: For any file >= 10k rows, convert to Parquet before analysis. Parquet supports columnar reads, compressed storage, and loads 10-50x faster than xlsx.
Streaming is the safety net: For 100k+ rows, always use openpyxl
```
read_only
```
+
```
iter_rows
```
. Never
```
pd.read_excel()
```
for full load.
Release memory promptly:
```
del df; gc.collect()
```
after every intermediate DataFrame.
Excel row limit: Max 1,048,576 rows per sheet. Auto-split to multiple sheets or export as CSV when exceeded.
Use write_only for output: Files with >50k cells must use
```
openpyxl Workbook(write_only=True)
```
.

Parquet是首选：对于任何≥1万行的文件，分析前先转换为Parquet。Parquet支持列读取、压缩存储，加载速度比xlsx快10-50倍。
流式读取是安全保障：对于10万行以上的文件，始终使用openpyxl
```
read_only
```
+
```
iter_rows
```
。绝对不要用
```
pd.read_excel()
```
完整加载。
及时释放内存：每个中间DataFrame处理完后执行
```
del df; gc.collect()
```
。
Excel行数限制：单工作表最大支持1,048,576行。超过时自动拆分为多个工作表或导出为CSV。
输出使用write_only模式：单元格数>5万的文件必须使用
```
openpyxl Workbook(write_only=True)
```
。