sn-da-large-file-analysis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Large Scale Excel Analysis Skill

大规模Excel分析技能

Mandatory Rules

强制规则

When total rows >= 10,000, you MUST use the methods in this skill.
Data ScaleRead StrategyReason
< 10k rows
pd.read_excel()
directly
No memory pressure
10k–100k rows
pd.read_excel()
→ convert to Parquet →
pd.read_parquet()
for analysis
Avoid repeated slow reads
100k–1M rowsopenpyxl
read_only
+
iter_rows
streaming
→ Parquet
pd.read_excel()
will OOM or timeout
> 1M rowsStreaming read + multi-sheet split (Excel max 1,048,576 rows per sheet)Must chunk
Prohibited:
  • Do NOT use
    pd.read_excel()
    to fully load 100k+ row files
  • Do NOT search for fonts with
    fc-list
    ,
    find ... fonts
    , or install packages with
    pip install
  • Do NOT use
    df.iterrows()
    on large DataFrames (use
    itertuples()
    or vectorized ops)
  • Do NOT use
    df.apply(lambda...)
    for operations that can be vectorized

当总行数≥10000时,必须使用本技能中的方法。
数据规模读取策略原因
<10000行直接使用
pd.read_excel()
无内存压力
10000–100000行
pd.read_excel()
→ 转换为Parquet → 使用
pd.read_parquet()
进行分析
避免重复慢读取
100000–1000000行openpyxl
read_only
+
iter_rows
流式读取
→ Parquet
pd.read_excel()
会导致内存溢出(OOM)或超时
>1000000行流式读取 + 多工作表拆分(Excel单工作表最大支持1,048,576行)必须分块处理
禁止操作:
  • 请勿使用
    pd.read_excel()
    完整加载10万行以上的文件
  • 请勿使用
    fc-list
    find ... fonts
    搜索字体,或使用
    pip install
    安装包
  • 请勿在大型DataFrame上使用
    df.iterrows()
    (使用
    itertuples()
    或向量化操作)
  • 对于可向量化的操作,请勿使用
    df.apply(lambda...)

Environment Setup

环境配置

python
import pandas as pd
import numpy as np
import os
import gc

pd.options.mode.copy_on_write = True
python
import pandas as pd
import numpy as np
import os
import gc

pd.options.mode.copy_on_write = True

CJK font setup (fixed paths — do NOT search for fonts)

CJK font setup (fixed paths — do NOT search for fonts)

⚠️ Copy this block as-is. Do NOT use fc-list, find, subprocess, or glob to locate fonts.

⚠️ Copy this block as-is. Do NOT use fc-list, find, subprocess, or glob to locate fonts.

import matplotlib import matplotlib.pyplot as plt import matplotlib.font_manager as fm
_FONT_PATHS = [ '/mnt/afs_agents/SimHei.ttf', '/mnt/afs_agents/mnt/data/SimHei.ttf', os.path.expanduser('~/.fonts/SimHei.ttf'), '/usr/share/fonts/truetype/wqy/wqy-zenhei.ttc', '/usr/share/fonts/SimHei.ttf', ] for _p in _FONT_PATHS: if os.path.exists(_p): fm.fontManager.addfont(_p) matplotlib.rcParams['font.family'] = fm.FontProperties(fname=_p).get_name() break matplotlib.rcParams['axes.unicode_minus'] = False

---
import matplotlib import matplotlib.pyplot as plt import matplotlib.font_manager as fm
_FONT_PATHS = [ '/mnt/afs_agents/SimHei.ttf', '/mnt/afs_agents/mnt/data/SimHei.ttf', os.path.expanduser('~/.fonts/SimHei.ttf'), '/usr/share/fonts/truetype/wqy/wqy-zenhei.ttc', '/usr/share/fonts/SimHei.ttf', ] for _p in _FONT_PATHS: if os.path.exists(_p): fm.fontManager.addfont(_p) matplotlib.rcParams['font.family'] = fm.FontProperties(fname=_p).get_name() break matplotlib.rcParams['axes.unicode_minus'] = False

---

Core Method 1: Inspect File Structure (Without Loading Data)

核心方法1:检查文件结构(无需加载数据)

Before any operation on a large file, inspect sheets and row counts without loading data into memory:
python
import openpyxl

def inspect_excel(file_path):
    """Stream-inspect Excel structure. Returns {sheet_name: {rows, columns}}."""
    wb = openpyxl.load_workbook(file_path, read_only=True, data_only=True)
    info = {}
    for name in wb.sheetnames:
        ws = wb[name]
        row_count = 0
        header = None
        for i, row in enumerate(ws.iter_rows(values_only=True)):
            if i == 0:
                header = [str(c) if c is not None else f"Col_{j}" for j, c in enumerate(row)]
            else:
                row_count += 1
        info[name] = {"rows": row_count, "columns": header}
    wb.close()
    return info
在对大文件执行任何操作前,无需将数据加载到内存即可检查工作表和行数:
python
import openpyxl

def inspect_excel(file_path):
    """Stream-inspect Excel structure. Returns {sheet_name: {rows, columns}}."""
    wb = openpyxl.load_workbook(file_path, read_only=True, data_only=True)
    info = {}
    for name in wb.sheetnames:
        ws = wb[name]
        row_count = 0
        header = None
        for i, row in enumerate(ws.iter_rows(values_only=True)):
            if i == 0:
                header = [str(c) if c is not None else f"Col_{j}" for j, c in enumerate(row)]
            else:
                row_count += 1
        info[name] = {"rows": row_count, "columns": header}
    wb.close()
    return info

Usage

Usage

file_info = inspect_excel(file_path) for sheet, meta in file_info.items(): print(f"Sheet '{sheet}': {meta['rows']} rows, {len(meta['columns'])} cols") print(f" Columns: {meta['columns'][:10]}...") total_rows = sum(m['rows'] for m in file_info.values()) print(f"Total rows: {total_rows}")

---
file_info = inspect_excel(file_path) for sheet, meta in file_info.items(): print(f"Sheet '{sheet}': {meta['rows']} rows, {len(meta['columns'])} cols") print(f" Columns: {meta['columns'][:10]}...") total_rows = sum(m['rows'] for m in file_info.values()) print(f"Total rows: {total_rows}")

---

Core Method 2: Streaming Read → Parquet (100k+ Rows)

核心方法2:流式读取→Parquet(10万行以上)

For 100k+ row files, never use
pd.read_excel()
. Use openpyxl streaming → Parquet:
python
import openpyxl
import pyarrow as pa
import pyarrow.parquet as pq

def stream_excel_to_parquet(excel_path, parquet_path, sheet_name=None, chunk_size=50000):
    """Stream Excel rows to Parquet with constant memory usage.

    All columns are cast to string to avoid cross-chunk schema mismatches
    (Excel mixed-type columns may be all-None in some chunks, causing PyArrow
    to infer null type instead of string). Convert numeric columns after loading
    Parquet with pd.to_numeric() as needed.
    """
    wb = openpyxl.load_workbook(excel_path, read_only=True, data_only=True)
    ws = wb[sheet_name] if sheet_name else wb.active

    header = None
    writer = None
    chunk_rows = []
    total_written = 0

    def _flush(rows):
        nonlocal writer
        table = pa.table({
            col: pa.array(
                [str(r[idx]) if r[idx] is not None else None for r in rows],
                type=pa.string(),
            )
            for idx, col in enumerate(header)
        })
        if writer is None:
            writer = pq.ParquetWriter(parquet_path, table.schema)
        writer.write_table(table)

    for i, row in enumerate(ws.iter_rows(values_only=True)):
        if i == 0:
            header = [str(c) if c is not None else f"Col_{j}" for j, c in enumerate(row)]
            continue

        chunk_rows.append(list(row))

        if len(chunk_rows) >= chunk_size:
            _flush(chunk_rows)
            total_written += len(chunk_rows)
            print(f"  Written {total_written:,} rows...")
            chunk_rows = []
            gc.collect()

    if chunk_rows:
        _flush(chunk_rows)
        total_written += len(chunk_rows)

    if writer:
        writer.close()
    wb.close()
    print(f"Done: {total_written:,} rows -> {parquet_path}")
    return total_written

对于10万行以上的文件,绝对不要使用
pd.read_excel()
。使用openpyxl流式读取→Parquet:
python
import openpyxl
import pyarrow as pa
import pyarrow.parquet as pq

def stream_excel_to_parquet(excel_path, parquet_path, sheet_name=None, chunk_size=50000):
    """Stream Excel rows to Parquet with constant memory usage.

    All columns are cast to string to avoid cross-chunk schema mismatches
    (Excel mixed-type columns may be all-None in some chunks, causing PyArrow
    to infer null type instead of string). Convert numeric columns after loading
    Parquet with pd.to_numeric() as needed.
    """
    wb = openpyxl.load_workbook(excel_path, read_only=True, data_only=True)
    ws = wb[sheet_name] if sheet_name else wb.active

    header = None
    writer = None
    chunk_rows = []
    total_written = 0

    def _flush(rows):
        nonlocal writer
        table = pa.table({
            col: pa.array(
                [str(r[idx]) if r[idx] is not None else None for r in rows],
                type=pa.string(),
            )
            for idx, col in enumerate(header)
        })
        if writer is None:
            writer = pq.ParquetWriter(parquet_path, table.schema)
        writer.write_table(table)

    for i, row in enumerate(ws.iter_rows(values_only=True)):
        if i == 0:
            header = [str(c) if c is not None else f"Col_{j}" for j, c in enumerate(row)]
            continue

        chunk_rows.append(list(row))

        if len(chunk_rows) >= chunk_size:
            _flush(chunk_rows)
            total_written += len(chunk_rows)
            print(f"  Written {total_written:,} rows...")
            chunk_rows = []
            gc.collect()

    if chunk_rows:
        _flush(chunk_rows)
        total_written += len(chunk_rows)

    if writer:
        writer.close()
    wb.close()
    print(f"Done: {total_written:,} rows -> {parquet_path}")
    return total_written

Core Method 3: Medium File Parquet Conversion (10k–100k Rows)

核心方法3:中等文件Parquet转换(1万–10万行)

For 10k–100k rows,
pd.read_excel()
won't OOM, but Parquet is much faster for repeated analysis:
python
def convert_excel_to_parquet(excel_path, parquet_path, sheet_name=0):
    """Medium file: pd.read_excel -> Parquet cache."""
    if os.path.exists(parquet_path):
        print(f"Cache exists: {parquet_path}")
        return
    df = pd.read_excel(excel_path, sheet_name=sheet_name)
    df.columns = df.columns.astype(str)
    df.to_parquet(parquet_path, engine='pyarrow', compression='snappy')
    row_count = len(df)
    del df
    gc.collect()
    print(f"Converted {row_count:,} rows -> {parquet_path}")

对于1万–10万行的文件,
pd.read_excel()
不会导致内存溢出,但Parquet在重复分析时速度快得多:
python
def convert_excel_to_parquet(excel_path, parquet_path, sheet_name=0):
    """Medium file: pd.read_excel -> Parquet cache."""
    if os.path.exists(parquet_path):
        print(f"Cache exists: {parquet_path}")
        return
    df = pd.read_excel(excel_path, sheet_name=sheet_name)
    df.columns = df.columns.astype(str)
    df.to_parquet(parquet_path, engine='pyarrow', compression='snappy')
    row_count = len(df)
    del df
    gc.collect()
    print(f"Converted {row_count:,} rows -> {parquet_path}")

Core Method 4: Memory Optimization (Type Downcasting)

核心方法4:内存优化(类型向下转换)

After loading Parquet, further reduce memory footprint:
python
def optimize_dtypes(df):
    """Auto-downcast numeric types + convert low-cardinality strings to Category.
    Typically saves 50-80% memory."""
    start_mb = df.memory_usage(deep=True).sum() / 1024**2

    for col in df.select_dtypes(include=['int64', 'int32']).columns:
        c_min, c_max = df[col].min(), df[col].max()
        if c_min >= np.iinfo(np.int8).min and c_max <= np.iinfo(np.int8).max:
            df[col] = df[col].astype(np.int8)
        elif c_min >= np.iinfo(np.int16).min and c_max <= np.iinfo(np.int16).max:
            df[col] = df[col].astype(np.int16)
        elif c_min >= np.iinfo(np.int32).min and c_max <= np.iinfo(np.int32).max:
            df[col] = df[col].astype(np.int32)

    for col in df.select_dtypes(include=['float64']).columns:
        df[col] = df[col].astype(np.float32)

    for col in df.select_dtypes(include=['object', 'string']).columns:
        if df[col].nunique() / max(len(df), 1) < 0.5:
            df[col] = df[col].astype('category')

    end_mb = df.memory_usage(deep=True).sum() / 1024**2
    print(f"Memory: {start_mb:.1f} MB -> {end_mb:.1f} MB (saved {(1 - end_mb/start_mb)*100:.0f}%)")
    return df

加载Parquet后,进一步减少内存占用:
python
def optimize_dtypes(df):
    """Auto-downcast numeric types + convert low-cardinality strings to Category.
    Typically saves 50-80% memory."""
    start_mb = df.memory_usage(deep=True).sum() / 1024**2

    for col in df.select_dtypes(include=['int64', 'int32']).columns:
        c_min, c_max = df[col].min(), df[col].max()
        if c_min >= np.iinfo(np.int8).min and c_max <= np.iinfo(np.int8).max:
            df[col] = df[col].astype(np.int8)
        elif c_min >= np.iinfo(np.int16).min and c_max <= np.iinfo(np.int16).max:
            df[col] = df[col].astype(np.int16)
        elif c_min >= np.iinfo(np.int32).min and c_max <= np.iinfo(np.int32).max:
            df[col] = df[col].astype(np.int32)

    for col in df.select_dtypes(include=['float64']).columns:
        df[col] = df[col].astype(np.float32)

    for col in df.select_dtypes(include=['object', 'string']).columns:
        if df[col].nunique() / max(len(df), 1) < 0.5:
            df[col] = df[col].astype('category')

    end_mb = df.memory_usage(deep=True).sum() / 1024**2
    print(f"Memory: {start_mb:.1f} MB -> {end_mb:.1f} MB (saved {(1 - end_mb/start_mb)*100:.0f}%)")
    return df

Core Method 5: Large File Writing

核心方法5:大文件写入

python
def write_large_excel(df, output_path, sheet_name="Sheet1"):
    """Auto-select write strategy based on data size."""
    total_cells = len(df) * len(df.columns)

    if len(df) > 1_000_000:
        csv_path = output_path.rsplit('.', 1)[0] + '.csv'
        df.to_csv(csv_path, index=False)
        print(f"Over 1M rows — exported as CSV: {csv_path}")
        return csv_path

    if total_cells > 50_000:
        from openpyxl import Workbook
        from openpyxl.cell import WriteOnlyCell

        wb = Workbook(write_only=True)
        ws = wb.create_sheet(title=sheet_name)
        ws.append(list(df.columns))
        for idx, row in enumerate(df.itertuples(index=False)):
            ws.append([None if pd.isna(v) else v for v in row])
            if (idx + 1) % 100_000 == 0:
                print(f"  Written {idx + 1:,} rows...")
        wb.save(output_path)
        wb.close()
        print(f"write_only mode: {len(df):,} rows -> {output_path}")
    else:
        df.to_excel(output_path, index=False, sheet_name=sheet_name)
        print(f"Standard write: {len(df):,} rows -> {output_path}")
    return output_path

python
def write_large_excel(df, output_path, sheet_name="Sheet1"):
    """Auto-select write strategy based on data size."""
    total_cells = len(df) * len(df.columns)

    if len(df) > 1_000_000:
        csv_path = output_path.rsplit('.', 1)[0] + '.csv'
        df.to_csv(csv_path, index=False)
        print(f"Over 1M rows — exported as CSV: {csv_path}")
        return csv_path

    if total_cells > 50_000:
        from openpyxl import Workbook
        from openpyxl.cell import WriteOnlyCell

        wb = Workbook(write_only=True)
        ws = wb.create_sheet(title=sheet_name)
        ws.append(list(df.columns))
        for idx, row in enumerate(df.itertuples(index=False)):
            ws.append([None if pd.isna(v) else v for v in row])
            if (idx + 1) % 100_000 == 0:
                print(f"  Written {idx + 1:,} rows...")
        wb.save(output_path)
        wb.close()
        print(f"write_only mode: {len(df):,} rows -> {output_path}")
    else:
        df.to_excel(output_path, index=False, sheet_name=sheet_name)
        print(f"Standard write: {len(df):,} rows -> {output_path}")
    return output_path

Example 1: 100k-Row Table — Column Distribution + Chart

示例1:10万行表格——列分布+图表

Scenario: User has a 100k-row sales Excel file and wants regional sales distribution with a bar chart.
python
import pandas as pd
import os, gc

excel_path = "sales_100k.xlsx"
parquet_path = "sales_100k.parquet"
场景:用户有一个10万行的销售Excel文件,想要生成区域销售分布柱状图。
python
import pandas as pd
import os, gc

excel_path = "sales_100k.xlsx"
parquet_path = "sales_100k.parquet"

=== Step 1: Inspect structure ===

=== Step 1: Inspect structure ===

file_info = inspect_excel(excel_path) total_rows = sum(m['rows'] for m in file_info.values()) print(f"Total rows: {total_rows}")
file_info = inspect_excel(excel_path) total_rows = sum(m['rows'] for m in file_info.values()) print(f"Total rows: {total_rows}")

=== Step 2: Choose read strategy by row count ===

=== Step 2: Choose read strategy by row count ===

if total_rows >= 100_000: stream_excel_to_parquet(excel_path, parquet_path) else: convert_excel_to_parquet(excel_path, parquet_path)
if total_rows >= 100_000: stream_excel_to_parquet(excel_path, parquet_path) else: convert_excel_to_parquet(excel_path, parquet_path)

=== Step 3: Load Parquet + optimize memory ===

=== Step 3: Load Parquet + optimize memory ===

df = pd.read_parquet(parquet_path) df = optimize_dtypes(df) print(f"Shape: {df.shape}") print(df.head(3))
df = pd.read_parquet(parquet_path) df = optimize_dtypes(df) print(f"Shape: {df.shape}") print(df.head(3))

=== Step 4: Analysis ===

=== Step 4: Analysis ===

region_sales = df.groupby('Region')['Sales'].sum().sort_values(ascending=False) print(region_sales)
region_sales = df.groupby('Region')['Sales'].sum().sort_values(ascending=False) print(region_sales)

=== Step 5: Visualization ===

=== Step 5: Visualization ===

fig, ax = plt.subplots(figsize=(10, 6)) region_sales.plot(kind='bar', ax=ax, color='#4C72B0') ax.set_title('Sales by Region') ax.set_ylabel('Sales') plt.tight_layout() plt.savefig('region_sales.png', dpi=150, bbox_inches='tight') plt.show()
fig, ax = plt.subplots(figsize=(10, 6)) region_sales.plot(kind='bar', ax=ax, color='#4C72B0') ax.set_title('Sales by Region') ax.set_ylabel('Sales') plt.tight_layout() plt.savefig('region_sales.png', dpi=150, bbox_inches='tight') plt.show()

=== Step 6: Cleanup ===

=== Step 6: Cleanup ===

del df gc.collect()

---
del df gc.collect()

---

Example 2: 1M-Row Table — Streaming Read + Filter + Export

示例2:100万行表格——流式读取+过滤+导出

Scenario: User has a 1M-row transaction log and wants records with amount > 10,000 exported.
python
import pandas as pd
import os, gc

excel_path = "transactions_1m.xlsx"
parquet_path = "transactions_1m.parquet"
场景:用户有一个100万行的交易日志,想要导出金额>10000的记录。
python
import pandas as pd
import os, gc

excel_path = "transactions_1m.xlsx"
parquet_path = "transactions_1m.parquet"

=== Step 1: Stream to Parquet (1M rows — MUST use streaming, never pd.read_excel) ===

=== Step 1: Stream to Parquet (1M rows — MUST use streaming, never pd.read_excel) ===

stream_excel_to_parquet(excel_path, parquet_path, chunk_size=50000)
stream_excel_to_parquet(excel_path, parquet_path, chunk_size=50000)

=== Step 2: Load only needed columns (saves memory) ===

=== Step 2: Load only needed columns (saves memory) ===

df = pd.read_parquet(parquet_path, columns=['TransactionID', 'Amount', 'Date', 'Type']) df = optimize_dtypes(df) print(f"Shape: {df.shape}, Memory: {df.memory_usage(deep=True).sum()/1024**2:.1f} MB")
df = pd.read_parquet(parquet_path, columns=['TransactionID', 'Amount', 'Date', 'Type']) df = optimize_dtypes(df) print(f"Shape: {df.shape}, Memory: {df.memory_usage(deep=True).sum()/1024**2:.1f} MB")

=== Step 3: Vectorized filtering (never use apply/iterrows) ===

=== Step 3: Vectorized filtering (never use apply/iterrows) ===

mask = df['Amount'] > 10000 high_value = df[mask].copy() print(f"Filtered: {len(high_value):,} / {len(df):,} rows")
mask = df['Amount'] > 10000 high_value = df[mask].copy() print(f"Filtered: {len(high_value):,} / {len(df):,} rows")

=== Step 4: Export ===

=== Step 4: Export ===

output_path = write_large_excel(high_value, 'high_value_transactions.xlsx')
output_path = write_large_excel(high_value, 'high_value_transactions.xlsx')

=== Step 5: Cleanup ===

=== Step 5: Cleanup ===

del df, high_value gc.collect()

---
del df, high_value gc.collect()

---

Vectorized Operations Cheat Sheet

向量化操作速查表

On large files, never use slow operations — use vectorized alternatives:
Slow (Prohibited)Fast (Use This)
df.apply(lambda x: x*2)
df['col'] * 2
df.iterrows()
df.itertuples(index=False)
for i in range(len(df)): df.iloc[i]
Vectorized boolean indexing
df[mask]
df['a'].map(lambda x: 'Y' if x>0 else 'N')
np.where(df['a']>0, 'Y', 'N')
df.groupby('a').apply(custom_func)
df.groupby('a').agg({'b':'sum','c':'mean'})

在处理大文件时,绝对不要使用慢速操作——改用向量化替代方案:
慢速(禁止)快速(推荐)
df.apply(lambda x: x*2)
df['col'] * 2
df.iterrows()
df.itertuples(index=False)
for i in range(len(df)): df.iloc[i]
向量化布尔索引
df[mask]
df['a'].map(lambda x: 'Y' if x>0 else 'N')
np.where(df['a']>0, 'Y', 'N')
df.groupby('a').apply(custom_func)
df.groupby('a').agg({'b':'sum','c':'mean'})

Memory Estimation

内存估算

Estimate memory before loading to avoid OOM:
Estimated MB ≈ rows × cols × 8 / 1024² (numeric columns)
Estimated MB ≈ rows × cols × 50 / 1024² (with text columns)
Rows20 cols (numeric)20 cols (with text)
100k~15 MB~95 MB
500k~76 MB~477 MB
1M~153 MB~953 MB
When estimated memory exceeds 80% of available RAM, use column-selective loading (
pd.read_parquet(columns=[...])
) or chunked processing.

加载前估算内存以避免OOM:
Estimated MB ≈ rows × cols × 8 / 1024² (numeric columns)
Estimated MB ≈ rows × cols × 50 / 1024² (with text columns)
行数20列(数值型)20列(含文本)
10万~15 MB~95 MB
50万~76 MB~477 MB
100万~153 MB~953 MB
当估算内存超过可用RAM的80%时,使用选择性列加载
pd.read_parquet(columns=[...])
)或分块处理。

Best Practices

最佳实践

  1. Parquet is king: For any file >= 10k rows, convert to Parquet before analysis. Parquet supports columnar reads, compressed storage, and loads 10-50x faster than xlsx.
  2. Streaming is the safety net: For 100k+ rows, always use openpyxl
    read_only
    +
    iter_rows
    . Never
    pd.read_excel()
    for full load.
  3. Release memory promptly:
    del df; gc.collect()
    after every intermediate DataFrame.
  4. Excel row limit: Max 1,048,576 rows per sheet. Auto-split to multiple sheets or export as CSV when exceeded.
  5. Use write_only for output: Files with >50k cells must use
    openpyxl Workbook(write_only=True)
    .
  1. Parquet是首选:对于任何≥1万行的文件,分析前先转换为Parquet。Parquet支持列读取、压缩存储,加载速度比xlsx快10-50倍。
  2. 流式读取是安全保障:对于10万行以上的文件,始终使用openpyxl
    read_only
    +
    iter_rows
    。绝对不要用
    pd.read_excel()
    完整加载。
  3. 及时释放内存:每个中间DataFrame处理完后执行
    del df; gc.collect()
  4. Excel行数限制:单工作表最大支持1,048,576行。超过时自动拆分为多个工作表或导出为CSV。
  5. 输出使用write_only模式:单元格数>5万的文件必须使用
    openpyxl Workbook(write_only=True)