sn-da-large-file-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLarge Scale Excel Analysis Skill
大规模Excel分析技能
Mandatory Rules
强制规则
When total rows >= 10,000, you MUST use the methods in this skill.
| Data Scale | Read Strategy | Reason |
|---|---|---|
| < 10k rows | | No memory pressure |
| 10k–100k rows | | Avoid repeated slow reads |
| 100k–1M rows | openpyxl | |
| > 1M rows | Streaming read + multi-sheet split (Excel max 1,048,576 rows per sheet) | Must chunk |
Prohibited:
- Do NOT use to fully load 100k+ row files
pd.read_excel() - Do NOT search for fonts with ,
fc-list, or install packages withfind ... fontspip install - Do NOT use on large DataFrames (use
df.iterrows()or vectorized ops)itertuples() - Do NOT use for operations that can be vectorized
df.apply(lambda...)
当总行数≥10000时,必须使用本技能中的方法。
| 数据规模 | 读取策略 | 原因 |
|---|---|---|
| <10000行 | 直接使用 | 无内存压力 |
| 10000–100000行 | | 避免重复慢读取 |
| 100000–1000000行 | openpyxl | |
| >1000000行 | 流式读取 + 多工作表拆分(Excel单工作表最大支持1,048,576行) | 必须分块处理 |
禁止操作:
- 请勿使用完整加载10万行以上的文件
pd.read_excel() - 请勿使用、
fc-list搜索字体,或使用find ... fonts安装包pip install - 请勿在大型DataFrame上使用(使用
df.iterrows()或向量化操作)itertuples() - 对于可向量化的操作,请勿使用
df.apply(lambda...)
Environment Setup
环境配置
python
import pandas as pd
import numpy as np
import os
import gc
pd.options.mode.copy_on_write = Truepython
import pandas as pd
import numpy as np
import os
import gc
pd.options.mode.copy_on_write = TrueCJK font setup (fixed paths — do NOT search for fonts)
CJK font setup (fixed paths — do NOT search for fonts)
⚠️ Copy this block as-is. Do NOT use fc-list, find, subprocess, or glob to locate fonts.
⚠️ Copy this block as-is. Do NOT use fc-list, find, subprocess, or glob to locate fonts.
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
_FONT_PATHS = [
'/mnt/afs_agents/SimHei.ttf',
'/mnt/afs_agents/mnt/data/SimHei.ttf',
os.path.expanduser('~/.fonts/SimHei.ttf'),
'/usr/share/fonts/truetype/wqy/wqy-zenhei.ttc',
'/usr/share/fonts/SimHei.ttf',
]
for _p in _FONT_PATHS:
if os.path.exists(_p):
fm.fontManager.addfont(_p)
matplotlib.rcParams['font.family'] = fm.FontProperties(fname=_p).get_name()
break
matplotlib.rcParams['axes.unicode_minus'] = False
---import matplotlib
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
_FONT_PATHS = [
'/mnt/afs_agents/SimHei.ttf',
'/mnt/afs_agents/mnt/data/SimHei.ttf',
os.path.expanduser('~/.fonts/SimHei.ttf'),
'/usr/share/fonts/truetype/wqy/wqy-zenhei.ttc',
'/usr/share/fonts/SimHei.ttf',
]
for _p in _FONT_PATHS:
if os.path.exists(_p):
fm.fontManager.addfont(_p)
matplotlib.rcParams['font.family'] = fm.FontProperties(fname=_p).get_name()
break
matplotlib.rcParams['axes.unicode_minus'] = False
---Core Method 1: Inspect File Structure (Without Loading Data)
核心方法1:检查文件结构(无需加载数据)
Before any operation on a large file, inspect sheets and row counts without loading data into memory:
python
import openpyxl
def inspect_excel(file_path):
"""Stream-inspect Excel structure. Returns {sheet_name: {rows, columns}}."""
wb = openpyxl.load_workbook(file_path, read_only=True, data_only=True)
info = {}
for name in wb.sheetnames:
ws = wb[name]
row_count = 0
header = None
for i, row in enumerate(ws.iter_rows(values_only=True)):
if i == 0:
header = [str(c) if c is not None else f"Col_{j}" for j, c in enumerate(row)]
else:
row_count += 1
info[name] = {"rows": row_count, "columns": header}
wb.close()
return info在对大文件执行任何操作前,无需将数据加载到内存即可检查工作表和行数:
python
import openpyxl
def inspect_excel(file_path):
"""Stream-inspect Excel structure. Returns {sheet_name: {rows, columns}}."""
wb = openpyxl.load_workbook(file_path, read_only=True, data_only=True)
info = {}
for name in wb.sheetnames:
ws = wb[name]
row_count = 0
header = None
for i, row in enumerate(ws.iter_rows(values_only=True)):
if i == 0:
header = [str(c) if c is not None else f"Col_{j}" for j, c in enumerate(row)]
else:
row_count += 1
info[name] = {"rows": row_count, "columns": header}
wb.close()
return infoUsage
Usage
file_info = inspect_excel(file_path)
for sheet, meta in file_info.items():
print(f"Sheet '{sheet}': {meta['rows']} rows, {len(meta['columns'])} cols")
print(f" Columns: {meta['columns'][:10]}...")
total_rows = sum(m['rows'] for m in file_info.values())
print(f"Total rows: {total_rows}")
---file_info = inspect_excel(file_path)
for sheet, meta in file_info.items():
print(f"Sheet '{sheet}': {meta['rows']} rows, {len(meta['columns'])} cols")
print(f" Columns: {meta['columns'][:10]}...")
total_rows = sum(m['rows'] for m in file_info.values())
print(f"Total rows: {total_rows}")
---Core Method 2: Streaming Read → Parquet (100k+ Rows)
核心方法2:流式读取→Parquet(10万行以上)
For 100k+ row files, never use . Use openpyxl streaming → Parquet:
pd.read_excel()python
import openpyxl
import pyarrow as pa
import pyarrow.parquet as pq
def stream_excel_to_parquet(excel_path, parquet_path, sheet_name=None, chunk_size=50000):
"""Stream Excel rows to Parquet with constant memory usage.
All columns are cast to string to avoid cross-chunk schema mismatches
(Excel mixed-type columns may be all-None in some chunks, causing PyArrow
to infer null type instead of string). Convert numeric columns after loading
Parquet with pd.to_numeric() as needed.
"""
wb = openpyxl.load_workbook(excel_path, read_only=True, data_only=True)
ws = wb[sheet_name] if sheet_name else wb.active
header = None
writer = None
chunk_rows = []
total_written = 0
def _flush(rows):
nonlocal writer
table = pa.table({
col: pa.array(
[str(r[idx]) if r[idx] is not None else None for r in rows],
type=pa.string(),
)
for idx, col in enumerate(header)
})
if writer is None:
writer = pq.ParquetWriter(parquet_path, table.schema)
writer.write_table(table)
for i, row in enumerate(ws.iter_rows(values_only=True)):
if i == 0:
header = [str(c) if c is not None else f"Col_{j}" for j, c in enumerate(row)]
continue
chunk_rows.append(list(row))
if len(chunk_rows) >= chunk_size:
_flush(chunk_rows)
total_written += len(chunk_rows)
print(f" Written {total_written:,} rows...")
chunk_rows = []
gc.collect()
if chunk_rows:
_flush(chunk_rows)
total_written += len(chunk_rows)
if writer:
writer.close()
wb.close()
print(f"Done: {total_written:,} rows -> {parquet_path}")
return total_written对于10万行以上的文件,绝对不要使用。使用openpyxl流式读取→Parquet:
pd.read_excel()python
import openpyxl
import pyarrow as pa
import pyarrow.parquet as pq
def stream_excel_to_parquet(excel_path, parquet_path, sheet_name=None, chunk_size=50000):
"""Stream Excel rows to Parquet with constant memory usage.
All columns are cast to string to avoid cross-chunk schema mismatches
(Excel mixed-type columns may be all-None in some chunks, causing PyArrow
to infer null type instead of string). Convert numeric columns after loading
Parquet with pd.to_numeric() as needed.
"""
wb = openpyxl.load_workbook(excel_path, read_only=True, data_only=True)
ws = wb[sheet_name] if sheet_name else wb.active
header = None
writer = None
chunk_rows = []
total_written = 0
def _flush(rows):
nonlocal writer
table = pa.table({
col: pa.array(
[str(r[idx]) if r[idx] is not None else None for r in rows],
type=pa.string(),
)
for idx, col in enumerate(header)
})
if writer is None:
writer = pq.ParquetWriter(parquet_path, table.schema)
writer.write_table(table)
for i, row in enumerate(ws.iter_rows(values_only=True)):
if i == 0:
header = [str(c) if c is not None else f"Col_{j}" for j, c in enumerate(row)]
continue
chunk_rows.append(list(row))
if len(chunk_rows) >= chunk_size:
_flush(chunk_rows)
total_written += len(chunk_rows)
print(f" Written {total_written:,} rows...")
chunk_rows = []
gc.collect()
if chunk_rows:
_flush(chunk_rows)
total_written += len(chunk_rows)
if writer:
writer.close()
wb.close()
print(f"Done: {total_written:,} rows -> {parquet_path}")
return total_writtenCore Method 3: Medium File Parquet Conversion (10k–100k Rows)
核心方法3:中等文件Parquet转换(1万–10万行)
For 10k–100k rows, won't OOM, but Parquet is much faster for repeated analysis:
pd.read_excel()python
def convert_excel_to_parquet(excel_path, parquet_path, sheet_name=0):
"""Medium file: pd.read_excel -> Parquet cache."""
if os.path.exists(parquet_path):
print(f"Cache exists: {parquet_path}")
return
df = pd.read_excel(excel_path, sheet_name=sheet_name)
df.columns = df.columns.astype(str)
df.to_parquet(parquet_path, engine='pyarrow', compression='snappy')
row_count = len(df)
del df
gc.collect()
print(f"Converted {row_count:,} rows -> {parquet_path}")对于1万–10万行的文件,不会导致内存溢出,但Parquet在重复分析时速度快得多:
pd.read_excel()python
def convert_excel_to_parquet(excel_path, parquet_path, sheet_name=0):
"""Medium file: pd.read_excel -> Parquet cache."""
if os.path.exists(parquet_path):
print(f"Cache exists: {parquet_path}")
return
df = pd.read_excel(excel_path, sheet_name=sheet_name)
df.columns = df.columns.astype(str)
df.to_parquet(parquet_path, engine='pyarrow', compression='snappy')
row_count = len(df)
del df
gc.collect()
print(f"Converted {row_count:,} rows -> {parquet_path}")Core Method 4: Memory Optimization (Type Downcasting)
核心方法4:内存优化(类型向下转换)
After loading Parquet, further reduce memory footprint:
python
def optimize_dtypes(df):
"""Auto-downcast numeric types + convert low-cardinality strings to Category.
Typically saves 50-80% memory."""
start_mb = df.memory_usage(deep=True).sum() / 1024**2
for col in df.select_dtypes(include=['int64', 'int32']).columns:
c_min, c_max = df[col].min(), df[col].max()
if c_min >= np.iinfo(np.int8).min and c_max <= np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min >= np.iinfo(np.int16).min and c_max <= np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min >= np.iinfo(np.int32).min and c_max <= np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
for col in df.select_dtypes(include=['float64']).columns:
df[col] = df[col].astype(np.float32)
for col in df.select_dtypes(include=['object', 'string']).columns:
if df[col].nunique() / max(len(df), 1) < 0.5:
df[col] = df[col].astype('category')
end_mb = df.memory_usage(deep=True).sum() / 1024**2
print(f"Memory: {start_mb:.1f} MB -> {end_mb:.1f} MB (saved {(1 - end_mb/start_mb)*100:.0f}%)")
return df加载Parquet后,进一步减少内存占用:
python
def optimize_dtypes(df):
"""Auto-downcast numeric types + convert low-cardinality strings to Category.
Typically saves 50-80% memory."""
start_mb = df.memory_usage(deep=True).sum() / 1024**2
for col in df.select_dtypes(include=['int64', 'int32']).columns:
c_min, c_max = df[col].min(), df[col].max()
if c_min >= np.iinfo(np.int8).min and c_max <= np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min >= np.iinfo(np.int16).min and c_max <= np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min >= np.iinfo(np.int32).min and c_max <= np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
for col in df.select_dtypes(include=['float64']).columns:
df[col] = df[col].astype(np.float32)
for col in df.select_dtypes(include=['object', 'string']).columns:
if df[col].nunique() / max(len(df), 1) < 0.5:
df[col] = df[col].astype('category')
end_mb = df.memory_usage(deep=True).sum() / 1024**2
print(f"Memory: {start_mb:.1f} MB -> {end_mb:.1f} MB (saved {(1 - end_mb/start_mb)*100:.0f}%)")
return dfCore Method 5: Large File Writing
核心方法5:大文件写入
python
def write_large_excel(df, output_path, sheet_name="Sheet1"):
"""Auto-select write strategy based on data size."""
total_cells = len(df) * len(df.columns)
if len(df) > 1_000_000:
csv_path = output_path.rsplit('.', 1)[0] + '.csv'
df.to_csv(csv_path, index=False)
print(f"Over 1M rows — exported as CSV: {csv_path}")
return csv_path
if total_cells > 50_000:
from openpyxl import Workbook
from openpyxl.cell import WriteOnlyCell
wb = Workbook(write_only=True)
ws = wb.create_sheet(title=sheet_name)
ws.append(list(df.columns))
for idx, row in enumerate(df.itertuples(index=False)):
ws.append([None if pd.isna(v) else v for v in row])
if (idx + 1) % 100_000 == 0:
print(f" Written {idx + 1:,} rows...")
wb.save(output_path)
wb.close()
print(f"write_only mode: {len(df):,} rows -> {output_path}")
else:
df.to_excel(output_path, index=False, sheet_name=sheet_name)
print(f"Standard write: {len(df):,} rows -> {output_path}")
return output_pathpython
def write_large_excel(df, output_path, sheet_name="Sheet1"):
"""Auto-select write strategy based on data size."""
total_cells = len(df) * len(df.columns)
if len(df) > 1_000_000:
csv_path = output_path.rsplit('.', 1)[0] + '.csv'
df.to_csv(csv_path, index=False)
print(f"Over 1M rows — exported as CSV: {csv_path}")
return csv_path
if total_cells > 50_000:
from openpyxl import Workbook
from openpyxl.cell import WriteOnlyCell
wb = Workbook(write_only=True)
ws = wb.create_sheet(title=sheet_name)
ws.append(list(df.columns))
for idx, row in enumerate(df.itertuples(index=False)):
ws.append([None if pd.isna(v) else v for v in row])
if (idx + 1) % 100_000 == 0:
print(f" Written {idx + 1:,} rows...")
wb.save(output_path)
wb.close()
print(f"write_only mode: {len(df):,} rows -> {output_path}")
else:
df.to_excel(output_path, index=False, sheet_name=sheet_name)
print(f"Standard write: {len(df):,} rows -> {output_path}")
return output_pathExample 1: 100k-Row Table — Column Distribution + Chart
示例1:10万行表格——列分布+图表
Scenario: User has a 100k-row sales Excel file and wants regional sales distribution with a bar chart.
python
import pandas as pd
import os, gc
excel_path = "sales_100k.xlsx"
parquet_path = "sales_100k.parquet"场景:用户有一个10万行的销售Excel文件,想要生成区域销售分布柱状图。
python
import pandas as pd
import os, gc
excel_path = "sales_100k.xlsx"
parquet_path = "sales_100k.parquet"=== Step 1: Inspect structure ===
=== Step 1: Inspect structure ===
file_info = inspect_excel(excel_path)
total_rows = sum(m['rows'] for m in file_info.values())
print(f"Total rows: {total_rows}")
file_info = inspect_excel(excel_path)
total_rows = sum(m['rows'] for m in file_info.values())
print(f"Total rows: {total_rows}")
=== Step 2: Choose read strategy by row count ===
=== Step 2: Choose read strategy by row count ===
if total_rows >= 100_000:
stream_excel_to_parquet(excel_path, parquet_path)
else:
convert_excel_to_parquet(excel_path, parquet_path)
if total_rows >= 100_000:
stream_excel_to_parquet(excel_path, parquet_path)
else:
convert_excel_to_parquet(excel_path, parquet_path)
=== Step 3: Load Parquet + optimize memory ===
=== Step 3: Load Parquet + optimize memory ===
df = pd.read_parquet(parquet_path)
df = optimize_dtypes(df)
print(f"Shape: {df.shape}")
print(df.head(3))
df = pd.read_parquet(parquet_path)
df = optimize_dtypes(df)
print(f"Shape: {df.shape}")
print(df.head(3))
=== Step 4: Analysis ===
=== Step 4: Analysis ===
region_sales = df.groupby('Region')['Sales'].sum().sort_values(ascending=False)
print(region_sales)
region_sales = df.groupby('Region')['Sales'].sum().sort_values(ascending=False)
print(region_sales)
=== Step 5: Visualization ===
=== Step 5: Visualization ===
fig, ax = plt.subplots(figsize=(10, 6))
region_sales.plot(kind='bar', ax=ax, color='#4C72B0')
ax.set_title('Sales by Region')
ax.set_ylabel('Sales')
plt.tight_layout()
plt.savefig('region_sales.png', dpi=150, bbox_inches='tight')
plt.show()
fig, ax = plt.subplots(figsize=(10, 6))
region_sales.plot(kind='bar', ax=ax, color='#4C72B0')
ax.set_title('Sales by Region')
ax.set_ylabel('Sales')
plt.tight_layout()
plt.savefig('region_sales.png', dpi=150, bbox_inches='tight')
plt.show()
=== Step 6: Cleanup ===
=== Step 6: Cleanup ===
del df
gc.collect()
---del df
gc.collect()
---Example 2: 1M-Row Table — Streaming Read + Filter + Export
示例2:100万行表格——流式读取+过滤+导出
Scenario: User has a 1M-row transaction log and wants records with amount > 10,000 exported.
python
import pandas as pd
import os, gc
excel_path = "transactions_1m.xlsx"
parquet_path = "transactions_1m.parquet"场景:用户有一个100万行的交易日志,想要导出金额>10000的记录。
python
import pandas as pd
import os, gc
excel_path = "transactions_1m.xlsx"
parquet_path = "transactions_1m.parquet"=== Step 1: Stream to Parquet (1M rows — MUST use streaming, never pd.read_excel) ===
=== Step 1: Stream to Parquet (1M rows — MUST use streaming, never pd.read_excel) ===
stream_excel_to_parquet(excel_path, parquet_path, chunk_size=50000)
stream_excel_to_parquet(excel_path, parquet_path, chunk_size=50000)
=== Step 2: Load only needed columns (saves memory) ===
=== Step 2: Load only needed columns (saves memory) ===
df = pd.read_parquet(parquet_path, columns=['TransactionID', 'Amount', 'Date', 'Type'])
df = optimize_dtypes(df)
print(f"Shape: {df.shape}, Memory: {df.memory_usage(deep=True).sum()/1024**2:.1f} MB")
df = pd.read_parquet(parquet_path, columns=['TransactionID', 'Amount', 'Date', 'Type'])
df = optimize_dtypes(df)
print(f"Shape: {df.shape}, Memory: {df.memory_usage(deep=True).sum()/1024**2:.1f} MB")
=== Step 3: Vectorized filtering (never use apply/iterrows) ===
=== Step 3: Vectorized filtering (never use apply/iterrows) ===
mask = df['Amount'] > 10000
high_value = df[mask].copy()
print(f"Filtered: {len(high_value):,} / {len(df):,} rows")
mask = df['Amount'] > 10000
high_value = df[mask].copy()
print(f"Filtered: {len(high_value):,} / {len(df):,} rows")
=== Step 4: Export ===
=== Step 4: Export ===
output_path = write_large_excel(high_value, 'high_value_transactions.xlsx')
output_path = write_large_excel(high_value, 'high_value_transactions.xlsx')
=== Step 5: Cleanup ===
=== Step 5: Cleanup ===
del df, high_value
gc.collect()
---del df, high_value
gc.collect()
---Vectorized Operations Cheat Sheet
向量化操作速查表
On large files, never use slow operations — use vectorized alternatives:
| Slow (Prohibited) | Fast (Use This) |
|---|---|
| |
| |
| Vectorized boolean indexing |
| |
| |
在处理大文件时,绝对不要使用慢速操作——改用向量化替代方案:
| 慢速(禁止) | 快速(推荐) |
|---|---|
| |
| |
| 向量化布尔索引 |
| |
| |
Memory Estimation
内存估算
Estimate memory before loading to avoid OOM:
Estimated MB ≈ rows × cols × 8 / 1024² (numeric columns)
Estimated MB ≈ rows × cols × 50 / 1024² (with text columns)| Rows | 20 cols (numeric) | 20 cols (with text) |
|---|---|---|
| 100k | ~15 MB | ~95 MB |
| 500k | ~76 MB | ~477 MB |
| 1M | ~153 MB | ~953 MB |
When estimated memory exceeds 80% of available RAM, use column-selective loading () or chunked processing.
pd.read_parquet(columns=[...])加载前估算内存以避免OOM:
Estimated MB ≈ rows × cols × 8 / 1024² (numeric columns)
Estimated MB ≈ rows × cols × 50 / 1024² (with text columns)| 行数 | 20列(数值型) | 20列(含文本) |
|---|---|---|
| 10万 | ~15 MB | ~95 MB |
| 50万 | ~76 MB | ~477 MB |
| 100万 | ~153 MB | ~953 MB |
当估算内存超过可用RAM的80%时,使用选择性列加载()或分块处理。
pd.read_parquet(columns=[...])Best Practices
最佳实践
- Parquet is king: For any file >= 10k rows, convert to Parquet before analysis. Parquet supports columnar reads, compressed storage, and loads 10-50x faster than xlsx.
- Streaming is the safety net: For 100k+ rows, always use openpyxl +
read_only. Neveriter_rowsfor full load.pd.read_excel() - Release memory promptly: after every intermediate DataFrame.
del df; gc.collect() - Excel row limit: Max 1,048,576 rows per sheet. Auto-split to multiple sheets or export as CSV when exceeded.
- Use write_only for output: Files with >50k cells must use .
openpyxl Workbook(write_only=True)
- Parquet是首选:对于任何≥1万行的文件,分析前先转换为Parquet。Parquet支持列读取、压缩存储,加载速度比xlsx快10-50倍。
- 流式读取是安全保障:对于10万行以上的文件,始终使用openpyxl +
read_only。绝对不要用iter_rows完整加载。pd.read_excel() - 及时释放内存:每个中间DataFrame处理完后执行。
del df; gc.collect() - Excel行数限制:单工作表最大支持1,048,576行。超过时自动拆分为多个工作表或导出为CSV。
- 输出使用write_only模式:单元格数>5万的文件必须使用。
openpyxl Workbook(write_only=True)