pandas
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesepandas
pandas
pandas (v3.0+) is Python's primary library for in-memory tabular data analysis. It provides and as core structures with vectorized operations, I/O adapters, groupby, merging, and time-series support.
DataFrameSeriespandas(v3.0+)是Python用于内存中表格数据分析的核心库。它提供和作为核心数据结构,支持向量化操作、I/O适配、分组(groupby)、合并(merging)以及时间序列处理。
DataFrameSeriesCore Data Structures
核心数据结构
DataFrame - 2-dimensional labeled table with columns of potentially different types.
Series - 1-dimensional labeled array; a single column or row of a DataFrame.
python
import pandas as pd
import numpy as npDataFrame - 二维带标签的表格,列可以是不同的数据类型。
Series - 一维带标签的数组;可视为DataFrame的单列或单行。
python
import pandas as pd
import numpy as npCreate DataFrame from dict
Create DataFrame from dict
df = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"score": [95, 82, 78],
"grade": pd.Categorical(["A", "B", "C"]),
})
df = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"score": [95, 82, 78],
"grade": pd.Categorical(["A", "B", "C"]),
})
Create Series
Create Series
s = pd.Series([10, 20, 30], index=["a", "b", "c"], name="values")
undefineds = pd.Series([10, 20, 30], index=["a", "b", "c"], name="values")
undefinedIndexing and Selection
索引与选择
Use (label-based) and (position-based) for explicit indexing.
.loc.ilocpython
undefined使用(基于标签)和(基于位置)进行显式索引。
.loc.ilocpython
undefinedLabel-based selection
Label-based selection
df.loc[0, "name"] # single value
df.loc[:, ["name", "score"]] # multiple columns
df.loc[df["score"] > 80] # boolean mask
df.loc[0, "name"] # single value
df.loc[:, ["name", "score"]] # multiple columns
df.loc[df["score"] > 80] # boolean mask
Position-based selection
Position-based selection
df.iloc[0, 1] # row 0, col 1
df.iloc[1:3, :] # rows 1-2, all columns
**Never use chained indexing** (`df["col"][condition]`). In pandas 3.0 with Copy-on-Write enabled, chained assignment raises `ChainedAssignmentError`. Always use `.loc` for conditional assignment:
```pythondf.iloc[0, 1] # row 0, col 1
df.iloc[1:3, :] # rows 1-2, all columns
**切勿使用链式索引**(`df["col"][condition]`)。在启用Copy-on-Write的pandas 3.0中,链式赋值会抛出`ChainedAssignmentError`。条件赋值请始终使用`.loc`:
```pythonWrong — chained assignment, silently fails or raises in pandas 3.0
Wrong — chained assignment, silently fails or raises in pandas 3.0
df["score"][df["name"] == "Bob"] = 90
df["score"][df["name"] == "Bob"] = 90
Correct
Correct
df.loc[df["name"] == "Bob", "score"] = 90
undefineddf.loc[df["name"] == "Bob", "score"] = 90
undefinedCopy-on-Write (pandas 3.0 default)
Copy-on-Write(pandas 3.0默认特性)
Copy-on-Write (CoW) is the default behavior in pandas 3.0. Every derived object (slice, subset, result of a method) behaves as an independent copy — modifications never propagate to the parent.
Key rule: assign back to the same variable to avoid unnecessary copies:
python
undefinedCopy-on-Write(CoW)是pandas 3.0中的默认行为。所有派生对象(切片、子集、方法返回结果)均为独立副本——修改操作绝不会传递到父对象。
核心规则:将结果重新赋值给同一个变量,避免不必要的复制:
python
undefinedTriggers a copy on every subsequent mutation (wasteful)
Triggers a copy on every subsequent mutation (wasteful)
df2 = df.reset_index(drop=True)
df2.iloc[0, 0] = 100 # triggers copy because df and df2 share data
df2 = df.reset_index(drop=True)
df2.iloc[0, 0] = 100 # triggers copy because df and df2 share data
Reassign to same variable — no extra copy needed
Reassign to same variable — no extra copy needed
df = df.reset_index(drop=True)
df.iloc[0, 0] = 100 # operates in-place, no shared data
**Avoid keeping unnecessary references alive** — they prevent CoW from optimizing copies away.df = df.reset_index(drop=True)
df.iloc[0, 0] = 100 # operates in-place, no shared data
**避免保留不必要的引用**——它们会阻碍CoW优化复制操作。Efficient Data Types
高效数据类型
Default dtypes are not memory-optimal. Apply the correct dtype at read time or immediately after.
| Data | Default dtype | Optimal dtype |
|---|---|---|
| Low-cardinality strings | | |
| Large integers | | |
| Booleans with NA | | |
| Floats | | |
python
undefined默认数据类型并非内存最优。在读取数据时或读取后立即应用正确的数据类型。
| 数据类型 | 默认数据类型 | 最优数据类型 |
|---|---|---|
| 低基数字符串 | | |
| 大整数 | | |
| 含空值的布尔值 | | |
| 浮点数 | | |
python
undefinedConvert low-cardinality string column to Categorical (~10x memory reduction)
Convert low-cardinality string column to Categorical (~10x memory reduction)
df["status"] = df["status"].astype("category")
df["status"] = df["status"].astype("category")
Downcast numerics
Downcast numerics
df["id"] = pd.to_numeric(df["id"], downcast="unsigned")
df["value"] = pd.to_numeric(df["value"], downcast="float")
df["id"] = pd.to_numeric(df["id"], downcast="unsigned")
df["value"] = pd.to_numeric(df["value"], downcast="float")
Check memory usage
Check memory usage
df.memory_usage(deep=True)
Use `pd.Categorical` for any column with fewer than ~50% unique values relative to row count.df.memory_usage(deep=True)
当列的唯一值占行数比例低于约50%时,优先使用`pd.Categorical`类型。Vectorized Operations
向量化操作
Prefer vectorized operations over explicit Python loops. Vectorized code operates on entire arrays using optimized C/NumPy code.
python
undefined优先使用向量化操作,而非显式Python循环。向量化代码基于优化后的C/NumPy代码对整个数组进行操作。
python
undefinedWrong — Python loop, slow
Wrong — Python loop, slow
results = []
for _, row in df.iterrows():
results.append(row["x"] * row["y"])
results = []
for _, row in df.iterrows():
results.append(row["x"] * row["y"])
Correct — vectorized
Correct — vectorized
df["product"] = df["x"] * df["y"]
df["product"] = df["x"] * df["y"]
Use .str accessor for string operations (vectorized)
Use .str accessor for string operations (vectorized)
df["name_upper"] = df["name"].str.upper()
df["name_len"] = df["name"].str.len()
df["name_upper"] = df["name"].str.upper()
df["name_len"] = df["name"].str.len()
Use .dt accessor for datetime operations
Use .dt accessor for datetime operations
df["year"] = df["timestamp"].dt.year
df["month"] = df["timestamp"].dt.month
**Iteration order** (fastest to slowest):
1. Vectorized column operations (preferred)
2. `df.apply(func, axis=1)` with `raw=True` for numeric
3. `df.apply(func, axis=1)` — calls func per row as Series
4. `itertuples()` — avoid unless necessary
5. `iterrows()` — avoid, slowest, loses dtypesdf["year"] = df["timestamp"].dt.year
df["month"] = df["timestamp"].dt.month
**迭代速度排序**(从快到慢):
1. 向量化列操作(优先选择)
2. 针对数值型数据使用`df.apply(func, axis=1)`并设置`raw=True`
3. `df.apply(func, axis=1)`——将函数应用于每一行的Series
4. `itertuples()`——除非必要否则避免使用
5. `iterrows()`——避免使用,速度最慢,会丢失数据类型Missing Data
缺失值处理
pandas uses (float), (datetime), and (nullable extension types) to represent missing values.
NaNpd.NaTpd.NApython
undefinedpandas使用(浮点型)、(日期时间型)和(可空扩展类型)表示缺失值。
NaNpd.NaTpd.NApython
undefinedDetect missing
Detect missing
df.isna() # boolean mask
df["col"].notna() # inverse mask
df.isna() # boolean mask
df["col"].notna() # inverse mask
Fill missing values
Fill missing values
df["score"].fillna(0)
df["score"].fillna(df["score"].median())
df["score"].fillna(0)
df["score"].fillna(df["score"].median())
Drop rows with missing values
Drop rows with missing values
df.dropna(subset=["score", "name"])
df.dropna(subset=["score", "name"])
Use nullable integer (preserves int type with NAs)
Use nullable integer (preserves int type with NAs)
s = pd.array([1, 2, None], dtype="Int64") # capital I = nullable
Prefer nullable extension types (`Int64`, `Float64`, `boolean`, `string`) over NumPy types when the column may contain missing values. This avoids silent upcasting to `float64`.s = pd.array([1, 2, None], dtype="Int64") # capital I = nullable
当列可能包含缺失值时,优先使用可空扩展类型(`Int64`、`Float64`、`boolean`、`string`)而非NumPy类型。这可避免隐式转换为`float64`类型。GroupBy: Split-Apply-Combine
GroupBy:拆分-应用-合并
python
undefinedpython
undefinedAggregation — returns one row per group
Aggregation — returns one row per group
df.groupby("grade")["score"].mean()
df.groupby("grade").agg({"score": ["mean", "std", "count"]})
df.groupby("grade")["score"].mean()
df.groupby("grade").agg({"score": ["mean", "std", "count"]})
Named aggregation (pandas 0.25+)
Named aggregation (pandas 0.25+)
df.groupby("grade").agg(
avg_score=("score", "mean"),
count=("name", "count"),
)
df.groupby("grade").agg(
avg_score=("score", "mean"),
count=("name", "count"),
)
Transformation — returns same shape as input
Transformation — returns same shape as input
df["score_zscore"] = df.groupby("grade")["score"].transform(
lambda s: (s - s.mean()) / s.std()
)
df["score_zscore"] = df.groupby("grade")["score"].transform(
lambda s: (s - s.mean()) / s.std()
)
Filter — keep groups matching a condition
Filter — keep groups matching a condition
df.groupby("grade").filter(lambda g: len(g) >= 2)
Avoid calling `apply` with a function that returns a scalar inside a `transform` — use `transform` directly with a named aggregation function for best performance.df.groupby("grade").filter(lambda g: len(g) >= 2)
在`transform`中避免调用返回标量的函数——直接使用带命名聚合的`transform`以获得最佳性能。Merging and Reshaping
合并与重塑
python
undefinedpython
undefinedMerge (SQL-style join)
Merge (SQL-style join)
result = pd.merge(left, right, on="key", how="left")
result = pd.merge(left, right, left_on="id", right_on="user_id", how="inner")
result = pd.merge(left, right, on="key", how="left")
result = pd.merge(left, right, left_on="id", right_on="user_id", how="inner")
Concat along rows
Concat along rows
pd.concat([df1, df2], ignore_index=True)
pd.concat([df1, df2], ignore_index=True)
Pivot — long to wide
Pivot — long to wide
pivot = df.pivot_table(index="date", columns="category", values="amount", aggfunc="sum")
pivot = df.pivot_table(index="date", columns="category", values="amount", aggfunc="sum")
Melt — wide to long
Melt — wide to long
melted = df.melt(id_vars=["id"], value_vars=["q1", "q2", "q3"], var_name="quarter")
Always pass `ignore_index=True` to `pd.concat` when the original indices are meaningless row numbers — prevents duplicate index values.melted = df.melt(id_vars=["id"], value_vars=["q1", "q2", "q3"], var_name="quarter")
当原始索引为无意义的行号时,务必为`pd.concat`传入`ignore_index=True`——避免出现重复索引值。I/O Best Practices
I/O最佳实践
| Format | Read | Write | Notes |
|---|---|---|---|
| CSV | | | Use |
| Parquet | | | Preferred for large datasets |
| JSON | | | Use |
| Excel | | | Slow; avoid for large files |
| SQL | | | Use |
python
undefined| 格式 | 读取方法 | 写入方法 | 注意事项 |
|---|---|---|---|
| CSV | | | 使用 |
| Parquet | | | 大型数据集的首选格式 |
| JSON | | | 对接API时使用 |
| Excel | | | 速度慢;避免用于大文件 |
| SQL | | | 写入大文件时使用 |
python
undefinedSpecify dtypes at read time — avoids post-hoc conversion cost
Specify dtypes at read time — avoids post-hoc conversion cost
df = pd.read_csv(
"data.csv",
usecols=["id", "name", "amount"],
dtype={"id": "int32", "name": "category"},
parse_dates=["created_at"],
)
df = pd.read_csv(
"data.csv",
usecols=["id", "name", "amount"],
dtype={"id": "int32", "name": "category"},
parse_dates=["created_at"],
)
Parquet with column selection — reads only needed columns from disk
Parquet with column selection — reads only needed columns from disk
df = pd.read_parquet("data.parquet", columns=["id", "amount"])
undefineddf = pd.read_parquet("data.parquet", columns=["id", "amount"])
undefinedUser-Defined Functions
自定义函数
Minimize UDF usage. Built-in pandas/NumPy methods are always faster than custom Python functions.
When a UDF is necessary:
python
undefined尽量减少自定义函数(UDF)的使用。内置的pandas/NumPy方法始终比自定义Python函数更快。
当必须使用自定义函数时:
python
undefinedUse raw=True for numeric operations — passes NumPy array, not Series
Use raw=True for numeric operations — passes NumPy array, not Series
df["result"] = df["value"].rolling(10).apply(np.sum, raw=True)
df["result"] = df["value"].rolling(10).apply(np.sum, raw=True)
Use engine="numba" for large datasets (>1M rows)
Use engine="numba" for large datasets (>1M rows)
df["result"] = df["value"].rolling(10).apply(my_func, raw=True, engine="numba")
df["result"] = df["value"].rolling(10).apply(my_func, raw=True, engine="numba")
Prefer map over apply for element-wise Series operations
Prefer map over apply for element-wise Series operations
df["label"] = df["code"].map({1: "low", 2: "mid", 3: "high"})
Never use `apply` with `axis=1` on large DataFrames when a vectorized alternative exists.df["label"] = df["code"].map({1: "low", 2: "mid", 3: "high"})
当存在向量化替代方案时,切勿在大型DataFrame上使用`apply`并设置`axis=1`。Quick Reference
速查指南
| Task | Method |
|---|---|
| Shape | |
| Column dtypes | |
| Summary statistics | |
| Unique values | |
| Value counts | |
| Sort | |
| Rename columns | |
| Drop columns | |
| Reset index | |
| Filter rows | |
| Select by dtype | |
| Duplicate detection | |
| Apply function | |
| 任务 | 方法 |
|---|---|
| 形状 | |
| 列数据类型 | |
| 汇总统计 | |
| 唯一值 | |
| 值计数 | |
| 排序 | |
| 重命名列 | |
| 删除列 | |
| 重置索引 | |
| 过滤行 | |
| 按数据类型选择 | |
| 重复值检测 | |
| 应用函数 | |
Additional Resources
额外资源
Reference Files
参考文档
- - dtype selection guide, nullable types, Categorical patterns, memory profiling
references/data-types-and-memory.md - - CoW patterns, vectorization, chunking large files, Numba/Cython integration, eval()
references/performance-and-scaling.md
- - 数据类型选择指南、可空类型、Categorical使用模式、内存分析
references/data-types-and-memory.md - - CoW使用模式、向量化、大文件分块处理、Numba/Cython集成、eval()方法
references/performance-and-scaling.md