pandas

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

pandas

pandas

pandas (v3.0+) is Python's primary library for in-memory tabular data analysis. It provides
DataFrame
and
Series
as core structures with vectorized operations, I/O adapters, groupby, merging, and time-series support.
pandas(v3.0+)是Python用于内存中表格数据分析的核心库。它提供
DataFrame
Series
作为核心数据结构,支持向量化操作、I/O适配、分组(groupby)、合并(merging)以及时间序列处理。

Core Data Structures

核心数据结构

DataFrame - 2-dimensional labeled table with columns of potentially different types.
Series - 1-dimensional labeled array; a single column or row of a DataFrame.
python
import pandas as pd
import numpy as np
DataFrame - 二维带标签的表格,列可以是不同的数据类型。
Series - 一维带标签的数组;可视为DataFrame的单列或单行。
python
import pandas as pd
import numpy as np

Create DataFrame from dict

Create DataFrame from dict

df = pd.DataFrame({ "name": ["Alice", "Bob", "Charlie"], "score": [95, 82, 78], "grade": pd.Categorical(["A", "B", "C"]), })
df = pd.DataFrame({ "name": ["Alice", "Bob", "Charlie"], "score": [95, 82, 78], "grade": pd.Categorical(["A", "B", "C"]), })

Create Series

Create Series

s = pd.Series([10, 20, 30], index=["a", "b", "c"], name="values")
undefined
s = pd.Series([10, 20, 30], index=["a", "b", "c"], name="values")
undefined

Indexing and Selection

索引与选择

Use
.loc
(label-based) and
.iloc
(position-based) for explicit indexing.
python
undefined
使用
.loc
(基于标签)和
.iloc
(基于位置)进行显式索引。
python
undefined

Label-based selection

Label-based selection

df.loc[0, "name"] # single value df.loc[:, ["name", "score"]] # multiple columns df.loc[df["score"] > 80] # boolean mask
df.loc[0, "name"] # single value df.loc[:, ["name", "score"]] # multiple columns df.loc[df["score"] > 80] # boolean mask

Position-based selection

Position-based selection

df.iloc[0, 1] # row 0, col 1 df.iloc[1:3, :] # rows 1-2, all columns

**Never use chained indexing** (`df["col"][condition]`). In pandas 3.0 with Copy-on-Write enabled, chained assignment raises `ChainedAssignmentError`. Always use `.loc` for conditional assignment:

```python
df.iloc[0, 1] # row 0, col 1 df.iloc[1:3, :] # rows 1-2, all columns

**切勿使用链式索引**(`df["col"][condition]`)。在启用Copy-on-Write的pandas 3.0中,链式赋值会抛出`ChainedAssignmentError`。条件赋值请始终使用`.loc`:

```python

Wrong — chained assignment, silently fails or raises in pandas 3.0

Wrong — chained assignment, silently fails or raises in pandas 3.0

df["score"][df["name"] == "Bob"] = 90
df["score"][df["name"] == "Bob"] = 90

Correct

Correct

df.loc[df["name"] == "Bob", "score"] = 90
undefined
df.loc[df["name"] == "Bob", "score"] = 90
undefined

Copy-on-Write (pandas 3.0 default)

Copy-on-Write(pandas 3.0默认特性)

Copy-on-Write (CoW) is the default behavior in pandas 3.0. Every derived object (slice, subset, result of a method) behaves as an independent copy — modifications never propagate to the parent.
Key rule: assign back to the same variable to avoid unnecessary copies:
python
undefined
Copy-on-Write(CoW)是pandas 3.0中的默认行为。所有派生对象(切片、子集、方法返回结果)均为独立副本——修改操作绝不会传递到父对象。
核心规则:将结果重新赋值给同一个变量,避免不必要的复制:
python
undefined

Triggers a copy on every subsequent mutation (wasteful)

Triggers a copy on every subsequent mutation (wasteful)

df2 = df.reset_index(drop=True) df2.iloc[0, 0] = 100 # triggers copy because df and df2 share data
df2 = df.reset_index(drop=True) df2.iloc[0, 0] = 100 # triggers copy because df and df2 share data

Reassign to same variable — no extra copy needed

Reassign to same variable — no extra copy needed

df = df.reset_index(drop=True) df.iloc[0, 0] = 100 # operates in-place, no shared data

**Avoid keeping unnecessary references alive** — they prevent CoW from optimizing copies away.
df = df.reset_index(drop=True) df.iloc[0, 0] = 100 # operates in-place, no shared data

**避免保留不必要的引用**——它们会阻碍CoW优化复制操作。

Efficient Data Types

高效数据类型

Default dtypes are not memory-optimal. Apply the correct dtype at read time or immediately after.
DataDefault dtypeOptimal dtype
Low-cardinality strings
object
/
str
category
Large integers
int64
int32
/
int16
Booleans with NA
object
boolean
(nullable)
Floats
float64
float32
python
undefined
默认数据类型并非内存最优。在读取数据时或读取后立即应用正确的数据类型。
数据类型默认数据类型最优数据类型
低基数字符串
object
/
str
category
大整数
int64
int32
/
int16
含空值的布尔值
object
boolean
(可空)
浮点数
float64
float32
python
undefined

Convert low-cardinality string column to Categorical (~10x memory reduction)

Convert low-cardinality string column to Categorical (~10x memory reduction)

df["status"] = df["status"].astype("category")
df["status"] = df["status"].astype("category")

Downcast numerics

Downcast numerics

df["id"] = pd.to_numeric(df["id"], downcast="unsigned") df["value"] = pd.to_numeric(df["value"], downcast="float")
df["id"] = pd.to_numeric(df["id"], downcast="unsigned") df["value"] = pd.to_numeric(df["value"], downcast="float")

Check memory usage

Check memory usage

df.memory_usage(deep=True)

Use `pd.Categorical` for any column with fewer than ~50% unique values relative to row count.
df.memory_usage(deep=True)

当列的唯一值占行数比例低于约50%时,优先使用`pd.Categorical`类型。

Vectorized Operations

向量化操作

Prefer vectorized operations over explicit Python loops. Vectorized code operates on entire arrays using optimized C/NumPy code.
python
undefined
优先使用向量化操作,而非显式Python循环。向量化代码基于优化后的C/NumPy代码对整个数组进行操作。
python
undefined

Wrong — Python loop, slow

Wrong — Python loop, slow

results = [] for _, row in df.iterrows(): results.append(row["x"] * row["y"])
results = [] for _, row in df.iterrows(): results.append(row["x"] * row["y"])

Correct — vectorized

Correct — vectorized

df["product"] = df["x"] * df["y"]
df["product"] = df["x"] * df["y"]

Use .str accessor for string operations (vectorized)

Use .str accessor for string operations (vectorized)

df["name_upper"] = df["name"].str.upper() df["name_len"] = df["name"].str.len()
df["name_upper"] = df["name"].str.upper() df["name_len"] = df["name"].str.len()

Use .dt accessor for datetime operations

Use .dt accessor for datetime operations

df["year"] = df["timestamp"].dt.year df["month"] = df["timestamp"].dt.month

**Iteration order** (fastest to slowest):
1. Vectorized column operations (preferred)
2. `df.apply(func, axis=1)` with `raw=True` for numeric
3. `df.apply(func, axis=1)` — calls func per row as Series
4. `itertuples()` — avoid unless necessary
5. `iterrows()` — avoid, slowest, loses dtypes
df["year"] = df["timestamp"].dt.year df["month"] = df["timestamp"].dt.month

**迭代速度排序**(从快到慢):
1. 向量化列操作(优先选择)
2. 针对数值型数据使用`df.apply(func, axis=1)`并设置`raw=True`
3. `df.apply(func, axis=1)`——将函数应用于每一行的Series
4. `itertuples()`——除非必要否则避免使用
5. `iterrows()`——避免使用,速度最慢,会丢失数据类型

Missing Data

缺失值处理

pandas uses
NaN
(float),
pd.NaT
(datetime), and
pd.NA
(nullable extension types) to represent missing values.
python
undefined
pandas使用
NaN
(浮点型)、
pd.NaT
(日期时间型)和
pd.NA
(可空扩展类型)表示缺失值。
python
undefined

Detect missing

Detect missing

df.isna() # boolean mask df["col"].notna() # inverse mask
df.isna() # boolean mask df["col"].notna() # inverse mask

Fill missing values

Fill missing values

df["score"].fillna(0) df["score"].fillna(df["score"].median())
df["score"].fillna(0) df["score"].fillna(df["score"].median())

Drop rows with missing values

Drop rows with missing values

df.dropna(subset=["score", "name"])
df.dropna(subset=["score", "name"])

Use nullable integer (preserves int type with NAs)

Use nullable integer (preserves int type with NAs)

s = pd.array([1, 2, None], dtype="Int64") # capital I = nullable

Prefer nullable extension types (`Int64`, `Float64`, `boolean`, `string`) over NumPy types when the column may contain missing values. This avoids silent upcasting to `float64`.
s = pd.array([1, 2, None], dtype="Int64") # capital I = nullable

当列可能包含缺失值时,优先使用可空扩展类型(`Int64`、`Float64`、`boolean`、`string`)而非NumPy类型。这可避免隐式转换为`float64`类型。

GroupBy: Split-Apply-Combine

GroupBy:拆分-应用-合并

python
undefined
python
undefined

Aggregation — returns one row per group

Aggregation — returns one row per group

df.groupby("grade")["score"].mean() df.groupby("grade").agg({"score": ["mean", "std", "count"]})
df.groupby("grade")["score"].mean() df.groupby("grade").agg({"score": ["mean", "std", "count"]})

Named aggregation (pandas 0.25+)

Named aggregation (pandas 0.25+)

df.groupby("grade").agg( avg_score=("score", "mean"), count=("name", "count"), )
df.groupby("grade").agg( avg_score=("score", "mean"), count=("name", "count"), )

Transformation — returns same shape as input

Transformation — returns same shape as input

df["score_zscore"] = df.groupby("grade")["score"].transform( lambda s: (s - s.mean()) / s.std() )
df["score_zscore"] = df.groupby("grade")["score"].transform( lambda s: (s - s.mean()) / s.std() )

Filter — keep groups matching a condition

Filter — keep groups matching a condition

df.groupby("grade").filter(lambda g: len(g) >= 2)

Avoid calling `apply` with a function that returns a scalar inside a `transform` — use `transform` directly with a named aggregation function for best performance.
df.groupby("grade").filter(lambda g: len(g) >= 2)

在`transform`中避免调用返回标量的函数——直接使用带命名聚合的`transform`以获得最佳性能。

Merging and Reshaping

合并与重塑

python
undefined
python
undefined

Merge (SQL-style join)

Merge (SQL-style join)

result = pd.merge(left, right, on="key", how="left") result = pd.merge(left, right, left_on="id", right_on="user_id", how="inner")
result = pd.merge(left, right, on="key", how="left") result = pd.merge(left, right, left_on="id", right_on="user_id", how="inner")

Concat along rows

Concat along rows

pd.concat([df1, df2], ignore_index=True)
pd.concat([df1, df2], ignore_index=True)

Pivot — long to wide

Pivot — long to wide

pivot = df.pivot_table(index="date", columns="category", values="amount", aggfunc="sum")
pivot = df.pivot_table(index="date", columns="category", values="amount", aggfunc="sum")

Melt — wide to long

Melt — wide to long

melted = df.melt(id_vars=["id"], value_vars=["q1", "q2", "q3"], var_name="quarter")

Always pass `ignore_index=True` to `pd.concat` when the original indices are meaningless row numbers — prevents duplicate index values.
melted = df.melt(id_vars=["id"], value_vars=["q1", "q2", "q3"], var_name="quarter")

当原始索引为无意义的行号时,务必为`pd.concat`传入`ignore_index=True`——避免出现重复索引值。

I/O Best Practices

I/O最佳实践

FormatReadWriteNotes
CSV
pd.read_csv()
df.to_csv(index=False)
Use
usecols=
to limit columns
Parquet
pd.read_parquet()
df.to_parquet()
Preferred for large datasets
JSON
pd.read_json()
df.to_json()
Use
orient="records"
for APIs
Excel
pd.read_excel()
df.to_excel(index=False)
Slow; avoid for large files
SQL
pd.read_sql(query, conn)
df.to_sql(name, conn)
Use
chunksize=
for large writes
python
undefined
格式读取方法写入方法注意事项
CSV
pd.read_csv()
df.to_csv(index=False)
使用
usecols=
限制读取的列
Parquet
pd.read_parquet()
df.to_parquet()
大型数据集的首选格式
JSON
pd.read_json()
df.to_json()
对接API时使用
orient="records"
Excel
pd.read_excel()
df.to_excel(index=False)
速度慢;避免用于大文件
SQL
pd.read_sql(query, conn)
df.to_sql(name, conn)
写入大文件时使用
chunksize=
python
undefined

Specify dtypes at read time — avoids post-hoc conversion cost

Specify dtypes at read time — avoids post-hoc conversion cost

df = pd.read_csv( "data.csv", usecols=["id", "name", "amount"], dtype={"id": "int32", "name": "category"}, parse_dates=["created_at"], )
df = pd.read_csv( "data.csv", usecols=["id", "name", "amount"], dtype={"id": "int32", "name": "category"}, parse_dates=["created_at"], )

Parquet with column selection — reads only needed columns from disk

Parquet with column selection — reads only needed columns from disk

df = pd.read_parquet("data.parquet", columns=["id", "amount"])
undefined
df = pd.read_parquet("data.parquet", columns=["id", "amount"])
undefined

User-Defined Functions

自定义函数

Minimize UDF usage. Built-in pandas/NumPy methods are always faster than custom Python functions.
When a UDF is necessary:
python
undefined
尽量减少自定义函数(UDF)的使用。内置的pandas/NumPy方法始终比自定义Python函数更快。
当必须使用自定义函数时:
python
undefined

Use raw=True for numeric operations — passes NumPy array, not Series

Use raw=True for numeric operations — passes NumPy array, not Series

df["result"] = df["value"].rolling(10).apply(np.sum, raw=True)
df["result"] = df["value"].rolling(10).apply(np.sum, raw=True)

Use engine="numba" for large datasets (>1M rows)

Use engine="numba" for large datasets (>1M rows)

df["result"] = df["value"].rolling(10).apply(my_func, raw=True, engine="numba")
df["result"] = df["value"].rolling(10).apply(my_func, raw=True, engine="numba")

Prefer map over apply for element-wise Series operations

Prefer map over apply for element-wise Series operations

df["label"] = df["code"].map({1: "low", 2: "mid", 3: "high"})

Never use `apply` with `axis=1` on large DataFrames when a vectorized alternative exists.
df["label"] = df["code"].map({1: "low", 2: "mid", 3: "high"})

当存在向量化替代方案时,切勿在大型DataFrame上使用`apply`并设置`axis=1`。

Quick Reference

速查指南

TaskMethod
Shape
df.shape
,
len(df)
Column dtypes
df.dtypes
Summary statistics
df.describe()
Unique values
df["col"].unique()
,
df["col"].nunique()
Value counts
df["col"].value_counts()
Sort
df.sort_values("col", ascending=False)
Rename columns
df.rename(columns={"old": "new"})
Drop columns
df.drop(columns=["a", "b"])
Reset index
df.reset_index(drop=True)
Filter rows
df.query("score > 80 and grade == 'A'")
Select by dtype
df.select_dtypes(include="number")
Duplicate detection
df.duplicated()
,
df.drop_duplicates()
Apply function
df["col"].map(func)
任务方法
形状
df.shape
,
len(df)
列数据类型
df.dtypes
汇总统计
df.describe()
唯一值
df["col"].unique()
,
df["col"].nunique()
值计数
df["col"].value_counts()
排序
df.sort_values("col", ascending=False)
重命名列
df.rename(columns={"old": "new"})
删除列
df.drop(columns=["a", "b"])
重置索引
df.reset_index(drop=True)
过滤行
df.query("score > 80 and grade == 'A'")
按数据类型选择
df.select_dtypes(include="number")
重复值检测
df.duplicated()
,
df.drop_duplicates()
应用函数
df["col"].map(func)

Additional Resources

额外资源

Reference Files

参考文档

  • references/data-types-and-memory.md
    - dtype selection guide, nullable types, Categorical patterns, memory profiling
  • references/performance-and-scaling.md
    - CoW patterns, vectorization, chunking large files, Numba/Cython integration, eval()
  • references/data-types-and-memory.md
    - 数据类型选择指南、可空类型、Categorical使用模式、内存分析
  • references/performance-and-scaling.md
    - CoW使用模式、向量化、大文件分块处理、Numba/Cython集成、eval()方法