pandas

pandas (v3.0+) is Python's primary library for in-memory tabular data analysis. It provides

DataFrame

and

Series

as core structures with vectorized operations, I/O adapters, groupby, merging, and time-series support.

pandas（v3.0+）是Python用于内存中表格数据分析的核心库。它提供

DataFrame

和

Series

作为核心数据结构，支持向量化操作、I/O适配、分组（groupby）、合并（merging）以及时间序列处理。

Core Data Structures

核心数据结构

DataFrame - 2-dimensional labeled table with columns of potentially different types.

Series - 1-dimensional labeled array; a single column or row of a DataFrame.

python

import pandas as pd
import numpy as np

DataFrame - 二维带标签的表格，列可以是不同的数据类型。

Series - 一维带标签的数组；可视为DataFrame的单列或单行。

python

import pandas as pd
import numpy as np

Create DataFrame from dict

df = pd.DataFrame({ "name": ["Alice", "Bob", "Charlie"], "score": [95, 82, 78], "grade": pd.Categorical(["A", "B", "C"]), })

Create Series

s = pd.Series([10, 20, 30], index=["a", "b", "c"], name="values")

undefined

s = pd.Series([10, 20, 30], index=["a", "b", "c"], name="values")

undefined

Indexing and Selection

索引与选择

Use

.loc

(label-based) and

.iloc

(position-based) for explicit indexing.

python

undefined

使用

.loc

（基于标签）和

.iloc

（基于位置）进行显式索引。

python

undefined

Label-based selection

df.loc[0, "name"] # single value df.loc[:, ["name", "score"]] # multiple columns df.loc[df["score"] > 80] # boolean mask

Position-based selection

df.iloc[0, 1] # row 0, col 1 df.iloc[1:3, :] # rows 1-2, all columns


**Never use chained indexing** (`df["col"][condition]`). In pandas 3.0 with Copy-on-Write enabled, chained assignment raises `ChainedAssignmentError`. Always use `.loc` for conditional assignment:

```python

df.iloc[0, 1] # row 0, col 1 df.iloc[1:3, :] # rows 1-2, all columns


**切勿使用链式索引**（`df["col"][condition]`）。在启用Copy-on-Write的pandas 3.0中，链式赋值会抛出`ChainedAssignmentError`。条件赋值请始终使用`.loc`：

```python

Wrong — chained assignment, silently fails or raises in pandas 3.0

df["score"][df["name"] == "Bob"] = 90

Correct

df.loc[df["name"] == "Bob", "score"] = 90

undefined

df.loc[df["name"] == "Bob", "score"] = 90

undefined

Copy-on-Write (pandas 3.0 default)

Copy-on-Write（pandas 3.0默认特性）

Copy-on-Write (CoW) is the default behavior in pandas 3.0. Every derived object (slice, subset, result of a method) behaves as an independent copy — modifications never propagate to the parent.

Key rule: assign back to the same variable to avoid unnecessary copies:

python

undefined

Copy-on-Write（CoW）是pandas 3.0中的默认行为。所有派生对象（切片、子集、方法返回结果）均为独立副本——修改操作绝不会传递到父对象。

核心规则：将结果重新赋值给同一个变量，避免不必要的复制：

python

undefined

Triggers a copy on every subsequent mutation (wasteful)

df2 = df.reset_index(drop=True) df2.iloc[0, 0] = 100 # triggers copy because df and df2 share data

Reassign to same variable — no extra copy needed

df = df.reset_index(drop=True) df.iloc[0, 0] = 100 # operates in-place, no shared data


**Avoid keeping unnecessary references alive** — they prevent CoW from optimizing copies away.

df = df.reset_index(drop=True) df.iloc[0, 0] = 100 # operates in-place, no shared data


**避免保留不必要的引用**——它们会阻碍CoW优化复制操作。

Efficient Data Types

高效数据类型

Default dtypes are not memory-optimal. Apply the correct dtype at read time or immediately after.

Data	Default dtype	Optimal dtype
Low-cardinality strings	`object` / `str`	`category`
Large integers	`int64`	`int32` / `int16`
Booleans with NA	`object`	`boolean` (nullable)
Floats	`float64`	`float32`

python

undefined

默认数据类型并非内存最优。在读取数据时或读取后立即应用正确的数据类型。

数据类型	默认数据类型	最优数据类型
低基数字符串	`object` / `str`	`category`
大整数	`int64`	`int32` / `int16`
含空值的布尔值	`object`	`boolean` （可空）
浮点数	`float64`	`float32`

python

undefined

Convert low-cardinality string column to Categorical (~10x memory reduction)

df["status"] = df["status"].astype("category")

Downcast numerics

df["id"] = pd.to_numeric(df["id"], downcast="unsigned") df["value"] = pd.to_numeric(df["value"], downcast="float")

Check memory usage

df.memory_usage(deep=True)


Use `pd.Categorical` for any column with fewer than ~50% unique values relative to row count.

df.memory_usage(deep=True)


当列的唯一值占行数比例低于约50%时，优先使用`pd.Categorical`类型。

Vectorized Operations

向量化操作

Prefer vectorized operations over explicit Python loops. Vectorized code operates on entire arrays using optimized C/NumPy code.

python

undefined

优先使用向量化操作，而非显式Python循环。向量化代码基于优化后的C/NumPy代码对整个数组进行操作。

python

undefined

Wrong — Python loop, slow

results = [] for _, row in df.iterrows(): results.append(row["x"] * row["y"])

Correct — vectorized

df["product"] = df["x"] * df["y"]

Use .str accessor for string operations (vectorized)

df["name_upper"] = df["name"].str.upper() df["name_len"] = df["name"].str.len()

Use .dt accessor for datetime operations

df["year"] = df["timestamp"].dt.year df["month"] = df["timestamp"].dt.month


**Iteration order** (fastest to slowest):
1. Vectorized column operations (preferred)
2. `df.apply(func, axis=1)` with `raw=True` for numeric
3. `df.apply(func, axis=1)` — calls func per row as Series
4. `itertuples()` — avoid unless necessary
5. `iterrows()` — avoid, slowest, loses dtypes

df["year"] = df["timestamp"].dt.year df["month"] = df["timestamp"].dt.month


**迭代速度排序**（从快到慢）：
1. 向量化列操作（优先选择）
2. 针对数值型数据使用`df.apply(func, axis=1)`并设置`raw=True`
3. `df.apply(func, axis=1)`——将函数应用于每一行的Series
4. `itertuples()`——除非必要否则避免使用
5. `iterrows()`——避免使用，速度最慢，会丢失数据类型

Missing Data

缺失值处理

pandas uses

NaN

(float),

pd.NaT

(datetime), and

pd.NA

(nullable extension types) to represent missing values.

python

undefined

pandas使用

NaN

（浮点型）、

pd.NaT

（日期时间型）和

pd.NA

（可空扩展类型）表示缺失值。

python

undefined

Detect missing

df.isna() # boolean mask df["col"].notna() # inverse mask

Fill missing values

df["score"].fillna(0) df["score"].fillna(df["score"].median())

Drop rows with missing values

df.dropna(subset=["score", "name"])

Use nullable integer (preserves int type with NAs)

s = pd.array([1, 2, None], dtype="Int64") # capital I = nullable


Prefer nullable extension types (`Int64`, `Float64`, `boolean`, `string`) over NumPy types when the column may contain missing values. This avoids silent upcasting to `float64`.

s = pd.array([1, 2, None], dtype="Int64") # capital I = nullable


当列可能包含缺失值时，优先使用可空扩展类型（`Int64`、`Float64`、`boolean`、`string`）而非NumPy类型。这可避免隐式转换为`float64`类型。

GroupBy: Split-Apply-Combine

GroupBy：拆分-应用-合并

python

undefined

python

undefined

Aggregation — returns one row per group

df.groupby("grade")["score"].mean() df.groupby("grade").agg({"score": ["mean", "std", "count"]})

Named aggregation (pandas 0.25+)

df.groupby("grade").agg( avg_score=("score", "mean"), count=("name", "count"), )

Transformation — returns same shape as input

df["score_zscore"] = df.groupby("grade")["score"].transform( lambda s: (s - s.mean()) / s.std() )

Filter — keep groups matching a condition

df.groupby("grade").filter(lambda g: len(g) >= 2)


Avoid calling `apply` with a function that returns a scalar inside a `transform` — use `transform` directly with a named aggregation function for best performance.

df.groupby("grade").filter(lambda g: len(g) >= 2)


在`transform`中避免调用返回标量的函数——直接使用带命名聚合的`transform`以获得最佳性能。

Merging and Reshaping

合并与重塑

python

undefined

python

undefined

Merge (SQL-style join)

result = pd.merge(left, right, on="key", how="left") result = pd.merge(left, right, left_on="id", right_on="user_id", how="inner")

Concat along rows

pd.concat([df1, df2], ignore_index=True)

Pivot — long to wide

pivot = df.pivot_table(index="date", columns="category", values="amount", aggfunc="sum")

Melt — wide to long

melted = df.melt(id_vars=["id"], value_vars=["q1", "q2", "q3"], var_name="quarter")


Always pass `ignore_index=True` to `pd.concat` when the original indices are meaningless row numbers — prevents duplicate index values.

melted = df.melt(id_vars=["id"], value_vars=["q1", "q2", "q3"], var_name="quarter")


当原始索引为无意义的行号时，务必为`pd.concat`传入`ignore_index=True`——避免出现重复索引值。

I/O Best Practices

I/O最佳实践

Format	Read	Write	Notes
CSV	`pd.read_csv()`	`df.to_csv(index=False)`	Use `usecols=` to limit columns
Parquet	`pd.read_parquet()`	`df.to_parquet()`	Preferred for large datasets
JSON	`pd.read_json()`	`df.to_json()`	Use `orient="records"` for APIs
Excel	`pd.read_excel()`	`df.to_excel(index=False)`	Slow; avoid for large files
SQL	`pd.read_sql(query, conn)`	`df.to_sql(name, conn)`	Use `chunksize=` for large writes

python

undefined

格式	读取方法	写入方法	注意事项
CSV	`pd.read_csv()`	`df.to_csv(index=False)`	使用 `usecols=` 限制读取的列
Parquet	`pd.read_parquet()`	`df.to_parquet()`	大型数据集的首选格式
JSON	`pd.read_json()`	`df.to_json()`	对接API时使用 `orient="records"`
Excel	`pd.read_excel()`	`df.to_excel(index=False)`	速度慢；避免用于大文件
SQL	`pd.read_sql(query, conn)`	`df.to_sql(name, conn)`	写入大文件时使用 `chunksize=`

python

undefined

Specify dtypes at read time — avoids post-hoc conversion cost

df = pd.read_csv( "data.csv", usecols=["id", "name", "amount"], dtype={"id": "int32", "name": "category"}, parse_dates=["created_at"], )

Parquet with column selection — reads only needed columns from disk

df = pd.read_parquet("data.parquet", columns=["id", "amount"])

undefined

df = pd.read_parquet("data.parquet", columns=["id", "amount"])

undefined

User-Defined Functions

自定义函数

Minimize UDF usage. Built-in pandas/NumPy methods are always faster than custom Python functions.

When a UDF is necessary:

python

undefined

尽量减少自定义函数（UDF）的使用。内置的pandas/NumPy方法始终比自定义Python函数更快。

当必须使用自定义函数时：

python

undefined

Use raw=True for numeric operations — passes NumPy array, not Series

df["result"] = df["value"].rolling(10).apply(np.sum, raw=True)

Use engine="numba" for large datasets (>1M rows)

df["result"] = df["value"].rolling(10).apply(my_func, raw=True, engine="numba")

Prefer map over apply for element-wise Series operations

df["label"] = df["code"].map({1: "low", 2: "mid", 3: "high"})


Never use `apply` with `axis=1` on large DataFrames when a vectorized alternative exists.

df["label"] = df["code"].map({1: "low", 2: "mid", 3: "high"})


当存在向量化替代方案时，切勿在大型DataFrame上使用`apply`并设置`axis=1`。

Quick Reference

速查指南

Task	Method
Shape	`df.shape` , `len(df)`
Column dtypes	`df.dtypes`
Summary statistics	`df.describe()`
Unique values	`df["col"].unique()` , `df["col"].nunique()`
Value counts	`df["col"].value_counts()`
Sort	`df.sort_values("col", ascending=False)`
Rename columns	`df.rename(columns={"old": "new"})`
Drop columns	`df.drop(columns=["a", "b"])`
Reset index	`df.reset_index(drop=True)`
Filter rows	`df.query("score > 80 and grade == 'A'")`
Select by dtype	`df.select_dtypes(include="number")`
Duplicate detection	`df.duplicated()` , `df.drop_duplicates()`
Apply function	`df["col"].map(func)`

任务	方法
形状	`df.shape` , `len(df)`
列数据类型	`df.dtypes`
汇总统计	`df.describe()`
唯一值	`df["col"].unique()` , `df["col"].nunique()`
值计数	`df["col"].value_counts()`
排序	`df.sort_values("col", ascending=False)`
重命名列	`df.rename(columns={"old": "new"})`
删除列	`df.drop(columns=["a", "b"])`
重置索引	`df.reset_index(drop=True)`
过滤行	`df.query("score > 80 and grade == 'A'")`
按数据类型选择	`df.select_dtypes(include="number")`
重复值检测	`df.duplicated()` , `df.drop_duplicates()`
应用函数	`df["col"].map(func)`

Additional Resources

额外资源

Reference Files

参考文档

references/data-types-and-memory.md
- dtype selection guide, nullable types, Categorical patterns, memory profiling
references/performance-and-scaling.md
- CoW patterns, vectorization, chunking large files, Numba/Cython integration, eval()

references/data-types-and-memory.md
- 数据类型选择指南、可空类型、Categorical使用模式、内存分析
references/performance-and-scaling.md
- CoW使用模式、向量化、大文件分块处理、Numba/Cython集成、eval()方法

pandas

Original

Translation

pandas

pandas

Core Data Structures

核心数据结构

Create DataFrame from dict

Create DataFrame from dict

Create Series

Create Series

Indexing and Selection

索引与选择

Label-based selection

Label-based selection

Position-based selection

Position-based selection

Wrong — chained assignment, silently fails or raises in pandas 3.0

Wrong — chained assignment, silently fails or raises in pandas 3.0

Correct

Correct

Copy-on-Write (pandas 3.0 default)

Copy-on-Write（pandas 3.0默认特性）

Triggers a copy on every subsequent mutation (wasteful)

Triggers a copy on every subsequent mutation (wasteful)

Reassign to same variable — no extra copy needed

Reassign to same variable — no extra copy needed

Efficient Data Types

高效数据类型

Convert low-cardinality string column to Categorical (~10x memory reduction)

Convert low-cardinality string column to Categorical (~10x memory reduction)

Downcast numerics

Downcast numerics

Check memory usage

Check memory usage

Vectorized Operations

向量化操作

Wrong — Python loop, slow

Wrong — Python loop, slow

Correct — vectorized

Correct — vectorized

Use .str accessor for string operations (vectorized)

Use .str accessor for string operations (vectorized)

Use .dt accessor for datetime operations

Use .dt accessor for datetime operations

Missing Data

缺失值处理

Detect missing

Detect missing

Fill missing values

Fill missing values

Drop rows with missing values

Drop rows with missing values

Use nullable integer (preserves int type with NAs)

Use nullable integer (preserves int type with NAs)

GroupBy: Split-Apply-Combine

GroupBy：拆分-应用-合并

Aggregation — returns one row per group

Aggregation — returns one row per group

Named aggregation (pandas 0.25+)

Named aggregation (pandas 0.25+)

Transformation — returns same shape as input

Transformation — returns same shape as input

Filter — keep groups matching a condition

Filter — keep groups matching a condition

Merging and Reshaping

合并与重塑

Merge (SQL-style join)

Merge (SQL-style join)

Concat along rows

Concat along rows

Pivot — long to wide

Pivot — long to wide

Melt — wide to long

Melt — wide to long

I/O Best Practices

I/O最佳实践

Specify dtypes at read time — avoids post-hoc conversion cost

Specify dtypes at read time — avoids post-hoc conversion cost

Parquet with column selection — reads only needed columns from disk