pandera
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePandera: DataFrame Validation
Pandera:DataFrame验证
Pandera is an open-source framework for validating DataFrame-like objects at runtime. Define schemas once and reuse them across pandas, polars, Dask, Modin, PySpark, and Ibis backends.
Pandera是一个用于在运行时验证类DataFrame对象的开源框架。只需定义一次schema,即可在pandas、polars、Dask、Modin、PySpark和Ibis等后端中复用。
Import Convention
导入约定
Since pandera v0.24.0, use the backend-specific module. Using the top-level module produces a and will be deprecated in v0.29.0.
panderaFutureWarningpython
import pandera.pandas as pa # pandas (recommended)
import pandera.polars as pa # polars
from pandera.typing.pandas import DataFrame, Series, Index从pandera v0.24.0版本开始,请使用后端特定的模块。使用顶层模块会触发,并且该方式将在v0.29.0版本中被弃用。
panderaFutureWarningpython
import pandera.pandas as pa # pandas(推荐)
import pandera.polars as pa # polars
from pandera.typing.pandas import DataFrame, Series, IndexTwo Schema Styles
两种Schema风格
Object-based API (DataFrameSchema
)
DataFrameSchema基于对象的API(DataFrameSchema
)
DataFrameSchemaSuitable for dynamic schema construction or when schemas need to be built programmatically.
python
import pandas as pd
import pandera.pandas as pa
schema = pa.DataFrameSchema({
"user_id": pa.Column(int, pa.Check.gt(0)),
"email": pa.Column(str, pa.Check.str_matches(r"^[^@]+@[^@]+\.[^@]+$")),
"score": pa.Column(float, [pa.Check.ge(0.0), pa.Check.le(1.0)]),
"status": pa.Column(str, pa.Check.isin(["active", "inactive", "banned"])),
})
validated = schema.validate(df)适用于动态构建schema,或者需要以编程方式构建schema的场景。
python
import pandas as pd
import pandera.pandas as pa
schema = pa.DataFrameSchema({
"user_id": pa.Column(int, pa.Check.gt(0)),
"email": pa.Column(str, pa.Check.str_matches(r"^[^@]+@[^@]+\.[^@]+$")),
"score": pa.Column(float, [pa.Check.ge(0.0), pa.Check.le(1.0)]),
"status": pa.Column(str, pa.Check.isin(["active", "inactive", "banned"])),
})
validated = schema.validate(df)Class-based API (DataFrameModel
) — preferred
DataFrameModel基于类的API(DataFrameModel
)——推荐使用
DataFrameModelPydantic-style syntax with type annotations. Produces cleaner, reusable schemas that integrate with .
@pa.check_typespython
import pandera.pandas as pa
from pandera.typing.pandas import DataFrame, Series
class UserSchema(pa.DataFrameModel):
user_id: int = pa.Field(gt=0)
email: str = pa.Field(str_matches=r"^[^@]+@[^@]+\.[^@]+$")
score: float = pa.Field(ge=0.0, le=1.0)
status: str = pa.Field(isin=["active", "inactive", "banned"])
class Config:
strict = True # reject extra columns
coerce = False # do not silently cast types采用Pydantic风格的类型注解语法。生成的schema更简洁、可复用,并且能与集成。
@pa.check_typespython
import pandera.pandas as pa
from pandera.typing.pandas import DataFrame, Series
class UserSchema(pa.DataFrameModel):
user_id: int = pa.Field(gt=0)
email: str = pa.Field(str_matches=r"^[^@]+@[^@]+\.[^@]+$")
score: float = pa.Field(ge=0.0, le=1.0)
status: str = pa.Field(isin=["active", "inactive", "banned"])
class Config:
strict = True # 存在额外列时抛出错误
coerce = False # 不自动静默转换类型Validate directly
直接验证
UserSchema.validate(df)
UserSchema.validate(df)
Or via typing annotation + decorator
或者通过类型注解 + 装饰器
@pa.check_types
def process(df: DataFrame[UserSchema]) -> DataFrame[UserSchema]:
return df
undefined@pa.check_types
def process(df: DataFrame[UserSchema]) -> DataFrame[UserSchema]:
return df
undefinedChecks
检查器
Built-in Checks (prefer these over lambdas)
内置检查器(优先使用,而非lambda)
python
pa.Check.gt(0) # greater than
pa.Check.ge(0) # greater than or equal
pa.Check.lt(100) # less than
pa.Check.le(100) # less than or equal
pa.Check.eq("value") # equal to
pa.Check.ne("value") # not equal to
pa.Check.isin(["a", "b"]) # membership
pa.Check.notin(["x"]) # exclusion
pa.Check.str_matches(r"^\d+$") # regex match
pa.Check.in_range(0, 100) # closed interval
pa.Check.str_startswith("prefix")
pa.Check.str_endswith("suffix")
pa.Check.str_length(1, 255) # min/max string lengthpython
pa.Check.gt(0) # 大于
pa.Check.ge(0) # 大于等于
pa.Check.lt(100) # 小于
pa.Check.le(100) # 小于等于
pa.Check.eq("value") # 等于
pa.Check.ne("value") # 不等于
pa.Check.isin(["a", "b"]) # 属于指定集合
pa.Check.notin(["x"]) # 不属于指定集合
pa.Check.str_matches(r"^\d+$") # 正则匹配
pa.Check.in_range(0, 100) # 闭区间范围
pa.Check.str_startswith("prefix")
pa.Check.str_endswith("suffix")
pa.Check.str_length(1, 255) # 字符串最小/最大长度Custom Checks
自定义检查器
python
undefinedpython
undefinedVectorized (default, faster — operates on the whole Series)
向量化(默认,性能更高——对整个Series操作)
pa.Check(lambda s: s.str.len() <= 255)
pa.Check(lambda s: s.str.len() <= 255)
Element-wise (scalar input, use only when vectorized is impractical)
元素级(标量输入,仅当向量化实现不现实时使用)
pa.Check(lambda x: x > 0, element_wise=True)
pa.Check(lambda x: x > 0, element_wise=True)
Always add an error message
始终添加错误提示信息
pa.Check(lambda s: s > 0, error="values must be positive")
undefinedpa.Check(lambda s: s > 0, error="values must be positive")
undefinedDataFrame-level Checks
DataFrame级检查器
python
schema = pa.DataFrameSchema(
columns={...},
checks=pa.Check(lambda df: df["end_date"] >= df["start_date"]),
)In , use :
DataFrameModel@pa.dataframe_checkpython
class Schema(pa.DataFrameModel):
start_date: int
end_date: int
@pa.dataframe_check
@classmethod
def end_after_start(cls, df: pd.DataFrame) -> pd.Series:
return df["end_date"] >= df["start_date"]python
schema = pa.DataFrameSchema(
columns={...},
checks=pa.Check(lambda df: df["end_date"] >= df["start_date"]),
)在中,使用:
DataFrameModel@pa.dataframe_checkpython
class Schema(pa.DataFrameModel):
start_date: int
end_date: int
@pa.dataframe_check
@classmethod
def end_after_start(cls, df: pd.DataFrame) -> pd.Series:
return df["end_date"] >= df["start_date"]Nullable and Optional Columns
可空与可选列
python
undefinedpython
undefinedObject API: allow nulls in a column
对象API:允许列包含空值
pa.Column(float, nullable=True)
pa.Column(float, nullable=True)
DataFrameModel: make a column optional (may be absent)
DataFrameModel:将列设为可选(可以不存在)
from typing import Optional
class Schema(pa.DataFrameModel):
required_col: Series[int]
optional_col: Optional[Series[float]]
undefinedfrom typing import Optional
class Schema(pa.DataFrameModel):
required_col: Series[int]
optional_col: Optional[Series[float]]
undefinedCoercion
类型强制转换
Enable coercion to cast data to the declared type before validation. Use deliberately — coercion can hide upstream data issues.
python
undefined启用强制转换可在验证前将数据转换为声明的类型。需谨慎使用——强制转换可能掩盖上游数据问题。
python
undefinedPer-column
按列设置
pa.Column(int, coerce=True)
pa.Column(int, coerce=True)
Schema-wide via Config
通过Config在全局Schema设置
class Schema(pa.DataFrameModel):
year: int = pa.Field(gt=2000, coerce=True)
class Config:
coerce = Trueundefinedclass Schema(pa.DataFrameModel):
year: int = pa.Field(gt=2000, coerce=True)
class Config:
coerce = TrueundefinedLazy Validation — Collect All Errors
延迟验证——收集所有错误
By default pandera raises on the first error. Use to collect all failures before raising, useful for batch reporting.
lazy=Truepython
try:
schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as exc:
print(exc.failure_cases) # DataFrame of all failures默认情况下,pandera会在遇到第一个错误时抛出异常。使用可收集所有失败后再抛出,适用于批量报告场景。
lazy=Truepython
try:
schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as exc:
print(exc.failure_cases) # 所有失败案例的DataFrameDecorator Integration
装饰器集成
Integrate validation transparently into pipelines using decorators.
python
undefined使用装饰器将验证透明地集成到数据流水线中。
python
undefinedDataFrameModel + check_types (recommended)
DataFrameModel + check_types(推荐)
@pa.check_types
def transform(df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
return df.assign(revenue=df["units"] * df["price"])
@pa.check_types
def transform(df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]:
return df.assign(revenue=df["units"] * df["price"])
Object API: check_input / check_output
对象API:check_input / check_output
@pa.check_input(input_schema)
@pa.check_output(output_schema)
def pipeline_step(df):
return df
@pa.check_input(input_schema)
@pa.check_output(output_schema)
def pipeline_step(df):
return df
check_io: concisely specify both
check_io:简洁地同时指定输入输出
@pa.check_io(raw=input_schema, out=output_schema)
def pipeline_step(raw):
return raw
Decorators work on sync/async functions, methods, class methods, and static methods.@pa.check_io(raw=input_schema, out=output_schema)
def pipeline_step(raw):
return raw
装饰器适用于同步/异步函数、方法、类方法和静态方法。Schema Inheritance
Schema继承
Build specialized schemas from a base to avoid repetition.
python
class BaseEvent(pa.DataFrameModel):
event_id: str
timestamp: int = pa.Field(gt=0)
class ClickEvent(BaseEvent):
url: str
user_agent: str
class Config:
strict = True基于基础Schema构建专用Schema,避免重复代码。
python
class BaseEvent(pa.DataFrameModel):
event_id: str
timestamp: int = pa.Field(gt=0)
class ClickEvent(BaseEvent):
url: str
user_agent: str
class Config:
strict = TrueSchema Persistence (YAML / Script)
Schema持久化(YAML/脚本)
Serialize and reload schemas to keep validation reproducible.
python
import pandera.io序列化和重新加载Schema,确保验证的可复现性。
python
import pandera.ioSave
保存
pandera.io.to_yaml(schema, "./schema.yaml")
pandera.io.to_yaml(schema, "./schema.yaml")
Load
加载
schema = pandera.io.from_yaml("./schema.yaml")
schema = pandera.io.from_yaml("./schema.yaml")
Generate Python script
生成Python脚本
pandera.io.to_script(schema, "./schema_definition.py")
undefinedpandera.io.to_script(schema, "./schema_definition.py")
undefinedSchema Inference (Prototyping Only)
Schema推断(仅用于原型开发)
Infer a schema from existing data to bootstrap development. Always review and tighten the generated schema before using in production.
python
import pandera.pandas as pa
inferred = pa.infer_schema(df)
print(inferred.to_script()) # inspect then copy-edit从现有数据中推断Schema,用于快速启动开发。在生产环境中使用前,务必审查并完善生成的Schema。
python
import pandera.pandas as pa
inferred = pa.infer_schema(df)
print(inferred.to_script()) # 检查后复制编辑Dropping Invalid Rows
丢弃无效行
Use on to filter out failing rows instead of raising an error. Supported on pandas and polars.
drop_invalid_rows=TrueDataFrameSchemapython
schema = pa.DataFrameSchema(
{"score": pa.Column(float, pa.Check.ge(0))},
drop_invalid_rows=True,
)
cleaned = schema.validate(df_with_bad_rows)在上设置可过滤掉失败行,而非抛出错误。支持pandas和polars后端。
DataFrameSchemadrop_invalid_rows=Truepython
schema = pa.DataFrameSchema(
{"score": pa.Column(float, pa.Check.ge(0))},
drop_invalid_rows=True,
)
cleaned = schema.validate(df_with_bad_rows)Error Handling
错误处理
python
from pandera.errors import SchemaError, SchemaErrorspython
from pandera.errors import SchemaError, SchemaErrorsSingle error (eager validation)
单个错误(即时验证)
try:
schema.validate(df)
except SchemaError as exc:
print(exc.failure_cases) # Series/DataFrame of failures
try:
schema.validate(df)
except SchemaError as exc:
print(exc.failure_cases) # 失败案例的Series/DataFrame
Multiple errors (lazy validation)
多个错误(延迟验证)
try:
schema.validate(df, lazy=True)
except SchemaErrors as exc:
# Structured dict with SCHEMA and DATA keys
print(exc.error_counts)
print(exc.failure_cases)
undefinedtry:
schema.validate(df, lazy=True)
except SchemaErrors as exc:
# 包含SCHEMA和DATA键的结构化字典
print(exc.error_counts)
print(exc.failure_cases)
undefinedKey Configuration Options (Config
)
Config关键配置选项(Config
)
Config| Option | Type | Effect |
|---|---|---|
| | Raise if extra columns present |
| | Cast columns to declared dtypes |
| | Require columns in declared order |
| | Schema name shown in error messages |
| | Insert columns with default values |
| 选项 | 类型 | 效果 |
|---|---|---|
| | 存在额外列时抛出错误 |
| | 将列转换为声明的数据类型 |
| | 要求列顺序与声明一致 |
| | 错误信息中显示的Schema名称 |
| | 插入带有默认值的缺失列 |
Best Practices
最佳实践
- Use over
DataFrameModelfor new code — cleaner syntax, inheritance, and type-annotation integration.DataFrameSchema - Prefer to catch unexpected extra columns early.
strict=True - Use built-in checks (,
Check.gt, etc.) over custom lambdas where possible — they produce better error messages.Check.isin - Write vectorized checks (, the default) for performance; only use
element_wise=Falsewhen the logic is truly scalar.element_wise=True - Always add messages to custom
error=objects to improve debuggability.Check - Use lazy validation in pipelines that process large batches so all failures surface in one pass.
- Never rely on inferred schemas in production — always explicitly define constraints.
- Use deliberately — set at the column level to limit scope; avoid schema-wide coercion unless certain.
coerce=True - Prefer only for non-critical informational checks (e.g., normality tests), not for data integrity constraints.
raise_warning=True
- 优先使用而非
DataFrameModel编写新代码——语法更简洁、支持继承、可与类型注解集成。DataFrameSchema - 推荐设置以尽早捕获意外的额外列。
strict=True - 优先使用内置检查器(、
Check.gt等)而非自定义lambda——它们能生成更清晰的错误信息。Check.isin - 编写向量化检查器(,默认设置)以提升性能;仅当逻辑确实为标量时才使用
element_wise=False。element_wise=True - 始终为自定义添加
Check提示信息,以提升可调试性。error= - 在处理大批量数据的流水线中使用延迟验证,以便一次性暴露所有失败。
- 生产环境中绝不依赖推断的Schema——始终显式定义约束条件。
- 谨慎使用——尽量在列级别设置以限制影响范围;除非确定必要,否则避免全局Schema强制转换。
coerce=True - 仅对非关键信息检查使用(如正态性检验),数据完整性约束请勿使用。
raise_warning=True
Additional Resources
更多资源
- — Built-in check catalog, groupby checks, wide checks, hypothesis testing
references/checks-and-validation.md - — Field spec, schema inheritance, MultiIndex, aliases, parsers, Polars usage
references/dataframe-models.md
- ——内置检查器目录、分组检查、宽表检查、假设检验
references/checks-and-validation.md - ——字段规范、Schema继承、MultiIndex、别名、解析器、Polars使用说明
references/dataframe-models.md