vaex

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Vaex

Overview

概述

Vaex is a high-performance Python library designed for lazy, out-of-core DataFrames to process and visualize tabular datasets that are too large to fit into RAM. Vaex can process over a billion rows per second, enabling interactive data exploration and analysis on datasets with billions of rows.

Vaex 是一个高性能Python库，专为延迟计算的核外DataFrame设计，用于处理和可视化无法放入RAM的大型表格数据集。Vaex每秒可处理超过十亿行数据，支持对数十亿行的数据集进行交互式数据探索和分析。

When to Use This Skill

何时使用该技能

Use Vaex when:

Processing tabular datasets larger than available RAM (gigabytes to terabytes)
Performing fast statistical aggregations on massive datasets
Creating visualizations and heatmaps of large datasets
Building machine learning pipelines on big data
Converting between data formats (CSV, HDF5, Arrow, Parquet)
Needing lazy evaluation and virtual columns to avoid memory overhead
Working with astronomical data, financial time series, or other large-scale scientific datasets

在以下场景中使用Vaex：

处理超出可用内存的表格数据集（从千兆字节到太字节）
对海量数据集执行快速统计聚合
创建大型数据集的可视化和热力图
在大数据上构建机器学习管道
在不同数据格式之间转换（CSV、HDF5、Arrow、Parquet）
需要延迟计算和虚拟列以避免内存开销
处理天文数据、金融时间序列或其他大规模科学数据集

Core Capabilities

核心功能

Vaex provides six primary capability areas, each documented in detail in the references directory:

Vaex提供六大核心功能领域，每个领域的详细文档都在参考目录中：

1. DataFrames and Data Loading

1. DataFrame与数据加载

Load and create Vaex DataFrames from various sources including files (HDF5, CSV, Arrow, Parquet), pandas DataFrames, NumPy arrays, and dictionaries. Reference

references/core_dataframes.md

for:

Opening large files efficiently
Converting from pandas/NumPy/Arrow
Working with example datasets
Understanding DataFrame structure

从各种来源加载并创建Vaex DataFrame，包括文件（HDF5、CSV、Arrow、Parquet）、pandas DataFrame、NumPy数组和字典。参考

references/core_dataframes.md

了解：

高效打开大型文件
从pandas/NumPy/Arrow转换
处理示例数据集
理解DataFrame结构

2. Data Processing and Manipulation

2. 数据处理与操作

Perform filtering, create virtual columns, use expressions, and aggregate data without loading everything into memory. Reference

references/data_processing.md

for:

Filtering and selections
Virtual columns and expressions
Groupby operations and aggregations
String operations and datetime handling
Working with missing data

无需将所有数据加载到内存即可执行过滤、创建虚拟列、使用表达式和聚合数据。参考

references/data_processing.md

了解：

过滤与选择
虚拟列与表达式
分组操作与聚合
字符串操作与日期时间处理
缺失数据处理

3. Performance and Optimization

3. 性能与优化

Leverage Vaex's lazy evaluation, caching strategies, and memory-efficient operations. Reference

references/performance.md

for:

Understanding lazy evaluation
Using
```
delay=True
```
for batching operations
Materializing columns when needed
Caching strategies
Asynchronous operations

利用Vaex的延迟计算、缓存策略和内存高效操作。参考

references/performance.md

了解：

理解延迟计算
使用
```
delay=True
```
进行批量操作
在需要时物化列
缓存策略
异步操作

4. Data Visualization

4. 数据可视化

Create interactive visualizations of large datasets including heatmaps, histograms, and scatter plots. Reference

references/visualization.md

for:

Creating 1D and 2D plots
Heatmap visualizations
Working with selections
Customizing plots and subplots

创建大型数据集的交互式可视化，包括热力图、直方图和散点图。参考

references/visualization.md

了解：

创建1D和2D图表
热力图可视化
处理选择结果
自定义图表和子图

5. Machine Learning Integration

5. 机器学习集成

Build ML pipelines with transformers, encoders, and integration with scikit-learn, XGBoost, and other frameworks. Reference

references/machine_learning.md

for:

Feature scaling and encoding
PCA and dimensionality reduction
K-means clustering
Integration with scikit-learn/XGBoost/CatBoost
Model serialization and deployment

构建包含转换器、编码器的ML管道，并与scikit-learn、XGBoost等框架集成。参考

references/machine_learning.md

了解：

特征缩放与编码
PCA与降维
K-means聚类
与scikit-learn/XGBoost/CatBoost集成
模型序列化与部署

6. I/O Operations

6. I/O操作

Efficiently read and write data in various formats with optimal performance. Reference

references/io_operations.md

for:

File format recommendations
Export strategies
Working with Apache Arrow
CSV handling for large files
Server and remote data access

高效读写各种格式的数据，实现最佳性能。参考

references/io_operations.md

了解：

文件格式推荐
导出策略
使用Apache Arrow
大型CSV文件处理
服务器与远程数据访问

Quick Start Pattern

快速入门模式

For most Vaex tasks, follow this pattern:

python

import vaex

对于大多数Vaex任务，请遵循以下模式：

python

import vaex

1. Open or create DataFrame

df = vaex.open('large_file.hdf5') # or .csv, .arrow, .parquet

OR

df = vaex.from_pandas(pandas_df)

2. Explore the data

print(df) # Shows first/last rows and column info df.describe() # Statistical summary

3. Create virtual columns (no memory overhead)

df['new_column'] = df.x ** 2 + df.y

4. Filter with selections

df_filtered = df[df.age > 25]

5. Compute statistics (fast, lazy evaluation)

mean_val = df.x.mean() stats = df.groupby('category').agg({'value': 'sum'})

6. Visualize

df.plot1d(df.x, limits=[0, 100]) df.plot(df.x, df.y, limits='99.7%')

7. Export if needed

df.export_hdf5('output.hdf5')

undefined

df.export_hdf5('output.hdf5')

undefined

Working with References

使用参考文档

The reference files contain detailed information about each capability area. Load references into context based on the specific task:

Basic operations: Start with

references/core_dataframes.md

and

references/data_processing.md

Performance issues: Check
```
references/performance.md
```
Visualization tasks: Use
```
references/visualization.md
```
ML pipelines: Reference
```
references/machine_learning.md
```
File I/O: Consult
```
references/io_operations.md
```

参考文件包含每个功能领域的详细信息。根据具体任务加载相关参考内容：

基础操作：从

references/core_dataframes.md

和

references/data_processing.md

开始

性能问题：查看
```
references/performance.md
```
可视化任务：使用
```
references/visualization.md
```
ML管道：参考
```
references/machine_learning.md
```
文件I/O：查阅
```
references/io_operations.md
```

Best Practices

最佳实践

Use HDF5 or Apache Arrow formats for optimal performance with large datasets
Leverage virtual columns instead of materializing data to save memory
Batch operations using
```
delay=True
```
when performing multiple calculations
Export to efficient formats rather than keeping data in CSV
Use expressions for complex calculations without intermediate storage
Profile with
df.stat()
to understand memory usage and optimize operations

使用HDF5或Apache Arrow格式以获得大型数据集的最佳性能
利用虚拟列而非物化数据以节省内存
执行批量操作时使用
```
delay=True
```
导出为高效格式而非保留CSV格式
使用表达式进行复杂计算，无需中间存储
使用
df.stat()
分析以了解内存使用情况并优化操作

Common Patterns

常见模式

Pattern: Converting Large CSV to HDF5

模式：将大型CSV转换为HDF5

python

import vaex

python

import vaex

Open large CSV (processes in chunks automatically)

df = vaex.from_csv('large_file.csv')

Export to HDF5 for faster future access

df.export_hdf5('large_file.hdf5')

Future loads are instant

df = vaex.open('large_file.hdf5')

undefined

df = vaex.open('large_file.hdf5')

undefined

Pattern: Efficient Aggregations

模式：高效聚合

python

undefined

python

undefined

Use delay=True to batch multiple operations

mean_x = df.x.mean(delay=True) std_y = df.y.std(delay=True) sum_z = df.z.sum(delay=True)

Execute all at once

results = vaex.execute([mean_x, std_y, sum_z])

undefined

results = vaex.execute([mean_x, std_y, sum_z])

undefined

Pattern: Virtual Columns for Feature Engineering

模式：用于特征工程的虚拟列

python

undefined

python

undefined

No memory overhead - computed on the fly

df['age_squared'] = df.age ** 2 df['full_name'] = df.first_name + ' ' + df.last_name df['is_adult'] = df.age >= 18

undefined

df['age_squared'] = df.age ** 2 df['full_name'] = df.first_name + ' ' + df.last_name df['is_adult'] = df.age >= 18

undefined

Resources

资源

This skill includes reference documentation in the

references/

directory:

```
core_dataframes.md
```
- DataFrame creation, loading, and basic structure
```
data_processing.md
```
- Filtering, expressions, aggregations, and transformations
```
performance.md
```
- Optimization strategies and lazy evaluation
```
visualization.md
```
- Plotting and interactive visualizations
```
machine_learning.md
```
- ML pipelines and model integration
```
io_operations.md
```
- File formats and data import/export

该技能在

references/

目录中包含参考文档：

```
core_dataframes.md
```
- DataFrame创建、加载和基本结构
```
data_processing.md
```
- 过滤、表达式、聚合和转换
```
performance.md
```
- 优化策略与延迟计算
```
visualization.md
```
- 绘图与交互式可视化
```
machine_learning.md
```
- ML管道与模型集成
```
io_operations.md
```
- 文件格式与数据导入/导出