vaex

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Vaex

Vaex

Overview

概述

Vaex is a high-performance Python library designed for lazy, out-of-core DataFrames to process and visualize tabular datasets that are too large to fit into RAM. Vaex can process over a billion rows per second, enabling interactive data exploration and analysis on datasets with billions of rows.
Vaex 是一个高性能Python库,专为延迟计算的核外DataFrame设计,用于处理和可视化无法放入RAM的大型表格数据集。Vaex每秒可处理超过十亿行数据,支持对数十亿行的数据集进行交互式数据探索和分析。

When to Use This Skill

何时使用该技能

Use Vaex when:
  • Processing tabular datasets larger than available RAM (gigabytes to terabytes)
  • Performing fast statistical aggregations on massive datasets
  • Creating visualizations and heatmaps of large datasets
  • Building machine learning pipelines on big data
  • Converting between data formats (CSV, HDF5, Arrow, Parquet)
  • Needing lazy evaluation and virtual columns to avoid memory overhead
  • Working with astronomical data, financial time series, or other large-scale scientific datasets
在以下场景中使用Vaex:
  • 处理超出可用内存的表格数据集(从千兆字节到太字节)
  • 对海量数据集执行快速统计聚合
  • 创建大型数据集的可视化和热力图
  • 在大数据上构建机器学习管道
  • 在不同数据格式之间转换(CSV、HDF5、Arrow、Parquet)
  • 需要延迟计算和虚拟列以避免内存开销
  • 处理天文数据、金融时间序列或其他大规模科学数据集

Core Capabilities

核心功能

Vaex provides six primary capability areas, each documented in detail in the references directory:
Vaex提供六大核心功能领域,每个领域的详细文档都在参考目录中:

1. DataFrames and Data Loading

1. DataFrame与数据加载

Load and create Vaex DataFrames from various sources including files (HDF5, CSV, Arrow, Parquet), pandas DataFrames, NumPy arrays, and dictionaries. Reference
references/core_dataframes.md
for:
  • Opening large files efficiently
  • Converting from pandas/NumPy/Arrow
  • Working with example datasets
  • Understanding DataFrame structure
从各种来源加载并创建Vaex DataFrame,包括文件(HDF5、CSV、Arrow、Parquet)、pandas DataFrame、NumPy数组和字典。参考
references/core_dataframes.md
了解:
  • 高效打开大型文件
  • 从pandas/NumPy/Arrow转换
  • 处理示例数据集
  • 理解DataFrame结构

2. Data Processing and Manipulation

2. 数据处理与操作

Perform filtering, create virtual columns, use expressions, and aggregate data without loading everything into memory. Reference
references/data_processing.md
for:
  • Filtering and selections
  • Virtual columns and expressions
  • Groupby operations and aggregations
  • String operations and datetime handling
  • Working with missing data
无需将所有数据加载到内存即可执行过滤、创建虚拟列、使用表达式和聚合数据。参考
references/data_processing.md
了解:
  • 过滤与选择
  • 虚拟列与表达式
  • 分组操作与聚合
  • 字符串操作与日期时间处理
  • 缺失数据处理

3. Performance and Optimization

3. 性能与优化

Leverage Vaex's lazy evaluation, caching strategies, and memory-efficient operations. Reference
references/performance.md
for:
  • Understanding lazy evaluation
  • Using
    delay=True
    for batching operations
  • Materializing columns when needed
  • Caching strategies
  • Asynchronous operations
利用Vaex的延迟计算、缓存策略和内存高效操作。参考
references/performance.md
了解:
  • 理解延迟计算
  • 使用
    delay=True
    进行批量操作
  • 在需要时物化列
  • 缓存策略
  • 异步操作

4. Data Visualization

4. 数据可视化

Create interactive visualizations of large datasets including heatmaps, histograms, and scatter plots. Reference
references/visualization.md
for:
  • Creating 1D and 2D plots
  • Heatmap visualizations
  • Working with selections
  • Customizing plots and subplots
创建大型数据集的交互式可视化,包括热力图、直方图和散点图。参考
references/visualization.md
了解:
  • 创建1D和2D图表
  • 热力图可视化
  • 处理选择结果
  • 自定义图表和子图

5. Machine Learning Integration

5. 机器学习集成

Build ML pipelines with transformers, encoders, and integration with scikit-learn, XGBoost, and other frameworks. Reference
references/machine_learning.md
for:
  • Feature scaling and encoding
  • PCA and dimensionality reduction
  • K-means clustering
  • Integration with scikit-learn/XGBoost/CatBoost
  • Model serialization and deployment
构建包含转换器、编码器的ML管道,并与scikit-learn、XGBoost等框架集成。参考
references/machine_learning.md
了解:
  • 特征缩放与编码
  • PCA与降维
  • K-means聚类
  • 与scikit-learn/XGBoost/CatBoost集成
  • 模型序列化与部署

6. I/O Operations

6. I/O操作

Efficiently read and write data in various formats with optimal performance. Reference
references/io_operations.md
for:
  • File format recommendations
  • Export strategies
  • Working with Apache Arrow
  • CSV handling for large files
  • Server and remote data access
高效读写各种格式的数据,实现最佳性能。参考
references/io_operations.md
了解:
  • 文件格式推荐
  • 导出策略
  • 使用Apache Arrow
  • 大型CSV文件处理
  • 服务器与远程数据访问

Quick Start Pattern

快速入门模式

For most Vaex tasks, follow this pattern:
python
import vaex
对于大多数Vaex任务,请遵循以下模式:
python
import vaex

1. Open or create DataFrame

1. Open or create DataFrame

df = vaex.open('large_file.hdf5') # or .csv, .arrow, .parquet
df = vaex.open('large_file.hdf5') # or .csv, .arrow, .parquet

OR

OR

df = vaex.from_pandas(pandas_df)
df = vaex.from_pandas(pandas_df)

2. Explore the data

2. Explore the data

print(df) # Shows first/last rows and column info df.describe() # Statistical summary
print(df) # Shows first/last rows and column info df.describe() # Statistical summary

3. Create virtual columns (no memory overhead)

3. Create virtual columns (no memory overhead)

df['new_column'] = df.x ** 2 + df.y
df['new_column'] = df.x ** 2 + df.y

4. Filter with selections

4. Filter with selections

df_filtered = df[df.age > 25]
df_filtered = df[df.age > 25]

5. Compute statistics (fast, lazy evaluation)

5. Compute statistics (fast, lazy evaluation)

mean_val = df.x.mean() stats = df.groupby('category').agg({'value': 'sum'})
mean_val = df.x.mean() stats = df.groupby('category').agg({'value': 'sum'})

6. Visualize

6. Visualize

df.plot1d(df.x, limits=[0, 100]) df.plot(df.x, df.y, limits='99.7%')
df.plot1d(df.x, limits=[0, 100]) df.plot(df.x, df.y, limits='99.7%')

7. Export if needed

7. Export if needed

df.export_hdf5('output.hdf5')
undefined
df.export_hdf5('output.hdf5')
undefined

Working with References

使用参考文档

The reference files contain detailed information about each capability area. Load references into context based on the specific task:
  • Basic operations: Start with
    references/core_dataframes.md
    and
    references/data_processing.md
  • Performance issues: Check
    references/performance.md
  • Visualization tasks: Use
    references/visualization.md
  • ML pipelines: Reference
    references/machine_learning.md
  • File I/O: Consult
    references/io_operations.md
参考文件包含每个功能领域的详细信息。根据具体任务加载相关参考内容:
  • 基础操作:从
    references/core_dataframes.md
    references/data_processing.md
    开始
  • 性能问题:查看
    references/performance.md
  • 可视化任务:使用
    references/visualization.md
  • ML管道:参考
    references/machine_learning.md
  • 文件I/O:查阅
    references/io_operations.md

Best Practices

最佳实践

  1. Use HDF5 or Apache Arrow formats for optimal performance with large datasets
  2. Leverage virtual columns instead of materializing data to save memory
  3. Batch operations using
    delay=True
    when performing multiple calculations
  4. Export to efficient formats rather than keeping data in CSV
  5. Use expressions for complex calculations without intermediate storage
  6. Profile with
    df.stat()
    to understand memory usage and optimize operations
  1. 使用HDF5或Apache Arrow格式以获得大型数据集的最佳性能
  2. 利用虚拟列而非物化数据以节省内存
  3. 执行批量操作时使用
    delay=True
  4. 导出为高效格式而非保留CSV格式
  5. 使用表达式进行复杂计算,无需中间存储
  6. 使用
    df.stat()
    分析
    以了解内存使用情况并优化操作

Common Patterns

常见模式

Pattern: Converting Large CSV to HDF5

模式:将大型CSV转换为HDF5

python
import vaex
python
import vaex

Open large CSV (processes in chunks automatically)

Open large CSV (processes in chunks automatically)

df = vaex.from_csv('large_file.csv')
df = vaex.from_csv('large_file.csv')

Export to HDF5 for faster future access

Export to HDF5 for faster future access

df.export_hdf5('large_file.hdf5')
df.export_hdf5('large_file.hdf5')

Future loads are instant

Future loads are instant

df = vaex.open('large_file.hdf5')
undefined
df = vaex.open('large_file.hdf5')
undefined

Pattern: Efficient Aggregations

模式:高效聚合

python
undefined
python
undefined

Use delay=True to batch multiple operations

Use delay=True to batch multiple operations

mean_x = df.x.mean(delay=True) std_y = df.y.std(delay=True) sum_z = df.z.sum(delay=True)
mean_x = df.x.mean(delay=True) std_y = df.y.std(delay=True) sum_z = df.z.sum(delay=True)

Execute all at once

Execute all at once

results = vaex.execute([mean_x, std_y, sum_z])
undefined
results = vaex.execute([mean_x, std_y, sum_z])
undefined

Pattern: Virtual Columns for Feature Engineering

模式:用于特征工程的虚拟列

python
undefined
python
undefined

No memory overhead - computed on the fly

No memory overhead - computed on the fly

df['age_squared'] = df.age ** 2 df['full_name'] = df.first_name + ' ' + df.last_name df['is_adult'] = df.age >= 18
undefined
df['age_squared'] = df.age ** 2 df['full_name'] = df.first_name + ' ' + df.last_name df['is_adult'] = df.age >= 18
undefined

Resources

资源

This skill includes reference documentation in the
references/
directory:
  • core_dataframes.md
    - DataFrame creation, loading, and basic structure
  • data_processing.md
    - Filtering, expressions, aggregations, and transformations
  • performance.md
    - Optimization strategies and lazy evaluation
  • visualization.md
    - Plotting and interactive visualizations
  • machine_learning.md
    - ML pipelines and model integration
  • io_operations.md
    - File formats and data import/export
该技能在
references/
目录中包含参考文档:
  • core_dataframes.md
    - DataFrame创建、加载和基本结构
  • data_processing.md
    - 过滤、表达式、聚合和转换
  • performance.md
    - 优化策略与延迟计算
  • visualization.md
    - 绘图与交互式可视化
  • machine_learning.md
    - ML管道与模型集成
  • io_operations.md
    - 文件格式与数据导入/导出