dask

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Dask

Dask

Overview

概述

Dask is a Python library for parallel and distributed computing that enables three critical capabilities:
  • Larger-than-memory execution on single machines for data exceeding available RAM
  • Parallel processing for improved computational speed across multiple cores
  • Distributed computation supporting terabyte-scale datasets across multiple machines
Dask scales from laptops (processing ~100 GiB) to clusters (processing ~100 TiB) while maintaining familiar Python APIs.
Dask是一个用于并行和分布式计算的Python库,提供三项关键功能:
  • 超内存执行:在单台机器上处理超出可用RAM的数据
  • 并行处理:通过多核提升计算速度
  • 分布式计算:支持跨多台机器处理TB级数据集
Dask可以从笔记本电脑(处理约100 GiB数据)扩展到集群(处理约100 TiB数据),同时保持熟悉的Python API。

When to Use This Skill

何时使用Dask

This skill should be used when:
  • Process datasets that exceed available RAM
  • Scale pandas or NumPy operations to larger datasets
  • Parallelize computations for performance improvements
  • Process multiple files efficiently (CSVs, Parquet, JSON, text logs)
  • Build custom parallel workflows with task dependencies
  • Distribute workloads across multiple cores or machines
在以下场景中应使用Dask:
  • 处理超出可用RAM的数据集
  • 将pandas或NumPy操作扩展到更大的数据集
  • 并行化计算以提升性能
  • 高效处理多个文件(CSV、Parquet、JSON、文本日志)
  • 构建带有任务依赖的自定义并行工作流
  • 将工作负载分配到多个核心或机器上

Core Capabilities

核心功能

Dask provides five main components, each suited to different use cases:
Dask提供五个主要组件,分别适用于不同的使用场景:

1. DataFrames - Parallel Pandas Operations

1. DataFrames - 并行Pandas操作

Purpose: Scale pandas operations to larger datasets through parallel processing.
When to Use:
  • Tabular data exceeds available RAM
  • Need to process multiple CSV/Parquet files together
  • Pandas operations are slow and need parallelization
  • Scaling from pandas prototype to production
Reference Documentation: For comprehensive guidance on Dask DataFrames, refer to
references/dataframes.md
which includes:
  • Reading data (single files, multiple files, glob patterns)
  • Common operations (filtering, groupby, joins, aggregations)
  • Custom operations with
    map_partitions
  • Performance optimization tips
  • Common patterns (ETL, time series, multi-file processing)
Quick Example:
python
import dask.dataframe as dd
用途:通过并行处理将pandas操作扩展到更大的数据集。
适用场景
  • 表格数据超出可用RAM
  • 需要将多个CSV/Parquet文件一起处理
  • Pandas操作速度慢,需要并行化
  • 从pandas原型扩展到生产环境
参考文档:有关Dask DataFrames的全面指南,请参考
references/dataframes.md
,其中包括:
  • 读取数据(单个文件、多个文件、通配符模式)
  • 常见操作(过滤、分组、连接、聚合)
  • 使用
    map_partitions
    进行自定义操作
  • 性能优化技巧
  • 常见模式(ETL、时间序列、多文件处理)
快速示例
python
import dask.dataframe as dd

Read multiple files as single DataFrame

读取多个文件为单个DataFrame

ddf = dd.read_csv('data/2024-*.csv')
ddf = dd.read_csv('data/2024-*.csv')

Operations are lazy until compute()

操作是惰性的,直到调用compute()

filtered = ddf[ddf['value'] > 100] result = filtered.groupby('category').mean().compute()

**Key Points**:
- Operations are lazy (build task graph) until `.compute()` called
- Use `map_partitions` for efficient custom operations
- Convert to DataFrame early when working with structured data from other sources
filtered = ddf[ddf['value'] > 100] result = filtered.groupby('category').mean().compute()

**关键点**:
- 操作是惰性的(构建任务图),直到调用`.compute()`
- 使用`map_partitions`进行高效的自定义操作
- 当处理来自其他来源的结构化数据时,尽早转换为DataFrame

2. Arrays - Parallel NumPy Operations

2. Arrays - 并行NumPy操作

Purpose: Extend NumPy capabilities to datasets larger than memory using blocked algorithms.
When to Use:
  • Arrays exceed available RAM
  • NumPy operations need parallelization
  • Working with scientific datasets (HDF5, Zarr, NetCDF)
  • Need parallel linear algebra or array operations
Reference Documentation: For comprehensive guidance on Dask Arrays, refer to
references/arrays.md
which includes:
  • Creating arrays (from NumPy, random, from disk)
  • Chunking strategies and optimization
  • Common operations (arithmetic, reductions, linear algebra)
  • Custom operations with
    map_blocks
  • Integration with HDF5, Zarr, and XArray
Quick Example:
python
import dask.array as da
用途:使用分块算法将NumPy功能扩展到超出内存的数据集。
适用场景
  • 数组超出可用RAM
  • NumPy操作需要并行化
  • 处理科学数据集(HDF5、Zarr、NetCDF)
  • 需要并行线性代数或数组操作
参考文档:有关Dask Arrays的全面指南,请参考
references/arrays.md
,其中包括:
  • 创建数组(从NumPy、随机生成、从磁盘读取)
  • 分块策略与优化
  • 常见操作(算术运算、归约、线性代数)
  • 使用
    map_blocks
    进行自定义操作
  • 与HDF5、Zarr和XArray的集成
快速示例
python
import dask.array as da

Create large array with chunks

创建带有分块的大型数组

x = da.random.random((100000, 100000), chunks=(10000, 10000))
x = da.random.random((100000, 100000), chunks=(10000, 10000))

Operations are lazy

操作是惰性的

y = x + 100 z = y.mean(axis=0)
y = x + 100 z = y.mean(axis=0)

Compute result

计算结果

result = z.compute()

**Key Points**:
- Chunk size is critical (aim for ~100 MB per chunk)
- Operations work on chunks in parallel
- Rechunk data when needed for efficient operations
- Use `map_blocks` for operations not available in Dask
result = z.compute()

**关键点**:
- 分块大小至关重要(目标为每个分块约100 MB)
- 操作在分块上并行执行
- 必要时重新分块以实现高效操作
- 对Dask不支持的操作使用`map_blocks`

3. Bags - Parallel Processing of Unstructured Data

3. Bags - 非结构化数据并行处理

Purpose: Process unstructured or semi-structured data (text, JSON, logs) with functional operations.
When to Use:
  • Processing text files, logs, or JSON records
  • Data cleaning and ETL before structured analysis
  • Working with Python objects that don't fit array/dataframe formats
  • Need memory-efficient streaming processing
Reference Documentation: For comprehensive guidance on Dask Bags, refer to
references/bags.md
which includes:
  • Reading text and JSON files
  • Functional operations (map, filter, fold, groupby)
  • Converting to DataFrames
  • Common patterns (log analysis, JSON processing, text processing)
  • Performance considerations
Quick Example:
python
import dask.bag as db
import json
用途:使用函数式操作处理非结构化或半结构化数据(文本、JSON、日志)。
适用场景
  • 处理文本文件、日志或JSON记录
  • 在结构化分析前进行数据清洗和ETL
  • 处理不适合数组/数据框格式的Python对象
  • 需要内存高效的流处理
参考文档:有关Dask Bags的全面指南,请参考
references/bags.md
,其中包括:
  • 读取文本和JSON文件
  • 函数式操作(map、filter、fold、groupby)
  • 转换为DataFrames
  • 常见模式(日志分析、JSON处理、文本处理)
  • 性能注意事项
快速示例
python
import dask.bag as db
import json

Read and parse JSON files

读取并解析JSON文件

bag = db.read_text('logs/*.json').map(json.loads)
bag = db.read_text('logs/*.json').map(json.loads)

Filter and transform

过滤和转换

valid = bag.filter(lambda x: x['status'] == 'valid') processed = valid.map(lambda x: {'id': x['id'], 'value': x['value']})
valid = bag.filter(lambda x: x['status'] == 'valid') processed = valid.map(lambda x: {'id': x['id'], 'value': x['value']})

Convert to DataFrame for analysis

转换为DataFrame进行分析

ddf = processed.to_dataframe()

**Key Points**:
- Use for initial data cleaning, then convert to DataFrame/Array
- Use `foldby` instead of `groupby` for better performance
- Operations are streaming and memory-efficient
- Convert to structured formats (DataFrame) for complex operations
ddf = processed.to_dataframe()

**关键点**:
- 用于初始数据清洗,然后转换为DataFrame/Array
- 使用`foldby`而非`groupby`以获得更好的性能
- 操作是流式的且内存高效
- 转换为结构化格式(DataFrame)以进行复杂操作

4. Futures - Task-Based Parallelization

4. Futures - 基于任务的并行化

Purpose: Build custom parallel workflows with fine-grained control over task execution and dependencies.
When to Use:
  • Building dynamic, evolving workflows
  • Need immediate task execution (not lazy)
  • Computations depend on runtime conditions
  • Implementing custom parallel algorithms
  • Need stateful computations
Reference Documentation: For comprehensive guidance on Dask Futures, refer to
references/futures.md
which includes:
  • Setting up distributed client
  • Submitting tasks and working with futures
  • Task dependencies and data movement
  • Advanced coordination (queues, locks, events, actors)
  • Common patterns (parameter sweeps, dynamic tasks, iterative algorithms)
Quick Example:
python
from dask.distributed import Client

client = Client()  # Create local cluster
用途:构建对任务执行和依赖关系有细粒度控制的自定义并行工作流。
适用场景
  • 构建动态、演进的工作流
  • 需要立即执行任务(非惰性)
  • 计算依赖于运行时条件
  • 实现自定义并行算法
  • 需要有状态的计算
参考文档:有关Dask Futures的全面指南,请参考
references/futures.md
,其中包括:
  • 设置分布式客户端
  • 提交任务并处理futures
  • 任务依赖与数据移动
  • 高级协调(队列、锁、事件、角色)
  • 常见模式(参数扫描、动态任务、迭代算法)
快速示例
python
from dask.distributed import Client

client = Client()  # 创建本地集群

Submit tasks (executes immediately)

提交任务(立即执行)

def process(x): return x ** 2
futures = client.map(process, range(100))
def process(x): return x ** 2
futures = client.map(process, range(100))

Gather results

收集结果

results = client.gather(futures)
client.close()

**Key Points**:
- Requires distributed client (even for single machine)
- Tasks execute immediately when submitted
- Pre-scatter large data to avoid repeated transfers
- ~1ms overhead per task (not suitable for millions of tiny tasks)
- Use actors for stateful workflows
results = client.gather(futures)
client.close()

**关键点**:
- 需要分布式客户端(即使是单台机器)
- 任务提交后立即执行
- 预分散大型数据以避免重复传输
- 每个任务约1ms的开销(不适合数百万个微小任务)
- 使用角色实现有状态工作流

5. Schedulers - Execution Backends

5. Schedulers - 执行后端

Purpose: Control how and where Dask tasks execute (threads, processes, distributed).
When to Choose Scheduler:
  • Threads (default): NumPy/Pandas operations, GIL-releasing libraries, shared memory benefit
  • Processes: Pure Python code, text processing, GIL-bound operations
  • Synchronous: Debugging with pdb, profiling, understanding errors
  • Distributed: Need dashboard, multi-machine clusters, advanced features
Reference Documentation: For comprehensive guidance on Dask Schedulers, refer to
references/schedulers.md
which includes:
  • Detailed scheduler descriptions and characteristics
  • Configuration methods (global, context manager, per-compute)
  • Performance considerations and overhead
  • Common patterns and troubleshooting
  • Thread configuration for optimal performance
Quick Example:
python
import dask
import dask.dataframe as dd
用途:控制Dask任务的执行方式和位置(线程、进程、分布式)。
调度器选择场景
  • 线程(默认):NumPy/Pandas操作、释放GIL的库、共享内存优势
  • 进程:纯Python代码、文本处理、受GIL限制的操作
  • 同步:使用pdb调试、性能分析、排查错误
  • 分布式:需要仪表盘、多机器集群、高级功能
参考文档:有关Dask调度器的全面指南,请参考
references/schedulers.md
,其中包括:
  • 详细的调度器描述和特性
  • 配置方法(全局、上下文管理器、按计算配置)
  • 性能注意事项与开销
  • 常见模式与故障排除
  • 线程配置以实现最佳性能
快速示例
python
import dask
import dask.dataframe as dd

Use threads for DataFrame (default, good for numeric)

对DataFrame使用线程(默认,适合数值计算)

ddf = dd.read_csv('data.csv') result1 = ddf.mean().compute() # Uses threads
ddf = dd.read_csv('data.csv') result1 = ddf.mean().compute() # 使用线程

Use processes for Python-heavy work

对Python密集型工作使用进程

import dask.bag as db bag = db.read_text('logs/*.txt') result2 = bag.map(python_function).compute(scheduler='processes')
import dask.bag as db bag = db.read_text('logs/*.txt') result2 = bag.map(python_function).compute(scheduler='processes')

Use synchronous for debugging

使用同步调度器进行调试

dask.config.set(scheduler='synchronous') result3 = problematic_computation.compute() # Can use pdb
dask.config.set(scheduler='synchronous') result3 = problematic_computation.compute() # 可以使用pdb

Use distributed for monitoring and scaling

使用分布式调度器进行监控和扩展

from dask.distributed import Client client = Client() result4 = computation.compute() # Uses distributed with dashboard

**Key Points**:
- Threads: Lowest overhead (~10 µs/task), best for numeric work
- Processes: Avoids GIL (~10 ms/task), best for Python work
- Distributed: Monitoring dashboard (~1 ms/task), scales to clusters
- Can switch schedulers per computation or globally
from dask.distributed import Client client = Client() result4 = computation.compute() # 使用带有仪表盘的分布式调度器

**关键点**:
- 线程:开销最低(约10微秒/任务),最适合数值工作
- 进程:避免GIL(约10毫秒/任务),最适合Python工作
- 分布式:带有监控仪表盘(约1毫秒/任务),可扩展到集群
- 可以按计算或全局切换调度器

Best Practices

最佳实践

For comprehensive performance optimization guidance, memory management strategies, and common pitfalls to avoid, refer to
references/best-practices.md
. Key principles include:
有关全面的性能优化指南、内存管理策略和需要避免的常见陷阱,请参考
references/best-practices.md
。核心原则包括:

Start with Simpler Solutions

从简单方案开始

Before using Dask, explore:
  • Better algorithms
  • Efficient file formats (Parquet instead of CSV)
  • Compiled code (Numba, Cython)
  • Data sampling
在使用Dask之前,先尝试:
  • 更优的算法
  • 高效的文件格式(用Parquet替代CSV)
  • 编译代码(Numba、Cython)
  • 数据采样

Critical Performance Rules

关键性能规则

1. Don't Load Data Locally Then Hand to Dask
python
undefined
1. 不要先本地加载数据再交给Dask
python
undefined

Wrong: Loads all data in memory first

错误:先将所有数据加载到内存

import pandas as pd df = pd.read_csv('large.csv') ddf = dd.from_pandas(df, npartitions=10)
import pandas as pd df = pd.read_csv('large.csv') ddf = dd.from_pandas(df, npartitions=10)

Correct: Let Dask handle loading

正确:让Dask处理加载

import dask.dataframe as dd ddf = dd.read_csv('large.csv')

**2. Avoid Repeated compute() Calls**
```python
import dask.dataframe as dd ddf = dd.read_csv('large.csv')

**2. 避免重复调用compute()**
```python

Wrong: Each compute is separate

错误:每次compute都是独立的

for item in items: result = dask_computation(item).compute()
for item in items: result = dask_computation(item).compute()

Correct: Single compute for all

正确:一次性计算所有任务

computations = [dask_computation(item) for item in items] results = dask.compute(*computations)

**3. Don't Build Excessively Large Task Graphs**
- Increase chunk sizes if millions of tasks
- Use `map_partitions`/`map_blocks` to fuse operations
- Check task graph size: `len(ddf.__dask_graph__())`

**4. Choose Appropriate Chunk Sizes**
- Target: ~100 MB per chunk (or 10 chunks per core in worker memory)
- Too large: Memory overflow
- Too small: Scheduling overhead

**5. Use the Dashboard**
```python
from dask.distributed import Client
client = Client()
print(client.dashboard_link)  # Monitor performance, identify bottlenecks
computations = [dask_computation(item) for item in items] results = dask.compute(*computations)

**3. 不要构建过大的任务图**
- 如果任务数以百万计,增加分块大小
- 使用`map_partitions`/`map_blocks`融合操作
- 检查任务图大小:`len(ddf.__dask_graph__())`

**4. 选择合适的分块大小**
- 目标:每个分块约100 MB(或工作内存中每个核心对应10个分块)
- 过大:内存溢出
- 过小:调度开销大

**5. 使用仪表盘**
```python
from dask.distributed import Client
client = Client()
print(client.dashboard_link)  # 监控性能,识别瓶颈

Common Workflow Patterns

常见工作流模式

ETL Pipeline

ETL流水线

python
import dask.dataframe as dd
python
import dask.dataframe as dd

Extract: Read data

提取:读取数据

ddf = dd.read_csv('raw_data/*.csv')
ddf = dd.read_csv('raw_data/*.csv')

Transform: Clean and process

转换:清洗和处理

ddf = ddf[ddf['status'] == 'valid'] ddf['amount'] = ddf['amount'].astype('float64') ddf = ddf.dropna(subset=['important_col'])
ddf = ddf[ddf['status'] == 'valid'] ddf['amount'] = ddf['amount'].astype('float64') ddf = ddf.dropna(subset=['important_col'])

Load: Aggregate and save

加载:聚合并保存

summary = ddf.groupby('category').agg({'amount': ['sum', 'mean']}) summary.to_parquet('output/summary.parquet')
undefined
summary = ddf.groupby('category').agg({'amount': ['sum', 'mean']}) summary.to_parquet('output/summary.parquet')
undefined

Unstructured to Structured Pipeline

非结构化到结构化流水线

python
import dask.bag as db
import json
python
import dask.bag as db
import json

Start with Bag for unstructured data

用Bag处理非结构化数据

bag = db.read_text('logs/*.json').map(json.loads) bag = bag.filter(lambda x: x['status'] == 'valid')
bag = db.read_text('logs/*.json').map(json.loads) bag = bag.filter(lambda x: x['status'] == 'valid')

Convert to DataFrame for structured analysis

转换为DataFrame进行结构化分析

ddf = bag.to_dataframe() result = ddf.groupby('category').mean().compute()
undefined
ddf = bag.to_dataframe() result = ddf.groupby('category').mean().compute()
undefined

Large-Scale Array Computation

大规模数组计算

python
import dask.array as da
python
import dask.array as da

Load or create large array

加载或创建大型数组

x = da.from_zarr('large_dataset.zarr')
x = da.from_zarr('large_dataset.zarr')

Process in chunks

分块处理

normalized = (x - x.mean()) / x.std()
normalized = (x - x.mean()) / x.std()

Save result

保存结果

da.to_zarr(normalized, 'normalized.zarr')
undefined
da.to_zarr(normalized, 'normalized.zarr')
undefined

Custom Parallel Workflow

自定义并行工作流

python
from dask.distributed import Client

client = Client()
python
from dask.distributed import Client

client = Client()

Scatter large dataset once

一次性分散大型数据集

data = client.scatter(large_dataset)
data = client.scatter(large_dataset)

Process in parallel with dependencies

并行处理带有依赖的任务

futures = [] for param in parameters: future = client.submit(process, data, param) futures.append(future)
futures = [] for param in parameters: future = client.submit(process, data, param) futures.append(future)

Gather results

收集结果

results = client.gather(futures)
undefined
results = client.gather(futures)
undefined

Selecting the Right Component

选择合适的组件

Use this decision guide to choose the appropriate Dask component:
Data Type:
  • Tabular data → DataFrames
  • Numeric arrays → Arrays
  • Text/JSON/logs → Bags (then convert to DataFrame)
  • Custom Python objects → Bags or Futures
Operation Type:
  • Standard pandas operations → DataFrames
  • Standard NumPy operations → Arrays
  • Custom parallel tasks → Futures
  • Text processing/ETL → Bags
Control Level:
  • High-level, automatic → DataFrames/Arrays
  • Low-level, manual → Futures
Workflow Type:
  • Static computation graph → DataFrames/Arrays/Bags
  • Dynamic, evolving → Futures
使用以下决策指南选择合适的Dask组件:
数据类型:
  • 表格数据 → DataFrames
  • 数值数组 → Arrays
  • 文本/JSON/日志 → Bags(然后转换为DataFrame)
  • 自定义Python对象 → BagsFutures
操作类型:
  • 标准pandas操作 → DataFrames
  • 标准NumPy操作 → Arrays
  • 自定义并行任务 → Futures
  • 文本处理/ETL → Bags
控制级别:
  • 高级、自动 → DataFrames/Arrays
  • 低级、手动 → Futures
工作流类型:
  • 静态计算图 → DataFrames/Arrays/Bags
  • 动态、演进 → Futures

Integration Considerations

集成注意事项

File Formats

文件格式

  • Efficient: Parquet, HDF5, Zarr (columnar, compressed, parallel-friendly)
  • Compatible but slower: CSV (use for initial ingestion only)
  • For Arrays: HDF5, Zarr, NetCDF
  • 高效格式:Parquet、HDF5、Zarr(列式、压缩、并行友好)
  • 兼容但较慢:CSV(仅用于初始导入)
  • 数组适用:HDF5、Zarr、NetCDF

Conversion Between Collections

集合间的转换

python
undefined
python
undefined

Bag → DataFrame

Bag → DataFrame

ddf = bag.to_dataframe()
ddf = bag.to_dataframe()

DataFrame → Array (for numeric data)

DataFrame → Array(适用于数值数据)

arr = ddf.to_dask_array(lengths=True)
arr = ddf.to_dask_array(lengths=True)

Array → DataFrame

Array → DataFrame

ddf = dd.from_dask_array(arr, columns=['col1', 'col2'])
undefined
ddf = dd.from_dask_array(arr, columns=['col1', 'col2'])
undefined

With Other Libraries

与其他库集成

  • XArray: Wraps Dask arrays with labeled dimensions (geospatial, imaging)
  • Dask-ML: Machine learning with scikit-learn compatible APIs
  • Distributed: Advanced cluster management and monitoring
  • XArray:用带标签的维度包装Dask数组(地理空间、成像)
  • Dask-ML:提供与scikit-learn兼容的API的机器学习库
  • Distributed:高级集群管理与监控

Debugging and Development

调试与开发

Iterative Development Workflow

迭代开发工作流

  1. Test on small data with synchronous scheduler:
python
dask.config.set(scheduler='synchronous')
result = computation.compute()  # Can use pdb, easy debugging
  1. Validate with threads on sample:
python
sample = ddf.head(1000)  # Small sample
  1. 使用同步调度器在小数据上测试:
python
dask.config.set(scheduler='synchronous')
result = computation.compute()  # 可以使用pdb,便于调试
  1. 用线程在样本数据上验证:
python
sample = ddf.head(1000)  # 小样本

Test logic, then scale to full dataset

测试逻辑,然后扩展到完整数据集


3. **Scale with distributed for monitoring**:
```python
from dask.distributed import Client
client = Client()
print(client.dashboard_link)  # Monitor performance
result = computation.compute()

3. **用分布式调度器扩展并监控**:
```python
from dask.distributed import Client
client = Client()
print(client.dashboard_link)  # 监控性能
result = computation.compute()

Common Issues

常见问题

Memory Errors:
  • Decrease chunk sizes
  • Use
    persist()
    strategically and delete when done
  • Check for memory leaks in custom functions
Slow Start:
  • Task graph too large (increase chunk sizes)
  • Use
    map_partitions
    or
    map_blocks
    to reduce tasks
Poor Parallelization:
  • Chunks too large (increase number of partitions)
  • Using threads with Python code (switch to processes)
  • Data dependencies preventing parallelism
内存错误:
  • 减小分块大小
  • 策略性地使用
    persist()
    ,使用后删除
  • 检查自定义函数中的内存泄漏
启动缓慢:
  • 任务图过大(增加分块大小)
  • 使用
    map_partitions
    map_blocks
    减少任务数
并行化效果差:
  • 分块过大(增加分区数)
  • 对Python代码使用线程(切换到进程)
  • 数据依赖导致无法并行

Reference Files

参考文件

All reference documentation files can be read as needed for detailed information:
  • references/dataframes.md
    - Complete Dask DataFrame guide
  • references/arrays.md
    - Complete Dask Array guide
  • references/bags.md
    - Complete Dask Bag guide
  • references/futures.md
    - Complete Dask Futures and distributed computing guide
  • references/schedulers.md
    - Complete scheduler selection and configuration guide
  • references/best-practices.md
    - Comprehensive performance optimization and troubleshooting
Load these files when users need detailed information about specific Dask components, operations, or patterns beyond the quick guidance provided here.
所有参考文档可根据需要读取以获取详细信息:
  • references/dataframes.md
    - 完整的Dask DataFrame指南
  • references/arrays.md
    - 完整的Dask Array指南
  • references/bags.md
    - 完整的Dask Bag指南
  • references/futures.md
    - 完整的Dask Futures与分布式计算指南
  • references/schedulers.md
    - 完整的调度器选择与配置指南
  • references/best-practices.md
    - 全面的性能优化与故障排除指南
当用户需要了解特定Dask组件、操作或模式的详细信息,且超出本文提供的快速指导时,可加载这些文件。