Loading...
Loading...
Use when "Dask", "parallel computing", "distributed computing", "larger than memory", or asking about "parallel pandas", "parallel numpy", "out-of-core", "multi-file processing", "cluster computing", "lazy evaluation dataframe"
npx skill4agent add eyadsibai/ltk dask| Collection | Like | Use Case |
|---|---|---|
| DataFrame | pandas | Tabular data, CSV/Parquet |
| Array | NumPy | Numerical arrays, matrices |
| Bag | list | Unstructured data, JSON logs |
| Delayed | Custom | Arbitrary Python functions |
.compute()| Function | Behavior | Use |
|---|---|---|
| Lazy load | Large CSVs |
| Lazy load | Large Parquet |
| Operations | Build graph | Chain transforms |
| Execute | Get final result |
.compute()| Scheduler | Best For | Start |
|---|---|---|
| threaded | NumPy/Pandas (releases GIL) | Default |
| processes | Pure Python (GIL bound) | |
| synchronous | Debugging | |
| distributed | Monitoring, scaling, clusters | |
| Feature | Benefit |
|---|---|
| Dashboard | Real-time progress monitoring |
| Cluster scaling | Add/remove workers |
| Fault tolerance | Retry failed tasks |
| Worker resources | Memory management |
| Concept | Description |
|---|---|
| Partition | Subset of rows (like a mini DataFrame) |
| npartitions | Number of partitions |
| divisions | Index boundaries between partitions |
| Concept | Description |
|---|---|
| Chunk | Subset of array (n-dimensional block) |
| chunks | Tuple of chunk sizes per dimension |
| Optimal size | ~100 MB per chunk |
| Category | Operations |
|---|---|
| Selection | |
| Aggregation | |
| Transforms | |
| Joins | |
| I/O | |
| Operation | Issue | Alternative |
|---|---|---|
| Kills parallelism | |
| Slow | |
Repeated | Inefficient | Single |
| Expensive shuffle | Avoid if possible |
scan_*read_*.compute().to_parquet()| Pattern | Description |
|---|---|
| Glob patterns | |
| Partition per file | Natural parallelism |
| Output partitioned | |
| Method | Use Case |
|---|---|
| Apply function to each partition |
| Apply function to each array block |
| Wrap arbitrary Python functions |
| Practice | Why |
|---|---|
| Don't load locally first | Let Dask handle loading |
| Single compute() at end | Avoid redundant computation |
| Use Parquet | Faster than CSV, columnar |
| Match partition to files | One partition per file |
| Check task graph size | |
| Use distributed for debugging | Dashboard shows progress |
| Pitfall | Solution |
|---|---|
| Loading with pandas first | Use |
| compute() in loops | Collect all, single compute() |
| Too many partitions | Repartition to ~100 MB each |
| Memory errors | Reduce chunk size, add workers |
| Slow shuffles | Avoid sorts/joins when possible |
| Tool | Best For | Trade-off |
|---|---|---|
| Dask | Scale pandas/NumPy, clusters | Setup complexity |
| Polars | Fast in-memory | Must fit in RAM |
| Vaex | Out-of-core single machine | Limited operations |
| Spark | Enterprise, SQL-heavy | Infrastructure |