spark
Original:🇺🇸 English
Translated
Apache Spark distributed computing. Use for big data processing.
5installs
Sourceg1joshi/agent-skills
Added on
NPX Install
npx skill4agent add g1joshi/agent-skills sparkTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →Apache Spark
Spark is the king of Big Data. v4.0 (2024/2025) makes Spark Connect the default, allowing thin clients (like VS Code) to connect to massive clusters easily.
When to Use
- Data Engineering: ETL at Petabyte scale.
- Streaming: Structured Streaming for real-time analytics.
- Legacy ML: (though mostly replaced by XGBoost/Torch).
spark.ml
Core Concepts
Spark Connect
Decouples client (your laptop) from server (the cluster). Allows using Spark from Go/Rust/TypeScript.
Catalyst Optimizer
Optimizes your SQL/DataFrame queries before execution.
RDD
The low-level API. Almost never used directly in modern Spark.
Best Practices (2025)
Do:
- Use PySpark: It is now a first-class citizen with Python UDF profiling.
- Use Delta Lake / Iceberg: Spark works best with modern table formats.
- Use : For vectorized Python UDFs.
pandas_udf
Don't:
- Don't use : It is slow (Python serialization). Use DataFrames.
rdd.map