Loading...
Loading...
Found 26 Skills
Use this skill for Hugging Face Dataset Viewer API workflows that fetch subset/split metadata, paginate rows, search text, apply filters, download parquet URLs, and read size or statistics.
An analytical in-process SQL database management system. Designed for fast analytical queries (OLAP). Highly interoperable with Python's data ecosystem (Pandas, NumPy, Arrow, Polars). Supports querying files (CSV, Parquet, JSON) directly without an ingestion step. Use for complex SQL queries on Pandas/Polars data, querying large Parquet/CSV files directly, joining data from different sources, analytical pipelines, local datasets too big for Excel, intermediate data storage and feature engineering for ML.
万行以上 Excel 数据集的高性能分析引擎。提供 openpyxl read_only 流式读取(iter_rows 支持 10 万行以上)、Parquet 转换加速、内存优化、分块处理和大文件写入模式。**遇到以下任一情况就主动使用本 skill**:①数据行数 ≥ 10k(由 sn-da-excel-workflow 的行数评估步骤触发);②用户出现触发词:大文件 / 大数据量 / 性能优化 / 内存不足 / OOM / 百万行 / 十万行 / 流式读取 / Parquet / 分块处理 / large file / big data / streaming read / chunked processing;③直接使用 pd.read_excel() 导致超时或内存溢出;④用户明确要求对大规模数据集进行高性能处理。仅不用于:小于 10k 行的常规 Excel 分析(使用 sn-da-excel-workflow 即可)。
Native Arrow filesystem integration with PyArrow. Optimized for Parquet workflows, zero-copy data transfer, predicate pushdown, and column pruning. Covers S3, GCS, HDFS with PyArrow datasets.
Exports Amazon RDS or Aurora database snapshots to Amazon S3 in Apache Parquet format for analytics, backup, or data migration. Handles snapshot selection or creation, IAM role setup, KMS encryption, S3 bucket preparation, export task execution, progress monitoring, and data verification. Use when exporting RDS/Aurora data to S3 for Athena, Glue, or Redshift Spectrum consumption.
Convert any data file to another format: CSV, Parquet, JSON, Excel, GeoJSON, and more. Use when the user says "convert to parquet", "save as xlsx", "export as JSON", "make this a CSV", "turn into parquet", or any variation of format-to-format conversion for data files. Also triggers when the user wants to write Parquet, Excel, or other binary formats that Claude cannot produce natively.
In-process ClickHouse SQL engine for Python — run ClickHouse SQL queries directly on local files, remote databases, and cloud storage without a server. Use when the user wants to write SQL queries against Parquet/CSV/ JSON files, use ClickHouse table functions (mysql(), s3(), postgresql(), iceberg(), deltaLake() etc.), build stateful analytical pipelines with Session, use parametrized queries, window functions, or other advanced ClickHouse SQL features. Also use when the user explicitly mentions chdb.query(), ClickHouse SQL syntax, or wants cross-source SQL joins. Do NOT use for pandas-style DataFrame operations — use chdb-datastore instead.
Scalable data processing for ML workloads. Streaming execution across CPU/GPU, supports Parquet/CSV/JSON/images. Integrates with Ray Train, PyTorch, TensorFlow. Scales from single machine to 100s of nodes. Use for batch inference, data preprocessing, multi-modal data loading, or distributed ETL pipelines.
This skill provides guidance for merging data from multiple heterogeneous sources (JSON, CSV, Parquet, XML, etc.) into a unified dataset. Use this skill when tasks involve combining records from different file formats, applying field mappings, resolving conflicts based on priority rules, or generating merged outputs with conflict reports. Applicable to ETL pipelines, data consolidation, and record deduplication scenarios.
Profile datasets to understand schema, quality, and characteristics. Use when analyzing data files (CSV, JSON, Parquet), discovering dataset properties, assessing data quality, or when user mentions data profiling, schema detection, data analysis, or quality metrics. Provides basic and intermediate profiling including distributions, uniqueness, and pattern detection.
Use when writing or running Nushell commands, scripts, or pipelines - via the Nushell MCP server (mcp__nushell__evaluate), via Bash (nu -c), or in .nu script files. Also use when working with structured data (JSON, YAML, TOML, CSV, Parquet, SQLite), doing ad-hoc data analysis or exploration, or when the user's shell is Nushell.
Routes the weakest VCN samples (output of `tao-analyze-gaps-visual-changenet`) into per-augmentation-module subsets — one parquet for k-NN mining, one for AnomalyGen (Cosmos SDG) — based on each module's label eligibility. Use as the immediate next step after DEFT gap analysis in a VCN AOI SDA iteration.