Loading...
Loading...
Found 42 Skills
Resolve data lake and lakehouse asset references across Glue Data Catalog, S3, S3 Tables, and Redshift. Triggers on: find the table, where is our data, which table has, locate dataset, find data for, search catalog, what tables match, Redshift table, lakehouse table, data lake table, warehouse table, reverse lookup S3 path. Do NOT use for: full catalog audits (use exploring-data-catalog), running queries (use querying-data-lake), creating tables (use creating-data-lake-table).
Create managed Iceberg tables using Amazon S3 Tables (s3tables API namespace) with automatic compaction and snapshot management. Sets up table bucket, namespace, table, schema, Glue catalog registration, partitioning, IAM access control. Triggers on: create table, data lake table, analytics table, structured data storage, S3 Tables, Iceberg, Athena table, partitioning strategy, access permissions. Do NOT use for: importing files (use ingesting-into-data-lake), vector storage (use storing-and-querying-vectors), querying existing tables (use querying-data-lake), or locating existing table (use finding-data-lake-assets).
Troubleshoots and debugs AWS Clean Rooms collaboration issues related to IAM roles, S3 bucket policies, KMS keys, Lake Formation permissions, and CloudWatch logging for custom ML model training and inference jobs. Use when a customer reports permission failures, access errors, or log publishing issues in Clean Rooms.
Convert laboratory instrument output files (PDF, CSV, Excel, TXT) to Allotrope Simple Model (ASM) JSON format or flattened 2D CSV. Use this skill when scientists need to standardize instrument data for LIMS systems, data lakes, or downstream analysis. Supports auto-detection of instrument types. Outputs include full ASM JSON, flattened CSV for easy import, and exportable Python code for data engineers. Common triggers include converting instrument files, standardizing lab data, preparing data for upload to LIMS/ELN systems, or generating parser code for production pipelines.
Bronze/Silver/Gold layer design patterns and templates for building scalable data lakehouse architectures. Includes incremental processing, data quality checks, and optimization strategies.
Create data analytics and data pipeline diagrams using PlantUML syntax with analytics/database stencil icons. Best for ETL pipelines, data lakes, real-time streaming, data warehousing, and BI dashboards. NOT for simple flowcharts (use mermaid) or general cloud infra (use cloud skill).
This skill should be used when working with LaminDB, an open-source data framework for biology that makes data queryable, traceable, reproducible, and FAIR. Use when managing biological datasets (scRNA-seq, spatial, flow cytometry, etc.), tracking computational workflows, curating and validating data with biological ontologies, building data lakehouses, or ensuring data lineage and reproducibility in biological research. Covers data management, annotation, ontologies (genes, cell types, diseases, tissues), schema validation, integrations with workflow managers (Nextflow, Snakemake) and MLOps platforms (W&B, MLflow), and deployment strategies.
Manage the full lifecycle of Alibaba Cloud E-MapReduce (EMR) ECS clusters—creation, scaling, renewal, and status queries. Use this Skill when users want to set up big data clusters, view cluster status, add nodes, release nodes, configure auto-scaling, check cluster and node states, or diagnose creation failures. Also applicable for scenarios like "create a Hadoop cluster", "data lake cluster", "running out of resources", "check my cluster", "renew", etc. NOTE: This Skill does NOT support cluster deletion, release, or termination under any circumstances. Any request to delete or terminate a cluster will be refused and redirected to the EMR console.
Execute authoring T-SQL (DDL, DML, data ingestion, transactions, schema changes) against Microsoft Fabric Data Warehouse and SQL endpoints from agentic CLI environments. Use when the user wants to: (1) create/alter/drop tables from terminal, (2) insert/update/delete/merge data via CLI, (3) run COPY INTO or OPENROWSET ingestion, (4) manage transactions or stored procedures, (5) perform schema evolution, (6) use time travel or snapshots, (7) generate ETL/ELT shell scripts, (8) create views/functions/procedures on Lakehouse SQLEP. Triggers: "create table in warehouse", "insert data via T-SQL", "load from ADLS", "COPY INTO", "run ETL with T-SQL", "alter warehouse table", "upsert with T-SQL", "merge into warehouse", "create T-SQL procedure", "warehouse time travel", "recover deleted warehouse data", "create warehouse schema", "deploy warehouse", "transaction conflict", "snapshot isolation error".
Analyze lakehouse data interactively using Fabric Livy sessions and PySpark/Spark SQL for advanced analytics, DataFrames, cross-lakehouse joins, Delta time-travel, and unstructured/JSON data. Use when the user explicitly asks for PySpark, Spark DataFrames, Livy sessions, or Python-based analysis — NOT for simple SQL queries. Triggers: "PySpark", "Spark SQL", "analyze with PySpark", "Spark DataFrame", "Livy session", "lakehouse with Python", "PySpark analysis", "PySpark data quality", "Delta time-travel with Spark".
Use when reading from or writing to Neo4j with Apache Spark or Databricks using the Neo4j Connector for Apache Spark (org.neo4j:neo4j-connector-apache-spark). Covers SparkSession setup, DataFrame reads via labels/Cypher/relationship scan, DataFrame writes with SaveMode, node.keys for MERGE, relationship write mapping, partition and batch tuning, PySpark and Scala examples, Databricks cluster config, Databricks secrets for credentials, Delta Lake to Neo4j pipelines. Does NOT handle Cypher authoring — use neo4j-cypher-skill. Does NOT handle the Python bolt driver — use neo4j-driver-python-skill. Does NOT handle GDS algorithms — use neo4j-gds-skill.
Use when "data pipelines", "ETL", "data warehousing", "data lakes", or asking about "Airflow", "Spark", "dbt", "Snowflake", "BigQuery", "data modeling"