carto-create-workflow
CARTO Workflows is a visual DAG builder that compiles to warehouse SQL. Each workflow runs
inside a connected warehouse — no CARTO compute is involved at execution time. This skill covers the full lifecycle: building the DAG (the bulk of this file), operating it via the CLI (CRUD, schedules), and
cross-profile copy (
promotion, customer-segregated workspaces via
) — see the references below.
For one-off ad-hoc SQL, use
carto-query-datawarehouse
— workflows are for repeatable, scheduled, multi-step DAGs.
Bundle structure, component schemas, input formats, and gotchas are all served by the CLI — never hardcode or assume them. The CLI is the source of truth.
Live introspection commands (use these before reaching for any reference file):
| Command | What it serves |
|---|
| Index of all bundle/DAG schema sections |
carto workflows schema bundle
| Top-level bundle shape (id, title, connectionId, config, privacy, tags). is a — fetch its shape with carto workflows schema privacy
. Minimal valid form: "privacy": { "privacy": "private" }
(the inner string is not a bare — lists the allowed values for that inner field). |
carto workflows schema config
| Full DAG config (schemaVersion, connectionProvider enum, nodes, edges, variables, viewport, useCache, executionSettings, schedule) |
carto workflows schema node
| Generic node shape, including requirement and vs |
carto workflows schema node.source
| Source/ node shape and the data.id == data.inputs[0].value
invariant |
carto workflows schema node.customsql
| Full customsql node spec |
carto workflows schema customsql
| Copy-paste customsql node template (with ) |
carto workflows schema edge
| Edge shape |
carto workflows schema handles
| Edge handle naming reference — sourceHandle/targetHandle by node type, by operator, by component. Critical for valid edges. |
carto workflows schema variable
| Variable (parameter) shape — { order, name, type, value, public }
|
carto workflows schema schedule
| Declarative schedule metadata fields |
carto workflows schema enums
| All valid enums (node types, providers, privacies, schedule frequencies) |
carto workflows components list --connection <conn> --json
| Component catalog for the connected warehouse |
carto workflows components get <names> --connection <conn> --json
| Per-component , , |
carto workflows components get <names> --connection <conn> --input-formats --json
| Input-type , , |
| Full command reference, including schedule-expression dialects per engine |
References (only for what the CLI doesn't serve):
- — per-warehouse details (BigQuery, Snowflake, Databricks): identifier quoting, column casing, AT path.
- — vs semantics, bundle-level schedule warning, activity-log verification.
references/mcp-and-api-publish.md
— publishing a workflow as an MCP tool or callable API endpoint: bundle requirements ( + scoped variables + draft descriptions), vs substitution syntax, gotcha, post-publish verification.
references/cross-profile-copy.md
— mechanics, connection mapping ( / ), , why copies are always new workflows.
references/schedule-readd.md
— schedules don't transfer across ; how to re-add them, including dialect translation when source and destination engines differ.
must match the connection. config.connectionProvider
(enum in
) must match the connection's actual provider — mismatches generate the wrong SQL dialect and error at runtime. Look it up with
carto connections list --search <name> --json
(
requires a UUID).
Development process
Follow these 6 phases in order for every workflow request. Do not skip or reorder them.
Phase 1 — Gather information
-
Identify data sources. If the user named tables, note them. Otherwise discover what's available with
and
carto connections describe <connection> "<fqn>"
.
-
Clarify the goal. What transformation? What output? What filters/conditions?
-
Determine the connection. carto connections list | head -n 20
. Note its
(
/
/
) — you will need it for the next step.
-
Read the provider reference.
<critical-rule id="read-provider-reference">
Before writing any node, you MUST open `references/providers/<provider>.md` (e.g. `references/providers/bigquery.md`) and read it end-to-end. This is non-negotiable.
<why>It contains identifier-quoting rules, column-casing behaviour, Analytics Toolbox path, schedule-expression dialect, and customsql
/
placeholder requirements that
cannot catch. These only surface later as
failures or runtime SQL errors, and are the single most common cause of late-stage rework.</why>
<do-not>Do not skip this step because the next phases look concrete. Do not rely on memory of a previous run — provider files change.</do-not>
</critical-rule>
-
Fetch the component catalog. carto workflows components list --connection <connection> --json
— your only source of truth for component names.
Phase 2 — Design the approach
- Select components from the catalog you fetched.
- Fetch schemas for every component you plan to use.
carto workflows components get <name1>,<name2>,<name3> --connection <connection> --json
returns , , and . Read the array carefully — it contains gotchas.
- Fetch input type formats.
carto workflows components get <component1>,<component2> --connection <connection> --input-formats --json
returns , , and for each input/output type. Pass component names (e.g. ), NOT input-type names.
- Design principles:
- Preserve identifier and spatial columns throughout.
- Prefer native components over . This is not a soft preference. See Native-first rule.
- H3/Quadbin columns work for visualization without geometry extraction.
- Use standard names for visualization: , , .
Phase 3 — Present plan, surface gaps, confirm
Present the workflow plan (components, data flow, decisions). Then explicitly enumerate every gap before building:
- Unresolved parameters — thresholds, radii, filter values, time windows, k for k-NN, aggregation columns, output table names, etc.
- Analytical decisions left to the user — significance levels, distance metrics, join types, null-handling, dedup keys, CRS, H3/quadbin resolution.
- Ambiguities in the request — anything where you had to guess intent.
For each gap,
propose a sensible default with its rationale (e.g. "p-value threshold: suggest
— conventional significance level", "buffer distance: suggest
— matches the city-block scale of the input"), and
ask the user to confirm or override. Never silently pick a value for a user-facing analytical parameter.
Wait for confirmation before building.
Phase 4 — Build the workflow
-
Create the workflow file. Get the bundle/node/edge/variable shapes from
carto workflows schema [section]
(start with
, then
,
,
,
,
). For customsql nodes, copy the template from
carto workflows schema customsql
.
If you set the optional top-level
, it must be an
object, not a string:
"privacy": { "privacy": "private" }
(the field name nests). Omit the field entirely if you don't need it —
will fail
.
Source nodes (
) — treat
like any other component: fetch its spec with
carto workflows components get ReadTable --connection <conn> --json
to get the canonical
and
. (
is hidden from
because it's grouped
, but
returns it normally.) Two source-only rules
cannot tell you, both from
:
- The canvas display name lives in , NOT . Generic nodes use ; source nodes use .
- and must be the same FQN.
Canvas layout & naming — apply on every node, every workflow. None of this affects execution, but the user opens the DAG in Builder and a sloppy canvas reads as low quality. The numbers are small and stable; just apply them.
- Snap grid is 16 px. Every and you write must be . Builder snaps drags to this grid; off-grid values look subtly misaligned next to anything the user nudged.
- Card widths are fixed by node type: source nodes render at 192 px (12 cells), generic components at 64 px (4 cells). Knowing this is what lets you reason about gaps.
- Card heights are fixed: every component card and source card is 80 px (5 cells) tall, with a 16 px label rendered below the card body. The label is not part of the card — it lives in the gap to the next card.
- Canonical inter-card gap (right edge → next left edge): 80 px (5 cells) for tight linear placement; 128 px (8 cells) at a fan-in (a join's left input, where an edge from another row needs room). The gap is the constant; left-edge-to-left-edge Δx differs across patterns only because cards have different widths. So a generic→generic linear step is Δx=144 (9 cells); a source→generic step at the same gap is Δx=272 (17 cells); a generic→generic fan-in step is Δx=192 (12 cells).
- Canonical vertical gap (card body bottom → next card body top): 80 px (5 cells), of which the first 16 px is the card's label and the remaining 64 px is whitespace. The label always sits inside the gap, never inside the card. So a stacked-card step is top-to-top Δy = 160 px (10 cells) — 80 (body) + 16 (label) + 64 (whitespace).
- Layout. Source nodes stack at the leftmost column with the same , Δy = 144 px (9 cells). The main pipeline runs at the y-midline of the source rows — e.g. sources at y=80 and y=224 → pipeline at y=160. Joins on the midline visually receive both inputs symmetrically.
- and are different fields — never duplicate. = short instance-specific verb (≤ 15 chars) describing what this node does in this DAG (, , ). = the component's canonical type name as Builder shows it on a fresh drop (, , ) — read from
carto workflows components get <name> --json
→ . Source nodes only render on canvas (treat it as a short alias for the table: , ).
-
Run after every write to the file. It's offline, fast, and catches structural errors immediately:
bash
carto workflows validate workflow.json --json
Treat any save without a passing
as broken — fix before continuing to the next node/edge.
is authoritative. If a component schema from
disagrees with what
accepts, trust
and adjust the bundle to satisfy it. Do not "fix" the bundle to match the schema if it's already passing validation.
-
Run at branch boundaries, not on every save. It hits the warehouse (slower, requires auth), so reserve it for whole sub-DAGs once their structure validates clean, and once at the end before presenting:
bash
carto workflows verify-remote workflow.json --connection <connection-name> --json
is what catches column-type mismatches, missing tables, and AT resolution — things
cannot see.
-
Fix errors silently — don't expose implementation details to the user.
-
Iterate until complete, with both
and a final
clean.
Phase 5 — Present result
Summarize what was built. Confirm validation success. Wait for user confirmation.
Phase 6 — Upload to CARTO
- Ask if the user wants to upload.
- Upload and provide the URL:
bash
carto workflows create --file workflow.json --verify
The connection comes from inside the bundle — no flag here.
- Do NOT auto-execute unless explicitly requested.
Native-first rule
is the
last tool to reach for, not the first. Before writing a customsql node, attempt the native chain. Fall back to customsql only if at least one of these is true:
- The native chain would require more than ~4-5 nodes to express the same logic.
- A specific operation has no native equivalent at all (verified via
carto workflows components list
).
- The expression genuinely needs raw warehouse SQL (e.g., , , , last-N windowing).
Common operations and their native equivalents — try these first:
| If you'd write SQL like… | Use natives |
|---|
| / multi-condition filter | (predicate), (UI builder), (geometry-based match/unmatch split), (column projection) |
| (multi-column projection / rename / multi-expression) | (one node, free-form SELECT body) |
SELECT ..., expr AS c FROM t
(add one computed column) | (one column + one expression per node) |
GROUP BY k, SUM(x), AVG(y), COUNT(*)
(single key) | — input is a single , not multi-column. For multi-key grouping use . |
| (any join type) | |
JOIN ... ON ST_INTERSECTS / ST_CONTAINS / ST_WITHIN
| |
MIN(ST_DISTANCE(a.geom, b.geom))
across two tables | (augments the main table in place with + — rename them per source if you chain two nodes for two reference tables) |
| |
| H3 binning / boundary / center / polyfill | , (output geometry column is named , e.g. — not ), , |
| + |
| z-score / standardization | |
| weighted composite score | native.spatialcompositeunsupervised
(weighted/PCA), native.spatialcompositesupervised
(target-driven) |
| Getis-Ord Gi*, GWR, isolines | , , |
| Save final node to a table | |
Signals you're reaching for customsql too early — stop and look for a native chain instead:
- The customsql is just a clause, a single , a with one or two aggregates, or a column projection.
- It wraps a single warehouse function (, , etc.) for which a dedicated native exists.
- Its only purpose is to project/rename/re-cast columns — use (free-form SELECT body, one node) for multiple columns; is for adding a single computed column.
- You're chaining customsql outputs through more customsql nodes — chain natives instead.
When customsql is genuinely the right call, the per-warehouse SQL-dialect footguns live in the matching
references/providers/*.md
(BigQuery backticks, Snowflake casing, Databricks identifiers).
Fetching component & input information
Do not rely on memorized component schemas or input formats. Always fetch live data from the CLI.
| Command | Purpose |
|---|
carto workflows components list --connection <conn> --json
| List all available components |
carto workflows components get <names> --connection <conn> --json
| Component schemas with , , and |
carto workflows components get <names> --connection <conn> --input-formats --json
| Input type , , for the types those components use |
What to look for in the response:
- Component — gotcha strings: non-obvious behavior, deprecated status, output column naming.
- Input — prose describing the expected value shape.
- Input — concrete JSON snippets showing correct usage.
- Input — common mistakes, evaluation order, format quirks.
- Component — copy verbatim into the authored node's (string). Generic nodes without it are flagged OUTDATED in Builder.
- Input (Selection / Enum) — the engine matches values exactly. Copy each option string verbatim — preserve case, never paraphrase or Title-Case (e.g. spatialjoin's accepts , not ).
For values that may evolve over time (component versions, bundle/config defaults, enum option lists), treat the CLI's
/
output as the single source of truth — never hardcode values in your own templates. Specifically:
- — read the current default from
carto workflows schema config --json
→ properties.schemaVersion.default
. Today it's (string), but resolve at author time so future bumps don't require a skill update.
Provider-specific notes
Different warehouses have different SQL dialects, table-naming conventions, and column-casing rules. Always check the matching provider guide:
references/providers/bigquery.md
references/providers/snowflake.md
references/providers/databricks.md
Input-type formats (
,
,
,
, etc.) and per-component gotchas (including the "AT components need
, not
" rule) are served by the CLI itself — see
Fetching component & input information.
Operating a workflow (after it's built)
Once a workflow exists in CARTO, the CLI exposes CRUD and schedule management. Quick reference:
bash
# List / inspect
carto workflows list --json
carto workflows get <id>
# Update with edited JSON
carto workflows update <id> --file workflow.json
# Add / remove a schedule
carto workflows schedule add <id> --expression "every day 08:00"
carto workflows schedule remove <id>
Always-on guidance:
- Workflows run on the connection's warehouse. A workflow with a BigQuery connection cannot use Snowflake-specific SQL.
- Schedule expression syntax depends on the engine — natural-language for BQ/CARTO DW (), cron for Snowflake/Postgres (), Quartz cron for Databricks (). See . Picking the wrong dialect fails at schedule-add time.
- Copying a workflow across profiles (dev → prod, customer-segregated workspaces) is covered in
references/cross-profile-copy.md
. Schedules don't transfer — see references/schedule-readd.md
.
- Deleting a workflow doesn't delete its outputs. Tables/views the workflow created in the warehouse persist; clean them up with if needed.
- replaces the whole DAG. There's no per-node patch. Always first, edit, then .
- Workflow execution status lives in the activity log (,
WorkflowExecutionComplete
event types). For health monitoring of scheduled workflows, query that log via carto-query-datawarehouse
— see references/activity-queries.md
in that skill.