Debug Production Renders
Telecine renders flow through a queue-based pipeline backed by Valkey (Redis-compatible). Each render is a workflow that progresses through multiple queues, each served by a dedicated Cloud Run worker service. Debugging a render means checking its state at three layers: the Postgres database, the Valkey queue state, and the Cloud Run worker logs.
Quick Start
All debug scripts run inside docker. From the monorepo root:
bash
# Full render status: DB row, fragment breakdown, error detail
telecine/scripts/debug-render <render-id>
# Add live Valkey state (queued/claimed/completed/failed jobs)
telecine/scripts/debug-render <render-id> --redis
# Add docker compose log grep (local dev only)
telecine/scripts/debug-render <render-id> --logs
# All three
telecine/scripts/debug-render <render-id> --redis --logs
Render Pipeline Flow
A render progresses through queues in this order:
- process-html-initializer -- Preprocesses raw HTML from the API submission.
- process-html-finalizer -- Sets workflow data, then enqueues the render-initializer job. This is the bridge from the HTML pipeline into the render pipeline.
- render-initializer -- Checks out render source, extracts render info (dimensions, duration, fps) via Electron RPC, creates an assets metadata bundle, then fans out N fragment jobs. For still images (png/jpeg/webp), renders the image directly and skips fragments.
- render-fragment (N parallel) -- Each job renders one time slice of the video via Electron RPC, writes the fragment file to storage, and reports progress.
- render-finalizer -- Auto-triggered when all fragment jobs complete. Merges fragment files into the final output. Marks the render as complete.
The pipeline is triggered by a Hasura event on
into
. The workflow system automatically routes the finalizer queue job when all child jobs complete or when any job fails with exhausted retries.
Queue names, worker service names, and resource allocations are defined in source files -- see "Key Source Files" below.
Debug Scripts
| Script | Purpose |
|---|
telecine/scripts/debug-render <id> [--redis] [--logs]
| Primary debug tool: DB state, fragments, errors, optional Valkey + logs |
telecine/scripts/inspect-render.ts
| Lower-level Valkey inspection: workflow data, claimed jobs with ages, all workflow keys |
telecine/scripts/check-queue.ts
| Queue-level Valkey state: queued/claimed/failed counts, org membership |
telecine/scripts/restart-render.ts
| Restart a failed render: resets DB status, re-enqueues initializer job |
telecine/scripts/create-render.ts
| Create a test render from an existing render's org context |
telecine/scripts/render-logs [-f] <id>
| Grep docker compose logs for initializer/fragment/finalizer services |
(runs telecine/scripts/smoke-test.ts
) | End-to-end smoke tests via the public API |
telecine/scripts/smoke-test-waveform.ts
| Waveform-specific smoke test: inserts renders directly into DB, exercises all ef-waveform modes concurrently |
| Node REPL with project imports (db, valkey, queues available) |
Run
scripts via:
telecine/scripts/run tsx scripts/<script>.ts <args>
Querying Cloud Run Logs
There are no dedicated log-querying scripts. Use
directly. Workers log structured JSON via pino, so use
filters:
bash
# All render workers for a specific render ID
gcloud logging read \
'resource.type="cloud_run_revision"
AND resource.labels.service_name=~"telecine-worker-render"
AND jsonPayload.renderId="<RENDER-ID>"' \
--project=editframe --limit=100 --format=json
# Specific worker stage
gcloud logging read \
'resource.type="cloud_run_revision"
AND resource.labels.service_name="telecine-worker-render-initializer"
AND jsonPayload.renderId="<RENDER-ID>"' \
--project=editframe --limit=100
# Errors only across all workers
gcloud logging read \
'resource.type="cloud_run_revision"
AND resource.labels.service_name=~"telecine-worker"
AND severity>=ERROR
AND jsonPayload.renderId="<RENDER-ID>"' \
--project=editframe --limit=50
Cloud Run service names follow the pattern
telecine-worker-{queue-name}
. See
telecine/deploy/resources/queues/configs.ts
for the full list of queue names.
Valkey Key Schema
Queues and workflows use predictable Valkey key patterns. Understanding these lets you query state directly when scripts don't cover your case.
Queue-level keys (per queue name):
queues:{queueName}:queued
-- zset of job keys waiting to be claimed
queues:{queueName}:claimed
-- zset of job keys being processed (score = claim timestamp)
queues:{queueName}:completed
-- zset of completed job keys
queues:{queueName}:failed
-- zset of failed job keys
queues:{queueName}:jobs:{jobId}
-- serialized job data (SuperJSON)
- -- zset of org keys with active jobs
Workflow-level keys (per render ID):
workflows:{renderId}:queued
-- zset of workflow-level queued jobs
workflows:{renderId}:claimed
-- zset of claimed jobs
workflows:{renderId}:completed
-- zset of completed jobs
workflows:{renderId}:failed
-- zset of failed jobs
workflows:{renderId}:data
-- SuperJSON workflow payload (render config)
workflows:{renderId}:status
-- workflow status string
Progress tracking:
- -- Redis stream for fragment completion progress
Stalled jobs are detected by checking claimed jobs whose score (claim timestamp) is older than 10 seconds.
Database Tables
Render state is persisted in Postgres via Kysely (not Prisma). The
client is imported from
.
- -- Main render record: status, html, org_id, dimensions, fps, duration_ms, failure_detail, timestamps
- -- Per-segment fragment records: render_id, segment_id, attempt_number, timestamps, last_error
- -- HTML processing records
- -- File records (used by process-isobmff and ingest-image)
Render status values:
->
->
->
|
Debugging Workflow
- Start with -- get the DB status, error detail, and fragment breakdown.
- If stuck in "rendering" -- add to see if jobs are queued, claimed (possibly stalled), or silently failed in Valkey.
- If Valkey state is unclear -- use for detailed claimed job ages and for queue-level counts.
- If you need worker logs -- query Cloud Run logs with using the render ID (see commands above). In local dev, use flag or script.
- To retry -- use to reset DB state and re-enqueue the initializer.
- For production DB access -- use
telecine/scripts/debug-prod-web --use-prod-db --shell
to get a container with production database connectivity.
Diagnosing Electron RPC Failures
Fragment renders fail via RPC timeout. The stack trace tells you how far rendering got:
- — initial 5 s timer fired, no keepalives received at all. Electron never started rendering. Causes: scheduler opened more connections than the single local container can handle (see below), Electron failed to load the page, or the render context couldn't be created.
- — keepalive-reset timer fired mid-render. At least one frame rendered before the hang. Causes: a race condition aborted an in-flight fetch (the case), a frame took too long, or Electron crashed mid-segment.
AbortError / FrameController race
If Electron logs show
[EF_FRAMEGEN.beginFrame] error: [object DOMException]
/
AbortError: The user aborted a request
, a fetch started during
was cancelled by an autonomous re-render firing concurrently. This is a timing-dependent race:
or
fires when media loads and calls
, killing the in-flight GCS fetch.
Fix: set
data-no-playback-controller
on the timegroup before
to suppress autonomous re-renders — the same attribute used on render clones. Check
and
/
.
Scheduler over-scaling in local dev
In production,
controls how many Cloud Run instances the scheduler spins up. Locally there is one container per queue. If the scheduler opens more WebSocket connections than the single container expects (e.g. 30 connections for 30 queued jobs), every concurrent
RPC call beyond the worker's
quota times out at
before Electron starts processing it.
The two dials and how they interact:
| Dial | Production | Local dev |
|---|
| scales container count | must match in docker-compose (usually 1) |
| jobs per container | effective parallelism when |
Both are read from
(via
in worker containers and
substitution in
scheduler-go/docker-compose.yaml
). Changing them requires
telecine/scripts/docker-compose up -d
(not just
) to force recreation with the new env.
Key Source Files
telecine/scripts/debug-render.ts
-- Primary debug tool implementation
telecine/scripts/inspect-render.ts
-- Low-level Valkey render inspection
telecine/scripts/check-queue.ts
-- Queue-level Valkey state checker
telecine/scripts/restart-render.ts
-- Render restart/retry tool
telecine/lib/queues/Queue.ts
-- Queue base class, key patterns
telecine/lib/queues/Workflow.ts
-- Workflow base class, workflow key patterns
telecine/lib/queues/Job.ts
-- Job serialization, enqueue, stall detection
telecine/lib/queues/units-of-work/Render/
-- Render pipeline queue/worker definitions
telecine/lib/queues/units-of-work/ProcessHtml/
-- HTML pipeline queue/worker definitions
telecine/deploy/resources/queues/configs.ts
-- Production queue names and scaling config
telecine/deploy/resources/queues/defineWorker.ts
-- Cloud Run service definition template
telecine/deploy/resources/constants.ts
-- GCP project/region constants
telecine/lib/valkey/valkey.ts
-- Valkey connection setup
When to Use This Skill
Use this skill when:
- A production render is stuck, failed, or producing unexpected results
- You need to trace a render through the pipeline to find where it stalled
- You need to query Cloud Run logs for a specific render
- You need to inspect or manipulate Valkey queue state
- You need to restart a failed render
- You need to understand the render pipeline architecture