k8s

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

k8s Cluster Operations — joelclaw on Talos

k8s集群运维——基于Talos的joelclaw集群

Architecture

架构

Mac Mini (localhost ports)
  └─ Lima SSH mux (~/.colima/_lima/colima/ssh.sock) ← NEVER KILL
      └─ Colima VM (8 CPU, 16 GiB, 100 GiB, VZ framework, aarch64)
          └─ Docker 29.x + buildx (joelclaw-builder, docker-container driver)
              └─ Talos v1.12.4 container (joelclaw-controlplane-1)
                  └─ k8s v1.35.0 (single node, Flannel CNI)
                      └─ joelclaw namespace (privileged PSA)
⚠️ Talos has NO shell. No bash, no /bin/sh, nothing. You cannot
docker exec
into the Talos container. Use
talosctl
for node operations and the Colima VM (
ssh lima-colima
) for host-level operations like
modprobe
.
Mac Mini(本地端口)
  └─ Lima SSH多路复用(~/.colima/_lima/colima/ssh.sock)← 切勿终止
      └─ Colima虚拟机(8核CPU、16 GiB内存、100 GiB存储、VZ框架、aarch64架构)
          └─ Docker 29.x + buildx(joelclaw-builder,docker-container驱动)
              └─ Talos v1.12.4容器(joelclaw-controlplane-1)
                  └─ k8s v1.35.0(单节点、Flannel CNI)
                      └─ joelclaw命名空间(特权PSA)
⚠️ Talos 没有Shell环境。 没有bash,没有/bin/sh,无法
docker exec
进入Talos容器。节点操作请使用
talosctl
,主机级操作(如
modprobe
)请通过Colima虚拟机(
ssh lima-colima
)执行。

Colima Stability Rules (2026-03-17 incident)

Colima稳定性规则(2026-03-17事故总结)

SettingValueReason
CPU8Match k8s workload requests (~2.8 CPU, 72%)
Memory16 GiB32GB causes macOS memory pressure → VM kill
nestedVirtualizationOFF by defaultCrashes VM under load (image builds, heavy scheduling). Toggle ON only for Firecracker testing
vmTypevzRequired for Apple Silicon
mountTypevirtiofsFastest option with VZ
nestedVirtualization: true
is unstable on M4 Pro under load.
It causes the Colima VM to silently crash during Docker builds/pushes. Each crash:
  • Kills the Talos container mid-operation
  • Corrupts Redis AOF (if caught mid-write) → crash-loop on restart
  • Breaks Lima socket forwarding →
    docker
    CLI on macOS disconnects
  • Creates stale k8s pods that re-pull images → amplifies pressure
Recovery from Colima crash-loop:
  1. colima stop && colima start
    — basic restart
  2. If Redis crash-loops:
    redis-check-aof --fix
    (see Redis AOF Recovery below)
  3. If Restate has stuck invocations: purge PVC or kill via admin API
  4. If native Docker socket dead: use SSH tunnel
    ssh -L /tmp/docker.sock:/var/run/docker.sock
Docker image builds should use the buildx container builder (
docker buildx build --builder joelclaw-builder
) to isolate build IO from k8s workloads.
设置取值原因
CPU8匹配k8s工作负载请求(约2.8 CPU,使用率72%)
内存16 GiB分配32GB会导致macOS内存压力过大→虚拟机被终止
nestedVirtualization默认关闭高负载下(镜像构建、调度密集)会导致虚拟机崩溃。仅在Firecracker测试时开启
vmTypevzApple Silicon芯片必需
mountTypevirtiofsVZ框架下的最快选项
nestedVirtualization: true
在M4 Pro高负载下不稳定
,会导致Docker构建/推送时Colima虚拟机静默崩溃。每次崩溃会引发:
  • Talos容器在运行中被终止
  • Redis AOF文件损坏(若写入时中断)→重启后崩溃循环
  • Lima套接字转发失效→macOS上的Docker CLI断开连接
  • 产生过时的k8s Pod,会重新拉取镜像→加剧资源压力
Colima崩溃循环恢复步骤:
  1. colima stop && colima start
    — 基础重启操作
  2. 若Redis崩溃循环:执行
    redis-check-aof --fix
    (详见下方Redis AOF恢复)
  3. 若Restate存在卡住的调用:清理PVC或通过管理API终止
  4. 若原生Docker套接字失效:使用SSH隧道
    ssh -L /tmp/docker.sock:/var/run/docker.sock
Docker镜像构建应使用buildx容器构建器(
docker buildx build --builder joelclaw-builder
),将构建IO与k8s工作负载隔离。

Redis AOF Recovery

Redis AOF恢复

If Redis crash-loops after a VM restart with
Bad file format reading the append only file
:
bash
undefined
若虚拟机重启后Redis因
Bad file format reading the append only file
报错进入崩溃循环:
bash
undefined

1. Scale down Redis (or use a temp pod if StatefulSet can't mount PVC concurrently)

1. 缩容Redis(若StatefulSet无法同时挂载PVC,可使用临时Pod)

kubectl -n joelclaw apply -f - <<'EOF' apiVersion: v1 kind: Pod metadata: name: redis-fix namespace: joelclaw spec: tolerations: - key: node-role.kubernetes.io/control-plane operator: Exists effect: NoSchedule containers: - name: fix image: redis:7-alpine command: ["sh", "-c", "cd /data/appendonlydir && echo y | redis-check-aof --fix *.incr.aof && redis-check-aof *.incr.aof"] volumeMounts: - name: data mountPath: /data restartPolicy: Never volumes: - name: data persistentVolumeClaim: claimName: data-redis-0 EOF
kubectl -n joelclaw apply -f - <<'EOF' apiVersion: v1 kind: Pod metadata: name: redis-fix namespace: joelclaw spec: tolerations: - key: node-role.kubernetes.io/control-plane operator: Exists effect: NoSchedule containers: - name: fix image: redis:7-alpine command: ["sh", "-c", "cd /data/appendonlydir && echo y | redis-check-aof --fix *.incr.aof && redis-check-aof *.incr.aof"] volumeMounts: - name: data mountPath: /data restartPolicy: Never volumes: - name: data persistentVolumeClaim: claimName: data-redis-0 EOF

2. Wait, check logs, then clean up

2. 等待执行完成,查看日志,然后清理临时Pod

kubectl -n joelclaw logs redis-fix kubectl -n joelclaw delete pod redis-fix --force
kubectl -n joelclaw logs redis-fix kubectl -n joelclaw delete pod redis-fix --force

3. Restart Redis

3. 重启Redis

kubectl -n joelclaw delete pod redis-0

For port mappings, recovery procedures, and cluster recreation steps, read [references/operations.md](references/operations.md).
kubectl -n joelclaw delete pod redis-0

端口映射、恢复流程和集群重建步骤,请查看[references/operations.md](references/operations.md)。

Kubeconfig Port Drift (2026-03-21 incident)

Kubeconfig端口漂移问题(2026-03-21事故总结)

Docker port mappings for k8s API (6443) and Talos API (50000) are not pinned — they use random host ports assigned at container creation. All service ports (3111, 8288, 6379, etc.) ARE pinned 1:1.
When the Colima VM or Talos container restarts, Docker may reassign different random ports for 6443/50000. Kubeconfig goes stale, kubectl fails, and everything that depends on it (joelclaw CLI, health checks, pod inspection) breaks silently.
Symptoms:
kubectl
returns
tls: internal error
or
connection refused
. All pods are actually running — only the kubeconfig routing is wrong.
Fix:
bash
undefined
k8s API(6443)和Talos API(50000)的Docker端口映射未固定——容器创建时会分配随机主机端口。所有服务端口(3111、8288、6379等)均为1:1固定映射。
当Colima虚拟机或Talos容器重启时,Docker可能会为6443/50000重新分配不同的随机端口。此时Kubeconfig会失效,kubectl执行失败,所有依赖它的工具(joelclaw CLI、健康检查、Pod检查)都会静默故障。
症状
kubectl
返回
tls: internal error
connection refused
。实际上所有Pod都在正常运行——仅Kubeconfig路由配置错误。
修复方法
bash
undefined

1. Regenerate kubeconfig from talosctl (which has the correct port)

1. 从talosctl重新生成kubeconfig(它会获取正确的端口)

talosctl --talosconfig ~/.talos/config --nodes 127.0.0.1 kubeconfig --force
talosctl --talosconfig ~/.talos/config --nodes 127.0.0.1 kubeconfig --force

2. Switch to the new context

2. 切换到新的上下文

kubectl config use-context "$(kubectl config get-contexts -o name | grep joelclaw | head -1)"
kubectl config use-context "$(kubectl config get-contexts -o name | grep joelclaw | head -1)"

3. Clean stale contexts (optional)

3. 清理过时的上下文(可选)

kubectl config delete-context admin@joelclaw # if stale entry exists

**Self-heal**: `health.sh` now auto-detects and fixes this before running checks.

**Root cause**: Container was created without pinning these ports. To permanently fix, recreate the container with explicit port bindings for 6443:6443 and 50000:50000. This requires cluster recreation — a bigger operation.
kubectl config delete-context admin@joelclaw # 若存在过时条目

**自修复**:`health.sh`脚本现在会在执行检查前自动检测并修复此问题。

**根本原因**:创建容器时未固定这些端口。要永久修复,需重新创建容器并显式绑定6443:6443和50000:50000端口,这需要重建集群——属于较大规模操作。

Quick Health Check

快速健康检查

bash
kubectl get pods -n joelclaw                          # all pods
curl -s localhost:3111/api/inngest                     # system-bus-worker → 200
curl -s localhost:7880/                                # LiveKit → "OK"
curl -s localhost:8108/health                          # Typesense → {"ok":true}
curl -s localhost:8288/health                          # Inngest → {"status":200}
curl -s localhost:9070/deployments                     # Restate admin → deployments list
curl -s localhost:9627/xrpc/_health                    # PDS → {"version":"..."}
kubectl exec -n joelclaw redis-0 -- redis-cli ping     # → PONG
joelclaw restate cron status                           # Dkron scheduler → healthy via temporary CLI tunnel
bash
kubectl get pods -n joelclaw                          # 查看所有Pod
curl -s localhost:3111/api/inngest                     # system-bus-worker → 应返回200
curl -s localhost:7880/                                # LiveKit → 返回"OK"
curl -s localhost:8108/health                          # Typesense → 返回{"ok":true}
curl -s localhost:8288/health                          # Inngest → 返回{"status":200}
curl -s localhost:9070/deployments                     # Restate管理端 → 返回部署列表
curl -s localhost:9627/xrpc/_health                    # PDS → 返回{"version":"..."}
kubectl exec -n joelclaw redis-0 -- redis-cli ping     # → 返回PONG
joelclaw restate cron status                           # Dkron调度器 → 通过临时CLI隧道检测健康状态

Services

服务列表

ServiceTypePodPorts (Mac→NodePort)Helm?
RedisStatefulSetredis-06379→6379No
TypesenseStatefulSettypesense-08108→8108No
InngestStatefulSetinngest-08288→8288, 8289→8289No
RestateStatefulSetrestate-08080→8080, 9070→9070, 9071→9071No
system-bus-workerDeploymentsystem-bus-worker-*3111→3111No
restate-workerDeploymentrestate-worker-*in-cluster only (
restate-worker:9080
)
No
docs-apiDeploymentdocs-api-*3838→3838No
LiveKitDeploymentlivekit-server-*7880→7880, 7881→7881Yes (livekit/livekit-server 1.9.0)
PDSDeploymentbluesky-pds-*9627→3000Yes (nerkho/bluesky-pds 0.4.2)
MinIOStatefulSetminio-030900→30900, 30901→30901No
DkronStatefulSetdkron-0in-cluster only (
dkron-svc:8080
)
No
AIStor Operator (
aistor
ns)
Deploymentsadminjob-operator, object-store-operatorn/aYes (
minio/aistor-operator
)
AIStor ObjectStore (
aistor
ns)
StatefulSetaistor-s3-pool-0-031000 (S3 TLS), 31001 (console)Yes (
minio/aistor-objectstore
)
服务类型Pod名称端口映射(Mac→NodePort)是否使用Helm
RedisStatefulSetredis-06379→6379
TypesenseStatefulSettypesense-08108→8108
InngestStatefulSetinngest-08288→8288, 8289→8289
RestateStatefulSetrestate-08080→8080, 9070→9070, 9071→9071
system-bus-workerDeploymentsystem-bus-worker-*3111→3111
restate-workerDeploymentrestate-worker-*仅集群内部访问(
restate-worker:9080
docs-apiDeploymentdocs-api-*3838→3838
LiveKitDeploymentlivekit-server-*7880→7880, 7881→7881是(livekit/livekit-server 1.9.0)
PDSDeploymentbluesky-pds-*9627→3000是(nerkho/bluesky-pds 0.4.2)
MinIOStatefulSetminio-030900→30900, 30901→30901
DkronStatefulSetdkron-0仅集群内部访问(
dkron-svc:8080
AIStor Operator(
aistor
命名空间)
Deploymentsadminjob-operator, object-store-operator是(
minio/aistor-operator
AIStor ObjectStore(
aistor
命名空间)
StatefulSetaistor-s3-pool-0-031000(S3 TLS)、31001(控制台)是(
minio/aistor-objectstore

Restate / Firecracker runtime notes

Restate / Firecracker运行时说明

  • deployment/restate-worker
    is intentionally privileged and mounts
    /dev/kvm
    (hostPath type
    ""
    — optional).
  • PVC
    firecracker-images
    at
    /tmp/firecracker-test
    stores kernel, rootfs, and snapshot artifacts.
  • When
    nestedVirtualization
    is OFF:
    /dev/kvm
    absent,
    microvm
    DAG handler fails, but
    shell
    /
    infer
    /
    noop
    handlers work normally.
  • When
    nestedVirtualization
    is ON: Firecracker one-shot exec works (create workspace ext4 → write command → boot VM → guest executes → poweroff → read results).
  • Restate retry caps: dagWorker maxAttempts=5, dagOrchestrator maxAttempts=3. Prevents journal poisoning.
  • Restate journal purge (if stuck invocations block work): scale down Restate, mount PVC with temp pod,
    rm -rf /restate-data/*
    , scale back up, re-register worker.
  • Re-register worker:
    curl -X POST http://localhost:9070/deployments -H 'content-type: application/json' -d '{"uri":"http://restate-worker:9080"}'
⚠️ PDS port trap: Docker maps
9627→3000
(host→container). NodePort must be 3000 to match the container-side port. If set to 9627, traffic won't route.
Rule: NodePort value = Docker's container-side port, not host-side.
  • deployment/restate-worker
    被设置为特权模式,并挂载
    /dev/kvm
    (hostPath类型为
    ""
    ——可选)。
  • PVC
    firecracker-images
    挂载在
    /tmp/firecracker-test
    ,用于存储内核、根文件系统和快照 artifacts。
  • nestedVirtualization
    关闭时:
    /dev/kvm
    不存在,
    microvm
    DAG处理器会失败,但
    shell
    /
    infer
    /
    noop
    处理器可正常工作。
  • nestedVirtualization
    开启时:Firecracker一次性执行功能正常(创建ext4工作区→写入命令→启动虚拟机→客户机执行→关机→读取结果)。
  • Restate重试上限:dagWorker最大尝试次数=5,dagOrchestrator最大尝试次数=3,防止日志污染。
  • Restate日志清理(若卡住的调用阻塞工作):缩容Restate,用临时Pod挂载PVC,执行
    rm -rf /restate-data/*
    ,重新扩容,重新注册worker。
  • 重新注册worker:
    curl -X POST http://localhost:9070/deployments -H 'content-type: application/json' -d '{"uri":"http://restate-worker:9080"}'
⚠️ PDS端口陷阱:Docker映射为
9627→3000
(主机→容器)。NodePort必须设置为3000以匹配容器端端口。若设置为9627,流量将无法正常路由。
规则:NodePort值=Docker容器端端口,而非主机端端口。

Agent Runner (Cold k8s Jobs)

Agent Runner(冷启动k8s Jobs)

Status: local sandbox remains the default/live path; the k8s backend is now code-landed and opt-in, but still needs supervised rollout before calling it earned runtime.
The agent runner executes sandboxed story runs as isolated k8s Jobs. Jobs are created dynamically via
@joelclaw/agent-execution/job-spec
— no static manifests.
状态:本地沙箱仍是默认/活跃路径;k8s后端已完成代码开发,可选择启用,但仍需受控部署才能投入正式运行。
Agent Runner将沙箱化的任务运行作为独立的k8s Jobs执行。Jobs通过
@joelclaw/agent-execution/job-spec
动态创建——无静态清单。

Runtime Image Contract

运行时镜像约定

See
k8s/agent-runner.yaml
for the full specification.
Required components:
  • Git (checkout, diff, commit)
  • Bun runtime
  • runner-installed agent tooling (currently
    claude
    and/or other installed CLIs)
  • /workspace
    working directory
  • runtime entrypoint at
    /app/packages/agent-execution/src/job-runner.ts
Configuration via environment variables:
  • Request metadata:
    WORKFLOW_ID
    ,
    REQUEST_ID
    ,
    STORY_ID
    ,
    SANDBOX_PROFILE
    ,
    BASE_SHA
    ,
    EXECUTION_BACKEND
    ,
    JOB_NAME
    ,
    JOB_NAMESPACE
  • Repo materialization:
    REPO_URL
    ,
    REPO_BRANCH
    , optional
    HOST_REQUESTED_CWD
  • Agent identity:
    AGENT_NAME
    ,
    AGENT_MODEL
    ,
    AGENT_VARIANT
    ,
    AGENT_PROGRAM
  • Execution config:
    SESSION_ID
    ,
    TIMEOUT_SECONDS
  • Task prompt:
    TASK_PROMPT_B64
    (base64-encoded)
  • Verification:
    VERIFICATION_COMMANDS_B64
    (base64-encoded JSON array)
  • Callback path:
    RESULT_CALLBACK_URL
    ,
    RESULT_CALLBACK_TOKEN
Expected behavior:
  1. Decode task from
    TASK_PROMPT_B64
  2. Materialize repo from
    REPO_URL
    /
    REPO_BRANCH
    at
    BASE_SHA
  3. Execute the requested
    AGENT_PROGRAM
  4. Run verification commands (if set)
  5. Print
    SandboxExecutionResult
    markers to stdout and POST the same result to
    /internal/agent-result
  6. Exit 0 (success) or non-zero (failure)
Current truthful limit:
  • pi
    remains local-backend only for now; do not pretend the pod runner can execute pi story runs yet.
完整规范请查看
k8s/agent-runner.yaml
必需组件:
  • Git(用于检出、对比、提交)
  • Bun运行时
  • Runner预装的Agent工具(当前为
    claude
    和/或其他已安装的CLI)
  • /workspace
    工作目录
  • 运行时入口点为
    /app/packages/agent-execution/src/job-runner.ts
通过环境变量配置:
  • 请求元数据:
    WORKFLOW_ID
    ,
    REQUEST_ID
    ,
    STORY_ID
    ,
    SANDBOX_PROFILE
    ,
    BASE_SHA
    ,
    EXECUTION_BACKEND
    ,
    JOB_NAME
    ,
    JOB_NAMESPACE
  • 代码仓实例化:
    REPO_URL
    ,
    REPO_BRANCH
    ,可选
    HOST_REQUESTED_CWD
  • Agent身份:
    AGENT_NAME
    ,
    AGENT_MODEL
    ,
    AGENT_VARIANT
    ,
    AGENT_PROGRAM
  • 执行配置:
    SESSION_ID
    ,
    TIMEOUT_SECONDS
  • 任务提示:
    TASK_PROMPT_B64
    (base64编码)
  • 验证命令:
    VERIFICATION_COMMANDS_B64
    (base64编码的JSON数组)
  • 回调路径:
    RESULT_CALLBACK_URL
    ,
    RESULT_CALLBACK_TOKEN
预期行为:
  1. 解码
    TASK_PROMPT_B64
    中的任务
  2. REPO_URL
    /
    REPO_BRANCH
    BASE_SHA
    版本实例化代码仓
  3. 执行指定的
    AGENT_PROGRAM
  4. 运行验证命令(若已设置)
  5. 向标准输出打印
    SandboxExecutionResult
    标记,并将相同结果POST到
    /internal/agent-result
  6. 退出码0(成功)或非0(失败)
当前限制:
  • pi
    任务目前仅支持本地后端;请勿假设Pod Runner可执行pi任务。

Job Lifecycle

Job生命周期

typescript
import { generateJobSpec, generateJobDeletion } from "@joelclaw/agent-execution";

// 1. Generate Job spec
const spec = generateJobSpec(request, {
  runtime: {
    image: "ghcr.io/joelhooks/agent-runner:latest",
    imagePullPolicy: "Always",
    command: ["bun", "run", "/app/packages/agent-execution/src/job-runner.ts"],
  },
  namespace: "joelclaw",
  imagePullSecret: "ghcr-pull",
  resultCallbackUrl: "http://host.docker.internal:3111/internal/agent-result",
  resultCallbackToken: process.env.OTEL_EMIT_TOKEN,
});

// 2. Apply to cluster (via kubectl or k8s client library)
// 3. Job runs → Pod materializes repo, executes agent, posts SandboxExecutionResult callback
// 4. Host worker can recover the same terminal result from log markers if callback delivery fails
// 5. Job auto-deletes after TTL (default: 5 minutes)

// Cancel a running Job
const deletion = generateJobDeletion("req-xyz");
// kubectl delete job ${deletion.name} -n ${deletion.namespace}
typescript
import { generateJobSpec, generateJobDeletion } from "@joelclaw/agent-execution";

// 1. 生成Job规格
const spec = generateJobSpec(request, {
  runtime: {
    image: "ghcr.io/joelhooks/agent-runner:latest",
    imagePullPolicy: "Always",
    command: ["bun", "run", "/app/packages/agent-execution/src/job-runner.ts"],
  },
  namespace: "joelclaw",
  imagePullSecret: "ghcr-pull",
  resultCallbackUrl: "http://host.docker.internal:3111/internal/agent-result",
  resultCallbackToken: process.env.OTEL_EMIT_TOKEN,
});

// 2. 应用到集群(通过kubectl或k8s客户端库)
// 3. Job运行→Pod实例化代码仓、执行Agent、提交SandboxExecutionResult回调
// 4. 若回调投递失败,主机Worker可从日志标记中恢复相同的终端结果
// 5. Job完成后按TTL自动删除(默认:5分钟)

// 取消运行中的Job
const deletion = generateJobDeletion("req-xyz");
// kubectl delete job ${deletion.name} -n ${deletion.namespace}

Resource Defaults

资源默认值

  • CPU:
    500m
    request,
    2
    limit
  • Memory:
    1Gi
    request,
    4Gi
    limit
  • Active deadline:
    1 hour
  • TTL after completion:
    5 minutes
  • Backoff limit:
    0
    (no retries)
  • CPU:
    500m
    请求,
    2
    限制
  • 内存:
    1Gi
    请求,
    4Gi
    限制
  • 活动截止时间:
    1小时
  • 完成后TTL:
    5分钟
  • 回退限制:
    0
    (不重试)

Security

安全配置

  • Non-root execution (UID 1000, GID 1000)
  • No privilege escalation
  • All capabilities dropped
  • RuntimeDefault seccomp profile
  • Control plane toleration for single-node cluster
  • 非root用户执行(UID 1000,GID 1000)
  • 禁止权限提升
  • 移除所有能力
  • 使用RuntimeDefault seccomp配置文件
  • 单节点集群的控制平面容忍度

Verification Commands

验证命令

bash
undefined
bash
undefined

List agent runner Jobs

列出Agent Runner Jobs

kubectl get jobs -n joelclaw -l app.kubernetes.io/name=agent-runner
kubectl get jobs -n joelclaw -l app.kubernetes.io/name=agent-runner

Check Job status

查看Job状态

kubectl describe job <job-name> -n joelclaw
kubectl describe job <job-name> -n joelclaw

View logs

查看日志

kubectl logs job/<job-name> -n joelclaw
kubectl logs job/<job-name> -n joelclaw

Check for stale Jobs (should be auto-deleted by TTL)

检查过时Jobs(应按TTL自动删除)

kubectl get jobs -n joelclaw --show-all
undefined
kubectl get jobs -n joelclaw --show-all
undefined

Current State

当前进度

  • ✅ Job spec generator (
    packages/agent-execution/src/job-spec.ts
    )
  • ✅ Runtime contract (
    k8s/agent-runner.yaml
    )
  • ✅ Tests (
    packages/agent-execution/__tests__/job-spec.test.ts
    )
  • ⏳ Runtime image not yet built (Story 3)
  • ⏳ Hot-image CronJob not yet implemented (Story 4)
  • ⏳ Warm-pool scheduler not yet implemented (Story 5)
  • ⏳ Restate integration not yet wired (Story 6)
  • ✅ Job规格生成器(
    packages/agent-execution/src/job-spec.ts
  • ✅ 运行时约定(
    k8s/agent-runner.yaml
  • ✅ 测试用例(
    packages/agent-execution/__tests__/job-spec.test.ts
  • ⏳ 运行时镜像尚未构建(任务3)
  • ⏳ 热镜像CronJob尚未实现(任务4)
  • ⏳ 暖池调度器尚未实现(任务5)
  • ⏳ Restate集成尚未对接(任务6)

NAS NFS Access from k8s (ADR-0088 Phase 2.5)

k8s访问NAS NFS存储(ADR-0088 第2.5阶段)

k8s pods can mount NAS storage over NFS via a LAN route through the Colima bridge.
k8s Pod可通过Colima网桥的LAN路由挂载NAS NFS存储。

How it works

工作原理

k8s pod → Talos container (10.5.0.x) → Docker NAT → Colima VM
  → ip route 192.168.1.0/24 via 192.168.64.1 dev col0
  → macOS host (IP forwarding enabled) → LAN → NAS (192.168.1.163)
Root cause of prior failures: VZ framework's shared networking on eth0 doesn't properly forward LAN-bound traffic. The fix routes LAN traffic through col0 (Colima bridge → macOS host) instead.
k8s pod → Talos容器(10.5.0.x)→ Docker NAT → Colima虚拟机
  → ip route 192.168.1.0/24 via 192.168.64.1 dev col0
  → macOS主机(已启用IP转发)→ LAN → NAS(192.168.1.163)
之前失败的根本原因:VZ框架的eth0共享网络无法正确转发LAN流量。修复方案是将LAN流量通过col0(Colima网桥→macOS主机)路由。

Route persistence

路由持久化

The LAN route is set in two places for reliability:
  1. Colima provision script (
    ~/.colima/default/colima.yaml
    ) — runs on
    colima start
    (cold boot)
  2. colima-tunnel script (
    ~/.local/bin/colima-tunnel
    ) — runs on tunnel restart (covers warm resume)
Both execute:
ip route replace 192.168.1.0/24 via 192.168.64.1 dev col0
LAN路由在两个位置配置以确保可靠性:
  1. Colima初始化脚本
    ~/.colima/default/colima.yaml
    )——
    colima start
    时运行(冷启动)
  2. colima-tunnel脚本
    ~/.local/bin/colima-tunnel
    )——隧道重启时运行(覆盖热恢复场景)
两者均执行:
ip route replace 192.168.1.0/24 via 192.168.64.1 dev col0

Available PVs

可用PV

PVNFS PathCapacityAccessUse
nas-nvme
192.168.1.163:/volume2/data
1.5TBRWXNVMe RAID1: backups, snapshots, models, sessions
nas-hdd
192.168.1.163:/volume1/joelclaw
50TBRWXHDD RAID5: books, docs-artifacts, archives, otel
minio-nfs-pv
192.168.1.163:/volume1/joelclaw
1TBRWOHDD tier: MinIO object storage (same export)
PV名称NFS路径容量访问权限用途
nas-nvme
192.168.1.163:/volume2/data
1.5TBRWXNVMe RAID1:备份、快照、模型、会话
nas-hdd
192.168.1.163:/volume1/joelclaw
50TBRWXHDD RAID5:书籍、文档制品、归档、otel
minio-nfs-pv
192.168.1.163:/volume1/joelclaw
1TBRWOHDD分层存储:MinIO对象存储(同一导出路径)

Mounting NAS in a pod

在Pod中挂载NAS

yaml
volumes:
  - name: nas
    persistentVolumeClaim:
      claimName: nas-nvme
containers:
  - volumeMounts:
      - name: nas
        mountPath: /nas
        # Optional: subPath for specific dir
        subPath: typesense
yaml
volumes:
  - name: nas
    persistentVolumeClaim:
      claimName: nas-nvme
containers:
  - volumeMounts:
      - name: nas
        mountPath: /nas
        # 可选:指定子目录
        subPath: typesense

Rules

规则

  • Always use IP (192.168.1.163), never hostname (three-body). DNS doesn't resolve from inside k8s.
  • Always use
    nfsvers=3,tcp,resvport,noatime
    mount options. NFSv4 has issues with Asustor ADM.
  • NAS unavailability degrades gracefully with
    soft
    mount option — returns errors, doesn't hang pods.
  • NFS write performance: ~660 MiB/s over 10GbE with jumbo frames. Good for sequential I/O (backups, snapshots). Latency-sensitive workloads (Redis, active Typesense indexes) stay on local SSD.
  • If NFS mount fails after Colima restart: verify the route exists:
    colima ssh -- ip route | grep 192.168.1.0
  • 始终使用IP(192.168.1.163),切勿使用主机名(three-body)。k8s内部无法解析该主机名。
  • 始终使用
    nfsvers=3,tcp,resvport,noatime
    挂载选项
    。NFSv4与Asustor ADM存在兼容问题。
  • NAS不可用时会优雅降级:使用
    soft
    挂载选项会返回错误,而非挂起Pod。
  • NFS写入性能:10GbE网络+巨帧下约660 MiB/s,适合顺序IO(备份、快照)。对延迟敏感的工作负载(Redis、活跃Typesense索引)保留在本地SSD。
  • Colima重启后NFS挂载失败:验证路由是否存在:
    colima ssh -- ip route | grep 192.168.1.0

Verify connectivity

验证连通性

bash
undefined
bash
undefined

From Colima VM

从Colima虚拟机测试

colima ssh -- timeout 2 bash -c "echo > /dev/tcp/192.168.1.163/2049" && echo "NFS OK"
colima ssh -- timeout 2 bash -c "echo > /dev/tcp/192.168.1.163/2049" && echo "NFS正常"

From k8s pod

从k8s Pod测试

kubectl run nfs-test --image=busybox --restart=Never -n joelclaw
--overrides='{"spec":{"tolerations":[{"key":"node-role.kubernetes.io/control-plane","operator":"Exists","effect":"NoSchedule"}],"containers":[{"name":"t","image":"busybox","command":["sh","-c","ls /nas && echo OK"],"volumeMounts":[{"name":"n","mountPath":"/nas"}]}],"volumes":[{"name":"n","persistentVolumeClaim":{"claimName":"nas-nvme"}}]}}' kubectl logs nfs-test -n joelclaw && kubectl delete pod nfs-test -n joelclaw --force
undefined
kubectl run nfs-test --image=busybox --restart=Never -n joelclaw \ --overrides='{"spec":{"tolerations":[{"key":"node-role.kubernetes.io/control-plane","operator":"Exists","effect":"NoSchedule"}],"containers":[{"name":"t","image":"busybox","command":["sh","-c","ls /nas && echo 正常"],"volumeMounts":[{"name":"n","mountPath":"/nas"}]}],"volumes":[{"name":"n","persistentVolumeClaim":{"claimName":"nas-nvme"}}]}}' kubectl logs nfs-test -n joelclaw && kubectl delete pod nfs-test -n joelclaw --force
undefined

Deploy Commands

部署命令

bash
undefined
bash
undefined

Manifests (redis, typesense, inngest, dkron)

清单文件(redis、typesense、inngest、dkron)

kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/
kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/

Restate runtime

Restate运行时

kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/restate.yaml kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/firecracker-pvc.yaml kubectl rollout status statefulset/restate -n joelclaw ~/Code/joelhooks/joelclaw/k8s/publish-restate-worker.sh curl -fsS http://localhost:9070/deployments
kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/restate.yaml kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/firecracker-pvc.yaml kubectl rollout status statefulset/restate -n joelclaw ~/Code/joelhooks/joelclaw/k8s/publish-restate-worker.sh curl -fsS http://localhost:9070/deployments

Dkron phase-1 scheduler (ClusterIP API + CLI-managed short-lived tunnel access)

Dkron第一阶段调度器(ClusterIP API + CLI管理的短期隧道访问)

kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/dkron.yaml kubectl rollout status statefulset/dkron -n joelclaw joelclaw restate cron status joelclaw restate cron sync-tier1 # seed/update ADR-0216 tier-1 jobs
kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/dkron.yaml kubectl rollout status statefulset/dkron -n joelclaw joelclaw restate cron status joelclaw restate cron sync-tier1 # 初始化/更新ADR-0216第一阶段任务

system-bus worker (build + push GHCR + apply + rollout wait)

system-bus worker(构建+推送至GHCR+应用+等待滚动更新完成)

~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.sh
~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.sh

LiveKit (Helm + reconcile patches)

LiveKit(Helm + 调和补丁)

~/Code/joelhooks/joelclaw/k8s/reconcile-livekit.sh joelclaw
~/Code/joelhooks/joelclaw/k8s/reconcile-livekit.sh joelclaw

AIStor (Helm operator + objectstore)

AIStor(Helm operator + objectstore)

Defaults to isolated
aistor
namespace to avoid service-name collisions with legacy
joelclaw/minio
.

默认部署在独立的
aistor
命名空间,避免与旧版
joelclaw/minio
服务名冲突。

Cutover override (explicit only): AISTOR_OBJECTSTORE_NAMESPACE=joelclaw AISTOR_ALLOW_JOELCLAW_NAMESPACE=true

强制切换(仅显式操作):AISTOR_OBJECTSTORE_NAMESPACE=joelclaw AISTOR_ALLOW_JOELCLAW_NAMESPACE=true

~/Code/joelhooks/joelclaw/k8s/reconcile-aistor.sh
~/Code/joelhooks/joelclaw/k8s/reconcile-aistor.sh

PDS (Helm) — always patch NodePort to 3000

PDS(Helm)——始终将NodePort补丁为3000

(export current values first if the release already exists)

(若版本已存在,先导出当前配置)

helm get values bluesky-pds -n joelclaw > /tmp/pds-values-live.yaml 2>/dev/null || true helm upgrade --install bluesky-pds nerkho/bluesky-pds
-n joelclaw -f /tmp/pds-values-live.yaml kubectl patch svc bluesky-pds -n joelclaw --type='json'
-p='[{"op":"replace","path":"/spec/ports/0/nodePort","value":3000}]'
undefined
helm get values bluesky-pds -n joelclaw > /tmp/pds-values-live.yaml 2>/dev/null || true helm upgrade --install bluesky-pds nerkho/bluesky-pds \ -n joelclaw -f /tmp/pds-values-live.yaml kubectl patch svc bluesky-pds -n joelclaw --type='json' \ -p='[{"op":"replace","path":"/spec/ports/0/nodePort","value":3000}]'
undefined

Auto Deploy (GitHub Actions)

自动部署(GitHub Actions)

  • Workflow:
    .github/workflows/system-bus-worker-deploy.yml
  • Trigger: push to
    main
    touching
    packages/system-bus/**
    or worker deploy files
  • Behavior:
    • builds/pushes
      ghcr.io/joelhooks/system-bus-worker:${GITHUB_SHA}
      +
      :latest
    • runs deploy job on
      self-hosted
      runner
    • updates k8s deployment image + waits for rollout + probes worker health
  • If deploy job is queued forever, check that a
    self-hosted
    runner is online on the Mac Mini.
  • 工作流:
    .github/workflows/system-bus-worker-deploy.yml
  • 触发条件:推送至
    main
    分支且修改了
    packages/system-bus/**
    或worker部署文件
  • 行为:
    • 构建并推送
      ghcr.io/joelhooks/system-bus-worker:${GITHUB_SHA}
      +
      :latest
      镜像
    • self-hosted
      runner上执行部署任务
    • 更新k8s部署镜像+等待滚动更新完成+探测worker健康状态
  • 若部署任务一直处于排队状态,检查Mac Mini上的
    self-hosted
    runner是否在线。

GHCR push 403 Forbidden

GHCR推送403 Forbidden错误

Cause:
GITHUB_TOKEN
(default Actions token) does not have
packages:write
scope for this repo. A dedicated PAT is required.
Fix already applied: Workflow uses
secrets.GHCR_PAT
(not
secrets.GITHUB_TOKEN
) for the GHCR login step. The PAT is stored in:
  • GitHub repo secrets as
    GHCR_PAT
    (set via GitHub UI)
  • agent-secrets as
    ghcr_pat
    (
    secrets lease ghcr_pat
    )
If this breaks again: PAT may have expired. Regenerate at github.com → Settings → Developer settings → PATs, update both stores.
Local fallback (bypass GHA entirely):
bash
DOCKER_CONFIG_DIR=$(mktemp -d)
echo '{"credsStore":""}' > "$DOCKER_CONFIG_DIR/config.json"
export DOCKER_CONFIG="$DOCKER_CONFIG_DIR"
secrets lease ghcr_pat | docker login ghcr.io -u joelhooks --password-stdin
~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.sh
Note:
publish-system-bus-worker.sh
uses
gh auth token
internally — if
gh auth
is stale, use the Docker login above before running the script, or patch it to use
secrets lease ghcr_pat
directly.
原因
GITHUB_TOKEN
(默认Actions令牌)没有此仓库的
packages:write
权限,需要专用PAT。
已应用修复:工作流使用
secrets.GHCR_PAT
(而非
secrets.GITHUB_TOKEN
)进行GHCR登录。PAT存储在:
  • GitHub仓库密钥中(
    GHCR_PAT
    ,通过GitHub UI设置)
  • agent-secrets中(
    ghcr_pat
    ,通过
    secrets lease ghcr_pat
    获取)
若再次失效:PAT可能已过期。在github.com → 设置 → 开发者设置 → PATs重新生成,更新两个存储位置。
本地回退方案(完全绕过GHA)
bash
DOCKER_CONFIG_DIR=$(mktemp -d)
echo '{"credsStore":""}' > "$DOCKER_CONFIG_DIR/config.json"
export DOCKER_CONFIG="$DOCKER_CONFIG_DIR"
secrets lease ghcr_pat | docker login ghcr.io -u joelhooks --password-stdin
~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.sh
注意:
publish-system-bus-worker.sh
内部使用
gh auth token
——若
gh auth
失效,在运行脚本前执行上述Docker登录,或修改脚本直接使用
secrets lease ghcr_pat

Resilience Rules (ADR-0148)

弹性规则(ADR-0148)

  1. NEVER use
    kubectl port-forward
    for persistent service exposure.
    All long-lived operator surfaces MUST use NodePort + Docker port mappings. The narrow exception is a CLI-managed, short-lived tunnel for an otherwise in-cluster-only control surface (for example
    joelclaw restate cron *
    tunneling to
    dkron-svc
    ). Port-forwards silently die on idle/restart/pod changes, so do not leave them running.
  2. All workloads MUST have liveness + readiness + startup probes. Missing probes = silent hangs that never recover.
  3. After any Docker/Colima/node restart: remove control-plane taint, uncordon node, verify flannel, check all pods reach Running.
  4. PVC reclaimPolicy is Delete — deleting a PVC = permanent data loss. Never delete PVCs without backup.
  5. firecracker-images
    is stateful runtime data.
    Treat it like a real runtime PVC: kernel, rootfs, and snapshot loss will break the microVM path.
  6. Colima VM disk is limited (19GB). Monitor with
    colima ssh -- df -h /
    . Alert at >80%.
  7. All launchd plists MUST set PATH including
    /opt/homebrew/bin
    .
    Colima shells to
    limactl
    , kubectl/talosctl live in homebrew. launchd's default PATH is
    /usr/bin:/bin:/usr/sbin:/sbin
    — no homebrew. The canonical PATH for infra plists is:
    /opt/homebrew/bin:/Users/joel/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
    . Discovered Feb 2026: missing PATH caused 6 days of silent recovery failures.
  8. Shell scripts run by launchd MUST export PATH at the top. Even if the plist sets EnvironmentVariables, belt-and-suspenders — add
    export PATH="/opt/homebrew/bin:..."
    to the script itself.
  1. 切勿使用
    kubectl port-forward
    进行持久化服务暴露
    。所有长期对外的操作面必须使用NodePort + Docker端口映射。唯一例外是CLI管理的短期隧道,用于原本仅集群内部可访问的控制面(例如
    joelclaw restate cron *
    隧道到
    dkron-svc
    )。端口转发会在空闲/重启/Pod变更时静默失效,请勿长期运行。
  2. 所有工作负载必须配置存活探针+就绪探针+启动探针。缺少探针会导致静默挂起且无法自动恢复。
  3. 任何Docker/Colima/节点重启后:移除控制平面污点、解除节点封锁、验证Flannel状态、检查所有Pod进入Running状态。
  4. PVC的reclaimPolicy为Delete——删除PVC会导致永久数据丢失。无备份时切勿删除PVC。
  5. firecracker-images
    是有状态的运行时数据
    ,需像对待真实的运行时PVC一样:内核、根文件系统和快照丢失会破坏microVM路径。
  6. Colima虚拟机磁盘空间有限(19GB),使用
    colima ssh -- df -h /
    监控,使用率超过80%时告警。
  7. 所有launchd plist必须设置包含
    /opt/homebrew/bin
    的PATH
    。Colima调用
    limactl
    ,kubectl/talosctl安装在homebrew中。launchd的默认PATH是
    /usr/bin:/bin:/usr/sbin:/sbin
    ——不包含homebrew。基础设施plist的标准PATH为:
    /opt/homebrew/bin:/Users/joel/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
    。2026年2月发现:PATH缺失导致了6天的静默恢复失败。
  8. launchd运行的Shell脚本必须在顶部导出PATH。即使plist中设置了EnvironmentVariables,也要双重保障——在脚本中添加
    export PATH="/opt/homebrew/bin:..."

Current Probe Gaps (fix when touching these services)

当前探针缺失(修改这些服务时修复)

  • Typesense: missing liveness probe (hangs won't be detected)
  • Bluesky PDS: missing readiness and startup probes
  • system-bus-worker: missing startup probe
  • Typesense:缺少存活探针(挂起无法被检测)
  • Bluesky PDS:缺少就绪探针和启动探针
  • system-bus-worker:缺少启动探针

Danger Zones

危险区域

  1. Stale SSH mux socket after Colima restart — When Colima restarts (disk resize, crash recovery,
    colima stop && start
    ), the SSH port changes but the mux socket (
    ~/.colima/_lima/colima/ssh.sock
    ) caches the old connection. Symptoms:
    kubectl port-forward
    fails with "tls: internal error",
    kubectl get nodes
    may intermittently work then fail. Fix:
    rm -f ~/.colima/_lima/colima/ssh.sock && pkill -f "ssh.*colima"
    , then re-establish tunnels with
    ssh -o ControlPath=none
    . Always verify SSH port with
    colima ssh-config | grep Port
    after restart.
  2. Adding Docker port mappings — can be hot-added without cluster recreation via
    hostconfig.json
    edit. See references/operations.md for the procedure.
  3. Inngest legacy host alias in manifests — old container-host alias may still appear in legacy configs. Worker uses connect mode, so it usually still works, but prefer explicit Talos/Colima hostnames.
  4. Colima zombie state
    colima status
    reports "Running" but docker socket / SSH tunnels are dead. All k8s ports unresponsive.
    colima start
    is a no-op. Only
    colima restart
    recovers. Detect with:
    ssh -F ~/.colima/_lima/colima/ssh.config lima-colima "docker info"
    — if that fails while
    colima status
    passes, it's a zombie. The heal script handles this automatically.
  5. Talos container has NO shell — No bash, no /bin/sh. Cannot
    docker exec
    into it. Kernel modules like
    br_netfilter
    must be loaded at the Colima VM level:
    ssh lima-colima "sudo modprobe br_netfilter"
    .
  6. AIStor service-name collision — if AIStor objectstore is deployed in
    joelclaw
    , it can claim
    svc/minio
    and break legacy MinIO assumptions. Keep AIStor objectstore in isolated namespace (
    aistor
    ) unless intentionally cutting over.
  7. AIStor operator webhook SSA conflict — repeated
    helm upgrade
    can fail on
    MutatingWebhookConfiguration
    caBundle
    ownership conflict. Current mitigation in this cluster: set
    operators.object-store.webhook.enabled=false
    in
    k8s/aistor-operator-values.yaml
    .
  8. MinIO pinned tag trap
    minio/minio:RELEASE.2025-10-15T17-29-55Z
    is not available on Docker Hub in this environment (ErrImagePull). Legacy fallback currently relies on
    minio/minio:latest
    .
  9. restate-worker
    privilege is intentional.
    Do not “harden” away
    /dev/kvm
    ,
    privileged: true
    , or the unconfined seccomp profile unless you are simultaneously changing the Firecracker runtime contract.
  10. Dkron service-name collision — never create a bare
    svc/dkron
    . Kubernetes injects
    DKRON_*
    env vars into pods, which collides with Dkron's own config parsing. Use
    dkron-peer
    and
    dkron-svc
    .
  11. Dkron PVC permissions — upstream
    dkron/dkron:latest
    currently needs root on the local-path PVC. Non-root hardening caused
    permission denied
    under
    /data/raft/snapshots/permTest
    and CrashLoopBackOff.
  1. Colima重启后过时的SSH多路复用套接字——Colima重启(磁盘扩容、崩溃恢复、
    colima stop && start
    )后,SSH端口会变化,但多路复用套接字(
    ~/.colima/_lima/colima/ssh.sock
    )会缓存旧连接。症状:
    kubectl port-forward
    返回"tls: internal error",
    kubectl get nodes
    可能间歇性工作后失败。修复
    rm -f ~/.colima/_lima/colima/ssh.sock && pkill -f "ssh.*colima"
    ,然后使用
    ssh -o ControlPath=none
    重新建立隧道。重启后务必通过
    colima ssh-config | grep Port
    验证SSH端口。
  2. 添加Docker端口映射——无需重建集群,可通过编辑
    hostconfig.json
    热添加。详见references/operations.md中的步骤。
  3. Inngest清单中的旧主机别名——旧版容器主机别名可能仍存在于遗留配置中。Worker使用连接模式通常仍可工作,但优先使用显式的Talos/Colima主机名。
  4. Colima僵尸状态——
    colima status
    显示"Running"但Docker套接字/SSH隧道已失效,所有k8s端口无响应。
    colima start
    无作用,仅
    colima restart
    可恢复。检测方法:
    ssh -F ~/.colima/_lima/colima/ssh.config lima-colima "docker info"
    ——若此命令失败但
    colima status
    显示运行中,则为僵尸状态。自愈脚本会自动处理此问题。
  5. Talos容器无Shell环境——没有bash,没有/bin/sh,无法
    docker exec
    进入。
    br_netfilter
    等内核模块需在Colima虚拟机层面加载:
    ssh lima-colima "sudo modprobe br_netfilter"
  6. AIStor服务名冲突——若AIStor objectstore部署在
    joelclaw
    命名空间,会占用
    svc/minio
    并破坏旧版MinIO的依赖。除非有意切换,否则请将AIStor objectstore部署在独立的
    aistor
    命名空间。
  7. AIStor operator webhook SSA冲突——重复执行
    helm upgrade
    可能因
    MutatingWebhookConfiguration
    caBundle
    所有权冲突失败。当前集群的缓解方案:在
    k8s/aistor-operator-values.yaml
    中设置
    operators.object-store.webhook.enabled=false
  8. MinIO固定标签陷阱——
    minio/minio:RELEASE.2025-10-15T17-29-55Z
    在此环境的Docker Hub中不可用(ErrImagePull)。当前旧版回退依赖
    minio/minio:latest
  9. restate-worker
    的特权模式是有意设置的
    。除非同时修改Firecracker运行时约定,否则请勿移除
    /dev/kvm
    privileged: true
    或无限制的seccomp配置文件。
  10. Dkron服务名冲突——切勿创建裸
    svc/dkron
    。Kubernetes会向Pod注入
    DKRON_*
    环境变量,与Dkron自身的配置解析冲突。请使用
    dkron-peer
    dkron-svc
  11. Dkron PVC权限——上游
    dkron/dkron:latest
    目前需要本地路径PVC的root权限。非root加固会导致
    /data/raft/snapshots/permTest
    下的
    permission denied
    错误,引发CrashLoopBackOff。

Key Files

关键文件

PathWhat
~/Code/joelhooks/joelclaw/k8s/*.yaml
Service manifests
~/Code/joelhooks/joelclaw/k8s/livekit-values.yaml
LiveKit Helm values (source controlled)
~/Code/joelhooks/joelclaw/k8s/reconcile-livekit.sh
LiveKit Helm deploy + post-upgrade reconcile
~/Code/joelhooks/joelclaw/k8s/aistor-operator-values.yaml
AIStor operator Helm values
~/Code/joelhooks/joelclaw/k8s/aistor-objectstore-values.yaml
AIStor objectstore Helm values
~/Code/joelhooks/joelclaw/k8s/reconcile-aistor.sh
AIStor deploy + upgrade reconcile script
~/Code/joelhooks/joelclaw/k8s/dkron.yaml
Dkron scheduler StatefulSet + services
~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.sh
Build/push/deploy system-bus worker to k8s
~/Code/joelhooks/joelclaw/infra/k8s-reboot-heal.sh
Reboot auto-heal script for Colima/Talos/taint/flannel
~/Code/joelhooks/joelclaw/infra/launchd/com.joel.k8s-reboot-heal.plist
launchd timer for reboot auto-heal
~/Code/joelhooks/joelclaw/skills/k8s/references/operations.md
Cluster operations + recovery notes
~/.talos/config
Talos client config
~/.kube/config
Kubeconfig (context:
admin@joelclaw-1
)
~/.colima/default/colima.yaml
Colima VM config
~/.local/bin/colima-tunnel
Persistent SSH tunnel + NAS route (launchd:
com.joel.colima-tunnel
)
~/.local/caddy/Caddyfile
Caddy HTTPS proxy (Tailscale)
~/Code/joelhooks/joelclaw/k8s/nas-nvme-pv.yaml
NAS NVMe NFS PV/PVC (1.5TB)
~/Code/joelhooks/joelclaw/k8s/nas-hdd-pv.yaml
NAS HDD NFS PV/PVC (50TB)
路径说明
~/Code/joelhooks/joelclaw/k8s/*.yaml
服务清单文件
~/Code/joelhooks/joelclaw/k8s/livekit-values.yaml
LiveKit Helm配置(已版本控制)
~/Code/joelhooks/joelclaw/k8s/reconcile-livekit.sh
LiveKit Helm部署+升级后调和脚本
~/Code/joelhooks/joelclaw/k8s/aistor-operator-values.yaml
AIStor operator Helm配置
~/Code/joelhooks/joelclaw/k8s/aistor-objectstore-values.yaml
AIStor objectstore Helm配置
~/Code/joelhooks/joelclaw/k8s/reconcile-aistor.sh
AIStor部署+升级调和脚本
~/Code/joelhooks/joelclaw/k8s/dkron.yaml
Dkron调度器StatefulSet+服务配置
~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.sh
构建/推送/部署system-bus worker到k8s的脚本
~/Code/joelhooks/joelclaw/infra/k8s-reboot-heal.sh
Colima/Talos/污点/Flannel重启自愈脚本
~/Code/joelhooks/joelclaw/infra/launchd/com.joel.k8s-reboot-heal.plist
重启自愈的launchd定时器
~/Code/joelhooks/joelclaw/skills/k8s/references/operations.md
集群操作+恢复说明文档
~/.talos/config
Talos客户端配置
~/.kube/config
Kubeconfig(上下文:
admin@joelclaw-1
~/.colima/default/colima.yaml
Colima虚拟机配置
~/.local/bin/colima-tunnel
持久化SSH隧道+NAS路由脚本(launchd:
com.joel.colima-tunnel
~/.local/caddy/Caddyfile
Caddy HTTPS代理(Tailscale)配置
~/Code/joelhooks/joelclaw/k8s/nas-nvme-pv.yaml
NAS NVMe NFS PV/PVC配置(1.5TB)
~/Code/joelhooks/joelclaw/k8s/nas-hdd-pv.yaml
NAS HDD NFS PV/PVC配置(50TB)

Troubleshooting

故障排查

Read references/operations.md for:
  • Recovery after Colima restart
  • Recovery after Mac reboot
  • Flannel br_netfilter crash fix
  • Full cluster recreation (nuclear option)
  • Caddy/Tailscale HTTPS proxy details
  • All port mapping details with explanation
以下内容请查看references/operations.md
  • Colima重启后的恢复步骤
  • Mac重启后的恢复步骤
  • Flannel br_netfilter崩溃修复
  • 完整集群重建(终极方案)
  • Caddy/Tailscale HTTPS代理详情
  • 所有端口映射详情及说明",