k8s

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

k8s Cluster Operations — joelclaw on Talos

k8s集群运维——基于Talos的joelclaw集群

Architecture

架构

Mac Mini (localhost ports)
  └─ Lima SSH mux (~/.colima/_lima/colima/ssh.sock) ← NEVER KILL
      └─ Colima VM (8 CPU, 16 GiB, 100 GiB, VZ framework, aarch64)
          └─ Docker 29.x + buildx (joelclaw-builder, docker-container driver)
              └─ Talos v1.12.4 container (joelclaw-controlplane-1)
                  └─ k8s v1.35.0 (single node, Flannel CNI)
                      └─ joelclaw namespace (privileged PSA)

⚠️ Talos has NO shell. No bash, no /bin/sh, nothing. You cannot

docker exec

into the Talos container. Use

talosctl

for node operations and the Colima VM (

ssh lima-colima

) for host-level operations like

modprobe

Mac Mini（本地端口）
  └─ Lima SSH多路复用（~/.colima/_lima/colima/ssh.sock）← 切勿终止
      └─ Colima虚拟机（8核CPU、16 GiB内存、100 GiB存储、VZ框架、aarch64架构）
          └─ Docker 29.x + buildx（joelclaw-builder，docker-container驱动）
              └─ Talos v1.12.4容器（joelclaw-controlplane-1）
                  └─ k8s v1.35.0（单节点、Flannel CNI）
                      └─ joelclaw命名空间（特权PSA）

⚠️ Talos 没有Shell环境。 没有bash，没有/bin/sh，无法

docker exec

进入Talos容器。节点操作请使用

talosctl

，主机级操作（如

modprobe

）请通过Colima虚拟机（

ssh lima-colima

）执行。

Colima Stability Rules (2026-03-17 incident)

Colima稳定性规则（2026-03-17事故总结）

Setting	Value	Reason
CPU	8	Match k8s workload requests (~2.8 CPU, 72%)
Memory	16 GiB	32GB causes macOS memory pressure → VM kill
nestedVirtualization	OFF by default	Crashes VM under load (image builds, heavy scheduling). Toggle ON only for Firecracker testing
vmType	vz	Required for Apple Silicon
mountType	virtiofs	Fastest option with VZ

nestedVirtualization: true
is unstable on M4 Pro under load. It causes the Colima VM to silently crash during Docker builds/pushes. Each crash:

Kills the Talos container mid-operation
Corrupts Redis AOF (if caught mid-write) → crash-loop on restart
Breaks Lima socket forwarding →
```
docker
```
CLI on macOS disconnects
Creates stale k8s pods that re-pull images → amplifies pressure

Recovery from Colima crash-loop:

```
colima stop && colima start
```
— basic restart
If Redis crash-loops:
```
redis-check-aof --fix
```
(see Redis AOF Recovery below)
If Restate has stuck invocations: purge PVC or kill via admin API

If native Docker socket dead: use SSH tunnel

ssh -L /tmp/docker.sock:/var/run/docker.sock

Docker image builds should use the buildx container builder (

docker buildx build --builder joelclaw-builder

) to isolate build IO from k8s workloads.

设置	取值	原因
CPU	8	匹配k8s工作负载请求（约2.8 CPU，使用率72%）
内存	16 GiB	分配32GB会导致macOS内存压力过大→虚拟机被终止
nestedVirtualization	默认关闭	高负载下（镜像构建、调度密集）会导致虚拟机崩溃。仅在Firecracker测试时开启
vmType	vz	Apple Silicon芯片必需
mountType	virtiofs	VZ框架下的最快选项

nestedVirtualization: true
在M4 Pro高负载下不稳定，会导致Docker构建/推送时Colima虚拟机静默崩溃。每次崩溃会引发：

Talos容器在运行中被终止
Redis AOF文件损坏（若写入时中断）→重启后崩溃循环
Lima套接字转发失效→macOS上的Docker CLI断开连接
产生过时的k8s Pod，会重新拉取镜像→加剧资源压力

Colima崩溃循环恢复步骤：

```
colima stop && colima start
```
— 基础重启操作
若Redis崩溃循环：执行
```
redis-check-aof --fix
```
（详见下方Redis AOF恢复）
若Restate存在卡住的调用：清理PVC或通过管理API终止
若原生Docker套接字失效：使用SSH隧道
```
ssh -L /tmp/docker.sock:/var/run/docker.sock
```

Docker镜像构建应使用buildx容器构建器（

docker buildx build --builder joelclaw-builder

），将构建IO与k8s工作负载隔离。

Redis AOF Recovery

Redis AOF恢复

If Redis crash-loops after a VM restart with

Bad file format reading the append only file

bash

undefined

若虚拟机重启后Redis因

Bad file format reading the append only file

报错进入崩溃循环：

bash

undefined

1. Scale down Redis (or use a temp pod if StatefulSet can't mount PVC concurrently)

1. 缩容Redis（若StatefulSet无法同时挂载PVC，可使用临时Pod）

kubectl -n joelclaw apply -f - <<'EOF' apiVersion: v1 kind: Pod metadata: name: redis-fix namespace: joelclaw spec: tolerations: - key: node-role.kubernetes.io/control-plane operator: Exists effect: NoSchedule containers: - name: fix image: redis:7-alpine command: ["sh", "-c", "cd /data/appendonlydir && echo y | redis-check-aof --fix *.incr.aof && redis-check-aof *.incr.aof"] volumeMounts: - name: data mountPath: /data restartPolicy: Never volumes: - name: data persistentVolumeClaim: claimName: data-redis-0 EOF

2. Wait, check logs, then clean up

2. 等待执行完成，查看日志，然后清理临时Pod

kubectl -n joelclaw logs redis-fix kubectl -n joelclaw delete pod redis-fix --force

3. Restart Redis

3. 重启Redis

kubectl -n joelclaw delete pod redis-0


For port mappings, recovery procedures, and cluster recreation steps, read [references/operations.md](references/operations.md).

kubectl -n joelclaw delete pod redis-0


端口映射、恢复流程和集群重建步骤，请查看[references/operations.md](references/operations.md)。

Kubeconfig Port Drift (2026-03-21 incident)

Kubeconfig端口漂移问题（2026-03-21事故总结）

Docker port mappings for k8s API (6443) and Talos API (50000) are not pinned — they use random host ports assigned at container creation. All service ports (3111, 8288, 6379, etc.) ARE pinned 1:1.

When the Colima VM or Talos container restarts, Docker may reassign different random ports for 6443/50000. Kubeconfig goes stale, kubectl fails, and everything that depends on it (joelclaw CLI, health checks, pod inspection) breaks silently.

Symptoms:

kubectl

returns

tls: internal error

connection refused

. All pods are actually running — only the kubeconfig routing is wrong.

Fix:

bash

undefined

k8s API（6443）和Talos API（50000）的Docker端口映射未固定——容器创建时会分配随机主机端口。所有服务端口（3111、8288、6379等）均为1:1固定映射。

当Colima虚拟机或Talos容器重启时，Docker可能会为6443/50000重新分配不同的随机端口。此时Kubeconfig会失效，kubectl执行失败，所有依赖它的工具（joelclaw CLI、健康检查、Pod检查）都会静默故障。

症状：

kubectl

tls: internal error

或

connection refused

。实际上所有Pod都在正常运行——仅Kubeconfig路由配置错误。

修复方法：

bash

undefined

1. Regenerate kubeconfig from talosctl (which has the correct port)

1. 从talosctl重新生成kubeconfig（它会获取正确的端口）

talosctl --talosconfig ~/.talos/config --nodes 127.0.0.1 kubeconfig --force

2. Switch to the new context

2. 切换到新的上下文

kubectl config use-context "$(kubectl config get-contexts -o name | grep joelclaw | head -1)"

3. Clean stale contexts (optional)

3. 清理过时的上下文（可选）

kubectl config delete-context admin@joelclaw # if stale entry exists


**Self-heal**: `health.sh` now auto-detects and fixes this before running checks.

**Root cause**: Container was created without pinning these ports. To permanently fix, recreate the container with explicit port bindings for 6443:6443 and 50000:50000. This requires cluster recreation — a bigger operation.

kubectl config delete-context admin@joelclaw # 若存在过时条目


**自修复**：`health.sh`脚本现在会在执行检查前自动检测并修复此问题。

**根本原因**：创建容器时未固定这些端口。要永久修复，需重新创建容器并显式绑定6443:6443和50000:50000端口，这需要重建集群——属于较大规模操作。

Quick Health Check

快速健康检查

bash

kubectl get pods -n joelclaw                          # all pods
curl -s localhost:3111/api/inngest                     # system-bus-worker → 200
curl -s localhost:7880/                                # LiveKit → "OK"
curl -s localhost:8108/health                          # Typesense → {"ok":true}
curl -s localhost:8288/health                          # Inngest → {"status":200}
curl -s localhost:9070/deployments                     # Restate admin → deployments list
curl -s localhost:9627/xrpc/_health                    # PDS → {"version":"..."}
kubectl exec -n joelclaw redis-0 -- redis-cli ping     # → PONG
joelclaw restate cron status                           # Dkron scheduler → healthy via temporary CLI tunnel

bash

kubectl get pods -n joelclaw                          # 查看所有Pod
curl -s localhost:3111/api/inngest                     # system-bus-worker → 应返回200
curl -s localhost:7880/                                # LiveKit → 返回"OK"
curl -s localhost:8108/health                          # Typesense → 返回{"ok":true}
curl -s localhost:8288/health                          # Inngest → 返回{"status":200}
curl -s localhost:9070/deployments                     # Restate管理端 → 返回部署列表
curl -s localhost:9627/xrpc/_health                    # PDS → 返回{"version":"..."}
kubectl exec -n joelclaw redis-0 -- redis-cli ping     # → 返回PONG
joelclaw restate cron status                           # Dkron调度器 → 通过临时CLI隧道检测健康状态

Services

服务列表

Service	Type	Pod	Ports (Mac→NodePort)	Helm?
Redis	StatefulSet	redis-0	6379→6379	No
Typesense	StatefulSet	typesense-0	8108→8108	No
Inngest	StatefulSet	inngest-0	8288→8288, 8289→8289	No
Restate	StatefulSet	restate-0	8080→8080, 9070→9070, 9071→9071	No
system-bus-worker	Deployment	system-bus-worker-*	3111→3111	No
restate-worker	Deployment	restate-worker-*	in-cluster only ( `restate-worker:9080` )	No
docs-api	Deployment	docs-api-*	3838→3838	No
LiveKit	Deployment	livekit-server-*	7880→7880, 7881→7881	Yes (livekit/livekit-server 1.9.0)
PDS	Deployment	bluesky-pds-*	9627→3000	Yes (nerkho/bluesky-pds 0.4.2)
MinIO	StatefulSet	minio-0	30900→30900, 30901→30901	No
Dkron	StatefulSet	dkron-0	in-cluster only ( `dkron-svc:8080` )	No
AIStor Operator ( `aistor` ns)	Deployments	adminjob-operator, object-store-operator	n/a	Yes ( `minio/aistor-operator` )
AIStor ObjectStore ( `aistor` ns)	StatefulSet	aistor-s3-pool-0-0	31000 (S3 TLS), 31001 (console)	Yes ( `minio/aistor-objectstore` )

服务	类型	Pod名称	端口映射（Mac→NodePort）	是否使用Helm
Redis	StatefulSet	redis-0	6379→6379	否
Typesense	StatefulSet	typesense-0	8108→8108	否
Inngest	StatefulSet	inngest-0	8288→8288, 8289→8289	否
Restate	StatefulSet	restate-0	8080→8080, 9070→9070, 9071→9071	否
system-bus-worker	Deployment	system-bus-worker-*	3111→3111	否
restate-worker	Deployment	restate-worker-*	仅集群内部访问（ `restate-worker:9080` ）	否
docs-api	Deployment	docs-api-*	3838→3838	否
LiveKit	Deployment	livekit-server-*	7880→7880, 7881→7881	是（livekit/livekit-server 1.9.0）
PDS	Deployment	bluesky-pds-*	9627→3000	是（nerkho/bluesky-pds 0.4.2）
MinIO	StatefulSet	minio-0	30900→30900, 30901→30901	否
Dkron	StatefulSet	dkron-0	仅集群内部访问（ `dkron-svc:8080` ）	否
AIStor Operator（ `aistor` 命名空间）	Deployments	adminjob-operator, object-store-operator	无	是（ `minio/aistor-operator` ）
AIStor ObjectStore（ `aistor` 命名空间）	StatefulSet	aistor-s3-pool-0-0	31000（S3 TLS）、31001（控制台）	是（ `minio/aistor-objectstore` ）

Restate / Firecracker runtime notes

Restate / Firecracker运行时说明

```
deployment/restate-worker
```
is intentionally privileged and mounts
```
/dev/kvm
```
(hostPath type
```
""
```
— optional).
PVC
```
firecracker-images
```
at
```
/tmp/firecracker-test
```
stores kernel, rootfs, and snapshot artifacts.
When
```
nestedVirtualization
```
is OFF:
```
/dev/kvm
```
absent,
```
microvm
```
DAG handler fails, but
```
shell
```
/
```
infer
```
/
```
noop
```
handlers work normally.
When
```
nestedVirtualization
```
is ON: Firecracker one-shot exec works (create workspace ext4 → write command → boot VM → guest executes → poweroff → read results).
Restate retry caps: dagWorker maxAttempts=5, dagOrchestrator maxAttempts=3. Prevents journal poisoning.
Restate journal purge (if stuck invocations block work): scale down Restate, mount PVC with temp pod,
```
rm -rf /restate-data/*
```
, scale back up, re-register worker.

Re-register worker:

curl -X POST http://localhost:9070/deployments -H 'content-type: application/json' -d '{"uri":"http://restate-worker:9080"}'

⚠️ PDS port trap: Docker maps

9627→3000

(host→container). NodePort must be 3000 to match the container-side port. If set to 9627, traffic won't route.

Rule: NodePort value = Docker's container-side port, not host-side.

```
deployment/restate-worker
```
被设置为特权模式，并挂载
```
/dev/kvm
```
（hostPath类型为
```
""
```
——可选）。
PVC
```
firecracker-images
```
挂载在
```
/tmp/firecracker-test
```
，用于存储内核、根文件系统和快照 artifacts。
当
```
nestedVirtualization
```
关闭时：
```
/dev/kvm
```
不存在，
```
microvm
```
DAG处理器会失败，但
```
shell
```
/
```
infer
```
/
```
noop
```
处理器可正常工作。
当
```
nestedVirtualization
```
开启时：Firecracker一次性执行功能正常（创建ext4工作区→写入命令→启动虚拟机→客户机执行→关机→读取结果）。
Restate重试上限：dagWorker最大尝试次数=5，dagOrchestrator最大尝试次数=3，防止日志污染。
Restate日志清理（若卡住的调用阻塞工作）：缩容Restate，用临时Pod挂载PVC，执行
```
rm -rf /restate-data/*
```
，重新扩容，重新注册worker。

重新注册worker：

curl -X POST http://localhost:9070/deployments -H 'content-type: application/json' -d '{"uri":"http://restate-worker:9080"}'

⚠️ PDS端口陷阱：Docker映射为

9627→3000

（主机→容器）。NodePort必须设置为3000以匹配容器端端口。若设置为9627，流量将无法正常路由。

规则：NodePort值=Docker容器端端口，而非主机端端口。

Agent Runner (Cold k8s Jobs)

Agent Runner（冷启动k8s Jobs）

Status: local sandbox remains the default/live path; the k8s backend is now code-landed and opt-in, but still needs supervised rollout before calling it earned runtime.

The agent runner executes sandboxed story runs as isolated k8s Jobs. Jobs are created dynamically via

@joelclaw/agent-execution/job-spec

— no static manifests.

状态：本地沙箱仍是默认/活跃路径；k8s后端已完成代码开发，可选择启用，但仍需受控部署才能投入正式运行。

Agent Runner将沙箱化的任务运行作为独立的k8s Jobs执行。Jobs通过

@joelclaw/agent-execution/job-spec

动态创建——无静态清单。

Runtime Image Contract

运行时镜像约定

See

k8s/agent-runner.yaml

for the full specification.

Required components:

Git (checkout, diff, commit)
Bun runtime
runner-installed agent tooling (currently
```
claude
```
and/or other installed CLIs)
```
/workspace
```
working directory

runtime entrypoint at

/app/packages/agent-execution/src/job-runner.ts

Configuration via environment variables:

Request metadata:

WORKFLOW_ID

REQUEST_ID

STORY_ID

SANDBOX_PROFILE

BASE_SHA

EXECUTION_BACKEND

JOB_NAME

JOB_NAMESPACE

Repo materialization:
```
REPO_URL
```
,
```
REPO_BRANCH
```
, optional
```
HOST_REQUESTED_CWD
```

Agent identity:

AGENT_NAME

AGENT_MODEL

AGENT_VARIANT

AGENT_PROGRAM

Execution config:
```
SESSION_ID
```
,
```
TIMEOUT_SECONDS
```
Task prompt:
```
TASK_PROMPT_B64
```
(base64-encoded)
Verification:
```
VERIFICATION_COMMANDS_B64
```
(base64-encoded JSON array)

Callback path:

RESULT_CALLBACK_URL

RESULT_CALLBACK_TOKEN

Expected behavior:

Decode task from
```
TASK_PROMPT_B64
```
Materialize repo from
```
REPO_URL
```
/
```
REPO_BRANCH
```
at
```
BASE_SHA
```
Execute the requested
```
AGENT_PROGRAM
```
Run verification commands (if set)
Print
```
SandboxExecutionResult
```
markers to stdout and POST the same result to
```
/internal/agent-result
```
Exit 0 (success) or non-zero (failure)

Current truthful limit:

```
pi
```
remains local-backend only for now; do not pretend the pod runner can execute pi story runs yet.

完整规范请查看

k8s/agent-runner.yaml

。

必需组件：

Git（用于检出、对比、提交）
Bun运行时
Runner预装的Agent工具（当前为
```
claude
```
和/或其他已安装的CLI）
```
/workspace
```
工作目录

运行时入口点为

/app/packages/agent-execution/src/job-runner.ts

通过环境变量配置：

请求元数据：

WORKFLOW_ID

REQUEST_ID

STORY_ID

SANDBOX_PROFILE

BASE_SHA

EXECUTION_BACKEND

JOB_NAME

JOB_NAMESPACE

代码仓实例化：
```
REPO_URL
```
,
```
REPO_BRANCH
```
，可选
```
HOST_REQUESTED_CWD
```

Agent身份：

AGENT_NAME

AGENT_MODEL

AGENT_VARIANT

AGENT_PROGRAM

执行配置：
```
SESSION_ID
```
,
```
TIMEOUT_SECONDS
```
任务提示：
```
TASK_PROMPT_B64
```
（base64编码）
验证命令：
```
VERIFICATION_COMMANDS_B64
```
（base64编码的JSON数组）

回调路径：

RESULT_CALLBACK_URL

RESULT_CALLBACK_TOKEN

预期行为：

解码
```
TASK_PROMPT_B64
```
中的任务
从
```
REPO_URL
```
/
```
REPO_BRANCH
```
在
```
BASE_SHA
```
版本实例化代码仓
执行指定的
```
AGENT_PROGRAM
```
运行验证命令（若已设置）
向标准输出打印
```
SandboxExecutionResult
```
标记，并将相同结果POST到
```
/internal/agent-result
```
退出码0（成功）或非0（失败）

当前限制：

```
pi
```
任务目前仅支持本地后端；请勿假设Pod Runner可执行pi任务。

Job Lifecycle

Job生命周期

typescript

import { generateJobSpec, generateJobDeletion } from "@joelclaw/agent-execution";

// 1. Generate Job spec
const spec = generateJobSpec(request, {
  runtime: {
    image: "ghcr.io/joelhooks/agent-runner:latest",
    imagePullPolicy: "Always",
    command: ["bun", "run", "/app/packages/agent-execution/src/job-runner.ts"],
  },
  namespace: "joelclaw",
  imagePullSecret: "ghcr-pull",
  resultCallbackUrl: "http://host.docker.internal:3111/internal/agent-result",
  resultCallbackToken: process.env.OTEL_EMIT_TOKEN,
});

// 2. Apply to cluster (via kubectl or k8s client library)
// 3. Job runs → Pod materializes repo, executes agent, posts SandboxExecutionResult callback
// 4. Host worker can recover the same terminal result from log markers if callback delivery fails
// 5. Job auto-deletes after TTL (default: 5 minutes)

// Cancel a running Job
const deletion = generateJobDeletion("req-xyz");
// kubectl delete job ${deletion.name} -n ${deletion.namespace}

typescript

import { generateJobSpec, generateJobDeletion } from "@joelclaw/agent-execution";

// 1. 生成Job规格
const spec = generateJobSpec(request, {
  runtime: {
    image: "ghcr.io/joelhooks/agent-runner:latest",
    imagePullPolicy: "Always",
    command: ["bun", "run", "/app/packages/agent-execution/src/job-runner.ts"],
  },
  namespace: "joelclaw",
  imagePullSecret: "ghcr-pull",
  resultCallbackUrl: "http://host.docker.internal:3111/internal/agent-result",
  resultCallbackToken: process.env.OTEL_EMIT_TOKEN,
});

// 2. 应用到集群（通过kubectl或k8s客户端库）
// 3. Job运行→Pod实例化代码仓、执行Agent、提交SandboxExecutionResult回调
// 4. 若回调投递失败，主机Worker可从日志标记中恢复相同的终端结果
// 5. Job完成后按TTL自动删除（默认：5分钟）

// 取消运行中的Job
const deletion = generateJobDeletion("req-xyz");
// kubectl delete job ${deletion.name} -n ${deletion.namespace}

Resource Defaults

资源默认值

CPU:
```
500m
```
request,
```
2
```
limit
Memory:
```
1Gi
```
request,
```
4Gi
```
limit
Active deadline:
```
1 hour
```
TTL after completion:
```
5 minutes
```
Backoff limit:
```
0
```
(no retries)

CPU：
```
500m
```
请求，
```
2
```
限制
内存：
```
1Gi
```
请求，
```
4Gi
```
限制
活动截止时间：
```
1小时
```
完成后TTL：
```
5分钟
```
回退限制：
```
0
```
（不重试）

Security

安全配置

Non-root execution (UID 1000, GID 1000)
No privilege escalation
All capabilities dropped
RuntimeDefault seccomp profile
Control plane toleration for single-node cluster

非root用户执行（UID 1000，GID 1000）
禁止权限提升
移除所有能力
使用RuntimeDefault seccomp配置文件
单节点集群的控制平面容忍度

Verification Commands

验证命令

bash

undefined

bash

undefined

List agent runner Jobs

列出Agent Runner Jobs

kubectl get jobs -n joelclaw -l app.kubernetes.io/name=agent-runner

Check Job status

查看Job状态

kubectl describe job <job-name> -n joelclaw

View logs

查看日志

kubectl logs job/<job-name> -n joelclaw

Check for stale Jobs (should be auto-deleted by TTL)

检查过时Jobs（应按TTL自动删除）

kubectl get jobs -n joelclaw --show-all

undefined

kubectl get jobs -n joelclaw --show-all

undefined

Current State

当前进度

✅ Job spec generator (

packages/agent-execution/src/job-spec.ts

)

✅ Runtime contract (
```
k8s/agent-runner.yaml
```
)

✅ Tests (

packages/agent-execution/__tests__/job-spec.test.ts

)

⏳ Runtime image not yet built (Story 3)
⏳ Hot-image CronJob not yet implemented (Story 4)
⏳ Warm-pool scheduler not yet implemented (Story 5)
⏳ Restate integration not yet wired (Story 6)

✅ Job规格生成器（

packages/agent-execution/src/job-spec.ts

）

✅ 运行时约定（
```
k8s/agent-runner.yaml
```
）

✅ 测试用例（

packages/agent-execution/__tests__/job-spec.test.ts

）

⏳ 运行时镜像尚未构建（任务3）
⏳ 热镜像CronJob尚未实现（任务4）
⏳ 暖池调度器尚未实现（任务5）
⏳ Restate集成尚未对接（任务6）

NAS NFS Access from k8s (ADR-0088 Phase 2.5)

k8s访问NAS NFS存储（ADR-0088 第2.5阶段）

k8s pods can mount NAS storage over NFS via a LAN route through the Colima bridge.

k8s Pod可通过Colima网桥的LAN路由挂载NAS NFS存储。

How it works

工作原理

k8s pod → Talos container (10.5.0.x) → Docker NAT → Colima VM
  → ip route 192.168.1.0/24 via 192.168.64.1 dev col0
  → macOS host (IP forwarding enabled) → LAN → NAS (192.168.1.163)

Root cause of prior failures: VZ framework's shared networking on eth0 doesn't properly forward LAN-bound traffic. The fix routes LAN traffic through col0 (Colima bridge → macOS host) instead.

k8s pod → Talos容器（10.5.0.x）→ Docker NAT → Colima虚拟机
  → ip route 192.168.1.0/24 via 192.168.64.1 dev col0
  → macOS主机（已启用IP转发）→ LAN → NAS（192.168.1.163）

之前失败的根本原因：VZ框架的eth0共享网络无法正确转发LAN流量。修复方案是将LAN流量通过col0（Colima网桥→macOS主机）路由。

Route persistence

路由持久化

The LAN route is set in two places for reliability:

Colima provision script (
```
~/.colima/default/colima.yaml
```
) — runs on
```
colima start
```
(cold boot)
colima-tunnel script (
```
~/.local/bin/colima-tunnel
```
) — runs on tunnel restart (covers warm resume)

Both execute:

ip route replace 192.168.1.0/24 via 192.168.64.1 dev col0

LAN路由在两个位置配置以确保可靠性：

Colima初始化脚本（
```
~/.colima/default/colima.yaml
```
）——
```
colima start
```
时运行（冷启动）
colima-tunnel脚本（
```
~/.local/bin/colima-tunnel
```
）——隧道重启时运行（覆盖热恢复场景）

两者均执行：

ip route replace 192.168.1.0/24 via 192.168.64.1 dev col0

Available PVs

可用PV

PV	NFS Path	Capacity	Access	Use
`nas-nvme`	`192.168.1.163:/volume2/data`	1.5TB	RWX	NVMe RAID1: backups, snapshots, models, sessions
`nas-hdd`	`192.168.1.163:/volume1/joelclaw`	50TB	RWX	HDD RAID5: books, docs-artifacts, archives, otel
`minio-nfs-pv`	`192.168.1.163:/volume1/joelclaw`	1TB	RWO	HDD tier: MinIO object storage (same export)

PV名称	NFS路径	容量	访问权限	用途
`nas-nvme`	`192.168.1.163:/volume2/data`	1.5TB	RWX	NVMe RAID1：备份、快照、模型、会话
`nas-hdd`	`192.168.1.163:/volume1/joelclaw`	50TB	RWX	HDD RAID5：书籍、文档制品、归档、otel
`minio-nfs-pv`	`192.168.1.163:/volume1/joelclaw`	1TB	RWO	HDD分层存储：MinIO对象存储（同一导出路径）

Mounting NAS in a pod

在Pod中挂载NAS

yaml

volumes:
  - name: nas
    persistentVolumeClaim:
      claimName: nas-nvme
containers:
  - volumeMounts:
      - name: nas
        mountPath: /nas
        # Optional: subPath for specific dir
        subPath: typesense

yaml

volumes:
  - name: nas
    persistentVolumeClaim:
      claimName: nas-nvme
containers:
  - volumeMounts:
      - name: nas
        mountPath: /nas
        # 可选：指定子目录
        subPath: typesense

Rules

规则

Always use IP (192.168.1.163), never hostname (three-body). DNS doesn't resolve from inside k8s.
Always use
nfsvers=3,tcp,resvport,noatime
mount options. NFSv4 has issues with Asustor ADM.
NAS unavailability degrades gracefully with
```
soft
```
mount option — returns errors, doesn't hang pods.
NFS write performance: ~660 MiB/s over 10GbE with jumbo frames. Good for sequential I/O (backups, snapshots). Latency-sensitive workloads (Redis, active Typesense indexes) stay on local SSD.
If NFS mount fails after Colima restart: verify the route exists:
```
colima ssh -- ip route | grep 192.168.1.0
```

始终使用IP（192.168.1.163），切勿使用主机名（three-body）。k8s内部无法解析该主机名。
始终使用
nfsvers=3,tcp,resvport,noatime
挂载选项。NFSv4与Asustor ADM存在兼容问题。
NAS不可用时会优雅降级：使用
```
soft
```
挂载选项会返回错误，而非挂起Pod。
NFS写入性能：10GbE网络+巨帧下约660 MiB/s，适合顺序IO（备份、快照）。对延迟敏感的工作负载（Redis、活跃Typesense索引）保留在本地SSD。
Colima重启后NFS挂载失败：验证路由是否存在：
```
colima ssh -- ip route | grep 192.168.1.0
```

Verify connectivity

验证连通性

bash

undefined

bash

undefined

From Colima VM

从Colima虚拟机测试

colima ssh -- timeout 2 bash -c "echo > /dev/tcp/192.168.1.163/2049" && echo "NFS OK"

colima ssh -- timeout 2 bash -c "echo > /dev/tcp/192.168.1.163/2049" && echo "NFS正常"

From k8s pod

从k8s Pod测试

kubectl run nfs-test --image=busybox --restart=Never -n joelclaw
--overrides='{"spec":{"tolerations":[{"key":"node-role.kubernetes.io/control-plane","operator":"Exists","effect":"NoSchedule"}],"containers":[{"name":"t","image":"busybox","command":["sh","-c","ls /nas && echo OK"],"volumeMounts":[{"name":"n","mountPath":"/nas"}]}],"volumes":[{"name":"n","persistentVolumeClaim":{"claimName":"nas-nvme"}}]}}' kubectl logs nfs-test -n joelclaw && kubectl delete pod nfs-test -n joelclaw --force

undefined

kubectl run nfs-test --image=busybox --restart=Never -n joelclaw \ --overrides='{"spec":{"tolerations":[{"key":"node-role.kubernetes.io/control-plane","operator":"Exists","effect":"NoSchedule"}],"containers":[{"name":"t","image":"busybox","command":["sh","-c","ls /nas && echo 正常"],"volumeMounts":[{"name":"n","mountPath":"/nas"}]}],"volumes":[{"name":"n","persistentVolumeClaim":{"claimName":"nas-nvme"}}]}}' kubectl logs nfs-test -n joelclaw && kubectl delete pod nfs-test -n joelclaw --force

undefined

Deploy Commands

部署命令

bash

undefined

bash

undefined

Manifests (redis, typesense, inngest, dkron)

清单文件（redis、typesense、inngest、dkron）

kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/

Restate runtime

Restate运行时

kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/restate.yaml kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/firecracker-pvc.yaml kubectl rollout status statefulset/restate -n joelclaw ~/Code/joelhooks/joelclaw/k8s/publish-restate-worker.sh curl -fsS http://localhost:9070/deployments

Dkron phase-1 scheduler (ClusterIP API + CLI-managed short-lived tunnel access)

Dkron第一阶段调度器（ClusterIP API + CLI管理的短期隧道访问）

kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/dkron.yaml kubectl rollout status statefulset/dkron -n joelclaw joelclaw restate cron status joelclaw restate cron sync-tier1 # seed/update ADR-0216 tier-1 jobs

system-bus worker (build + push GHCR + apply + rollout wait)

system-bus worker（构建+推送至GHCR+应用+等待滚动更新完成）

~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.sh

LiveKit (Helm + reconcile patches)

LiveKit（Helm + 调和补丁）

~/Code/joelhooks/joelclaw/k8s/reconcile-livekit.sh joelclaw

AIStor (Helm operator + objectstore)

AIStor（Helm operator + objectstore）

Defaults to isolated

aistor

namespace to avoid service-name collisions with legacy

joelclaw/minio

默认部署在独立的

aistor

命名空间，避免与旧版

joelclaw/minio

服务名冲突。

Cutover override (explicit only): AISTOR_OBJECTSTORE_NAMESPACE=joelclaw AISTOR_ALLOW_JOELCLAW_NAMESPACE=true

强制切换（仅显式操作）：AISTOR_OBJECTSTORE_NAMESPACE=joelclaw AISTOR_ALLOW_JOELCLAW_NAMESPACE=true

~/Code/joelhooks/joelclaw/k8s/reconcile-aistor.sh

PDS (Helm) — always patch NodePort to 3000

PDS（Helm）——始终将NodePort补丁为3000

(export current values first if the release already exists)

（若版本已存在，先导出当前配置）

helm get values bluesky-pds -n joelclaw > /tmp/pds-values-live.yaml 2>/dev/null || true helm upgrade --install bluesky-pds nerkho/bluesky-pds
-n joelclaw -f /tmp/pds-values-live.yaml kubectl patch svc bluesky-pds -n joelclaw --type='json'
-p='[{"op":"replace","path":"/spec/ports/0/nodePort","value":3000}]'

undefined

helm get values bluesky-pds -n joelclaw > /tmp/pds-values-live.yaml 2>/dev/null || true helm upgrade --install bluesky-pds nerkho/bluesky-pds \ -n joelclaw -f /tmp/pds-values-live.yaml kubectl patch svc bluesky-pds -n joelclaw --type='json' \ -p='[{"op":"replace","path":"/spec/ports/0/nodePort","value":3000}]'

undefined

Auto Deploy (GitHub Actions)

自动部署（GitHub Actions）

Workflow:

.github/workflows/system-bus-worker-deploy.yml

Trigger: push to
```
main
```
touching
```
packages/system-bus/**
```
or worker deploy files
Behavior:
- builds/pushes
```
ghcr.io/joelhooks/system-bus-worker:${GITHUB_SHA}
```
  +
```
:latest
```
- runs deploy job on
```
self-hosted
```
  runner
- updates k8s deployment image + waits for rollout + probes worker health
If deploy job is queued forever, check that a
```
self-hosted
```
runner is online on the Mac Mini.

工作流：

.github/workflows/system-bus-worker-deploy.yml

触发条件：推送至
```
main
```
分支且修改了
```
packages/system-bus/**
```
或worker部署文件
行为：
- 构建并推送
```
ghcr.io/joelhooks/system-bus-worker:${GITHUB_SHA}
```
  +
```
:latest
```
  镜像
- 在
```
self-hosted
```
  runner上执行部署任务
- 更新k8s部署镜像+等待滚动更新完成+探测worker健康状态
若部署任务一直处于排队状态，检查Mac Mini上的
```
self-hosted
```
runner是否在线。

GHCR push 403 Forbidden

GHCR推送403 Forbidden错误

Cause:

GITHUB_TOKEN

(default Actions token) does not have

packages:write

scope for this repo. A dedicated PAT is required.

Fix already applied: Workflow uses

secrets.GHCR_PAT

(not

secrets.GITHUB_TOKEN

) for the GHCR login step. The PAT is stored in:

GitHub repo secrets as
```
GHCR_PAT
```
(set via GitHub UI)
agent-secrets as
```
ghcr_pat
```
(
```
secrets lease ghcr_pat
```
)

If this breaks again: PAT may have expired. Regenerate at github.com → Settings → Developer settings → PATs, update both stores.

Local fallback (bypass GHA entirely):

bash

DOCKER_CONFIG_DIR=$(mktemp -d)
echo '{"credsStore":""}' > "$DOCKER_CONFIG_DIR/config.json"
export DOCKER_CONFIG="$DOCKER_CONFIG_DIR"
secrets lease ghcr_pat | docker login ghcr.io -u joelhooks --password-stdin
~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.sh

Note:

publish-system-bus-worker.sh

uses

gh auth token

internally — if

gh auth

is stale, use the Docker login above before running the script, or patch it to use

secrets lease ghcr_pat

directly.

原因：

GITHUB_TOKEN

（默认Actions令牌）没有此仓库的

packages:write

权限，需要专用PAT。

已应用修复：工作流使用

secrets.GHCR_PAT

（而非

secrets.GITHUB_TOKEN

）进行GHCR登录。PAT存储在：

GitHub仓库密钥中（
```
GHCR_PAT
```
，通过GitHub UI设置）
agent-secrets中（
```
ghcr_pat
```
，通过
```
secrets lease ghcr_pat
```
获取）

若再次失效：PAT可能已过期。在github.com → 设置 → 开发者设置 → PATs重新生成，更新两个存储位置。

本地回退方案（完全绕过GHA）：

bash

DOCKER_CONFIG_DIR=$(mktemp -d)
echo '{"credsStore":""}' > "$DOCKER_CONFIG_DIR/config.json"
export DOCKER_CONFIG="$DOCKER_CONFIG_DIR"
secrets lease ghcr_pat | docker login ghcr.io -u joelhooks --password-stdin
~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.sh

注意：

publish-system-bus-worker.sh

内部使用

gh auth token

——若

gh auth

失效，在运行脚本前执行上述Docker登录，或修改脚本直接使用

secrets lease ghcr_pat

。

Resilience Rules (ADR-0148)

弹性规则（ADR-0148）

NEVER use
kubectl port-forward
for persistent service exposure. All long-lived operator surfaces MUST use NodePort + Docker port mappings. The narrow exception is a CLI-managed, short-lived tunnel for an otherwise in-cluster-only control surface (for example
```
joelclaw restate cron *
```
tunneling to
```
dkron-svc
```
). Port-forwards silently die on idle/restart/pod changes, so do not leave them running.
All workloads MUST have liveness + readiness + startup probes. Missing probes = silent hangs that never recover.
After any Docker/Colima/node restart: remove control-plane taint, uncordon node, verify flannel, check all pods reach Running.
PVC reclaimPolicy is Delete — deleting a PVC = permanent data loss. Never delete PVCs without backup.
firecracker-images
is stateful runtime data. Treat it like a real runtime PVC: kernel, rootfs, and snapshot loss will break the microVM path.
Colima VM disk is limited (19GB). Monitor with
```
colima ssh -- df -h /
```
. Alert at >80%.
All launchd plists MUST set PATH including
/opt/homebrew/bin
. Colima shells to
```
limactl
```
, kubectl/talosctl live in homebrew. launchd's default PATH is
```
/usr/bin:/bin:/usr/sbin:/sbin
```
— no homebrew. The canonical PATH for infra plists is:
```
/opt/homebrew/bin:/Users/joel/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
```
. Discovered Feb 2026: missing PATH caused 6 days of silent recovery failures.
Shell scripts run by launchd MUST export PATH at the top. Even if the plist sets EnvironmentVariables, belt-and-suspenders — add
```
export PATH="/opt/homebrew/bin:..."
```
to the script itself.

切勿使用
kubectl port-forward
进行持久化服务暴露。所有长期对外的操作面必须使用NodePort + Docker端口映射。唯一例外是CLI管理的短期隧道，用于原本仅集群内部可访问的控制面（例如
```
joelclaw restate cron *
```
隧道到
```
dkron-svc
```
）。端口转发会在空闲/重启/Pod变更时静默失效，请勿长期运行。
所有工作负载必须配置存活探针+就绪探针+启动探针。缺少探针会导致静默挂起且无法自动恢复。
任何Docker/Colima/节点重启后：移除控制平面污点、解除节点封锁、验证Flannel状态、检查所有Pod进入Running状态。
PVC的reclaimPolicy为Delete——删除PVC会导致永久数据丢失。无备份时切勿删除PVC。
firecracker-images
是有状态的运行时数据，需像对待真实的运行时PVC一样：内核、根文件系统和快照丢失会破坏microVM路径。
Colima虚拟机磁盘空间有限（19GB），使用
```
colima ssh -- df -h /
```
监控，使用率超过80%时告警。
所有launchd plist必须设置包含
/opt/homebrew/bin
的PATH。Colima调用
```
limactl
```
，kubectl/talosctl安装在homebrew中。launchd的默认PATH是
```
/usr/bin:/bin:/usr/sbin:/sbin
```
——不包含homebrew。基础设施plist的标准PATH为：
```
/opt/homebrew/bin:/Users/joel/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
```
。2026年2月发现：PATH缺失导致了6天的静默恢复失败。
launchd运行的Shell脚本必须在顶部导出PATH。即使plist中设置了EnvironmentVariables，也要双重保障——在脚本中添加
```
export PATH="/opt/homebrew/bin:..."
```
。

Current Probe Gaps (fix when touching these services)

当前探针缺失（修改这些服务时修复）

Typesense: missing liveness probe (hangs won't be detected)
Bluesky PDS: missing readiness and startup probes
system-bus-worker: missing startup probe

Typesense：缺少存活探针（挂起无法被检测）
Bluesky PDS：缺少就绪探针和启动探针
system-bus-worker：缺少启动探针

Danger Zones

危险区域

Stale SSH mux socket after Colima restart — When Colima restarts (disk resize, crash recovery,
```
colima stop && start
```
), the SSH port changes but the mux socket (
```
~/.colima/_lima/colima/ssh.sock
```
) caches the old connection. Symptoms:
```
kubectl port-forward
```
fails with "tls: internal error",
```
kubectl get nodes
```
may intermittently work then fail. Fix:
```
rm -f ~/.colima/_lima/colima/ssh.sock && pkill -f "ssh.*colima"
```
, then re-establish tunnels with
```
ssh -o ControlPath=none
```
. Always verify SSH port with
```
colima ssh-config | grep Port
```
after restart.
Adding Docker port mappings — can be hot-added without cluster recreation via
```
hostconfig.json
```
edit. See references/operations.md for the procedure.
Inngest legacy host alias in manifests — old container-host alias may still appear in legacy configs. Worker uses connect mode, so it usually still works, but prefer explicit Talos/Colima hostnames.
Colima zombie state —
```
colima status
```
reports "Running" but docker socket / SSH tunnels are dead. All k8s ports unresponsive.
```
colima start
```
is a no-op. Only
```
colima restart
```
recovers. Detect with:
```
ssh -F ~/.colima/_lima/colima/ssh.config lima-colima "docker info"
```
— if that fails while
```
colima status
```
passes, it's a zombie. The heal script handles this automatically.
Talos container has NO shell — No bash, no /bin/sh. Cannot
```
docker exec
```
into it. Kernel modules like
```
br_netfilter
```
must be loaded at the Colima VM level:
```
ssh lima-colima "sudo modprobe br_netfilter"
```
.
AIStor service-name collision — if AIStor objectstore is deployed in
```
joelclaw
```
, it can claim
```
svc/minio
```
and break legacy MinIO assumptions. Keep AIStor objectstore in isolated namespace (
```
aistor
```
) unless intentionally cutting over.

AIStor operator webhook SSA conflict — repeated

helm upgrade

can fail on

MutatingWebhookConfiguration

caBundle

ownership conflict. Current mitigation in this cluster: set

operators.object-store.webhook.enabled=false

k8s/aistor-operator-values.yaml

MinIO pinned tag trap —
```
minio/minio:RELEASE.2025-10-15T17-29-55Z
```
is not available on Docker Hub in this environment (ErrImagePull). Legacy fallback currently relies on
```
minio/minio:latest
```
.
restate-worker
privilege is intentional. Do not “harden” away
```
/dev/kvm
```
,
```
privileged: true
```
, or the unconfined seccomp profile unless you are simultaneously changing the Firecracker runtime contract.
Dkron service-name collision — never create a bare
```
svc/dkron
```
. Kubernetes injects
```
DKRON_*
```
env vars into pods, which collides with Dkron's own config parsing. Use
```
dkron-peer
```
and
```
dkron-svc
```
.
Dkron PVC permissions — upstream
```
dkron/dkron:latest
```
currently needs root on the local-path PVC. Non-root hardening caused
```
permission denied
```
under
```
/data/raft/snapshots/permTest
```
and CrashLoopBackOff.

Colima重启后过时的SSH多路复用套接字——Colima重启（磁盘扩容、崩溃恢复、
```
colima stop && start
```
）后，SSH端口会变化，但多路复用套接字（
```
~/.colima/_lima/colima/ssh.sock
```
）会缓存旧连接。症状：
```
kubectl port-forward
```
返回"tls: internal error"，
```
kubectl get nodes
```
可能间歇性工作后失败。修复：
```
rm -f ~/.colima/_lima/colima/ssh.sock && pkill -f "ssh.*colima"
```
，然后使用
```
ssh -o ControlPath=none
```
重新建立隧道。重启后务必通过
```
colima ssh-config | grep Port
```
验证SSH端口。
添加Docker端口映射——无需重建集群，可通过编辑
```
hostconfig.json
```
热添加。详见references/operations.md中的步骤。
Inngest清单中的旧主机别名——旧版容器主机别名可能仍存在于遗留配置中。Worker使用连接模式通常仍可工作，但优先使用显式的Talos/Colima主机名。
Colima僵尸状态——
```
colima status
```
显示"Running"但Docker套接字/SSH隧道已失效，所有k8s端口无响应。
```
colima start
```
无作用，仅
```
colima restart
```
可恢复。检测方法：
```
ssh -F ~/.colima/_lima/colima/ssh.config lima-colima "docker info"
```
——若此命令失败但
```
colima status
```
显示运行中，则为僵尸状态。自愈脚本会自动处理此问题。
Talos容器无Shell环境——没有bash，没有/bin/sh，无法
```
docker exec
```
进入。
```
br_netfilter
```
等内核模块需在Colima虚拟机层面加载：
```
ssh lima-colima "sudo modprobe br_netfilter"
```
。
AIStor服务名冲突——若AIStor objectstore部署在
```
joelclaw
```
命名空间，会占用
```
svc/minio
```
并破坏旧版MinIO的依赖。除非有意切换，否则请将AIStor objectstore部署在独立的
```
aistor
```
命名空间。
AIStor operator webhook SSA冲突——重复执行
```
helm upgrade
```
可能因
```
MutatingWebhookConfiguration
```
的
```
caBundle
```
所有权冲突失败。当前集群的缓解方案：在
```
k8s/aistor-operator-values.yaml
```
中设置
```
operators.object-store.webhook.enabled=false
```
。
MinIO固定标签陷阱——
```
minio/minio:RELEASE.2025-10-15T17-29-55Z
```
在此环境的Docker Hub中不可用（ErrImagePull）。当前旧版回退依赖
```
minio/minio:latest
```
。
restate-worker
的特权模式是有意设置的。除非同时修改Firecracker运行时约定，否则请勿移除
```
/dev/kvm
```
、
```
privileged: true
```
或无限制的seccomp配置文件。
Dkron服务名冲突——切勿创建裸
```
svc/dkron
```
。Kubernetes会向Pod注入
```
DKRON_*
```
环境变量，与Dkron自身的配置解析冲突。请使用
```
dkron-peer
```
和
```
dkron-svc
```
。
Dkron PVC权限——上游
```
dkron/dkron:latest
```
目前需要本地路径PVC的root权限。非root加固会导致
```
/data/raft/snapshots/permTest
```
下的
```
permission denied
```
错误，引发CrashLoopBackOff。

Key Files

关键文件

Path	What
`~/Code/joelhooks/joelclaw/k8s/*.yaml`	Service manifests
`~/Code/joelhooks/joelclaw/k8s/livekit-values.yaml`	LiveKit Helm values (source controlled)
`~/Code/joelhooks/joelclaw/k8s/reconcile-livekit.sh`	LiveKit Helm deploy + post-upgrade reconcile
`~/Code/joelhooks/joelclaw/k8s/aistor-operator-values.yaml`	AIStor operator Helm values
`~/Code/joelhooks/joelclaw/k8s/aistor-objectstore-values.yaml`	AIStor objectstore Helm values
`~/Code/joelhooks/joelclaw/k8s/reconcile-aistor.sh`	AIStor deploy + upgrade reconcile script
`~/Code/joelhooks/joelclaw/k8s/dkron.yaml`	Dkron scheduler StatefulSet + services
`~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.sh`	Build/push/deploy system-bus worker to k8s
`~/Code/joelhooks/joelclaw/infra/k8s-reboot-heal.sh`	Reboot auto-heal script for Colima/Talos/taint/flannel
`~/Code/joelhooks/joelclaw/infra/launchd/com.joel.k8s-reboot-heal.plist`	launchd timer for reboot auto-heal
`~/Code/joelhooks/joelclaw/skills/k8s/references/operations.md`	Cluster operations + recovery notes
`~/.talos/config`	Talos client config
`~/.kube/config`	Kubeconfig (context: `admin@joelclaw-1` )
`~/.colima/default/colima.yaml`	Colima VM config
`~/.local/bin/colima-tunnel`	Persistent SSH tunnel + NAS route (launchd: `com.joel.colima-tunnel` )
`~/.local/caddy/Caddyfile`	Caddy HTTPS proxy (Tailscale)
`~/Code/joelhooks/joelclaw/k8s/nas-nvme-pv.yaml`	NAS NVMe NFS PV/PVC (1.5TB)
`~/Code/joelhooks/joelclaw/k8s/nas-hdd-pv.yaml`	NAS HDD NFS PV/PVC (50TB)

路径	说明
`~/Code/joelhooks/joelclaw/k8s/*.yaml`	服务清单文件
`~/Code/joelhooks/joelclaw/k8s/livekit-values.yaml`	LiveKit Helm配置（已版本控制）
`~/Code/joelhooks/joelclaw/k8s/reconcile-livekit.sh`	LiveKit Helm部署+升级后调和脚本
`~/Code/joelhooks/joelclaw/k8s/aistor-operator-values.yaml`	AIStor operator Helm配置
`~/Code/joelhooks/joelclaw/k8s/aistor-objectstore-values.yaml`	AIStor objectstore Helm配置
`~/Code/joelhooks/joelclaw/k8s/reconcile-aistor.sh`	AIStor部署+升级调和脚本
`~/Code/joelhooks/joelclaw/k8s/dkron.yaml`	Dkron调度器StatefulSet+服务配置
`~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.sh`	构建/推送/部署system-bus worker到k8s的脚本
`~/Code/joelhooks/joelclaw/infra/k8s-reboot-heal.sh`	Colima/Talos/污点/Flannel重启自愈脚本
`~/Code/joelhooks/joelclaw/infra/launchd/com.joel.k8s-reboot-heal.plist`	重启自愈的launchd定时器
`~/Code/joelhooks/joelclaw/skills/k8s/references/operations.md`	集群操作+恢复说明文档
`~/.talos/config`	Talos客户端配置
`~/.kube/config`	Kubeconfig（上下文： `admin@joelclaw-1` ）
`~/.colima/default/colima.yaml`	Colima虚拟机配置
`~/.local/bin/colima-tunnel`	持久化SSH隧道+NAS路由脚本（launchd： `com.joel.colima-tunnel` ）
`~/.local/caddy/Caddyfile`	Caddy HTTPS代理（Tailscale）配置
`~/Code/joelhooks/joelclaw/k8s/nas-nvme-pv.yaml`	NAS NVMe NFS PV/PVC配置（1.5TB）
`~/Code/joelhooks/joelclaw/k8s/nas-hdd-pv.yaml`	NAS HDD NFS PV/PVC配置（50TB）

Troubleshooting

故障排查

Read references/operations.md for:

Recovery after Colima restart
Recovery after Mac reboot
Flannel br_netfilter crash fix
Full cluster recreation (nuclear option)
Caddy/Tailscale HTTPS proxy details
All port mapping details with explanation

以下内容请查看references/operations.md：

Colima重启后的恢复步骤
Mac重启后的恢复步骤
Flannel br_netfilter崩溃修复
完整集群重建（终极方案）
Caddy/Tailscale HTTPS代理详情
所有端口映射详情及说明",