k8s
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesek8s Cluster Operations — joelclaw on Talos
k8s集群运维——基于Talos的joelclaw集群
Architecture
架构
Mac Mini (localhost ports)
└─ Lima SSH mux (~/.colima/_lima/colima/ssh.sock) ← NEVER KILL
└─ Colima VM (8 CPU, 16 GiB, 100 GiB, VZ framework, aarch64)
└─ Docker 29.x + buildx (joelclaw-builder, docker-container driver)
└─ Talos v1.12.4 container (joelclaw-controlplane-1)
└─ k8s v1.35.0 (single node, Flannel CNI)
└─ joelclaw namespace (privileged PSA)⚠️ Talos has NO shell. No bash, no /bin/sh, nothing. You cannot into the Talos container. Use for node operations and the Colima VM () for host-level operations like .
docker exectalosctlssh lima-colimamodprobeMac Mini(本地端口)
└─ Lima SSH多路复用(~/.colima/_lima/colima/ssh.sock)← 切勿终止
└─ Colima虚拟机(8核CPU、16 GiB内存、100 GiB存储、VZ框架、aarch64架构)
└─ Docker 29.x + buildx(joelclaw-builder,docker-container驱动)
└─ Talos v1.12.4容器(joelclaw-controlplane-1)
└─ k8s v1.35.0(单节点、Flannel CNI)
└─ joelclaw命名空间(特权PSA)⚠️ Talos 没有Shell环境。 没有bash,没有/bin/sh,无法进入Talos容器。节点操作请使用,主机级操作(如)请通过Colima虚拟机()执行。
docker exectalosctlmodprobessh lima-colimaColima Stability Rules (2026-03-17 incident)
Colima稳定性规则(2026-03-17事故总结)
| Setting | Value | Reason |
|---|---|---|
| CPU | 8 | Match k8s workload requests (~2.8 CPU, 72%) |
| Memory | 16 GiB | 32GB causes macOS memory pressure → VM kill |
| nestedVirtualization | OFF by default | Crashes VM under load (image builds, heavy scheduling). Toggle ON only for Firecracker testing |
| vmType | vz | Required for Apple Silicon |
| mountType | virtiofs | Fastest option with VZ |
nestedVirtualization: true- Kills the Talos container mid-operation
- Corrupts Redis AOF (if caught mid-write) → crash-loop on restart
- Breaks Lima socket forwarding → CLI on macOS disconnects
docker - Creates stale k8s pods that re-pull images → amplifies pressure
Recovery from Colima crash-loop:
- — basic restart
colima stop && colima start - If Redis crash-loops: (see Redis AOF Recovery below)
redis-check-aof --fix - If Restate has stuck invocations: purge PVC or kill via admin API
- If native Docker socket dead: use SSH tunnel
ssh -L /tmp/docker.sock:/var/run/docker.sock
Docker image builds should use the buildx container builder () to isolate build IO from k8s workloads.
docker buildx build --builder joelclaw-builder| 设置 | 取值 | 原因 |
|---|---|---|
| CPU | 8 | 匹配k8s工作负载请求(约2.8 CPU,使用率72%) |
| 内存 | 16 GiB | 分配32GB会导致macOS内存压力过大→虚拟机被终止 |
| nestedVirtualization | 默认关闭 | 高负载下(镜像构建、调度密集)会导致虚拟机崩溃。仅在Firecracker测试时开启 |
| vmType | vz | Apple Silicon芯片必需 |
| mountType | virtiofs | VZ框架下的最快选项 |
nestedVirtualization: true- Talos容器在运行中被终止
- Redis AOF文件损坏(若写入时中断)→重启后崩溃循环
- Lima套接字转发失效→macOS上的Docker CLI断开连接
- 产生过时的k8s Pod,会重新拉取镜像→加剧资源压力
Colima崩溃循环恢复步骤:
- — 基础重启操作
colima stop && colima start - 若Redis崩溃循环:执行(详见下方Redis AOF恢复)
redis-check-aof --fix - 若Restate存在卡住的调用:清理PVC或通过管理API终止
- 若原生Docker套接字失效:使用SSH隧道
ssh -L /tmp/docker.sock:/var/run/docker.sock
Docker镜像构建应使用buildx容器构建器(),将构建IO与k8s工作负载隔离。
docker buildx build --builder joelclaw-builderRedis AOF Recovery
Redis AOF恢复
If Redis crash-loops after a VM restart with :
Bad file format reading the append only filebash
undefined若虚拟机重启后Redis因报错进入崩溃循环:
Bad file format reading the append only filebash
undefined1. Scale down Redis (or use a temp pod if StatefulSet can't mount PVC concurrently)
1. 缩容Redis(若StatefulSet无法同时挂载PVC,可使用临时Pod)
kubectl -n joelclaw apply -f - <<'EOF'
apiVersion: v1
kind: Pod
metadata:
name: redis-fix
namespace: joelclaw
spec:
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
containers:
- name: fix
image: redis:7-alpine
command: ["sh", "-c", "cd /data/appendonlydir && echo y | redis-check-aof --fix *.incr.aof && redis-check-aof *.incr.aof"]
volumeMounts:
- name: data
mountPath: /data
restartPolicy: Never
volumes:
- name: data
persistentVolumeClaim:
claimName: data-redis-0
EOF
kubectl -n joelclaw apply -f - <<'EOF'
apiVersion: v1
kind: Pod
metadata:
name: redis-fix
namespace: joelclaw
spec:
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
containers:
- name: fix
image: redis:7-alpine
command: ["sh", "-c", "cd /data/appendonlydir && echo y | redis-check-aof --fix *.incr.aof && redis-check-aof *.incr.aof"]
volumeMounts:
- name: data
mountPath: /data
restartPolicy: Never
volumes:
- name: data
persistentVolumeClaim:
claimName: data-redis-0
EOF
2. Wait, check logs, then clean up
2. 等待执行完成,查看日志,然后清理临时Pod
kubectl -n joelclaw logs redis-fix
kubectl -n joelclaw delete pod redis-fix --force
kubectl -n joelclaw logs redis-fix
kubectl -n joelclaw delete pod redis-fix --force
3. Restart Redis
3. 重启Redis
kubectl -n joelclaw delete pod redis-0
For port mappings, recovery procedures, and cluster recreation steps, read [references/operations.md](references/operations.md).kubectl -n joelclaw delete pod redis-0
端口映射、恢复流程和集群重建步骤,请查看[references/operations.md](references/operations.md)。Kubeconfig Port Drift (2026-03-21 incident)
Kubeconfig端口漂移问题(2026-03-21事故总结)
Docker port mappings for k8s API (6443) and Talos API (50000) are not pinned — they use random host ports assigned at container creation. All service ports (3111, 8288, 6379, etc.) ARE pinned 1:1.
When the Colima VM or Talos container restarts, Docker may reassign different random ports for 6443/50000. Kubeconfig goes stale, kubectl fails, and everything that depends on it (joelclaw CLI, health checks, pod inspection) breaks silently.
Symptoms: returns or . All pods are actually running — only the kubeconfig routing is wrong.
kubectltls: internal errorconnection refusedFix:
bash
undefinedk8s API(6443)和Talos API(50000)的Docker端口映射未固定——容器创建时会分配随机主机端口。所有服务端口(3111、8288、6379等)均为1:1固定映射。
当Colima虚拟机或Talos容器重启时,Docker可能会为6443/50000重新分配不同的随机端口。此时Kubeconfig会失效,kubectl执行失败,所有依赖它的工具(joelclaw CLI、健康检查、Pod检查)都会静默故障。
症状:返回或。实际上所有Pod都在正常运行——仅Kubeconfig路由配置错误。
kubectltls: internal errorconnection refused修复方法:
bash
undefined1. Regenerate kubeconfig from talosctl (which has the correct port)
1. 从talosctl重新生成kubeconfig(它会获取正确的端口)
talosctl --talosconfig ~/.talos/config --nodes 127.0.0.1 kubeconfig --force
talosctl --talosconfig ~/.talos/config --nodes 127.0.0.1 kubeconfig --force
2. Switch to the new context
2. 切换到新的上下文
kubectl config use-context "$(kubectl config get-contexts -o name | grep joelclaw | head -1)"
kubectl config use-context "$(kubectl config get-contexts -o name | grep joelclaw | head -1)"
3. Clean stale contexts (optional)
3. 清理过时的上下文(可选)
kubectl config delete-context admin@joelclaw # if stale entry exists
**Self-heal**: `health.sh` now auto-detects and fixes this before running checks.
**Root cause**: Container was created without pinning these ports. To permanently fix, recreate the container with explicit port bindings for 6443:6443 and 50000:50000. This requires cluster recreation — a bigger operation.kubectl config delete-context admin@joelclaw # 若存在过时条目
**自修复**:`health.sh`脚本现在会在执行检查前自动检测并修复此问题。
**根本原因**:创建容器时未固定这些端口。要永久修复,需重新创建容器并显式绑定6443:6443和50000:50000端口,这需要重建集群——属于较大规模操作。Quick Health Check
快速健康检查
bash
kubectl get pods -n joelclaw # all pods
curl -s localhost:3111/api/inngest # system-bus-worker → 200
curl -s localhost:7880/ # LiveKit → "OK"
curl -s localhost:8108/health # Typesense → {"ok":true}
curl -s localhost:8288/health # Inngest → {"status":200}
curl -s localhost:9070/deployments # Restate admin → deployments list
curl -s localhost:9627/xrpc/_health # PDS → {"version":"..."}
kubectl exec -n joelclaw redis-0 -- redis-cli ping # → PONG
joelclaw restate cron status # Dkron scheduler → healthy via temporary CLI tunnelbash
kubectl get pods -n joelclaw # 查看所有Pod
curl -s localhost:3111/api/inngest # system-bus-worker → 应返回200
curl -s localhost:7880/ # LiveKit → 返回"OK"
curl -s localhost:8108/health # Typesense → 返回{"ok":true}
curl -s localhost:8288/health # Inngest → 返回{"status":200}
curl -s localhost:9070/deployments # Restate管理端 → 返回部署列表
curl -s localhost:9627/xrpc/_health # PDS → 返回{"version":"..."}
kubectl exec -n joelclaw redis-0 -- redis-cli ping # → 返回PONG
joelclaw restate cron status # Dkron调度器 → 通过临时CLI隧道检测健康状态Services
服务列表
| Service | Type | Pod | Ports (Mac→NodePort) | Helm? |
|---|---|---|---|---|
| Redis | StatefulSet | redis-0 | 6379→6379 | No |
| Typesense | StatefulSet | typesense-0 | 8108→8108 | No |
| Inngest | StatefulSet | inngest-0 | 8288→8288, 8289→8289 | No |
| Restate | StatefulSet | restate-0 | 8080→8080, 9070→9070, 9071→9071 | No |
| system-bus-worker | Deployment | system-bus-worker-* | 3111→3111 | No |
| restate-worker | Deployment | restate-worker-* | in-cluster only ( | No |
| docs-api | Deployment | docs-api-* | 3838→3838 | No |
| LiveKit | Deployment | livekit-server-* | 7880→7880, 7881→7881 | Yes (livekit/livekit-server 1.9.0) |
| PDS | Deployment | bluesky-pds-* | 9627→3000 | Yes (nerkho/bluesky-pds 0.4.2) |
| MinIO | StatefulSet | minio-0 | 30900→30900, 30901→30901 | No |
| Dkron | StatefulSet | dkron-0 | in-cluster only ( | No |
AIStor Operator ( | Deployments | adminjob-operator, object-store-operator | n/a | Yes ( |
AIStor ObjectStore ( | StatefulSet | aistor-s3-pool-0-0 | 31000 (S3 TLS), 31001 (console) | Yes ( |
| 服务 | 类型 | Pod名称 | 端口映射(Mac→NodePort) | 是否使用Helm |
|---|---|---|---|---|
| Redis | StatefulSet | redis-0 | 6379→6379 | 否 |
| Typesense | StatefulSet | typesense-0 | 8108→8108 | 否 |
| Inngest | StatefulSet | inngest-0 | 8288→8288, 8289→8289 | 否 |
| Restate | StatefulSet | restate-0 | 8080→8080, 9070→9070, 9071→9071 | 否 |
| system-bus-worker | Deployment | system-bus-worker-* | 3111→3111 | 否 |
| restate-worker | Deployment | restate-worker-* | 仅集群内部访问( | 否 |
| docs-api | Deployment | docs-api-* | 3838→3838 | 否 |
| LiveKit | Deployment | livekit-server-* | 7880→7880, 7881→7881 | 是(livekit/livekit-server 1.9.0) |
| PDS | Deployment | bluesky-pds-* | 9627→3000 | 是(nerkho/bluesky-pds 0.4.2) |
| MinIO | StatefulSet | minio-0 | 30900→30900, 30901→30901 | 否 |
| Dkron | StatefulSet | dkron-0 | 仅集群内部访问( | 否 |
AIStor Operator( | Deployments | adminjob-operator, object-store-operator | 无 | 是( |
AIStor ObjectStore( | StatefulSet | aistor-s3-pool-0-0 | 31000(S3 TLS)、31001(控制台) | 是( |
Restate / Firecracker runtime notes
Restate / Firecracker运行时说明
- is intentionally privileged and mounts
deployment/restate-worker(hostPath type/dev/kvm— optional)."" - PVC at
firecracker-imagesstores kernel, rootfs, and snapshot artifacts./tmp/firecracker-test - When is OFF:
nestedVirtualizationabsent,/dev/kvmDAG handler fails, butmicrovm/shell/inferhandlers work normally.noop - When is ON: Firecracker one-shot exec works (create workspace ext4 → write command → boot VM → guest executes → poweroff → read results).
nestedVirtualization - Restate retry caps: dagWorker maxAttempts=5, dagOrchestrator maxAttempts=3. Prevents journal poisoning.
- Restate journal purge (if stuck invocations block work): scale down Restate, mount PVC with temp pod, , scale back up, re-register worker.
rm -rf /restate-data/* - Re-register worker:
curl -X POST http://localhost:9070/deployments -H 'content-type: application/json' -d '{"uri":"http://restate-worker:9080"}'
⚠️ PDS port trap: Docker maps (host→container). NodePort must be 3000 to match the container-side port. If set to 9627, traffic won't route.
9627→3000Rule: NodePort value = Docker's container-side port, not host-side.
- 被设置为特权模式,并挂载
deployment/restate-worker(hostPath类型为/dev/kvm——可选)。"" - PVC 挂载在
firecracker-images,用于存储内核、根文件系统和快照 artifacts。/tmp/firecracker-test - 当关闭时:
nestedVirtualization不存在,/dev/kvmDAG处理器会失败,但microvm/shell/infer处理器可正常工作。noop - 当开启时:Firecracker一次性执行功能正常(创建ext4工作区→写入命令→启动虚拟机→客户机执行→关机→读取结果)。
nestedVirtualization - Restate重试上限:dagWorker最大尝试次数=5,dagOrchestrator最大尝试次数=3,防止日志污染。
- Restate日志清理(若卡住的调用阻塞工作):缩容Restate,用临时Pod挂载PVC,执行,重新扩容,重新注册worker。
rm -rf /restate-data/* - 重新注册worker:
curl -X POST http://localhost:9070/deployments -H 'content-type: application/json' -d '{"uri":"http://restate-worker:9080"}'
⚠️ PDS端口陷阱:Docker映射为(主机→容器)。NodePort必须设置为3000以匹配容器端端口。若设置为9627,流量将无法正常路由。
9627→3000规则:NodePort值=Docker容器端端口,而非主机端端口。
Agent Runner (Cold k8s Jobs)
Agent Runner(冷启动k8s Jobs)
Status: local sandbox remains the default/live path; the k8s backend is now code-landed and opt-in, but still needs supervised rollout before calling it earned runtime.
The agent runner executes sandboxed story runs as isolated k8s Jobs. Jobs are created dynamically via — no static manifests.
@joelclaw/agent-execution/job-spec状态:本地沙箱仍是默认/活跃路径;k8s后端已完成代码开发,可选择启用,但仍需受控部署才能投入正式运行。
Agent Runner将沙箱化的任务运行作为独立的k8s Jobs执行。Jobs通过动态创建——无静态清单。
@joelclaw/agent-execution/job-specRuntime Image Contract
运行时镜像约定
See for the full specification.
k8s/agent-runner.yamlRequired components:
- Git (checkout, diff, commit)
- Bun runtime
- runner-installed agent tooling (currently and/or other installed CLIs)
claude - working directory
/workspace - runtime entrypoint at
/app/packages/agent-execution/src/job-runner.ts
Configuration via environment variables:
- Request metadata: ,
WORKFLOW_ID,REQUEST_ID,STORY_ID,SANDBOX_PROFILE,BASE_SHA,EXECUTION_BACKEND,JOB_NAMEJOB_NAMESPACE - Repo materialization: ,
REPO_URL, optionalREPO_BRANCHHOST_REQUESTED_CWD - Agent identity: ,
AGENT_NAME,AGENT_MODEL,AGENT_VARIANTAGENT_PROGRAM - Execution config: ,
SESSION_IDTIMEOUT_SECONDS - Task prompt: (base64-encoded)
TASK_PROMPT_B64 - Verification: (base64-encoded JSON array)
VERIFICATION_COMMANDS_B64 - Callback path: ,
RESULT_CALLBACK_URLRESULT_CALLBACK_TOKEN
Expected behavior:
- Decode task from
TASK_PROMPT_B64 - Materialize repo from /
REPO_URLatREPO_BRANCHBASE_SHA - Execute the requested
AGENT_PROGRAM - Run verification commands (if set)
- Print markers to stdout and POST the same result to
SandboxExecutionResult/internal/agent-result - Exit 0 (success) or non-zero (failure)
Current truthful limit:
- remains local-backend only for now; do not pretend the pod runner can execute pi story runs yet.
pi
完整规范请查看。
k8s/agent-runner.yaml必需组件:
- Git(用于检出、对比、提交)
- Bun运行时
- Runner预装的Agent工具(当前为和/或其他已安装的CLI)
claude - 工作目录
/workspace - 运行时入口点为
/app/packages/agent-execution/src/job-runner.ts
通过环境变量配置:
- 请求元数据:,
WORKFLOW_ID,REQUEST_ID,STORY_ID,SANDBOX_PROFILE,BASE_SHA,EXECUTION_BACKEND,JOB_NAMEJOB_NAMESPACE - 代码仓实例化:,
REPO_URL,可选REPO_BRANCHHOST_REQUESTED_CWD - Agent身份:,
AGENT_NAME,AGENT_MODEL,AGENT_VARIANTAGENT_PROGRAM - 执行配置:,
SESSION_IDTIMEOUT_SECONDS - 任务提示:(base64编码)
TASK_PROMPT_B64 - 验证命令:(base64编码的JSON数组)
VERIFICATION_COMMANDS_B64 - 回调路径:,
RESULT_CALLBACK_URLRESULT_CALLBACK_TOKEN
预期行为:
- 解码中的任务
TASK_PROMPT_B64 - 从/
REPO_URL在REPO_BRANCH版本实例化代码仓BASE_SHA - 执行指定的
AGENT_PROGRAM - 运行验证命令(若已设置)
- 向标准输出打印标记,并将相同结果POST到
SandboxExecutionResult/internal/agent-result - 退出码0(成功)或非0(失败)
当前限制:
- 任务目前仅支持本地后端;请勿假设Pod Runner可执行pi任务。
pi
Job Lifecycle
Job生命周期
typescript
import { generateJobSpec, generateJobDeletion } from "@joelclaw/agent-execution";
// 1. Generate Job spec
const spec = generateJobSpec(request, {
runtime: {
image: "ghcr.io/joelhooks/agent-runner:latest",
imagePullPolicy: "Always",
command: ["bun", "run", "/app/packages/agent-execution/src/job-runner.ts"],
},
namespace: "joelclaw",
imagePullSecret: "ghcr-pull",
resultCallbackUrl: "http://host.docker.internal:3111/internal/agent-result",
resultCallbackToken: process.env.OTEL_EMIT_TOKEN,
});
// 2. Apply to cluster (via kubectl or k8s client library)
// 3. Job runs → Pod materializes repo, executes agent, posts SandboxExecutionResult callback
// 4. Host worker can recover the same terminal result from log markers if callback delivery fails
// 5. Job auto-deletes after TTL (default: 5 minutes)
// Cancel a running Job
const deletion = generateJobDeletion("req-xyz");
// kubectl delete job ${deletion.name} -n ${deletion.namespace}typescript
import { generateJobSpec, generateJobDeletion } from "@joelclaw/agent-execution";
// 1. 生成Job规格
const spec = generateJobSpec(request, {
runtime: {
image: "ghcr.io/joelhooks/agent-runner:latest",
imagePullPolicy: "Always",
command: ["bun", "run", "/app/packages/agent-execution/src/job-runner.ts"],
},
namespace: "joelclaw",
imagePullSecret: "ghcr-pull",
resultCallbackUrl: "http://host.docker.internal:3111/internal/agent-result",
resultCallbackToken: process.env.OTEL_EMIT_TOKEN,
});
// 2. 应用到集群(通过kubectl或k8s客户端库)
// 3. Job运行→Pod实例化代码仓、执行Agent、提交SandboxExecutionResult回调
// 4. 若回调投递失败,主机Worker可从日志标记中恢复相同的终端结果
// 5. Job完成后按TTL自动删除(默认:5分钟)
// 取消运行中的Job
const deletion = generateJobDeletion("req-xyz");
// kubectl delete job ${deletion.name} -n ${deletion.namespace}Resource Defaults
资源默认值
- CPU: request,
500mlimit2 - Memory: request,
1Gilimit4Gi - Active deadline:
1 hour - TTL after completion:
5 minutes - Backoff limit: (no retries)
0
- CPU:请求,
500m限制2 - 内存:请求,
1Gi限制4Gi - 活动截止时间:
1小时 - 完成后TTL:
5分钟 - 回退限制:(不重试)
0
Security
安全配置
- Non-root execution (UID 1000, GID 1000)
- No privilege escalation
- All capabilities dropped
- RuntimeDefault seccomp profile
- Control plane toleration for single-node cluster
- 非root用户执行(UID 1000,GID 1000)
- 禁止权限提升
- 移除所有能力
- 使用RuntimeDefault seccomp配置文件
- 单节点集群的控制平面容忍度
Verification Commands
验证命令
bash
undefinedbash
undefinedList agent runner Jobs
列出Agent Runner Jobs
kubectl get jobs -n joelclaw -l app.kubernetes.io/name=agent-runner
kubectl get jobs -n joelclaw -l app.kubernetes.io/name=agent-runner
Check Job status
查看Job状态
kubectl describe job <job-name> -n joelclaw
kubectl describe job <job-name> -n joelclaw
View logs
查看日志
kubectl logs job/<job-name> -n joelclaw
kubectl logs job/<job-name> -n joelclaw
Check for stale Jobs (should be auto-deleted by TTL)
检查过时Jobs(应按TTL自动删除)
kubectl get jobs -n joelclaw --show-all
undefinedkubectl get jobs -n joelclaw --show-all
undefinedCurrent State
当前进度
- ✅ Job spec generator ()
packages/agent-execution/src/job-spec.ts - ✅ Runtime contract ()
k8s/agent-runner.yaml - ✅ Tests ()
packages/agent-execution/__tests__/job-spec.test.ts - ⏳ Runtime image not yet built (Story 3)
- ⏳ Hot-image CronJob not yet implemented (Story 4)
- ⏳ Warm-pool scheduler not yet implemented (Story 5)
- ⏳ Restate integration not yet wired (Story 6)
- ✅ Job规格生成器()
packages/agent-execution/src/job-spec.ts - ✅ 运行时约定()
k8s/agent-runner.yaml - ✅ 测试用例()
packages/agent-execution/__tests__/job-spec.test.ts - ⏳ 运行时镜像尚未构建(任务3)
- ⏳ 热镜像CronJob尚未实现(任务4)
- ⏳ 暖池调度器尚未实现(任务5)
- ⏳ Restate集成尚未对接(任务6)
NAS NFS Access from k8s (ADR-0088 Phase 2.5)
k8s访问NAS NFS存储(ADR-0088 第2.5阶段)
k8s pods can mount NAS storage over NFS via a LAN route through the Colima bridge.
k8s Pod可通过Colima网桥的LAN路由挂载NAS NFS存储。
How it works
工作原理
k8s pod → Talos container (10.5.0.x) → Docker NAT → Colima VM
→ ip route 192.168.1.0/24 via 192.168.64.1 dev col0
→ macOS host (IP forwarding enabled) → LAN → NAS (192.168.1.163)Root cause of prior failures: VZ framework's shared networking on eth0 doesn't properly forward LAN-bound traffic. The fix routes LAN traffic through col0 (Colima bridge → macOS host) instead.
k8s pod → Talos容器(10.5.0.x)→ Docker NAT → Colima虚拟机
→ ip route 192.168.1.0/24 via 192.168.64.1 dev col0
→ macOS主机(已启用IP转发)→ LAN → NAS(192.168.1.163)之前失败的根本原因:VZ框架的eth0共享网络无法正确转发LAN流量。修复方案是将LAN流量通过col0(Colima网桥→macOS主机)路由。
Route persistence
路由持久化
The LAN route is set in two places for reliability:
- Colima provision script () — runs on
~/.colima/default/colima.yaml(cold boot)colima start - colima-tunnel script () — runs on tunnel restart (covers warm resume)
~/.local/bin/colima-tunnel
Both execute:
ip route replace 192.168.1.0/24 via 192.168.64.1 dev col0LAN路由在两个位置配置以确保可靠性:
- Colima初始化脚本()——
~/.colima/default/colima.yaml时运行(冷启动)colima start - colima-tunnel脚本()——隧道重启时运行(覆盖热恢复场景)
~/.local/bin/colima-tunnel
两者均执行:
ip route replace 192.168.1.0/24 via 192.168.64.1 dev col0Available PVs
可用PV
| PV | NFS Path | Capacity | Access | Use |
|---|---|---|---|---|
| | 1.5TB | RWX | NVMe RAID1: backups, snapshots, models, sessions |
| | 50TB | RWX | HDD RAID5: books, docs-artifacts, archives, otel |
| | 1TB | RWO | HDD tier: MinIO object storage (same export) |
| PV名称 | NFS路径 | 容量 | 访问权限 | 用途 |
|---|---|---|---|---|
| | 1.5TB | RWX | NVMe RAID1:备份、快照、模型、会话 |
| | 50TB | RWX | HDD RAID5:书籍、文档制品、归档、otel |
| | 1TB | RWO | HDD分层存储:MinIO对象存储(同一导出路径) |
Mounting NAS in a pod
在Pod中挂载NAS
yaml
volumes:
- name: nas
persistentVolumeClaim:
claimName: nas-nvme
containers:
- volumeMounts:
- name: nas
mountPath: /nas
# Optional: subPath for specific dir
subPath: typesenseyaml
volumes:
- name: nas
persistentVolumeClaim:
claimName: nas-nvme
containers:
- volumeMounts:
- name: nas
mountPath: /nas
# 可选:指定子目录
subPath: typesenseRules
规则
- Always use IP (192.168.1.163), never hostname (three-body). DNS doesn't resolve from inside k8s.
- Always use mount options. NFSv4 has issues with Asustor ADM.
nfsvers=3,tcp,resvport,noatime - NAS unavailability degrades gracefully with mount option — returns errors, doesn't hang pods.
soft - NFS write performance: ~660 MiB/s over 10GbE with jumbo frames. Good for sequential I/O (backups, snapshots). Latency-sensitive workloads (Redis, active Typesense indexes) stay on local SSD.
- If NFS mount fails after Colima restart: verify the route exists:
colima ssh -- ip route | grep 192.168.1.0
- 始终使用IP(192.168.1.163),切勿使用主机名(three-body)。k8s内部无法解析该主机名。
- 始终使用挂载选项。NFSv4与Asustor ADM存在兼容问题。
nfsvers=3,tcp,resvport,noatime - NAS不可用时会优雅降级:使用挂载选项会返回错误,而非挂起Pod。
soft - NFS写入性能:10GbE网络+巨帧下约660 MiB/s,适合顺序IO(备份、快照)。对延迟敏感的工作负载(Redis、活跃Typesense索引)保留在本地SSD。
- Colima重启后NFS挂载失败:验证路由是否存在:
colima ssh -- ip route | grep 192.168.1.0
Verify connectivity
验证连通性
bash
undefinedbash
undefinedFrom Colima VM
从Colima虚拟机测试
colima ssh -- timeout 2 bash -c "echo > /dev/tcp/192.168.1.163/2049" && echo "NFS OK"
colima ssh -- timeout 2 bash -c "echo > /dev/tcp/192.168.1.163/2049" && echo "NFS正常"
From k8s pod
从k8s Pod测试
kubectl run nfs-test --image=busybox --restart=Never -n joelclaw
--overrides='{"spec":{"tolerations":[{"key":"node-role.kubernetes.io/control-plane","operator":"Exists","effect":"NoSchedule"}],"containers":[{"name":"t","image":"busybox","command":["sh","-c","ls /nas && echo OK"],"volumeMounts":[{"name":"n","mountPath":"/nas"}]}],"volumes":[{"name":"n","persistentVolumeClaim":{"claimName":"nas-nvme"}}]}}' kubectl logs nfs-test -n joelclaw && kubectl delete pod nfs-test -n joelclaw --force
--overrides='{"spec":{"tolerations":[{"key":"node-role.kubernetes.io/control-plane","operator":"Exists","effect":"NoSchedule"}],"containers":[{"name":"t","image":"busybox","command":["sh","-c","ls /nas && echo OK"],"volumeMounts":[{"name":"n","mountPath":"/nas"}]}],"volumes":[{"name":"n","persistentVolumeClaim":{"claimName":"nas-nvme"}}]}}' kubectl logs nfs-test -n joelclaw && kubectl delete pod nfs-test -n joelclaw --force
undefinedkubectl run nfs-test --image=busybox --restart=Never -n joelclaw \
--overrides='{"spec":{"tolerations":[{"key":"node-role.kubernetes.io/control-plane","operator":"Exists","effect":"NoSchedule"}],"containers":[{"name":"t","image":"busybox","command":["sh","-c","ls /nas && echo 正常"],"volumeMounts":[{"name":"n","mountPath":"/nas"}]}],"volumes":[{"name":"n","persistentVolumeClaim":{"claimName":"nas-nvme"}}]}}'
kubectl logs nfs-test -n joelclaw && kubectl delete pod nfs-test -n joelclaw --force
undefinedDeploy Commands
部署命令
bash
undefinedbash
undefinedManifests (redis, typesense, inngest, dkron)
清单文件(redis、typesense、inngest、dkron)
kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/
kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/
Restate runtime
Restate运行时
kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/restate.yaml
kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/firecracker-pvc.yaml
kubectl rollout status statefulset/restate -n joelclaw
~/Code/joelhooks/joelclaw/k8s/publish-restate-worker.sh
curl -fsS http://localhost:9070/deployments
kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/restate.yaml
kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/firecracker-pvc.yaml
kubectl rollout status statefulset/restate -n joelclaw
~/Code/joelhooks/joelclaw/k8s/publish-restate-worker.sh
curl -fsS http://localhost:9070/deployments
Dkron phase-1 scheduler (ClusterIP API + CLI-managed short-lived tunnel access)
Dkron第一阶段调度器(ClusterIP API + CLI管理的短期隧道访问)
kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/dkron.yaml
kubectl rollout status statefulset/dkron -n joelclaw
joelclaw restate cron status
joelclaw restate cron sync-tier1 # seed/update ADR-0216 tier-1 jobs
kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/dkron.yaml
kubectl rollout status statefulset/dkron -n joelclaw
joelclaw restate cron status
joelclaw restate cron sync-tier1 # 初始化/更新ADR-0216第一阶段任务
system-bus worker (build + push GHCR + apply + rollout wait)
system-bus worker(构建+推送至GHCR+应用+等待滚动更新完成)
~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.sh
~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.sh
LiveKit (Helm + reconcile patches)
LiveKit(Helm + 调和补丁)
~/Code/joelhooks/joelclaw/k8s/reconcile-livekit.sh joelclaw
~/Code/joelhooks/joelclaw/k8s/reconcile-livekit.sh joelclaw
AIStor (Helm operator + objectstore)
AIStor(Helm operator + objectstore)
Defaults to isolated aistor
namespace to avoid service-name collisions with legacy joelclaw/minio
.
aistorjoelclaw/minio默认部署在独立的aistor
命名空间,避免与旧版joelclaw/minio
服务名冲突。
aistorjoelclaw/minioCutover override (explicit only): AISTOR_OBJECTSTORE_NAMESPACE=joelclaw AISTOR_ALLOW_JOELCLAW_NAMESPACE=true
强制切换(仅显式操作):AISTOR_OBJECTSTORE_NAMESPACE=joelclaw AISTOR_ALLOW_JOELCLAW_NAMESPACE=true
~/Code/joelhooks/joelclaw/k8s/reconcile-aistor.sh
~/Code/joelhooks/joelclaw/k8s/reconcile-aistor.sh
PDS (Helm) — always patch NodePort to 3000
PDS(Helm)——始终将NodePort补丁为3000
(export current values first if the release already exists)
(若版本已存在,先导出当前配置)
helm get values bluesky-pds -n joelclaw > /tmp/pds-values-live.yaml 2>/dev/null || true
helm upgrade --install bluesky-pds nerkho/bluesky-pds
-n joelclaw -f /tmp/pds-values-live.yaml kubectl patch svc bluesky-pds -n joelclaw --type='json'
-p='[{"op":"replace","path":"/spec/ports/0/nodePort","value":3000}]'
-n joelclaw -f /tmp/pds-values-live.yaml kubectl patch svc bluesky-pds -n joelclaw --type='json'
-p='[{"op":"replace","path":"/spec/ports/0/nodePort","value":3000}]'
undefinedhelm get values bluesky-pds -n joelclaw > /tmp/pds-values-live.yaml 2>/dev/null || true
helm upgrade --install bluesky-pds nerkho/bluesky-pds \
-n joelclaw -f /tmp/pds-values-live.yaml
kubectl patch svc bluesky-pds -n joelclaw --type='json' \
-p='[{"op":"replace","path":"/spec/ports/0/nodePort","value":3000}]'
undefinedAuto Deploy (GitHub Actions)
自动部署(GitHub Actions)
- Workflow:
.github/workflows/system-bus-worker-deploy.yml - Trigger: push to touching
mainor worker deploy filespackages/system-bus/** - Behavior:
- builds/pushes +
ghcr.io/joelhooks/system-bus-worker:${GITHUB_SHA}:latest - runs deploy job on runner
self-hosted - updates k8s deployment image + waits for rollout + probes worker health
- builds/pushes
- If deploy job is queued forever, check that a runner is online on the Mac Mini.
self-hosted
- 工作流:
.github/workflows/system-bus-worker-deploy.yml - 触发条件:推送至分支且修改了
main或worker部署文件packages/system-bus/** - 行为:
- 构建并推送+
ghcr.io/joelhooks/system-bus-worker:${GITHUB_SHA}镜像:latest - 在runner上执行部署任务
self-hosted - 更新k8s部署镜像+等待滚动更新完成+探测worker健康状态
- 构建并推送
- 若部署任务一直处于排队状态,检查Mac Mini上的runner是否在线。
self-hosted
GHCR push 403 Forbidden
GHCR推送403 Forbidden错误
Cause: (default Actions token) does not have scope for this repo. A dedicated PAT is required.
GITHUB_TOKENpackages:writeFix already applied: Workflow uses (not ) for the GHCR login step. The PAT is stored in:
secrets.GHCR_PATsecrets.GITHUB_TOKEN- GitHub repo secrets as (set via GitHub UI)
GHCR_PAT - agent-secrets as (
ghcr_pat)secrets lease ghcr_pat
If this breaks again: PAT may have expired. Regenerate at github.com → Settings → Developer settings → PATs, update both stores.
Local fallback (bypass GHA entirely):
bash
DOCKER_CONFIG_DIR=$(mktemp -d)
echo '{"credsStore":""}' > "$DOCKER_CONFIG_DIR/config.json"
export DOCKER_CONFIG="$DOCKER_CONFIG_DIR"
secrets lease ghcr_pat | docker login ghcr.io -u joelhooks --password-stdin
~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.shNote: uses internally — if is stale, use the Docker login above before running the script, or patch it to use directly.
publish-system-bus-worker.shgh auth tokengh authsecrets lease ghcr_pat原因:(默认Actions令牌)没有此仓库的权限,需要专用PAT。
GITHUB_TOKENpackages:write已应用修复:工作流使用(而非)进行GHCR登录。PAT存储在:
secrets.GHCR_PATsecrets.GITHUB_TOKEN- GitHub仓库密钥中(,通过GitHub UI设置)
GHCR_PAT - agent-secrets中(,通过
ghcr_pat获取)secrets lease ghcr_pat
若再次失效:PAT可能已过期。在github.com → 设置 → 开发者设置 → PATs重新生成,更新两个存储位置。
本地回退方案(完全绕过GHA):
bash
DOCKER_CONFIG_DIR=$(mktemp -d)
echo '{"credsStore":""}' > "$DOCKER_CONFIG_DIR/config.json"
export DOCKER_CONFIG="$DOCKER_CONFIG_DIR"
secrets lease ghcr_pat | docker login ghcr.io -u joelhooks --password-stdin
~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.sh注意:内部使用——若失效,在运行脚本前执行上述Docker登录,或修改脚本直接使用。
publish-system-bus-worker.shgh auth tokengh authsecrets lease ghcr_patResilience Rules (ADR-0148)
弹性规则(ADR-0148)
- NEVER use for persistent service exposure. All long-lived operator surfaces MUST use NodePort + Docker port mappings. The narrow exception is a CLI-managed, short-lived tunnel for an otherwise in-cluster-only control surface (for example
kubectl port-forwardtunneling tojoelclaw restate cron *). Port-forwards silently die on idle/restart/pod changes, so do not leave them running.dkron-svc - All workloads MUST have liveness + readiness + startup probes. Missing probes = silent hangs that never recover.
- After any Docker/Colima/node restart: remove control-plane taint, uncordon node, verify flannel, check all pods reach Running.
- PVC reclaimPolicy is Delete — deleting a PVC = permanent data loss. Never delete PVCs without backup.
- is stateful runtime data. Treat it like a real runtime PVC: kernel, rootfs, and snapshot loss will break the microVM path.
firecracker-images - Colima VM disk is limited (19GB). Monitor with . Alert at >80%.
colima ssh -- df -h / - All launchd plists MUST set PATH including . Colima shells to
/opt/homebrew/bin, kubectl/talosctl live in homebrew. launchd's default PATH islimactl— no homebrew. The canonical PATH for infra plists is:/usr/bin:/bin:/usr/sbin:/sbin. Discovered Feb 2026: missing PATH caused 6 days of silent recovery failures./opt/homebrew/bin:/Users/joel/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin - Shell scripts run by launchd MUST export PATH at the top. Even if the plist sets EnvironmentVariables, belt-and-suspenders — add to the script itself.
export PATH="/opt/homebrew/bin:..."
- 切勿使用进行持久化服务暴露。所有长期对外的操作面必须使用NodePort + Docker端口映射。唯一例外是CLI管理的短期隧道,用于原本仅集群内部可访问的控制面(例如
kubectl port-forward隧道到joelclaw restate cron *)。端口转发会在空闲/重启/Pod变更时静默失效,请勿长期运行。dkron-svc - 所有工作负载必须配置存活探针+就绪探针+启动探针。缺少探针会导致静默挂起且无法自动恢复。
- 任何Docker/Colima/节点重启后:移除控制平面污点、解除节点封锁、验证Flannel状态、检查所有Pod进入Running状态。
- PVC的reclaimPolicy为Delete——删除PVC会导致永久数据丢失。无备份时切勿删除PVC。
- 是有状态的运行时数据,需像对待真实的运行时PVC一样:内核、根文件系统和快照丢失会破坏microVM路径。
firecracker-images - Colima虚拟机磁盘空间有限(19GB),使用监控,使用率超过80%时告警。
colima ssh -- df -h / - 所有launchd plist必须设置包含的PATH。Colima调用
/opt/homebrew/bin,kubectl/talosctl安装在homebrew中。launchd的默认PATH是limactl——不包含homebrew。基础设施plist的标准PATH为:/usr/bin:/bin:/usr/sbin:/sbin。2026年2月发现:PATH缺失导致了6天的静默恢复失败。/opt/homebrew/bin:/Users/joel/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin - launchd运行的Shell脚本必须在顶部导出PATH。即使plist中设置了EnvironmentVariables,也要双重保障——在脚本中添加。
export PATH="/opt/homebrew/bin:..."
Current Probe Gaps (fix when touching these services)
当前探针缺失(修改这些服务时修复)
- Typesense: missing liveness probe (hangs won't be detected)
- Bluesky PDS: missing readiness and startup probes
- system-bus-worker: missing startup probe
- Typesense:缺少存活探针(挂起无法被检测)
- Bluesky PDS:缺少就绪探针和启动探针
- system-bus-worker:缺少启动探针
Danger Zones
危险区域
- Stale SSH mux socket after Colima restart — When Colima restarts (disk resize, crash recovery, ), the SSH port changes but the mux socket (
colima stop && start) caches the old connection. Symptoms:~/.colima/_lima/colima/ssh.sockfails with "tls: internal error",kubectl port-forwardmay intermittently work then fail. Fix:kubectl get nodes, then re-establish tunnels withrm -f ~/.colima/_lima/colima/ssh.sock && pkill -f "ssh.*colima". Always verify SSH port withssh -o ControlPath=noneafter restart.colima ssh-config | grep Port - Adding Docker port mappings — can be hot-added without cluster recreation via edit. See references/operations.md for the procedure.
hostconfig.json - Inngest legacy host alias in manifests — old container-host alias may still appear in legacy configs. Worker uses connect mode, so it usually still works, but prefer explicit Talos/Colima hostnames.
- Colima zombie state — reports "Running" but docker socket / SSH tunnels are dead. All k8s ports unresponsive.
colima statusis a no-op. Onlycolima startrecovers. Detect with:colima restart— if that fails whilessh -F ~/.colima/_lima/colima/ssh.config lima-colima "docker info"passes, it's a zombie. The heal script handles this automatically.colima status - Talos container has NO shell — No bash, no /bin/sh. Cannot into it. Kernel modules like
docker execmust be loaded at the Colima VM level:br_netfilter.ssh lima-colima "sudo modprobe br_netfilter" - AIStor service-name collision — if AIStor objectstore is deployed in , it can claim
joelclawand break legacy MinIO assumptions. Keep AIStor objectstore in isolated namespace (svc/minio) unless intentionally cutting over.aistor - AIStor operator webhook SSA conflict — repeated can fail on
helm upgradeMutatingWebhookConfigurationownership conflict. Current mitigation in this cluster: setcaBundleinoperators.object-store.webhook.enabled=false.k8s/aistor-operator-values.yaml - MinIO pinned tag trap — is not available on Docker Hub in this environment (ErrImagePull). Legacy fallback currently relies on
minio/minio:RELEASE.2025-10-15T17-29-55Z.minio/minio:latest - privilege is intentional. Do not “harden” away
restate-worker,/dev/kvm, or the unconfined seccomp profile unless you are simultaneously changing the Firecracker runtime contract.privileged: true - Dkron service-name collision — never create a bare . Kubernetes injects
svc/dkronenv vars into pods, which collides with Dkron's own config parsing. UseDKRON_*anddkron-peer.dkron-svc - Dkron PVC permissions — upstream currently needs root on the local-path PVC. Non-root hardening caused
dkron/dkron:latestunderpermission deniedand CrashLoopBackOff./data/raft/snapshots/permTest
- Colima重启后过时的SSH多路复用套接字——Colima重启(磁盘扩容、崩溃恢复、)后,SSH端口会变化,但多路复用套接字(
colima stop && start)会缓存旧连接。症状:~/.colima/_lima/colima/ssh.sock返回"tls: internal error",kubectl port-forward可能间歇性工作后失败。修复:kubectl get nodes,然后使用rm -f ~/.colima/_lima/colima/ssh.sock && pkill -f "ssh.*colima"重新建立隧道。重启后务必通过ssh -o ControlPath=none验证SSH端口。colima ssh-config | grep Port - 添加Docker端口映射——无需重建集群,可通过编辑热添加。详见references/operations.md中的步骤。
hostconfig.json - Inngest清单中的旧主机别名——旧版容器主机别名可能仍存在于遗留配置中。Worker使用连接模式通常仍可工作,但优先使用显式的Talos/Colima主机名。
- Colima僵尸状态——显示"Running"但Docker套接字/SSH隧道已失效,所有k8s端口无响应。
colima status无作用,仅colima start可恢复。检测方法:colima restart——若此命令失败但ssh -F ~/.colima/_lima/colima/ssh.config lima-colima "docker info"显示运行中,则为僵尸状态。自愈脚本会自动处理此问题。colima status - Talos容器无Shell环境——没有bash,没有/bin/sh,无法进入。
docker exec等内核模块需在Colima虚拟机层面加载:br_netfilter。ssh lima-colima "sudo modprobe br_netfilter" - AIStor服务名冲突——若AIStor objectstore部署在命名空间,会占用
joelclaw并破坏旧版MinIO的依赖。除非有意切换,否则请将AIStor objectstore部署在独立的svc/minio命名空间。aistor - AIStor operator webhook SSA冲突——重复执行可能因
helm upgrade的MutatingWebhookConfiguration所有权冲突失败。当前集群的缓解方案:在caBundle中设置k8s/aistor-operator-values.yaml。operators.object-store.webhook.enabled=false - MinIO固定标签陷阱——在此环境的Docker Hub中不可用(ErrImagePull)。当前旧版回退依赖
minio/minio:RELEASE.2025-10-15T17-29-55Z。minio/minio:latest - 的特权模式是有意设置的。除非同时修改Firecracker运行时约定,否则请勿移除
restate-worker、/dev/kvm或无限制的seccomp配置文件。privileged: true - Dkron服务名冲突——切勿创建裸。Kubernetes会向Pod注入
svc/dkron环境变量,与Dkron自身的配置解析冲突。请使用DKRON_*和dkron-peer。dkron-svc - Dkron PVC权限——上游目前需要本地路径PVC的root权限。非root加固会导致
dkron/dkron:latest下的/data/raft/snapshots/permTest错误,引发CrashLoopBackOff。permission denied
Key Files
关键文件
| Path | What |
|---|---|
| Service manifests |
| LiveKit Helm values (source controlled) |
| LiveKit Helm deploy + post-upgrade reconcile |
| AIStor operator Helm values |
| AIStor objectstore Helm values |
| AIStor deploy + upgrade reconcile script |
| Dkron scheduler StatefulSet + services |
| Build/push/deploy system-bus worker to k8s |
| Reboot auto-heal script for Colima/Talos/taint/flannel |
| launchd timer for reboot auto-heal |
| Cluster operations + recovery notes |
| Talos client config |
| Kubeconfig (context: |
| Colima VM config |
| Persistent SSH tunnel + NAS route (launchd: |
| Caddy HTTPS proxy (Tailscale) |
| NAS NVMe NFS PV/PVC (1.5TB) |
| NAS HDD NFS PV/PVC (50TB) |
| 路径 | 说明 |
|---|---|
| 服务清单文件 |
| LiveKit Helm配置(已版本控制) |
| LiveKit Helm部署+升级后调和脚本 |
| AIStor operator Helm配置 |
| AIStor objectstore Helm配置 |
| AIStor部署+升级调和脚本 |
| Dkron调度器StatefulSet+服务配置 |
| 构建/推送/部署system-bus worker到k8s的脚本 |
| Colima/Talos/污点/Flannel重启自愈脚本 |
| 重启自愈的launchd定时器 |
| 集群操作+恢复说明文档 |
| Talos客户端配置 |
| Kubeconfig(上下文: |
| Colima虚拟机配置 |
| 持久化SSH隧道+NAS路由脚本(launchd: |
| Caddy HTTPS代理(Tailscale)配置 |
| NAS NVMe NFS PV/PVC配置(1.5TB) |
| NAS HDD NFS PV/PVC配置(50TB) |
Troubleshooting
故障排查
Read references/operations.md for:
- Recovery after Colima restart
- Recovery after Mac reboot
- Flannel br_netfilter crash fix
- Full cluster recreation (nuclear option)
- Caddy/Tailscale HTTPS proxy details
- All port mapping details with explanation
以下内容请查看references/operations.md:
- Colima重启后的恢复步骤
- Mac重启后的恢复步骤
- Flannel br_netfilter崩溃修复
- 完整集群重建(终极方案)
- Caddy/Tailscale HTTPS代理详情
- 所有端口映射详情及说明",