vpe-advisor

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

VP of Engineering Advisor

工程副总裁(VPE)顾问

Strategic engineering operations leadership for startup VPEs and founders without one. Four decisions, no generic engineering survey:
  1. Are we delivering at the right throughput? — DORA 4 metrics + bottleneck identification (where work waits)
  2. How do we scale the eng hiring funnel? — funnel math + pipeline gap + time-to-fill discipline
  3. What's our team structure — and when do we add a tech-lead manager? — squad/tribe/chapter design + manager-trigger
  4. What's our production discipline? — on-call rotation, deployment cadence, postmortem culture (reference-only)
This skill is NOT a CTO skill. CTO owns what to build (architecture, scaling cliffs, build-vs-buy). VPE owns how to ship it reliably (delivery, hiring, team structure, production operations). At early stage these are often the same person; at scale they're distinct roles.
This skill is NOT a cs-engineering-lead replacement. Engineering-lead owns day-to-day incident and on-call coordination. VPE owns the operating model that engineering-lead executes.
为初创公司的VPE及未配备VPE的创始人提供工程运营战略指导。聚焦四大决策,无通用工程调研:
  1. 我们的交付吞吐量是否达标? — DORA 4指标+瓶颈识别(工作等待环节)
  2. 如何扩大工程招聘漏斗规模? — 漏斗数据分析+人才缺口+招聘周期规范
  3. 我们的团队架构应该是什么样?何时增设技术主管经理? — Squad/Tribe/Chapter设计+经理触发阈值
  4. 我们的生产规范如何? — 值班轮岗、部署节奏、事后复盘文化(仅供参考)
本技能不属于CTO技能范畴。CTO负责要构建什么(架构、扩容瓶颈、自研vs外购决策)。VPE负责如何可靠交付(交付、招聘、团队架构、生产运营)。早期阶段这两个角色常由同一人担任;规模扩大后则会分化为不同岗位。
本技能不能替代cs-engineering-lead。工程负责人负责日常事件协调及值班管理。VPE负责制定工程负责人执行的运营模式。

Keywords

关键词

VPE, VP of Engineering, VP Engineering, engineering operations, delivery throughput, DORA, deployment frequency, lead time for changes, mean time to recovery, MTTR, change failure rate, cycle time, lead time, throughput, engineering hiring, eng hiring funnel, technical interview, take-home, pair programming, hiring pipeline, time-to-fill, cost-per-hire, ramp time, engineering team structure, squad, tribe, chapter, Spotify model, conway's law, tech lead, engineering manager, EM, span of control, hiring funnel conversion, eng comp, leveling, IC track, manager track, deployment cadence, on-call rotation, postmortem culture, blameless retro
VPE, VP of Engineering, VP Engineering, engineering operations, delivery throughput, DORA, deployment frequency, lead time for changes, mean time to recovery, MTTR, change failure rate, cycle time, lead time, throughput, engineering hiring, eng hiring funnel, technical interview, take-home, pair programming, hiring pipeline, time-to-fill, cost-per-hire, ramp time, engineering team structure, squad, tribe, chapter, Spotify model, conway's law, tech lead, engineering manager, EM, span of control, hiring funnel conversion, eng comp, leveling, IC track, manager track, deployment cadence, on-call rotation, postmortem culture, blameless retro

Quick Start

快速开始

bash
undefined
bash
undefined

Decision A: DORA 4 metrics + bottleneck identification

决策A:DORA 4指标+瓶颈识别

python scripts/delivery_throughput_analyzer.py # embedded sprint sample python scripts/delivery_throughput_analyzer.py path/to/sprint_metrics.json
python scripts/delivery_throughput_analyzer.py # 内置迭代样本 python scripts/delivery_throughput_analyzer.py path/to/sprint_metrics.json

Decision B: Hiring funnel health + pipeline gap

决策B:招聘漏斗健康度+人才缺口

python scripts/eng_hiring_funnel_calculator.py # embedded 3-quarter sample python scripts/eng_hiring_funnel_calculator.py path/to/funnel.json
python scripts/eng_hiring_funnel_calculator.py # 内置3季度样本 python scripts/eng_hiring_funnel_calculator.py path/to/funnel.json

Decision C: Team structure recommendation + manager-trigger

决策C:团队架构建议+经理触发阈值

python scripts/eng_team_structure_designer.py # embedded 25-engineer sample python scripts/eng_team_structure_designer.py path/to/team.json
undefined
python scripts/eng_team_structure_designer.py # 内置25人工程师样本 python scripts/eng_team_structure_designer.py path/to/team.json
undefined

Key Questions (ask these first)

核心问题(优先询问)

  • What's your cycle time, and where does the work spend most of its time waiting? (If you don't know, you can't improve it.)
  • How long from commit to production? (DORA "lead time for changes" — best predictor of overall team health.)
  • What's the escape rate? (Bugs found in production vs caught in CI/staging. > 15% = quality discipline broken.)
  • When did the eng manager last write code? (Manager-IC ratio is wrong if managers can't review code at all.)
  • What's the hiring funnel conversion at each stage? (Source → screen → onsite → offer → accept. The leakage is the answer.)
  • What's the on-call rotation, and who's on it? (If the same 3 people are always paged, the operating model is broken.)
  • 你们的周期时长是多少?工作大部分等待时间消耗在哪个环节?(如果不清楚,就无法优化。)
  • 从代码提交到上线需要多久?(DORA指标“变更前置时间”——团队整体健康度的最佳预测指标。)
  • 逃逸率是多少?(生产环境发现的Bug vs CI/预发布环境拦截的Bug。>15%意味着质量规范失效。)
  • 工程经理上次写代码是什么时候?(如果经理完全无法参与代码评审,说明经理与独立贡献者的比例不合理。)
  • 招聘漏斗各阶段的转化率是多少?(寻源→筛选→现场面试→Offer→接受。流失环节就是问题所在。)
  • 值班轮岗机制是怎样的?哪些人参与轮岗?(如果总是同一3个人接到告警,说明运营模式存在问题。)

Core Responsibilities

核心职责

1. Delivery Throughput (DORA Metrics)

1. 交付吞吐量(DORA指标)

The framework: Google DORA's 4 key metrics (from "Accelerate", Forsgren/Humble/Kim 2018).
MetricWhat it measuresEliteHighMediumLow
Deployment FrequencyHow often code reaches prodMultiple/dayDaily-weeklyWeekly-monthly< monthly
Lead Time for ChangesCommit → production< 1 hour1 day-1 week1 week-1 month> 1 month
Mean Time to Recovery (MTTR)Incident detection → resolved< 1 hour< 1 day1-7 days> 7 days
Change Failure Rate% of deploys causing incidents0-15%16-30%16-45%46-60%
Bottleneck identification — where does work wait?
Cycle time = (PR creation → first review) + (review → approval) + (approval → merge) + (merge → deploy). The longest segment is the bottleneck.
Common bottlenecks:
  • PR review queue (waiting for human reviewers) — fix: reviewer rotation + SLA
  • Test flakiness (CI fails intermittently, re-runs needed) — fix: flaky-test budget + quarantine
  • Deploy gates (manual approval, change-control board) — fix: progressive delivery + feature flags
  • Database migrations (locking, scheduled windows) — fix: zero-downtime migration patterns
Run
delivery_throughput_analyzer.py
with sprint data to get DORA verdict + top bottleneck.
See
references/delivery_throughput.md
for the full DORA framework, anti-patterns, and what to fix first.
框架: Google DORA的4项核心指标(来自《Accelerate》,Forsgren/Humble/Kim 2018)。
指标衡量内容精英级高级中级低级
Deployment Frequency(部署频率)代码上线至生产环境的频次每日多次每日-每周每周-每月<每月
Lead Time for Changes(变更前置时间)代码提交→生产环境<1小时1天-1周1周-1个月>1个月
Mean Time to Recovery (MTTR)(平均恢复时间)事件检测→问题解决<1小时<1天1-7天>7天
Change Failure Rate(变更失败率)引发事件的部署占比0-15%16-30%16-45%46-60%
瓶颈识别——工作等待环节在哪里?
周期时长 =(PR创建→首次评审)+(评审→批准)+(批准→合并)+(合并→部署)。耗时最长的环节即为瓶颈。
常见瓶颈:
  • PR评审队列(等待人工评审)——解决方案:评审轮岗+服务级别协议(SLA)
  • 测试不稳定(CI间歇性失败,需重新运行)——解决方案:不稳定测试预算+隔离机制
  • 部署闸门(手动批准、变更控制委员会)——解决方案:渐进式交付+功能开关
  • 数据库迁移(锁表、定时窗口)——解决方案:零停机迁移模式
运行
delivery_throughput_analyzer.py
并传入迭代数据,获取DORA评估结果+首要瓶颈。
详见
references/delivery_throughput.md
完整DORA框架、反模式及优先修复事项。

2. Engineering Hiring Funnel

2. 工程招聘漏斗

The trap: "We can't find good engineers."
The reality: the funnel has 4-6 stages, each with a conversion rate. Find which stage is leakiest; fix that one. "Can't find good engineers" usually means top-of-funnel volume is too low or screening criteria are wrong.
Standard funnel stages:
StageHealthy conversionWhat it measures
Applied → Sourcer screen30-50%Resume quality
Sourcer → Recruiter screen50-70%Basic fit
Recruiter → Hiring manager60-80%Team fit
Hiring manager → Technical interview70-85%Technical baseline
Technical → Onsite (full loop)30-50%Technical depth
Onsite → Offer25-40%Final go/no-go
Offer → Accept70-90%Comp + close discipline
Funnel math: to hire N engineers, you need N / (product of all conversion rates) candidates at top of funnel.
Example: 4 hires needed × 100 candidates per stage (assuming 30% × 60% × 70% × 75% × 40% × 35% × 80% = ~0.7% end-to-end) = ~570 candidates at top of funnel.
Run
eng_hiring_funnel_calculator.py
with funnel data to compute conversion per stage, time-to-fill, and pipeline gap.
See
references/engineering_hiring_funnel.md
for the full funnel framework, common leakage points, and sourcing channel diversification.
常见误区:“我们找不到优秀工程师。”
实际情况:漏斗包含4-6个阶段,每个阶段都有转化率。找到流失最严重的阶段并修复即可。“找不到优秀工程师”通常意味着漏斗顶部候选人数不足或筛选标准有误。
标准漏斗阶段:
阶段健康转化率衡量内容
申请→招聘专员初筛30-50%简历质量
招聘专员→ recruiter筛选50-70%基础匹配度
Recruiter→招聘经理60-80%团队匹配度
招聘经理→技术面试70-85%技术基础
技术面试→现场全流程面试30-50%技术深度
现场面试→Offer25-40%最终录用决策
Offer→接受70-90%薪酬+入职沟通规范
漏斗计算: 要招聘N名工程师,漏斗顶部需要N /(所有阶段转化率乘积)名候选人。
示例:需招聘4人 × 每个阶段100名候选人(假设30% × 60% × 70% × 75% × 40% × 35% × 80% = ~0.7%的端到端转化率)= 漏斗顶部约需570名候选人。
运行
eng_hiring_funnel_calculator.py
并传入漏斗数据,计算各阶段转化率、招聘周期及人才缺口。
详见
references/engineering_hiring_funnel.md
完整漏斗框架、常见流失点及招聘渠道多元化策略。

3. Engineering Team Structure

3. 工程团队架构

The right question: "How do we organize people so they can ship without coordination overhead?"
Three-axis model (adapted from Spotify, refined by reality):
  • Squad: small autonomous team (5-9 engineers) owning a service or product area end-to-end
  • Chapter: functional discipline cutting across squads (backend chapter, frontend chapter, etc.) — for skill development, NOT for ownership
  • Tribe: group of related squads working toward a shared goal (e.g., "platform tribe" = 3 squads on infra)
When to evolve:
StageStructure
1-5 engineersOne team. No structure.
6-15 engineers2-3 informal pods around major work streams. Founder-CTO can still know everyone.
16-40 engineers4-6 squads. First eng manager hires. Chapter structure emerges for cross-squad skill alignment.
41-100 engineers2-3 tribes (clusters of squads). Director of engineering layer. Chapters are formal.
100+ engineersMultiple tribes + group EM/director per tribe. VPE + director(s) + EMs + tech leads.
Manager-trigger thresholds:
  • 5-7 ICs without a manager = first EM hire (or internal promote)
  • 3+ EMs without a director = director hire
  • 8+ teams in one tribe = split the tribe
Run
eng_team_structure_designer.py
with team profile for structure recommendation + manager-trigger.
See
references/eng_team_structure.md
for the full framework, Conway's Law implications, and EM-vs-tech-lead split.
核心问题:“如何组织人员,让他们无需过多协调即可完成交付?”
三轴模型(改编自Spotify,经实践优化):
  • Squad: 小型自治团队(5-9名工程师),端到端负责某一服务或产品领域
  • Chapter: 跨Squad的职能学科(后端Chapter、前端Chapter等)——用于技能发展,而非所有权归属
  • Tribe: 为共同目标协作的相关Squad群组(例如:“平台Tribe”=3个负责基础设施的Squad)
架构演进节点:
阶段架构
1-5名工程师单一团队,无明确架构
6-15名工程师围绕主要工作流划分2-3个非正式小组。创始人兼CTO仍能熟悉每位成员。
16-40名工程师4-6个Squad。首次招聘工程经理。Chapter架构逐渐形成,用于跨Squad技能对齐。
41-100名工程师2-3个Tribe(Squad集群)。增设工程总监层级。Chapter架构正式化。
100+名工程师多个Tribe + 每个Tribe配备总监/经理。团队层级为VPE + 总监 + 工程经理 + 技术主管。
经理触发阈值:
  • 5-7名独立贡献者(IC)无经理 → 首次招聘或内部晋升工程经理(EM)
  • 3+名EM无总监 → 招聘总监
  • 一个Tribe包含8+个团队 → 拆分Tribe
运行
eng_team_structure_designer.py
并传入团队资料,获取架构建议+经理触发阈值。
详见
references/eng_team_structure.md
完整框架、Conway定律影响及EM与技术主管的职责划分。

4. Production Discipline

4. 生产规范

Production discipline is the operating model that lets the team sleep. Four pillars:
  • On-call rotation: broad enough to avoid burnout (≥ 6 people per rotation; primary + secondary)
  • Incident response: runbooks, severity definitions, blameless postmortems
  • Deployment cadence: continuous deployment OR scheduled releases; both work; surprise releases don't
  • SLO discipline: every customer-facing service has documented SLOs + error budgets (pair with
    engineering/slo-architect/
    )
See
references/production_discipline.md
for the full operating model.
生产规范是让团队能够安心工作的运营模式,包含四大支柱:
  • 值班轮岗: 覆盖范围足够广,避免 burnout(每个轮岗组≥6人;主岗+副岗)
  • 事件响应: 运行手册、严重程度定义、无责事后复盘
  • 部署节奏: 持续部署或定时发布;两种模式均可行,但意外发布不可取
  • SLO规范: 每个面向客户的服务都有文档化的SLO+错误预算(搭配
    engineering/slo-architect/
    使用)
详见
references/production_discipline.md
完整运营模式。

Workflows

工作流

Workflow 1: Quarterly Delivery Health Review (4 hours)

工作流1:季度交付健康评审(4小时)

Goal: Diagnose throughput + identify top bottleneck.
bash
undefined
目标: 诊断吞吐量+识别首要瓶颈。
bash
undefined

1. Pull sprint metrics: deployment frequency, lead time, MTTR, change failure rate

1. 获取迭代指标:部署频率、前置时间、MTTR、变更失败率

python ../../skills/vpe-advisor/scripts/delivery_throughput_analyzer.py sprint_metrics.json
python ../../skills/vpe-advisor/scripts/delivery_throughput_analyzer.py sprint_metrics.json

2. Review DORA verdict per metric

2. 查看各指标的DORA评估结果

3. Identify top bottleneck (longest wait stage)

3. 识别首要瓶颈(等待时间最长的环节)

4. Cross-check with cs-cto-advisor on architectural causes

4. 与cs-cto-advisor交叉验证架构层面的原因

5. Output: 90-day fix plan with one bottleneck owned by one engineer

5. 输出:90天修复计划,明确由一名工程师负责解决一个瓶颈

6. Log via /cs:decide

6. 通过 /cs:decide 记录

undefined
undefined

Workflow 2: Hiring Funnel Diagnosis (1 day)

工作流2:招聘漏斗诊断(1天)

Goal: Identify funnel leakage + compute pipeline gap for hiring target.
bash
undefined
目标: 识别漏斗流失点+计算招聘目标所需的人才缺口。
bash
undefined

1. Pull funnel data from ATS for last 90 days

1. 从ATS获取过去90天的漏斗数据

python ../../skills/vpe-advisor/scripts/eng_hiring_funnel_calculator.py funnel.json
python ../../skills/vpe-advisor/scripts/eng_hiring_funnel_calculator.py funnel.json

2. Identify weakest conversion stage

2. 识别转化率最低的阶段

3. Compute pipeline volume needed for next quarter's hiring target

3. 计算下一季度招聘目标所需的候选人数

4. Cross-check with cs-chro-advisor on comp/leveling competitiveness

4. 与cs-chro-advisor交叉验证薪酬/职级竞争力

5. Cross-check with cs-cfo-advisor on cost-per-hire envelope

5. 与cs-cfo-advisor交叉验证招聘成本上限

6. Output: top-3 fixes + sourcing channel diversification plan

6. 输出:Top3修复方案+招聘渠道多元化计划

undefined
undefined

Workflow 3: Team Structure Audit (1 day)

工作流3:团队架构审计(1天)

Goal: Confirm team structure matches headcount + work streams.
bash
undefined
目标: 确认团队架构与人员规模+工作流匹配。
bash
undefined

1. Build team.json: headcount, work streams, manager count, IC distribution

1. 构建team.json:人员规模、工作流、经理数量、IC分布

python ../../skills/vpe-advisor/scripts/eng_team_structure_designer.py team.json
python ../../skills/vpe-advisor/scripts/eng_team_structure_designer.py team.json

2. Check manager-trigger thresholds (5-7 IC rule)

2. 检查经理触发阈值(5-7名IC规则)

3. Identify squad sizes outside 5-9 range

3. 识别规模超出5-9人范围的Squad

4. Cross-check with cs-cto-advisor on Conway's Law alignment

4. 与cs-cto-advisor交叉验证Conway定律对齐情况

5. Output: structure recommendations + manager hire plan

5. 输出:架构建议+经理招聘计划

undefined
undefined

Workflow 4: Production Discipline Audit (1 week)

工作流4:生产规范审计(1周)

Goal: Confirm operating model can scale through current growth.
  1. Inventory: on-call coverage, incident frequency by severity, MTTR trend
  2. Confirm every customer-facing service has SLOs (pair with
    engineering/slo-architect/
    )
  3. Review last 5 postmortems — are they blameless? Are action items closed?
  4. Cross-check deployment cadence against DORA verdict
  5. Output: production-discipline maturity score + 90-day improvement plan
目标: 确认运营模式能够支撑当前增长规模。
  1. 盘点:值班覆盖范围、按严重程度划分的事件频率、MTTR趋势
  2. 确认每个面向客户的服务都有SLO(搭配
    engineering/slo-architect/
    使用)
  3. 回顾最近5次事后复盘——是否为无责复盘?行动项是否已关闭?
  4. 交叉验证部署节奏与DORA评估结果
  5. 输出:生产规范成熟度评分+90天改进计划

Output Standards

输出标准

**Bottom Line:** [one sentence — decision and rationale]
**The Decision:** [one of: throughput | hiring | structure | production]
**The Evidence:** [numbers from the tool, not adjectives]
**How to Act:** [3 concrete next steps]
**Your Decision:** [the call only the founder/CTO can make]
**核心结论:** [一句话——决策及理由]
**决策方向:** [以下之一:吞吐量 | 招聘 | 架构 | 生产]
**依据:** [来自工具的数据,而非形容词]
**行动方案:** [3个具体下一步动作]
**最终决策:** [仅创始人/CTO可做出的决定]

Adjacent Skills

关联技能

  • ../cto-advisor/
    — Architecture, scaling cliffs, tech debt strategy (CTO decides what to build; VPE decides how to ship)
  • ../chro-advisor/
    — Hiring systems (ladders, bands, leveling rubrics company-wide); VPE owns eng-specific funnel execution
  • ../coo-advisor/
    — Operating cadence company-wide; VPE owns eng-specific cadence
  • ../../../engineering/slo-architect/
    — SLO design (tactical; VPE owns the policy that SLOs are required)
  • ../../../engineering/chaos-engineering/
    — Chaos experiment design (tactical resilience)
  • ../../../engineering/feature-flags-architect/
    — Progressive delivery (tactical deployment)
  • ../../../engineering/kubernetes-operator/
    — K8s operator pattern (tactical infra)
  • cs-engineering-lead
    agent — Day-to-day incident + on-call coordination (VPE owns the operating model that engineering-lead executes)
  • ../cto-advisor/
    — 架构、扩容瓶颈、技术债务策略(CTO决定要构建什么;VPE决定如何交付)
  • ../chro-advisor/
    — 招聘体系(职级、薪酬带宽、公司范围的职级评定标准);VPE负责工程领域的漏斗执行
  • ../coo-advisor/
    — 公司层面的运营节奏;VPE负责工程领域的运营节奏
  • ../../../engineering/slo-architect/
    — SLO设计(战术层面;VPE负责制定要求SLO的政策)
  • ../../../engineering/chaos-engineering/
    — 混沌实验设计(战术层面的韧性建设)
  • ../../../engineering/feature-flags-architect/
    — 渐进式交付(战术层面的部署)
  • ../../../engineering/kubernetes-operator/
    — K8s operator模式(战术层面的基础设施)
  • cs-engineering-lead
    agent — 日常事件+值班协调(VPE负责制定工程负责人执行的运营模式)

References

参考资料

  • delivery_throughput.md — Full DORA framework + 4 common bottlenecks + what to fix first + anti-patterns
  • engineering_hiring_funnel.md — 7-stage funnel + conversion benchmarks + common leakage + sourcing channel diversification + technical interview design
  • eng_team_structure.md — Squad/chapter/tribe model + headcount-to-structure map + Conway's Law + EM-vs-tech-lead split + span-of-control
  • production_discipline.md — On-call rotation design + incident response + blameless postmortem culture + deployment cadence + SLO discipline integration

Version: 1.0.0 Status: Production Ready
  • delivery_throughput.md — 完整DORA框架+4个常见瓶颈+优先修复事项+反模式
  • engineering_hiring_funnel.md — 7阶段漏斗+转化率基准+常见流失点+招聘渠道多元化+技术面试设计
  • eng_team_structure.md — Squad/Chapter/Tribe模型+人员规模-架构映射+Conway定律+EM与技术主管的职责划分+管控幅度
  • production_discipline.md — 值班轮岗设计+事件响应+无责事后复盘文化+部署节奏+SLO规范整合

版本: 1.0.0 状态: 可投入生产使用