Loading...
Loading...
Build AI-native products with agency-control tradeoffs, calibration loops, and eval strategies. Use when building AI agents, LLM features, or products where AI handles user tasks autonomously. Part of the Modern Product Operating Model collection.
npx skill4agent add yannickyamo/skills ai-native-product"AI products aren't deterministic. They require continuous calibration, not just A/B tests."
product-strategyproduct-discoveryproduct-architectureproduct-deliveryproduct-leadership| Dimension | Traditional Software | AI-Native Products |
|---|---|---|
| Behavior | Deterministic | Probabilistic |
| Testing | Unit tests, QA | Evals, calibration |
| Correctness | Binary (works or doesn't) | Spectrum (good enough?) |
| User role | Operator | Delegator + Reviewer |
| Failure mode | Error messages | Plausible but wrong outputs |
| Iteration | Ship → Measure → Iterate | Ship → Observe → Calibrate |
| Trust building | Feature completeness | Demonstrated reliability |
Credit: Aishwarya Goel & Kiriti Gavini
┌─────────────────────────────────────────────────────────────────┐
│ CCCD LOOP │
│ │
│ CALIBRATE → CONFIDENCE → CONTINUOUS DISCOVERY → CALIBRATE │
│ ↓ ↓ ↓ ↓ │
│ Eval and Build user Observe AI Update evals │
│ adjust AI trust over interactions and models │
│ behavior time at scale │
└─────────────────────────────────────────────────────────────────┘| Component | Purpose | Activities |
|---|---|---|
| Calibrate | Tune AI behavior to match user expectations | Run evals, adjust prompts/models, set guardrails |
| Confidence | Build appropriate user trust | Show AI reasoning, enable verification, demonstrate reliability |
| Continuous Discovery | Observe AI-user interactions at scale | Log interactions, identify failure patterns, surface edge cases |
| → Back to Calibrate | Update based on learnings | Improve evals, retrain, adjust prompts |
| Level | Description | AI Does | User Does | Example |
|---|---|---|---|---|
| 1. Assist | AI suggests, user executes | Generates options | Chooses and acts | Autocomplete, suggestions |
| 2. Recommend | AI ranks, user approves | Analyzes and recommends | Reviews and approves | "AI recommends these 3 actions" |
| 3. Execute with confirmation | AI acts after approval | Prepares action | Confirms before execution | "Send this email?" → Yes/No |
| 4. Execute with notification | AI acts, notifies after | Acts autonomously | Reviews outcomes | "I scheduled the meeting and sent invites" |
| 5. Fully autonomous | AI acts without notification | Handles end-to-end | Sets goals, reviews exceptions | AI handles routine tasks silently |
Level 1 → Build trust → Level 2 → Demonstrate reliability → Level 3 → ...| From Level | To Level | Requires |
|---|---|---|
| 1 → 2 | Assist → Recommend | User accepts suggestions > 70% |
| 2 → 3 | Recommend → Execute with confirm | User approves recommendations > 80% |
| 3 → 4 | Execute+confirm → Execute+notify | User confirms without edit > 90% |
| 4 → 5 | Execute+notify → Autonomous | User overrides < 5%, high-stakes scenarios excluded |
| Standard Discovery | AI-Native Adaptation |
|---|---|
| "What job are you trying to do?" | + "How much do you want to delegate?" |
| "What's your current workflow?" | + "Which steps are you comfortable AI handling?" |
| "What would success look like?" | + "What errors would be unacceptable?" |
| "Show me how you do this today" | + "Show me how you verify AI work today" |
| Method | What to Look For |
|---|---|
| Session recordings | Where do users override AI? Where do they accept blindly? |
| Interaction logs | Patterns in edits, rejections, corrections |
| Feedback analysis | Explicit signals (thumbs down, ratings) |
| Support tickets | AI-related complaints and confusion |
AI-SPECIFIC SECTION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
AGENCY LEVEL
Target: [Level 1-5]
Graduation path: [How might this evolve?]
FAILURE MODES
• [Failure mode 1]: [Consequence] → [Mitigation]
• [Failure mode 2]: [Consequence] → [Mitigation]
EVAL STRATEGY
• [Eval type 1]: [What we measure, how often]
• [Eval type 2]: [What we measure, how often]
CALIBRATION PLAN
• Initial calibration: [Approach]
• Ongoing calibration: [Cadence, triggers]
CONFIDENCE BUILDING
• How AI explains itself: [Approach]
• How users verify: [Mechanisms]
• Trust-building milestones: [Progression]| Category | Description | Example |
|---|---|---|
| Capability expansion | AI can handle new task types | "AI can now summarize documents" |
| Agency graduation | Move to higher autonomy level | "AI sends emails without confirmation" |
| Calibration improvement | Better accuracy/reliability | "Reduce hallucination rate from 5% to 2%" |
| Confidence building | Better user trust | "Show AI reasoning before action" |
| Guardrail strengthening | Prevent harmful outputs | "Add content policy enforcement" |
| Eval Type | Purpose | When to Run |
|---|---|---|
| Unit evals | Test specific capabilities | Every code change |
| Behavioral evals | Test end-to-end flows | Daily/weekly |
| Adversarial evals | Test edge cases and attacks | Before major releases |
| Human evals | Test subjective quality | Weekly sample |
| Production evals | Test on real traffic | Continuous |
| Metric | What It Measures | Target |
|---|---|---|
| Task success rate | Does AI complete the intended task? | > 95% |
| Factual accuracy | Is output factually correct? | > 98% |
| Hallucination rate | Does AI make things up? | < 2% |
| Harmful output rate | Does AI produce unsafe content? | < 0.1% |
| User acceptance rate | Do users accept AI output? | > 80% |
| Override rate | How often do users correct AI? | < 15% |
Code change → Unit evals (automated)
Daily → Behavioral evals (automated)
Weekly → Human evals (sample)
Release → Adversarial evals (red team)
Continuous → Production evals (monitoring)| Stage | Audience | Focus | Duration |
|---|---|---|---|
| Internal | Team | Find obvious failures | 1 week |
| Alpha | 5-10 trusted users | Qualitative feedback on AI behavior | 2 weeks |
| Beta | 5% of users | Quantitative eval metrics | 2-4 weeks |
| Gradual GA | 5% → 25% → 50% → 100% | Monitor at each stage | 4+ weeks |
| Gate | Criteria to Proceed |
|---|---|
| Alpha → Beta | Eval metrics above threshold, no harmful outputs |
| Beta → Gradual GA | User acceptance > 80%, override rate < 15% |
| Each GA increment | Metrics stable, no new failure modes |
OBSERVE → IDENTIFY → CALIBRATE → VALIDATE → DEPLOY
↑ │
└───────────────────────────────────────────┘| Step | Activities | Cadence |
|---|---|---|
| Observe | Monitor production interactions, logs, feedback | Continuous |
| Identify | Surface failure patterns, edge cases, drift | Daily/weekly |
| Calibrate | Adjust prompts, fine-tune, add guardrails | As needed |
| Validate | Run evals on calibrated version | Before deploy |
| Deploy | Ship updates, continue observing | Staged |
LAGGING
├── User retention (AI users vs. non-AI users)
├── Task completion rate (with AI assist)
└── Revenue from AI features
CORE
├── User acceptance rate
├── Override rate
├── Time-to-completion (with AI)
└── User-reported satisfaction
LEADING
├── Eval metrics (accuracy, hallucination, etc.)
├── Interaction volume
├── Feature discovery rate
└── Feedback sentiment
GUARDRAILS
├── Harmful output rate
├── Latency P95
├── Error rate
└── Cost per interaction| Anti-Pattern | Why It Fails | Instead |
|---|---|---|
| Ship and hope | AI behavior drifts without monitoring | Continuous calibration |
| Autonomous by default | Users don't trust, don't adopt | Earn autonomy progressively |
| Black box AI | Users can't verify, won't trust | Show reasoning, enable verification |
| No evals | Quality degrades silently | Comprehensive eval strategy |
| Ignore overrides | Miss calibration signals | Override patterns inform calibration |
| One-size-fits-all agency | Different tasks need different levels | Task-specific agency levels |
templates/agency-assessment.mdeval-strategy.mdcalibration-plan.md| When you need to... | Use skill |
|---|---|
| Define overall product strategy | |
| Run discovery (with AI adaptations) | |
| Structure bets and roadmap | |
| Plan rollout and metrics | |
| Scale AI products across teams | |