Loading...
Loading...
Systematic LLM prompt engineering: analyzes existing prompts for failure modes, generates structured variants (direct, few-shot, chain-of-thought), designs evaluation rubrics with weighted criteria, and produces test case suites for comparing prompt performance. Triggers on: "prompt engineering", "prompt lab", "generate prompt variants", "A/B test prompts", "evaluate prompt", "optimize prompt", "write a better prompt", "prompt design", "prompt iteration", "few-shot examples", "chain-of-thought prompt", "prompt failure modes", "improve this prompt". Use this skill when designing, improving, or evaluating LLM prompts specifically. NOT for evaluating Claude Code skills or SKILL.md files — use skill-evaluator instead.
npx skill4agent add mathews-tom/praxis-skills prompt-lab| File | Contents | Load When |
|---|---|---|
| Prompt structure catalog: zero-shot, few-shot, CoT, persona, structured output | Always |
| Quality metrics (accuracy, format compliance, completeness), rubric design | Evaluation needed |
| Common prompt failure taxonomy, detection strategies, mitigations | Failure analysis requested |
| Techniques for constraining LLM output format, JSON mode, schema enforcement | Format control needed |
references/failure-modes.md| Variant Type | Hypothesis | When to Use |
|---|---|---|
| Direct instruction | Clear instruction is sufficient | Simple tasks, capable models |
| Few-shot | Examples improve output consistency | Pattern-following tasks |
| Chain-of-thought | Reasoning improves accuracy | Multi-step logic, math, analysis |
| Persona/role | Role framing improves tone/expertise | Domain-specific tasks |
| Structured output | Format specification prevents errors | JSON, CSV, specific templates |
| Criterion | What It Measures | Typical Weight |
|---|---|---|
| Correctness | Output matches expected answer | 30-50% |
| Format compliance | Follows specified structure | 15-25% |
| Completeness | All required elements present | 15-25% |
| Conciseness | No unnecessary content | 5-15% |
| Tone/style | Matches requested voice | 5-10% |
## Prompt Lab: {Task Name}
### Objective
{What the prompt should achieve — specific and measurable}
### Success Criteria
- [ ] {Criterion 1 — measurable}
- [ ] {Criterion 2 — measurable}
### Current Prompt Analysis
{If existing prompt provided}
- **Strengths:** {what works}
- **Weaknesses:** {what fails or is ambiguous}
- **Missing:** {what's not specified}
### Variants
#### Variant A: {Strategy Name}**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}
#### Variant B: {Strategy Name}**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}
#### Variant C: {Strategy Name}**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}
### Evaluation Rubric
| Criterion | Weight | Scoring |
|-----------|--------|---------|
| {criterion} | {%} | {how to score: 0-3 scale or pass/fail} |
### Test Cases
| # | Input | Expected Output | Tests Criteria |
|---|-------|-----------------|---------------|
| 1 | {standard input} | {expected} | Correctness, Format |
| 2 | {edge case} | {expected} | Completeness |
| 3 | {adversarial} | {expected} | Robustness |
### Failure Modes to Monitor
- {Failure mode 1}: {detection method}
- {Failure mode 2}: {detection method}
### Recommended Next Steps
1. Run all variants against the test suite
2. Score using the rubric
3. Select the highest-scoring variant
4. Iterate on the winner with targeted improvements| Problem | Resolution |
|---|---|
| No clear objective | Ask the user to define what "good output" looks like with 2-3 examples. |
| Prompt is for a task LLMs are bad at (math, counting) | Flag the limitation. Suggest tool-augmented approaches or pre/post-processing. |
| Too many variables to test | Focus on the highest-impact variable first. Iterative refinement beats combinatorial testing. |
| No existing prompt to analyze | Start with the simplest possible prompt. The first variant IS the baseline. |
| Output format requirements are strict | Use structured output mode (JSON mode, function calling) instead of prompt-only constraints. |