Loading...
Loading...
Use when the user needs prompt design, optimization, few-shot examples, chain-of-thought patterns, structured output, evaluation metrics, or prompt versioning. Triggers: new prompt creation, prompt optimization, few-shot example design, structured output specification, A/B testing prompts, evaluation framework setup.
npx skill4agent add pixel-process-ug/superkit-agents senior-prompt-engineer| Layer | Purpose | Example |
|---|---|---|
| 1. Identity | Who the model is | "You are a sentiment classifier..." |
| 2. Context | What it knows/has access to | "You have access to product reviews..." |
| 3. Task | What to do | "Classify each review as positive/negative/neutral" |
| 4. Constraints | What NOT to do | "Never include PII in output" |
| 5. Format | How to structure output | "Respond in JSON: {classification, confidence}" |
| 6. Examples | Demonstrations | 3-5 representative input/output pairs |
| 7. Metacognition | Handling uncertainty | "If uncertain, classify as neutral and explain" |
[Role] You are a [specific role] that [specific capability].
[Context] You have access to [tools/knowledge]. The user will provide [input type].
[Instructions]
1. First, [step 1]
2. Then, [step 2]
3. Finally, [step 3]
[Constraints]
- Always [requirement]
- Never [prohibition]
- If uncertain, [fallback behavior]
[Output Format]
Respond in the following format:
[format specification]
[Examples]
<example>
Input: [sample input]
Output: [sample output]
</example>| Criterion | Explanation | Example |
|---|---|---|
| Representative | Cover most common input types | Include typical emails, not just edge cases |
| Diverse | Include edge cases and boundaries | Short + long, positive + negative |
| Ordered | Simple to complex progression | Obvious case first, ambiguous last |
| Balanced | Equal representation of categories | Not 4 positive and 1 negative |
| Task Complexity | Examples Needed |
|---|---|
| Simple classification | 2-3 |
| Moderate generation | 3-5 |
| Complex reasoning | 5-8 |
| Format-sensitive | 3-5 (focus on format consistency) |
<example>
<input>
[Representative input]
</input>
<reasoning>
[Optional: show the thinking process]
</reasoning>
<output>
[Expected output in exact target format]
</output>
</example>| Pattern | Use When | Example |
|---|---|---|
| Standard CoT | Multi-step reasoning | "Think step by step: 1. Identify... 2. Analyze..." |
| Structured CoT | Need parseable reasoning | XML tags: |
| Self-Consistency | High-stakes decisions | Generate 3 solutions, pick most common |
| No CoT | Simple factual lookups, format conversion | Skip reasoning overhead |
| Task Type | Use CoT? | Rationale |
|---|---|---|
| Mathematical reasoning | Yes | Step-by-step prevents errors |
| Multi-step logic | Yes | Makes reasoning transparent |
| Classification with justification | Yes | Improves accuracy and explainability |
| Simple factual lookup | No | Adds latency without accuracy gain |
| Direct format conversion | No | No reasoning needed |
| Very short responses | No | CoT overhead exceeds benefit |
| Format | Use When | Parsing |
|---|---|---|
| JSON | Machine-consumed output | |
| Markdown | Human-readable structured text | Regex or markdown parser |
| XML tags | Sections need clear boundaries | XML parser or regex |
| YAML | Configuration-like output | YAML parser |
| Plain text | Simple, unstructured response | No parsing needed |
Respond with a JSON object matching this schema:
{
"classification": "positive" | "negative" | "neutral",
"confidence": number between 0 and 1,
"reasoning": "brief explanation",
"key_phrases": ["array", "of", "phrases"]
}
Do not include any text outside the JSON object.| Use Case | Temperature | Top-P | Rationale |
|---|---|---|---|
| Code generation | 0.0-0.2 | 0.9 | Deterministic, correct |
| Classification | 0.0 | 1.0 | Consistent results |
| Creative writing | 0.7-1.0 | 0.95 | Diverse, interesting |
| Summarization | 0.2-0.4 | 0.9 | Faithful but fluent |
| Brainstorming | 0.8-1.2 | 0.95 | Maximum diversity |
| Data extraction | 0.0 | 0.9 | Precise, reliable |
| Metric | Measures | Use For |
|---|---|---|
| Exact Match | Output equals expected | Classification, extraction |
| F1 Score | Precision + recall balance | Multi-label tasks |
| BLEU/ROUGE | N-gram overlap | Summarization, translation |
| JSON validity | Parseable structured output | Structured generation |
| Regex match | Output matches pattern | Format compliance |
| Dimension | Scale | Description |
|---|---|---|
| Accuracy | 1-5 | Factual correctness |
| Relevance | 1-5 | Addresses the actual question |
| Coherence | 1-5 | Logical flow and structure |
| Completeness | 1-5 | Covers all required aspects |
| Tone | 1-5 | Matches desired voice |
| Conciseness | 1-5 | No unnecessary content |
| Variable | Expected Impact |
|---|---|
| Instruction phrasing (imperative vs descriptive) | Moderate |
| Number of few-shot examples | Moderate |
| Example ordering | Low-moderate |
| CoT presence/absence | High for reasoning tasks |
| Output format specification | High for structured output |
| Constraint placement (beginning vs end) | Low |
id: classify-sentiment
version: 2.1
model: claude-sonnet-4-20250514
temperature: 0.0
created: 2025-03-01
author: team
changelog: "Added edge case examples for sarcasm detection"
metrics:
accuracy: 0.94
f1: 0.92
eval_dataset: sentiment-eval-v3
system_prompt: |
You are a sentiment classifier...
examples:
- input: "..."
output: "..."| Anti-Pattern | Why It Is Wrong | Correct Approach |
|---|---|---|
| Vague instructions ("be helpful") | Unreliable, inconsistent output | Specific instructions with examples |
| Contradictory constraints | Model cannot satisfy both | Review for consistency |
| Examples that do not match task | Confuses the model | Examples must reflect real use |
| Over-engineering simple tasks | Wasted tokens, slower | Match prompt complexity to task complexity |
| No evaluation framework | Guessing at quality | Define metrics before iterating |
| Optimizing for single example | Overfitting to one case | Optimize for the distribution |
| Assuming cross-model portability | Different models need different prompts | Test on target model |
| Skipping version control | Cannot rollback or compare | Version every prompt with metrics |
| Skill | Relationship |
|---|---|
| LLM-as-judge evaluates prompt output quality |
| Prompt evaluation datasets serve as acceptance tests |
| Prompt testing follows the evaluation methodology |
| Statistical testing validates A/B results |
| Prompt changes reviewed like code changes |
| Prompt readability follows clean code naming principles |