bare-eval
Original:🇺🇸 English
Translated
Run isolated eval and grading calls using CC 2.1.81 --bare mode. Constructs claude -p --bare invocations for skill evaluation, trigger testing, and LLM grading without plugin/hook interference. Use when running eval pipelines, grading skill outputs, benchmarking prompt quality, or testing trigger accuracy in isolation.
3installs
Sourceyonatangross/orchestkit
Added on
NPX Install
npx skill4agent add yonatangross/orchestkit bare-evalTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →Bare Eval — Isolated Evaluation Calls
Run for fast, clean eval/grading without plugin overhead.
claude -p --bareCC 2.1.81 required. The flag skips hooks, LSP, plugin sync, and skill directory walks.
--bareWhen to Use
- Grading skill outputs against assertions
- Trigger classification (which skill matches a prompt)
- Description optimization iterations
- Any scripted call that doesn't need plugins
-p
When NOT to Use
- Testing skill routing (needs )
--plugin-dir - Testing agent orchestration (needs full plugin context)
- Interactive sessions
Prerequisites
bash
# --bare requires ANTHROPIC_API_KEY (OAuth/keychain disabled)
export ANTHROPIC_API_KEY="sk-ant-..."
# Verify CC version
claude --version # Must be >= 2.1.81Quick Reference
| Call Type | Command Pattern |
|---|---|
| Grading | |
| Trigger | |
| Optimize | |
| Force-skill | |
Invocation Patterns
Load detailed patterns and examples:
Read("${CLAUDE_SKILL_DIR}/references/invocation-patterns.md")Grading Schemas
JSON schemas for structured eval output:
Read("${CLAUDE_SKILL_DIR}/references/grading-schemas.md")Pipeline Integration
OrchestKit's eval scripts () auto-detect bare mode:
npm run eval:skillbash
# eval-common.sh detects ANTHROPIC_API_KEY → sets BARE_MODE=true
# Scripts add --bare to all non-plugin calls automaticallyBare calls: Trigger classification, force-skill, baseline, all grading.
Never bare: (needs plugin context for routing tests).
run_with_skillPerformance
| Scenario | Without --bare | With --bare | Savings |
|---|---|---|---|
| Single grading call | ~3-5s startup | ~0.5-1s | 2-4x |
| Trigger (per prompt) | ~3-5s | ~0.5-1s | 2-4x |
| Full eval (50 calls) | ~150-250s overhead | ~25-50s | 3-5x |
Rules
Read("${CLAUDE_SKILL_DIR}/rules/_sections.md")Troubleshooting
Read("${CLAUDE_SKILL_DIR}/references/troubleshooting.md")Related
- npm script — unified skill evaluation runner
eval:skill - — trigger accuracy testing
eval:trigger - — A/B quality comparison
eval:quality - — iterative description improvement
optimize-description.sh - Version compatibility:
doctor/references/version-compatibility.md