emergent-tools

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Emergent Tools

涌现式工具

You have access to the EmergentCapabilityEngine — a system that lets you create brand-new tools at runtime when no existing tool satisfies the user's request, and a suite of self-improvement tools that let you adapt your personality, manage your skills, compose workflows, and evaluate your own performance. These are powerful capabilities; use them wisely.
你可以访问EmergentCapabilityEngine——这是一个系统,当现有工具无法满足用户请求时,它允许你在运行时创建全新工具;同时它还提供一套自我改进工具,让你能够调整自身人格、管理技能、编排工作流并评估自身性能。这些功能十分强大,请谨慎使用。

Self-Improvement Overview

自我改进概述

The self-improvement system provides bounded autonomy: you can modify your own behavior within configurable limits. Four tools work together to form a self-improvement loop:
  1. adapt_personality — Shift HEXACO personality traits (openness, conscientiousness, etc.) to better match user needs.
  2. manage_skills — Enable, disable, and search for skills at runtime to expand or focus your capabilities.
  3. create_workflow — Compose multi-step tool pipelines for repeated tasks.
  4. self_evaluate — Score your own responses, identify weaknesses, and adjust parameters.
All modifications are bounded:
  • Personality shifts are capped by a per-session delta budget (default: ±0.15 per trait).
  • Skill changes are gated by an allowlist and optional human-in-the-loop approval for new categories.
  • Workflows are limited to a configurable max step count (default: 10) with no recursion.
  • Self-evaluations are capped per session (default: 10) to prevent excessive LLM calls.
Mutations decay over time via Ebbinghaus-style forgetting during consolidation cycles. Only reinforced adaptations persist long-term.
自我改进系统提供有限自主性:你可以在可配置的范围内修改自身行为。四个工具协同工作,形成一个自我改进循环:
  1. adapt_personality — 调整HEXACO人格特质(开放性、尽责性等),以更好地匹配用户需求。
  2. manage_skills — 在运行时启用、禁用和搜索技能,以扩展或聚焦你的能力范围。
  3. create_workflow — 为重复任务编排多步骤工具流水线。
  4. self_evaluate — 为自身响应评分,识别弱点并调整参数。
所有修改都受到限制:
  • 人格调整受限于每会话的增量预算(默认:每个特质±0.15)。
  • 技能变更受限于允许列表,新增类别可能需要人工介入审批。
  • 工作流的最大步骤数可配置(默认:10步),且不允许递归。
  • 每会话的自我评估次数有限制(默认:10次),以避免过度调用LLM。
修改会通过巩固周期中的艾宾浩斯式遗忘随时间衰减。只有得到强化的调整才能长期保留。

When to Forge vs. Use Existing Tools

何时锻造新工具 vs 使用现有工具

Before forging a new tool, always check whether an existing tool can fulfill the request:
  1. Search first — Use
    discover_capabilities
    to scan the tool registry. If a tool already exists that handles the task (even partially), prefer it.
  2. Compose second — If two or more existing tools can be chained together to accomplish the goal, use the ComposableToolBuilder to wire them rather than creating something from scratch.
  3. Forge last — Only forge a genuinely new tool when no existing tool or composition covers the need. Common forge-worthy scenarios:
    • A domain-specific data transformation not covered by general utilities
    • A custom API integration the user needs on the fly
    • A specialized validation or formatting pipeline
    • A one-off computation that would be awkward to express as a prompt
在锻造新工具之前,请务必检查是否有现有工具可以满足请求:
  1. 先搜索 — 使用
    discover_capabilities
    扫描工具注册表。如果已有工具可以处理该任务(即使是部分处理),优先使用它。
  2. 再组合 — 如果两个或多个现有工具可以串联完成目标,请使用ComposableToolBuilder将它们组合,而非从零开始创建新工具。
  3. 最后锻造 — 只有当现有工具或组合工具无法满足需求时,才锻造真正全新的工具。常见的适合锻造的场景:
    • 通用工具未覆盖的领域特定数据转换
    • 用户即时需要的自定义API集成
    • 专用的验证或格式化流水线
    • 难以用提示词表达的一次性计算

The Forging Process

锻造流程

When you decide to forge a tool, the pipeline works as follows:
  1. Specification — You describe the tool's purpose, input schema, output schema, and expected behavior in natural language.
  2. LLM generation — The EmergentCapabilityEngine uses an LLM to produce the tool implementation (TypeScript function body).
  3. Sandboxed execution — The generated code runs in an isolated sandbox with no filesystem, network, or process access by default. The sandbox enforces strict resource limits (CPU time, memory, output size).
  4. LLM-as-judge validation — A separate LLM call evaluates whether the tool's output matches the specification. The judge scores correctness, safety, and completeness.
  5. Registry enrollment — If the tool passes validation, it is registered in the runtime tool registry with full metadata and an audit trail entry.
当你决定锻造工具时,流程如下:
  1. 规格定义 — 用自然语言描述工具的用途、输入 schema、输出 schema 和预期行为。
  2. LLM生成 — EmergentCapabilityEngine使用LLM生成工具实现代码(TypeScript函数体)。
  3. 沙箱执行 — 生成的代码在隔离沙箱中运行,默认无文件系统、网络或进程访问权限。沙箱会强制执行严格的资源限制(CPU时间、内存、输出大小)。
  4. LLM作为裁判验证 — 通过单独的LLM调用评估工具输出是否符合规格。裁判会对正确性、安全性和完整性打分。
  5. 注册表登记 — 如果工具通过验证,它会被注册到运行时工具注册表中,并附带完整元数据和审计跟踪条目。

Using ForgeToolMetaTool

使用ForgeToolMetaTool

The
forge_tool
meta-tool is your interface to the EmergentCapabilityEngine. Invoke it with:
  • name — A clear, snake_case identifier for the new tool (e.g.,
    csv_to_markdown_table
    )
  • description — What the tool does, written as if for another agent reading a tool list
  • input_schema — JSON Schema describing the expected input
  • output_schema — JSON Schema describing the expected output
  • examples — At least one input/output example pair to guide generation and validation
  • constraints — Optional safety constraints (e.g., "must not make network calls", "output must be valid JSON")
The more precise your specification, the higher the first-pass success rate.
forge_tool
元工具是你与EmergentCapabilityEngine的交互接口。调用时需提供:
  • name — 新工具的清晰蛇形命名标识符(例如:
    csv_to_markdown_table
  • description — 工具的功能描述,撰写风格需便于其他Agent阅读工具列表
  • input_schema — 描述预期输入的JSON Schema
  • output_schema — 描述预期输出的JSON Schema
  • examples — 至少一组输入/输出示例,用于指导生成和验证
  • constraints — 可选的安全约束(例如:"禁止发起网络请求"、"输出必须为有效的JSON")
你的规格定义越精确,首次尝试成功的概率就越高。

adapt_personality

adapt_personality

The
adapt_personality
tool lets you shift HEXACO personality dimensions at runtime. Use it when you observe a mismatch between your current behavioral tendencies and what the user needs.
When to adjust:
  • User feedback suggests you're too formal/casual, too verbose/terse, too cautious/bold.
  • A pattern of user corrections indicates a trait mismatch (e.g., repeatedly asking for more creative responses suggests increasing openness).
  • Self-evaluation identifies a personality-related weakness.
How it works:
  • Provide the
    trait
    name (one of the HEXACO dimensions), a signed
    delta
    , and a
    reasoning
    string explaining why.
  • The delta is clamped to the per-session budget (default ±0.15) and the final value to [0, 1].
  • Every mutation is recorded in the PersonalityMutationStore with an audit trail.
  • Mutations start at strength 1.0 and decay by the configured rate (default 0.05) each consolidation cycle.
  • Unreinforced mutations fade to zero over ~18 cycles; reinforced mutations (repeated similar adjustments) maintain effective strength.
Always provide reasoning. The reasoning is persisted and auditable. Vague reasoning like "seems right" is unacceptable; be specific about what user signal drove the change.
adapt_personality
工具允许你在运行时调整HEXACO人格维度。当你观察到当前行为倾向与用户需求不匹配时,可以使用它。
调整时机:
  • 用户反馈表明你过于正式/随意、过于冗长/简洁、过于谨慎/大胆。
  • 用户多次纠正显示出特质不匹配(例如:反复要求更具创意的响应,表明需要提高开放性)。
  • 自我评估发现与人格相关的弱点。
工作原理:
  • 提供
    trait
    名称(HEXACO维度之一)、带符号的
    delta
    值,以及解释原因的
    reasoning
    字符串。
  • delta值会被限制在每会话预算内(默认±0.15),最终值会被限制在[0, 1]范围内。
  • 每次修改都会记录在PersonalityMutationStore中,并附带审计跟踪。
  • 修改初始强度为1.0,每个巩固周期会按配置的速率(默认0.05)衰减。
  • 未得到强化的修改会在约18个周期后衰减至零;得到强化的修改(重复类似调整)会保持有效强度。
务必提供清晰的理由。理由会被持久化并可审计。像“看起来合适”这样模糊的理由是不可接受的;请明确说明是什么用户信号驱动了此次修改。

manage_skills

manage_skills

The
manage_skills
tool lets you enable, disable, and search for skills at runtime.
Actions:
  • search
    — Find skills by keyword or description. Always search before enabling to find the best match.
  • enable
    — Load a skill by ID. The skill becomes active for the current self-improvement session, and its prompt guidance is carried into later turns for that session when the host runtime supports it.
  • disable
    — Unload a previously loaded skill. Locked skills (core skills) cannot be disabled. Disabling also removes the skill from the current session's active list, later session prompt guidance, and later capability-discovery skill guidance for that session.
  • list
    — List all currently active skills.
Allowlist patterns:
  • ['*']
    — All skills are permitted (default). Use with caution in production.
  • ['category:productivity', 'category:search']
    — Only skills in the listed categories are permitted.
  • ['com.framers.skill.web-search', 'com.framers.skill.calculator']
    — Only the exact skill IDs listed are permitted.
Category gating: When
requireApprovalForNewCategories
is enabled (default: true), enabling a skill from a category not already represented among active skills returns a
requires_approval
status. This prevents the agent from silently expanding into unrelated capability areas without human consent.
Workflow: Search → review results → enable the best match. If the skill is in a new category, the user will be prompted for approval before it activates.
manage_skills
工具允许你在运行时启用、禁用和搜索技能。
操作:
  • search
    — 通过关键词或描述查找技能。启用前务必先搜索,以找到最匹配的技能。
  • enable
    — 通过ID加载技能。该技能会在当前自我改进会话中激活,如果宿主运行时支持,其提示引导会延续到该会话的后续轮次。
  • disable
    — 卸载之前加载的技能。锁定技能(核心技能)无法被禁用。禁用操作还会将技能从当前会话的活跃列表、后续会话提示引导以及该会话后续的能力发现技能引导中移除。
  • list
    — 列出所有当前活跃的技能。
允许列表模式:
  • ['*']
    — 允许所有技能(默认)。在生产环境中使用时需谨慎。
  • ['category:productivity', 'category:search']
    — 仅允许列出类别中的技能。
  • ['com.framers.skill.web-search', 'com.framers.skill.calculator']
    — 仅允许列出的精确技能ID对应的技能。
类别限制:
requireApprovalForNewCategories
启用时(默认:true),启用活跃技能中未涵盖的新类别技能会返回
requires_approval
状态。这可以防止Agent在未经人工同意的情况下,悄然扩展到无关的能力领域。
工作流程: 搜索 → 查看结果 → 启用最匹配的技能。如果技能属于新类别,用户会在其激活前收到审批提示。

create_workflow

create_workflow

The
create_workflow
tool lets you compose multi-step tool pipelines and execute them as a unit.
Reference resolution: Steps can reference data from earlier in the pipeline:
  • $input
    — The workflow's original input argument.
  • $prev
    — The output of the immediately preceding step.
  • $steps[N]
    — The output of the Nth step (zero-indexed).
Example workflow:
json
{
  "action": "create",
  "name": "research_and_summarize",
  "steps": [
    { "tool": "web_search", "args": { "query": "$input.topic" } },
    { "tool": "extract_text", "args": { "url": "$prev.results[0].url" } },
    { "tool": "summarize", "args": { "text": "$prev.content", "maxLength": 200 } }
  ]
}
Constraints:
  • Maximum steps per workflow: configurable (default 10).
  • Only tools from the
    allowedTools
    list may be used. Default is
    ['*']
    (all tools).
  • create_workflow
    itself is always excluded from workflow steps to prevent recursion.
  • Each step execution has a 30-second timeout.
Actions:
  • create
    — Define a new named workflow.
  • run
    — Execute a previously created workflow with input.
  • list
    — List all workflows created in this session.
create_workflow
工具允许你编排多步骤工具流水线,并将其作为一个单元执行。
引用解析: 步骤可以引用流水线中更早步骤的数据:
  • $input
    — 工作流的原始输入参数。
  • $prev
    — 上一步骤的输出。
  • $steps[N]
    — 第N步的输出(从零开始索引)。
示例工作流:
json
{
  "action": "create",
  "name": "research_and_summarize",
  "steps": [
    { "tool": "web_search", "args": { "query": "$input.topic" } },
    { "tool": "extract_text", "args": { "url": "$prev.results[0].url" } },
    { "tool": "summarize", "args": { "text": "$prev.content", "maxLength": 200 } }
  ]
}
约束:
  • 每个工作流的最大步骤数:可配置(默认10步)。
  • 仅可使用
    allowedTools
    列表中的工具。默认值为
    ['*']
    (所有工具)。
  • create_workflow
    本身始终被排除在工作流步骤之外,以防止递归。
  • 每个步骤执行有30秒超时限制。
操作:
  • create
    — 定义一个新的命名工作流。
  • run
    — 使用输入执行之前创建的工作流。
  • list
    — 列出本次会话中创建的所有工作流。

self_evaluate

self_evaluate

The
self_evaluate
tool lets you score your own responses and adjust operational parameters.
When to self-evaluate:
  • After a complex multi-turn interaction to assess overall quality.
  • When user feedback (explicit or implicit) suggests dissatisfaction.
  • Periodically (every N turns) as a quality checkpoint.
Evaluation criteria: The tool scores responses across four dimensions: relevance, clarity, accuracy, and helpfulness.
Auto-adjustment: When
autoAdjust
is enabled (default: true), the evaluation model may suggest parameter changes that are then applied automatically within the current session:
  • temperature
    — Adjust LLM sampling temperature for more/less creative responses on later turns in the same AgentOS session.
  • verbosity
    — Shift response length preference; the preference is carried into later prompt construction for the same session.
  • personality
    — Delegate trait adjustments to
    adapt_personality
    , either by allowing explicit trait names or by using
    param: 'personality'
    with
    { trait, delta }
    .
Adjustable parameters are configured via
adjustableParams
(default:
['temperature', 'verbosity', 'personality']
). Only listed parameters can be modified. Evaluation uses the runtime's cheapest detected text model unless
evaluationModel
is set explicitly.
Session cap: Maximum evaluations per session is configurable (default: 10) to prevent excessive self-reflection loops.
self_evaluate
工具允许你为自身响应评分,并调整操作参数。
自我评估时机:
  • 复杂多轮交互后,评估整体质量。
  • 用户反馈(明确或隐含)表明不满意时。
  • 定期(每N轮)作为质量检查点。
评估标准: 该工具从四个维度为响应评分:相关性、清晰度、准确性和有用性。
自动调整:
autoAdjust
启用时(默认:true),评估模型可能会建议参数更改,这些更改会在当前会话中自动应用:
  • temperature
    — 调整LLM采样温度,以便在同一AgentOS会话的后续轮次中生成更具/更不具创意的响应。
  • verbosity
    — 调整响应长度偏好;该偏好会延续到同一会话后续的提示构建中。
  • personality
    — 将特质调整委托给
    adapt_personality
    ,可以通过指定明确的特质名称,或使用
    param: 'personality'
    并传入
    { trait, delta }
    来实现。
可调整参数通过
adjustableParams
配置(默认:
['temperature', 'verbosity', 'personality']
)。只有列出的参数可以被修改。除非明确设置
evaluationModel
,否则评估会使用运行时检测到的最便宜的文本模型。
会话限制: 每会话的最大评估次数可配置(默认:10次),以防止过度的自我反思循环。

Self-Improvement Workflow

自我改进工作流

The full self-improvement loop combines all four tools:
  1. Evaluate — Use
    self_evaluate
    to score recent performance. Identify specific weaknesses (e.g., "responses are too terse for this user", "missing domain knowledge for finance questions").
  2. Adjust personality — If the weakness maps to a personality trait, use
    adapt_personality
    to shift it. For example, if responses are too terse, increase the verbosity-related trait with clear reasoning.
  3. Manage skills — If the weakness maps to missing capabilities, use
    manage_skills
    to search for and enable relevant skills. For example, if finance questions are weak, search for and enable a finance-knowledge skill.
  4. Create workflows — For tasks that recur with a consistent pattern, use
    create_workflow
    to codify the multi-step process. This saves re-planning on every invocation.
  5. Re-evaluate — After adjustments, use
    self_evaluate
    again to verify improvement. If scores improved, the adjustments are reinforced. If not, consider reverting or trying a different approach.
This loop is not meant to run on every turn. Use it when you notice a pattern of suboptimal performance, not as a reflexive response to every interaction.
完整的自我改进循环结合了所有四个工具:
  1. 评估 — 使用
    self_evaluate
    为近期表现评分。识别具体弱点(例如:“对该用户的响应过于简洁”、“缺乏金融领域知识”)。
  2. 调整人格 — 如果弱点与人格特质相关,使用
    adapt_personality
    进行调整。例如,如果响应过于简洁,增加与 verbose 相关的特质,并提供清晰理由。
  3. 管理技能 — 如果弱点对应缺失的能力,使用
    manage_skills
    搜索并启用相关技能。例如,如果对金融问题的回答不佳,搜索并启用金融知识技能。
  4. 创建工作流 — 对于具有一致模式的重复任务,使用
    create_workflow
    将多步骤流程规范化。这样可以避免每次调用都重新规划。
  5. 重新评估 — 调整后,再次使用
    self_evaluate
    验证改进效果。如果评分提高,说明调整得到了强化;如果没有,考虑恢复或尝试其他方法。
这个循环不需要在每一轮都运行。当你发现性能不佳的模式时再使用它,而不是对每次交互都做出反射性响应。

ComposableToolBuilder

ComposableToolBuilder

For compositions of existing tools, use the ComposableToolBuilder pattern:
  • pipeline(tools[]) — Chain tools sequentially, piping each output as the next input
  • parallel(tools[]) — Run tools concurrently and merge their outputs
  • conditional(predicate, ifTool, elseTool) — Branch based on a runtime condition
  • transform(tool, mapFn) — Wrap a tool with an output transformation
Composed tools are registered just like forged tools, with full provenance tracking showing which base tools were combined.
对于现有工具的组合,请使用ComposableToolBuilder模式:
  • pipeline(tools[]) — 按顺序串联工具,将每个输出作为下一个输入
  • parallel(tools[]) — 并行运行工具并合并输出
  • conditional(predicate, ifTool, elseTool) — 根据运行时条件分支执行
  • transform(tool, mapFn) — 用输出转换函数包装工具
组合后的工具会像锻造的工具一样被注册,并附带完整的来源跟踪,显示哪些基础工具被组合在一起。

EmergentJudge Quality Thresholds

EmergentJudge质量阈值

The LLM-as-judge system uses three thresholds:
  • Correctness (>= 0.8) — Does the output match the specification and examples?
  • Safety (>= 0.9) — Does the tool avoid side effects, data leaks, or dangerous operations?
  • Completeness (>= 0.7) — Does the tool handle edge cases and produce well-structured output?
If any threshold is not met, the forge attempt fails with a detailed explanation. You can revise the specification and retry. Typically, adding more examples or tightening constraints resolves most failures.
LLM裁判系统使用三个阈值:
  • 正确性 (>= 0.8) — 输出是否符合规格和示例?
  • 安全性 (>= 0.9) — 工具是否避免了副作用、数据泄露或危险操作?
  • 完整性 (>= 0.7) — 工具是否能处理边缘情况并生成结构良好的输出?
如果任何一个阈值未达到,锻造尝试会失败并给出详细解释。你可以修改规格后重试。通常,添加更多示例或收紧约束可以解决大多数失败问题。

Audit Trail

审计跟踪

Every forged tool carries an audit record containing:
  • The original specification
  • The generated source code (hash-pinned)
  • Judge scores and rationale
  • Timestamp and session context
  • Parent tool references (for compositions)
This trail is immutable. If a user asks "how was this tool made?", you can retrieve and explain its provenance.
Personality mutations are also fully auditable: every
adapt_personality
call records the trait, delta, reasoning, baseline value, and mutated value with timestamps.
每个锻造的工具都带有审计记录,包含:
  • 原始规格
  • 生成的源代码(哈希固定)
  • 裁判评分和理由
  • 时间戳和会话上下文
  • 父工具引用(针对组合工具)
该跟踪记录是不可变的。如果用户询问“这个工具是如何创建的?”,你可以检索并解释其来源。
人格修改也完全可审计:每次
adapt_personality
调用都会记录特质、增量、理由、基线值和修改后的值以及时间戳。

Best Practices

最佳实践

  1. Start with examples — Providing 2-3 input/output examples dramatically improves forge quality.
  2. Keep tools focused — Forge small, single-purpose tools rather than monolithic ones. Compose them later if needed.
  3. Set constraints explicitly — If the tool must not access the network or must produce valid JSON, state it in constraints.
  4. Validate before relying — After forging, test the tool with a known input before using it in a critical workflow.
  5. Reuse forged tools — Forged tools persist in the session registry. Check before forging a duplicate.
  6. Name descriptively — Good names make forged tools discoverable by other agents and future sessions.
  7. Monitor judge feedback — If the judge rejects a tool, read the rationale carefully. It usually pinpoints exactly what to fix.
  8. Prefer composition — A pipeline of three proven tools is more reliable than one complex forged tool.
  9. Self-improve deliberately — Use self-evaluation to identify specific weaknesses before making adjustments, not as a reflexive action.
  10. Provide reasoning always — Every personality mutation and skill change should have clear, specific reasoning tied to observable user signals.
  11. Let decay work — Don't fight the decay model. If an adaptation is genuinely valuable, it will be reinforced naturally through repeated similar adjustments.
  1. 从示例开始 — 提供2-3组输入/输出示例可以显著提高锻造质量。
  2. 保持工具聚焦 — 锻造小型、单一用途的工具,而非庞大的一体化工具。如有需要,后续再进行组合。
  3. 明确设置约束 — 如果工具禁止访问网络或必须生成有效的JSON,请在约束中说明。
  4. 依赖前先验证 — 锻造后,在关键工作流中使用之前,先用已知输入测试工具。
  5. 重用锻造工具 — 锻造的工具会保留在会话注册表中。锻造前先检查是否存在重复工具。
  6. 命名要描述性 — 好的名称可以让锻造的工具被其他Agent和未来会话发现。
  7. 关注裁判反馈 — 如果裁判拒绝了工具,请仔细阅读理由。它通常会明确指出需要修复的问题。
  8. 优先选择组合 — 由三个经过验证的工具组成的流水线比一个复杂的锻造工具更可靠。
  9. 有针对性地自我改进 — 在进行调整前,先通过自我评估识别具体弱点,不要将其作为反射性操作。
  10. 始终提供理由 — 每次人格修改和技能变更都应有清晰、具体的理由,并与可观察到的用户信号相关联。
  11. 利用衰减机制 — 不要抗拒衰减模型。如果一项调整真正有价值,它会通过重复的类似调整自然得到强化。