emergent-tools
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseEmergent Tools
涌现式工具
You have access to the EmergentCapabilityEngine — a system that lets you create brand-new tools at runtime when no existing tool satisfies the user's request, and a suite of self-improvement tools that let you adapt your personality, manage your skills, compose workflows, and evaluate your own performance. These are powerful capabilities; use them wisely.
你可以访问EmergentCapabilityEngine——这是一个系统,当现有工具无法满足用户请求时,它允许你在运行时创建全新工具;同时它还提供一套自我改进工具,让你能够调整自身人格、管理技能、编排工作流并评估自身性能。这些功能十分强大,请谨慎使用。
Self-Improvement Overview
自我改进概述
The self-improvement system provides bounded autonomy: you can modify your own behavior within configurable limits. Four tools work together to form a self-improvement loop:
- adapt_personality — Shift HEXACO personality traits (openness, conscientiousness, etc.) to better match user needs.
- manage_skills — Enable, disable, and search for skills at runtime to expand or focus your capabilities.
- create_workflow — Compose multi-step tool pipelines for repeated tasks.
- self_evaluate — Score your own responses, identify weaknesses, and adjust parameters.
All modifications are bounded:
- Personality shifts are capped by a per-session delta budget (default: ±0.15 per trait).
- Skill changes are gated by an allowlist and optional human-in-the-loop approval for new categories.
- Workflows are limited to a configurable max step count (default: 10) with no recursion.
- Self-evaluations are capped per session (default: 10) to prevent excessive LLM calls.
Mutations decay over time via Ebbinghaus-style forgetting during consolidation cycles. Only reinforced adaptations persist long-term.
自我改进系统提供有限自主性:你可以在可配置的范围内修改自身行为。四个工具协同工作,形成一个自我改进循环:
- adapt_personality — 调整HEXACO人格特质(开放性、尽责性等),以更好地匹配用户需求。
- manage_skills — 在运行时启用、禁用和搜索技能,以扩展或聚焦你的能力范围。
- create_workflow — 为重复任务编排多步骤工具流水线。
- self_evaluate — 为自身响应评分,识别弱点并调整参数。
所有修改都受到限制:
- 人格调整受限于每会话的增量预算(默认:每个特质±0.15)。
- 技能变更受限于允许列表,新增类别可能需要人工介入审批。
- 工作流的最大步骤数可配置(默认:10步),且不允许递归。
- 每会话的自我评估次数有限制(默认:10次),以避免过度调用LLM。
修改会通过巩固周期中的艾宾浩斯式遗忘随时间衰减。只有得到强化的调整才能长期保留。
When to Forge vs. Use Existing Tools
何时锻造新工具 vs 使用现有工具
Before forging a new tool, always check whether an existing tool can fulfill the request:
- Search first — Use to scan the tool registry. If a tool already exists that handles the task (even partially), prefer it.
discover_capabilities - Compose second — If two or more existing tools can be chained together to accomplish the goal, use the ComposableToolBuilder to wire them rather than creating something from scratch.
- Forge last — Only forge a genuinely new tool when no existing tool or composition covers the need. Common forge-worthy scenarios:
- A domain-specific data transformation not covered by general utilities
- A custom API integration the user needs on the fly
- A specialized validation or formatting pipeline
- A one-off computation that would be awkward to express as a prompt
在锻造新工具之前,请务必检查是否有现有工具可以满足请求:
- 先搜索 — 使用扫描工具注册表。如果已有工具可以处理该任务(即使是部分处理),优先使用它。
discover_capabilities - 再组合 — 如果两个或多个现有工具可以串联完成目标,请使用ComposableToolBuilder将它们组合,而非从零开始创建新工具。
- 最后锻造 — 只有当现有工具或组合工具无法满足需求时,才锻造真正全新的工具。常见的适合锻造的场景:
- 通用工具未覆盖的领域特定数据转换
- 用户即时需要的自定义API集成
- 专用的验证或格式化流水线
- 难以用提示词表达的一次性计算
The Forging Process
锻造流程
When you decide to forge a tool, the pipeline works as follows:
- Specification — You describe the tool's purpose, input schema, output schema, and expected behavior in natural language.
- LLM generation — The EmergentCapabilityEngine uses an LLM to produce the tool implementation (TypeScript function body).
- Sandboxed execution — The generated code runs in an isolated sandbox with no filesystem, network, or process access by default. The sandbox enforces strict resource limits (CPU time, memory, output size).
- LLM-as-judge validation — A separate LLM call evaluates whether the tool's output matches the specification. The judge scores correctness, safety, and completeness.
- Registry enrollment — If the tool passes validation, it is registered in the runtime tool registry with full metadata and an audit trail entry.
当你决定锻造工具时,流程如下:
- 规格定义 — 用自然语言描述工具的用途、输入 schema、输出 schema 和预期行为。
- LLM生成 — EmergentCapabilityEngine使用LLM生成工具实现代码(TypeScript函数体)。
- 沙箱执行 — 生成的代码在隔离沙箱中运行,默认无文件系统、网络或进程访问权限。沙箱会强制执行严格的资源限制(CPU时间、内存、输出大小)。
- LLM作为裁判验证 — 通过单独的LLM调用评估工具输出是否符合规格。裁判会对正确性、安全性和完整性打分。
- 注册表登记 — 如果工具通过验证,它会被注册到运行时工具注册表中,并附带完整元数据和审计跟踪条目。
Using ForgeToolMetaTool
使用ForgeToolMetaTool
The meta-tool is your interface to the EmergentCapabilityEngine. Invoke it with:
forge_tool- name — A clear, snake_case identifier for the new tool (e.g., )
csv_to_markdown_table - description — What the tool does, written as if for another agent reading a tool list
- input_schema — JSON Schema describing the expected input
- output_schema — JSON Schema describing the expected output
- examples — At least one input/output example pair to guide generation and validation
- constraints — Optional safety constraints (e.g., "must not make network calls", "output must be valid JSON")
The more precise your specification, the higher the first-pass success rate.
forge_tool- name — 新工具的清晰蛇形命名标识符(例如:)
csv_to_markdown_table - description — 工具的功能描述,撰写风格需便于其他Agent阅读工具列表
- input_schema — 描述预期输入的JSON Schema
- output_schema — 描述预期输出的JSON Schema
- examples — 至少一组输入/输出示例,用于指导生成和验证
- constraints — 可选的安全约束(例如:"禁止发起网络请求"、"输出必须为有效的JSON")
你的规格定义越精确,首次尝试成功的概率就越高。
adapt_personality
adapt_personality
The tool lets you shift HEXACO personality dimensions at runtime. Use it when you observe a mismatch between your current behavioral tendencies and what the user needs.
adapt_personalityWhen to adjust:
- User feedback suggests you're too formal/casual, too verbose/terse, too cautious/bold.
- A pattern of user corrections indicates a trait mismatch (e.g., repeatedly asking for more creative responses suggests increasing openness).
- Self-evaluation identifies a personality-related weakness.
How it works:
- Provide the name (one of the HEXACO dimensions), a signed
trait, and adeltastring explaining why.reasoning - The delta is clamped to the per-session budget (default ±0.15) and the final value to [0, 1].
- Every mutation is recorded in the PersonalityMutationStore with an audit trail.
- Mutations start at strength 1.0 and decay by the configured rate (default 0.05) each consolidation cycle.
- Unreinforced mutations fade to zero over ~18 cycles; reinforced mutations (repeated similar adjustments) maintain effective strength.
Always provide reasoning. The reasoning is persisted and auditable. Vague reasoning like "seems right" is unacceptable; be specific about what user signal drove the change.
adapt_personality调整时机:
- 用户反馈表明你过于正式/随意、过于冗长/简洁、过于谨慎/大胆。
- 用户多次纠正显示出特质不匹配(例如:反复要求更具创意的响应,表明需要提高开放性)。
- 自我评估发现与人格相关的弱点。
工作原理:
- 提供名称(HEXACO维度之一)、带符号的
trait值,以及解释原因的delta字符串。reasoning - delta值会被限制在每会话预算内(默认±0.15),最终值会被限制在[0, 1]范围内。
- 每次修改都会记录在PersonalityMutationStore中,并附带审计跟踪。
- 修改初始强度为1.0,每个巩固周期会按配置的速率(默认0.05)衰减。
- 未得到强化的修改会在约18个周期后衰减至零;得到强化的修改(重复类似调整)会保持有效强度。
务必提供清晰的理由。理由会被持久化并可审计。像“看起来合适”这样模糊的理由是不可接受的;请明确说明是什么用户信号驱动了此次修改。
manage_skills
manage_skills
The tool lets you enable, disable, and search for skills at runtime.
manage_skillsActions:
- — Find skills by keyword or description. Always search before enabling to find the best match.
search - — Load a skill by ID. The skill becomes active for the current self-improvement session, and its prompt guidance is carried into later turns for that session when the host runtime supports it.
enable - — Unload a previously loaded skill. Locked skills (core skills) cannot be disabled. Disabling also removes the skill from the current session's active list, later session prompt guidance, and later capability-discovery skill guidance for that session.
disable - — List all currently active skills.
list
Allowlist patterns:
- — All skills are permitted (default). Use with caution in production.
['*'] - — Only skills in the listed categories are permitted.
['category:productivity', 'category:search'] - — Only the exact skill IDs listed are permitted.
['com.framers.skill.web-search', 'com.framers.skill.calculator']
Category gating: When is enabled (default: true), enabling a skill from a category not already represented among active skills returns a status. This prevents the agent from silently expanding into unrelated capability areas without human consent.
requireApprovalForNewCategoriesrequires_approvalWorkflow: Search → review results → enable the best match. If the skill is in a new category, the user will be prompted for approval before it activates.
manage_skills操作:
- — 通过关键词或描述查找技能。启用前务必先搜索,以找到最匹配的技能。
search - — 通过ID加载技能。该技能会在当前自我改进会话中激活,如果宿主运行时支持,其提示引导会延续到该会话的后续轮次。
enable - — 卸载之前加载的技能。锁定技能(核心技能)无法被禁用。禁用操作还会将技能从当前会话的活跃列表、后续会话提示引导以及该会话后续的能力发现技能引导中移除。
disable - — 列出所有当前活跃的技能。
list
允许列表模式:
- — 允许所有技能(默认)。在生产环境中使用时需谨慎。
['*'] - — 仅允许列出类别中的技能。
['category:productivity', 'category:search'] - — 仅允许列出的精确技能ID对应的技能。
['com.framers.skill.web-search', 'com.framers.skill.calculator']
类别限制: 当启用时(默认:true),启用活跃技能中未涵盖的新类别技能会返回状态。这可以防止Agent在未经人工同意的情况下,悄然扩展到无关的能力领域。
requireApprovalForNewCategoriesrequires_approval工作流程: 搜索 → 查看结果 → 启用最匹配的技能。如果技能属于新类别,用户会在其激活前收到审批提示。
create_workflow
create_workflow
The tool lets you compose multi-step tool pipelines and execute them as a unit.
create_workflowReference resolution: Steps can reference data from earlier in the pipeline:
- — The workflow's original input argument.
$input - — The output of the immediately preceding step.
$prev - — The output of the Nth step (zero-indexed).
$steps[N]
Example workflow:
json
{
"action": "create",
"name": "research_and_summarize",
"steps": [
{ "tool": "web_search", "args": { "query": "$input.topic" } },
{ "tool": "extract_text", "args": { "url": "$prev.results[0].url" } },
{ "tool": "summarize", "args": { "text": "$prev.content", "maxLength": 200 } }
]
}Constraints:
- Maximum steps per workflow: configurable (default 10).
- Only tools from the list may be used. Default is
allowedTools(all tools).['*'] - itself is always excluded from workflow steps to prevent recursion.
create_workflow - Each step execution has a 30-second timeout.
Actions:
- — Define a new named workflow.
create - — Execute a previously created workflow with input.
run - — List all workflows created in this session.
list
create_workflow引用解析: 步骤可以引用流水线中更早步骤的数据:
- — 工作流的原始输入参数。
$input - — 上一步骤的输出。
$prev - — 第N步的输出(从零开始索引)。
$steps[N]
示例工作流:
json
{
"action": "create",
"name": "research_and_summarize",
"steps": [
{ "tool": "web_search", "args": { "query": "$input.topic" } },
{ "tool": "extract_text", "args": { "url": "$prev.results[0].url" } },
{ "tool": "summarize", "args": { "text": "$prev.content", "maxLength": 200 } }
]
}约束:
- 每个工作流的最大步骤数:可配置(默认10步)。
- 仅可使用列表中的工具。默认值为
allowedTools(所有工具)。['*'] - 本身始终被排除在工作流步骤之外,以防止递归。
create_workflow - 每个步骤执行有30秒超时限制。
操作:
- — 定义一个新的命名工作流。
create - — 使用输入执行之前创建的工作流。
run - — 列出本次会话中创建的所有工作流。
list
self_evaluate
self_evaluate
The tool lets you score your own responses and adjust operational parameters.
self_evaluateWhen to self-evaluate:
- After a complex multi-turn interaction to assess overall quality.
- When user feedback (explicit or implicit) suggests dissatisfaction.
- Periodically (every N turns) as a quality checkpoint.
Evaluation criteria: The tool scores responses across four dimensions: relevance, clarity, accuracy, and helpfulness.
Auto-adjustment: When is enabled (default: true), the evaluation model may suggest parameter changes that are then applied automatically within the current session:
autoAdjust- — Adjust LLM sampling temperature for more/less creative responses on later turns in the same AgentOS session.
temperature - — Shift response length preference; the preference is carried into later prompt construction for the same session.
verbosity - — Delegate trait adjustments to
personality, either by allowing explicit trait names or by usingadapt_personalitywithparam: 'personality'.{ trait, delta }
Adjustable parameters are configured via (default: ). Only listed parameters can be modified. Evaluation uses the runtime's cheapest detected text model unless is set explicitly.
adjustableParams['temperature', 'verbosity', 'personality']evaluationModelSession cap: Maximum evaluations per session is configurable (default: 10) to prevent excessive self-reflection loops.
self_evaluate自我评估时机:
- 复杂多轮交互后,评估整体质量。
- 用户反馈(明确或隐含)表明不满意时。
- 定期(每N轮)作为质量检查点。
评估标准: 该工具从四个维度为响应评分:相关性、清晰度、准确性和有用性。
自动调整: 当启用时(默认:true),评估模型可能会建议参数更改,这些更改会在当前会话中自动应用:
autoAdjust- — 调整LLM采样温度,以便在同一AgentOS会话的后续轮次中生成更具/更不具创意的响应。
temperature - — 调整响应长度偏好;该偏好会延续到同一会话后续的提示构建中。
verbosity - — 将特质调整委托给
personality,可以通过指定明确的特质名称,或使用adapt_personality并传入param: 'personality'来实现。{ trait, delta }
可调整参数通过配置(默认:)。只有列出的参数可以被修改。除非明确设置,否则评估会使用运行时检测到的最便宜的文本模型。
adjustableParams['temperature', 'verbosity', 'personality']evaluationModel会话限制: 每会话的最大评估次数可配置(默认:10次),以防止过度的自我反思循环。
Self-Improvement Workflow
自我改进工作流
The full self-improvement loop combines all four tools:
-
Evaluate — Useto score recent performance. Identify specific weaknesses (e.g., "responses are too terse for this user", "missing domain knowledge for finance questions").
self_evaluate -
Adjust personality — If the weakness maps to a personality trait, useto shift it. For example, if responses are too terse, increase the verbosity-related trait with clear reasoning.
adapt_personality -
Manage skills — If the weakness maps to missing capabilities, useto search for and enable relevant skills. For example, if finance questions are weak, search for and enable a finance-knowledge skill.
manage_skills -
Create workflows — For tasks that recur with a consistent pattern, useto codify the multi-step process. This saves re-planning on every invocation.
create_workflow -
Re-evaluate — After adjustments, useagain to verify improvement. If scores improved, the adjustments are reinforced. If not, consider reverting or trying a different approach.
self_evaluate
This loop is not meant to run on every turn. Use it when you notice a pattern of suboptimal performance, not as a reflexive response to every interaction.
完整的自我改进循环结合了所有四个工具:
-
评估 — 使用为近期表现评分。识别具体弱点(例如:“对该用户的响应过于简洁”、“缺乏金融领域知识”)。
self_evaluate -
调整人格 — 如果弱点与人格特质相关,使用进行调整。例如,如果响应过于简洁,增加与 verbose 相关的特质,并提供清晰理由。
adapt_personality -
管理技能 — 如果弱点对应缺失的能力,使用搜索并启用相关技能。例如,如果对金融问题的回答不佳,搜索并启用金融知识技能。
manage_skills -
创建工作流 — 对于具有一致模式的重复任务,使用将多步骤流程规范化。这样可以避免每次调用都重新规划。
create_workflow -
重新评估 — 调整后,再次使用验证改进效果。如果评分提高,说明调整得到了强化;如果没有,考虑恢复或尝试其他方法。
self_evaluate
这个循环不需要在每一轮都运行。当你发现性能不佳的模式时再使用它,而不是对每次交互都做出反射性响应。
ComposableToolBuilder
ComposableToolBuilder
For compositions of existing tools, use the ComposableToolBuilder pattern:
- pipeline(tools[]) — Chain tools sequentially, piping each output as the next input
- parallel(tools[]) — Run tools concurrently and merge their outputs
- conditional(predicate, ifTool, elseTool) — Branch based on a runtime condition
- transform(tool, mapFn) — Wrap a tool with an output transformation
Composed tools are registered just like forged tools, with full provenance tracking showing which base tools were combined.
对于现有工具的组合,请使用ComposableToolBuilder模式:
- pipeline(tools[]) — 按顺序串联工具,将每个输出作为下一个输入
- parallel(tools[]) — 并行运行工具并合并输出
- conditional(predicate, ifTool, elseTool) — 根据运行时条件分支执行
- transform(tool, mapFn) — 用输出转换函数包装工具
组合后的工具会像锻造的工具一样被注册,并附带完整的来源跟踪,显示哪些基础工具被组合在一起。
EmergentJudge Quality Thresholds
EmergentJudge质量阈值
The LLM-as-judge system uses three thresholds:
- Correctness (>= 0.8) — Does the output match the specification and examples?
- Safety (>= 0.9) — Does the tool avoid side effects, data leaks, or dangerous operations?
- Completeness (>= 0.7) — Does the tool handle edge cases and produce well-structured output?
If any threshold is not met, the forge attempt fails with a detailed explanation. You can revise the specification and retry. Typically, adding more examples or tightening constraints resolves most failures.
LLM裁判系统使用三个阈值:
- 正确性 (>= 0.8) — 输出是否符合规格和示例?
- 安全性 (>= 0.9) — 工具是否避免了副作用、数据泄露或危险操作?
- 完整性 (>= 0.7) — 工具是否能处理边缘情况并生成结构良好的输出?
如果任何一个阈值未达到,锻造尝试会失败并给出详细解释。你可以修改规格后重试。通常,添加更多示例或收紧约束可以解决大多数失败问题。
Audit Trail
审计跟踪
Every forged tool carries an audit record containing:
- The original specification
- The generated source code (hash-pinned)
- Judge scores and rationale
- Timestamp and session context
- Parent tool references (for compositions)
This trail is immutable. If a user asks "how was this tool made?", you can retrieve and explain its provenance.
Personality mutations are also fully auditable: every call records the trait, delta, reasoning, baseline value, and mutated value with timestamps.
adapt_personality每个锻造的工具都带有审计记录,包含:
- 原始规格
- 生成的源代码(哈希固定)
- 裁判评分和理由
- 时间戳和会话上下文
- 父工具引用(针对组合工具)
该跟踪记录是不可变的。如果用户询问“这个工具是如何创建的?”,你可以检索并解释其来源。
人格修改也完全可审计:每次调用都会记录特质、增量、理由、基线值和修改后的值以及时间戳。
adapt_personalityBest Practices
最佳实践
- Start with examples — Providing 2-3 input/output examples dramatically improves forge quality.
- Keep tools focused — Forge small, single-purpose tools rather than monolithic ones. Compose them later if needed.
- Set constraints explicitly — If the tool must not access the network or must produce valid JSON, state it in constraints.
- Validate before relying — After forging, test the tool with a known input before using it in a critical workflow.
- Reuse forged tools — Forged tools persist in the session registry. Check before forging a duplicate.
- Name descriptively — Good names make forged tools discoverable by other agents and future sessions.
- Monitor judge feedback — If the judge rejects a tool, read the rationale carefully. It usually pinpoints exactly what to fix.
- Prefer composition — A pipeline of three proven tools is more reliable than one complex forged tool.
- Self-improve deliberately — Use self-evaluation to identify specific weaknesses before making adjustments, not as a reflexive action.
- Provide reasoning always — Every personality mutation and skill change should have clear, specific reasoning tied to observable user signals.
- Let decay work — Don't fight the decay model. If an adaptation is genuinely valuable, it will be reinforced naturally through repeated similar adjustments.
- 从示例开始 — 提供2-3组输入/输出示例可以显著提高锻造质量。
- 保持工具聚焦 — 锻造小型、单一用途的工具,而非庞大的一体化工具。如有需要,后续再进行组合。
- 明确设置约束 — 如果工具禁止访问网络或必须生成有效的JSON,请在约束中说明。
- 依赖前先验证 — 锻造后,在关键工作流中使用之前,先用已知输入测试工具。
- 重用锻造工具 — 锻造的工具会保留在会话注册表中。锻造前先检查是否存在重复工具。
- 命名要描述性 — 好的名称可以让锻造的工具被其他Agent和未来会话发现。
- 关注裁判反馈 — 如果裁判拒绝了工具,请仔细阅读理由。它通常会明确指出需要修复的问题。
- 优先选择组合 — 由三个经过验证的工具组成的流水线比一个复杂的锻造工具更可靠。
- 有针对性地自我改进 — 在进行调整前,先通过自我评估识别具体弱点,不要将其作为反射性操作。
- 始终提供理由 — 每次人格修改和技能变更都应有清晰、具体的理由,并与可观察到的用户信号相关联。
- 利用衰减机制 — 不要抗拒衰减模型。如果一项调整真正有价值,它会通过重复的类似调整自然得到强化。