memory-collector

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Memory collector

内存收集器

Your richest memory source is sitting untouched on disk: every coding-agent session ever run on this machine. Transcripts full of decisions, gotchas, open questions, people, and the occasional quotable outburst — none of it queryable, all of it rotting in JSONL.
This skill is the collector: a budgeted, cursor-tracked harvest that mines those transcripts and plants durable knowledge into whatever memory stores the current environment exposes. It is storage-agnostic (capabilities discovered at run time, like memory-gardener) and source-aware (it knows where coding tools keep their sessions). The extraction prompts ship in
prompts/
, carried from EI's extraction pipeline by Jeremy Scherer (MIT).
The composition: the collector plants; the gardener prunes. Collection deliberately tolerates near-duplicates and overgrowth — the gardener's validate gate, dedup curator, and bloat-split exist precisely to tend what collection produces. Don't make the collector perfect; make the pair converge.
你最宝贵的内存源正闲置在磁盘上:这台机器上运行过的每一段coding-agent会话记录。这些记录里满是决策、陷阱、待解决问题、人物信息,偶尔还有值得引用的突发内容——但所有信息都无法被查询,只能在JSONL文件中逐渐失效。
本技能就是这个“收集器”:一个受预算控制、基于游标跟踪的提取过程,它会挖掘这些会话记录,并将持久化知识存入当前环境可访问的任意内存存储中。它具备存储无关性(运行时自动发现存储能力,类似memory-gardener)和源信息感知能力(了解编码工具存储会话记录的位置)。提取提示模板存放在
prompts/
中,由MIT的Jeremy Scherer从EI的提取管道中引入。
协作模式:收集器负责“播种”;内存管理器负责“修剪”。收集过程刻意允许近似重复和冗余内容——内存管理器的验证关卡、去重管理和拆分冗余功能,正是为了维护收集过程产生的内容。无需让收集器做到完美,只需让两者协作达成最优结果。

Ground rules — the safety contract

基本原则——安全协议

  1. Sources are read-only. Never modify, move, or delete a transcript file.
  2. Stores are additive. The collector creates and updates memory items; it never deletes. Anything that looks delete-worthy is the gardener's job.
  3. Budget every run. Default: 3 sessions (or ~150 messages) per run, oldest unprocessed first. Stop at the budget; the cursor makes the next run continue cleanly.
  4. Skip live sessions. A transcript modified in the last ~30 minutes (or whose tool is plainly mid-session) gets skipped — half-written sessions extract badly. It will be there next run.
  5. Never store secrets. Coding transcripts contain tokens, connection strings, ARNs, and keys. If an extracted value is shaped like a credential, drop it. The shipped prompts already exclude these from quotes; apply the same bar to every field you store.
  6. Conservative is the law. The shipped prompts are tuned so that empty results are the most common response. Honor that — noise is worse than gaps.
  7. Provenance is mandatory. Every stored item carries its source id (see Phase 4). An item you can't trace back to a session is a rumor.
  1. 源文件只读:绝不修改、移动或删除任何会话记录文件。
  2. 存储仅追加:收集器仅创建和更新内存条目;绝不删除内容。任何看起来需要删除的内容都交由内存管理器处理。
  3. 每次运行都设预算:默认值:每次运行处理3个会话(约150条消息),优先处理最早未处理的会话。达到预算即停止;游标会确保下一次运行可以无缝继续。
  4. 跳过活跃会话:最近约30分钟内修改过的记录(或工具明显处于会话进行中)会被跳过——未完成的会话提取效果很差,下次运行时再处理即可。
  5. 绝不存储机密信息:编码会话记录中包含令牌、连接字符串、ARN和密钥。如果提取的内容看起来像凭据,直接丢弃。内置的提示模板已在引用中排除这些内容,存储的所有字段都需遵循同样的标准。
  6. 保守原则至上:内置提示模板经过调校,空结果是最常见的输出。请严格遵循——噪音比空白更糟糕。
  7. 来源追踪是必需项:每个存储条目都必须携带其源ID(见第4阶段)。无法追溯到会话的条目视为无效信息。

Phase 0 — survey

第0阶段——调研

Transcript sources. The skill bundles dependency-free Node readers (
readers/
) — run each with
node readers/<tool>.mjs --list --since <cursor high-water mark>
to discover what exists. Do not parse session stores by hand: the readers already encode the format traps (sidechain files, tool-result records masquerading as user messages, lossy cwd encodings).
ToolReaderWhere sessions live
Claude Code
readers/claude_code.mjs
~/.claude/projects/<encoded>/<uuid>.jsonl
Pi / OMP
readers/pi.mjs
~/.pi/agent/sessions/
(and
~/.omp/…
)
Codex
readers/codex.mjs
~/.codex/state_<N>.sqlite
+ rollout JSONL
OpenCode, Cursornone yet — see
readers/README.md
to add one
local app data
Memory stores. Discover capabilities from the tool surface exactly as the gardener's Phase 0 does — memory search/mutation, knowledge graph, diary, stats. Don't assume tool names.
The cursor. Find the previous collection state: a memory item or artifact tagged
collector-cursor
holding, per source: a
highWater
timestamp (pass it as
--since
when listing) plus maps of processed and skipped-trivial session ids → timestamps (the maps are the sole source of truth — EI's
processed_sessions
pattern;
highWater
is the cheap pre-filter that keeps a noisy source from re-listing hundreds of already-judged sessions every run). No cursor → first run: start with the most recent few sessions, not all of history; backfill over subsequent runs.
会话记录源:本技能包含无需依赖的Node读取器(
readers/
)——运行
node readers/<tool>.mjs --list --since <cursor high-water mark>
即可发现可用的会话记录。请勿手动解析会话存储:读取器已封装了格式陷阱(如侧链文件、伪装成用户消息的工具结果记录、有损的cwd编码)。
工具读取器会话存储位置
Claude Code
readers/claude_code.mjs
~/.claude/projects/<encoded>/<uuid>.jsonl
Pi / OMP
readers/pi.mjs
~/.pi/agent/sessions/
(以及
~/.omp/…
Codex
readers/codex.mjs
~/.codex/state_<N>.sqlite
+ 滚动JSONL
OpenCode, Cursor暂未提供——查看
readers/README.md
添加
本地应用数据
内存存储:完全按照内存管理器第0阶段的方式,从工具表面发现能力——内存搜索/修改、知识图谱、日志、统计信息。请勿假设工具名称。
游标:查找上一次收集的状态:一个标记为
collector-cursor
的内存条目或工件,包含每个源的:
highWater
时间戳(列出会话时作为
--since
参数传入),以及已处理和跳过的无意义会话ID→时间戳的映射(这些映射是唯一的事实来源——EI的
processed_sessions
模式;
highWater
是低成本预过滤器,避免每次运行都重新列出数百条已判断过的会话)。如果没有游标→首次运行:从最近的几个会话开始,而非全部历史记录;后续运行逐步回填。

Phase 1 — select

第1阶段——选择

From each available source, list sessions not in the cursor, oldest first, and take sessions up to the budget. Apply the live-session guard (rule 4). The readers supply
title
(cwd-derived) and
messageCount
per session.
Prefer real conversations. Agent automation produces sessions too — a Pi store can hold a thousand mechanical runner-job sessions for every human one. Skip sessions that are tiny (fewer than ~4 messages) or whose opening message is plainly a machine-generated job prompt, and record them in the cursor as
skipped-trivial
so they are never re-listed. Spending the budget on noise is how a collector starves.
从每个可用源中,列出游标中未记录的会话,按从旧到新排序,选取不超过预算数量的会话。应用活跃会话防护规则(规则4)。读取器会提供每个会话的
title
(基于cwd生成)和
messageCount
优先选择真实对话:Agent自动化也会生成会话——Pi存储中每一条人类会话可能对应上千条机械运行任务会话。跳过消息量极少(少于约4条)或初始消息明显是机器生成的任务提示的会话,并在游标中标记为
skipped-trivial
,避免再次列出。把预算浪费在噪音上会导致收集器失效。

Phase 2 — convert

第2阶段——转换

node readers/<tool>.mjs --session <id>
returns the session already reduced to a clean conversation — human text and assistant text only; thinking blocks, tool calls/results, system noise, and sub-agent chatter are stripped by the reader. From that output:
  • Build fully qualified message ids:
    <tool>:<machine>:<session>:<reader msg id>
    (e.g.,
    claudecode:mbp:0a1f…:42
    ). Quotes and provenance point at these.
  • Process the session in windows (~20–40 messages). For each window, the window itself is the "Most Recent Messages" and a compact tail of what came before is the "Earlier Conversation" — the shipped prompts are built around exactly this split and only ever analyze the recent window.
node readers/<tool>.mjs --session <id>
会返回已精简为干净对话的会话记录——仅保留人类文本和助手文本;思考块、工具调用/结果、系统噪音和子Agent对话都会被读取器过滤掉。基于该输出:
  • 构建完整的消息ID:
    <tool>:<machine>:<session>:<reader msg id>
    (例如
    claudecode:mbp:0a1f…:42
    )。引用和来源追踪都指向这些ID。
  • 窗口(约20-40条消息)处理会话。每个窗口中,窗口本身是“最新消息”,之前的精简内容是“早期对话”——内置提示模板正是基于这种拆分设计,且仅分析最新窗口。

Phase 3 — extract

第3阶段——提取

Run the shipped pipelines over each window, with
technical_context: true
for coding-tool sessions (it makes Technical a priority category):
  1. Topics
    prompts/topics.md
    : scan flags candidate topics → match checks each against existing memory (conservative: unsure ⇒ "new") → update writes the record under the right discipline (Event narratives; Technical accumulate, don't synthesize; everything else synthesize, don't accumulate). Quotes ride along.
  2. People
    prompts/people.md
    : scan flags people (confidence 1–5, identifier capture, self/hypothetical guards) → match by identifiers first, then name → update under the person disciplines. For coding sessions most windows yield nobody; that's correct.
  3. Events
    prompts/events.md
    : once per session, the campaign-recap test ("The Night We Debugged the CPU"). Empty is the norm.
  4. Facts
    prompts/facts.md
    : only if you maintain a missing-facts list (kept beside the cursor). No list, no run.
对每个窗口运行内置管道,针对编码工具会话设置
technical_context: true
(让技术类成为优先类别):
  1. 主题——
    prompts/topics.md
    :扫描标记候选主题→匹配阶段将每个主题与现有内存对比(保守原则:不确定则标记为“新主题”)→更新阶段按正确规则写入记录(事件类采用叙事方式;技术类累加而非合成;其他类合成而非累加)。引用内容会一并存储。
  2. 人物——
    prompts/people.md
    :扫描标记人物(置信度1-5,捕获标识符,区分自身/假设人物)→优先按标识符匹配,再按名称匹配→按人物规则更新记录。对于编码会话,大多数窗口不会提取到人物,这是正常情况。
  3. 事件——
    prompts/events.md
    :每个会话运行一次“活动回顾检测”(例如“我们调试CPU的那晚”)。空结果是常态。
  4. 事实——
    prompts/facts.md
    :仅当你维护了缺失事实列表(与游标一起保存)时才运行。没有列表则不执行。

Phase 4 — store

第4阶段——存储

Write extractions into the discovered stores, mapping fields onto the store's schema (confidence/exposure-impact → importance-like fields; categories → tags/containers; drop fields the store can't hold rather than inventing homes). Tag everything
source:<tool>:<machine>:<session>
plus
collected:<ISO date>
.
Where the store distinguishes recent/unreviewed items, leave new items visibly new — the gardener's validate gate (its Phase 1) is the door these newcomers are supposed to walk through. If both a fast store and a structured knowledge store exist, put summaries where retrieval happens and structure (entities, links) where the graph lives.
将提取内容写入已发现的存储中,将字段映射到存储的 schema(置信度/影响范围→类似重要性的字段;类别→标签/容器;丢弃存储无法容纳的字段,而非强行创建存储位置)。为所有内容添加标签
source:<tool>:<machine>:<session>
collected:<ISO date>
如果存储区分“最新/未审核”条目,保留新条目的可见性——内存管理器的验证关卡(其第1阶段)是这些新条目必经的流程。如果同时存在快速存储和结构化知识存储,将摘要放在检索位置,将结构化内容(实体、链接)放在知识图谱中。

Phase 5 — advance the cursor & report

第5阶段——更新游标并生成报告

Update the cursor only for sessions fully processed — a budget-truncated session stays uncursored and resumes next run. Also advance each source's
highWater
to the newest
lastMessageAt
among sessions you resolved (processed or skipped-trivial), but never at-or-past a skipped-live session's timestamp — live sessions must re-list once they settle. Then report:
undefined
仅对完全处理的会话更新游标——因预算截断的会话保持未标记状态,下次运行时继续处理。同时将每个源的
highWater
更新为已处理或跳过的无意义会话中最新的
lastMessageAt
时间戳,但绝不超过跳过的活跃会话的时间戳——活跃会话稳定后必须重新列出。然后生成报告:
undefined

Collection report — <ISO timestamp>

收集报告 — <ISO时间戳>

Sources: <tool: sessions found / processed / skipped-live> Windows analyzed: N · budget used: <sessions>/<max> Planted: topics N (new X, updated Y) · people N · events N · facts N · quotes N Dropped: secrets-shaped values N · low-confidence extractions N Cursor: advanced to <session id / timestamp> per source Handoff: <n> new items awaiting the gardener's validate gate
undefined
源:<工具:发现会话数 / 处理会话数 / 跳过的活跃会话数> 分析窗口数:N · 已用预算:<会话数>/<最大值> 存入内容:主题N个(新增X个,更新Y个)· 人物N个 · 事件N个 · 事实N个 · 引用N个 丢弃内容:类机密值N个 · 低置信度提取内容N个 游标:每个源已推进至<会话ID / 时间戳> 移交:<n>个新条目等待内存管理器的验证关卡
undefined

Running periodically

定期运行

Same hosting story as the gardener: any scheduler that can invoke an agent with this skill. A good rhythm — collector daily, gardener nightly after it — so each harvest is tended within a day. Both are budget-capped; worst case is a report that says "nothing new."
与内存管理器的托管方式相同:任何可以调用Agent并执行本技能的调度器都可使用。推荐节奏——收集器每日运行,内存管理器在收集器之后夜间运行,确保每次收集的内容在一天内得到维护。两者都受预算限制;最坏情况是生成“无新内容”的报告。

Provenance & credit

来源与致谢

The extraction pipeline (scan → match → update), its conservative defaults, the three description disciplines, the quote bar-test, and the prompts in
prompts/
come from Flare576/ei by Jeremy Scherer (MIT, © 2026 Jeremy Scherer) — EI runs this pipeline against five coding tools as its importer layer. This skill generalizes the storage side and pairs it with memory-gardener.
提取管道(扫描→匹配→更新)、保守默认值、三种描述规则、引用验证测试,以及
prompts/
中的提示模板均来自Flare576/ei,由Jeremy Scherer(MIT,© 2026 Jeremy Scherer)开发——EI将该管道作为导入层,用于处理五种编码工具的会话记录。本技能对存储端进行了通用化处理,并与memory-gardener配合使用。