memory-collector

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Memory collector

内存收集器

Your richest memory source is sitting untouched on disk: every coding-agent session ever run on this machine. Transcripts full of decisions, gotchas, open questions, people, and the occasional quotable outburst — none of it queryable, all of it rotting in JSONL.

This skill is the collector: a budgeted, cursor-tracked harvest that mines those transcripts and plants durable knowledge into whatever memory stores the current environment exposes. It is storage-agnostic (capabilities discovered at run time, like memory-gardener) and source-aware (it knows where coding tools keep their sessions). The extraction prompts ship in

prompts/

, carried from EI's extraction pipeline by Jeremy Scherer (MIT).

The composition: the collector plants; the gardener prunes. Collection deliberately tolerates near-duplicates and overgrowth — the gardener's validate gate, dedup curator, and bloat-split exist precisely to tend what collection produces. Don't make the collector perfect; make the pair converge.

你最宝贵的内存源正闲置在磁盘上：这台机器上运行过的每一段coding-agent会话记录。这些记录里满是决策、陷阱、待解决问题、人物信息，偶尔还有值得引用的突发内容——但所有信息都无法被查询，只能在JSONL文件中逐渐失效。

本技能就是这个“收集器”：一个受预算控制、基于游标跟踪的提取过程，它会挖掘这些会话记录，并将持久化知识存入当前环境可访问的任意内存存储中。它具备存储无关性（运行时自动发现存储能力，类似memory-gardener）和源信息感知能力（了解编码工具存储会话记录的位置）。提取提示模板存放在

prompts/

中，由MIT的Jeremy Scherer从EI的提取管道中引入。

协作模式：收集器负责“播种”；内存管理器负责“修剪”。收集过程刻意允许近似重复和冗余内容——内存管理器的验证关卡、去重管理和拆分冗余功能，正是为了维护收集过程产生的内容。无需让收集器做到完美，只需让两者协作达成最优结果。

Ground rules — the safety contract

基本原则——安全协议

Sources are read-only. Never modify, move, or delete a transcript file.
Stores are additive. The collector creates and updates memory items; it never deletes. Anything that looks delete-worthy is the gardener's job.
Budget every run. Default: 3 sessions (or ~150 messages) per run, oldest unprocessed first. Stop at the budget; the cursor makes the next run continue cleanly.
Skip live sessions. A transcript modified in the last ~30 minutes (or whose tool is plainly mid-session) gets skipped — half-written sessions extract badly. It will be there next run.
Never store secrets. Coding transcripts contain tokens, connection strings, ARNs, and keys. If an extracted value is shaped like a credential, drop it. The shipped prompts already exclude these from quotes; apply the same bar to every field you store.
Conservative is the law. The shipped prompts are tuned so that empty results are the most common response. Honor that — noise is worse than gaps.
Provenance is mandatory. Every stored item carries its source id (see Phase 4). An item you can't trace back to a session is a rumor.

源文件只读：绝不修改、移动或删除任何会话记录文件。
存储仅追加：收集器仅创建和更新内存条目；绝不删除内容。任何看起来需要删除的内容都交由内存管理器处理。
每次运行都设预算：默认值：每次运行处理3个会话（约150条消息），优先处理最早未处理的会话。达到预算即停止；游标会确保下一次运行可以无缝继续。
跳过活跃会话：最近约30分钟内修改过的记录（或工具明显处于会话进行中）会被跳过——未完成的会话提取效果很差，下次运行时再处理即可。
绝不存储机密信息：编码会话记录中包含令牌、连接字符串、ARN和密钥。如果提取的内容看起来像凭据，直接丢弃。内置的提示模板已在引用中排除这些内容，存储的所有字段都需遵循同样的标准。
保守原则至上：内置提示模板经过调校，空结果是最常见的输出。请严格遵循——噪音比空白更糟糕。
来源追踪是必需项：每个存储条目都必须携带其源ID（见第4阶段）。无法追溯到会话的条目视为无效信息。

Phase 0 — survey

第0阶段——调研

Transcript sources. The skill bundles dependency-free Node readers (

readers/

) — run each with

node readers/<tool>.mjs --list --since <cursor high-water mark>

to discover what exists. Do not parse session stores by hand: the readers already encode the format traps (sidechain files, tool-result records masquerading as user messages, lossy cwd encodings).

Tool	Reader	Where sessions live
Claude Code	`readers/claude_code.mjs`	`~/.claude/projects/<encoded>/<uuid>.jsonl`
Pi / OMP	`readers/pi.mjs`	`~/.pi/agent/sessions/` (and `~/.omp/…` )
Codex	`readers/codex.mjs`	`~/.codex/state_<N>.sqlite` + rollout JSONL
OpenCode, Cursor	none yet — see `readers/README.md` to add one	local app data

Memory stores. Discover capabilities from the tool surface exactly as the gardener's Phase 0 does — memory search/mutation, knowledge graph, diary, stats. Don't assume tool names.

The cursor. Find the previous collection state: a memory item or artifact tagged

collector-cursor

holding, per source: a

highWater

timestamp (pass it as

--since

when listing) plus maps of processed and skipped-trivial session ids → timestamps (the maps are the sole source of truth — EI's

processed_sessions

pattern;

highWater

is the cheap pre-filter that keeps a noisy source from re-listing hundreds of already-judged sessions every run). No cursor → first run: start with the most recent few sessions, not all of history; backfill over subsequent runs.

会话记录源：本技能包含无需依赖的Node读取器（

readers/

）——运行

node readers/<tool>.mjs --list --since <cursor high-water mark>

即可发现可用的会话记录。请勿手动解析会话存储：读取器已封装了格式陷阱（如侧链文件、伪装成用户消息的工具结果记录、有损的cwd编码）。

工具	读取器	会话存储位置
Claude Code	`readers/claude_code.mjs`	`~/.claude/projects/<encoded>/<uuid>.jsonl`
Pi / OMP	`readers/pi.mjs`	`~/.pi/agent/sessions/` （以及 `~/.omp/…` ）
Codex	`readers/codex.mjs`	`~/.codex/state_<N>.sqlite` + 滚动JSONL
OpenCode, Cursor	暂未提供——查看 `readers/README.md` 添加	本地应用数据

内存存储：完全按照内存管理器第0阶段的方式，从工具表面发现能力——内存搜索/修改、知识图谱、日志、统计信息。请勿假设工具名称。

游标：查找上一次收集的状态：一个标记为

collector-cursor

的内存条目或工件，包含每个源的：

highWater

时间戳（列出会话时作为

--since

参数传入），以及已处理和跳过的无意义会话ID→时间戳的映射（这些映射是唯一的事实来源——EI的

processed_sessions

模式；

highWater

是低成本预过滤器，避免每次运行都重新列出数百条已判断过的会话）。如果没有游标→首次运行：从最近的几个会话开始，而非全部历史记录；后续运行逐步回填。

Phase 1 — select

第1阶段——选择

From each available source, list sessions not in the cursor, oldest first, and take sessions up to the budget. Apply the live-session guard (rule 4). The readers supply

title

(cwd-derived) and

messageCount

per session.

Prefer real conversations. Agent automation produces sessions too — a Pi store can hold a thousand mechanical runner-job sessions for every human one. Skip sessions that are tiny (fewer than ~4 messages) or whose opening message is plainly a machine-generated job prompt, and record them in the cursor as

skipped-trivial

so they are never re-listed. Spending the budget on noise is how a collector starves.

从每个可用源中，列出游标中未记录的会话，按从旧到新排序，选取不超过预算数量的会话。应用活跃会话防护规则（规则4）。读取器会提供每个会话的

title

（基于cwd生成）和

messageCount

。

优先选择真实对话：Agent自动化也会生成会话——Pi存储中每一条人类会话可能对应上千条机械运行任务会话。跳过消息量极少（少于约4条）或初始消息明显是机器生成的任务提示的会话，并在游标中标记为

skipped-trivial

，避免再次列出。把预算浪费在噪音上会导致收集器失效。

Phase 2 — convert

第2阶段——转换

node readers/<tool>.mjs --session <id>

returns the session already reduced to a clean conversation — human text and assistant text only; thinking blocks, tool calls/results, system noise, and sub-agent chatter are stripped by the reader. From that output:

Build fully qualified message ids:
```
<tool>:<machine>:<session>:<reader msg id>
```
(e.g.,
```
claudecode:mbp:0a1f…:42
```
). Quotes and provenance point at these.
Process the session in windows (~20–40 messages). For each window, the window itself is the "Most Recent Messages" and a compact tail of what came before is the "Earlier Conversation" — the shipped prompts are built around exactly this split and only ever analyze the recent window.

node readers/<tool>.mjs --session <id>

会返回已精简为干净对话的会话记录——仅保留人类文本和助手文本；思考块、工具调用/结果、系统噪音和子Agent对话都会被读取器过滤掉。基于该输出：

构建完整的消息ID：
```
<tool>:<machine>:<session>:<reader msg id>
```
（例如
```
claudecode:mbp:0a1f…:42
```
）。引用和来源追踪都指向这些ID。
按窗口（约20-40条消息）处理会话。每个窗口中，窗口本身是“最新消息”，之前的精简内容是“早期对话”——内置提示模板正是基于这种拆分设计，且仅分析最新窗口。

Phase 3 — extract

第3阶段——提取

Run the shipped pipelines over each window, with

technical_context: true

for coding-tool sessions (it makes Technical a priority category):

Topics —
```
prompts/topics.md
```
: scan flags candidate topics → match checks each against existing memory (conservative: unsure ⇒ "new") → update writes the record under the right discipline (Event narratives; Technical accumulate, don't synthesize; everything else synthesize, don't accumulate). Quotes ride along.
People —
```
prompts/people.md
```
: scan flags people (confidence 1–5, identifier capture, self/hypothetical guards) → match by identifiers first, then name → update under the person disciplines. For coding sessions most windows yield nobody; that's correct.
Events —
```
prompts/events.md
```
: once per session, the campaign-recap test ("The Night We Debugged the CPU"). Empty is the norm.
Facts —
```
prompts/facts.md
```
: only if you maintain a missing-facts list (kept beside the cursor). No list, no run.

对每个窗口运行内置管道，针对编码工具会话设置

technical_context: true

（让技术类成为优先类别）：

主题——
```
prompts/topics.md
```
：扫描标记候选主题→匹配阶段将每个主题与现有内存对比（保守原则：不确定则标记为“新主题”）→更新阶段按正确规则写入记录（事件类采用叙事方式；技术类累加而非合成；其他类合成而非累加）。引用内容会一并存储。
人物——
```
prompts/people.md
```
：扫描标记人物（置信度1-5，捕获标识符，区分自身/假设人物）→优先按标识符匹配，再按名称匹配→按人物规则更新记录。对于编码会话，大多数窗口不会提取到人物，这是正常情况。
事件——
```
prompts/events.md
```
：每个会话运行一次“活动回顾检测”（例如“我们调试CPU的那晚”）。空结果是常态。
事实——
```
prompts/facts.md
```
：仅当你维护了缺失事实列表（与游标一起保存）时才运行。没有列表则不执行。

Phase 4 — store

第4阶段——存储

Write extractions into the discovered stores, mapping fields onto the store's schema (confidence/exposure-impact → importance-like fields; categories → tags/containers; drop fields the store can't hold rather than inventing homes). Tag everything

source:<tool>:<machine>:<session>

plus

collected:<ISO date>

Where the store distinguishes recent/unreviewed items, leave new items visibly new — the gardener's validate gate (its Phase 1) is the door these newcomers are supposed to walk through. If both a fast store and a structured knowledge store exist, put summaries where retrieval happens and structure (entities, links) where the graph lives.

将提取内容写入已发现的存储中，将字段映射到存储的 schema（置信度/影响范围→类似重要性的字段；类别→标签/容器；丢弃存储无法容纳的字段，而非强行创建存储位置）。为所有内容添加标签

source:<tool>:<machine>:<session>

和

collected:<ISO date>

。

如果存储区分“最新/未审核”条目，保留新条目的可见性——内存管理器的验证关卡（其第1阶段）是这些新条目必经的流程。如果同时存在快速存储和结构化知识存储，将摘要放在检索位置，将结构化内容（实体、链接）放在知识图谱中。

Phase 5 — advance the cursor & report

第5阶段——更新游标并生成报告

Update the cursor only for sessions fully processed — a budget-truncated session stays uncursored and resumes next run. Also advance each source's

highWater

to the newest

lastMessageAt

among sessions you resolved (processed or skipped-trivial), but never at-or-past a skipped-live session's timestamp — live sessions must re-list once they settle. Then report:

undefined

仅对完全处理的会话更新游标——因预算截断的会话保持未标记状态，下次运行时继续处理。同时将每个源的

highWater

更新为已处理或跳过的无意义会话中最新的

lastMessageAt

时间戳，但绝不超过跳过的活跃会话的时间戳——活跃会话稳定后必须重新列出。然后生成报告：

undefined

Collection report — <ISO timestamp>

收集报告 — <ISO时间戳>

Sources: <tool: sessions found / processed / skipped-live> Windows analyzed: N · budget used: <sessions>/<max> Planted: topics N (new X, updated Y) · people N · events N · facts N · quotes N Dropped: secrets-shaped values N · low-confidence extractions N Cursor: advanced to <session id / timestamp> per source Handoff: <n> new items awaiting the gardener's validate gate

undefined

源：<工具：发现会话数 / 处理会话数 / 跳过的活跃会话数> 分析窗口数：N · 已用预算：<会话数>/<最大值> 存入内容：主题N个（新增X个，更新Y个）· 人物N个 · 事件N个 · 事实N个 · 引用N个丢弃内容：类机密值N个 · 低置信度提取内容N个游标：每个源已推进至<会话ID / 时间戳> 移交：<n>个新条目等待内存管理器的验证关卡

undefined

Running periodically

定期运行

Same hosting story as the gardener: any scheduler that can invoke an agent with this skill. A good rhythm — collector daily, gardener nightly after it — so each harvest is tended within a day. Both are budget-capped; worst case is a report that says "nothing new."

与内存管理器的托管方式相同：任何可以调用Agent并执行本技能的调度器都可使用。推荐节奏——收集器每日运行，内存管理器在收集器之后夜间运行，确保每次收集的内容在一天内得到维护。两者都受预算限制；最坏情况是生成“无新内容”的报告。

Provenance & credit

来源与致谢

The extraction pipeline (scan → match → update), its conservative defaults, the three description disciplines, the quote bar-test, and the prompts in

prompts/

come from Flare576/ei by Jeremy Scherer (MIT, © 2026 Jeremy Scherer) — EI runs this pipeline against five coding tools as its importer layer. This skill generalizes the storage side and pairs it with memory-gardener.

提取管道（扫描→匹配→更新）、保守默认值、三种描述规则、引用验证测试，以及

prompts/