dstl8

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Dstl8 — AI-Native Observability Skill

Dstl8 — AI原生可观测性Skill

Dstl8 distills logs across dev, staging, and production into root cause analysis, impact assessment, and fix recommendations. All environments queryable via the Dstl8 MCP server using the same tools.

Dstl8 将开发、预发布和生产环境的日志提炼为根因分析、影响评估和修复建议。所有环境均可通过 Dstl8 MCP 服务器使用相同工具进行查询。

Setup gate

设置验证环节

Before running any workflow below, verify Dstl8 is set up:
  1. CLI installed and authenticated (
    dstl8 profiles
    shows an active profile)
  2. At least one source connected and ingesting (
    dstl8 sources
    lists it)
  3. MCP server installed and the AI client restarted (
    dstl8 install status
    )
If any of these are missing, read
setup.md
from this skill directory and complete setup first.
Do not attempt setup from memory.
If Dstl8 tools aren't visible even after setup is reportedly complete:
"I don't see a Dstl8 MCP server connected. Check
dstl8 install status
, restart your AI client, or re-run setup. See
setup.md
."

在运行以下任何工作流之前,请确认Dstl8已完成设置:
  1. CLI已安装并完成认证(
    dstl8 profiles
    显示活跃配置文件)
  2. 至少已连接一个数据源并正在采集数据(
    dstl8 sources
    可列出该数据源)
  3. MCP服务器已安装且AI客户端已重启(
    dstl8 install status
如果缺少上述任何一项,请阅读本Skill目录下的
setup.md
并先完成设置
。请勿凭记忆尝试设置。
如果完成设置后仍无法看到Dstl8工具:
"未检测到已连接的Dstl8 MCP服务器。请检查
dstl8 install status
,重启AI客户端,或重新运行设置流程。详见
setup.md
。"

Tool surface preference

工具交互方式选择

This skill exposes Dstl8 functionality through two surfaces. Default correctly between them; the wrong choice wastes turns and produces worse answers.
MCP tools (
query_log_samples
,
list_incidents
,
query_patterns
,
get_sentiment_heatmap
,
query_insights_params
,
search_nodes
, etc.) are the right surface for investigation, queries, incident triage, and any run-time use of the data. These are the high-leverage tools the user installed Dstl8 to get. Default here for any question shaped like "show me X", "what happened with Y", "why is Z broken", "investigate W", "did my deploy fix it", "what's going on in prod".
CLI via bash (
dstl8 profiles
,
dstl8 sources
,
dstl8 install
,
dstl8 logs fetch
, etc.) is for setup, configuration, source management, and installation. Rare, admin-flavored actions.
If a user explicitly asks for the CLI ("run dstl8 sources" / "use the CLI to..."), use bash. Otherwise, when both surfaces could serve the question, prefer MCP.
dstl8 logs fetch
via bash is a fallback for when MCP is unavailable, not a default.
When MCP isn't loaded, prefer asking the user to restart over substituting via CLI. If the user asks an investigation question and MCP tools aren't available in the session (e.g., they just signed up and Claude Code hasn't been restarted yet), tell them directly: "MCP tools aren't loaded in this session — restart Claude Code and ask again." Don't paper over it with parallel
dstl8 logs fetch
calls. That produces a degraded answer and burns turns. CLI fallback is fine for setup verification (e.g.,
dstl8 logs fetch -n 5
to confirm ingestion), but not for investigation flows.

本Skill通过两种方式提供Dstl8功能。请根据场景选择正确的方式;错误选择会浪费操作步骤并降低答案质量。
MCP工具
query_log_samples
list_incidents
query_patterns
get_sentiment_heatmap
query_insights_params
search_nodes
等)适用于调查、查询、事件分类以及任何数据的运行时使用场景。这些是用户安装Dstl8后可获得的高价值工具。对于形如“展示X”、“Y发生了什么”、“Z为什么故障”、“调查W”、“我的部署是否解决了问题”、“生产环境状况如何”的问题,默认使用此类工具。
通过bash使用CLI
dstl8 profiles
dstl8 sources
dstl8 install
dstl8 logs fetch
等)适用于设置、配置、数据源管理和安装等少见的管理类操作。
如果用户明确要求使用CLI(如“运行dstl8 sources” / “使用CLI来...”),则使用bash。否则,当两种方式都可满足需求时,优先选择MCP工具。通过bash执行
dstl8 logs fetch
仅作为MCP不可用时的 fallback 方案,而非默认选择。
当MCP未加载时,优先建议用户重启而非替换为CLI。如果用户提出调查类问题但会话中未加载MCP工具(例如,用户刚注册且Claude Code尚未重启),请直接告知:“当前会话未加载MCP工具,请重启Claude Code后再次提问。” 不要通过并行调用
dstl8 logs fetch
来掩盖问题,这会降低答案质量并浪费操作步骤。CLI fallback仅适用于设置验证(例如,执行
dstl8 logs fetch -n 5
确认数据采集),不适用于调查流程。

Starting moves

初始操作

Most workflows start with one of these:
Start withWhen
query_insights_params
You need to discover available environments, services, or time ranges. Good default first call.
list_incidents
"What's going on?" — get active incidents
get_sentiment_heatmap
Quick health pulse across services
query_log_samples
+ severity filter
"Why is X broken?" — find specific errors
大多数工作流从以下操作之一开始:
起始操作适用场景
query_insights_params
需要发现可用环境、服务或时间范围时。是默认的首次调用操作。
list_incidents
用户询问“发生了什么?”时——获取活跃事件
get_sentiment_heatmap
快速查看各服务的健康状况
query_log_samples
+ 严重性过滤器
用户询问“X为什么故障?”时——查找特定错误

Entry patterns

场景入口模式

"What's going on?" — Situational awareness

“发生了什么?” —— 态势感知

query_insights_params
list_incidents
(active, filtered by environment if specified) →
get_sentiment_heatmap
by service. Present active incidents + health across environments. When the user names a specific environment, pass it as a filter — don't return all incidents and let them sort through it.
query_insights_params
list_incidents
(活跃事件,若指定环境则按环境过滤)→
get_sentiment_heatmap
(按服务分组)。展示活跃事件 + 各环境的健康状况。 当用户指定特定环境时,需传入过滤器——不要返回所有事件让用户自行筛选。

"Why is X broken?" — Targeted investigation

“X为什么故障?” —— 定向调查

query_log_samples
(service + keyword + error) →
query_patterns
(recurring?) →
list_incidents
(already tracked?). Then cross-environment: does the same pattern appear in other environments? Same error in local + staging + prod = systematic. Only in prod = environment-specific. Present: root cause → impact → fix.
query_log_samples
(服务 + 关键词 + 错误)→
query_patterns
(是否重复出现?)→
list_incidents
(是否已被追踪?)。然后进行跨环境检查:相同模式是否出现在其他环境?本地、预发布和生产环境均出现相同错误 = 系统性问题。仅生产环境出现 = 环境特定问题。输出内容:根因 → 影响 → 修复建议。

"Check staging" / "Check production" / "Check <env>" — Environment-specific

“检查预发布环境” / “检查生产环境” / “检查<环境>” —— 特定环境检查

query_insights_params
query_log_samples
for that environment →
query_patterns
get_anomalies
. Compare against production baseline. New pattern in staging not in prod = flag before promoting. Pattern in a dev environment matching a known prod incident = good signal, developer is reproducing it. Present with "safe to promote" or "flag before promoting" verdict.
query_insights_params
query_log_samples
(针对该环境)→
query_patterns
get_anomalies
。与生产环境基线进行对比。 预发布环境出现新模式但生产环境未出现 = 上线前需标记。开发环境出现与已知生产事件匹配的模式 = 良好信号,开发者正在复现问题。输出结论:“可安全上线”或“上线前需标记”。

"Did my deploy fix it?" — Verification

“我的部署是否解决了问题?” —— 验证

get_current_time
to anchor windows →
query_severity_data
before vs after →
query_sentiment_data
same windows →
get_anomalies
. If deployed to staging, compare staging post-deploy vs production — are they converging? Clear verdict.
get_current_time
确定时间窗口 →
query_severity_data
(部署前后对比)→
query_sentiment_data
(相同时间窗口)→
get_anomalies
。如果部署到预发布环境,对比预发布环境部署后与生产环境的状况——是否趋于一致?给出明确结论。

"I'm about to make changes" — Pre-coding context (Loop 1)

“我即将进行变更” —— 编码前上下文(循环1)

search_nodes
for the service →
list_incidents
across all environments →
query_patterns
for recent issues. Surface what the developer should know before writing code.
search_nodes
查找目标服务 →
list_incidents
(所有环境)→
query_patterns
查找近期问题。向开发者展示编码前需了解的信息。

Defensive patterns

注意事项与防御性操作

  • Several tools require
    group_by
    .
    query_patterns
    ,
    query_summary
    ,
    query_severity_data
    ,
    query_sentiment_data
    ,
    get_sentiment_heatmap
    all need a
    group_by
    parameter (typically
    service
    or
    environment
    ). They'll fail without it.
  • CRITICAL:
    list_incident_events
    MUST include a state or time range filter.
    Unfiltered calls return 10-15k tokens and blow up context. NEVER call without passing
    state
    (e.g.
    state: "open"
    ) or
    start
    /
    end
    timestamps. If the filtered response is still large (>5k tokens), use a narrower time window or pipe the response through a local script to extract what you need rather than re-fetching.
  • Discover, don't guess. Call
    query_insights_params
    when unsure about environment or service names.
  • CLI time flag is
    --start
    , not
    --since
    .
    dstl8 logs fetch
    and
    dstl8 logs tail
    accept
    --start <duration>
    (e.g.,
    --start 1h
    ,
    --start 24h
    ,
    --start 7d
    ) and
    --end <duration>
    . Don't use
    --since
    ,
    --from
    , or other common variants — they don't exist on this CLI and will error.
  • Respect environment scope. When the user specifies an environment ("in brewhaus", "check staging"), filter queries to that environment. Cross-environment data is supplementary context, not the main answer. When no environment is specified, infer from git branch, repo name, or conversation context. Only ask if you can't determine it.
  • Always think cross-environment. When investigating one environment, check if the same pattern exists in others. But respect the user's scope — if they ask about a specific environment, lead with that environment's data and present cross-environment findings as secondary context, not the primary answer.
  • Persist findings. After triage reaching root cause, write to the knowledge graph. This feeds future sessions.
  • Verify after fixing. Proactively offer before/after comparison post-deploy.
  • Check before creating. Search for existing incidents/entities before creating — Möbius may have already created them. Ask the user before creating incidents.
  • Convert timestamps. Always present human-readable times, not raw Unix.
  • 多个工具需要
    group_by
    参数
    query_patterns
    query_summary
    query_severity_data
    query_sentiment_data
    get_sentiment_heatmap
    均需要
    group_by
    参数(通常为
    service
    environment
    )。缺少该参数会导致工具调用失败。
  • 关键注意事项:
    list_incident_events
    必须包含状态或时间范围过滤器
    。未过滤的调用会返回10-15k tokens,超出上下文限制。绝对禁止在未传入
    state
    (例如
    state: "open"
    )或
    start
    /
    end
    时间戳的情况下调用该工具
    。如果过滤后的响应仍然过大(>5k tokens),请使用更窄的时间窗口,或通过本地脚本提取所需内容,而非重新获取数据。
  • 主动发现,不要猜测。当不确定环境或服务名称时,调用
    query_insights_params
  • CLI时间参数为
    --start
    ,而非
    --since
    dstl8 logs fetch
    dstl8 logs tail
    接受
    --start <duration>
    (例如
    --start 1h
    --start 24h
    --start 7d
    )和
    --end <duration>
    参数。请勿使用
    --since
    --from
    或其他常见变体——这些参数在该CLI中不存在,会导致错误。
  • 遵守环境范围。当用户指定环境(如“在brewhaus环境中”、“检查预发布环境”)时,需将查询过滤到该环境。跨环境数据仅作为补充上下文,而非主要答案。当未指定环境时,从git分支、仓库名称或对话上下文推断。只有在无法确定时才询问用户。
  • 始终考虑跨环境对比。调查某一环境时,检查相同模式是否存在于其他环境。但需尊重用户指定的范围——如果用户询问特定环境,先展示该环境的数据,再将跨环境发现作为次要上下文呈现,而非主要答案。
  • 留存调查结果。完成分类并找到根因后,写入知识图谱。这将为后续会话提供数据支持。
  • 修复后验证。部署完成后主动提供前后对比验证。
  • 创建前检查。创建事件/实体前先搜索是否已存在——Möbius可能已创建相关内容。创建事件前需询问用户。
  • 转换时间戳。始终展示人类可读的时间,而非原始Unix时间戳。

Incident status mapping

事件状态映射

CodeLabel
0Open
1Investigating
2Active
3Resolved
4Closed
编码标签
0打开
1调查中
2活跃
3已解决
4已关闭

Output conventions

输出规范

Present investigation results as: Summary (one sentence) → Root causeImpact (quantified) → Recommended fix (concrete) → Confidence level.
Default to roughly 250 words. Expand to a longer post-mortem format only when the user explicitly asks for one ("write up a full post-mortem," "give me the long version"). For routine investigation queries, brevity beats thoroughness — the user is iterating, not archiving.
For post-mortems add: timeline table, action items with owner/priority.
调查结果按以下结构呈现:摘要(一句话)→ 根因影响(量化)→ 建议修复方案(具体)→ 置信度
默认输出约250字。仅当用户明确要求时(如“撰写完整的事后分析报告”、“给我详细版本”),才扩展为更长的事后分析格式。对于常规调查查询,简洁优于详尽——用户处于迭代过程中,而非归档记录。
事后分析需额外添加:时间线表格、带负责人/优先级的行动项。

Feedback loops

反馈循环

Three loops drive compounding value:
  • Loop 1 (Intent): Before coding — surface past incidents and patterns via knowledge graph. Only works if Loop 3 persisted findings.
  • Loop 2 (Iteration): During dev — validate in dev environments and staging before promoting to production. Cross-environment comparison catches regressions.
  • Loop 3 (Production intelligence): After deploy — triage → fix → verify → persist to graph for Loop 1.
三个循环推动价值持续提升:
  • 循环1(意图): 编码前——通过知识图谱展示过往事件和模式。该循环仅在循环3留存了调查结果时有效。
  • 循环2(迭代): 开发过程中——在开发环境和预发布环境验证后再推广到生产环境。跨环境对比可发现回归问题。
  • 循环3(生产智能): 部署后——分类→修复→验证→留存到知识图谱,为循环1提供数据。