analyzing-mlflow-trace
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAnalyzing a Single MLflow Trace
分析单个MLflow Trace
Trace Structure
Trace结构
A trace captures the full execution of an AI/ML application as a tree of spans. Each span represents one operation (LLM call, tool invocation, retrieval step, etc.) and records its inputs, outputs, timing, and status. Traces also carry assessments — feedback from humans or LLM judges about quality.
It is recommended to read references/trace-structure.md before analyzing a trace — it covers the complete data model, all fields and types, analysis guidance, and OpenTelemetry compatibility notes.
Trace会以Span树的形式捕获AI/ML应用的完整执行过程。每个Span代表一个操作(LLM调用、工具调用、检索步骤等),并记录其输入、输出、时间和状态。Trace还包含评估结果——来自人类或LLM评判者的质量反馈。
建议在分析Trace前阅读references/trace-structure.md,其中涵盖了完整的数据模型、所有字段和类型、分析指南以及OpenTelemetry兼容性说明。
Handling CLI Output
处理CLI输出
Traces can be 100KB+ for complex agent executions. Always redirect output to a file — do not pipe directly to , , or other commands, as piping can silently produce no output.
mlflow traces getjqheadbash
undefined对于复杂的Agent执行,Trace大小可能超过100KB。请始终将输出重定向到文件——不要直接将的输出通过管道传给、或其他命令,因为管道可能会静默地产生空输出。
mlflow traces getjqheadbash
undefinedFetch full trace to a file (traces get always outputs JSON, no --output flag needed)
将完整Trace抓取到文件中(traces get始终输出JSON,无需--output参数)
mlflow traces get --trace-id <ID> > /tmp/trace.json
mlflow traces get --trace-id <ID> > /tmp/trace.json
Then process the file
然后处理该文件
jq '.info.state' /tmp/trace.json
jq '.data.spans | length' /tmp/trace.json
**Prefer fetching the full trace and parsing the JSON directly** rather than using `--extract-fields`. The `--extract-fields` flag has limited support for nested span data (e.g., span inputs/outputs may return empty objects). Fetch the complete trace once and parse it as needed.jq '.info.state' /tmp/trace.json
jq '.data.spans | length' /tmp/trace.json
**优先抓取完整Trace并直接解析JSON**,而不是使用`--extract-fields`。`--extract-fields`标志对嵌套Span数据的支持有限(例如,Span的输入/输出可能返回空对象)。只需抓取一次完整的Trace,然后根据需要解析即可。JSON Structure
JSON结构
The trace JSON has two top-level keys: (metadata, assessments) and (spans).
infodata{
"info": { "trace_id", "state", "request_time", "assessments", ... },
"data": { "spans": [ { "span_id", "name", "status", "attributes", ... } ] }
}Key paths (verified against actual CLI output):
| What | jq path |
|---|---|
| Trace state | |
| All spans | |
| Root span | |
| Span status code | |
| Span status message | |
| Span inputs | |
| Span outputs | |
| Assessments | |
| Assessment name | |
| Feedback value | |
| Feedback error | |
| Assessment rationale | |
Important: Span inputs and outputs are stored as serialized JSON strings inside , not as top-level span fields. Traces from third-party OpenTelemetry clients may use different attribute names (e.g., GenAI Semantic Conventions, OpenInference, or custom keys) — check the raw dict to find the equivalent fields.
attributesattributesIf paths don't match (structure may vary by MLflow version), discover them:
bash
undefinedTrace JSON包含两个顶级键:(元数据、评估结果)和(Span)。
infodata{
"info": { "trace_id", "state", "request_time", "assessments", ... },
"data": { "spans": [ { "span_id", "name", "status", "attributes", ... } ] }
}关键路径(已通过实际CLI输出验证):
| 内容 | jq路径 |
|---|---|
| Trace状态 | |
| 所有Span | |
| 根Span | |
| Span状态码 | |
| Span状态消息 | |
| Span输入 | |
| Span输出 | |
| 评估结果 | |
| 评估名称 | |
| 反馈值 | |
| 反馈错误 | |
| 评估理由 | |
重要提示:Span的输入和输出以序列化JSON字符串的形式存储在中,而不是作为Span的顶级字段。来自第三方OpenTelemetry客户端的Trace可能使用不同的属性名称(例如GenAI语义约定、OpenInference或自定义键)——请检查原始字典以找到等效字段。
attributesattributes如果路径不匹配(结构可能因MLflow版本而异),可以通过以下方式发现:
bash
undefinedTop-level keys
顶级键
jq 'keys' /tmp/trace.json
jq 'keys' /tmp/trace.json
Span keys
Span键
jq '.data.spans[0] | keys' /tmp/trace.json
jq '.data.spans[0] | keys' /tmp/trace.json
Status structure
状态结构
jq '.data.spans[0].status' /tmp/trace.json
undefinedjq '.data.spans[0].status' /tmp/trace.json
undefinedQuick Health Check
快速健康检查
After fetching a trace to a file, run this to get a summary:
bash
jq '{
state: .info.state,
span_count: (.data.spans | length),
error_spans: [.data.spans[] | select(.status.code == "STATUS_CODE_ERROR") | .name],
assessment_errors: [.info.assessments[] | select(.feedback.error) | .assessment_name]
}' /tmp/trace.json将Trace抓取到文件后,运行以下命令获取摘要:
bash
jq '{
state: .info.state,
span_count: (.data.spans | length),
error_spans: [.data.spans[] | select(.status.code == "STATUS_CODE_ERROR") | .name],
assessment_errors: [.info.assessments[] | select(.feedback.error) | .assessment_name]
}' /tmp/trace.jsonAnalysis Insights
分析见解
- does not mean correct output. It only means no unhandled exception. Check assessments for quality signals, and if none exist, analyze the trace's inputs, outputs, and intermediate span data directly for issues.
state: OK - Always consult the when interpreting assessment values. The
rationalealone can be misleading — for example, avalueassessment withuser_frustrationcould mean "no frustration detected" or "the frustration check did not pass" (i.e., frustration is present), depending on how the scorer was configured. Thevalue: "no"field (a top-level assessment field, not nested under.rationale) explains what the value means in context and often describes the issue in plain language before you need to examine any spans..feedback - Assessments tell you what went wrong; spans tell you where. If assessments exist, use feedback/expectations to form a hypothesis, then confirm it in the span tree. If no assessments exist, examine span inputs/outputs to identify where the execution diverged from expected behavior.
- Assessment errors are not trace errors. If an assessment has an field, it means the scorer or judge that evaluated the trace failed — not that the trace itself has a problem. The trace may be perfectly fine; the assessment's
erroris just unreliable. This can happen when a scorer crashes (e.g., timed out, returned unparseable output) or when a scorer was applied to a trace type it wasn't designed for (e.g., a retrieval relevance scorer applied to a trace with no retrieval steps). The latter is a scorer configuration issue, not a trace issue.value - Span timing reveals performance issues. Gaps between parent and child spans indicate overhead; repeated span names suggest retries; compare individual span durations to find bottlenecks.
- Token usage explains latency and cost. Look for token usage in trace metadata (e.g., ) or span attributes (e.g.,
mlflow.trace.tokenUsage). Not all clients set these — check the rawmlflow.chat.tokenUsagedict for equivalent fields. Spikes in input tokens may indicate prompt injection or overly large context.attributes
- 不代表输出正确。它仅表示没有未处理的异常。请检查评估结果以获取质量信号,如果没有评估结果,则直接分析Trace的输入、输出和中间Span数据以查找问题。
state: OK - 解读评估值时务必参考。仅看
rationale可能会产生误导——例如,value评估的user_frustration可能表示“未检测到挫败感”或“挫败感检查未通过”(即确实存在挫败感),具体取决于评分器的配置。value: "no"字段(评估的顶级字段,不嵌套在.rationale下)会解释该值在上下文中的含义,并且通常会用通俗易懂的语言描述问题,无需查看任何Span。.feedback - **评估结果告诉你哪里出了问题;Span告诉你问题发生在何处。**如果存在评估结果,请使用反馈/预期形成假设,然后在Span树中验证。如果没有评估结果,请检查Span的输入/输出以确定执行与预期行为的偏差点。
- 评估错误不是Trace错误。如果某个评估包含字段,这意味着评估该Trace的评分器或评判者失败了——而不是Trace本身有问题。Trace可能完全正常;只是评估的
error不可靠。这种情况可能发生在评分器崩溃(例如超时、返回无法解析的输出)或评分器应用于其不适用的Trace类型(例如,将检索相关性评分器应用于没有检索步骤的Trace)时。后者是评分器配置问题,而非Trace问题。value - Span时序揭示性能问题。父Span和子Span之间的间隔表示开销;重复的Span名称表示重试;比较单个Span的持续时间以找到瓶颈。
- Token使用情况解释延迟和成本。在Trace元数据(例如)或Span属性(例如
mlflow.trace.tokenUsage)中查找Token使用情况。并非所有客户端都会设置这些字段——请检查原始mlflow.chat.tokenUsage字典以找到等效字段。输入Token的峰值可能提示提示注入或上下文过大。attributes
Codebase Correlation
代码库关联
MLflow Tracing captures inputs, outputs, and metadata from different parts of an application's call stack. By correlating trace contents with the source code, issues can be root-caused more precisely than from the trace alone.
- Span names map to functions. Span names typically match the function decorated with or wrapped in
@mlflow.trace. For autologged spans (LangChain, OpenAI, etc.), names follow framework conventions instead (e.g.,mlflow.start_span(),ChatOpenAI).RetrievalQA - The span tree mirrors the call stack. If span A is the parent of span B, then function A called function B.
- Span inputs/outputs correspond to function parameters/return values. Comparing them against the code logic reveals whether the function behaved as designed or produced an unexpected result.
- The trace shows what happened; the code shows why. A retriever returning irrelevant results might trace back to a faulty similarity threshold. Incorrect span inputs might reveal wrong model parameters or missing environment variables set in code.
MLflow Tracing会捕获应用调用栈不同部分的输入、输出和元数据。通过将Trace内容与源代码关联,可以比仅通过Trace更精确地定位问题根因。
- Span名称与函数对应。Span名称通常与使用装饰或用
@mlflow.trace包裹的函数匹配。对于自动记录的Span(LangChain、OpenAI等),名称遵循框架约定(例如mlflow.start_span()、ChatOpenAI)。RetrievalQA - Span树镜像调用栈。如果Span A是Span B的父Span,则表示函数A调用了函数B。
- Span输入/输出对应函数参数/返回值。将它们与代码逻辑进行比较,可以发现函数是否按设计运行或产生了意外结果。
- **Trace显示发生了什么;代码显示为什么会发生。**检索器返回不相关结果可能追溯到错误的相似度阈值。不正确的Span输入可能揭示错误的模型参数或代码中未设置的环境变量。
Example: Investigating a Wrong Answer
示例:调查错误答案
A user reports that their customer support agent gave an incorrect answer for the query "What is our refund policy?" There are no assessments on the trace.
1. Fetch the trace and check high-level signals.
The trace has — no crash occurred. No assessments are present, so examine the trace's inputs and outputs directly. The says "Our shipping policy states that orders are delivered within 3-5 business days..." — this answers a different question than what was asked.
state: OKresponse_preview2. Examine spans to locate the problem.
The span tree shows:
customer_support_agent (AGENT) — OK
├── plan_action (LLM) — OK
│ outputs: {"tool_call": "search_knowledge_base", "args": {"query": "refund policy"}}
├── search_knowledge_base (TOOL) — OK
│ inputs: {"query": "refund policy"}
│ outputs: [{"doc": "Shipping takes 3-5 business days...", "score": 0.82}]
├── generate_response (LLM) — OK
│ inputs: {"messages": [..., {"role": "user", "content": "Context: Shipping takes 3-5 business days..."}]}
│ outputs: {"content": "Our shipping policy states..."}The agent correctly decided to search for "refund policy," but the tool returned a shipping document. The LLM then faithfully answered using the wrong context. The problem is in the tool's retrieval, not the agent's reasoning or the LLM's generation.
search_knowledge_base3. Correlate with the codebase.
The span maps to a function in the application code. Investigating reveals the vector index was built from only the shipping FAQ — the refund policy documents were never indexed.
search_knowledge_base4. Recommendations.
- Re-index the knowledge base to include refund policy documents.
- Add a retrieval relevance scorer to detect when retrieved context doesn't match the query topic.
- Consider adding expectation assessments with correct answers for common queries to enable regression testing.
用户报告其客户支持Agent对查询“我们的退款政策是什么?”给出了错误答案。该Trace上没有评估结果。
1. 抓取Trace并检查高级信号。
Trace的——未发生崩溃。没有评估结果,因此直接检查Trace的输入和输出。显示*“我们的运输政策规定订单将在3-5个工作日内送达...”*——这回答的是另一个问题,而非用户所问的问题。
state: OKresponse_preview2. 检查Span以定位问题。
Span树显示:
customer_support_agent (AGENT) — OK
├── plan_action (LLM) — OK
│ outputs: {"tool_call": "search_knowledge_base", "args": {"query": "refund policy"}}
├── search_knowledge_base (TOOL) — OK
│ inputs: {"query": "refund policy"}
│ outputs: [{"doc": "Shipping takes 3-5 business days...", "score": 0.82}]
├── generate_response (LLM) — OK
│ inputs: {"messages": [..., {"role": "user", "content": "Context: Shipping takes 3-5 business days..."}]}
│ outputs: {"content": "Our shipping policy states..."}Agent正确决定搜索“refund policy”,但工具返回了运输文档。LLM随后根据错误的上下文生成了回答。问题出在工具的检索环节,而非Agent的推理或LLM的生成环节。
search_knowledge_base3. 与代码库关联。
Span 对应应用代码中的一个函数。调查发现向量索引仅基于运输FAQ构建——退款政策文档从未被索引。
search_knowledge_base4. 建议。
- 重新索引知识库,包含退款政策文档。
- 添加检索相关性评分器,以检测检索到的上下文是否与查询主题匹配。
- 考虑为常见查询添加带有正确答案的预期评估,以启用回归测试。