langfuse-observability

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Langfuse Observability

Langfuse 可观测性

Overview

概述

Langfuse is the open-source LLM observability platform that OrchestKit uses for tracing, monitoring, evaluation, and prompt management. Unlike LangSmith (deprecated), Langfuse is self-hosted, free, and designed for production LLM applications.
When to use this skill:
  • Setting up LLM observability from scratch
  • Debugging slow or incorrect LLM responses
  • Tracking token usage and costs
  • Managing prompts in production
  • Evaluating LLM output quality
  • Migrating from LangSmith to Langfuse
OrchestKit Integration:
  • Status: Migrated from LangSmith (Dec 2025)
  • Location:
    backend/app/shared/services/langfuse/
  • MCP Server:
    orchestkit-langfuse
    (optional)

Langfuse 是OrchestKit用于追踪、监控、评估和提示词管理的开源LLM可观测性平台。与已弃用的LangSmith不同,Langfuse支持自托管、免费使用,专为生产级LLM应用设计。
何时使用该技能:
  • 从零搭建LLM可观测性体系
  • 调试响应缓慢或结果错误的LLM请求
  • 追踪token使用量及相关成本
  • 在生产环境中管理提示词
  • 评估LLM输出质量
  • 从LangSmith迁移至Langfuse
OrchestKit 集成信息:
  • 状态:2025年12月从LangSmith迁移完成
  • 代码位置
    backend/app/shared/services/langfuse/
  • MCP服务器
    orchestkit-langfuse
    (可选)

Quick Start

快速开始

Setup

部署步骤

python
undefined
python
undefined

backend/app/shared/services/langfuse/client.py

backend/app/shared/services/langfuse/client.py

from langfuse import Langfuse from app.core.config import settings
langfuse_client = Langfuse( public_key=settings.LANGFUSE_PUBLIC_KEY, secret_key=settings.LANGFUSE_SECRET_KEY, host=settings.LANGFUSE_HOST # Self-hosted or cloud )
undefined
from langfuse import Langfuse from app.core.config import settings
langfuse_client = Langfuse( public_key=settings.LANGFUSE_PUBLIC_KEY, secret_key=settings.LANGFUSE_SECRET_KEY, host=settings.LANGFUSE_HOST # Self-hosted or cloud )
undefined

Basic Tracing with @observe

使用@observe实现基础追踪

python
from langfuse.decorators import observe, langfuse_context

@observe()  # Automatic tracing
async def analyze_content(content: str):
    langfuse_context.update_current_observation(
        metadata={"content_length": len(content)}
    )
    return await llm.generate(content)
python
from langfuse.decorators import observe, langfuse_context

@observe()  # 自动追踪
async def analyze_content(content: str):
    langfuse_context.update_current_observation(
        metadata={"content_length": len(content)}
    )
    return await llm.generate(content)

Session & User Tracking

会话与用户追踪

python
langfuse.trace(
    name="analysis",
    user_id="user_123",
    session_id="session_abc",
    metadata={"content_type": "article", "agent_count": 8},
    tags=["production", "orchestkit"]
)

python
langfuse.trace(
    name="analysis",
    user_id="user_123",
    session_id="session_abc",
    metadata={"content_type": "article", "agent_count": 8},
    tags=["production", "orchestkit"]
)

Core Features Summary

核心功能汇总

FeatureDescriptionReference
Distributed TracingTrack LLM calls with parent-child spans
references/tracing-setup.md
Cost TrackingAutomatic token & cost calculation
references/cost-tracking.md
Prompt ManagementVersion control for prompts
references/prompt-management.md
LLM EvaluationCustom scoring with G-Eval
references/evaluation-scores.md
Session TrackingGroup related traces
references/session-tracking.md
Experiments APIA/B testing & benchmarks
references/experiments-api.md
Multi-Judge EvalEnsemble LLM evaluation
references/multi-judge-evaluation.md

功能描述参考文档
分布式追踪通过父子追踪链路跟踪LLM调用
references/tracing-setup.md
成本跟踪自动计算token用量及对应成本
references/cost-tracking.md
提示词管理提示词的版本控制
references/prompt-management.md
LLM评估基于G-Eval的自定义评分
references/evaluation-scores.md
会话追踪对相关追踪链路进行分组
references/session-tracking.md
实验APIA/B测试与基准测试
references/experiments-api.md
多裁判评估集成式LLM评估
references/multi-judge-evaluation.md

References

参考文档

Tracing Setup

追踪配置

See:
references/tracing-setup.md
Key topics covered:
  • Initializing Langfuse client with @observe decorator
  • Creating nested traces and spans
  • Tracking LLM generations with metadata
  • LangChain/LangGraph CallbackHandler integration
  • Workflow integration patterns
详见:
references/tracing-setup.md
涵盖核心主题:
  • 使用@observe装饰器初始化Langfuse客户端
  • 创建嵌套追踪链路与追踪段
  • 结合元数据跟踪LLM生成结果
  • LangChain/LangGraph CallbackHandler集成
  • 工作流集成模式

Cost Tracking

成本跟踪

See:
references/cost-tracking.md
Key topics covered:
  • Automatic cost calculation from token usage
  • Custom model pricing configuration
  • Monitoring dashboard SQL queries
  • Cost tracking per analysis/user
  • Daily cost trend analysis
详见:
references/cost-tracking.md
涵盖核心主题:
  • 基于token用量自动计算成本
  • 自定义模型定价配置
  • 监控仪表盘SQL查询
  • 按分析任务/用户维度跟踪成本
  • 每日成本趋势分析

Prompt Management

提示词管理

See:
references/prompt-management.md
Key topics covered:
  • Prompt versioning and labels (production/staging/draft)
  • Template variables with Jinja2 syntax
  • A/B testing prompt versions
  • OrchestKit 4-level caching architecture (L1-L4)
  • Linking prompts to generation spans
详见:
references/prompt-management.md
涵盖核心主题:
  • 提示词版本控制与标签(生产/预发布/草稿)
  • 基于Jinja2语法的模板变量
  • 提示词版本A/B测试
  • OrchestKit 4级缓存架构(L1-L4)
  • 关联提示词与生成链路

LLM Evaluation

LLM评估

See:
references/evaluation-scores.md
Key topics covered:
  • Custom scoring with numeric/categorical values
  • G-Eval automated quality assessment
  • Score trends and comparisons
  • Filtering traces by score thresholds
详见:
references/evaluation-scores.md
涵盖核心主题:
  • 基于数值/分类值的自定义评分
  • G-Eval自动化质量评估
  • 评分趋势与对比分析
  • 按评分阈值筛选追踪链路

Session Tracking

会话追踪

See:
references/session-tracking.md
Key topics covered:
  • Grouping traces by session_id
  • Multi-turn conversation tracking
  • User and metadata analytics
详见:
references/session-tracking.md
涵盖核心主题:
  • 按session_id分组追踪链路
  • 多轮对话追踪
  • 用户与元数据分析

Experiments API

实验API

See:
references/experiments-api.md
Key topics covered:
  • Creating test datasets in Langfuse
  • Running automated evaluations
  • Regression testing for LLMs
  • Benchmarking prompt versions
详见:
references/experiments-api.md
涵盖核心主题:
  • 在Langfuse中创建测试数据集
  • 运行自动化评估
  • LLM回归测试
  • 提示词版本基准测试

Multi-Judge Evaluation

多裁判评估

See:
references/multi-judge-evaluation.md
Key topics covered:
  • Multiple LLM judges for quality assessment
  • Weighted scoring across judges
  • OrchestKit langfuse_evaluators.py integration

详见:
references/multi-judge-evaluation.md
涵盖核心主题:
  • 多LLM裁判联合质量评估
  • 跨裁判的加权评分
  • OrchestKit langfuse_evaluators.py集成

Best Practices

最佳实践

  1. Always use @observe decorator for automatic tracing
  2. Set user_id and session_id for better analytics
  3. Add meaningful metadata (content_type, analysis_id, etc.)
  4. Score all production traces for quality monitoring
  5. Use prompt management instead of hardcoded prompts
  6. Monitor costs daily to catch spikes early
  7. Create datasets for regression testing
  8. Tag production vs staging traces

  1. 始终使用@observe装饰器实现自动追踪
  2. 设置user_id和session_id以优化分析效果
  3. 添加有意义的元数据(如content_type、analysis_id等)
  4. 为所有生产链路评分以监控质量
  5. 使用提示词管理功能替代硬编码提示词
  6. 每日监控成本以提前发现异常波动
  7. 创建数据集用于回归测试
  8. 为生产/预发布链路添加标签区分环境

LangSmith Migration Notes

LangSmith 迁移说明

Key Differences:
AspectLangfuseLangSmith
HostingSelf-hosted, open-sourceCloud-only, proprietary
CostFreePaid
PromptsBuilt-in managementExternal storage needed
Decorator
@observe
@traceable

核心差异:
维度LangfuseLangSmith
部署方式自托管、开源仅云端、闭源
成本免费付费
提示词管理内置管理功能需要外部存储
装饰器
@observe
@traceable

External References

外部参考链接

Related Skills

相关技能

  • observability-monitoring
    - General observability patterns for metrics, logging, and alerting
  • llm-evaluation
    - Evaluation patterns that integrate with Langfuse scoring
  • llm-streaming
    - Streaming response patterns with trace instrumentation
  • prompt-caching
    - Caching strategies that reduce costs tracked by Langfuse
  • observability-monitoring
    - 适用于指标、日志和告警的通用可观测性模式
  • llm-evaluation
    - 与Langfuse评分集成的评估模式
  • llm-streaming
    - 带追踪埋点的流式响应模式
  • prompt-caching
    - 降低Langfuse追踪成本的缓存策略

Key Decisions

关键决策

DecisionChoiceRationale
Observability platformLangfuse (not LangSmith)Open-source, self-hosted, free, built-in prompt management
Tracing approach@observe decoratorAutomatic, low-overhead instrumentation
Cost trackingAutomatic token countingBuilt-in model pricing with custom overrides
Prompt managementLangfuse nativeVersion control, A/B testing, labels in one place
决策项选择理由
可观测性平台Langfuse(而非LangSmith)开源、支持自托管、免费、内置提示词管理
追踪实现方式@observe装饰器自动追踪、低开销埋点
成本跟踪方式自动token统计内置模型定价支持自定义覆盖
提示词管理方式Langfuse原生功能版本控制、A/B测试、标签管理一体化

Capability Details

能力详情

distributed-tracing

distributed-tracing

Keywords: trace, tracing, observability, span, nested, parent-child, observe Solves:
  • How do I trace LLM calls across my application?
  • How to debug slow LLM responses?
  • Track execution flow in multi-agent workflows
  • Create nested trace spans
关键词: trace, tracing, observability, span, nested, parent-child, observe 解决问题:
  • 如何在应用中跟踪LLM调用链路?
  • 如何调试响应缓慢的LLM请求?
  • 跟踪多Agent工作流的执行流程
  • 创建嵌套追踪段

cost-tracking

cost-tracking

Keywords: cost, token usage, pricing, budget, spend, expense Solves:
  • How do I track LLM costs?
  • Calculate token usage and pricing
  • Monitor AI budget and spending
  • Track cost per user or session
关键词: cost, token usage, pricing, budget, spend, expense 解决问题:
  • 如何跟踪LLM成本?
  • 计算token用量及对应定价
  • 监控AI预算与支出
  • 按用户或会话维度跟踪成本

prompt-management

prompt-management

Keywords: prompt version, prompt template, prompt control, prompt registry Solves:
  • How do I version control prompts?
  • Manage prompts in production
  • A/B test different prompt versions
  • Link prompts to traces
关键词: prompt version, prompt template, prompt control, prompt registry 解决问题:
  • 如何对提示词进行版本控制?
  • 在生产环境中管理提示词
  • 对不同提示词版本进行A/B测试
  • 关联提示词与追踪链路

llm-evaluation

llm-evaluation

Keywords: score, quality, evaluation, rating, assessment, g-eval Solves:
  • How do I evaluate LLM output quality?
  • Score responses with custom metrics
  • Track quality trends over time
  • Compare prompt versions by quality
关键词: score, quality, evaluation, rating, assessment, g-eval 解决问题:
  • 如何评估LLM输出质量?
  • 基于自定义指标对响应结果评分
  • 跟踪质量随时间的变化趋势
  • 按质量对比不同提示词版本

session-tracking

session-tracking

Keywords: session, user tracking, conversation, group traces Solves:
  • How do I group related traces?
  • Track multi-turn conversations
  • Monitor per-user performance
  • Organize traces by session
关键词: session, user tracking, conversation, group traces 解决问题:
  • 如何对相关追踪链路进行分组?
  • 跟踪多轮对话
  • 监控单用户性能
  • 按会话组织追踪链路

langchain-integration

langchain-integration

Keywords: langchain, callback, handler, langgraph integration Solves:
  • How do I integrate Langfuse with LangChain?
  • Use CallbackHandler for tracing
  • Automatic LangGraph workflow tracing
  • LangChain observability setup
关键词: langchain, callback, handler, langgraph integration 解决问题:
  • 如何将Langfuse与LangChain集成?
  • 使用CallbackHandler实现追踪
  • 自动追踪LangGraph工作流
  • LangChain可观测性搭建

datasets-evaluation

datasets-evaluation

Keywords: dataset, test set, evaluation dataset, benchmark Solves:
  • How do I create test datasets in Langfuse?
  • Run automated evaluations
  • Regression testing for LLMs
  • Benchmark prompt versions
关键词: dataset, test set, evaluation dataset, benchmark 解决问题:
  • 如何在Langfuse中创建测试数据集?
  • 运行自动化评估
  • LLM回归测试
  • 提示词版本基准测试

ab-testing

ab-testing

Keywords: a/b test, experiment, compare prompts, variant testing Solves:
  • How do I A/B test prompts?
  • Compare two prompt versions
  • Experimental prompt evaluation
  • Statistical prompt testing
关键词: a/b test, experiment, compare prompts, variant testing 解决问题:
  • 如何对提示词进行A/B测试?
  • 对比两个提示词版本
  • 提示词的实验性评估
  • 提示词的统计性测试

monitoring-dashboard

monitoring-dashboard

Keywords: dashboard, analytics, metrics, monitoring, queries Solves:
  • What are the most expensive traces?
  • Average cost by agent type
  • Quality score trends
  • Custom monitoring queries
关键词: dashboard, analytics, metrics, monitoring, queries 解决问题:
  • 哪些追踪链路成本最高?
  • 按Agent类型统计平均成本
  • 质量评分趋势
  • 自定义监控查询

orchestkit-integration

orchestkit-integration

Keywords: orchestkit, migration, setup, workflow integration Solves:
  • How does OrchestKit use Langfuse?
  • Migrate from LangSmith to Langfuse
  • OrchestKit workflow tracing patterns
  • Cost tracking per analysis
关键词: orchestkit, migration, setup, workflow integration 解决问题:
  • OrchestKit如何使用Langfuse?
  • 从LangSmith迁移至Langfuse
  • OrchestKit工作流追踪模式
  • 按分析任务跟踪成本

multi-judge-evaluation

multi-judge-evaluation

Keywords: multi judge, g-eval, multiple evaluators, ensemble evaluation, weighted scoring Solves:
  • How do I use multiple LLM judges to evaluate quality?
  • Set up G-Eval criteria evaluation
  • Configure weighted scoring across judges
  • Wire OrchestKit's existing langfuse_evaluators.py
关键词: multi judge, g-eval, multiple evaluators, ensemble evaluation, weighted scoring 解决问题:
  • 如何使用多LLM裁判评估质量?
  • 配置G-Eval标准评估
  • 跨裁判的加权评分配置
  • 对接OrchestKit现有langfuse_evaluators.py

experiments-api

experiments-api

Keywords: experiment, dataset, benchmark, regression test, prompt testing Solves:
  • How do I run experiments across datasets?
  • A/B test models and prompts systematically
  • Track quality regression over time
  • Compare experiment results
关键词: experiment, dataset, benchmark, regression test, prompt testing 解决问题:
  • 如何跨数据集运行实验?
  • 系统性地对比模型与提示词版本
  • 跟踪质量随时间的退化情况
  • 对比实验结果