langfuse-observability
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLangfuse Observability
Langfuse 可观测性
Overview
概述
Langfuse is the open-source LLM observability platform that OrchestKit uses for tracing, monitoring, evaluation, and prompt management. Unlike LangSmith (deprecated), Langfuse is self-hosted, free, and designed for production LLM applications.
When to use this skill:
- Setting up LLM observability from scratch
- Debugging slow or incorrect LLM responses
- Tracking token usage and costs
- Managing prompts in production
- Evaluating LLM output quality
- Migrating from LangSmith to Langfuse
OrchestKit Integration:
- Status: Migrated from LangSmith (Dec 2025)
- Location:
backend/app/shared/services/langfuse/ - MCP Server: (optional)
orchestkit-langfuse
Langfuse 是OrchestKit用于追踪、监控、评估和提示词管理的开源LLM可观测性平台。与已弃用的LangSmith不同,Langfuse支持自托管、免费使用,专为生产级LLM应用设计。
何时使用该技能:
- 从零搭建LLM可观测性体系
- 调试响应缓慢或结果错误的LLM请求
- 追踪token使用量及相关成本
- 在生产环境中管理提示词
- 评估LLM输出质量
- 从LangSmith迁移至Langfuse
OrchestKit 集成信息:
- 状态:2025年12月从LangSmith迁移完成
- 代码位置:
backend/app/shared/services/langfuse/ - MCP服务器:(可选)
orchestkit-langfuse
Quick Start
快速开始
Setup
部署步骤
python
undefinedpython
undefinedbackend/app/shared/services/langfuse/client.py
backend/app/shared/services/langfuse/client.py
from langfuse import Langfuse
from app.core.config import settings
langfuse_client = Langfuse(
public_key=settings.LANGFUSE_PUBLIC_KEY,
secret_key=settings.LANGFUSE_SECRET_KEY,
host=settings.LANGFUSE_HOST # Self-hosted or cloud
)
undefinedfrom langfuse import Langfuse
from app.core.config import settings
langfuse_client = Langfuse(
public_key=settings.LANGFUSE_PUBLIC_KEY,
secret_key=settings.LANGFUSE_SECRET_KEY,
host=settings.LANGFUSE_HOST # Self-hosted or cloud
)
undefinedBasic Tracing with @observe
使用@observe实现基础追踪
python
from langfuse.decorators import observe, langfuse_context
@observe() # Automatic tracing
async def analyze_content(content: str):
langfuse_context.update_current_observation(
metadata={"content_length": len(content)}
)
return await llm.generate(content)python
from langfuse.decorators import observe, langfuse_context
@observe() # 自动追踪
async def analyze_content(content: str):
langfuse_context.update_current_observation(
metadata={"content_length": len(content)}
)
return await llm.generate(content)Session & User Tracking
会话与用户追踪
python
langfuse.trace(
name="analysis",
user_id="user_123",
session_id="session_abc",
metadata={"content_type": "article", "agent_count": 8},
tags=["production", "orchestkit"]
)python
langfuse.trace(
name="analysis",
user_id="user_123",
session_id="session_abc",
metadata={"content_type": "article", "agent_count": 8},
tags=["production", "orchestkit"]
)Core Features Summary
核心功能汇总
| Feature | Description | Reference |
|---|---|---|
| Distributed Tracing | Track LLM calls with parent-child spans | |
| Cost Tracking | Automatic token & cost calculation | |
| Prompt Management | Version control for prompts | |
| LLM Evaluation | Custom scoring with G-Eval | |
| Session Tracking | Group related traces | |
| Experiments API | A/B testing & benchmarks | |
| Multi-Judge Eval | Ensemble LLM evaluation | |
| 功能 | 描述 | 参考文档 |
|---|---|---|
| 分布式追踪 | 通过父子追踪链路跟踪LLM调用 | |
| 成本跟踪 | 自动计算token用量及对应成本 | |
| 提示词管理 | 提示词的版本控制 | |
| LLM评估 | 基于G-Eval的自定义评分 | |
| 会话追踪 | 对相关追踪链路进行分组 | |
| 实验API | A/B测试与基准测试 | |
| 多裁判评估 | 集成式LLM评估 | |
References
参考文档
Tracing Setup
追踪配置
See:
references/tracing-setup.mdKey topics covered:
- Initializing Langfuse client with @observe decorator
- Creating nested traces and spans
- Tracking LLM generations with metadata
- LangChain/LangGraph CallbackHandler integration
- Workflow integration patterns
详见:
references/tracing-setup.md涵盖核心主题:
- 使用@observe装饰器初始化Langfuse客户端
- 创建嵌套追踪链路与追踪段
- 结合元数据跟踪LLM生成结果
- LangChain/LangGraph CallbackHandler集成
- 工作流集成模式
Cost Tracking
成本跟踪
See:
references/cost-tracking.mdKey topics covered:
- Automatic cost calculation from token usage
- Custom model pricing configuration
- Monitoring dashboard SQL queries
- Cost tracking per analysis/user
- Daily cost trend analysis
详见:
references/cost-tracking.md涵盖核心主题:
- 基于token用量自动计算成本
- 自定义模型定价配置
- 监控仪表盘SQL查询
- 按分析任务/用户维度跟踪成本
- 每日成本趋势分析
Prompt Management
提示词管理
See:
references/prompt-management.mdKey topics covered:
- Prompt versioning and labels (production/staging/draft)
- Template variables with Jinja2 syntax
- A/B testing prompt versions
- OrchestKit 4-level caching architecture (L1-L4)
- Linking prompts to generation spans
详见:
references/prompt-management.md涵盖核心主题:
- 提示词版本控制与标签(生产/预发布/草稿)
- 基于Jinja2语法的模板变量
- 提示词版本A/B测试
- OrchestKit 4级缓存架构(L1-L4)
- 关联提示词与生成链路
LLM Evaluation
LLM评估
See:
references/evaluation-scores.mdKey topics covered:
- Custom scoring with numeric/categorical values
- G-Eval automated quality assessment
- Score trends and comparisons
- Filtering traces by score thresholds
详见:
references/evaluation-scores.md涵盖核心主题:
- 基于数值/分类值的自定义评分
- G-Eval自动化质量评估
- 评分趋势与对比分析
- 按评分阈值筛选追踪链路
Session Tracking
会话追踪
See:
references/session-tracking.mdKey topics covered:
- Grouping traces by session_id
- Multi-turn conversation tracking
- User and metadata analytics
详见:
references/session-tracking.md涵盖核心主题:
- 按session_id分组追踪链路
- 多轮对话追踪
- 用户与元数据分析
Experiments API
实验API
See:
references/experiments-api.mdKey topics covered:
- Creating test datasets in Langfuse
- Running automated evaluations
- Regression testing for LLMs
- Benchmarking prompt versions
详见:
references/experiments-api.md涵盖核心主题:
- 在Langfuse中创建测试数据集
- 运行自动化评估
- LLM回归测试
- 提示词版本基准测试
Multi-Judge Evaluation
多裁判评估
See:
references/multi-judge-evaluation.mdKey topics covered:
- Multiple LLM judges for quality assessment
- Weighted scoring across judges
- OrchestKit langfuse_evaluators.py integration
详见:
references/multi-judge-evaluation.md涵盖核心主题:
- 多LLM裁判联合质量评估
- 跨裁判的加权评分
- OrchestKit langfuse_evaluators.py集成
Best Practices
最佳实践
- Always use @observe decorator for automatic tracing
- Set user_id and session_id for better analytics
- Add meaningful metadata (content_type, analysis_id, etc.)
- Score all production traces for quality monitoring
- Use prompt management instead of hardcoded prompts
- Monitor costs daily to catch spikes early
- Create datasets for regression testing
- Tag production vs staging traces
- 始终使用@observe装饰器实现自动追踪
- 设置user_id和session_id以优化分析效果
- 添加有意义的元数据(如content_type、analysis_id等)
- 为所有生产链路评分以监控质量
- 使用提示词管理功能替代硬编码提示词
- 每日监控成本以提前发现异常波动
- 创建数据集用于回归测试
- 为生产/预发布链路添加标签区分环境
LangSmith Migration Notes
LangSmith 迁移说明
Key Differences:
| Aspect | Langfuse | LangSmith |
|---|---|---|
| Hosting | Self-hosted, open-source | Cloud-only, proprietary |
| Cost | Free | Paid |
| Prompts | Built-in management | External storage needed |
| Decorator | | |
核心差异:
| 维度 | Langfuse | LangSmith |
|---|---|---|
| 部署方式 | 自托管、开源 | 仅云端、闭源 |
| 成本 | 免费 | 付费 |
| 提示词管理 | 内置管理功能 | 需要外部存储 |
| 装饰器 | | |
External References
外部参考链接
Related Skills
相关技能
- - General observability patterns for metrics, logging, and alerting
observability-monitoring - - Evaluation patterns that integrate with Langfuse scoring
llm-evaluation - - Streaming response patterns with trace instrumentation
llm-streaming - - Caching strategies that reduce costs tracked by Langfuse
prompt-caching
- - 适用于指标、日志和告警的通用可观测性模式
observability-monitoring - - 与Langfuse评分集成的评估模式
llm-evaluation - - 带追踪埋点的流式响应模式
llm-streaming - - 降低Langfuse追踪成本的缓存策略
prompt-caching
Key Decisions
关键决策
| Decision | Choice | Rationale |
|---|---|---|
| Observability platform | Langfuse (not LangSmith) | Open-source, self-hosted, free, built-in prompt management |
| Tracing approach | @observe decorator | Automatic, low-overhead instrumentation |
| Cost tracking | Automatic token counting | Built-in model pricing with custom overrides |
| Prompt management | Langfuse native | Version control, A/B testing, labels in one place |
| 决策项 | 选择 | 理由 |
|---|---|---|
| 可观测性平台 | Langfuse(而非LangSmith) | 开源、支持自托管、免费、内置提示词管理 |
| 追踪实现方式 | @observe装饰器 | 自动追踪、低开销埋点 |
| 成本跟踪方式 | 自动token统计 | 内置模型定价支持自定义覆盖 |
| 提示词管理方式 | Langfuse原生功能 | 版本控制、A/B测试、标签管理一体化 |
Capability Details
能力详情
distributed-tracing
distributed-tracing
Keywords: trace, tracing, observability, span, nested, parent-child, observe
Solves:
- How do I trace LLM calls across my application?
- How to debug slow LLM responses?
- Track execution flow in multi-agent workflows
- Create nested trace spans
关键词: trace, tracing, observability, span, nested, parent-child, observe
解决问题:
- 如何在应用中跟踪LLM调用链路?
- 如何调试响应缓慢的LLM请求?
- 跟踪多Agent工作流的执行流程
- 创建嵌套追踪段
cost-tracking
cost-tracking
Keywords: cost, token usage, pricing, budget, spend, expense
Solves:
- How do I track LLM costs?
- Calculate token usage and pricing
- Monitor AI budget and spending
- Track cost per user or session
关键词: cost, token usage, pricing, budget, spend, expense
解决问题:
- 如何跟踪LLM成本?
- 计算token用量及对应定价
- 监控AI预算与支出
- 按用户或会话维度跟踪成本
prompt-management
prompt-management
Keywords: prompt version, prompt template, prompt control, prompt registry
Solves:
- How do I version control prompts?
- Manage prompts in production
- A/B test different prompt versions
- Link prompts to traces
关键词: prompt version, prompt template, prompt control, prompt registry
解决问题:
- 如何对提示词进行版本控制?
- 在生产环境中管理提示词
- 对不同提示词版本进行A/B测试
- 关联提示词与追踪链路
llm-evaluation
llm-evaluation
Keywords: score, quality, evaluation, rating, assessment, g-eval
Solves:
- How do I evaluate LLM output quality?
- Score responses with custom metrics
- Track quality trends over time
- Compare prompt versions by quality
关键词: score, quality, evaluation, rating, assessment, g-eval
解决问题:
- 如何评估LLM输出质量?
- 基于自定义指标对响应结果评分
- 跟踪质量随时间的变化趋势
- 按质量对比不同提示词版本
session-tracking
session-tracking
Keywords: session, user tracking, conversation, group traces
Solves:
- How do I group related traces?
- Track multi-turn conversations
- Monitor per-user performance
- Organize traces by session
关键词: session, user tracking, conversation, group traces
解决问题:
- 如何对相关追踪链路进行分组?
- 跟踪多轮对话
- 监控单用户性能
- 按会话组织追踪链路
langchain-integration
langchain-integration
Keywords: langchain, callback, handler, langgraph integration
Solves:
- How do I integrate Langfuse with LangChain?
- Use CallbackHandler for tracing
- Automatic LangGraph workflow tracing
- LangChain observability setup
关键词: langchain, callback, handler, langgraph integration
解决问题:
- 如何将Langfuse与LangChain集成?
- 使用CallbackHandler实现追踪
- 自动追踪LangGraph工作流
- LangChain可观测性搭建
datasets-evaluation
datasets-evaluation
Keywords: dataset, test set, evaluation dataset, benchmark
Solves:
- How do I create test datasets in Langfuse?
- Run automated evaluations
- Regression testing for LLMs
- Benchmark prompt versions
关键词: dataset, test set, evaluation dataset, benchmark
解决问题:
- 如何在Langfuse中创建测试数据集?
- 运行自动化评估
- LLM回归测试
- 提示词版本基准测试
ab-testing
ab-testing
Keywords: a/b test, experiment, compare prompts, variant testing
Solves:
- How do I A/B test prompts?
- Compare two prompt versions
- Experimental prompt evaluation
- Statistical prompt testing
关键词: a/b test, experiment, compare prompts, variant testing
解决问题:
- 如何对提示词进行A/B测试?
- 对比两个提示词版本
- 提示词的实验性评估
- 提示词的统计性测试
monitoring-dashboard
monitoring-dashboard
Keywords: dashboard, analytics, metrics, monitoring, queries
Solves:
- What are the most expensive traces?
- Average cost by agent type
- Quality score trends
- Custom monitoring queries
关键词: dashboard, analytics, metrics, monitoring, queries
解决问题:
- 哪些追踪链路成本最高?
- 按Agent类型统计平均成本
- 质量评分趋势
- 自定义监控查询
orchestkit-integration
orchestkit-integration
Keywords: orchestkit, migration, setup, workflow integration
Solves:
- How does OrchestKit use Langfuse?
- Migrate from LangSmith to Langfuse
- OrchestKit workflow tracing patterns
- Cost tracking per analysis
关键词: orchestkit, migration, setup, workflow integration
解决问题:
- OrchestKit如何使用Langfuse?
- 从LangSmith迁移至Langfuse
- OrchestKit工作流追踪模式
- 按分析任务跟踪成本
multi-judge-evaluation
multi-judge-evaluation
Keywords: multi judge, g-eval, multiple evaluators, ensemble evaluation, weighted scoring
Solves:
- How do I use multiple LLM judges to evaluate quality?
- Set up G-Eval criteria evaluation
- Configure weighted scoring across judges
- Wire OrchestKit's existing langfuse_evaluators.py
关键词: multi judge, g-eval, multiple evaluators, ensemble evaluation, weighted scoring
解决问题:
- 如何使用多LLM裁判评估质量?
- 配置G-Eval标准评估
- 跨裁判的加权评分配置
- 对接OrchestKit现有langfuse_evaluators.py
experiments-api
experiments-api
Keywords: experiment, dataset, benchmark, regression test, prompt testing
Solves:
- How do I run experiments across datasets?
- A/B test models and prompts systematically
- Track quality regression over time
- Compare experiment results
关键词: experiment, dataset, benchmark, regression test, prompt testing
解决问题:
- 如何跨数据集运行实验?
- 系统性地对比模型与提示词版本
- 跟踪质量随时间的退化情况
- 对比实验结果