langfuse-observability

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Langfuse Observability

Langfuse 可观测性

Overview

概述

Langfuse is the open-source LLM observability platform that OrchestKit uses for tracing, monitoring, evaluation, and prompt management. Unlike LangSmith (deprecated), Langfuse is self-hosted, free, and designed for production LLM applications.

When to use this skill:

Setting up LLM observability from scratch
Debugging slow or incorrect LLM responses
Tracking token usage and costs
Managing prompts in production
Evaluating LLM output quality
Migrating from LangSmith to Langfuse

OrchestKit Integration:

Status: Migrated from LangSmith (Dec 2025)
Location:
```
backend/app/shared/services/langfuse/
```
MCP Server:
```
orchestkit-langfuse
```
(optional)

Langfuse 是OrchestKit用于追踪、监控、评估和提示词管理的开源LLM可观测性平台。与已弃用的LangSmith不同，Langfuse支持自托管、免费使用，专为生产级LLM应用设计。

何时使用该技能：

从零搭建LLM可观测性体系
调试响应缓慢或结果错误的LLM请求
追踪token使用量及相关成本
在生产环境中管理提示词
评估LLM输出质量
从LangSmith迁移至Langfuse

OrchestKit 集成信息：

状态：2025年12月从LangSmith迁移完成
代码位置：
```
backend/app/shared/services/langfuse/
```
MCP服务器：
```
orchestkit-langfuse
```
（可选）

Quick Start

快速开始

Setup

部署步骤

python

undefined

python

undefined

backend/app/shared/services/langfuse/client.py

from langfuse import Langfuse from app.core.config import settings

langfuse_client = Langfuse( public_key=settings.LANGFUSE_PUBLIC_KEY, secret_key=settings.LANGFUSE_SECRET_KEY, host=settings.LANGFUSE_HOST # Self-hosted or cloud )

undefined

from langfuse import Langfuse from app.core.config import settings

langfuse_client = Langfuse( public_key=settings.LANGFUSE_PUBLIC_KEY, secret_key=settings.LANGFUSE_SECRET_KEY, host=settings.LANGFUSE_HOST # Self-hosted or cloud )

undefined

Basic Tracing with @observe

使用@observe实现基础追踪

python

from langfuse.decorators import observe, langfuse_context

@observe()  # Automatic tracing
async def analyze_content(content: str):
    langfuse_context.update_current_observation(
        metadata={"content_length": len(content)}
    )
    return await llm.generate(content)

python

from langfuse.decorators import observe, langfuse_context

@observe()  # 自动追踪
async def analyze_content(content: str):
    langfuse_context.update_current_observation(
        metadata={"content_length": len(content)}
    )
    return await llm.generate(content)

Session & User Tracking

会话与用户追踪

python

langfuse.trace(
    name="analysis",
    user_id="user_123",
    session_id="session_abc",
    metadata={"content_type": "article", "agent_count": 8},
    tags=["production", "orchestkit"]
)

python

langfuse.trace(
    name="analysis",
    user_id="user_123",
    session_id="session_abc",
    metadata={"content_type": "article", "agent_count": 8},
    tags=["production", "orchestkit"]
)

Core Features Summary

核心功能汇总

Feature	Description	Reference
Distributed Tracing	Track LLM calls with parent-child spans	`references/tracing-setup.md`
Cost Tracking	Automatic token & cost calculation	`references/cost-tracking.md`
Prompt Management	Version control for prompts	`references/prompt-management.md`
LLM Evaluation	Custom scoring with G-Eval	`references/evaluation-scores.md`
Session Tracking	Group related traces	`references/session-tracking.md`
Experiments API	A/B testing & benchmarks	`references/experiments-api.md`
Multi-Judge Eval	Ensemble LLM evaluation	`references/multi-judge-evaluation.md`

功能	描述	参考文档
分布式追踪	通过父子追踪链路跟踪LLM调用	`references/tracing-setup.md`
成本跟踪	自动计算token用量及对应成本	`references/cost-tracking.md`
提示词管理	提示词的版本控制	`references/prompt-management.md`
LLM评估	基于G-Eval的自定义评分	`references/evaluation-scores.md`
会话追踪	对相关追踪链路进行分组	`references/session-tracking.md`
实验API	A/B测试与基准测试	`references/experiments-api.md`
多裁判评估	集成式LLM评估	`references/multi-judge-evaluation.md`

References

参考文档

Tracing Setup

追踪配置

See:
references/tracing-setup.md

Key topics covered:

Initializing Langfuse client with @observe decorator
Creating nested traces and spans
Tracking LLM generations with metadata
LangChain/LangGraph CallbackHandler integration
Workflow integration patterns

详见：
references/tracing-setup.md

涵盖核心主题：

使用@observe装饰器初始化Langfuse客户端
创建嵌套追踪链路与追踪段
结合元数据跟踪LLM生成结果
LangChain/LangGraph CallbackHandler集成
工作流集成模式

Cost Tracking

成本跟踪

See:
references/cost-tracking.md

Key topics covered:

Automatic cost calculation from token usage
Custom model pricing configuration
Monitoring dashboard SQL queries
Cost tracking per analysis/user
Daily cost trend analysis

详见：
references/cost-tracking.md

涵盖核心主题：

基于token用量自动计算成本
自定义模型定价配置
监控仪表盘SQL查询
按分析任务/用户维度跟踪成本
每日成本趋势分析

Prompt Management

提示词管理

See:
references/prompt-management.md

Key topics covered:

Prompt versioning and labels (production/staging/draft)
Template variables with Jinja2 syntax
A/B testing prompt versions
OrchestKit 4-level caching architecture (L1-L4)
Linking prompts to generation spans

详见：
references/prompt-management.md

涵盖核心主题：

提示词版本控制与标签（生产/预发布/草稿）
基于Jinja2语法的模板变量
提示词版本A/B测试
OrchestKit 4级缓存架构（L1-L4）
关联提示词与生成链路

LLM Evaluation

LLM评估

See:
references/evaluation-scores.md

Key topics covered:

Custom scoring with numeric/categorical values
G-Eval automated quality assessment
Score trends and comparisons
Filtering traces by score thresholds

详见：
references/evaluation-scores.md

涵盖核心主题：

基于数值/分类值的自定义评分
G-Eval自动化质量评估
评分趋势与对比分析
按评分阈值筛选追踪链路

Session Tracking

会话追踪

See:
references/session-tracking.md

Key topics covered:

Grouping traces by session_id
Multi-turn conversation tracking
User and metadata analytics

详见：
references/session-tracking.md

涵盖核心主题：

按session_id分组追踪链路
多轮对话追踪
用户与元数据分析

Experiments API

实验API

See:
references/experiments-api.md

Key topics covered:

Creating test datasets in Langfuse
Running automated evaluations
Regression testing for LLMs
Benchmarking prompt versions

详见：
references/experiments-api.md

涵盖核心主题：

在Langfuse中创建测试数据集
运行自动化评估
LLM回归测试
提示词版本基准测试

Multi-Judge Evaluation

多裁判评估

See:
references/multi-judge-evaluation.md

Key topics covered:

Multiple LLM judges for quality assessment
Weighted scoring across judges
OrchestKit langfuse_evaluators.py integration

详见：
references/multi-judge-evaluation.md

涵盖核心主题：

多LLM裁判联合质量评估
跨裁判的加权评分
OrchestKit langfuse_evaluators.py集成

Best Practices

最佳实践

Always use @observe decorator for automatic tracing
Set user_id and session_id for better analytics
Add meaningful metadata (content_type, analysis_id, etc.)
Score all production traces for quality monitoring
Use prompt management instead of hardcoded prompts
Monitor costs daily to catch spikes early
Create datasets for regression testing
Tag production vs staging traces

始终使用@observe装饰器实现自动追踪
设置user_id和session_id以优化分析效果
添加有意义的元数据（如content_type、analysis_id等）
为所有生产链路评分以监控质量
使用提示词管理功能替代硬编码提示词
每日监控成本以提前发现异常波动
创建数据集用于回归测试
为生产/预发布链路添加标签区分环境

LangSmith Migration Notes

LangSmith 迁移说明

Key Differences:

Aspect	Langfuse	LangSmith
Hosting	Self-hosted, open-source	Cloud-only, proprietary
Cost	Free	Paid
Prompts	Built-in management	External storage needed
Decorator	`@observe`	`@traceable`

核心差异：

维度	Langfuse	LangSmith
部署方式	自托管、开源	仅云端、闭源
成本	免费	付费
提示词管理	内置管理功能	需要外部存储
装饰器	`@observe`	`@traceable`

External References

外部参考链接

Related Skills

Key Decisions

关键决策

Decision	Choice	Rationale
Observability platform	Langfuse (not LangSmith)	Open-source, self-hosted, free, built-in prompt management
Tracing approach	@observe decorator	Automatic, low-overhead instrumentation
Cost tracking	Automatic token counting	Built-in model pricing with custom overrides
Prompt management	Langfuse native	Version control, A/B testing, labels in one place

决策项	选择	理由
可观测性平台	Langfuse（而非LangSmith）	开源、支持自托管、免费、内置提示词管理
追踪实现方式	@observe装饰器	自动追踪、低开销埋点
成本跟踪方式	自动token统计	内置模型定价支持自定义覆盖
提示词管理方式	Langfuse原生功能	版本控制、A/B测试、标签管理一体化

Capability Details

能力详情

distributed-tracing

Keywords: trace, tracing, observability, span, nested, parent-child, observe Solves:

How do I trace LLM calls across my application?
How to debug slow LLM responses?
Track execution flow in multi-agent workflows
Create nested trace spans

关键词： trace, tracing, observability, span, nested, parent-child, observe 解决问题：

如何在应用中跟踪LLM调用链路？
如何调试响应缓慢的LLM请求？
跟踪多Agent工作流的执行流程
创建嵌套追踪段

cost-tracking

Keywords: cost, token usage, pricing, budget, spend, expense Solves:

How do I track LLM costs?
Calculate token usage and pricing
Monitor AI budget and spending
Track cost per user or session

关键词： cost, token usage, pricing, budget, spend, expense 解决问题：

如何跟踪LLM成本？
计算token用量及对应定价
监控AI预算与支出
按用户或会话维度跟踪成本

prompt-management

Keywords: prompt version, prompt template, prompt control, prompt registry Solves:

How do I version control prompts?
Manage prompts in production
A/B test different prompt versions
Link prompts to traces

关键词： prompt version, prompt template, prompt control, prompt registry 解决问题：

如何对提示词进行版本控制？
在生产环境中管理提示词
对不同提示词版本进行A/B测试
关联提示词与追踪链路

llm-evaluation

Keywords: score, quality, evaluation, rating, assessment, g-eval Solves:

How do I evaluate LLM output quality?
Score responses with custom metrics
Track quality trends over time
Compare prompt versions by quality

关键词： score, quality, evaluation, rating, assessment, g-eval 解决问题：

如何评估LLM输出质量？
基于自定义指标对响应结果评分
跟踪质量随时间的变化趋势
按质量对比不同提示词版本

session-tracking

Keywords: session, user tracking, conversation, group traces Solves:

How do I group related traces?
Track multi-turn conversations
Monitor per-user performance
Organize traces by session

关键词： session, user tracking, conversation, group traces 解决问题：

如何对相关追踪链路进行分组？
跟踪多轮对话
监控单用户性能
按会话组织追踪链路

langchain-integration

Keywords: langchain, callback, handler, langgraph integration Solves:

How do I integrate Langfuse with LangChain?
Use CallbackHandler for tracing
Automatic LangGraph workflow tracing
LangChain observability setup

关键词： langchain, callback, handler, langgraph integration 解决问题：

如何将Langfuse与LangChain集成？
使用CallbackHandler实现追踪
自动追踪LangGraph工作流
LangChain可观测性搭建

datasets-evaluation

Keywords: dataset, test set, evaluation dataset, benchmark Solves:

How do I create test datasets in Langfuse?
Run automated evaluations
Regression testing for LLMs
Benchmark prompt versions

关键词： dataset, test set, evaluation dataset, benchmark 解决问题：

如何在Langfuse中创建测试数据集？
运行自动化评估
LLM回归测试
提示词版本基准测试

ab-testing

Keywords: a/b test, experiment, compare prompts, variant testing Solves:

How do I A/B test prompts?
Compare two prompt versions
Experimental prompt evaluation
Statistical prompt testing

关键词： a/b test, experiment, compare prompts, variant testing 解决问题：

如何对提示词进行A/B测试？
对比两个提示词版本
提示词的实验性评估
提示词的统计性测试

monitoring-dashboard

Keywords: dashboard, analytics, metrics, monitoring, queries Solves:

What are the most expensive traces?
Average cost by agent type
Quality score trends
Custom monitoring queries

关键词： dashboard, analytics, metrics, monitoring, queries 解决问题：

哪些追踪链路成本最高？
按Agent类型统计平均成本
质量评分趋势
自定义监控查询

orchestkit-integration

Keywords: orchestkit, migration, setup, workflow integration Solves:

How does OrchestKit use Langfuse?
Migrate from LangSmith to Langfuse
OrchestKit workflow tracing patterns
Cost tracking per analysis

关键词： orchestkit, migration, setup, workflow integration 解决问题：

OrchestKit如何使用Langfuse？
从LangSmith迁移至Langfuse
OrchestKit工作流追踪模式
按分析任务跟踪成本

multi-judge-evaluation

Keywords: multi judge, g-eval, multiple evaluators, ensemble evaluation, weighted scoring Solves:

How do I use multiple LLM judges to evaluate quality?
Set up G-Eval criteria evaluation
Configure weighted scoring across judges
Wire OrchestKit's existing langfuse_evaluators.py

关键词： multi judge, g-eval, multiple evaluators, ensemble evaluation, weighted scoring 解决问题：

如何使用多LLM裁判评估质量？
配置G-Eval标准评估
跨裁判的加权评分配置
对接OrchestKit现有langfuse_evaluators.py

experiments-api

Keywords: experiment, dataset, benchmark, regression test, prompt testing Solves:

How do I run experiments across datasets?
A/B test models and prompts systematically
Track quality regression over time
Compare experiment results

关键词： experiment, dataset, benchmark, regression test, prompt testing 解决问题：

如何跨数据集运行实验？
系统性地对比模型与提示词版本
跟踪质量随时间的退化情况
对比实验结果

langfuse-observability

Original

Translation

Langfuse Observability

Langfuse 可观测性

Overview

概述

Quick Start

快速开始

Setup

部署步骤

backend/app/shared/services/langfuse/client.py

backend/app/shared/services/langfuse/client.py

Basic Tracing with @observe

使用@observe实现基础追踪

Session & User Tracking

会话与用户追踪

Core Features Summary

核心功能汇总

References

参考文档

Tracing Setup

追踪配置

Cost Tracking

成本跟踪

Prompt Management

提示词管理

LLM Evaluation

LLM评估

Session Tracking

会话追踪

Experiments API

实验API

Multi-Judge Evaluation

多裁判评估

Best Practices

最佳实践

LangSmith Migration Notes

LangSmith 迁移说明

External References

外部参考链接

Related Skills

相关技能

Key Decisions

关键决策

Capability Details

能力详情

distributed-tracing

distributed-tracing

cost-tracking

cost-tracking

prompt-management

prompt-management

llm-evaluation

llm-evaluation

session-tracking

session-tracking

langchain-integration

langchain-integration

datasets-evaluation

datasets-evaluation

ab-testing

ab-testing

monitoring-dashboard

monitoring-dashboard

orchestkit-integration

orchestkit-integration

multi-judge-evaluation

multi-judge-evaluation

experiments-api

experiments-api