groq-observability

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Groq Observability

Groq 可观测性

Overview

概述

Set up comprehensive observability for Groq integrations.
为Groq集成设置全面的可观测性方案。

Prerequisites

前提条件

  • Prometheus or compatible metrics backend
  • OpenTelemetry SDK installed
  • Grafana or similar dashboarding tool
  • AlertManager configured
  • Prometheus或兼容的指标后端
  • 已安装OpenTelemetry SDK
  • Grafana或类似的仪表板工具
  • 已配置AlertManager

Metrics Collection

指标收集

Key Metrics

核心指标

MetricTypeDescription
groq_requests_total
CounterTotal API requests
groq_request_duration_seconds
HistogramRequest latency
groq_errors_total
CounterError count by type
groq_rate_limit_remaining
GaugeRate limit headroom
指标类型描述
groq_requests_total
计数器API请求总数
groq_request_duration_seconds
直方图请求延迟
groq_errors_total
计数器按类型统计的错误数量
groq_rate_limit_remaining
仪表盘指标剩余请求额度

Prometheus Metrics

Prometheus指标

typescript
import { Registry, Counter, Histogram, Gauge } from 'prom-client';

const registry = new Registry();

const requestCounter = new Counter({
  name: 'groq_requests_total',
  help: 'Total Groq API requests',
  labelNames: ['method', 'status'],
  registers: [registry],
});

const requestDuration = new Histogram({
  name: 'groq_request_duration_seconds',
  help: 'Groq request duration',
  labelNames: ['method'],
  buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
  registers: [registry],
});

const errorCounter = new Counter({
  name: 'groq_errors_total',
  help: 'Groq errors by type',
  labelNames: ['error_type'],
  registers: [registry],
});
typescript
import { Registry, Counter, Histogram, Gauge } from 'prom-client';

const registry = new Registry();

const requestCounter = new Counter({
  name: 'groq_requests_total',
  help: 'Total Groq API requests',
  labelNames: ['method', 'status'],
  registers: [registry],
});

const requestDuration = new Histogram({
  name: 'groq_request_duration_seconds',
  help: 'Groq request duration',
  labelNames: ['method'],
  buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
  registers: [registry],
});

const errorCounter = new Counter({
  name: 'groq_errors_total',
  help: 'Groq errors by type',
  labelNames: ['error_type'],
  registers: [registry],
});

Instrumented Client

带监控的客户端

typescript
async function instrumentedRequest<T>(
  method: string,
  operation: () => Promise<T>
): Promise<T> {
  const timer = requestDuration.startTimer({ method });

  try {
    const result = await operation();
    requestCounter.inc({ method, status: 'success' });
    return result;
  } catch (error: any) {
    requestCounter.inc({ method, status: 'error' });
    errorCounter.inc({ error_type: error.code || 'unknown' });
    throw error;
  } finally {
    timer();
  }
}
typescript
async function instrumentedRequest<T>(
  method: string,
  operation: () => Promise<T>
): Promise<T> {
  const timer = requestDuration.startTimer({ method });

  try {
    const result = await operation();
    requestCounter.inc({ method, status: 'success' });
    return result;
  } catch (error: any) {
    requestCounter.inc({ method, status: 'error' });
    errorCounter.inc({ error_type: error.code || 'unknown' });
    throw error;
  } finally {
    timer();
  }
}

Distributed Tracing

分布式追踪

OpenTelemetry Setup

OpenTelemetry 配置

typescript
import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('groq-client');

async function tracedGroqCall<T>(
  operationName: string,
  operation: () => Promise<T>
): Promise<T> {
  return tracer.startActiveSpan(`groq.${operationName}`, async (span) => {
    try {
      const result = await operation();
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (error: any) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}
typescript
import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('groq-client');

async function tracedGroqCall<T>(
  operationName: string,
  operation: () => Promise<T>
): Promise<T> {
  return tracer.startActiveSpan(`groq.${operationName}`, async (span) => {
    try {
      const result = await operation();
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (error: any) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Logging Strategy

日志策略

Structured Logging

结构化日志

typescript
import pino from 'pino';

const logger = pino({
  name: 'groq',
  level: process.env.LOG_LEVEL || 'info',
});

function logGroqOperation(
  operation: string,
  data: Record<string, any>,
  duration: number
) {
  logger.info({
    service: 'groq',
    operation,
    duration_ms: duration,
    ...data,
  });
}
typescript
import pino from 'pino';

const logger = pino({
  name: 'groq',
  level: process.env.LOG_LEVEL || 'info',
});

function logGroqOperation(
  operation: string,
  data: Record<string, any>,
  duration: number
) {
  logger.info({
    service: 'groq',
    operation,
    duration_ms: duration,
    ...data,
  });
}

Alert Configuration

告警配置

Prometheus AlertManager Rules

Prometheus AlertManager 规则

yaml
undefined
yaml
undefined

groq_alerts.yaml

groq_alerts.yaml

groups:
  • name: groq_alerts rules:
    • alert: GroqHighErrorRate expr: | rate(groq_errors_total[5m]) / rate(groq_requests_total[5m]) > 0.05 for: 5m labels: severity: warning annotations: summary: "Groq error rate > 5%"
    • alert: GroqHighLatency expr: | histogram_quantile(0.95, rate(groq_request_duration_seconds_bucket[5m]) ) > 2 for: 5m labels: severity: warning annotations: summary: "Groq P95 latency > 2s"
    • alert: GroqDown expr: up{job="groq"} == 0 for: 1m labels: severity: critical annotations: summary: "Groq integration is down"
undefined
groups:
  • name: groq_alerts rules:
    • alert: GroqHighErrorRate expr: | rate(groq_errors_total[5m]) / rate(groq_requests_total[5m]) > 0.05 for: 5m labels: severity: warning annotations: summary: "Groq error rate > 5%"
    • alert: GroqHighLatency expr: | histogram_quantile(0.95, rate(groq_request_duration_seconds_bucket[5m]) ) > 2 for: 5m labels: severity: warning annotations: summary: "Groq P95 latency > 2s"
    • alert: GroqDown expr: up{job="groq"} == 0 for: 1m labels: severity: critical annotations: summary: "Groq integration is down"
undefined

Dashboard

仪表板

Grafana Panel Queries

Grafana 面板查询

json
{
  "panels": [
    {
      "title": "Groq Request Rate",
      "targets": [{
        "expr": "rate(groq_requests_total[5m])"
      }]
    },
    {
      "title": "Groq Latency P50/P95/P99",
      "targets": [{
        "expr": "histogram_quantile(0.5, rate(groq_request_duration_seconds_bucket[5m]))"
      }]
    }
  ]
}
json
{
  "panels": [
    {
      "title": "Groq Request Rate",
      "targets": [{
        "expr": "rate(groq_requests_total[5m])"
      }]
    },
    {
      "title": "Groq Latency P50/P95/P99",
      "targets": [{
        "expr": "histogram_quantile(0.5, rate(groq_request_duration_seconds_bucket[5m]))"
      }]
    }
  ]
}

Instructions

操作步骤

Step 1: Set Up Metrics Collection

步骤1:设置指标收集

Implement Prometheus counters, histograms, and gauges for key operations.
为核心操作实现Prometheus计数器、直方图和仪表盘指标。

Step 2: Add Distributed Tracing

步骤2:添加分布式追踪

Integrate OpenTelemetry for end-to-end request tracing.
集成OpenTelemetry以实现端到端请求追踪。

Step 3: Configure Structured Logging

步骤3:配置结构化日志

Set up JSON logging with consistent field names.
设置具有统一字段名的JSON日志。

Step 4: Create Alert Rules

步骤4:创建告警规则

Define Prometheus alerting rules for error rates and latency.
为错误率和延迟定义Prometheus告警规则。

Output

输出结果

  • Metrics collection enabled
  • Distributed tracing configured
  • Structured logging implemented
  • Alert rules deployed
  • 已启用指标收集
  • 已配置分布式追踪
  • 已实现结构化日志
  • 已部署告警规则

Error Handling

错误处理

IssueCauseSolution
Missing metricsNo instrumentationWrap client calls
Trace gapsMissing propagationCheck context headers
Alert stormsWrong thresholdsTune alert rules
High cardinalityToo many labelsReduce label values
问题原因解决方案
指标缺失未实现监控包装客户端调用
追踪断链上下文传递缺失检查上下文头信息
告警风暴阈值设置错误调整告警规则
高基数标签过多减少标签值数量

Examples

示例

Quick Metrics Endpoint

快速指标端点

typescript
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType);
  res.send(await registry.metrics());
});
typescript
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType);
  res.send(await registry.metrics());
});

Resources

参考资源

Next Steps

下一步

For incident response, see
groq-incident-runbook
.
如需了解事件响应,请查看
groq-incident-runbook