instantly-observability

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Instantly Observability

Instantly 可观测性

Overview

概述

Set up comprehensive observability for Instantly integrations.
为Instantly集成搭建全面的可观测能力。

Prerequisites

前置要求

  • Prometheus or compatible metrics backend
  • OpenTelemetry SDK installed
  • Grafana or similar dashboarding tool
  • AlertManager configured
  • Prometheus或兼容的指标存储后端
  • 已安装OpenTelemetry SDK
  • Grafana或类似的仪表盘工具
  • 已配置AlertManager

Metrics Collection

指标采集

Key Metrics

核心指标

MetricTypeDescription
instantly_requests_total
CounterTotal API requests
instantly_request_duration_seconds
HistogramRequest latency
instantly_errors_total
CounterError count by type
instantly_rate_limit_remaining
GaugeRate limit headroom
指标类型描述
instantly_requests_total
Counter总API请求数
instantly_request_duration_seconds
Histogram请求延迟
instantly_errors_total
Counter按类型统计的错误数
instantly_rate_limit_remaining
Gauge剩余限流额度

Prometheus Metrics

Prometheus Metrics

typescript
import { Registry, Counter, Histogram, Gauge } from 'prom-client';

const registry = new Registry();

const requestCounter = new Counter({
  name: 'instantly_requests_total',
  help: 'Total Instantly API requests',
  labelNames: ['method', 'status'],
  registers: [registry],
});

const requestDuration = new Histogram({
  name: 'instantly_request_duration_seconds',
  help: 'Instantly request duration',
  labelNames: ['method'],
  buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
  registers: [registry],
});

const errorCounter = new Counter({
  name: 'instantly_errors_total',
  help: 'Instantly errors by type',
  labelNames: ['error_type'],
  registers: [registry],
});
typescript
import { Registry, Counter, Histogram, Gauge } from 'prom-client';

const registry = new Registry();

const requestCounter = new Counter({
  name: 'instantly_requests_total',
  help: 'Total Instantly API requests',
  labelNames: ['method', 'status'],
  registers: [registry],
});

const requestDuration = new Histogram({
  name: 'instantly_request_duration_seconds',
  help: 'Instantly request duration',
  labelNames: ['method'],
  buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
  registers: [registry],
});

const errorCounter = new Counter({
  name: 'instantly_errors_total',
  help: 'Instantly errors by type',
  labelNames: ['error_type'],
  registers: [registry],
});

Instrumented Client

埋点客户端

typescript
async function instrumentedRequest<T>(
  method: string,
  operation: () => Promise<T>
): Promise<T> {
  const timer = requestDuration.startTimer({ method });

  try {
    const result = await operation();
    requestCounter.inc({ method, status: 'success' });
    return result;
  } catch (error: any) {
    requestCounter.inc({ method, status: 'error' });
    errorCounter.inc({ error_type: error.code || 'unknown' });
    throw error;
  } finally {
    timer();
  }
}
typescript
async function instrumentedRequest<T>(
  method: string,
  operation: () => Promise<T>
): Promise<T> {
  const timer = requestDuration.startTimer({ method });

  try {
    const result = await operation();
    requestCounter.inc({ method, status: 'success' });
    return result;
  } catch (error: any) {
    requestCounter.inc({ method, status: 'error' });
    errorCounter.inc({ error_type: error.code || 'unknown' });
    throw error;
  } finally {
    timer();
  }
}

Distributed Tracing

分布式链路追踪

OpenTelemetry Setup

OpenTelemetry配置

typescript
import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('instantly-client');

async function tracedInstantlyCall<T>(
  operationName: string,
  operation: () => Promise<T>
): Promise<T> {
  return tracer.startActiveSpan(`instantly.${operationName}`, async (span) => {
    try {
      const result = await operation();
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (error: any) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}
typescript
import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('instantly-client');

async function tracedInstantlyCall<T>(
  operationName: string,
  operation: () => Promise<T>
): Promise<T> {
  return tracer.startActiveSpan(`instantly.${operationName}`, async (span) => {
    try {
      const result = await operation();
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (error: any) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Logging Strategy

日志策略

Structured Logging

结构化日志

typescript
import pino from 'pino';

const logger = pino({
  name: 'instantly',
  level: process.env.LOG_LEVEL || 'info',
});

function logInstantlyOperation(
  operation: string,
  data: Record<string, any>,
  duration: number
) {
  logger.info({
    service: 'instantly',
    operation,
    duration_ms: duration,
    ...data,
  });
}
typescript
import pino from 'pino';

const logger = pino({
  name: 'instantly',
  level: process.env.LOG_LEVEL || 'info',
});

function logInstantlyOperation(
  operation: string,
  data: Record<string, any>,
  duration: number
) {
  logger.info({
    service: 'instantly',
    operation,
    duration_ms: duration,
    ...data,
  });
}

Alert Configuration

告警配置

Prometheus AlertManager Rules

Prometheus AlertManager规则

yaml
undefined
yaml
undefined

instantly_alerts.yaml

instantly_alerts.yaml

groups:
  • name: instantly_alerts rules:
    • alert: InstantlyHighErrorRate expr: | rate(instantly_errors_total[5m]) / rate(instantly_requests_total[5m]) > 0.05 for: 5m labels: severity: warning annotations: summary: "Instantly error rate > 5%"
    • alert: InstantlyHighLatency expr: | histogram_quantile(0.95, rate(instantly_request_duration_seconds_bucket[5m]) ) > 2 for: 5m labels: severity: warning annotations: summary: "Instantly P95 latency > 2s"
    • alert: InstantlyDown expr: up{job="instantly"} == 0 for: 1m labels: severity: critical annotations: summary: "Instantly integration is down"
undefined
groups:
  • name: instantly_alerts rules:
    • alert: InstantlyHighErrorRate expr: | rate(instantly_errors_total[5m]) / rate(instantly_requests_total[5m]) > 0.05 for: 5m labels: severity: warning annotations: summary: "Instantly error rate > 5%"
    • alert: InstantlyHighLatency expr: | histogram_quantile(0.95, rate(instantly_request_duration_seconds_bucket[5m]) ) > 2 for: 5m labels: severity: warning annotations: summary: "Instantly P95 latency > 2s"
    • alert: InstantlyDown expr: up{job="instantly"} == 0 for: 1m labels: severity: critical annotations: summary: "Instantly integration is down"
undefined

Dashboard

仪表盘

Grafana Panel Queries

Grafana面板查询

json
{
  "panels": [
    {
      "title": "Instantly Request Rate",
      "targets": [{
        "expr": "rate(instantly_requests_total[5m])"
      }]
    },
    {
      "title": "Instantly Latency P50/P95/P99",
      "targets": [{
        "expr": "histogram_quantile(0.5, rate(instantly_request_duration_seconds_bucket[5m]))"
      }]
    }
  ]
}
json
{
  "panels": [
    {
      "title": "Instantly Request Rate",
      "targets": [{
        "expr": "rate(instantly_requests_total[5m])"
      }]
    },
    {
      "title": "Instantly Latency P50/P95/P99",
      "targets": [{
        "expr": "histogram_quantile(0.5, rate(instantly_request_duration_seconds_bucket[5m]))"
      }]
    }
  ]
}

Instructions

操作指南

Step 1: Set Up Metrics Collection

步骤1:搭建指标采集

Implement Prometheus counters, histograms, and gauges for key operations.
为核心操作实现Prometheus计数器、直方图和仪表盘指标。

Step 2: Add Distributed Tracing

步骤2:添加分布式链路追踪

Integrate OpenTelemetry for end-to-end request tracing.
集成OpenTelemetry实现端到端请求追踪。

Step 3: Configure Structured Logging

步骤3:配置结构化日志

Set up JSON logging with consistent field names.
搭建字段命名统一的JSON日志。

Step 4: Create Alert Rules

步骤4:创建告警规则

Define Prometheus alerting rules for error rates and latency.
定义针对错误率和延迟的Prometheus告警规则。

Output

输出

  • Metrics collection enabled
  • Distributed tracing configured
  • Structured logging implemented
  • Alert rules deployed
  • 已启用指标采集
  • 已配置分布式链路追踪
  • 已实现结构化日志
  • 已部署告警规则

Error Handling

错误处理

IssueCauseSolution
Missing metricsNo instrumentationWrap client calls
Trace gapsMissing propagationCheck context headers
Alert stormsWrong thresholdsTune alert rules
High cardinalityToo many labelsReduce label values
问题原因解决方案
缺失指标无埋点封装客户端调用
链路断层缺少上下文传播检查上下文头
告警风暴阈值不合理调整告警规则
高基数标签过多减少标签值数量

Examples

示例

Quick Metrics Endpoint

快速搭建指标端点

typescript
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType);
  res.send(await registry.metrics());
});
typescript
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType);
  res.send(await registry.metrics());
});

Resources

参考资源

Next Steps

后续步骤

For incident response, see
instantly-incident-runbook
.
如需了解事件响应,请查看
instantly-incident-runbook