Files
ai-development-scaffold/.claude/skills/patterns/observability/SKILL.md
James Bland befb8fbaeb feat: initial Claude Code configuration scaffold
Comprehensive Claude Code guidance system with:

- 5 agents: tdd-guardian, code-reviewer, security-scanner, refactor-scan, dependency-audit
- 18 skills covering languages (Python, TypeScript, Rust, Go, Java, C#),
  infrastructure (AWS, Azure, GCP, Terraform, Ansible, Docker/K8s, Database, CI/CD),
  testing (TDD, UI, Browser), and patterns (Monorepo, API Design, Observability)
- 3 hooks: secret detection, auto-formatting, TDD git pre-commit
- Strict TDD enforcement with 80%+ coverage requirements
- Multi-model strategy: Opus for planning, Sonnet for execution (opusplan)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 15:47:34 -05:00

13 KiB

name, description
name description
observability Logging, metrics, and tracing patterns for application observability. Use when implementing monitoring, debugging, or production visibility.

Observability Skill

Three Pillars

  1. Logs - Discrete events with context
  2. Metrics - Aggregated measurements over time
  3. Traces - Request flow across services

Structured Logging

Python (structlog)

import structlog
from structlog.types import Processor

def configure_logging(json_output: bool = True) -> None:
    """Configure structured logging."""
    processors: list[Processor] = [
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
    ]

    if json_output:
        processors.append(structlog.processors.JSONRenderer())
    else:
        processors.append(structlog.dev.ConsoleRenderer())

    structlog.configure(
        processors=processors,
        wrapper_class=structlog.make_filtering_bound_logger(logging.INFO),
        context_class=dict,
        logger_factory=structlog.PrintLoggerFactory(),
        cache_logger_on_first_use=True,
    )

# Usage
logger = structlog.get_logger()

# Add context that persists across log calls
structlog.contextvars.bind_contextvars(
    request_id="req-123",
    user_id="user-456",
)

logger.info("order_created", order_id="order-789", total=150.00)
# {"event": "order_created", "order_id": "order-789", "total": 150.0, "request_id": "req-123", "user_id": "user-456", "level": "info", "timestamp": "2024-01-15T10:30:00Z"}

logger.error("payment_failed", order_id="order-789", error="insufficient_funds")

TypeScript (pino)

import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
  redact: ['password', 'token', 'authorization'],
});

// Create child logger with bound context
const requestLogger = logger.child({
  requestId: 'req-123',
  userId: 'user-456',
});

requestLogger.info({ orderId: 'order-789', total: 150.0 }, 'order_created');
requestLogger.error({ orderId: 'order-789', error: 'insufficient_funds' }, 'payment_failed');

// Express middleware
import { randomUUID } from 'crypto';

const loggingMiddleware = (req, res, next) => {
  const requestId = req.headers['x-request-id'] || randomUUID();

  req.log = logger.child({
    requestId,
    method: req.method,
    path: req.path,
    userAgent: req.headers['user-agent'],
  });

  const startTime = Date.now();

  res.on('finish', () => {
    req.log.info({
      statusCode: res.statusCode,
      durationMs: Date.now() - startTime,
    }, 'request_completed');
  });

  next();
};

Log Levels

Level When to Use
error Failures requiring attention
warn Unexpected but handled situations
info Business events (order created, user logged in)
debug Technical details for debugging
trace Very detailed tracing (rarely used in prod)

Metrics

Python (prometheus-client)

from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Define metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)

ACTIVE_CONNECTIONS = Gauge(
    'active_connections',
    'Number of active connections'
)

ORDERS_PROCESSED = Counter(
    'orders_processed_total',
    'Total orders processed',
    ['status']  # success, failed
)

# Usage
def process_request(method: str, endpoint: str):
    ACTIVE_CONNECTIONS.inc()
    start_time = time.time()

    try:
        # Process request...
        REQUEST_COUNT.labels(method=method, endpoint=endpoint, status='200').inc()
    except Exception:
        REQUEST_COUNT.labels(method=method, endpoint=endpoint, status='500').inc()
        raise
    finally:
        REQUEST_LATENCY.labels(method=method, endpoint=endpoint).observe(
            time.time() - start_time
        )
        ACTIVE_CONNECTIONS.dec()

# FastAPI middleware
from fastapi import FastAPI, Request
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
from starlette.responses import Response

app = FastAPI()

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start_time = time.time()
    response = await call_next(request)

    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.url.path,
        status=response.status_code
    ).inc()

    REQUEST_LATENCY.labels(
        method=request.method,
        endpoint=request.url.path
    ).observe(time.time() - start_time)

    return response

@app.get("/metrics")
async def metrics():
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

TypeScript (prom-client)

import { Registry, Counter, Histogram, Gauge, collectDefaultMetrics } from 'prom-client';

const register = new Registry();
collectDefaultMetrics({ register });

const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'path', 'status'],
  registers: [register],
});

const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration',
  labelNames: ['method', 'path'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
  registers: [register],
});

// Express middleware
const metricsMiddleware = (req, res, next) => {
  const end = httpRequestDuration.startTimer({ method: req.method, path: req.path });

  res.on('finish', () => {
    httpRequestsTotal.inc({ method: req.method, path: req.path, status: res.statusCode });
    end();
  });

  next();
};

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

Key Metrics (RED Method)

Metric Description
Rate Requests per second
Errors Error rate (%)
Duration Latency (p50, p95, p99)

Key Metrics (USE Method for Resources)

Metric Description
Utilization % time resource is busy
Saturation Queue depth, backlog
Errors Error count

Distributed Tracing

Python (OpenTelemetry)

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor

def configure_tracing(service_name: str, otlp_endpoint: str) -> None:
    """Configure OpenTelemetry tracing."""
    resource = Resource.create({"service.name": service_name})

    provider = TracerProvider(resource=resource)
    processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint))
    provider.add_span_processor(processor)

    trace.set_tracer_provider(provider)

    # Auto-instrument libraries
    FastAPIInstrumentor.instrument()
    SQLAlchemyInstrumentor().instrument()
    HTTPXClientInstrumentor().instrument()

# Manual instrumentation
tracer = trace.get_tracer(__name__)

async def process_order(order_id: str) -> dict:
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)

        # Child span for validation
        with tracer.start_as_current_span("validate_order"):
            validated = await validate_order(order_id)

        # Child span for payment
        with tracer.start_as_current_span("process_payment") as payment_span:
            payment_span.set_attribute("payment.method", "card")
            result = await charge_payment(order_id)

        span.set_attribute("order.status", "completed")
        return result

TypeScript (OpenTelemetry)

import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'my-service',
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTLP_ENDPOINT,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

// Manual instrumentation
import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('my-service');

async function processOrder(orderId: string) {
  return tracer.startActiveSpan('process_order', async (span) => {
    try {
      span.setAttribute('order.id', orderId);

      await tracer.startActiveSpan('validate_order', async (validateSpan) => {
        await validateOrder(orderId);
        validateSpan.end();
      });

      const result = await tracer.startActiveSpan('process_payment', async (paymentSpan) => {
        paymentSpan.setAttribute('payment.method', 'card');
        const res = await chargePayment(orderId);
        paymentSpan.end();
        return res;
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Health Checks

from fastapi import FastAPI, Response
from pydantic import BaseModel
from enum import Enum

class HealthStatus(str, Enum):
    HEALTHY = "healthy"
    DEGRADED = "degraded"
    UNHEALTHY = "unhealthy"

class ComponentHealth(BaseModel):
    name: str
    status: HealthStatus
    message: str | None = None

class HealthResponse(BaseModel):
    status: HealthStatus
    version: str
    components: list[ComponentHealth]

async def check_database() -> ComponentHealth:
    try:
        await db.execute("SELECT 1")
        return ComponentHealth(name="database", status=HealthStatus.HEALTHY)
    except Exception as e:
        return ComponentHealth(name="database", status=HealthStatus.UNHEALTHY, message=str(e))

async def check_redis() -> ComponentHealth:
    try:
        await redis.ping()
        return ComponentHealth(name="redis", status=HealthStatus.HEALTHY)
    except Exception as e:
        return ComponentHealth(name="redis", status=HealthStatus.DEGRADED, message=str(e))

@app.get("/health", response_model=HealthResponse)
async def health_check(response: Response):
    components = await asyncio.gather(
        check_database(),
        check_redis(),
    )

    # Overall status is worst component status
    if any(c.status == HealthStatus.UNHEALTHY for c in components):
        overall = HealthStatus.UNHEALTHY
        response.status_code = 503
    elif any(c.status == HealthStatus.DEGRADED for c in components):
        overall = HealthStatus.DEGRADED
    else:
        overall = HealthStatus.HEALTHY

    return HealthResponse(
        status=overall,
        version="1.0.0",
        components=components,
    )

@app.get("/ready")
async def readiness_check():
    """Kubernetes readiness probe - can we serve traffic?"""
    # Check critical dependencies
    await check_database()
    return {"status": "ready"}

@app.get("/live")
async def liveness_check():
    """Kubernetes liveness probe - is the process healthy?"""
    return {"status": "alive"}

Alerting Rules

# prometheus-rules.yaml
groups:
  - name: application
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"

      # High latency
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "p95 latency is {{ $value }}s"

      # Service down
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"

Best Practices

Logging

  • Use structured JSON logs
  • Include correlation/request IDs
  • Redact sensitive data
  • Use appropriate log levels
  • Don't log in hot paths (use sampling)

Metrics

  • Use consistent naming conventions
  • Keep cardinality under control
  • Use histograms for latency (not averages)
  • Export business metrics alongside technical ones

Tracing

  • Instrument at service boundaries
  • Propagate context across services
  • Sample appropriately in production
  • Add relevant attributes to spans