Files

James Bland befb8fbaeb feat: initial Claude Code configuration scaffold

Comprehensive Claude Code guidance system with:

- 5 agents: tdd-guardian, code-reviewer, security-scanner, refactor-scan, dependency-audit
- 18 skills covering languages (Python, TypeScript, Rust, Go, Java, C#),
  infrastructure (AWS, Azure, GCP, Terraform, Ansible, Docker/K8s, Database, CI/CD),
  testing (TDD, UI, Browser), and patterns (Monorepo, API Design, Observability)
- 3 hooks: secret detection, auto-formatting, TDD git pre-commit
- Strict TDD enforcement with 80%+ coverage requirements
- Multi-model strategy: Opus for planning, Sonnet for execution (opusplan)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-20 15:47:34 -05:00

13 KiB

Raw Blame History

name, description

name	description
observability	Logging, metrics, and tracing patterns for application observability. Use when implementing monitoring, debugging, or production visibility.

Observability Skill

Three Pillars

Logs - Discrete events with context
Metrics - Aggregated measurements over time
Traces - Request flow across services

Structured Logging

Python (structlog)

import structlog
from structlog.types import Processor

def configure_logging(json_output: bool = True) -> None:
    """Configure structured logging."""
    processors: list[Processor] = [
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
    ]

    if json_output:
        processors.append(structlog.processors.JSONRenderer())
    else:
        processors.append(structlog.dev.ConsoleRenderer())

    structlog.configure(
        processors=processors,
        wrapper_class=structlog.make_filtering_bound_logger(logging.INFO),
        context_class=dict,
        logger_factory=structlog.PrintLoggerFactory(),
        cache_logger_on_first_use=True,
    )

# Usage
logger = structlog.get_logger()

# Add context that persists across log calls
structlog.contextvars.bind_contextvars(
    request_id="req-123",
    user_id="user-456",
)

logger.info("order_created", order_id="order-789", total=150.00)
# {"event": "order_created", "order_id": "order-789", "total": 150.0, "request_id": "req-123", "user_id": "user-456", "level": "info", "timestamp": "2024-01-15T10:30:00Z"}

logger.error("payment_failed", order_id="order-789", error="insufficient_funds")

TypeScript (pino)

import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
  redact: ['password', 'token', 'authorization'],
});

// Create child logger with bound context
const requestLogger = logger.child({
  requestId: 'req-123',
  userId: 'user-456',
});

requestLogger.info({ orderId: 'order-789', total: 150.0 }, 'order_created');
requestLogger.error({ orderId: 'order-789', error: 'insufficient_funds' }, 'payment_failed');

// Express middleware
import { randomUUID } from 'crypto';

const loggingMiddleware = (req, res, next) => {
  const requestId = req.headers['x-request-id'] || randomUUID();

  req.log = logger.child({
    requestId,
    method: req.method,
    path: req.path,
    userAgent: req.headers['user-agent'],
  });

  const startTime = Date.now();

  res.on('finish', () => {
    req.log.info({
      statusCode: res.statusCode,
      durationMs: Date.now() - startTime,
    }, 'request_completed');
  });

  next();
};

Log Levels

Level	When to Use
`error`	Failures requiring attention
`warn`	Unexpected but handled situations
`info`	Business events (order created, user logged in)
`debug`	Technical details for debugging
`trace`	Very detailed tracing (rarely used in prod)

Metrics

Python (prometheus-client)

from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Define metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)

ACTIVE_CONNECTIONS = Gauge(
    'active_connections',
    'Number of active connections'
)

ORDERS_PROCESSED = Counter(
    'orders_processed_total',
    'Total orders processed',
    ['status']  # success, failed
)

# Usage
def process_request(method: str, endpoint: str):
    ACTIVE_CONNECTIONS.inc()
    start_time = time.time()

    try:
        # Process request...
        REQUEST_COUNT.labels(method=method, endpoint=endpoint, status='200').inc()
    except Exception:
        REQUEST_COUNT.labels(method=method, endpoint=endpoint, status='500').inc()
        raise
    finally:
        REQUEST_LATENCY.labels(method=method, endpoint=endpoint).observe(
            time.time() - start_time
        )
        ACTIVE_CONNECTIONS.dec()

# FastAPI middleware
from fastapi import FastAPI, Request
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
from starlette.responses import Response

app = FastAPI()

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start_time = time.time()
    response = await call_next(request)

    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.url.path,
        status=response.status_code
    ).inc()

    REQUEST_LATENCY.labels(
        method=request.method,
        endpoint=request.url.path
    ).observe(time.time() - start_time)

    return response

@app.get("/metrics")
async def metrics():
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

TypeScript (prom-client)

import { Registry, Counter, Histogram, Gauge, collectDefaultMetrics } from 'prom-client';

const register = new Registry();
collectDefaultMetrics({ register });

const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'path', 'status'],
  registers: [register],
});

const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration',
  labelNames: ['method', 'path'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
  registers: [register],
});

// Express middleware
const metricsMiddleware = (req, res, next) => {
  const end = httpRequestDuration.startTimer({ method: req.method, path: req.path });

  res.on('finish', () => {
    httpRequestsTotal.inc({ method: req.method, path: req.path, status: res.statusCode });
    end();
  });

  next();
};

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

Key Metrics (RED Method)

Metric	Description
Rate	Requests per second
Errors	Error rate (%)
Duration	Latency (p50, p95, p99)

Key Metrics (USE Method for Resources)

Metric	Description
Utilization	% time resource is busy
Saturation	Queue depth, backlog
Errors	Error count

Distributed Tracing

Python (OpenTelemetry)

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor

def configure_tracing(service_name: str, otlp_endpoint: str) -> None:
    """Configure OpenTelemetry tracing."""
    resource = Resource.create({"service.name": service_name})

    provider = TracerProvider(resource=resource)
    processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint))
    provider.add_span_processor(processor)

    trace.set_tracer_provider(provider)

    # Auto-instrument libraries
    FastAPIInstrumentor.instrument()
    SQLAlchemyInstrumentor().instrument()
    HTTPXClientInstrumentor().instrument()

# Manual instrumentation
tracer = trace.get_tracer(__name__)

async def process_order(order_id: str) -> dict:
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)

        # Child span for validation
        with tracer.start_as_current_span("validate_order"):
            validated = await validate_order(order_id)

        # Child span for payment
        with tracer.start_as_current_span("process_payment") as payment_span:
            payment_span.set_attribute("payment.method", "card")
            result = await charge_payment(order_id)

        span.set_attribute("order.status", "completed")
        return result

TypeScript (OpenTelemetry)

import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'my-service',
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTLP_ENDPOINT,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

// Manual instrumentation
import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('my-service');

async function processOrder(orderId: string) {
  return tracer.startActiveSpan('process_order', async (span) => {
    try {
      span.setAttribute('order.id', orderId);

      await tracer.startActiveSpan('validate_order', async (validateSpan) => {
        await validateOrder(orderId);
        validateSpan.end();
      });

      const result = await tracer.startActiveSpan('process_payment', async (paymentSpan) => {
        paymentSpan.setAttribute('payment.method', 'card');
        const res = await chargePayment(orderId);
        paymentSpan.end();
        return res;
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Health Checks

from fastapi import FastAPI, Response
from pydantic import BaseModel
from enum import Enum

class HealthStatus(str, Enum):
    HEALTHY = "healthy"
    DEGRADED = "degraded"
    UNHEALTHY = "unhealthy"

class ComponentHealth(BaseModel):
    name: str
    status: HealthStatus
    message: str | None = None

class HealthResponse(BaseModel):
    status: HealthStatus
    version: str
    components: list[ComponentHealth]

async def check_database() -> ComponentHealth:
    try:
        await db.execute("SELECT 1")
        return ComponentHealth(name="database", status=HealthStatus.HEALTHY)
    except Exception as e:
        return ComponentHealth(name="database", status=HealthStatus.UNHEALTHY, message=str(e))

async def check_redis() -> ComponentHealth:
    try:
        await redis.ping()
        return ComponentHealth(name="redis", status=HealthStatus.HEALTHY)
    except Exception as e:
        return ComponentHealth(name="redis", status=HealthStatus.DEGRADED, message=str(e))

@app.get("/health", response_model=HealthResponse)
async def health_check(response: Response):
    components = await asyncio.gather(
        check_database(),
        check_redis(),
    )

    # Overall status is worst component status
    if any(c.status == HealthStatus.UNHEALTHY for c in components):
        overall = HealthStatus.UNHEALTHY
        response.status_code = 503
    elif any(c.status == HealthStatus.DEGRADED for c in components):
        overall = HealthStatus.DEGRADED
    else:
        overall = HealthStatus.HEALTHY

    return HealthResponse(
        status=overall,
        version="1.0.0",
        components=components,
    )

@app.get("/ready")
async def readiness_check():
    """Kubernetes readiness probe - can we serve traffic?"""
    # Check critical dependencies
    await check_database()
    return {"status": "ready"}

@app.get("/live")
async def liveness_check():
    """Kubernetes liveness probe - is the process healthy?"""
    return {"status": "alive"}

Alerting Rules

# prometheus-rules.yaml
groups:
  - name: application
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"

      # High latency
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "p95 latency is {{ $value }}s"

      # Service down
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"

Best Practices

Logging

Use structured JSON logs
Include correlation/request IDs
Redact sensitive data
Use appropriate log levels
Don't log in hot paths (use sampling)

Metrics

Use consistent naming conventions
Keep cardinality under control
Use histograms for latency (not averages)
Export business metrics alongside technical ones

Tracing

Instrument at service boundaries
Propagate context across services
Sample appropriately in production
Add relevant attributes to spans

13 KiB Raw Blame History

Observability Skill

Three Pillars

Structured Logging

Python (structlog)

TypeScript (pino)

Log Levels

Metrics

Python (prometheus-client)

TypeScript (prom-client)

Key Metrics (RED Method)

Key Metrics (USE Method for Resources)

Distributed Tracing

Python (OpenTelemetry)

TypeScript (OpenTelemetry)

Health Checks

Alerting Rules

Best Practices

Logging

Metrics

Tracing

13 KiB

Raw Blame History