Observability Fundamentals & Tracing Concepts
- The three pillars: Logs, metrics, traces — what each delivers and where each falls short
- Trace anatomy: Spans, trace IDs, parent-child relationships, span attributes and events
- W3C TraceContext: How context is propagated across service boundaries
- Why grep isn't enough: live demo of a distributed failure without and with traces
OpenTelemetry SDK & Instrumentation
- Auto-instrumentation: Java agent for Spring Boot, Node.js, Python — zero code changes
- Manual spans: Making business logic traceable — which spans are actually valuable
- Attributes & events: Enriching spans with relevant context (user.id, order.value, error.type)
- Context propagation: HTTP headers, gRPC metadata, Kafka message headers
- Hands-on: instrumenting a Spring Boot microservice from scratch
OTel Collector, Sampling & Backends
- OTel Collector: Architecture, pipeline configuration (receivers → processors → exporters)
- Tail-based sampling: Keep all errors, sample healthy requests — configuration and trade-offs
- Attribute processors: Filtering sensitive data, adding labels, optimizing batching
- Backend integration: Exporting to Grafana Tempo, Jaeger, Datadog, Honeycomb via OTLP
- Latency dashboard in Grafana: combining traces and metrics, p99 alerting
Workshop: Diagnosing a production problem
- Provided microservice application with a built-in latency problem
- Reading and interpreting traces: waterfall view, critical path, span gaps
- Locating an N+1 query, a slow external call, and a race condition using traces alone
- Setting up an alert rule on trace-based metrics: error rate and latency SLOs
Building observability sounds like an infrastructure concern — in practice, it changes how teams think about production systems. After this day:
- Instrument services with OpenTelemetry without any vendor lock-in
- Read distributed traces and name the slow or failing service precisely
- Configure the OTel Collector for sampling, routing, and multi-backend export
- Build Grafana dashboards that combine traces and metrics in one view
- Diagnose production problems that hide behind logs — in minutes instead of hours
For teams who want a 2-day version: day one can be expanded with Prometheus, Grafana, and log aggregation with Loki — a complete observability stack training in two focused days.
Book the observability training
Whether it's a focused OpenTelemetry introduction or a full observability stack workshop — brief conversation about your setup, then a concrete proposal.
Get in touch