Picture this: a user reports that checkout is slow. You have logs from the API gateway, the order service, the inventory service, the payment service, and the notification service. All of them look fine individually. The latency is hiding somewhere in the handoffs — and you have no way to see it.
This is the wall that logs and metrics alone hit in microservice architectures. Each signal tells you what happened inside one service. None of them tell you what happened across the whole request. That's the gap distributed tracing fills, and it's why observability in 2025 means all three signals working together — not just structured logs and dashboards.
The Three Pillars — and Why Traces Are the Missing One
What happened inside a single process at a point in time. Excellent for debugging known error classes. Blind to cross-service causality.
Aggregated system health over time. Tells you that something is slow or failing. Doesn't tell you where in a distributed call chain.
The full journey of a request across all services, with timing for each hop. Connects the other two signals to a causal story.
In a monolith, a stack trace gives you the full picture. In a distributed system, the equivalent is a trace — a structured record of every service, database call, and queue interaction that contributed to a single user request.
What OpenTelemetry Actually Is
OpenTelemetry (OTel) is a CNCF project that provides vendor-neutral APIs, SDKs, and instrumentation for capturing traces, metrics, and logs. Before OTel, every observability vendor had its own SDK — switching backends meant rewriting all your instrumentation. OTel solves this with a standard data model and wire protocol (OTLP) that every major backend now speaks: Jaeger, Grafana Tempo, Honeycomb, Datadog, Dynatrace, and more.
The practical implication: instrument once, route anywhere. You're not betting on a vendor when you adopt OTel — you're adopting the standard.
trace-id across service boundaries via HTTP headers or message metadata.
How Context Propagation Works
The mechanism that makes distributed tracing possible is context propagation: when service A calls service B, it injects the current trace context into the request headers (the W3C TraceContext standard uses traceparent). Service B extracts that context, creates a child span, and continues the trace. The result is a single trace ID that links every service involved in handling one request.
With OTel, this propagation is handled automatically by the HTTP client and server instrumentation — you don't write it by hand. For asynchronous systems like Kafka, you inject the context into message headers when producing and extract it when consuming, maintaining the causal link even across queue boundaries.
Instrumentation in Practice
Auto-instrumentation vs manual spans
OTel SDKs ship with auto-instrumentation for most common frameworks — Spring Boot, Express, FastAPI, Django, gRPC, JDBC, and many more. Auto-instrumentation handles the framework boundaries (HTTP requests, DB calls) without any code changes. You add it as a Java agent or SDK import and get immediate baseline traces.
Manual instrumentation is for business logic that matters for debugging: "which pricing rule fired?", "how long did the inventory reservation hold the lock?", "which customer segment triggered the slow path?". These are the spans that turn a generic slow trace into a useful one.
Attributes are everything
A span without attributes is just a box on a timeline. Useful spans include the data that lets you filter, group, and correlate: user.id, order.value, db.table, http.route, error.type. The OTel semantic conventions define standard attribute names for common operations — use them, so your tooling and dashboards work out of the box.
Sampling: The Problem Nobody Plans For
A high-traffic service might handle tens of thousands of requests per second. Capturing a full trace for every request is neither practical nor useful — most traces look identical. Sampling is how you decide which traces to keep.
- Head-based sampling: The decision is made at the start of the trace (simple, but you'll drop errors if your error rate is low)
- Tail-based sampling: The decision is made after the trace completes — you can keep all errors, all slow traces, and sample the rest. Requires a trace collector that buffers spans (like the OTel Collector with tail sampling processor)
Most teams start with head-based sampling (10–20%) and graduate to tail-based as their trace volume grows. Plan for this transition early — retrofitting tail sampling into a system that wasn't designed for it is painful.
The Collector Layer
The OpenTelemetry Collector is a proxy that receives telemetry from your services, processes it (filtering, batching, attribute enrichment), and exports it to one or more backends. Running a Collector between your services and your observability backend decouples the two — you can switch backends, add fan-out to multiple backends, or add sampling logic without touching application code.
In Kubernetes environments, a DaemonSet Collector per node handles infrastructure metrics, while a central gateway Collector handles tail sampling and routing. This architecture is now the standard recommendation from the OTel project.
What Changes in Practice
Teams I've worked with that invest in proper distributed tracing report a consistent shift: production incidents that previously took hours to diagnose because they required correlating logs across multiple services now take minutes. The trace shows you exactly where time was spent, which service threw the error, and what the request looked like at each step.
More subtly, traces change how teams think about performance. When you can see that 80% of your checkout latency is in a single N+1 query in the inventory service, you stop guessing and start fixing the right thing.
Observability isn't a dashboard. It's the ability to answer questions about your system that you didn't know you'd need to ask. Distributed tracing is what makes that possible in a microservice world.