Observability · May 20, 2025

Distributed Tracing with OpenTelemetry:
Why Logs Alone Aren't Enough

Raffael Hühnerschulte · 6 min read

Picture this: a user reports that checkout is slow. You have logs from the API gateway, the order service, the inventory service, the payment service, and the notification service. All of them look fine individually. The latency is hiding somewhere in the handoffs — and you have no way to see it.

This is the wall that logs and metrics alone hit in microservice architectures. Each signal tells you what happened inside one service. None of them tell you what happened across the whole request. That's the gap distributed tracing fills, and it's why observability in 2025 means all three signals working together — not just structured logs and dashboards.

The Three Pillars — and Why Traces Are the Missing One

Logs

What happened inside a single process at a point in time. Excellent for debugging known error classes. Blind to cross-service causality.

Metrics

Aggregated system health over time. Tells you that something is slow or failing. Doesn't tell you where in a distributed call chain.

Traces

The full journey of a request across all services, with timing for each hop. Connects the other two signals to a causal story.

In a monolith, a stack trace gives you the full picture. In a distributed system, the equivalent is a trace — a structured record of every service, database call, and queue interaction that contributed to a single user request.

What OpenTelemetry Actually Is

OpenTelemetry (OTel) is a CNCF project that provides vendor-neutral APIs, SDKs, and instrumentation for capturing traces, metrics, and logs. Before OTel, every observability vendor had its own SDK — switching backends meant rewriting all your instrumentation. OTel solves this with a standard data model and wire protocol (OTLP) that every major backend now speaks: Jaeger, Grafana Tempo, Honeycomb, Datadog, Dynatrace, and more.

The practical implication: instrument once, route anywhere. You're not betting on a vendor when you adopt OTel — you're adopting the standard.

Key concept A trace is a tree of spans. Each span represents one operation — an HTTP handler, a DB query, a downstream call — with a start time, duration, status, and arbitrary key-value attributes. Spans carry a trace-id across service boundaries via HTTP headers or message metadata.

How Context Propagation Works

The mechanism that makes distributed tracing possible is context propagation: when service A calls service B, it injects the current trace context into the request headers (the W3C TraceContext standard uses traceparent). Service B extracts that context, creates a child span, and continues the trace. The result is a single trace ID that links every service involved in handling one request.

With OTel, this propagation is handled automatically by the HTTP client and server instrumentation — you don't write it by hand. For asynchronous systems like Kafka, you inject the context into message headers when producing and extract it when consuming, maintaining the causal link even across queue boundaries.

Instrumentation in Practice

Auto-instrumentation vs manual spans

OTel SDKs ship with auto-instrumentation for most common frameworks — Spring Boot, Express, FastAPI, Django, gRPC, JDBC, and many more. Auto-instrumentation handles the framework boundaries (HTTP requests, DB calls) without any code changes. You add it as a Java agent or SDK import and get immediate baseline traces.

Manual instrumentation is for business logic that matters for debugging: "which pricing rule fired?", "how long did the inventory reservation hold the lock?", "which customer segment triggered the slow path?". These are the spans that turn a generic slow trace into a useful one.

Attributes are everything

A span without attributes is just a box on a timeline. Useful spans include the data that lets you filter, group, and correlate: user.id, order.value, db.table, http.route, error.type. The OTel semantic conventions define standard attribute names for common operations — use them, so your tooling and dashboards work out of the box.

Sampling: The Problem Nobody Plans For

A high-traffic service might handle tens of thousands of requests per second. Capturing a full trace for every request is neither practical nor useful — most traces look identical. Sampling is how you decide which traces to keep.

Most teams start with head-based sampling (10–20%) and graduate to tail-based as their trace volume grows. Plan for this transition early — retrofitting tail sampling into a system that wasn't designed for it is painful.

The Collector Layer

The OpenTelemetry Collector is a proxy that receives telemetry from your services, processes it (filtering, batching, attribute enrichment), and exports it to one or more backends. Running a Collector between your services and your observability backend decouples the two — you can switch backends, add fan-out to multiple backends, or add sampling logic without touching application code.

In Kubernetes environments, a DaemonSet Collector per node handles infrastructure metrics, while a central gateway Collector handles tail sampling and routing. This architecture is now the standard recommendation from the OTel project.

What Changes in Practice

Teams I've worked with that invest in proper distributed tracing report a consistent shift: production incidents that previously took hours to diagnose because they required correlating logs across multiple services now take minutes. The trace shows you exactly where time was spent, which service threw the error, and what the request looked like at each step.

More subtly, traces change how teams think about performance. When you can see that 80% of your checkout latency is in a single N+1 query in the inventory service, you stop guessing and start fixing the right thing.

Observability isn't a dashboard. It's the ability to answer questions about your system that you didn't know you'd need to ask. Distributed tracing is what makes that possible in a microservice world.

Want proper observability in your system?

I run workshops on distributed tracing, OpenTelemetry instrumentation, and building observability stacks that actually help you debug production. Hands-on, with your real services and workloads.

Get in touch