Telemetry and OpenTelemetry

Instrument Elixir applications with Telemetry and OpenTelemetry for metrics, traces, and production diagnostics. Covers event design, handlers, and trace propagation.

Observability is not optional in production systems. Without it, debugging becomes guesswork.

In Elixir, observability usually starts with Telemetry events and extends to OpenTelemetry for distributed tracing.

Telemetry Basics

Telemetry emits events as:

event name (list of atoms),
measurements (numeric values),
metadata (context about the event).

:telemetry.execute(
  [:my_app, :checkout, :completed],
  %{duration_ms: 128},
  %{user_id: user.id, cart_size: 4}
)

A handler subscribes and processes these events:

:telemetry.attach(
  "checkout-metrics",
  [:my_app, :checkout, :completed],
  fn _event, measurements, metadata, _config ->
    Logger.info("checkout completed in #{measurements.duration_ms}ms for user #{metadata.user_id}")
  end,
  nil
)

Designing Useful Events

Prefer stable naming and semantic consistency.

Good patterns:

[:my_app, :http, :request, :stop]
[:my_app, :repo, :query, :stop]
[:my_app, :job, :run, :exception]

Include metadata that answers operational questions quickly (tenant, endpoint, job name, retry count).

From Telemetry to Metrics

Metrics backends aggregate events into dashboards and alerts.

Examples:

request latency percentiles,
error rate by endpoint,
queue depth and worker utilization,
DB query durations.

Avoid metric spam. Track signals you can act on.

OpenTelemetry Tracing

OpenTelemetry adds request-level traces and spans so you can follow work across boundaries.

Typical flow:

inbound request creates/continues a trace,
internal spans capture DB calls, jobs, external API calls,
exporter sends data to a backend (e.g., Honeycomb, Tempo, Datadog).

This makes high-latency paths and cascade failures visible.

Propagation Matters

For distributed systems, trace context must be propagated:

HTTP headers between services,
message metadata in queues/pubsub,
background task boundaries.

If context is dropped, traces become fragmented and hard to debug.

# Python ecosystem
# OpenTelemetry SDK + framework middleware instrumentation.
# Similar concepts: spans, context propagation, exporters.

// Node.js ecosystem
// OpenTelemetry auto/manual instrumentation with context propagation.
// Similar challenge: keeping trace context across async boundaries.

# Elixir ecosystem
# Telemetry for events + OpenTelemetry for traces.
# Strong fit with BEAM process boundaries when propagation is explicit.

Common Mistakes

Instrumenting everything with no clear signal strategy.
Logging high-cardinality values directly into metric labels.
Forgetting trace context propagation in background jobs.
Alerting on raw noise instead of service-level indicators.

Exercise

Instrument a Phoenix Endpoint End-to-End

Instrument one endpoint in your app:

Emit Telemetry events for request start/stop and business operation success/failure.
Add a custom measurement for domain latency (for example, checkout processing time).
Attach a handler that logs structured data for failures.
Create an OpenTelemetry span around an external API call.
Verify trace continuity across the request and background task boundary.

FAQ and Troubleshooting

I added events, but nothing appears in dashboards. Why?

Usually handlers or exporters were not attached in the running environment. Confirm startup wiring, check event names exactly, and ensure your telemetry backend credentials/config are loaded at runtime.

Why are my traces split into separate fragments?

Trace context was likely not propagated between boundaries (HTTP client calls, jobs, pubsub messages). Add explicit context injection/extraction at every boundary.

What should I instrument first in a new service?

Start with request latency, error rate, and dependency timings (DB/external API). Add business-domain events next. Resist broad instrumentation until these core signals are stable and useful.

Prerequisites