Observability

Production applications need three categories of telemetry: logs (discrete events), traces (request flows across spans of work), and metrics (numeric measurements over time). The Rust ecosystem handles all three through the tracing crate for instrumentation and OpenTelemetry for export to a self-hosted Grafana stack.

The web server section introduces tracing basics: initialising a subscriber, adding TraceLayer, and controlling log levels with RUST_LOG. The error handling section covers logging errors with tracing::error! and #[instrument]. This section builds on that foundation, covering production subscriber configuration, OpenTelemetry export, metrics collection, and the self-hosted observability stack that receives it all.

Dependencies

tracing and tracing-subscriber are already workspace dependencies from the web server setup. Add the OpenTelemetry crates and the metrics stack:

[workspace.dependencies]
# Already present from web server section
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter", "json", "registry"] }

# OpenTelemetry
opentelemetry = "0.31"
opentelemetry_sdk = { version = "0.31", features = ["rt-tokio"] }
opentelemetry-otlp = { version = "0.31", features = ["grpc-tonic"] }
tracing-opentelemetry = "0.32"
opentelemetry-appender-tracing = "0.31"

# Metrics
metrics = "0.24"
metrics-exporter-prometheus = "0.18"

The grpc-tonic feature on opentelemetry-otlp enables gRPC transport via Tonic. This is not a default feature as of 0.31, so it must be explicitly enabled. The alternative is HTTP/protobuf (http-proto feature), which is now the default. Either works; gRPC is the more established choice for OTLP.

Add json and registry features to tracing-subscriber if they are not already present. json enables structured JSON log output and registry provides the composable multi-layer subscriber.

Structured logging with tracing

tracing goes beyond traditional logging. Where log::info!("processing user {}", user_id) produces a flat string, tracing attaches structured key-value fields to both events and spans.

use tracing::{info, warn, info_span};

// Structured fields on an event
info!(user_id = 42, action = "login", "user authenticated");

// Variable name becomes field name automatically
let user_id = 42;
info!(user_id, "user authenticated");

// Debug vs Display formatting
info!(?some_struct, "debug format");   // uses Debug
info!(%some_value, "display format");  // uses Display

Spans represent a unit of work with a duration. Events occur within spans. When you nest spans, child spans carry their parent’s context, building a tree that traces the full path of a request through your application.

let span = info_span!("process_order", order_id = 1234);
let _guard = span.enter();

// Everything logged here is associated with the process_order span
info!("validating payment");
info!("updating inventory");

The #[instrument] attribute macro (covered in error handling) is the most common way to create spans. It wraps a function in a span named after the function, recording arguments as fields:

#[tracing::instrument(skip(pool))]
async fn create_order(pool: &PgPool, user_id: i64, items: &[Item]) -> Result<Order, AppError> {
    info!("creating order");
    // ...
}

This structured data is what makes observability work. Log aggregation systems like Loki can filter and group by field values. Trace backends like Tempo use spans to reconstruct request flows. None of this works with unstructured string logs.

Production subscriber configuration

The simple tracing_subscriber::fmt().init() from the web server section works for development. Production needs a multi-layer subscriber that sends telemetry to both stdout and OpenTelemetry.

tracing-subscriber’s architecture is built on composable layers stacked on a Registry:

use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt, EnvFilter};

pub fn init_telemetry() {
    // Filter: controls which spans and events are processed
    let env_filter = EnvFilter::try_from_default_env()
        .unwrap_or_else(|_| EnvFilter::new("info,tower_http=debug"));

    // Layer 1: format and print to stdout
    let fmt_layer = tracing_subscriber::fmt::layer()
        .json()
        .with_target(true)
        .with_thread_ids(false)
        .with_file(true)
        .with_line_number(true);

    tracing_subscriber::registry()
        .with(env_filter)
        .with(fmt_layer)
        .init();
}

Each layer handles one concern. The EnvFilter controls what gets processed (respects RUST_LOG). The fmt::layer() handles output formatting. Calling .json() switches from human-readable to structured JSON output, which log aggregation tools parse far more reliably than plain text.

When OpenTelemetry is configured (next section), two additional layers join the stack: one that exports spans as traces, and one that exports events as log records.

OpenTelemetry export

OpenTelemetry (OTel) is a vendor-neutral standard for telemetry data. The Rust application exports traces and logs via the OTLP protocol to an OpenTelemetry Collector, which forwards them to the storage backends (Tempo for traces, Loki for logs).

Setting up the trace exporter

tracing-opentelemetry provides a layer that converts tracing spans into OpenTelemetry spans and exports them via OTLP:

use opentelemetry::trace::TracerProvider;
use opentelemetry_otlp::SpanExporter;
use opentelemetry_sdk::{
    trace::SdkTracerProvider,
    Resource,
};
use tracing_opentelemetry::OpenTelemetryLayer;

fn init_tracer_provider() -> SdkTracerProvider {
    let exporter = SpanExporter::builder()
        .with_tonic()
        .build()
        .expect("failed to create OTLP span exporter");

    SdkTracerProvider::builder()
        .with_batch_exporter(exporter)
        .with_resource(
            Resource::builder()
                .with_service_name("my-app")
                .build(),
        )
        .build()
}

with_tonic() sends spans over gRPC to the collector. The exporter reads OTEL_EXPORTER_OTLP_ENDPOINT for the collector address (defaults to http://localhost:4317). with_batch_exporter batches spans before sending, reducing network overhead.

The Resource identifies this application in the observability stack. service.name is the minimum; it appears as the service label in Grafana.

Setting up the logs exporter

opentelemetry-appender-tracing bridges tracing events to OpenTelemetry log records. This sends your application’s log output through the same OTLP pipeline as traces, and automatically attaches the active trace ID and span ID to each log record. That attachment is what enables clicking from a log line in Loki directly to the corresponding trace in Tempo.

use opentelemetry_appender_tracing::layer::OpenTelemetryTracingBridge;
use opentelemetry_otlp::LogExporter;
use opentelemetry_sdk::logs::SdkLoggerProvider;

fn init_logger_provider() -> SdkLoggerProvider {
    let exporter = LogExporter::builder()
        .with_tonic()
        .build()
        .expect("failed to create OTLP log exporter");

    SdkLoggerProvider::builder()
        .with_batch_exporter(exporter)
        .build()
}

Combining everything

Wire up the full subscriber with all four layers:

use opentelemetry::trace::TracerProvider;
use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt, EnvFilter};

pub struct TelemetryGuard {
    tracer_provider: SdkTracerProvider,
    logger_provider: SdkLoggerProvider,
}

impl Drop for TelemetryGuard {
    fn drop(&mut self) {
        if let Err(e) = self.tracer_provider.shutdown() {
            eprintln!("failed to shutdown tracer provider: {e}");
        }
        if let Err(e) = self.logger_provider.shutdown() {
            eprintln!("failed to shutdown logger provider: {e}");
        }
    }
}

pub fn init_telemetry() -> TelemetryGuard {
    let tracer_provider = init_tracer_provider();
    let logger_provider = init_logger_provider();

    let env_filter = EnvFilter::try_from_default_env()
        .unwrap_or_else(|_| EnvFilter::new("info,tower_http=debug"));

    // Stdout: JSON for production, human-readable for development
    let fmt_layer = tracing_subscriber::fmt::layer()
        .json()
        .with_target(true);

    // Traces: tracing spans → OTel spans → OTLP → Tempo
    let tracer = tracer_provider.tracer("my-app");
    let otel_trace_layer = tracing_opentelemetry::layer().with_tracer(tracer);

    // Logs: tracing events → OTel log records → OTLP → Loki
    let otel_logs_layer = OpenTelemetryTracingBridge::new(&logger_provider);

    tracing_subscriber::registry()
        .with(env_filter)
        .with(fmt_layer)
        .with(otel_trace_layer)
        .with(otel_logs_layer)
        .init();

    TelemetryGuard {
        tracer_provider,
        logger_provider,
    }
}

The TelemetryGuard ensures providers flush pending telemetry on shutdown. Hold it in main:

#[tokio::main]
async fn main() {
    let _telemetry = init_telemetry();

    // ... build app, start server
}

When _telemetry drops (at the end of main or on graceful shutdown), the providers flush any buffered spans and log records to the collector. Without this, the last few seconds of telemetry are lost on shutdown.

Development vs production

In development, you may not want to run the full observability stack. Make OpenTelemetry export conditional:

pub fn init_telemetry() -> Option<TelemetryGuard> {
    let env_filter = EnvFilter::try_from_default_env()
        .unwrap_or_else(|_| EnvFilter::new("info,tower_http=debug"));

    let otel_enabled = std::env::var("OTEL_EXPORTER_OTLP_ENDPOINT").is_ok();

    if otel_enabled {
        let tracer_provider = init_tracer_provider();
        let logger_provider = init_logger_provider();

        let tracer = tracer_provider.tracer("my-app");
        let otel_trace_layer = tracing_opentelemetry::layer().with_tracer(tracer);
        let otel_logs_layer = OpenTelemetryTracingBridge::new(&logger_provider);

        tracing_subscriber::registry()
            .with(env_filter)
            .with(tracing_subscriber::fmt::layer().json().with_target(true))
            .with(otel_trace_layer)
            .with(otel_logs_layer)
            .init();

        Some(TelemetryGuard { tracer_provider, logger_provider })
    } else {
        tracing_subscriber::registry()
            .with(env_filter)
            .with(tracing_subscriber::fmt::layer().with_target(true))
            .init();

        None
    }
}

When OTEL_EXPORTER_OTLP_ENDPOINT is not set, the subscriber falls back to human-readable stdout logging with no OTel export. Set the variable to enable export:

OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

The observability stack

The receiving end is a self-hosted Grafana stack: Loki for logs, Tempo for traces, Prometheus for metrics, and Grafana for dashboards. An OpenTelemetry Collector sits between the application and these backends, receiving OTLP data and routing it to the correct destination.

┌──────────┐    OTLP     ┌───────────────┐
│ Rust App │───────────── │  OTel         │──── Loki (logs)
│          │   gRPC/HTTP  │  Collector    │──── Tempo (traces)
└──────────┘              └───────────────┘──── Prometheus (metrics)
                                                    │
                                              ┌─────┘
                                              ▼
                                          Grafana (dashboards)

Local development

For development, Grafana provides an all-in-one Docker image that bundles the OTel Collector, Grafana, Loki, Tempo, and Prometheus in a single container. Add it to your Docker Compose file alongside the other backing services:

services:
  lgtm:
    image: grafana/otel-lgtm:latest
    ports:
      - "3000:3000"   # Grafana UI
      - "4317:4317"   # OTLP gRPC receiver
      - "4318:4318"   # OTLP HTTP receiver

Start the container and set the OTLP endpoint in your .env:

OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

Open http://localhost:3000 to access Grafana. All data sources (Loki, Tempo, Prometheus) are pre-configured. No additional setup is needed.

Production

In production, run each component as a separate container: Grafana, Loki, Tempo, Prometheus, and the OpenTelemetry Collector. The Grafana documentation covers configuration for each component. The OTel Collector documentation covers the YAML pipeline configuration for routing OTLP data to each backend.

The key configuration is the collector’s pipeline, which routes each signal type to its destination:

# otel-collector.yaml (abbreviated)
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:4318"

exporters:
  otlphttp/loki:
    endpoint: http://loki:3100/otlp
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write

service:
  pipelines:
    logs:
      receivers: [otlp]
      exporters: [otlphttp/loki]
    traces:
      receivers: [otlp]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      exporters: [prometheusremotewrite]

Loki 3.x accepts OTLP logs natively at its /otlp endpoint. Tempo accepts OTLP traces over gRPC on port 4317. Prometheus 3.x accepts remote writes with the --web.enable-remote-write-receiver flag. Use the otel/opentelemetry-collector-contrib Docker image, which includes all the required exporters.

Metrics with Prometheus

The metrics crate provides a facade for recording application metrics. metrics-exporter-prometheus exposes those metrics in Prometheus text format at a /metrics endpoint that Prometheus scrapes.

Setup

use metrics_exporter_prometheus::{Formatter, PrometheusBuilder, PrometheusHandle};

pub fn init_metrics() -> PrometheusHandle {
    PrometheusBuilder::new()
        .install_recorder()
        .expect("failed to install Prometheus recorder")
}

install_recorder() registers the Prometheus exporter as the global metrics recorder and returns a handle. Use the handle to render the metrics output in an Axum handler:

use axum::{routing::get, Router, Extension};

async fn metrics_handler(
    Extension(handle): Extension<PrometheusHandle>,
) -> String {
    handle.render()
}

let metrics_handle = init_metrics();

let app = Router::new()
    .route("/", get(index))
    .route("/metrics", get(metrics_handler))
    .layer(Extension(metrics_handle));

Prometheus scrapes http://your-app:3000/metrics on a configured interval (typically 15 seconds) and stores the time series data.

Recording metrics

The metrics crate provides three metric types:

use metrics::{counter, gauge, histogram};

// Counter: monotonically increasing (requests served, errors encountered)
counter!("http_requests_total", "method" => "GET", "path" => "/users").increment(1);

// Gauge: value that goes up and down (active connections, queue depth)
gauge!("active_connections").set(42.0);

// Histogram: distribution of values (request latency, response size)
histogram!("http_request_duration_seconds").record(0.035);

Labels (the "method" => "GET" pairs) create separate time series for each label combination. Use labels to break down metrics by dimensions you need to filter or group by in dashboards.

Request metrics middleware

Record HTTP request count and duration for every request with Tower middleware:

use axum::{extract::MatchedPath, middleware, http::Request, response::Response};
use std::time::Instant;

async fn track_metrics<B>(
    matched_path: Option<MatchedPath>,
    request: Request<B>,
    next: middleware::Next<B>,
) -> Response {
    let method = request.method().to_string();
    let path = matched_path
        .map(|p| p.as_str().to_string())
        .unwrap_or_else(|| "unknown".to_string());

    let start = Instant::now();
    let response = next.run(request).await;
    let duration = start.elapsed().as_secs_f64();
    let status = response.status().as_u16().to_string();

    counter!("http_requests_total", "method" => method.clone(), "path" => path.clone(), "status" => status).increment(1);
    histogram!("http_request_duration_seconds", "method" => method, "path" => path).record(duration);

    response
}

let app = Router::new()
    .route("/", get(index))
    .route_layer(middleware::from_fn(track_metrics))
    .route("/metrics", get(metrics_handler));

Use MatchedPath rather than the raw URI for the path label. Raw URIs with path parameters (e.g., /users/42, /users/73) create unbounded label cardinality, which bloats Prometheus storage. MatchedPath returns the route template (/users/:id), keeping cardinality bounded.

The /metrics endpoint is outside the route_layer scope so it does not record metrics about metrics scraping.

What to measure

Start with RED metrics for HTTP services:

Rate: http_requests_total (counter, by method/path/status)
Errors: filter http_requests_total where status is 5xx
Duration: http_request_duration_seconds (histogram, by method/path)

Add application-specific metrics as you identify monitoring needs:

db_query_duration_seconds (histogram) for slow query detection
background_jobs_total (counter, by job type and outcome)
active_sessions (gauge) for capacity planning
email_send_total (counter, by outcome: success/failure)

Resist adding metrics for everything up front. Start with RED, observe your application in production, and add metrics when you find yourself asking a question that the existing telemetry cannot answer.

Correlating requests across services

The value of the observability stack multiplies when signals are connected. A log line links to its trace. A metric spike links to the traces that caused it.

Trace IDs in logs

The opentelemetry-appender-tracing bridge automatically attaches the active trace ID and span ID to every log record exported via OTLP. When these logs land in Loki, Grafana can extract the trace ID and create a clickable link to the corresponding trace in Tempo.

Configure this in Grafana’s Loki data source settings by adding a derived field that matches the trace ID and links to Tempo. In the grafana/otel-lgtm development image, this correlation is pre-configured.

Propagation across services

If your application calls other services (via reqwest or similar), propagate the trace context so spans across services form a single trace. Inject the W3C traceparent header into outgoing requests:

use opentelemetry::global;
use opentelemetry::propagation::Injector;
use reqwest::header::HeaderMap;

struct HeaderInjector<'a>(&'a mut HeaderMap);

impl<'a> Injector for HeaderInjector<'a> {
    fn set(&mut self, key: &str, value: String) {
        if let Ok(header_name) = key.parse() {
            if let Ok(header_value) = value.parse() {
                self.0.insert(header_name, header_value);
            }
        }
    }
}

pub async fn call_other_service(client: &reqwest::Client, url: &str) -> reqwest::Result<String> {
    let mut headers = HeaderMap::new();
    global::get_text_map_propagator(|propagator| {
        propagator.inject(&mut HeaderInjector(&mut headers));
    });

    client
        .get(url)
        .headers(headers)
        .send()
        .await?
        .text()
        .await
}

use opentelemetry::global;
use opentelemetry_sdk::propagation::TraceContextPropagator;

global::set_text_map_propagator(TraceContextPropagator::new());

For a single-service application (which most projects in this guide will be), propagation is not needed. Add it when you split into multiple services and want end-to-end traces.

Environment variables

The OpenTelemetry SDK reads standard environment variables. Set these in production:

# Collector endpoint
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317

# Service identification
OTEL_SERVICE_NAME=my-app
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

# Log level filtering
RUST_LOG=info,tower_http=debug

# Prometheus scrape (configure in prometheus.yml, not as an env var)

OTEL_SERVICE_NAME overrides the service.name set in code. OTEL_RESOURCE_ATTRIBUTES adds arbitrary key-value pairs to the OTel resource (useful for environment, version, or region labels).

Gotchas

Shutdown order matters. The TelemetryGuard must outlive the Axum server. If the guard drops before in-flight requests complete, those requests’ spans and logs are lost. Structure main so the guard is declared before the server starts and drops after shutdown completes.

EnvFilter is applied once. The filter determines which spans and events reach any layer. If you filter at info level, the OTel layers will not receive debug spans either. For production, info is typically appropriate. Avoid trace or debug in production unless you are actively debugging, as the volume of OTel data grows rapidly.

gRPC vs HTTP for OTLP. The grpc-tonic feature adds Tonic as a dependency, which pulls in prost, hyper, and h2. If binary size or compile time is a concern, use the http-proto feature instead, which sends OTLP over HTTP/protobuf using reqwest. Both are functionally equivalent.

Prometheus label cardinality. Every unique combination of label values creates a separate time series in Prometheus. High-cardinality labels (user IDs, request IDs, raw URLs) cause storage bloat and query slowness. Keep labels to bounded sets: HTTP methods, route templates, status code classes, job types.

The metrics crate is a facade. Like log for logging, metrics defines the recording API but not the backend. If you forget to call PrometheusBuilder::new().install_recorder(), metric calls silently do nothing. Initialise the recorder early in startup.

OpenTelemetry crate versions move together. The opentelemetry, opentelemetry_sdk, and opentelemetry-otlp crates share a version number and must match. tracing-opentelemetry is one minor version ahead (0.32.x works with opentelemetry 0.31.x). Check compatibility when upgrading.