Background Jobs and Durable Execution with Restate

Web applications run work outside the request/response cycle constantly: processing a payment after checkout, sending confirmation emails, generating reports, importing data. When any of these steps fails partway through, you need the operation to resume from where it left off, not start over or silently disappear.

Restate is a durable execution engine. It records every step of a handler in a persistent journal. If a step fails or the process crashes, Restate replays from the journal, skipping completed steps and retrying from the point of failure. You get automatic retries, exactly-once side effects, and workflows that survive process restarts, without writing retry logic or state machines yourself.

This section covers integrating Restate with an Axum application as separate workspace binaries, defining durable workflows, triggering them from HTTP handlers, and reporting progress back to the browser via Valkey pub/sub and SSE.

When to use Restate

The design principle for this stack is durable by default: any work that must not be silently lost goes through Restate.

Use Restate for:

Multi-step operations involving external services (payment + inventory + email)
Any side effect that must happen exactly once (sending a notification, charging a card, calling a third-party API)
Work that outlives an HTTP request timeout (report generation, data imports, file processing)
Operations that need automatic retries with persistence across process restarts
Coordinating work across multiple services with transactional guarantees

The bar for skipping Restate is high. A tokio::spawn that fires off a quick in-memory computation with no external effects is fine. But the moment the spawned task calls an external API, sends an email, or does anything the user expects to complete reliably, route it through Restate. The cost is one HTTP hop to the Restate server. What you get back is durability, observability, and automatic retry logic without writing any of it yourself.

How Restate works

Restate runs as a separate server process that sits between your application and your service handlers:

Axum app ──HTTP──▶ Restate Server (port 8080) ──HTTP──▶ Worker (port 9080)
                         │
                         ▼
                   Journal + State
                 (durable log + RocksDB)

Your Axum application sends requests to the Restate server’s ingress on port 8080. Restate forwards them to your worker process, which runs the actual handler logic using the Restate SDK on port 9080. Every operation the handler performs is recorded in Restate’s journal. If the handler crashes, Restate replays the journal against a new handler invocation: completed steps return their stored results without re-executing, and execution resumes from the failed step.

The Restate server is a single binary with no external dependencies. It stores its journal and state in an embedded RocksDB instance backed by a durable replicated log. The admin API on port 9070 handles service registration and provides a built-in UI for inspecting invocations.

Service types

The Restate Rust SDK provides three types of handlers, each defined as a trait with a proc macro.

Services (#[restate_sdk::service]) are stateless handlers. Multiple invocations run concurrently. Use these for independent operations like sending emails or calling external APIs.

Virtual objects (#[restate_sdk::object]) are stateful entities identified by a string key. Each object has isolated key/value state stored durably by Restate. Only one exclusive handler runs at a time per key, which guarantees state consistency without locks. Handlers marked #[shared] run concurrently with read-only state access.

Workflows (#[restate_sdk::workflow]) are a specialised form of virtual object. The run handler executes exactly once per workflow ID. Additional #[shared] handlers can query the workflow’s state or signal it through durable promises. Workflow state is retained for 24 hours after completion by default.

Type	State	Concurrency per key	Use case
Service	None	Concurrent	Stateless operations: send email, call API, transform data
Virtual Object	Per-key K/V	One exclusive handler at a time	Mutable state: counter, order tracker, rate limiter
Workflow	Per-workflow K/V	`run` exclusive; `#[shared]` concurrent	Multi-step processes: order fulfilment, onboarding, data pipeline

Durable execution primitives

Journaled side effects

ctx.run() executes a closure and persists the result in Restate’s journal. On replay, the stored result is returned without re-executing the closure. Wrap every non-deterministic operation (HTTP calls, database writes, random number generation) in ctx.run().

let payment_id: String = ctx
    .run(|| charge_payment(order.clone()))
    .name("charge_payment")
    .await?;

The .name() call labels the operation in the Restate UI for observability. It is optional but worth adding.

If the closure fails, Restate retries it with exponential backoff. The default retry policy retries indefinitely. Override it for operations that should fail fast:

use restate_sdk::prelude::*;
use std::time::Duration;

let result = ctx
    .run(|| call_flaky_service())
    .retry_policy(
        RunRetryPolicy::default()
            .initial_delay(Duration::from_millis(100))
            .exponentiation_factor(2.0)
            .max_attempts(5),
    )
    .name("flaky_service")
    .await?;

Two constraints on ctx.run() closures: you cannot use the Restate context (ctx) inside the closure (no state access, no nested run calls, no service calls), and the run call must be immediately awaited before making other context calls.

Terminal errors

Return a TerminalError from a ctx.run() closure to signal a permanent failure that should not be retried. A declined credit card or an invalid request are terminal; a network timeout is not.

use restate_sdk::prelude::*;

async fn charge_payment(order: Order) -> Result<String, HandlerError> {
    let resp = payment_client.charge(&order).await;
    match resp {
        Ok(charge) => Ok(charge.id),
        Err(e) if e.is_retryable() => Err(e.into()),
        Err(e) => Err(TerminalError::new(format!("Payment permanently failed: {e}")).into()),
    }
}

Durable state

Virtual objects and workflows have access to key/value state that is persisted by Restate. State changes are journaled alongside execution and survive crashes.

ctx.set("status", "processing".to_string());
let status: Option<String> = ctx.get("status").await?;
ctx.clear("status");

Durable timers

ctx.sleep() suspends the handler for a duration. Restate persists the timer. If the process crashes during the sleep, Restate resumes the handler on another invocation when the timer fires.

ctx.sleep(Duration::from_secs(60)).await?;

Workspace layout

The Axum web application and the Restate worker run as separate processes. Shared domain types live in a common crate. This follows the project’s workspace-with-multiple-crates pattern.

your-project/
├── Cargo.toml              (workspace root)
├── crates/
│   ├── web/                (Axum web application)
│   │   └── Cargo.toml
│   ├── worker/             (Restate service worker)
│   │   └── Cargo.toml
│   └── shared/             (shared domain types)
│       └── Cargo.toml

Worker dependencies

# crates/worker/Cargo.toml
[dependencies]
restate-sdk = "0.9"
tokio = { version = "1", features = ["full"] }
tracing = "0.1"
tracing-subscriber = "0.3"
serde = { version = "1", features = ["derive"] }
serde_json = "1"
reqwest = { version = "0.13", features = ["json"] }
redis = { version = "1.0", features = ["tokio-comp"] }
anyhow = "1"
shared = { path = "../shared" }

reqwest is for registering the worker with the Restate server on startup and for side effects that call external APIs. The redis crate is for publishing progress events to Valkey; the SSE infrastructure in the web application picks them up.

Shared types

Define domain types in the shared crate so both the web application and the worker use identical structs:

// crates/shared/src/lib.rs
use serde::{Deserialize, Serialize};

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Order {
    pub id: String,
    pub customer_email: String,
    pub items: Vec<OrderItem>,
    pub total_cents: u64,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct OrderItem {
    pub product_id: String,
    pub quantity: u32,
    pub price_cents: u64,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct FulfilmentResult {
    pub order_id: String,
    pub payment_id: String,
    pub status: String,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Progress {
    pub percent: u32,
    pub message: String,
}

Defining the workflow

An order fulfilment workflow that processes payment, reserves inventory, sends a confirmation email, and reports progress at each stage. The workflow uses ctx.run() for each side effect, ctx.set() to track progress in durable state, and publishes to Valkey for real-time SSE updates.

// crates/worker/src/fulfilment.rs
use restate_sdk::prelude::*;
use shared::{FulfilmentResult, Order, Progress};

#[restate_sdk::workflow]
pub trait OrderFulfilment {
    async fn run(order: Json<Order>) -> Result<Json<FulfilmentResult>, HandlerError>;
    #[shared]
    async fn get_progress() -> Result<Json<Progress>, HandlerError>;
}

pub struct OrderFulfilmentImpl {
    pub valkey: redis::Client,
}

impl OrderFulfilment for OrderFulfilmentImpl {
    async fn run(
        &self,
        ctx: WorkflowContext<'_>,
        Json(order): Json<Order>,
    ) -> Result<Json<FulfilmentResult>, HandlerError> {
        let order_id = ctx.key().to_string();

        // Step 1: Process payment
        report_progress(&ctx, &self.valkey, &order_id, 0, "Processing payment...");
        let payment_id: String = ctx
            .run(|| charge_payment(order.clone()))
            .name("charge_payment")
            .await?;

        // Step 2: Reserve inventory
        report_progress(&ctx, &self.valkey, &order_id, 33, "Reserving inventory...");
        ctx.run(|| reserve_inventory(order.clone()))
            .name("reserve_inventory")
            .await?;

        // Step 3: Send confirmation email
        report_progress(&ctx, &self.valkey, &order_id, 66, "Sending confirmation...");
        ctx.run(|| send_confirmation(order.clone(), payment_id.clone()))
            .name("send_confirmation")
            .await?;

        // Step 4: Complete
        report_progress(&ctx, &self.valkey, &order_id, 100, "Order fulfilled.");

        Ok(Json(FulfilmentResult {
            order_id,
            payment_id,
            status: "fulfilled".to_string(),
        }))
    }

    async fn get_progress(
        &self,
        ctx: SharedWorkflowContext<'_>,
    ) -> Result<Json<Progress>, HandlerError> {
        let progress = ctx
            .get::<Progress>("progress")
            .await?
            .unwrap_or(Progress {
                percent: 0,
                message: "Waiting to start...".to_string(),
            });
        Ok(Json(progress))
    }
}

Each ctx.run() call wraps a side effect function that takes ownership of the data it needs. This is the standard pattern: clone the data before the closure so the closure owns its inputs. The side effect functions are regular async functions that call external services:

async fn charge_payment(order: Order) -> Result<String, anyhow::Error> {
    // Call payment provider API (see HTTP Client section for typed client patterns)
    let client = reqwest::Client::new();
    let resp: serde_json::Value = client
        .post("https://payments.example.com/v1/charges")
        .json(&serde_json::json!({
            "amount": order.total_cents,
            "currency": "gbp",
        }))
        .send()
        .await?
        .error_for_status()?
        .json()
        .await?;
    Ok(resp["id"].as_str().unwrap_or_default().to_string())
}

async fn reserve_inventory(order: Order) -> Result<(), anyhow::Error> {
    // Call inventory service
    Ok(())
}

async fn send_confirmation(order: Order, payment_id: String) -> Result<(), anyhow::Error> {
    // Send email via Lettre (see Email section)
    Ok(())
}

Progress reporting

The report_progress function does two things: it updates durable state (queryable via the get_progress handler) and publishes an event to Valkey for real-time SSE delivery. The durable state is the authoritative source; the Valkey publish is best-effort for pushing updates to the browser.

fn report_progress(
    ctx: &WorkflowContext<'_>,
    valkey: &redis::Client,
    order_id: &str,
    percent: u32,
    message: &str,
) {
    // Durable state, queryable via get_progress
    ctx.set(
        "progress",
        Progress {
            percent,
            message: message.to_string(),
        },
    );

    // Real-time push to SSE clients via Valkey (best-effort, limited retries)
    let valkey = valkey.clone();
    let channel = format!("order:{order_id}");
    let event_type = if percent >= 100 { "complete" } else { "progress" };
    let html = format!(
        r#"<div class="progress-bar" style="width: {percent}%">{percent}%</div>
        <p>{message}</p>"#
    );
    let payload = serde_json::json!({
        "event_type": event_type,
        "data": html,
    })
    .to_string();

    // Fire-and-forget: ignore Valkey publish failures
    let _ = ctx
        .run(|| async move {
            let mut conn = valkey.get_multiplexed_async_connection().await?;
            redis::cmd("PUBLISH")
                .arg(&channel)
                .arg(&payload)
                .query_async::<()>(&mut conn)
                .await?;
            Ok::<(), redis::RedisError>(())
        })
        .retry_policy(RunRetryPolicy::default().max_attempts(2))
        .await;
}

The retry policy limits Valkey publish retries to 2 attempts. Progress events are informational; if Valkey is temporarily unreachable, the workflow should continue. The let _ = discards the result so a failed publish never fails the workflow.

The Valkey event payload matches the format established in the Server-Sent Events section: a JSON object with event_type and data fields. The SSE handler parses this format and delivers it to the browser. When percent >= 100, the event type switches to "complete", which triggers sse-close="complete" in the browser and closes the SSE connection.

Running the worker

The worker binary starts the Restate SDK HTTP server and registers itself with the Restate server on startup.

// crates/worker/src/main.rs
use restate_sdk::prelude::*;
use std::time::Duration;

mod fulfilment;
use fulfilment::OrderFulfilmentImpl;

#[tokio::main]
async fn main() {
    tracing_subscriber::fmt::init();

    let valkey_url =
        std::env::var("VALKEY_URL").unwrap_or_else(|_| "redis://127.0.0.1:6379".to_string());
    let valkey = redis::Client::open(valkey_url).expect("invalid VALKEY_URL");

    let worker_addr = std::env::var("WORKER_ADDR")
        .unwrap_or_else(|_| "0.0.0.0:9080".to_string());
    let restate_admin_url = std::env::var("RESTATE_ADMIN_URL")
        .unwrap_or_else(|_| "http://127.0.0.1:9070".to_string());
    let worker_url = std::env::var("WORKER_URL")
        .unwrap_or_else(|_| "http://127.0.0.1:9080".to_string());

    let endpoint = Endpoint::builder()
        .bind(OrderFulfilmentImpl { valkey }.serve())
        .build();

    // Start the SDK HTTP server in a background task
    let addr = worker_addr.parse().expect("invalid WORKER_ADDR");
    tokio::spawn(async move {
        HttpServer::new(endpoint).listen_and_serve(addr).await;
    });

    // Wait for the server to bind
    tokio::time::sleep(Duration::from_millis(500)).await;

    // Register with the Restate server
    register_deployment(&restate_admin_url, &worker_url).await;

    tracing::info!("Worker running on {worker_addr}");

    // Block until shutdown signal
    tokio::signal::ctrl_c().await.unwrap();
}

async fn register_deployment(admin_url: &str, worker_url: &str) {
    let client = reqwest::Client::new();
    match client
        .post(format!("{admin_url}/deployments"))
        .json(&serde_json::json!({
            "uri": worker_url,
            "force": true,
        }))
        .send()
        .await
    {
        Ok(resp) if resp.status().is_success() => {
            tracing::info!("Registered deployment at {worker_url}");
        }
        Ok(resp) => {
            let status = resp.status();
            let body = resp.text().await.unwrap_or_default();
            tracing::warn!("Restate registration returned {status}: {body}");
        }
        Err(e) => {
            tracing::warn!("Failed to register with Restate: {e}");
        }
    }
}

The "force": true field in the registration body tells Restate to update the deployment if it already exists. This handles the common development case where you restart the worker after code changes.

Registration calls the Restate admin API on port 9070. Restate performs service discovery automatically: it queries your worker’s endpoint, finds all bound services and their handlers, and registers them. After registration, the services are callable through the Restate ingress on port 8080.

Triggering workflows from Axum handlers

The Axum web application triggers Restate workflows by sending HTTP requests to the Restate ingress. This is a plain reqwest call; no Restate SDK is needed in the web application.

The Restate team is developing a standalone typed client that will be generated from the same trait declarations that define the service. This will replace the raw HTTP calls shown below with compile-time checked method calls. Until then, the ingress HTTP API is the integration point.

URL patterns

Service type	URL pattern
Service	`POST /ServiceName/handlerName`
Virtual Object	`POST /ObjectName/{key}/handlerName`
Workflow (run)	`POST /WorkflowName/{workflowId}/run`
Workflow (query/signal)	`POST /WorkflowName/{workflowId}/handlerName`

Append /send to any URL for fire-and-forget invocation. Restate accepts the request immediately and returns an invocation ID. The handler runs in the background.

Starting a workflow

An Axum handler that creates an order, triggers the fulfilment workflow, and returns a confirmation page with live progress:

use axum::{extract::State, response::Html, Form};

async fn create_order(
    State(state): State<AppState>,
    Form(input): Form<OrderInput>,
) -> Result<Html<String>, AppError> {
    // Insert order into the database
    let order = insert_order(&state.db, &input).await?;

    // Trigger the fulfilment workflow (fire-and-forget)
    let resp = state
        .http
        .post(format!(
            "{}/OrderFulfilment/{}/run/send",
            state.restate_ingress_url, order.id,
        ))
        .json(&order)
        .send()
        .await
        .map_err(|e| {
            tracing::error!(error = ?e, "failed to trigger fulfilment workflow");
            AppError::BadGateway("could not start order processing".into())
        })?;

    if !resp.status().is_success() {
        tracing::error!(status = %resp.status(), "Restate rejected workflow");
        return Err(AppError::BadGateway("could not start order processing".into()));
    }

    Ok(Html(render_order_progress(&order)))
}

The /send suffix is critical. Without it, the HTTP call blocks until the entire workflow completes, which could take seconds or minutes. With /send, Restate returns immediately and the workflow runs in the background.

The confirmation page with SSE progress

The rendered page connects to the SSE endpoint established in the Server-Sent Events section. The SSE handler subscribes to the Valkey channel order:{id}, which the workflow publishes progress events to.

use maud::{html, Markup};

fn render_order_progress(order: &Order) -> String {
    html! {
        h2 { "Order " (order.id) " placed" }
        div hx-ext="sse"
            sse-connect=(format!("/events/orders/{}/progress", order.id))
            sse-close="complete" {
            div sse-swap="progress" hx-swap="innerHTML" {
                div .progress-bar style="width: 0%" { "0%" }
                p { "Starting fulfilment..." }
            }
        }
    }
    .into_string()
}

The sse-close="complete" attribute closes the SSE connection when the workflow publishes its final event with event_type: "complete". The full SSE wiring (the /events/orders/{id}/progress endpoint, Valkey subscriber, event delivery) is covered in the Server-Sent Events section.

Synchronous invocation

For cases where you need the workflow result before responding (rare, but sometimes necessary for short-running workflows):

let result: FulfilmentResult = state
    .http
    .post(format!(
        "{}/OrderFulfilment/{}/run",
        state.restate_ingress_url, order.id,
    ))
    .json(&order)
    .send()
    .await?
    .error_for_status()?
    .json()
    .await?;

Without /send, the call blocks until run completes and returns the workflow’s result directly.

Idempotency

Workflows are inherently idempotent: the run handler executes exactly once per workflow ID. Calling /OrderFulfilment/order-123/run twice with the same ID attaches the second call to the existing execution rather than starting a new one. Use the order ID (or another natural identifier) as the workflow ID to get this deduplication for free.

For services (which are not keyed), add an Idempotency-Key header to prevent duplicate processing:

state
    .http
    .post(format!("{}/EmailSender/send_confirmation", state.restate_ingress_url))
    .header("idempotency-key", &order.id)
    .json(&email_details)
    .send()
    .await?;

Restate caches the response for 24 hours, returning the cached result for duplicate calls.

Application state

Add the Restate ingress URL to your Axum application state:

#[derive(Clone)]
pub struct AppState {
    pub db: sqlx::PgPool,
    pub http: reqwest::Client,
    pub valkey: redis::Client,
    pub restate_ingress_url: String,
}

Read the URL from an environment variable at startup:

let restate_ingress_url = std::env::var("RESTATE_INGRESS_URL")
    .unwrap_or_else(|_| "http://127.0.0.1:8080".to_string());

Running Restate in development

Restate and Valkey run as Docker containers. The web application and worker run on the host, consistent with the project’s approach of Docker for backing services only.

Add Restate to your Docker Compose file alongside PostgreSQL and Valkey:

# docker-compose.yml
services:
  postgres:
    image: postgres:17
    ports:
      - "5432:5432"
    environment:
      POSTGRES_DB: app
      POSTGRES_USER: app
      POSTGRES_PASSWORD: app

  valkey:
    image: valkey/valkey:8
    ports:
      - "6379:6379"

  restate:
    image: docker.restate.dev/restatedev/restate:1.6
    ports:
      - "8080:8080"   # ingress (client requests)
      - "9070:9070"   # admin API + UI
      - "9071:9071"   # internal (cluster communication)

Development workflow

Start the backing services:

docker compose up -d

Run the worker (registers itself with Restate on startup):

cargo run -p worker

Run the web application:

cargo run -p web

Open the Restate UI at http://localhost:9070 to inspect registered services, active invocations, and their journal entries.

When the worker registers with Restate, the Restate server (running in Docker) needs to reach the worker (running on the host). On macOS and Windows, Docker Desktop maps host.docker.internal to the host. Set WORKER_URL=http://host.docker.internal:9080 when the worker registers. On Linux, use --network=host on the Restate container or set up bridge networking.

Re-registration on code changes

The worker registers with "force": true, so restarting the worker after code changes updates the deployment automatically. If you add or remove handlers, the updated service definitions are picked up on the next registration.

Deploying Restate in production

Restate is a single binary with no external dependencies. It stores its own state in an embedded data directory. Deploy it alongside your application, not as a managed service.

Single-server deployment

On a VPS running Docker Compose, add Restate as another service:

services:
  restate:
    image: docker.restate.dev/restatedev/restate:1.6
    restart: unless-stopped
    volumes:
      - restate_data:/target/restate-data
    ports:
      - "8080:8080"
    # Admin API should not be publicly exposed
    # Access via Tailscale or SSH tunnel

volumes:
  restate_data:

Persist the Restate data directory (/target/restate-data inside the container) to a volume. This preserves the journal and state across container restarts.

The admin API (port 9070) should not be exposed to the public internet. Access it through your private network (Tailscale, VPN, or SSH tunnel) for service registration and the UI.

The worker in production

Build the worker as a separate Docker image from the same workspace:

FROM rust:1.85 AS builder
WORKDIR /app
COPY . .
RUN cargo build --release -p worker

FROM debian:bookworm-slim
COPY --from=builder /app/target/release/worker /usr/local/bin/
CMD ["worker"]

The worker registers with Restate on startup, so deploying a new version is: build the image, restart the container, and the new handlers are registered automatically.

Scaling

Restate handles hundreds of concurrent workflow invocations on a single server. Each invocation consumes minimal resources when suspended (waiting on a timer, sleeping, or paused between steps).

If you outgrow a single Restate server, Restate supports clustered deployment with partitioned state and replicated logs. This is a significant operational step. For most applications covered by this guide (content sites, CRUD apps, internal tools), a single Restate instance is sufficient.

Gotchas

Everything non-deterministic must be in ctx.run(). HTTP calls, database queries, reading the current time, generating random numbers. If it produces different results on different executions, wrap it. Forgetting this causes journal replay to diverge, which Restate detects and flags as an error.

Side effect functions must own their data. The closure passed to ctx.run() must be Send + 'static. Clone values before passing them into the closure rather than borrowing from the handler’s scope.

The worker is not your web server. The Restate SDK’s HttpServer serves the Restate protocol, not HTTP for browsers. Keep the Axum web application and the Restate worker as separate binaries. They share types through the workspace, not a runtime.

Registration must happen after code changes. When you add, remove, or rename handlers, the worker must re-register with Restate so it knows the updated service definitions. The auto-registration pattern shown above handles this. If you skip registration after a change, Restate routes requests to the old handler definitions and invocations fail.

Workflow IDs are unique. Calling run on a workflow ID that has already completed or is currently running attaches to the existing execution. If you need to re-run a workflow for the same entity, use a new ID (e.g., order-123-retry-1 or include a timestamp).

Valkey progress events are fire-and-forget. If the browser is not connected when a progress event is published, that event is lost. The get_progress shared handler provides a durable fallback: the browser can poll it on reconnection or use it as the initial state before the SSE connection is established.

Test with restate-sdk-testcontainers. The restate-sdk-testcontainers crate spins up a Restate server in Docker for integration tests, similar to how testcontainers works for PostgreSQL. Use it to test workflows end-to-end without a persistent Restate instance.

Pin the Restate server version. Use a specific image tag (e.g., restate:1.6) rather than latest. The Restate SDK and server have version compatibility ranges. The Rust SDK v0.9 supports Restate Server 1.3 through 1.6.