AI and LLM Integration

LLM features in a web application (content generation, summarisation, classification, conversational interfaces) are fundamentally HTTP calls to an inference API. The challenge is not the call itself but everything around it: provider abstraction, structured tool use, streaming partial responses to the browser, and surviving failures in calls that are expensive, slow, and rate-limited.

Rig is a Rust library for building LLM-powered applications. It provides a unified interface across providers (Anthropic, OpenAI, Ollama, Gemini, and others), typed tool definitions, streaming support, and an agent abstraction that handles multi-step tool-calling loops. This section covers integrating Rig with Axum handlers, defining tools, streaming responses to the browser via SSE, and making AI workflows durable with Restate.

Dependencies

[dependencies]
rig-core = "0.31"
tokio = { version = "1", features = ["full"] }
serde = { version = "1", features = ["derive"] }
serde_json = "1"
futures-util = "0.3"

Rig includes all providers by default. No feature flags are needed to enable Anthropic, OpenAI, or Ollama support.

Provider setup

Anthropic

Anthropic’s Claude models are the primary provider for the examples in this section. Create a client from the ANTHROPIC_API_KEY environment variable:

use rig::providers::anthropic;

let client = anthropic::Client::from_env();
let agent = client.agent("claude-sonnet-4-20250514")
    .preamble("You are a helpful assistant.")
    .build();

Client::from_env() reads ANTHROPIC_API_KEY from the environment. The model string matches Anthropic’s model ID format. Add the API key to your .env file for local development:

# .env
ANTHROPIC_API_KEY=sk-ant-...

OpenAI

OpenAI is a drop-in alternative. The agent code is identical apart from the client and model name:

use rig::providers::openai;

let client = openai::Client::from_env(); // reads OPENAI_API_KEY
let agent = client.agent("gpt-4o")
    .preamble("You are a helpful assistant.")
    .build();

This is the core value of Rig’s provider abstraction: your application code uses the Prompt, Chat, and StreamingPrompt traits. Swapping providers means changing two lines, not rewriting your handlers.

Ollama for local inference

Ollama runs open-weight models locally. It fits the self-hosted ethos of this stack and is useful for development without burning API credits, for privacy-sensitive workloads, and for running smaller models where latency to a cloud API is unnecessary overhead.

Add Ollama to your Docker Compose alongside other backing services:

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama

volumes:
  ollama_data:

Pull a model after the container starts:

docker exec -it ollama ollama pull llama3.2

Create a Rig client pointing at the local instance:

use rig::providers::ollama;

let client = ollama::Client::from_env(); // reads OLLAMA_API_BASE_URL, default http://localhost:11434
let agent = client.agent("llama3.2")
    .preamble("You are a helpful assistant.")
    .build();

OLLAMA_API_BASE_URL defaults to http://localhost:11434. No API key is required.

Basic completions in Axum handlers

The simplest integration: an Axum handler that sends a prompt to the LLM and returns the response as HTML.

use axum::{extract::State, response::Html, Form};
use rig::completion::Prompt;
use serde::Deserialize;

#[derive(Deserialize)]
struct SummariseInput {
    text: String,
}

async fn summarise(
    State(state): State<AppState>,
    Form(input): Form<SummariseInput>,
) -> Result<Html<String>, AppError> {
    let prompt = format!(
        "Summarise the following text in 2-3 sentences:\n\n{}",
        input.text
    );

    let summary = state.agent.prompt(&prompt).await.map_err(|e| {
        tracing::error!(error = ?e, "LLM completion failed");
        AppError::BadGateway("AI service unavailable".into())
    })?;

    Ok(Html(format!("<div class=\"summary\">{summary}</div>")))
}

The Prompt trait’s .prompt() method sends a one-shot request and returns the full response as a String. The agent is stored in application state, shared across requests:

use rig::providers::anthropic;

#[derive(Clone)]
pub struct AppState {
    pub db: sqlx::PgPool,
    pub http: reqwest::Client,
    pub agent: rig::agent::Agent<rig::providers::anthropic::completion::CompletionModel>,
}

Building the agent at startup:

let anthropic = anthropic::Client::from_env();

let agent = anthropic
    .agent("claude-sonnet-4-20250514")
    .preamble("You are an assistant that summarises text concisely.")
    .temperature(0.3)
    .build();

let state = AppState {
    db: pool,
    http: reqwest::Client::new(),
    agent,
};

Lower temperature values (0.0 to 0.3) produce more deterministic output, which is appropriate for summarisation, classification, and extraction. Higher values (0.7 to 1.0) produce more creative output for generation tasks.

Chat with history

For multi-turn conversations, the Chat trait accepts a message and a history vector:

use rig::completion::{Chat, Message};

async fn chat(
    State(state): State<AppState>,
    Form(input): Form<ChatInput>,
) -> Result<Html<String>, AppError> {
    // Load chat history from session or database
    let history: Vec<Message> = load_chat_history(&state.db, input.session_id).await?;

    let response = state.agent.chat(&input.message, history).await.map_err(|e| {
        tracing::error!(error = ?e, "chat completion failed");
        AppError::BadGateway("AI service unavailable".into())
    })?;

    // Persist the new exchange
    save_chat_messages(&state.db, input.session_id, &input.message, &response).await?;

    Ok(Html(format!("<div class=\"message assistant\">{response}</div>")))
}

Store chat history in PostgreSQL rather than in-memory. Sessions expire, servers restart, and users expect conversations to persist.

Tool use

LLMs generate text. Tools let them take actions: query a database, call an API, perform calculations, look up current information. The model decides which tool to call and with what arguments, your code executes the tool, and the result feeds back into the model’s next response.

Rig defines tools through the Tool trait. Each tool is a Rust struct with typed arguments, typed output, and a JSON schema that tells the model what the tool does and what parameters it accepts.

Defining a tool

A tool that searches for products in a database:

use rig::tool::{Tool, ToolDyn};
use rig::completion::ToolDefinition;
use serde::{Deserialize, Serialize};
use serde_json::json;

#[derive(Debug, Deserialize)]
struct ProductSearchArgs {
    query: String,
    max_results: Option<u32>,
}

#[derive(Debug, thiserror::Error)]
#[error("product search failed: {0}")]
struct ProductSearchError(String);

#[derive(Serialize, Deserialize)]
struct ProductSearch {
    db: sqlx::PgPool,
}

impl Tool for ProductSearch {
    const NAME: &'static str = "search_products";

    type Error = ProductSearchError;
    type Args = ProductSearchArgs;
    type Output = String;

    async fn definition(&self, _prompt: String) -> ToolDefinition {
        ToolDefinition {
            name: "search_products".to_string(),
            description: "Search the product catalogue by name or description. Returns matching products with prices.".to_string(),
            parameters: json!({
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "Search query for product name or description"
                    },
                    "max_results": {
                        "type": "integer",
                        "description": "Maximum number of results to return (default 5)"
                    }
                },
                "required": ["query"]
            }),
        }
    }

    async fn call(&self, args: Self::Args) -> Result<Self::Output, Self::Error> {
        let max = args.max_results.unwrap_or(5) as i64;

        let products = sqlx::query_as!(
            Product,
            r#"
            SELECT id, name, description, price_cents
            FROM products
            WHERE to_tsvector('english', name || ' ' || description) @@ plainto_tsquery('english', $1)
            ORDER BY ts_rank(to_tsvector('english', name || ' ' || description), plainto_tsquery('english', $1)) DESC
            LIMIT $2
            "#,
            args.query,
            max,
        )
        .fetch_all(&self.db)
        .await
        .map_err(|e| ProductSearchError(e.to_string()))?;

        // Return results as a formatted string the model can reason about
        let formatted = products
            .iter()
            .map(|p| format!("- {} ({}): {}", p.name, format_price(p.price_cents), p.description))
            .collect::<Vec<_>>()
            .join("\n");

        if formatted.is_empty() {
            Ok("No products found matching the search query.".to_string())
        } else {
            Ok(formatted)
        }
    }
}

Key points about the Tool trait:

NAME: a static string identifier the model uses to invoke the tool.
Args: a deserializable struct. Rig parses the model’s JSON arguments into this type automatically.
Output: a serialisable type returned to the model. Strings work well because the model consumes the result as text.
definition(): returns a JSON Schema that describes the tool’s purpose and parameters. The model uses this to decide when and how to call the tool.
call(): the actual implementation. This is regular Rust code, so it can query databases, call APIs, read files, or do anything else.

Wiring tools into an agent

Build an agent with tools attached:

use rig::tool::ToolDyn;

let product_search = ProductSearch { db: pool.clone() };

let tools: Vec<Box<dyn ToolDyn>> = vec![Box::new(product_search)];

let agent = anthropic
    .agent("claude-sonnet-4-20250514")
    .preamble(
        "You are a shopping assistant. Use the search_products tool to find products \
         that match what the customer is looking for. Provide helpful recommendations \
         based on the search results."
    )
    .tools(tools)
    .max_tokens(1024)
    .build();

When the user prompts this agent, the model can decide to call search_products with appropriate arguments. Rig handles the loop automatically: it sends the prompt, receives a tool call, executes the tool, sends the result back to the model, and returns the final text response. A single .prompt() call can involve multiple round trips between your code and the model.

// The agent calls search_products internally, then responds with recommendations
let response = agent.prompt("I need a waterproof jacket for hiking").await?;

Multiple tools

Agents can use multiple tools. Define each tool separately and pass them all to the builder:

let tools: Vec<Box<dyn ToolDyn>> = vec![
    Box::new(ProductSearch { db: pool.clone() }),
    Box::new(OrderLookup { db: pool.clone() }),
    Box::new(InventoryCheck { http: http_client.clone() }),
];

let agent = anthropic
    .agent("claude-sonnet-4-20250514")
    .preamble("You are a customer service agent. You can search products, look up orders, and check inventory.")
    .tools(tools)
    .build();

The model chooses which tools to call based on the user’s query and the tool descriptions. Good tool descriptions are critical: the model relies on the description field in ToolDefinition to understand when each tool is appropriate.

Streaming LLM responses via SSE

LLM responses arrive token by token. Streaming them to the browser as they generate gives the user immediate feedback instead of a blank screen followed by a wall of text. Rig’s StreamingPrompt trait produces a stream of chunks that you can convert into Axum SSE events.

use axum::response::sse::{Event, KeepAlive, Sse};
use futures_util::{Stream, StreamExt};
use rig::streaming::StreamingPrompt;
use std::convert::Infallible;

async fn stream_response(
    State(state): State<AppState>,
    Form(input): Form<PromptInput>,
) -> Sse<impl Stream<Item = Result<Event, Infallible>>> {
    let prompt = input.prompt.clone();
    let agent = state.agent.clone();

    let stream = async_stream::stream! {
        match agent.stream_prompt(&prompt).await {
            Ok(mut completion_stream) => {
                let mut stream = completion_stream.stream().await.unwrap();
                while let Some(chunk) = stream.next().await {
                    match chunk {
                        Ok(rig::streaming::StreamedAssistantContent::Text(text)) => {
                            let html = format!("<span>{}</span>", text.text);
                            yield Ok(Event::default().event("chunk").data(html));
                        }
                        Ok(_) => {} // tool calls, usage data
                        Err(e) => {
                            tracing::error!(error = ?e, "stream error");
                            yield Ok(
                                Event::default()
                                    .event("error")
                                    .data("<span class=\"error\">Generation failed</span>"),
                            );
                            break;
                        }
                    }
                }
                // Signal completion
                yield Ok(Event::default().event("done").data("<span class=\"done\"></span>"));
            }
            Err(e) => {
                tracing::error!(error = ?e, "failed to start stream");
                yield Ok(
                    Event::default()
                        .event("error")
                        .data("<span class=\"error\">AI service unavailable</span>"),
                );
            }
        }
    };

    Sse::new(stream).keep_alive(KeepAlive::default())
}

The handler returns Sse<impl Stream>, which Axum sends as Content-Type: text/event-stream. Each text chunk from the model becomes an SSE event with an HTML fragment as its data.

On the browser side, htmx’s SSE extension consumes the events and swaps them into the page. The full SSE-to-htmx wiring (event subscription, sse-swap, connection lifecycle) is covered in the Server-Sent Events section. The relevant htmx markup:

<div hx-ext="sse"
     sse-connect="/ai/stream"
     sse-close="done">
    <div id="response" sse-swap="chunk" hx-swap="beforeend">
    </div>
</div>

sse-swap="chunk" appends each chunk event’s data to the target div. sse-close="done" closes the SSE connection when the stream completes.

Escaping HTML in streamed output

LLM output may contain characters that break HTML (<, >, &). If you render the output as raw HTML, you must escape it. Maud’s PreEscaped type handles this, but since streaming bypasses Maud’s template rendering, escape manually:

fn escape_html(s: &str) -> String {
    s.replace('&', "&amp;")
        .replace('<', "&lt;")
        .replace('>', "&gt;")
        .replace('"', "&quot;")
}

// In the stream loop:
let html = format!("<span>{}</span>", escape_html(&text.text));

If you want the model to produce HTML (e.g., for formatted responses), sanitise the output instead of escaping it. Use a library like ammonia to strip dangerous tags while preserving safe formatting.

Durable AI workflows with Restate

LLM calls are expensive, slow (seconds, not milliseconds), and rate-limited. A crashed process that loses a partially complete AI workflow wastes money and time. Wrapping AI calls in Restate gives you automatic retries, exactly-once execution, and crash recovery for every step.

The pattern: each LLM call goes inside a ctx.run() closure. If the process crashes after the call completes but before the next step starts, Restate replays from the journal and skips the completed call, returning the stored result without re-invoking the model.

A content generation workflow

A workflow that generates a product description, translates it, and stores the results. Each step is independently durable.

// crates/worker/src/content_gen.rs
use restate_sdk::prelude::*;
use rig::completion::Prompt;
use rig::providers::anthropic;
use serde::{Deserialize, Serialize};

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ContentRequest {
    pub product_id: String,
    pub product_name: String,
    pub product_details: String,
    pub target_languages: Vec<String>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ContentResult {
    pub product_id: String,
    pub description: String,
    pub translations: Vec<Translation>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Translation {
    pub language: String,
    pub text: String,
}

#[restate_sdk::workflow]
pub trait ContentGeneration {
    async fn run(request: Json<ContentRequest>) -> Result<Json<ContentResult>, HandlerError>;
    #[shared]
    async fn get_status() -> Result<String, HandlerError>;
}

pub struct ContentGenerationImpl;

impl ContentGeneration for ContentGenerationImpl {
    async fn run(
        &self,
        ctx: WorkflowContext<'_>,
        Json(request): Json<ContentRequest>,
    ) -> Result<Json<ContentResult>, HandlerError> {
        ctx.set("status", "Generating description...".to_string());

        // Step 1: Generate the product description
        let description: String = ctx
            .run(|| generate_description(request.clone()))
            .name("generate_description")
            .await?;

        // Step 2: Translate into each target language
        let mut translations = Vec::new();
        for language in &request.target_languages {
            ctx.set("status", format!("Translating to {language}..."));

            let translation: String = ctx
                .run(|| translate_text(description.clone(), language.clone()))
                .name(&format!("translate_{language}"))
                .await?;

            translations.push(Translation {
                language: language.clone(),
                text: translation,
            });
        }

        ctx.set("status", "Complete".to_string());

        Ok(Json(ContentResult {
            product_id: request.product_id,
            description,
            translations,
        }))
    }

    async fn get_status(
        &self,
        ctx: SharedWorkflowContext<'_>,
    ) -> Result<String, HandlerError> {
        Ok(ctx
            .get::<String>("status")
            .await?
            .unwrap_or_else(|| "Waiting to start...".to_string()))
    }
}

async fn generate_description(request: ContentRequest) -> Result<String, anyhow::Error> {
    let client = anthropic::Client::from_env();
    let agent = client
        .agent("claude-sonnet-4-20250514")
        .preamble(
            "You are a copywriter. Write a compelling product description \
             in 2-3 paragraphs. Be specific and highlight key features."
        )
        .temperature(0.7)
        .build();

    let prompt = format!(
        "Write a product description for: {}\n\nDetails: {}",
        request.product_name, request.product_details
    );

    Ok(agent.prompt(&prompt).await?)
}

async fn translate_text(text: String, language: String) -> Result<String, anyhow::Error> {
    let client = anthropic::Client::from_env();
    let agent = client
        .agent("claude-sonnet-4-20250514")
        .preamble(&format!(
            "You are a translator. Translate the following text into {language}. \
             Preserve the tone and style of the original."
        ))
        .temperature(0.3)
        .build();

    Ok(agent.prompt(&text).await?)
}

Each ctx.run() call wraps one LLM invocation. The side effect functions create their own Rig clients because Restate closures must be Send + 'static, which means they cannot borrow the handler’s context. Creating an Anthropic client is cheap (it is just an HTTP client with credentials), so this overhead is negligible compared to the LLM call itself.

If the worker crashes after generating the description but before the translations, Restate restarts the workflow and replays from the journal. The description step returns its stored result without calling the model again, and execution resumes with the first translation.

Triggering the workflow from Axum

Fire-and-forget from an Axum handler, with the workflow running in the background:

async fn generate_content(
    State(state): State<AppState>,
    Form(input): Form<ContentInput>,
) -> Result<Html<String>, AppError> {
    let request = ContentRequest {
        product_id: input.product_id.clone(),
        product_name: input.product_name,
        product_details: input.product_details,
        target_languages: vec!["fr".into(), "de".into(), "es".into()],
    };

    state
        .http
        .post(format!(
            "{}/ContentGeneration/{}/run/send",
            state.restate_ingress_url, input.product_id,
        ))
        .json(&request)
        .send()
        .await
        .map_err(|e| {
            tracing::error!(error = ?e, "failed to trigger content generation");
            AppError::BadGateway("could not start content generation".into())
        })?;

    Ok(Html(render_generation_progress(&input.product_id)))
}

The /send suffix makes the call fire-and-forget. The Restate workflow runs durably in the background. The rendered page can use SSE to display progress updates, following the same pattern shown in the Background Jobs section.

When to use Restate for AI calls

Wrap LLM calls in Restate when:

The call is part of a multi-step workflow where earlier steps are expensive to repeat
The result will be stored (database write, file creation) and losing it means re-running the model
You are chaining multiple model calls where later calls depend on earlier results
The operation is user-initiated and the user expects it to complete even if the server restarts

Skip Restate for:

Single low-latency completions served directly in the HTTP response (the basic handler pattern above)
Streaming responses where the user sees output in real time and can retry if it fails
Development and experimentation where durability adds friction

Prompt management

Hardcoded prompt strings work for simple cases. As your application grows, prompts need structure.

Preamble as configuration

Store system prompts in configuration rather than code. This lets you adjust model behaviour without redeploying:

#[derive(Clone)]
pub struct AiConfig {
    pub model: String,
    pub summarise_preamble: String,
    pub chat_preamble: String,
    pub temperature: f64,
}

impl AiConfig {
    pub fn from_env() -> Self {
        Self {
            model: std::env::var("AI_MODEL")
                .unwrap_or_else(|_| "claude-sonnet-4-20250514".to_string()),
            summarise_preamble: std::env::var("AI_SUMMARISE_PREAMBLE")
                .unwrap_or_else(|_| "You summarise text concisely in 2-3 sentences.".to_string()),
            chat_preamble: std::env::var("AI_CHAT_PREAMBLE")
                .unwrap_or_else(|_| "You are a helpful assistant.".to_string()),
            temperature: std::env::var("AI_TEMPERATURE")
                .ok()
                .and_then(|v| v.parse().ok())
                .unwrap_or(0.3),
        }
    }
}

Build agents from the configuration at startup:

let ai_config = AiConfig::from_env();
let anthropic = anthropic::Client::from_env();

let summarise_agent = anthropic
    .agent(&ai_config.model)
    .preamble(&ai_config.summarise_preamble)
    .temperature(ai_config.temperature)
    .build();

Prompt templates

For prompts that combine fixed instructions with dynamic data, format strings are sufficient:

let prompt = format!(
    "Classify the following support ticket into one of these categories: \
     billing, technical, account, other.\n\n\
     Respond with only the category name.\n\n\
     Ticket: {ticket_text}"
);

For more complex templates with conditional sections, build the prompt string with standard Rust string manipulation. There is no need for a dedicated templating engine for prompts. Rig’s .context() method on the agent builder is another option for injecting dynamic context alongside the preamble:

let agent = anthropic
    .agent("claude-sonnet-4-20250514")
    .preamble("You answer questions about the user's order history.")
    .context(&format!("Customer name: {}\nAccount since: {}", name, since))
    .build();

Context documents are sent alongside the preamble in every request, giving the model additional information without modifying the system prompt.

Retrieval-augmented generation

Retrieval-augmented generation (RAG) grounds an LLM’s answers in your data. Instead of relying on the model’s training data alone, you retrieve relevant documents from your database and include them in the prompt as context. The model answers based on what you provided, reducing hallucination and keeping responses current.

The pattern has three steps: embed the user’s query into a vector, search your database for similar documents, and inject the results into the prompt alongside the question.

The retrieval step

The Semantic Search section covers pgvector setup, embedding generation with Ollama, and similarity queries with SQLx. The functions below come directly from that section:

generate_embeddings() converts text into vectors via Ollama’s /api/embed endpoint
semantic_search() finds the most similar documents by cosine distance

If your application needs better retrieval quality, swap semantic_search() for the hybrid_search() function from the same section, which combines vector similarity with full-text search using Reciprocal Rank Fusion.

Building a RAG handler

Retrieve context, format it, and pass it to the agent in a single Axum handler:

use axum::{extract::State, response::Html, Form};
use rig::completion::Prompt;
use serde::Deserialize;

#[derive(Deserialize)]
struct AskInput {
    question: String,
}

async fn ask_with_context(
    State(state): State<AppState>,
    Form(input): Form<AskInput>,
) -> Result<Html<String>, AppError> {
    // Step 1: Embed the question
    let embeddings = generate_embeddings(
        &state.http,
        &state.config.ollama_url,
        &[&input.question],
    )
    .await
    .map_err(|e| {
        tracing::error!(error = ?e, "embedding generation failed");
        AppError::BadGateway("embedding service unavailable".into())
    })?;

    let query_embedding = embeddings
        .into_iter()
        .next()
        .ok_or_else(|| AppError::Internal("no embedding returned".into()))?;

    // Step 2: Retrieve relevant documents
    let documents = semantic_search(&state.db, query_embedding, 5).await?;

    // Step 3: Format context and build the prompt
    let context = documents
        .iter()
        .map(|doc| format!("## {}\n{}", doc.title, doc.content))
        .collect::<Vec<_>>()
        .join("\n\n");

    let prompt = format!(
        "Answer the question using only the provided documents. \
         If the documents do not contain enough information, say so.\n\n\
         {context}\n\n\
         Question: {}",
        input.question
    );

    let answer = state.agent.prompt(&prompt).await.map_err(|e| {
        tracing::error!(error = ?e, "RAG completion failed");
        AppError::BadGateway("AI service unavailable".into())
    })?;

    Ok(Html(format!(
        "<div class=\"answer\">{}</div>",
        escape_html(&answer)
    )))
}

The agent used here is the same one built at startup and stored in AppState, as shown in the basic completions section. The only difference is that the prompt now includes retrieved documents as context.

Context window management

Retrieved documents consume input tokens. A pgvector query returning five documents of 500 words each adds roughly 3,000 to 4,000 tokens to the prompt. Monitor this budget:

let max_context_chars = 8_000;
let context = if context.len() > max_context_chars {
    let truncated = &context[..max_context_chars];
    truncated
        .rfind("\n\n")
        .map(|pos| &truncated[..pos])
        .unwrap_or(truncated)
        .to_string()
} else {
    context
};

For large document sets, retrieve more candidates than you need (e.g., 10 to 20) and include only those that fit within your token budget. The similarity score from semantic_search() helps here: set a minimum threshold (e.g., 0.7) and discard documents below it.

Alternative: Rig’s dynamic_context

Rig provides a built-in RAG mechanism through .dynamic_context() on the agent builder. Combined with the rig-postgres companion crate, which implements VectorStoreIndex for pgvector, you can wire retrieval directly into the agent:

// Using rig-postgres (requires its own table schema)
let vector_store = PostgresVectorStore::default(embedding_model, pool);
let index = vector_store.index(embedding_model);

let agent = anthropic
    .agent("claude-sonnet-4-20250514")
    .preamble("Answer questions using the provided context.")
    .dynamic_context(5, index)
    .build();

With .dynamic_context(5, index), Rig automatically retrieves the top 5 similar documents before every prompt and injects them as context. This is convenient but less flexible: you cannot use hybrid search, you cannot filter results by similarity threshold, and rig-postgres requires its own table schema (id uuid, document jsonb, embedded_text text, embedding vector(N)) that differs from the typed columns established in the Semantic Search section. The manual approach gives you full control over retrieval and context formatting.

Agentic retrieval

Standard RAG retrieves context on every query regardless of whether the query needs it. “What is 2 + 2?” triggers a vector search that returns irrelevant results and wastes tokens. Agentic retrieval inverts this: the LLM decides when to search, what to search for, and whether to search again with a refined query.

This is a direct application of the Tool trait covered in the tool use section. Define a tool that wraps semantic search, attach it to an agent, and let the model decide when retrieval is appropriate.

A search tool

use rig::tool::Tool;
use rig::completion::ToolDefinition;
use serde::{Deserialize, Serialize};
use serde_json::json;

#[derive(Debug, Deserialize)]
struct SearchArgs {
    query: String,
    max_results: Option<u32>,
}

#[derive(Debug, thiserror::Error)]
#[error("knowledge base search failed: {0}")]
struct SearchError(String);

#[derive(Serialize, Deserialize)]
struct KnowledgeBaseSearch {
    db: sqlx::PgPool,
    http: reqwest::Client,
    ollama_url: String,
}

impl Tool for KnowledgeBaseSearch {
    const NAME: &'static str = "search_knowledge_base";

    type Error = SearchError;
    type Args = SearchArgs;
    type Output = String;

    async fn definition(&self, _prompt: String) -> ToolDefinition {
        ToolDefinition {
            name: "search_knowledge_base".to_string(),
            description: "Search the knowledge base for documents relevant to a query. \
                Use this when you need factual information to answer a question. \
                You can call this tool multiple times with different or refined \
                queries if the initial results are insufficient."
                .to_string(),
            parameters: json!({
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "Natural language search query describing what information you need"
                    },
                    "max_results": {
                        "type": "integer",
                        "description": "Maximum number of documents to return (default 5)"
                    }
                },
                "required": ["query"]
            }),
        }
    }

    async fn call(&self, args: Self::Args) -> Result<Self::Output, Self::Error> {
        let limit = args.max_results.unwrap_or(5) as i64;

        let embeddings =
            generate_embeddings(&self.http, &self.ollama_url, &[&args.query])
                .await
                .map_err(|e| SearchError(e.to_string()))?;

        let query_embedding = embeddings
            .into_iter()
            .next()
            .ok_or_else(|| SearchError("no embedding returned".into()))?;

        let results = semantic_search(&self.db, query_embedding, limit)
            .await
            .map_err(|e| SearchError(e.to_string()))?;

        if results.is_empty() {
            return Ok("No relevant documents found.".to_string());
        }

        let formatted = results
            .iter()
            .map(|doc| {
                format!(
                    "## {} (relevance: {:.0}%)\n{}",
                    doc.title,
                    doc.similarity * 100.0,
                    doc.content
                )
            })
            .collect::<Vec<_>>()
            .join("\n\n");

        Ok(formatted)
    }
}

The tool wraps the same generate_embeddings() and semantic_search() functions from the Semantic Search section. The model receives the formatted results as text and reasons about them.

Two details in the tool definition matter for multi-turn retrieval:

The description explicitly tells the model it can call the tool multiple times with different queries. Without this, models tend to search once and work with whatever comes back.
Including the relevance percentage in the output helps the model judge whether the results are useful or whether a refined search is warranted.

Building the agent

use rig::tool::ToolDyn;

let search_tool = KnowledgeBaseSearch {
    db: pool.clone(),
    http: reqwest::Client::new(),
    ollama_url: config.ollama_url.clone(),
};

let tools: Vec<Box<dyn ToolDyn>> = vec![Box::new(search_tool)];

let agent = anthropic
    .agent("claude-sonnet-4-20250514")
    .preamble(
        "You are a knowledge assistant. You have access to a search tool that \
         queries the knowledge base. Use it when you need factual information to \
         answer a question. If your first search does not return relevant results, \
         try rephrasing the query or searching for related terms. When you have \
         enough information, answer the question directly. If you cannot find the \
         answer after searching, say so."
    )
    .tools(tools)
    .max_tokens(1024)
    .build();

The preamble instructs the model to search selectively and refine when needed. A single .prompt() call can trigger multiple search rounds: the model calls the tool, reads the results, decides they are too broad, calls the tool again with a more specific query, and synthesises an answer from the combined results. Rig manages this loop automatically, as described in the tool use section.

Wiring into an Axum handler

async fn ask_agent(
    State(state): State<AppState>,
    Form(input): Form<AskInput>,
) -> Result<Html<String>, AppError> {
    let answer = state.rag_agent.prompt(&input.question).await.map_err(|e| {
        tracing::error!(error = ?e, "agentic retrieval failed");
        AppError::BadGateway("AI service unavailable".into())
    })?;

    Ok(Html(format!(
        "<div class=\"answer\">{}</div>",
        escape_html(&answer)
    )))
}

The handler is simpler than the manual RAG handler because the agent manages retrieval internally. The trade-off is less control: you cannot inspect or filter the retrieved documents before they reach the model, and each query may trigger zero, one, or several search tool calls depending on the model’s judgement.

RAG vs agentic retrieval

Use standard RAG when:

Every query needs context from the knowledge base (e.g., a documentation Q&A system)
You want deterministic retrieval: same query always retrieves the same documents
You need to control exactly which documents the model sees

Use agentic retrieval when:

Queries vary widely and not all require retrieval (e.g., a general assistant that sometimes needs to look things up)
The model benefits from refining its search strategy based on initial results
You want the agent to combine multiple searches to answer complex questions

Both approaches can be made durable with Restate using the same patterns shown in the durable AI workflows section.

Gotchas

LLM calls are slow. A typical completion takes 1 to 10 seconds. Do not call them synchronously in a request that the user is waiting on unless you are streaming the response. For non-streaming use cases, trigger a Restate workflow and show progress.

Token limits are real. Each model has input and output token limits. If your prompt plus context exceeds the input limit, the API returns an error. Track prompt sizes, especially when injecting user-provided content or database results. Use .max_tokens() on the agent builder to cap output length.

Rate limits vary by provider. Anthropic, OpenAI, and other providers enforce rate limits on tokens per minute and requests per minute. Handle 429 Too Many Requests errors gracefully. Restate’s retry logic helps here: if a rate limit error is retryable, the journaled side effect retries automatically with backoff.

Model output is not safe HTML. Never insert raw LLM output into an HTML page without escaping or sanitising. Models can produce arbitrary text, including strings that look like HTML tags or script injections. Escape by default, sanitise only when you explicitly want formatted output.

Tool definitions need good descriptions. The model decides whether to call a tool based on the description field in ToolDefinition and the parameter descriptions. Vague descriptions lead to the model not calling tools when it should, or calling them with wrong arguments. Write descriptions as if explaining the tool to a colleague who will use it without seeing the implementation.

Rig creates HTTP clients internally. Each Rig provider client manages its own HTTP connection pool. This is separate from the reqwest::Client you use for other external API calls. Do not try to share a single reqwest::Client across both Rig and your own HTTP calls.

Ollama model availability. Ollama models must be pulled before use. If the model is not available locally, the API call fails. Pull models as part of your development setup, not at application startup. For production Ollama deployments, pre-bake models into the container image or volume.

Provider-specific features. Some features (vision, extended thinking, tool use with streaming) vary by provider and model. Test your specific provider/model combination. Rig’s unified interface covers the common surface, but edge cases may behave differently across providers.

Retrieved context is a prompt injection surface. In RAG, retrieved documents become part of the prompt. If your documents contain adversarial text (e.g., “Ignore previous instructions and…”), the model may follow it. This is a fundamental limitation of injecting external content into prompts. Sanitise stored content if it originates from untrusted sources, and do not treat model output from RAG queries as trusted.