Streaming

Real per-token streaming via LanguageModel::stream_chunks (RFC 0001).

Streaming

Forge's primary generation API is LanguageModel::stream_chunks, which yields discrete events as the upstream provider produces them. Text deltas reach your code byte-for-byte as the model's wire emits them — no buffering, no re-tokenization, no delay.

This page documents the API design from RFC 0001.

The trait

pub trait LanguageModel: Send + Sync {
    async fn stream_chunks(
        &self,
        messages: &[ModelMessage],
        tools: &[ToolDefinition],
        options: &GenerateOptions,
    ) -> ForgeResult<ChunkStream<'static>>;

    // Legacy, deprecated:
    #[deprecated]
    async fn stream(&self, ...) -> ForgeResult<Vec<StreamChunk>>;
}

pub type ChunkStream<'a> =
    Pin<Box<dyn Stream<Item = ForgeResult<StreamChunk>> + Send + 'a>>;

stream_chunks is the primary API. The legacy stream method is deprecated and will be removed in v0.3.0.

StreamChunk events

pub enum StreamChunk {
    TextDelta {
        text: String,
    },
    ToolCallStart {
        id: String,
        name: String,
        input_partial: String,
    },
    ToolCallDelta {
        id: String,
        name: String,
        arguments_delta: String,
    },
    ToolCallEnd {
        id: String,
    },
    Thinking {
        delta: String,
    },
    Done {
        usage: Usage,
        finish_reason: FinishReason,
    },
}

Most consumers care about TextDelta and Done:

use futures_util::StreamExt;

let mut stream = model.stream_chunks(&messages, &[], &options).await?;
while let Some(item) = stream.next().await {
    match item? {
        StreamChunk::TextDelta { text } => print!("{text}"),
        StreamChunk::Done { usage, .. } => {
            println!("\n[{} prompt + {} completion tokens]",
                usage.prompt_tokens, usage.completion_tokens);
            break;
        }
        _ => {}
    }
}

Provider implementations

Forge providers split into two families:

Native streaming (per-token)

These providers parse the upstream SSE stream incrementally and emit StreamChunks as bytes arrive:

Provider SSE decoder
Anthropic forge-provider-anthropic::SseDecoder + ChunkEmitter
OpenAI forge-provider-openai::OpenAiSseDecoder
Google forge-provider-google::GoogleSseDecoder

Each decoder is byte-resilient — it handles SSE events that arrive split across arbitrary chunk boundaries. The Anthropic decoder includes a property-based test that splits the upstream wire at every byte position and verifies parity with the buffered decode.

Buffered shims (uniform surface)

These providers wait for the full response and then chunk it locally to keep the trait surface uniform:

  • forge-provider-litellm
  • forge-provider-foundry
  • forge-provider-codex (Codex CLI bridge)
  • forge-provider-claude-code (Claude Code CLI bridge)

These produce the same StreamChunk events — the only difference is that deltas arrive in larger batches. Native streaming for these is tracked in follow-up issues.

End-to-end terminal streaming

The harness-sdk-forge bridge consumes stream_chunks natively and yields harness_sdk::ResponseChunk events into a Codex-style TUI. Text reaches the terminal as the upstream wire produces it:

use std::sync::Arc;
use harness_sdk_forge::prelude::*;

let adapter: Arc<dyn AgentAdapter> = Arc::new(ForgeAdapter::new(Arc::new(agent)));
AgentApp::builder().adapter(adapter).build()?.run().await?;

See Harness integration.

Convenience helpers

forge-generate exposes higher-level streaming on top of stream_chunks:

use forge::generate::{stream_text_chunks, stream_text};

// Yields just text deltas (StreamChunk::TextDelta unwrapped to String).
let mut text_stream = stream_text_chunks(&model, prompt, &options).await?;
while let Some(delta) = text_stream.next().await {
    print!("{}", delta?);
}

// Buffered convenience that fully consumes the stream.
let result: TextStreamResult = stream_text(&model, prompt, &options).await?;
println!("{}", result.text);

Why this matters

Before RFC 0001, the trait method stream returned a fully-collected Vec<StreamChunk> — the name implied streaming but the surface buffered. Consumers built terminal UIs that appeared to stream by re-emitting buffered chunks at fixed intervals. The actual upstream wire-level latency was hidden.

stream_chunks returns a real Stream. Every consumer up the chain — the agent's tool loop, forge-generate::stream_text_chunks, the harness adapter, your terminal widget — now reflects the upstream producer's pace.

Next