Durable Streams: The Infrastructure Challenge for Long-Running Agents

Building systems that can execute for hours, days, or weeks without losing state

By Derek Parham

There's a clip of Dario Amodei saying agents will be able to work on tasks for "weeks or months at a time." The capability discourse has accepted this: long-horizon agents are coming.

What's less discussed is the infrastructure required to make this real.

Current AI deployments are request-response. User sends prompt, system generates response, connection closes. This works for sub-minute interactions. It fundamentally doesn't work for tasks that span hours, days, or weeks.

This post is about the infrastructure we're building to support genuinely long-running agent work. It's technical, not markety. If you're thinking about how to deploy agents that work on extended timescales, this is for you.

Why Long-Running Is Hard

Challenge 1: State Persistence

A standard LLM call is stateless. Everything needed to generate the response is in the prompt. Once the response is generated, nothing persists.

Long-running tasks require persistent state:

Progress through the overall task
Partial results accumulated so far
Resources acquired (file handles, API connections, database sessions)
Context built up during execution
Decisions made and their rationale

This state must survive:

Server restarts
Infrastructure failures
Service upgrades and deployments
Load rebalancing across nodes
Network partitions

Standard serverless architectures don't support this. Container orchestration systems assume workloads can be killed and restarted. Long-running agent state doesn't fit the ephemeral compute paradigm.

Challenge 2: Resumability

When something fails (and something always fails), the system needs to resume from a consistent state.

This is harder than it sounds. The agent might have:

Sent an email that can't be unsent
Made an API call with side effects
Updated state in external systems
Acquired locks or resources that must be released

Simple checkpointing doesn't work if external side effects aren't idempotent. You need:

Clear boundaries between "planning" and "executing" phases
Confirmation that side effects completed successfully before checkpointing
Rollback or compensation logic for partial failures
Resource cleanup on resume

Challenge 3: Resource Management

Long-running tasks consume resources over extended periods:

Memory for accumulated state
Connections to external services
Rate limits on APIs
Compute for periodic processing

These resources must be managed across the entire task duration. Standard request-based patterns (acquire at start, release at end) don't scale to hour-long or day-long tasks.

Challenge 4: Human-in-the-Loop at Scale

Long-running tasks inevitably encounter situations requiring human input:

Ambiguous instructions that need clarification
Decisions that exceed agent autonomy
Errors that require human judgment to resolve
Approval gates for high-stakes actions

The system must:

Pause gracefully when human input is needed
Preserve full context while waiting (could be hours or days)
Resume correctly when input arrives
Handle multiple outstanding requests for human input

The Durable Streams Architecture

We've built an infrastructure pattern called durable streams to address these challenges.

Core Concept: Streams, Not Requests

Instead of request-response, we model agent execution as a stream of events:

[Task Created]
    → [Context Loaded]
    → [Plan Generated]
    → [Step 1 Started]
    → [Step 1 Tool Call]
    → [Step 1 Tool Response]
    → [Step 1 Completed]
    → [Step 2 Started]
    → ...
    → [Human Input Requested]
    → [Human Input Received]
    → ...
    → [Task Completed]

Every state change is an event. Events are durably persisted before being processed. The stream is the source of truth for task state.

Event Sourcing

The stream is an append-only event log. Task state is reconstructed by replaying events.

This provides:

Complete history: Every state transition is recorded
Point-in-time recovery: Can reconstruct state at any moment
Auditability: Full trace of what happened and when
Debugging: Can replay execution to diagnose issues

The stream persists independently of compute instances. Workers can crash, restart, or be replaced—the stream maintains ground truth.

Checkpointing and Resume

We implement checkpointing through snapshot events:

[Checkpoint: {
    progress: "step_7_of_12",
    accumulated_results: {...},
    pending_actions: [],
    context_state: {...}
}]

On resume:

Find latest checkpoint event
Reconstruct state from checkpoint
Verify no pending actions with external side effects
Resume execution from checkpoint

Checkpoints are taken at safe points—moments when the system is in a consistent state with no outstanding external operations.

Action Primitives

External side effects are wrapped in action primitives with explicit lifecycle:

[Action Started: send_email, id: abc123]
[Action Executing: abc123]
[Action Completed: abc123, result: {...}]

[Action Started: send_email, id: abc123]
[Action Executing: abc123]
[Action Failed: abc123, error: {...}]
[Action Compensated: abc123]  # if compensation possible

On resume, incomplete actions are either:

Completed if idempotent (can safely retry)
Compensated if reversible (can roll back)
Flagged for human review if neither

Human Input Protocol

Human-in-the-loop is modeled as a special event type:

[Human Input Requested: {
    question: "Should I proceed with sending to all 500 recipients?",
    context: {...},
    options: ["proceed", "modify", "cancel"],
    timeout: "24h",
    escalation: "auto-cancel"
}]

The stream pauses. The system emits notifications. When input arrives:

[Human Input Received: {
    request_id: xyz789,
    response: "proceed",
    responder: "user@company.com",
    timestamp: ...
}]

Execution resumes with the human decision incorporated.

Implementation Details

Worker Architecture

Workers are stateless processors that consume from the stream:

Worker claims a stream
Worker replays events to reconstruct state (or loads from checkpoint)
Worker processes next event, generating new events
Worker writes new events to stream
Repeat until stream pauses (human input) or completes

Workers can be scaled horizontally. Each stream is processed by one worker at a time (exclusive lock), but multiple streams can be processed in parallel.

Storage Layer

Events are stored in a durable, ordered log. We use:

Write-ahead logging for durability
Partitioning by stream for parallelism
Compaction to archive old streams while maintaining active ones
Replication for fault tolerance

The storage layer guarantees:

Events are never lost once acknowledged
Order within a stream is preserved
Reads see writes that completed before the read started

Context Management

Long-running tasks accumulate significant context. We manage this through:

Hierarchical storage: Recent/active context in fast storage, historical context in slower storage
Lazy loading: Context is loaded as needed, not all at startup
Compaction: Periodic distillation of historical context into summaries
Eviction policies: Clear rules for what context can be dropped

Resource Lifecycle

External resources (connections, handles, locks) are tracked explicitly:

[Resource Acquired: db_connection, id: conn123]
...
[Resource Released: conn123]

On resume, the system:

Identifies resources that were acquired but not released
Attempts cleanup/release
Reacquires resources needed for continuation

Operational Considerations

Monitoring

Long-running tasks need different monitoring than request-response:

Progress tracking: Where is the task in its overall plan?
Resource utilization: Connections, memory, compute over time
Human input latency: How long are tasks waiting for human response?
Completion rate: What percentage of started tasks finish successfully?
Failure patterns: Where do tasks tend to fail?

Cost Management

Extended execution accumulates costs:

Compute time (even if intermittent)
Storage for event streams
API calls to external services
Human attention for input requests

We provide cost tracking per stream and per task type, with alerting for runaway costs.

Security

Long-running tasks holding credentials raise security concerns:

Credential rotation during execution
Scope limiting for long-lived sessions
Audit logging for all privileged actions
Automatic revocation on task completion or timeout

Where This Matters

Durable streams aren't needed for quick interactions. They're infrastructure for:

Research tasks: Literature review, data analysis, report writing spanning hours
Process automation: Workflows involving multiple systems and approval gates
Monitoring and response: Ongoing surveillance with conditional action
Project execution: Multi-step work with human collaboration

The pattern enables agents that work at human timescales—not just response times, but project durations. This is the infrastructure required for Dario's vision of month-long agent tasks.

We're not there yet on capability. But when capability arrives, we'll be ready on infrastructure.

This is part of a series on agent infrastructure at Context. Learn more at context.inc

Durable Streams: The Infrastructure Challenge for Long-Running Agents

Durable Streams: The Infrastructure Challenge for Long-Running Agents

Why Long-Running Is Hard

Challenge 1: State Persistence

Challenge 2: Resumability

Challenge 3: Resource Management

Challenge 4: Human-in-the-Loop at Scale

The Durable Streams Architecture

Core Concept: Streams, Not Requests

Event Sourcing

Checkpointing and Resume

Action Primitives

Human Input Protocol

Implementation Details

Worker Architecture

Storage Layer

Context Management

Resource Lifecycle

Operational Considerations

Monitoring

Cost Management

Security

Where This Matters

More from Context

Continual Learning: Building AI That Gets Better With Every Task

Reinforcement Learning in Real-World Systems: When It Works, When It Doesn't

The Future of Agent Coordination: Intents, Skills, and Why Applets Are the New Programs