Context
7 min read

Durable Streams: The Infrastructure Challenge for Long-Running Agents

*Building systems that can execute for hours, days, or weeks without losing state*

Derek Parham

Contributor · Oct 20, 2018

Durable Streams: The Infrastructure Challenge for Long-Running Agents

Building systems that can execute for hours, days, or weeks without losing state

By Derek Parham


There's a clip of Dario Amodei saying agents will be able to work on tasks for "weeks or months at a time." The capability discourse has accepted this: long-horizon agents are coming.

What's less discussed is the infrastructure required to make this real.

Current AI deployments are request-response. User sends prompt, system generates response, connection closes. This works for sub-minute interactions. It fundamentally doesn't work for tasks that span hours, days, or weeks.

This post is about the infrastructure we're building to support genuinely long-running agent work. It's technical, not markety. If you're thinking about how to deploy agents that work on extended timescales, this is for you.


Why Long-Running Is Hard

Challenge 1: State Persistence

A standard LLM call is stateless. Everything needed to generate the response is in the prompt. Once the response is generated, nothing persists.

Long-running tasks require persistent state:

  • Progress through the overall task
  • Partial results accumulated so far
  • Resources acquired (file handles, API connections, database sessions)
  • Context built up during execution
  • Decisions made and their rationale

This state must survive:

  • Server restarts
  • Infrastructure failures
  • Service upgrades and deployments
  • Load rebalancing across nodes
  • Network partitions

Standard serverless architectures don't support this. Container orchestration systems assume workloads can be killed and restarted. Long-running agent state doesn't fit the ephemeral compute paradigm.

Challenge 2: Resumability

When something fails (and something always fails), the system needs to resume from a consistent state.

This is harder than it sounds. The agent might have:

  • Sent an email that can't be unsent
  • Made an API call with side effects
  • Updated state in external systems
  • Acquired locks or resources that must be released

Simple checkpointing doesn't work if external side effects aren't idempotent. You need:

  • Clear boundaries between "planning" and "executing" phases
  • Confirmation that side effects completed successfully before checkpointing
  • Rollback or compensation logic for partial failures
  • Resource cleanup on resume

Challenge 3: Resource Management

Long-running tasks consume resources over extended periods:

  • Memory for accumulated state
  • Connections to external services
  • Rate limits on APIs
  • Compute for periodic processing

These resources must be managed across the entire task duration. Standard request-based patterns (acquire at start, release at end) don't scale to hour-long or day-long tasks.

Challenge 4: Human-in-the-Loop at Scale

Long-running tasks inevitably encounter situations requiring human input:

  • Ambiguous instructions that need clarification
  • Decisions that exceed agent autonomy
  • Errors that require human judgment to resolve
  • Approval gates for high-stakes actions

The system must:

  • Pause gracefully when human input is needed
  • Preserve full context while waiting (could be hours or days)
  • Resume correctly when input arrives
  • Handle multiple outstanding requests for human input

The Durable Streams Architecture

We've built an infrastructure pattern called durable streams to address these challenges.

Core Concept: Streams, Not Requests

Instead of request-response, we model agent execution as a stream of events:

[Task Created]
    → [Context Loaded]
    → [Plan Generated]
    → [Step 1 Started]
    → [Step 1 Tool Call]
    → [Step 1 Tool Response]
    → [Step 1 Completed]
    → [Step 2 Started]
    → ...
    → [Human Input Requested]
    → [Human Input Received]
    → ...
    → [Task Completed]

Every state change is an event. Events are durably persisted before being processed. The stream is the source of truth for task state.

Event Sourcing

The stream is an append-only event log. Task state is reconstructed by replaying events.

This provides:

  • Complete history: Every state transition is recorded
  • Point-in-time recovery: Can reconstruct state at any moment
  • Auditability: Full trace of what happened and when
  • Debugging: Can replay execution to diagnose issues

The stream persists independently of compute instances. Workers can crash, restart, or be replaced—the stream maintains ground truth.

Checkpointing and Resume

We implement checkpointing through snapshot events:

[Checkpoint: {
    progress: "step_7_of_12",
    accumulated_results: {...},
    pending_actions: [],
    context_state: {...}
}]

On resume:

  1. Find latest checkpoint event
  2. Reconstruct state from checkpoint
  3. Verify no pending actions with external side effects
  4. Resume execution from checkpoint

Checkpoints are taken at safe points—moments when the system is in a consistent state with no outstanding external operations.

Action Primitives

External side effects are wrapped in action primitives with explicit lifecycle:

[Action Started: send_email, id: abc123]
[Action Executing: abc123]
[Action Completed: abc123, result: {...}]

or

[Action Started: send_email, id: abc123]
[Action Executing: abc123]
[Action Failed: abc123, error: {...}]
[Action Compensated: abc123]  # if compensation possible

On resume, incomplete actions are either:

  • Completed if idempotent (can safely retry)
  • Compensated if reversible (can roll back)
  • Flagged for human review if neither

Human Input Protocol

Human-in-the-loop is modeled as a special event type:

[Human Input Requested: {
    question: "Should I proceed with sending to all 500 recipients?",
    context: {...},
    options: ["proceed", "modify", "cancel"],
    timeout: "24h",
    escalation: "auto-cancel"
}]

The stream pauses. The system emits notifications. When input arrives:

[Human Input Received: {
    request_id: xyz789,
    response: "proceed",
    responder: "user@company.com",
    timestamp: ...
}]

Execution resumes with the human decision incorporated.


Implementation Details

Worker Architecture

Workers are stateless processors that consume from the stream:

  1. Worker claims a stream
  2. Worker replays events to reconstruct state (or loads from checkpoint)
  3. Worker processes next event, generating new events
  4. Worker writes new events to stream
  5. Repeat until stream pauses (human input) or completes

Workers can be scaled horizontally. Each stream is processed by one worker at a time (exclusive lock), but multiple streams can be processed in parallel.

Storage Layer

Events are stored in a durable, ordered log. We use:

  • Write-ahead logging for durability
  • Partitioning by stream for parallelism
  • Compaction to archive old streams while maintaining active ones
  • Replication for fault tolerance

The storage layer guarantees:

  • Events are never lost once acknowledged
  • Order within a stream is preserved
  • Reads see writes that completed before the read started

Context Management

Long-running tasks accumulate significant context. We manage this through:

  • Hierarchical storage: Recent/active context in fast storage, historical context in slower storage
  • Lazy loading: Context is loaded as needed, not all at startup
  • Compaction: Periodic distillation of historical context into summaries
  • Eviction policies: Clear rules for what context can be dropped

Resource Lifecycle

External resources (connections, handles, locks) are tracked explicitly:

[Resource Acquired: db_connection, id: conn123]
...
[Resource Released: conn123]

On resume, the system:

  1. Identifies resources that were acquired but not released
  2. Attempts cleanup/release
  3. Reacquires resources needed for continuation

Operational Considerations

Monitoring

Long-running tasks need different monitoring than request-response:

  • Progress tracking: Where is the task in its overall plan?
  • Resource utilization: Connections, memory, compute over time
  • Human input latency: How long are tasks waiting for human response?
  • Completion rate: What percentage of started tasks finish successfully?
  • Failure patterns: Where do tasks tend to fail?

Cost Management

Extended execution accumulates costs:

  • Compute time (even if intermittent)
  • Storage for event streams
  • API calls to external services
  • Human attention for input requests

We provide cost tracking per stream and per task type, with alerting for runaway costs.

Security

Long-running tasks holding credentials raise security concerns:

  • Credential rotation during execution
  • Scope limiting for long-lived sessions
  • Audit logging for all privileged actions
  • Automatic revocation on task completion or timeout

Where This Matters

Durable streams aren't needed for quick interactions. They're infrastructure for:

  • Research tasks: Literature review, data analysis, report writing spanning hours
  • Process automation: Workflows involving multiple systems and approval gates
  • Monitoring and response: Ongoing surveillance with conditional action
  • Project execution: Multi-step work with human collaboration

The pattern enables agents that work at human timescales—not just response times, but project durations. This is the infrastructure required for Dario's vision of month-long agent tasks.

We're not there yet on capability. But when capability arrives, we'll be ready on infrastructure.


This is part of a series on agent infrastructure at Context. Learn more at context.inc

Share this article