Durable Streams: The Infrastructure Challenge for Long-Running Agents
Building systems that can execute for hours, days, or weeks without losing state
By Derek Parham
There's a clip of Dario Amodei saying agents will be able to work on tasks for "weeks or months at a time." The capability discourse has accepted this: long-horizon agents are coming.
What's less discussed is the infrastructure required to make this real.
Current AI deployments are request-response. User sends prompt, system generates response, connection closes. This works for sub-minute interactions. It fundamentally doesn't work for tasks that span hours, days, or weeks.
This post is about the infrastructure we're building to support genuinely long-running agent work. It's technical, not markety. If you're thinking about how to deploy agents that work on extended timescales, this is for you.
Why Long-Running Is Hard
Challenge 1: State Persistence
A standard LLM call is stateless. Everything needed to generate the response is in the prompt. Once the response is generated, nothing persists.
Long-running tasks require persistent state:
- Progress through the overall task
- Partial results accumulated so far
- Resources acquired (file handles, API connections, database sessions)
- Context built up during execution
- Decisions made and their rationale
This state must survive:
- Server restarts
- Infrastructure failures
- Service upgrades and deployments
- Load rebalancing across nodes
- Network partitions
Standard serverless architectures don't support this. Container orchestration systems assume workloads can be killed and restarted. Long-running agent state doesn't fit the ephemeral compute paradigm.
Challenge 2: Resumability
When something fails (and something always fails), the system needs to resume from a consistent state.
This is harder than it sounds. The agent might have:
- Sent an email that can't be unsent
- Made an API call with side effects
- Updated state in external systems
- Acquired locks or resources that must be released
Simple checkpointing doesn't work if external side effects aren't idempotent. You need:
- Clear boundaries between "planning" and "executing" phases
- Confirmation that side effects completed successfully before checkpointing
- Rollback or compensation logic for partial failures
- Resource cleanup on resume
Challenge 3: Resource Management
Long-running tasks consume resources over extended periods:
- Memory for accumulated state
- Connections to external services
- Rate limits on APIs
- Compute for periodic processing
These resources must be managed across the entire task duration. Standard request-based patterns (acquire at start, release at end) don't scale to hour-long or day-long tasks.
Challenge 4: Human-in-the-Loop at Scale
Long-running tasks inevitably encounter situations requiring human input:
- Ambiguous instructions that need clarification
- Decisions that exceed agent autonomy
- Errors that require human judgment to resolve
- Approval gates for high-stakes actions
The system must:
- Pause gracefully when human input is needed
- Preserve full context while waiting (could be hours or days)
- Resume correctly when input arrives
- Handle multiple outstanding requests for human input
The Durable Streams Architecture
We've built an infrastructure pattern called durable streams to address these challenges.
Core Concept: Streams, Not Requests
Instead of request-response, we model agent execution as a stream of events:
[Task Created]
→ [Context Loaded]
→ [Plan Generated]
→ [Step 1 Started]
→ [Step 1 Tool Call]
→ [Step 1 Tool Response]
→ [Step 1 Completed]
→ [Step 2 Started]
→ ...
→ [Human Input Requested]
→ [Human Input Received]
→ ...
→ [Task Completed]
Every state change is an event. Events are durably persisted before being processed. The stream is the source of truth for task state.
Event Sourcing
The stream is an append-only event log. Task state is reconstructed by replaying events.
This provides:
- Complete history: Every state transition is recorded
- Point-in-time recovery: Can reconstruct state at any moment
- Auditability: Full trace of what happened and when
- Debugging: Can replay execution to diagnose issues
The stream persists independently of compute instances. Workers can crash, restart, or be replaced—the stream maintains ground truth.
Checkpointing and Resume
We implement checkpointing through snapshot events:
[Checkpoint: {
progress: "step_7_of_12",
accumulated_results: {...},
pending_actions: [],
context_state: {...}
}]
On resume:
- Find latest checkpoint event
- Reconstruct state from checkpoint
- Verify no pending actions with external side effects
- Resume execution from checkpoint
Checkpoints are taken at safe points—moments when the system is in a consistent state with no outstanding external operations.
Action Primitives
External side effects are wrapped in action primitives with explicit lifecycle:
[Action Started: send_email, id: abc123]
[Action Executing: abc123]
[Action Completed: abc123, result: {...}]
or
[Action Started: send_email, id: abc123]
[Action Executing: abc123]
[Action Failed: abc123, error: {...}]
[Action Compensated: abc123] # if compensation possible
On resume, incomplete actions are either:
- Completed if idempotent (can safely retry)
- Compensated if reversible (can roll back)
- Flagged for human review if neither
Human Input Protocol
Human-in-the-loop is modeled as a special event type:
[Human Input Requested: {
question: "Should I proceed with sending to all 500 recipients?",
context: {...},
options: ["proceed", "modify", "cancel"],
timeout: "24h",
escalation: "auto-cancel"
}]
The stream pauses. The system emits notifications. When input arrives:
[Human Input Received: {
request_id: xyz789,
response: "proceed",
responder: "user@company.com",
timestamp: ...
}]
Execution resumes with the human decision incorporated.
Implementation Details
Worker Architecture
Workers are stateless processors that consume from the stream:
- Worker claims a stream
- Worker replays events to reconstruct state (or loads from checkpoint)
- Worker processes next event, generating new events
- Worker writes new events to stream
- Repeat until stream pauses (human input) or completes
Workers can be scaled horizontally. Each stream is processed by one worker at a time (exclusive lock), but multiple streams can be processed in parallel.
Storage Layer
Events are stored in a durable, ordered log. We use:
- Write-ahead logging for durability
- Partitioning by stream for parallelism
- Compaction to archive old streams while maintaining active ones
- Replication for fault tolerance
The storage layer guarantees:
- Events are never lost once acknowledged
- Order within a stream is preserved
- Reads see writes that completed before the read started
Context Management
Long-running tasks accumulate significant context. We manage this through:
- Hierarchical storage: Recent/active context in fast storage, historical context in slower storage
- Lazy loading: Context is loaded as needed, not all at startup
- Compaction: Periodic distillation of historical context into summaries
- Eviction policies: Clear rules for what context can be dropped
Resource Lifecycle
External resources (connections, handles, locks) are tracked explicitly:
[Resource Acquired: db_connection, id: conn123]
...
[Resource Released: conn123]
On resume, the system:
- Identifies resources that were acquired but not released
- Attempts cleanup/release
- Reacquires resources needed for continuation
Operational Considerations
Monitoring
Long-running tasks need different monitoring than request-response:
- Progress tracking: Where is the task in its overall plan?
- Resource utilization: Connections, memory, compute over time
- Human input latency: How long are tasks waiting for human response?
- Completion rate: What percentage of started tasks finish successfully?
- Failure patterns: Where do tasks tend to fail?
Cost Management
Extended execution accumulates costs:
- Compute time (even if intermittent)
- Storage for event streams
- API calls to external services
- Human attention for input requests
We provide cost tracking per stream and per task type, with alerting for runaway costs.
Security
Long-running tasks holding credentials raise security concerns:
- Credential rotation during execution
- Scope limiting for long-lived sessions
- Audit logging for all privileged actions
- Automatic revocation on task completion or timeout
Where This Matters
Durable streams aren't needed for quick interactions. They're infrastructure for:
- Research tasks: Literature review, data analysis, report writing spanning hours
- Process automation: Workflows involving multiple systems and approval gates
- Monitoring and response: Ongoing surveillance with conditional action
- Project execution: Multi-step work with human collaboration
The pattern enables agents that work at human timescales—not just response times, but project durations. This is the infrastructure required for Dario's vision of month-long agent tasks.
We're not there yet on capability. But when capability arrives, we'll be ready on infrastructure.
This is part of a series on agent infrastructure at Context. Learn more at context.inc
Share this article