6 Jun 2026

Durable agents on ephemeral compute

The Context team

We built the agent's computer to be disposable. It is a sealed, ephemeral sandbox that gets destroyed when the session ends, and that we are happy to lose at any moment. That is exactly what you want for security. It is a problem for work, because real agent work is not short.

An agent might run for hours, across dozens of steps, holding a long thread of reasoning and a pile of intermediate state. So there is a contradiction to resolve: long, valuable work running on compute we treat as throwaway. The resolution is one sentence, and the rest is engineering. Keep nothing durable in the sandbox.

Why the compute has to be disposable

The sandbox is ephemeral on purpose. A fresh, empty machine every session means a compromise has no foothold and leaves nothing behind, which is the security story from an earlier post. But ephemeral cuts both ways. A sandbox can vanish at any moment, and not only when an attacker is involved: a node gets recycled, the autoscaler reclaims capacity, a process crashes, a spot instance is taken back.

If losing a sandbox meant losing the work, you could never run anything important inside one. So the rule runs the other way. The sandbox holds only the live, in-flight computation. Every piece of state that has to survive the sandbox lives somewhere else, by design, and the somewhere else is built to be durable in a way the sandbox deliberately is not.

State lives outside the sandbox, in three places

Durable state is split across three systems, each handling a different shape of state.

The workflow skeleton lives in a durable workflow engine. It owns the multi-step plan: each discrete step is recorded, and if a step fails or a subsystem goes down, it retries, times out, or replays, with in-flight steps resuming after the failure. The workflow always knows what has been done and what comes next, independent of whichever sandbox happens to be running right now.

The live session lives in a durable streaming layer. Between the discrete steps is everything happening moment to moment: the model's token stream, the agent's working memory, coordination between sub-processes, the user's interaction in flight. That runs through an actor cluster on top of an append-only log. Every event is written to a durable stream. The workflow engine handles the steps; the actor cluster handles everything in between them.

The outputs live in the Drive. Files the agent has finished and published go to durable object storage, under the workspace's permissions. The sandbox's own disk is scratch space. The Drive is the record. Nothing the agent produces is considered real until it is written there.

  1. Session starts
    A sandbox is provisioned, a pod or a microVM.
  2. Agent runs
    Every event is appended to a durable stream. Each workflow step is recorded. Finished outputs are written to the Drive.
  3. Sandbox dies
    A node recycle, a crash, or an autoscaler decision. The in-memory state is gone.
  4. Recovery
    Pending steps are re-dispatched, the session stream is replayed, and a fresh sandbox is provisioned.
  5. Work resumes
    The agent continues against the same Drive scope, with no lost tokens.
A session survives the loss of its sandbox. Durable state lives outside the compute, so a new sandbox resumes where the old one stopped.

Why two systems and not one

The split between a workflow engine and a streaming log is not incidental. It is the design. The two systems hold different shapes of state, and each would be the wrong tool for the other's job.

A workflow engine is built to durably order discrete business steps, the kind that are relatively rare, can fail, and need retries, timeouts, and exactly-once handling. That fits the agent's plan, where a step might be run a query, call a connector, or wait for an approval. It would be a poor fit for a stream of tokens arriving dozens of times a second.

The streaming log is the opposite. It is built for high-frequency, append-only events that need to replay cheaply and in order, which fits the token stream and the working memory and fits orchestrating a business process badly. Match each kind of state to the system shaped for it, and recovery becomes a property of the systems rather than something the agent has to implement itself.

What happens when a sandbox dies mid-task

Here is the payoff. A node gets recycled while an agent is forty minutes into a job. The sandbox is gone, and so is everything in its memory. The recovery is unremarkable, which is the point: the workflow engine re-dispatches the steps that were pending, the actor cluster replays the session state from the durable stream, a fresh sandbox is provisioned in seconds, and the work continues against the same Drive scope.

From the user's side, if they had the tab open, the model connection drops and reconnects, the stream replays, and they see no lost tokens and no orphaned tool calls. The agent did not restart from the top. It resumed from where it was. The difference between those two words is the entire post.

To make it concrete, take a three-hour build job. An hour in, a node is recycled for an OS patch: the workflow re-dispatches the step that was running, the stream replays, a new sandbox picks it up, and the job continues. Two hours in, the model provider has a brief outage: the in-flight step times out and retries against a healthy endpoint without losing the prior work. The job finishes on time, on its third sandbox, and the user watching the whole time saw a short reconnect and nothing else. No single failure was special. They were all the same failure, handled the same way.

No lost tokens is the hard part

It is worth being precise about why resume is harder than retry. A naive system retries the failed step from scratch. For an agent, that means re-running an expensive multi-turn reasoning process, and possibly repeating side effects that already happened, which is worse than slow.

Resume means picking up mid-stream: the token that was being generated, the tool call that was in flight, the working memory exactly as it stood. That only works if session state is recorded at a fine grain, as an append-only stream of events, so that replaying the stream reconstructs the precise state and the agent continues the same thought instead of starting a new one. The log is the source of truth. The sandbox is just its current reader, and readers are replaceable.

Resuming safely also means not doing things twice. A step that already sent an email or wrote a row must not repeat that when it is re-dispatched. The workflow engine's recorded history is what makes steps replay-safe: it knows which effects already happened, so a replay reconstructs state without re-firing the side effects. Resume is not just fast. It is correct.

Swarms survive too

A single agent is the simple case. An agent can also spawn sub-agents, to parallelize work or to isolate context across separate workers. A sub-agent runs as a sub-process inside the parent's sandbox, shares its working volume, and inherits a defined subset of the parent's grants. It can never exceed the parent's permissions.

The durability story holds for the swarm without changes. The coordination between parent and sub-agents is itself a series of events on the same durable stream, not state trapped in one process's memory. So if the sandbox running a swarm dies, the coordination replays along with everything else, and the work resumes as a swarm. Every sub-agent action is attributed to both the sub-process and the parent session in the audit log, so a recovered swarm is exactly as traceable as it was before the failure.

Provisioning fast enough to hide

The recovery story depends on a new sandbox showing up quickly, and most of the time it does. The sandbox node pool autoscales, so a brand-new node takes sixty to ninety seconds to come up, but a sandbox pod scheduled onto existing capacity starts in under ten seconds. Because the pool keeps headroom, a recycled sandbox usually returns on that faster path, before the user's reconnect even finishes.

When the pool hits its ceiling, new work queues rather than failing, and capacity utilization, queue depth, and provisioning latency are exposed as metrics. The operator sees pressure building before it becomes a wait anyone notices.

The same design gives you availability for free

Once durable state lives in replicated systems, ordinary failures stop being events worth handling specially. The workflow engine keeps durable workflow state and resumes in-flight workflows after a subsystem failure. The streaming log replays across pod restarts and node failures, so session state survives them. The relational store runs across availability zones with automatic failover.

None of that is special-case recovery code living inside the agent. It falls out of one decision: keep durable state off the disposable layer. The agent does not have to know a node died. It reads the log and keeps going, and the machinery underneath quietly hands it a new machine.

Disposable compute, durable work

The two properties look like they should trade off against each other. A machine you can destroy at any moment, and work that survives for hours across machines. They do not trade off, because they describe different things. The compute is disposable. The state is durable. The line between them is drawn so that nothing you care about ever lives on the part you are willing to lose. Throw the sandbox away as often as you like. The work was never in it.