Network drops and server crashes abort long-running agent tasks—wasting tokens and corrupting production databases. Cellaflow bridges the gap. It delivers low-overhead execution journaling and fault-tolerant process recovery to make probabilistic AI architectures completely deterministic.
from cellaflow.sdk import workflow, step, NonPersistableZone
from cellaflow.sdk.tools import clone_repo, send_report
@workflow(name="code_audit_agent", version="1.0.0")
async def run_audit_agent(ctx, repo_url: str):
# Step 1: Durable Execution of Repository Clone
files = await ctx.step("clone_repo", clone_repo, repo_url)
# Step 2: Non-Persistable Zone (Suspends Checkpoint Commits)
async with NonPersistableZone(ctx):
analysis = ""
async for chunk in ctx.llm.stream("gpt-4", prompt=f"Audit: {files}"):
analysis += chunk
# Step 3: Notification with Idempotency Key (Exactly-Once)
report_status = await ctx.step(
"send_report",
send_report,
to="devs@org.com",
body=analysis,
idempotency_key=f"report-{ctx.session_id}"
)
return report_statusAutonomous agents execute long-running, non-deterministic workflows. Running them on stateless application servers or serverless containers introduces critical operational and financial liabilities.
Kubernetes or AWS ECS terminates your node 8 minutes into a 10-minute workflow. Because all variables, agent stack context, and memory are in transient RAM, the entire run is aborted.
To complete the failed job, your agent must start over from Step 1. You re-run costly LLM reasoning prompts and stream completions again, doubling or tripling your model API bill.
Retrying the agent forces tools to run a second time. This results in duplicate database mutations, repeated Stripe credit card charges, or multiple redundant Slack/email dispatches.
Upon boot, the container SDK re-initializes and re-binds to the active Session ID. Cellaflow queries the Engine Core via gRPC, recognizing the session state instantly.
Cellaflow runs **Replay Recovery** by reading completed steps from the immutable **Cognitive Graph** log. It skips execution, returns cached outputs, and resumes at the first uncompleted step.
Deterministic idempotency keys guarantee that tool side-effects occur at most once. Non-Persistable Zones (NPZ) suspend checkpointing during stream blocks to avoid database corruption on crash.
Observe how Cellaflow journals steps to its local RocksDB store, handles a sudden mid-turn server recycle, and restores execution context in less than 250ms with zero extra LLM tokens.
Durable Tool Call ($0.01 API fee)
Durable Tool Call ($0.01 API fee)
LLM Completion Prompt ($0.25 API fee)
Durable Tool Call ($0.05 API fee)
Side-effecting Tool (Idempotent)
See the immediate business case. Estimate how much LLM API budget you are throwing away on redundant steps due to infrastructure recycles and how Cellaflow's middleware stops the bleed.
Enable background log compaction (5x to 40x compression ratios) to optimize context inputs and save an extra ~45% in prompt token costs.
Directly thrown away on re-running completed steps.
Saved via asynchronous observation compaction.
Annual savings from Durable Replay + CCS Memory optimization.
Cellaflow does away with bloated state machine libraries, replacing them with a low-level, high-throughput Rust engine core built for state safety.
Automatically serializes stack frame variables, variable bindings, and conversation histories to disk after every successful state transition.
Maintains an immutable, versioned, append-only ledger that records all non-deterministic events (LLM outputs, tool calls) for instant time-travel debugging.
Suspends checkpoints during streaming LLM completions or unresolved tool executions to prevent serializing corrupted state, rolling back on crash.
Uses an isolated key-value database embedded directly inside the single-process core daemon. Eliminates relational network overhead.
Cellaflow maintains a dedicated compilation crate `cellaflow-proto` that compiles Protocol Buffer definitions (`proto/cellaflow/v1/`) on the `cargo build` phase using `tonic-build` in its build script. SDK clients and backend systems share exact runtime contracts without code replication.
Tonic gRPC server rejects unencrypted HTTP/2 immediately, securing all pipeline traffic.
Bearer Token metadata is validated at the gRPC interceptor layer before passing to the engine.
Locks active executions to the specific version they started on, preventing schema drift.
syntax = "proto3";
package cellaflow.v1;
service EngineService {
// Initiates a stateful session with registry version pinning
rpc StartSession(StartSessionRequest) returns (StartSessionResponse);
// Commits a step output to the RocksDB Cognitive Graph
rpc CommitStep(CommitStepRequest) returns (CommitStepResponse);
// Recovers active session state log on failover
rpc ReplayRecovery(ReplayRequest) returns (stream ReplayEvent);
}The Cellaflow stateful runtime engine core is under active development. We are partnering with engineering teams building complex multi-agent frameworks, cognitive pipelines, and containerized agent platforms.
Book a direct technical architecture review with our founding engineers to design your stateful runtime parameters and request early builds.
The stateful engine core and compiler boundary are currently in a private repository. Code and SDK definitions will be released under an open-source license once we reach beta.