Lambda Durable Functions vs Step Functions: When the Orchestrator Belongs in Your Code
Lambda durable functions put the orchestrator in your TypeScript. Reach for them when you own the whole workflow; keep Step Functions for cross-team state machines.
Problem
A multi-step workflow on AWS Lambda has long forced a split that nobody likes: the orchestration logic lives in an Amazon States Language (ASL) state machine, while the business logic lives in the functions it calls. A six-step order saga becomes a JSON state definition in one deploy unit and TypeScript in another, and a one-line retry-policy change means editing a state machine, not the code it coordinates. With Lambda durable functions, now generally available, that split is no longer mandatory: my position is that when you own the whole workflow end-to-end, you should write the orchestrator as ordinary TypeScript inside a durable Lambda, and keep Step Functions for visual, cross-team, or borrowed orchestration. The price you pay for that is the determinism tax.
The reason the split is worth removing is that the two halves drift. Reviewers read the workflow by jumping between ASL JSON, a Lambda, and an SQS queue. Tests duplicate the state-machine logic in code because the state machine cannot be unit-tested as code. A durable function collapses that into a single narrative you can read top to bottom. The trade is real, though: the model only holds if you accept the determinism constraint that governs replay, and that constraint is where most of the work moves.
How durable functions work
A durable function is a regular Lambda that runs a long-lived, checkpointed durable execution. AWS states it can run for up to one year. Under the hood it uses a checkpoint-and-replay mechanism: when the function resumes after an interruption or a pause, your code runs from the beginning but skips completed operations, replaying their stored results instead of re-executing them.
Two durable operations carry the model:
- Steps add automatic retries and checkpointing to your business logic. Each
context.step()call creates a checkpoint before and after execution; once a step completes, replay reuses its stored result instead of running it again. Side effects (SDK calls, writes, sends) belong here. - Waits pause execution for a duration. The function terminates and suspends with no compute charge, then resumes via replay. This is the primitive for human-in-the-loop approvals, polling, and timers. A callback primitive covers external approvals: the execution suspends until an outside system posts a result.
In TypeScript you wrap the handler with withDurableExecution, which hands your function a DurableContext exposing step() and wait(). Python takes a different shape worth noting for contrast: it uses @durable_execution and @durable_step decorators, and the Python SDK is synchronous, with no await. The TypeScript path is the one this post builds on.
On cost, AWS is explicit about the wait case: for on-demand functions, a wait suspends without incurring compute charges, so a workflow that waits hours or days pays only for actual processing time, not idle waiting. Waiting is not entirely free, though. Each checkpoint carries a data-write charge, and the retained state incurs a retention charge for as long as the execution holds it. The execution role needs lambda:CheckpointDurableExecution and lambda:GetDurableExecutionState.
Note: Durable functions reached general availability in December 2025 and the feature is still expanding across regions and runtimes. Confirm runtime versions and region availability against the current AWS pages before you commit to it.
The determinism tax
Replay re-runs the handler from the top. Anything that produces a different value on a second run breaks the illusion that execution resumed where it left off. AWS puts the rule plainly: when a function resumes, Lambda runs your code from the beginning, completed steps do not re-execute, and “This is why your code must be deterministic.”
That single constraint generates the rules you actually code against:
- Every side effect lives inside a step. A direct SDK call in the handler body re-runs on every replay. That means a double charge and a double email. Wrapping it in
context.step()checkpoints the result so the replay reads the record instead of re-firing the call. - Every non-deterministic value lives inside a step.
Date.now(),Math.random(), andcrypto.randomUUID()drift across replays. The documented fix is not a special helper; it is to wrap the call in a step so the value is computed once, checkpointed, and replayed from the record. Treat the result like any other step output. - Step identity is part of the contract. AWS best practices warn: “Don’t rename steps or change their behavior in ways that break replay.” A rename mid-flight orphans the checkpoint. Pin executions to a version or alias and make sure new code can still read state that old code checkpointed.
In practice the handler body becomes pure orchestration: control flow over step results, nothing else. This is the same discipline that Azure Durable Functions and Temporal impose on orchestrator code, so if you have written replay-safe orchestrators before, the model maps directly.
Saga in code vs ASL
The clearest way to see the trade is to write the same order saga both ways. First the durable function, where the orchestrator is ordinary control flow:
import { DurableContext, withDurableExecution } from "@aws/durable-execution-sdk-js";
export const handler = withDurableExecution(
async (event: OrderEvent, context: DurableContext) => {
const payment = await context.step("charge-payment", () =>
chargePayment(event.orderId, event.amount)
);
try {
const shipment = await context.step("reserve-inventory", () =>
reserveInventory(event.orderId, event.items)
);
await context.step("confirm-order", () => confirmOrder(event.orderId));
return { status: "confirmed", shipment };
} catch (err) {
// compensating transaction: saga rollback in an ordinary catch block
await context.step("refund-payment", () =>
refundPayment(payment.chargeId)
);
throw err;
}
}
);
The rollback is a plain catch. Retries are a step option. The whole saga reads top to bottom as one function, and you can unit-test it by stubbing the step bodies. The same saga as ASL is a data structure instead:
{
"Comment": "Order saga (illustrative ASL)",
"StartAt": "ChargePayment",
"States": {
"ChargePayment": { "Type": "Task", "Resource": "arn:...:chargePayment", "Next": "ReserveInventory" },
"ReserveInventory": {
"Type": "Task", "Resource": "arn:...:reserveInventory",
"Catch": [{ "ErrorEquals": ["States.ALL"], "Next": "RefundPayment" }],
"Next": "ConfirmOrder"
},
"ConfirmOrder": { "Type": "Task", "Resource": "arn:...:confirmOrder", "End": true },
"RefundPayment": { "Type": "Task", "Resource": "arn:...:refundPayment", "Next": "Failed" },
"Failed": { "Type": "Fail" }
}
}
The compensating path is a Catch rule. The story in the console is strong: anyone can open the graph and watch the saga move state to state. The story in code review and unit testing is weaker, because the logic is JSON the test suite cannot exercise directly. This is the whole trade in one pair of files. The durable function optimizes for the engineer who owns and changes the workflow; the ASL optimizes for the reader who needs to see it.
This is the shape I would reach for first on a workflow I own. The honest caveat for my own code: a saga I run today leans on a Date.now() for an idempotency window and a direct SDK write before the first task, and both would have to move inside steps before replay would be safe. The determinism tax is not abstract; it lands exactly where you were sloppy about side effects.
Recommended default
For a workflow you own end-to-end, with one team, one bounded context, and logic that changes alongside your code, write the orchestrator as a durable Lambda in TypeScript. You get one deploy unit, ordinary control flow, code you can test, and a workflow a reviewer can read as a single narrative. Step Functions then becomes the override for specific cases, not the starting point.
The trade behind that default is worth stating directly. Durable functions buy you orchestrator-in-code at the cost of the determinism tax and the newness of the feature: it is younger and less proven than Step Functions, and the Java SDK only reached general availability in April 2026, so JS/TS and Python are the mature paths today. A risk-averse team with a working Step Functions machine has a legitimate reason to leave it where it is, even when the durable shape would read cleaner. Newer means fewer known failure modes, and that is a real input to the decision, not a flaw to wave away.
The decision tree below roots at the durable-function default; each branch names an override toward Step Functions.
When Step Functions still wins
AWS’s own decision page draws the line, and it matches this thesis. It says to use Step Functions when you need a visual workflow representation for cross-team visibility, when you are orchestrating multiple AWS services and want native integrations without custom SDK code, when you require zero-maintenance infrastructure with no patching or runtime updates, and when non-technical stakeholders need to understand and validate workflow logic. AWS frames durable functions as “optimized for application development within Lambda” and Step Functions as “built for workflow orchestration across AWS services.” The override cases follow from that framing:
- Non-engineers are part of the audience. When ops, compliance, or product read the workflow graph in the console, code is not their interface. AWS lists “non-technical stakeholders need to understand and validate workflow logic” as a Step Functions trigger, and a durable function cannot offer that view.
- The workflow is service glue, not logic. Step Functions advertises native integrations across 220+ AWS services and 16,000 APIs without custom SDK code. When the workflow is mostly service-to-service wiring rather than branching logic, those direct integrations win over hand-written SDK calls inside steps.
- You want runtime-agnostic, zero-maintenance infrastructure. AWS frames Step Functions as fully managed and runtime agnostic, with no patching or runtime updates. A durable function runs inside the Lambda environment and inherits Lambda’s runtime lifecycle.
- The state machine is a cross-team contract. When a workflow spans teams, the state machine is the shared, visual artifact each side agrees on. AWS frames this as cross-team visibility, and that visibility is the point.
The strongest evidence here is AWS endorsing the same default. Its migration guidance reads: “Begin with durable functions for Lambda-centric workflows. Add Step Functions when you need multi-service orchestration or visual workflow design.” For existing users it adds: keep Step Functions for established cross-service workflows, and consider durable functions for new Lambda application logic that needs reliability. That is the vendor recommending durable-first for workflows you own, which is exactly the position this post takes.
Common pitfalls
- Side effect in the handler body instead of a step. It re-runs on every replay: double charge, double email. Wrap every effect in a step.
- Non-deterministic value in the handler body.
Date.now(), randomness, and UUIDs drift across replays and corrupt the resumed state. Produce them inside acontext.step()so the value is checkpointed once and replayed. - Renaming or re-behaving a step between deploys. A rename mid-flight orphans the checkpoint. Pin executions to a version or alias and keep step identity stable.
- Treating a wait like a busy-loop. Use a durable wait, which suspends without compute charge, for long pauses, not a polling loop that bills CPU the whole time.
- Reaching for a durable function when a single short Lambda suffices. Orchestration overhead on a one-shot function is waste; durable functions earn their keep on multi-step, long-lived, or crash-sensitive flows.
- Porting an ASL machine into code and silently dropping the visual. Stakeholders who depended on the console graph lose their interface. Migration is a communication change, not only a code change.
For the broader saga model these workflows implement, see the saga pattern for distributed transactions. For the incumbent’s full model, the Step Functions deep dive covers ASL, Standard vs Express, and direct integrations. The question of how much logic belongs in one Lambda connects to Lambda function granularity, and for agentic multi-step flows specifically, Bedrock AgentCore in production covers a related orchestration surface.
Closing
Default to a durable Lambda when you own the workflow end-to-end: one team, one bounded context, logic that moves with your code. The orchestrator becomes ordinary TypeScript in one deploy unit, and you pay for it with the determinism tax, every side effect and every non-deterministic value pushed inside a step. Reach for Step Functions when the boundary flips: a non-engineer needs the visual, the workflow is mostly cross-service glue, or the state machine is a contract shared across teams. The next concrete step is to pick one workflow you own, audit it for Date.now() and direct SDK calls in what would become the handler body, and see how much of it the determinism tax would actually touch.
References
- Lambda durable functions (AWS Lambda Developer Guide) - Primary source for the mechanism: checkpoint and replay, steps vs waits, the one-year maximum, and the use-case list. Verify currency before publishing; the feature is under seven months old.
- Durable functions or Step Functions (AWS Lambda Developer Guide) - The official decision page: when to use each, the feature comparison table, and the migration guidance quoted in the boundary section.
- Creating Lambda durable functions, getting started (AWS Lambda Developer Guide) - Verified API surface and the “your code must be deterministic” statement.
- TypeScript SDK reference (AWS Durable Execution SDK) - Confirms the
@aws/durable-execution-sdk-jsimport, thewithDurableExecutionwrapper, and thecontext.step()andcontext.wait()signatures. - Best practices for Lambda durable functions (AWS Lambda Developer Guide) - Source for the step-identity rule: do not rename steps or change behavior in ways that break replay; use versions and aliases.
- AWS Durable Execution SDK Developer Guide - SDK landing page listing callbacks, parallel and map operations, child contexts, and the open-source SDK repositories.
- Build multi-step applications and AI workflows with AWS Lambda durable functions (AWS News Blog) - Launch post with the full end-to-end order-processing example and the two-primitive framing.
- AWS Lambda announces durable functions (What’s New) - General availability announcement from December 2025 with the initial region and runtime support.
- AWS Lambda Durable Execution SDK for Java GA (What’s New) - Grounds the maturity note: JS/TS and Python are mature paths, the Java SDK is newest.
- aws-durable-execution-sdk-js (GitHub) - The open-source JS/TS SDK; authoritative source for the import path and current version.
- AWS Step Functions Developer Guide - The incumbent orchestrator: ASL, Standard vs Express, and direct service integrations.
- Amazon States Language specification - The ASL reference for the saga-as-data side of the comparison.
- AWS Lambda Pricing - Grounds the pricing claims: no compute charge during a wait on on-demand functions, plus per-checkpoint data-write and state-retention charges.
Related posts
Match architecture weight to each runtime's init-amortization: lean handlers on single-purpose Lambda, more on a Lambdalith, full OOP/DI only on long-lived runtimes.
How to slice AWS Lambda functions: default to single-purpose, treat the single-domain Lambdalith as an earned exception, and the platform forces that decide it.
DI containers, monolithic SDKs, god-handlers, top-level secret fetches, and heavy ORMs - what they cost on cold start, and the functional shape that replaces them.
Run Bun and Deno on AWS Lambda with custom runtimes: real performance benchmarks, cost analysis, and production deployment patterns.
A practical guide to learning Effect incrementally and integrating it with AWS Lambda, with real code examples, common pitfalls, and production patterns.