Human-in-the-loop done right: when to pause, when to auto-resume

Most human-in-the-loop fails because the wait is expensive

The naive way to add a human approval step to a workflow is to call out to a webhook and block. The worker thread holds. The connection holds. If the human takes a day, the worker holds for a day. At any meaningful scale this collapses — workers consumed by waiting, not working — and engineers respond by removing approval steps to keep throughput up. The human is removed from the loop because the loop could not afford the human.

Durable execution flips this. The workflow yields, the engine persists the wait, no resources are held, and the workflow resumes when the approval arrives — minutes, hours, or days later. The cost of a paused workflow is approximately zero. Humans are not a bottleneck because the system was designed to wait without paying for waiting.

Pause when the decision requires judgment, auto-resume when it does not

The design question is not 'should there be a human in this workflow' but 'which steps actually need human judgment.' Cost-threshold approvals, exception cases, regulatory holds, customer-impacting changes — yes. Routine retries, expected variance within tolerance, known-pattern exceptions — no. The workflow should pause for the first set, auto-resume for the second, and the decision should be encoded explicitly in the workflow rather than left to runtime.

The anti-pattern we see is workflows that pause on every exception by default, drown a human approval queue, and degrade into rubber-stamp approvals that defeat the point. The right pattern is conservative auto-resume with an explicit policy for what reaches a human, plus a fast lane to override when the policy missed.

Workflows w/ HITL steps
~60% avg per tenant
Median approval latency
4–18m desk-time approvals
Auto-resume rate
78% expected-variance branches
Approval rubber-stamp rate
< 5% with right policy

The approval payload has to make the decision easy

When a human gets pinged for approval, what they see decides the quality and the speed of the decision. A bad approval payload is 'Workflow 4f7a-c2 needs approval. Approve / Reject.' A good payload is the inputs the workflow saw, the outputs computed so far, the recommended decision with the reasoning, the impact if approved, the impact if rejected, and the SLA on the decision.

Our Workflow Automation engine treats the approval UI as a first-class artifact. Every HITL step declares the schema of the payload presented to the approver, the actions available, and the resulting workflow path per action. Approvers stop asking 'what is this' because the system already tells them.

Routing approvals: who decides matters as much as what they see

Approval routing has to respect organization structure: amount thresholds, role requirements, dual-approval for sensitive actions, escalation if the primary approver is unavailable, delegation if they are out of office. None of this can live in the workflow author's head. It has to live in a routing rule the workflow consults at the moment of pause.

We deploy routing as a separate concern from workflow logic. The workflow declares 'this needs approval at this risk tier'; the routing engine decides which person, with which fallback. Org charts change, vacation calendars change, role assignments change — none of these require changing workflow code. The routing layer absorbs them.

Audit the decision, not just the action

Every approval produces an audit row: who approved, when, against what input, with what action. The compliance team reads this; the auditor reads this; the post-mortem reads this when something went wrong. The audit row has to capture not just the decision but the input the decider saw at decision time, because the input may have changed by the time the audit happens.

Durable execution makes this clean. Every step's input is checkpointed; the approval step's input is the approver's view; the action is the approver's response. Reconstructing what the approver knew at the time of decision is a matter of replaying the trace, not reconstructing context from logs.

SLAs and escalation prevent silent stalls

Pauses without SLAs become silent stalls. A workflow that has been waiting on approval for three days might be a vacation, a missed notification, a routing miss, or a forgotten queue. Without an SLA and an escalation, the workflow author finds out when the customer complains. With both, the system escalates after the configured timeout — to the next approver in the routing rule, to a manager, or to a recovery branch — and the workflow either resumes or fails loudly.

Setting these timeouts is policy, not engineering. The right escalation cadence depends on what the workflow does and who pays the cost of delay. The point is that the timeout exists. Workflows without timeouts on human steps are workflows waiting to surprise someone.

Compensation paths handle rejected approvals cleanly

When a human rejects, the workflow does not just stop — it runs the rejection branch. That might be a different downstream path, a notification to the requester, a saga compensation that unwinds prior steps, or a hold for further review. The rejection is a first-class workflow outcome with its own logic, not an exception.

Engines that treat rejection as a generic failure force every workflow author to reinvent the rejection path. Engines that treat rejection as a typed outcome let the author declare the path explicitly and audit it cleanly. The second pattern is what we deploy.

The first time a HITL approval timed out and the workflow auto-escalated to the SVP without anyone telling her about it, she approved within 12 minutes and the customer never knew there was a delay. That is the moment the operations team stopped treating approval as a risk and started treating it as a feature.

— Director of Operations, financial services client

Frequently asked

What is a human-in-the-loop workflow step?

A workflow step that pauses execution to wait for a human decision — an approval, a verification, a judgment call — before resuming the rest of the workflow. The decision is captured as a typed input to the next step, audited explicitly, and routed to the right person via a routing rule. Done correctly, the workflow yields cheaply during the wait so resources are not held while a human takes minutes, hours, or days to act.

When should a workflow pause for human approval versus auto-resume?

Pause when the decision requires judgment — cost thresholds, exception cases, regulatory holds, customer-impacting changes. Auto-resume when the variance is within expected tolerance, the exception matches a known pattern, or the routine retry policy applies. The decision should be explicit in the workflow rather than runtime, and it should be conservative enough that human approval queues do not flood and degrade into rubber-stamping.

How does durable execution change the cost of human-in-the-loop?

Durable execution lets the workflow yield without holding workers, connections, or memory. The engine persists the wait, releases resources, and resumes when the human acts. The cost of a paused workflow approaches zero, which means human approval is not a throughput bottleneck. Naive workflow engines that block a worker thread during human waits collapse at scale, which is why teams remove approval steps to maintain throughput.

What does a good approval payload look like?

Inputs the workflow saw, outputs computed so far, the recommended decision with reasoning, impact if approved, impact if rejected, and an SLA. The approver should not have to ask what the workflow is or why it paused. Every HITL step declares the schema of the payload, the actions available, and the workflow path per action. Approvers spend their time deciding, not investigating.

How do you prevent silent stalls on human approval steps?

Every HITL step has an SLA and an escalation path. If the primary approver does not act within the timeout, the system escalates — to the next approver in the routing rule, to a manager, or to a recovery branch — and the workflow either resumes or fails loudly. Workflows without timeouts on human steps are workflows waiting to surprise someone. Setting the timeout is policy, not engineering.

How is approval routing decoupled from workflow logic?

The workflow declares the risk tier of the decision; the routing engine resolves the actual person based on org structure, amount thresholds, dual-approval rules, vacation calendars, and delegation. Org changes, role assignments, and out-of-office settings update routing without changing workflow code. Without this decoupling, every HR change becomes a workflow change, which is unsustainable.