Automatic Retries
kontinue can automatically retry failed Executions. When an Execution fails
and has retries available, it enters the Backoff phase before being re-executed. This provides
resilience against transient failures without manual intervention.
Retries are configured per-Execution and control both the number of retry attempts and the behavior of child resources when retrying.
Configuration
Configure retries in the Execution spec:
apiVersion: kontinue.cloud/v1alpha1
kind: Execution
metadata:
name: my-execution
spec:
function: my-function
retry:
retries: 3 # Number of retry attempts (default: 0)
backoff: 5s # Wait time between retries (default: 5s)
childPolicy: None # How to handle children on retry
| Field | Description | Default |
|---|---|---|
retries | Number of retry attempts. If 0, the Execution is not retried on failure. | 0 |
backoff | How long to wait before retrying. | 5s |
childPolicy | How child resources are handled on retry. | None |
Configuring via SDK
When spawning child executions from within a function, use ExecuteOptions:
result, err := kontinue.Execute[MyResult](ktx, "child-function", &ChildArgs{}, &kontinue.ExecuteOptions{
Retry: &kontinue.RetryOptions{
Retries: 2,
ChildPolicy: kontinuev1alpha1.ChildPolicyRetryFailed,
},
})
When spawning an Execution externally via the client library, pass retry options in SpawnOptions:
exec, err := client.Spawn(ctx, "my-function", &MyArgs{}, &client.SpawnOptions{
Retry: &kontinuev1alpha1.ExecutionRetry{
Retries: 3,
Backoff: &metav1.Duration{Duration: 10 * time.Second},
ChildPolicy: kontinuev1alpha1.ChildPolicyRetryFailed,
},
})
Function Defaults
Set default retry configuration for all executions of a function during registration:
worker.RegisterFunction(w, "deploy-cluster", DeployCluster, &function.Options{
Description: "Deploy a cluster with automatic retries",
Defaults: &function.ExecutionDefaults{
Retry: &function.ExecutionRetry{
Retries: 3,
Backoff: &metav1.Duration{Duration: 30 * time.Second},
ChildPolicy: kontinuev1alpha1.ChildPolicyRetryFailed,
},
},
})
These defaults apply to all Executions of the function unless overridden at creation time. The defaults are also published to the Function CRD and visible in the UI.
Backoff Phase
When an Execution fails and has retries remaining, it transitions to the Backoff phase.
During this phase, the Execution waits for the configured backoff duration before being
picked up by a worker again.
Executing → Failed → Backoff → Executing → Completed
↑ ↓
└── Failed ─┘ (if retries remain)
The status.attempts array records each failed attempt, including the phase, message,
worker name, and timestamps.
Child Policies
When an Execution with child resources (sub-executions, Jobs, Suspensions) is retried,
the childPolicy controls what happens to those children:
| Policy | Behavior |
|---|---|
None | Children are memoized and reused. Failed children remain failed. |
RetryFailed | Retry any failed child Executions, Jobs, and all Suspensions. |
RetryAll | Retry all child resources, even successful ones. |
Example: RetryFailed Policy
spec:
function: deploy-workflow
retry:
retries: 3
backoff: 10s
childPolicy: RetryFailed
With RetryFailed, if the parent Execution is retried:
- Completed children are reused (memoized)
- Failed children are retried
- All Suspensions are retried (they may have been waiting for external input)
This is useful when a child failure was transient and you want to retry just the failed parts.
Example: None Policy (Default)
With None, children are fully memoized. If a child failed, it remains failed on retry,
causing the parent to fail again immediately. This is useful when you want to preserve
the exact state and manually investigate failures.
Manual Retry
The CLI and UI can be used to manually retry completed (Failed or Canceled) Executions.
Manual retry works by incrementing the available number of retries on the Execution,
which causes it to transition from its terminal state back to Pending.
Using the CLI
# Retry with default policy (RetryFailed)
kontinue retry my-execution
# Retry but keep memoized children (None policy)
kontinue retry my-execution --retry-none
# Retry all children, even successful ones
kontinue retry my-execution --retry-all
Using the API
err := client.RetryExecution(ctx, execution, &client.RetryOptions{
ChildPolicy: kontinuev1alpha1.ChildPolicyRetryFailed,
})
Manual retry is useful for:
- Retrying after fixing an external dependency
- Re-running a workflow after a transient infrastructure issue
- Continuing a workflow that exhausted its automatic retries