Automatic Retries

kontinue can automatically retry failed Executions. When an Execution fails and has retries available, it enters the Backoff phase before being re-executed. This provides resilience against transient failures without manual intervention.

Retries are configured per-Execution and control both the number of retry attempts and the behavior of child resources when retrying.

Configuration

Configure retries in the Execution spec:

apiVersion: kontinue.cloud/v1alpha1
kind: Execution
metadata:
  name: my-execution
spec:
  function: my-function
  retry:
    retries: 3          # Number of retry attempts (default: 0)
    backoff: 5s         # Wait time between retries (default: 5s)
    childPolicy: None   # How to handle children on retry

Field	Description	Default
`retries`	Number of retry attempts. If 0, the Execution is not retried on failure.	`0`
`backoff`	How long to wait before retrying.	`5s`
`childPolicy`	How child resources are handled on retry.	`None`

Configuring via SDK

When spawning child executions from within a function, use ExecuteOptions:

result, err := kontinue.Execute[MyResult](ktx, "child-function", &ChildArgs{}, &kontinue.ExecuteOptions{
    Retry: &kontinue.RetryOptions{
        Retries:     2,
        ChildPolicy: kontinuev1alpha1.ChildPolicyRetryFailed,
    },
})

When spawning an Execution externally via the client library, pass retry options in SpawnOptions:

exec, err := client.Spawn(ctx, "my-function", &MyArgs{}, &client.SpawnOptions{
    Retry: &kontinuev1alpha1.ExecutionRetry{
        Retries:     3,
        Backoff:     &metav1.Duration{Duration: 10 * time.Second},
        ChildPolicy: kontinuev1alpha1.ChildPolicyRetryFailed,
    },
})

Function Defaults

Set default retry configuration for all executions of a function during registration:

worker.RegisterFunction(w, "deploy-cluster", DeployCluster, &function.Options{
    Description: "Deploy a cluster with automatic retries",
    Defaults: &function.ExecutionDefaults{
        Retry: &function.ExecutionRetry{
            Retries:     3,
            Backoff:     &metav1.Duration{Duration: 30 * time.Second},
            ChildPolicy: kontinuev1alpha1.ChildPolicyRetryFailed,
        },
    },
})

These defaults apply to all Executions of the function unless overridden at creation time. The defaults are also published to the Function CRD and visible in the UI.

Backoff Phase

When an Execution fails and has retries remaining, it transitions to the Backoff phase. During this phase, the Execution waits for the configured backoff duration before being picked up by a worker again.

Executing → Failed → Backoff → Executing → Completed
                        ↑          ↓
                        └── Failed ─┘ (if retries remain)

The status.attempts array records each failed attempt, including the phase, message, worker name, and timestamps.

Child Policies

When an Execution with child resources (sub-executions, Jobs, Suspensions) is retried, the childPolicy controls what happens to those children:

Policy	Behavior
`None`	Children are memoized and reused. Failed children remain failed.
`RetryFailed`	Retry any failed child Executions, Jobs, and all Suspensions.
`RetryAll`	Retry all child resources, even successful ones.

Example: RetryFailed Policy

spec:
  function: deploy-workflow
  retry:
    retries: 3
    backoff: 10s
    childPolicy: RetryFailed

With RetryFailed, if the parent Execution is retried:

Completed children are reused (memoized)
Failed children are retried
All Suspensions are retried (they may have been waiting for external input)

This is useful when a child failure was transient and you want to retry just the failed parts.

Example: None Policy (Default)

With None, children are fully memoized. If a child failed, it remains failed on retry, causing the parent to fail again immediately. This is useful when you want to preserve the exact state and manually investigate failures.

Manual Retry

The CLI and UI can be used to manually retry completed (Failed or Canceled) Executions. Manual retry works by incrementing the available number of retries on the Execution, which causes it to transition from its terminal state back to Pending.

Using the CLI

# Retry with default policy (RetryFailed)
kontinue retry my-execution

# Retry but keep memoized children (None policy)
kontinue retry my-execution --retry-none

# Retry all children, even successful ones
kontinue retry my-execution --retry-all

Using the API

err := client.RetryExecution(ctx, execution, &client.RetryOptions{
    ChildPolicy: kontinuev1alpha1.ChildPolicyRetryFailed,
})

Manual retry is useful for:

Retrying after fixing an external dependency
Re-running a workflow after a transient infrastructure issue
Continuing a workflow that exhausted its automatic retries