Automatic Retries

kontinue can automatically retry failed Executions. When an Execution fails and has retries available, it enters the Backoff phase before being re-executed. This provides resilience against transient failures without manual intervention.

Retries are configured per-Execution and control both the number of retry attempts and the behavior of child resources when retrying.

Configuration

Configure retries in the Execution spec:

apiVersion: kontinue.cloud/v1alpha1
kind: Execution
metadata:
  name: my-execution
spec:
  function: my-function
  retry:
    retries: 3          # Number of retry attempts (default: 0)
    backoff: 5s         # Wait time between retries (default: 5s)
    childPolicy: None   # How to handle children on retry

FieldDescriptionDefault
retriesNumber of retry attempts. If 0, the Execution is not retried on failure.0
backoffHow long to wait before retrying.5s
childPolicyHow child resources are handled on retry.None

Configuring via SDK

When spawning child executions from within a function, use ExecuteOptions:

result, err := kontinue.Execute[MyResult](ktx, "child-function", &ChildArgs{}, &kontinue.ExecuteOptions{
    Retry: &kontinue.RetryOptions{
        Retries:     2,
        ChildPolicy: kontinuev1alpha1.ChildPolicyRetryFailed,
    },
})

When spawning an Execution externally via the client library, pass retry options in SpawnOptions:

exec, err := client.Spawn(ctx, "my-function", &MyArgs{}, &client.SpawnOptions{
    Retry: &kontinuev1alpha1.ExecutionRetry{
        Retries:     3,
        Backoff:     &metav1.Duration{Duration: 10 * time.Second},
        ChildPolicy: kontinuev1alpha1.ChildPolicyRetryFailed,
    },
})

Function Defaults

Set default retry configuration for all executions of a function during registration:

worker.RegisterFunction(w, "deploy-cluster", DeployCluster, &function.Options{
    Description: "Deploy a cluster with automatic retries",
    Defaults: &function.ExecutionDefaults{
        Retry: &function.ExecutionRetry{
            Retries:     3,
            Backoff:     &metav1.Duration{Duration: 30 * time.Second},
            ChildPolicy: kontinuev1alpha1.ChildPolicyRetryFailed,
        },
    },
})

These defaults apply to all Executions of the function unless overridden at creation time. The defaults are also published to the Function CRD and visible in the UI.

Backoff Phase

When an Execution fails and has retries remaining, it transitions to the Backoff phase. During this phase, the Execution waits for the configured backoff duration before being picked up by a worker again.

Executing → Failed → Backoff → Executing → Completed
                        ↑          ↓
                        └── Failed ─┘ (if retries remain)

The status.attempts array records each failed attempt, including the phase, message, worker name, and timestamps.

Child Policies

When an Execution with child resources (sub-executions, Jobs, Suspensions) is retried, the childPolicy controls what happens to those children:

PolicyBehavior
NoneChildren are memoized and reused. Failed children remain failed.
RetryFailedRetry any failed child Executions, Jobs, and all Suspensions.
RetryAllRetry all child resources, even successful ones.

Example: RetryFailed Policy

spec:
  function: deploy-workflow
  retry:
    retries: 3
    backoff: 10s
    childPolicy: RetryFailed

With RetryFailed, if the parent Execution is retried:

  • Completed children are reused (memoized)
  • Failed children are retried
  • All Suspensions are retried (they may have been waiting for external input)

This is useful when a child failure was transient and you want to retry just the failed parts.

Example: None Policy (Default)

With None, children are fully memoized. If a child failed, it remains failed on retry, causing the parent to fail again immediately. This is useful when you want to preserve the exact state and manually investigate failures.

Manual Retry

The CLI and UI can be used to manually retry completed (Failed or Canceled) Executions. Manual retry works by incrementing the available number of retries on the Execution, which causes it to transition from its terminal state back to Pending.

Using the CLI

# Retry with default policy (RetryFailed)
kontinue retry my-execution

# Retry but keep memoized children (None policy)
kontinue retry my-execution --retry-none

# Retry all children, even successful ones
kontinue retry my-execution --retry-all

Using the API

err := client.RetryExecution(ctx, execution, &client.RetryOptions{
    ChildPolicy: kontinuev1alpha1.ChildPolicyRetryFailed,
})

Manual retry is useful for:

  • Retrying after fixing an external dependency
  • Re-running a workflow after a transient infrastructure issue
  • Continuing a workflow that exhausted its automatic retries