Performance

Performance depends on the number of API operations that your workflows do. Every new Execution (or child Execution), Suspend, or Job creates additional load on the API server. Typical clusters can handle 10s to 100s of operations per second — larger clusters potentially thousands per second — but kontinue is not designed for insane scalability or millions of executions.

kontinue performance depends heavily on the performance of the backing Kubernetes cluster (and etcd cluster). It is always important to benchmark your own workload to find bottlenecks.

Benchmarks

Synthetic benchmarks show that overhead for trivial Executions is on the order of 1-5ms.

Cluster SizeThroughput
Toy cluster (2 vCPU, co-located control plane and worker)~10 Executions/sec (500+ executions/minute)
Larger cluster (32 vCPU)50-100 Executions/sec

Benchmarking Your Workload

The kontinue CLI provides a subcommand for running simple benchmarks of your own workloads and reporting throughput and latency characteristics:

kontinue bench <function> [flags]

Flags

FlagDefaultDescription
--qps10Maximum queries per second
--max-concurrency10Maximum concurrent executions
--duration30sDuration to run the benchmark
-p, --parametersParameters as key=value pairs
--client-qps100Kubernetes client QPS limit
--client-burst200Kubernetes client burst limit

Example

# Benchmark a function at 50 QPS for 1 minute
kontinue bench my-function --qps 50 --duration 1m

# Benchmark with parameters
kontinue bench deploy -p env=staging -p version=1.2.3 --qps 20

This allows you to measure real performance with your specific cluster configuration and execution patterns.

Scaling

By default, all Executions run on a single leader replica. For most workloads, vertical scaling (increasing the CPU and memory of the leader pod) is the simplest and most efficient way to handle more load.

Horizontal Scaling with Sharding

If a single replica is CPU or memory-constrained running Execution logic, you can enable sharding to distribute Executions across multiple replicas. Sharding horizontally scales the work of running Executions, but adds control plane overhead:

  • Additional API calls: Each replica renews a Lease object periodically, and the leader must watch and assign shard labels to new Executions.
  • Assignment latency: New Executions wait for the leader to assign a shard label before they can be processed. This is typically fast (sub-second) but adds a step that doesn’t exist in single-leader mode.
  • Reassignment on failure: When a replica dies, the leader must detect the expired lease and reassign Executions, introducing a delay proportional to the lease duration.

Sharding will not improve performance if the bottleneck is the Kubernetes API server or etcd, since all replicas share the same API server. It is only beneficial when the Execution logic itself is the bottleneck.

Offloading Work to Jobs

For heavy computation like builds, tests, or CLI tools, consider offloading the work to Jobs instead of running it directly in the Execution. Jobs run as separate Kubernetes pods with their own resource limits and can be scheduled independently by the cluster. This keeps the worker lightweight — it only orchestrates the workflow — while the actual compute runs elsewhere with dedicated resources.

This is often a better fit than sharding when the bottleneck is a small number of expensive operations rather than a large number of concurrent Executions.