Retrying Failed or Errored Steps¶
You can specify a retryStrategy that will dictate how failed or errored steps are retried:
# This example demonstrates the use of retry back offs
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: retry-backoff-
spec:
  entrypoint: retry-backoff
  templates:
  - name: retry-backoff
    retryStrategy:
      limit: 10
      retryPolicy: "Always"
      backoff:
        duration: "1"      # Must be a string. Default unit is seconds. Could also be a Duration, e.g.: "2m", "6h", "1d"
        factor: 2
        maxDuration: "1m"  # Must be a string. Default unit is seconds. Could also be a Duration, e.g.: "2m", "6h", "1d"
      affinity:
        nodeAntiAffinity: {}
    container:
      image: python:alpine3.6
      command: ["python", -c]
      # fail with a 66% probability
      args: ["import random; import sys; exit_code = random.choice([0, 1, 1]); sys.exit(exit_code)"]
- limitis the maximum number of times the container will be retried.
- retryPolicyspecifies if a container will be retried on failure, error, both, or only transient errors (e.g. i/o or TLS handshake timeout). "Always" retries on both errors and failures. Also available:- OnFailure(default), "- OnError", and "- OnTransientError" (available after v3.0.0-rc2).
- backoffis an exponential back-off
- nodeAntiAffinityprevents running steps on the same host. Current implementation allows only empty- nodeAntiAffinity(i.e.- nodeAntiAffinity: {}) and by default it uses label- kubernetes.io/hostnameas the selector.
Providing an empty retryStrategy (i.e. retryStrategy: {}) will cause a container to retry until completion.