The following guide describes how to configure your Spinnaker on AWS deployment to be more resilient and perform Disaster Recovery (DR). Spinnaker does not function in multi-master mode, which means that active-active is not supported at this time. Instead, this guide describes how to achieve an active-passive Spinnaker setup. This results in two instances of Spinnaker deployed into two regions that can fail independently.
- Assumptions and prerequisites
- What is a passive Spinnaker
- Storage Considerations
- Kubernetes guidelines
- DNS considerations
- Setting up a Passive Spinnaker
- Performing Disaster Recovery
- Restoration time
- Other resources
Assumptions and prerequisites
- The passive Spinnaker will have the same permissions as the active Spinnaker
- The active Spinnaker is configured to use AWS Aurora and S3 for persistent storage
- Your Secret engine/store has been configured for Disaster Recovery (DR)
- All other services integrated with Spinnaker, such as your Continuous Integration (CI) system, is configured for DR
What is a passive Spinnaker
A passive Spinnaker means that the deployment:
- Is not reachable by its known endpoints while passive (external and internal)
- Does not schedule pipelines
- Cannot have pipelines triggered by CI jobs
|NOTE: It is important that the storage being used is replicated across regions since these contain all the application and pipeline definitions|
Armory recommends using a relational database for Orca and Clouddriver. For Orca, a relational database helps maintain integrity. For Clouddriver, it reduces the time to recovery. Even though any MySQL version 5.7+ database can be used, Armory recommends using AWS Aurora for the following reasons:
- More performant than RDS MySQL
- Better high availability than RDS MySQL
- Less downtime for patching and maintenance
- Support for cross-region replication
Note the following guidelines about Spinnaker storage and caching:
- S3 buckets should be set up with cross-region replication turned on. See Replication.
- The MySQL database should be set up with cross-region replication turned on. See the following page for AWS Aurora: Replicating Amazon Aurora MySQL DB Clusters Across AWS Regions
- Redis - Each service should be configured to use it’s own Redis. With Spinnaker services configured to use a relational database or S3 as a permanent backing store Redis is now used for caching. For disaster recovery purposes it is no longer required that Redis is recoverable. A couple things to note are:
- Gate - Users will need to login again
- Fiat - Will need to sync user permissions and warmup
- Orca - Will lose pending executions
- Rosco - Will lose bake logs
- Igor - Will lose last executed Jenkins job cursor
Keep the following guidelines in mind when configuring Kubernetes.
- The Kubernetes control plane should be configured to use multiple availability zones in order to handle availability zone failure. For EKS clusters they are available across availability zones by default.
The following guidelines are meant for EKS workers:
- The Kubernetes cluster should be able to support the Spinnaker load. Use the same instance type and configure the same number of worker nodes as the primary.
- There needs to be at least 1 node in each availability zone the cluster is using.
- The autoscaling group has to have a proper termination policy. Use one or all of the following policies: OldestLaunchConfiguration, OldestLaunchTemplate, OldestInstance. This allows the underlying worker AMIs to be rotated more easily.
- Ideally, Spinnaker pods for each service that do not have a replica of 1 should be spread out among the various workers. This means that pod affinity/anti-affinity should be configured. With this configuration Spinnaker will be able to handle availability zone failures better.
A good way to handle failover is to set up DNS entries as a CNAME for each Spinnaker installation.
- Active Spinnaker accessible through
- Passive Spinnaker accessible through
- Add DNS entries
spinnaker.acme.comwith a CNAME pointing to
apisubdomain) and a small TTL (1 minute to 5 minute).
In this setup, point your CNAME to
us-east when a disaster event happens.
|Note: Armory does not recommend setting up DNS with a backup IP address when manual steps are required for failover.|
Setting up a Passive Spinnaker
To make a passive version of Spinnaker using Halyard, use the same Halyard configuration file as the current active installation for your starting point. Then, modify it to deactivate certain services before deployment.
To keep the configurations in sync, set up automation to create a passive Spinnaker configuration every time a configuration is changed for the active Spinnaker. An easy way to do this is to use Kustomize Overlays.
Halyard configuration modifications**
Make sure you set replicas for all Spinnaker services to 0.
Once you’re done configuring Halyard for the passive Spinnaker, run
hal deploy apply to deploy.
|Note: Armory recommends performing a DR exercise run to make sure the passive Spinnaker is set up correctly. Ideally, the DR exercise should include both failing over to the DR region and failing back to the primary region.|
Performing Disaster Recovery
If the active Spinnaker is failing, the following actions need to be taken:
Activating the passive Spinnaker
Perform the following tasks when you make the passive Spinnaker into the active Spinnaker:
- Use the same version of Halyard to deploy the passive Spinnaker installation that was used to deploy the active Spinnaker.
- AWS Aurora
- Promote another cluster in the global database to have read/write capability.
- Update Halyard configuration to point to the promoted database if the database endpoint and/or the database credentials have changed.
- Create the Redis clusters.
- Activate the passive instance.
- Set the replicas to more than 0. Ideally, this should be set to the same number of replicas that the active Spinnaker used.
- Change the DNS CNAME if it is not already pointing to the passive Spinnaker installation.
- If the Spinnaker that is not working is accessible, it should be deactivated
Restoration time is dependent on the time it takes to restore the database, the Spinnaker services, and the time it takes to update DNS. The following services will also take some time to restore since Redis needs time to warm up the cache: