22 September 2016

Blue Green Deployments with AWS CloudFormation AutoScalingRollingUpdate Policy


At tado° we are doing blue-green deployments for our web and application servers by simply updating AWS CloudFormation stacks. Our architecture is classical: EC2 instances in an Auto Scaling Group (ASG) behind an Elastic Load Balancer (ELB). 


We follow the immutable server pattern and never modify existing instances. Instead, we are use the AutoScalingRollingUpdate policy to replace old instances with new ones. CloudFormation has the additional ability to roll back to the previous state if something goes wrong. If the old instances keep running the rollback is very quick without any user visible downtime.

The problem is that the AutoScalingRollingUpdate policy was not invented for blue-green deployments but for replacing instances in small batches. Configuring the policy to properly do a blue-green deployment in one batch was more tricky than expected. CloudFormation still lacks some little piece of functionality which so I needed to workaround.

The AutoScalingRollingUpdate policy determines how instances get replaced when the LaunchConfiguration changes. It has two important properties:
  • MaxBatchSize controls how many instances can be created or terminated at the same time. A blue-green deployment is a special rolling update done in one batch. First, create all new instances, wait until the application is warmed up and healthy, and then terminate all old instances. We set MaxBatchSize to equal the MaxSize of the ASG to create and terminate all instances in one big batch.  
  • MinInstancesInService controls how many instances must keep running during the update. If you omit this parameter it defaults to zero. This means CloudFormation will terminate all old instances while starting new ones in parallel. Clearly not what you want for zero downtime blue-green deployments. This should be set to the current number of instances in service. 

One of the benefits of Auto Scaling is that it adapts the number of running instances to the current load. E.g. during the night two instances might be sufficient but at prime time you need ten. The number of needed instances is reflected in the DesiredCapacity property of the ASG and changes over time.

Autoscaling instance over time
So it is vital to set MinInstancesInService to the current DesiredCapacity of the ASG before starting the update. But when we deployed a service during prime time we saw that instances were terminated until we had only two left when we needed ten! Not good!
I made two mistakes. The first was hard coding the DesiredCapacity of the ASG in our CloudFormation template so the current value got overwritten on each deployment. In this case, the ASG instantly adjusts its capacity. Fixed that by just not specifying the DesiredCapacity at all in the template. The second one was naively setting MinInstancesInService to the MinCapacity of the ASG. Bummer! I tried to get the current value with the intrinsic function Fn::GetAtt that gives you access to attributes of stack resources. But when I tried it out, CloudFormation complains about not knowing this attribute. It turned out that no AutoScalingGroup attributes are exposed inside a CloudFormation template! As a workaround, I used the CloudFormation API to query the name of the ASG resource and were then able to use the AutoScaling API to get the current DesiredCapacity. I also introduced an external parameter and assigned it to the MinInstancesInService property so I could pass this value to the stack before applying the update. Luckily, we are using AutoStacker24 to update stack so this workaround could be done with a few lines of ruby code. But it would be even better if AWS would expose ASG attributes to the Fn::GetAtt function out of the box.

If you are still reading there was another interesting issue with TerminationPolicies property. Several times experienced the strange behaviour that on rollback (e.g. because of failing healthchecks) CloudFormation first creates new instances with the old LaunchConfiguration while terminating the still running old instances. To fix this I configured the OldestLaunchConfiguration and OldestInstance policies in exactly that order.  In addition to keeping the existing instances in case of rollbacks it also allows us to do instance recycling. Some services start to get slower and slower after a few days of work, so an easy way to fix this was to scale up manually, wait until the new instances are in service and then scale down again. This will automatically terminate the oldest instances.

Learnings

  • don't specify DesiredCapacity on the AutoScalingGroup in the template, it should be determined by the scaling policies of the ASG 
  • Set MaxBatchSize to equal the MaxSize of your ASG to have only one batch.
  • Set MinInstancesInService to the current number of running instance. Ideally, this should be MinInstancesInService: "@ASG.DesiredCapacity" or {"Fn::GetAtt": ["ASG", "DesiredCapacity"]} but CloudFormation doesn't support it (yet). Use  API/scripting to work around.
  • Use TerminationPolicies for making rollbacks keep the existing instances
    • OldestLaunchConfiguration
    • OldestInstance