Infrastructure as code using AWS Cloudformation and Chef : Cloudformation

This is second post in a 3-part series.

Introduction: High level set up
Cloudformation: Setting up Cloudformation
Chef: Automating using Chef

This post discusses how to setup a standalone EC2 Instance and what cloudformation resources are needed to make it work. Some resources are a bit more involved and have a lot of configuration options, we limit this discussion to the pieces that are needed to get an instance or autoscaling group up and running to be able to execute chef.

We will first discuss standing up a standalone instance and then modify it to work as an auto scaling group (ASG)

Resources for standalone EC2 instance

InstanceSecurityGroup

We need to assign a security group to our instance to open up ports so it can be accessed from outside. In our case we opened up port 22 (SSH) and port 80 (application). This one is pretty straight forward:

				
					"InstanceSecurityGroup": {
  "Type": "AWS::EC2::SecurityGroup",
  "Properties": {
    "GroupDescription": "Enable SSH access and HTTP",
    "SecurityGroupIngress": [{
      "IpProtocol": "tcp",
      "FromPort": "22",
      "ToPort": "22",
      "CidrIp": "0.0.0.0/0"
    },
    {
      "IpProtocol": "tcp",
      "FromPort": "80",
      "ToPort": "80",
      "CidrIp": "0.0.0.0/0"
    }]
  }
}

You can see this specifies following:

Port ranges: port range that we are opening up (FromPort-ToPort), 22 to 22 implies only open port 22.
Portocol: since both ssh and http run over tcp, we need tcp for both.
CidrIp: This is the range of IPs that a request to instance can originate from. Http needs to be open to internet id you are hosting a website, its a best practice to limit ssh to a specific IP range if you know the IPs you are logging in from.

WebServerInstance

This has a few moving parts. High level structure looks like this:

				
					    
    "WebServerInstance": {
      "Type": "AWS::EC2::Instance",
      "Metadata": {
        "Comment": "Install chef",
        "AWS::CloudFormation::Init": {
    
          "configSets": {
            "All": ["setupDefault"]
          },
    
          "setupDefault": {
            "packages": {
              "yum": {
                "git": []
              },
              "rpm": {
                "chefdk": "https://packages.chef.io/stable/el/6/chefdk-1.0.3-1.el6.x86_64.rpm"
              }
            },
            "files": {
              "/tmp/install.sh": {
                "source": "https://www.chef.io/chef/install.sh",
                "mode": "000400",
                "owner": "root",
                "group": "root"
              },
    
              "/etc/cfn/cfn-hup.conf": {
                "content": {
                  "Fn::Join": ["", [
                    "[main]n",
                    "stack=", {
                      "Ref": "AWS::StackId"
                    }, "n",
                    "region=", {
                      "Ref": "AWS::Region"
                    }, "n"
                  ]]
                },
                "mode": "000400",
                "owner": "root",
                "group": "root"
              },
    
              "/etc/cfn/hooks.d/cfn-auto-reloader.conf": {
                "content": {
                  "Fn::Join": ["", [
                    "[cfn-auto-reloader-hook]n",
                    "triggers=post.updaten",
                    "path=Resources.LaunchConfig.Metadata.AWS::CloudFormation::Initn",
                    "action=/opt/aws/bin/cfn-init -v ",
                    " --stack ", {
                      "Ref": "AWS::StackName"
                    },
                    " --resource LaunchConfig ",
                    " --region ", {
                      "Ref": "AWS::Region"
                    }, "n",
                    "runas=rootn"
                  ]]
                }
              }
            },
    
            "commands": {
              "install_chef": {
                "command": "bash /tmp/install.sh"
              },
              "clone_git": {
                "command": "sudo -u ec2-user bash -c 'cd ;git clone  '"
              }
            },
    
            "services": {
              "sysvinit": {
                "cfn-hup": {
                  "enabled": "true",
                  "ensureRunning": "true",
                  "files": ["/etc/cfn/cfn-hup.conf",
                  "/etc/cfn/hooks.d/cfn-auto-reloader.conf"]
                }
              }
            }
          }
        }

Metadata contains the information about initializing the instance
configSets: This section contains all the config sets to be executed for instance. In this example, we have one configSet – setupDefault
packages: This can contain the packages that cloudformation should install. We limit this to bare bones – chef and git (required for checking out chef configs). Alternatively, the chef configs can be in S3, IAM controlled which is a better solution because that would mean you do not have to manually manage keys and git checkouts for downloading chef configs.
files: This section contains the information about the files that need to be created. Note – this can refer to remote files form web, or have content inline.
commands: This section contains the commands to be executed. Note – The commands are processed in alphabetical order by name.
services: This section tells cloudformation on how to manage the services
cfn-init: This is a helper script that reads the metadata and does the actual processing of packages, files, etc.
cfn-init executes these blocks in following order:
packages, groups, users, sources, files, commands, services. A different order can be implemented using multiple configSets.
cfn-hup: This is another helper script that detects change in configuration, this is so that instance updates when the stack definition is updated in cloudformation.

Autoscaling Group

Now that we have a standalone instance, we can add other components to our stack so we have a scalable web application. If the application gets some traffic, the group can expand or contract if the traffic slows down. For such web-scale application, we would need the following resources:

Autoscaling Group: Multiple Servers that can scale up and down
Scaling up policy
Scaling down policy
Alarm that triggers scale up policy
Alarm that triggers scale down policy
Load Balancer: a server that distributes traffic to the autoscaling group.

Load Balancer

This is where the DNS of the website points to. Load balancer takes in the incoming request and forwards it to one of the servers in the Autoscaling group.

				
					"ElasticLoadBalancer": {
      "Type": "AWS::ElasticLoadBalancing::LoadBalancer",
      "Properties": {
        "AvailabilityZones": {
          "Fn::GetAZs": ""
        },
        "CrossZone": "true",
        "Listeners": [{
          "LoadBalancerPort": "80",
          "InstancePort": "xxxx",
          "Protocol": "HTTP"
        }],
        "HealthCheck": {
          "Target": "HTTP:xxxx/health-check",
          "HealthyThreshold": "3",
          "UnhealthyThreshold": "5",
          "Interval": "30",
          "Timeout": "5"
        }
      }
    },

This is a multi-AZ (Availability Zone) Load balancer
Listeners: This declares what port load balancer listens on (80) and what port the web application is running on.
HealthCheck – Target: The load balancer hits the Target URL on each instance that registers with it to make sure it’s healthy. If for some reason it does not get 200 responses, it marks the instance as unhealthy and stops all traffic to it. This is useful in case something goes wrong, and application stops for some reason. We implemented a special end point for this.
HealthCheck – HealthyThreshold : Number of times the health check end point must return 200 before instance is marked healthy
HealthCheck – UnhealthyThreshold : Number of times the health check fails (returns anything but 200) before instance is marked healthy.
HealthCheck – Interval: Interval between health checks. If an instance goes bad right after its marked in service, it may take bout 30 seconds in this case for load balancer to stop traffic.
HealthCheck – Timeout: This is how long the load balancer will wait to get a response, if instance does not respond in specified time, its marked as health check fail.

Note: Take special caution while updating health checks. If for some reason it does not work because of not setting it up correctly or not having reasonable timeouts, you can end up taking the whole application down.

Autoscaling group

This is a group of servers that can grow and shrink based on specified grow/shrink logic to account for increasing or reducing traffic. Instead of website owners watching the metrics and adding servers manually, this takes care of it automatically, which is great for increased uptime and better customer experience.

				
					"WebServerGroup": {
      "Type": "AWS::AutoScaling::AutoScalingGroup",
      "Properties": {
        "AvailabilityZones": {
          "Fn::GetAZs": ""
        },
        "LaunchConfigurationName": {
          "Ref": "LaunchConfig"
        },
        "MinSize": "3",
        "MaxSize": "10",
        "LoadBalancerNames": [{
          "Ref": "ElasticLoadBalancer"
        }],
        "NotificationConfigurations": [{
          "TopicARN": {
            "Ref": "NotificationTopic"
          },
          "NotificationTypes": ["autoscaling:EC2_INSTANCE_LAUNCH",
            "autoscaling:EC2_INSTANCE_LAUNCH_ERROR",
            "autoscaling:EC2_INSTANCE_TERMINATE",
            "autoscaling:EC2_INSTANCE_TERMINATE_ERROR"
          ]
        }]
      },
      "CreationPolicy": {
        "ResourceSignal": {
          "Timeout": "PT15M",
          "Count": "1"
        }
      },
      "UpdatePolicy": {
        "AutoScalingRollingUpdate": {
          "MinInstancesInService": "1",
          "MaxBatchSize": "1",
          "PauseTime": "PT15M",
          "WaitOnResourceSignals": "true"
        }
      }
    },
    
    
    
    
    "LaunchConfig" : {
      "Type" : "AWS::AutoScaling::LaunchConfiguration",
      "Metadata" : {...}
    },

Autoscaling group has a min, max and desired count. Desired always lies between min and max. It can be changed by an authorized user or an autoscaling policy.
LaunchConfigurationName – This is the configuration for each instance in the autoscaling group. This is exactly the same as in WebServerInstance discussed earlier.
NotificationConfigurations – This is a hook for SNS notifications, in case you want to subscribe to notifications when any sclaing happens.
UpdatePolicy : AutoScalingRollingUpdate – This handles update policy for the group. This configuration means update one instance at a time and keep at least once instance in service while performing the update, wait for 15 minutes before running update on next batch (next instance since batch size is 1.)
UpdatePolicy : WaitOnResourceSignals – This means that autoscaling group must wait for signal from new instances for pause time duration (15 min here). If no signal arrives, it does not complete the update.

Policies

These policies specify how the auto scaling group will expand or shrink when the alarm triggers.

				
					"WebServerScaleUpPolicy" : {
      "Type" : "AWS::AutoScaling::ScalingPolicy",
      "Properties" : {
        "AdjustmentType" : "ChangeInCapacity",
        "AutoScalingGroupName" : { "Ref" : "WebServerGroup" },
        "Cooldown" : "60",
        "ScalingAdjustment" : "1"
      }
    },
    
    "WebServerScaleDownPolicy" : {
      "Type" : "AWS::AutoScaling::ScalingPolicy",
      "Properties" : {
        "AdjustmentType" : "ChangeInCapacity",
        "AutoScalingGroupName" : { "Ref" : "WebServerGroup" },
        "Cooldown" : "60",
        "ScalingAdjustment" : "-1"
      }
    },

WebServerScaleUpPolicy – Increase the size of autoscaling group by 1 instance and wait for 60 seconds before running this policy again.
WebServerScaleDownPolicy – Decrease the size of autoscaling group by 1 instance and wait for 60 seconds before running this policy again.
These can also be configured to be % of Autoscaling group. for instance adding 1 instance, when there are 3 instances increases the capacity by 33%, but adding 1 when there are 5 increases capacity by 20%, hence for consistency’s sake, sometimes it might be best to indicate a % increase if instances can scale from very low to very high.
Adding instances and removing them may need some tuning, for instance, if the limits are tight, like scale down < 50%, scale up > 55%, it is totally possible that at times it adds an instance that brings load down to less than 50, then it removes an instance that takes the load up to 60, which then triggers an add, and the group keeps scaling all the time. this is not good, so the advice here is to spread it out a bit, maybe about 20% of spread in scaling.

CloudWatch Alarms

These alarms monitor the “bottle neck” metrics of the autoscaling group, for most applications its CPU Utilization. When these alarms trigger they activate the policy they are configured to which causes the auto scaling group to grow or shrink based on amount of traffic.

				
					"CPUAlarmHigh": {
      "Type": "AWS::CloudWatch::Alarm",
      "Properties": {
        "AlarmDescription": "Scale-up if CPU > 60% for 10 minutes",
        "MetricName": "CPUUtilization",
        "Namespace": "AWS/EC2",
        "Statistic": "Average",
        "Period": "300",
        "EvaluationPeriods": "2",
        "Threshold": "60",
        "AlarmActions": [{
          "Ref": "WebServerScaleUpPolicy"
        }],
        "Dimensions": [{
          "Name": "AutoScalingGroupName",
          "Value": {
            "Ref": "WebServerGroup"
          }
        }],
        "ComparisonOperator": "GreaterThanThreshold"
      }
    },
    "CPUAlarmLow": {
      "Type": "AWS::CloudWatch::Alarm",
      "Properties": {
        "AlarmDescription": "Scale-down if CPU < 50% for 10 minutes",
        "MetricName": "CPUUtilization",
        "Namespace": "AWS/EC2",
        "Statistic": "Average",
        "Period": "300",
        "EvaluationPeriods": "2",
        "Threshold": "50",
        "AlarmActions": [{
          "Ref": "WebServerScaleDownPolicy"
        }],
        "Dimensions": [{
          "Name": "AutoScalingGroupName",
          "Value": {
            "Ref": "WebServerGroup"
          }
        }],
        "ComparisonOperator": "LessThanThreshold"
      }
    },

CPUAlarmHigh – This triggers WebServerScaleUpPolicy if the combined CPU utilization of autoscaling group is greater than 60% for 2 consecutive 300 second periods, or 10 min.
CPUAlarmLow – This triggers WebServerScaleDownPolicy if the combined CPU utilization of autoscaling group is less than 50% for 2 consecutive 300 second periods, or 10 min.

Official AWS Cloudformation documentation: https://aws.amazon.com/documentation/cloudformation/