Skip to content

Auto Scaling Lessons Learned

joesondow edited this page Nov 29, 2012 · 1 revision

Scale up early

In the requests-per-second (RPS) example, based on the load test, queuing increased when RPS hit 25. To avoid excessive queueing, the auto scaling policy was setup to increase capacity when RPS exceeded 20. The 'RPS headroom' (20 versus 25) serves two purposes. One, we are provisioned for any unexpected RPS bursts, too small to trigger an auto scaling event. Two, the buffer provides a safety net to avoid a "capacity spiral", falling behind and constantly adding capacity to keep up.

Scale down slowly

The time required for a metric to meet a threshold to scale down should be greater than the time to scale up. For example, in the previous RPS example, the scale up alarm is triggered to alert if RPS exceeds 20 for 5 minutes. However, the scale down alarm will fire if RPS drops below 10 for 20 minutes. Note, the 4x time difference. The reason for a slow scale down is to reduce scaling down due to a false-positive event. For example, a middle tier service may begin to scale down if an edge service has a full or partial production outage. When the edge service is back online, the middle tier may not have the correct capacity to meet the current need.

Availability Zone capacity

When NOT to use Percent Based Autoscaling

For smaller auto scaling groups, percentage based auto scaling may result in a particular availability zone (AZ) under provisioned. If fewer instances than AZs are added (10% of < 30, if 3 AZs), one or more AZs may not have enough capacity to handle load. Moreover, if the AZ is severely under provisioned, this may result in a decrease in throughput and/or an increase in latency. Note, this can also occur with a larger farm during periods of low traffic (morning hours), when the ASG size is small.

Avoid up/down alarms with small variance

Alarms with small variance may results in "capacity thrashing". For example, scaling on load average using 2 and 3 for alarm triggers, may result in unexpected scaling. The problem is exacerbated if the period + occurrence time is small. A log rotation, CPU spike may be enough to trigger an alarm, resulting in a false positive.

Symmetric percentages

When TO use Percent Based Autoscaling

To avoid "capacity thrashing", create auto scaling policies with symmetric percentages. For example, the RPS example scaled up by 10% and also scaled down by 10%. If the two percentages are unequal, too much capacity may be added, then quickly removed, causing "capacity thrashing".

Symmetric periods

CloudWatch aggregates over the period, with two different values, 300 and 600, the aggregate results can differ. This may result in some unexpected scaling behavior, especially with the small variance for the rules (see below). Example of different numbers reported based on different periods,

[awsprod@awsprod100] mon-get-stats _SystemLoadAverage  --period 300 --dimensions "AutoScalingGroupName=api" --headers --namespace "NFLX" --statistics "Average"
Time                 Average             Unit
2011-12-14 22:25:00  10.513188405797104  None
2011-12-14 22:30:00  14.119027777777777  None
2011-12-14 22:35:00  17.88585365853659   None
2011-12-14 22:40:00  12.57720930232558   None
2011-12-14 22:45:00  10.395000000000005  None
2011-12-14 22:50:00  14.785624999999996  None
2011-12-14 22:55:00  12.755767195767197  None
2011-12-14 23:00:00  2.8151906158357756  None
2011-12-14 23:05:00  2.398559999999998   None
2011-12-14 23:10:00  2.053230403800475   None
2011-12-14 23:15:00  2.7254260089686104  None
2011-12-14 23:20:00  3.3501363636363615  None

[prod@us-east-1]/apps/aws/scripts
[awsprod@awsprod100] mon-get-stats _SystemLoadAverage  --period 600 --dimensions "AutoScalingGroupName=api" --headers --namespace "NFLX" --statistics "Average"
Time                 Average             Unit
2011-12-14 22:25:00  12.35446808510638   None
2011-12-14 22:35:00  15.168333333333345  None
2011-12-14 22:45:00  12.26004424778762   None
2011-12-14 22:55:00  6.3600377358490565  None
2011-12-14 23:05:00  2.2159170854271353  None
2011-12-14 23:15:00  3.0356659142212195  None
Clone this wiki locally