Amazon's Elastic Load Balancer (ELB) gives you only one latency metric, aggregated over all requests, and only with the usual min, max and average statistics. The value of this information is very poor. For example, when our first service went live in the Cloud, we monitored only the average latency, which was a very good and stable around 7ms. Later I found out that half of our traffic consists of OPTION requests (due to a CORS configuration error) which were handled in less than one millisecond, but users actually using the functionality of our service experienced a latency between 100ms and 800ms. Problem was that those users were only a few percent of the traffic, so the really important data was covered by noise and invisible inside the average. What we needed were url specific metrics and percentiles, especially the 99th ones.
Lambda to the rescue
As CloudWatch doesn't give us more detailed metrics we were on our own. Luckily, we can instruct the ELB to write its access logs every 5 minutes to an S3 bucket. All the raw data we need is already there. Tools like Graylog or the ELK stack could analyse them, but it takes some time to set up a pipeline that digests those logs continuously and produce the desired metrics. But AWS has a new service in its portfolio that helped us to get the desired data even faster: Lambda.
I've created a repository with a simplified version of the ELB percentile to CloudWatch lambda to give you a quick start if you want to do similar stuff. You need to adjust the bucket names and group-by regex to your conditions. The complete logic lives in lambda.js, there are some local tests in test.js. The zip_lambda.sh creates the upload package.
Actually, I spent most of the time fighting with cross account access policies and setting up the correct lambda invocation and execution roles, because we are using a multiple account setup where logs and CloudWatch metrics are living in different accounts. Another problem was the AWS CLI, I could not automate the lambda upload process and had to do it manually. The zip_lamda.sh creates the necessary command, but it never worked for me. When you create the lambda function, make the timeout big enough to be prepared for sunday evening traffic spikes ;-)