12 February 2017

How to gunzip AWS Loadbalancer logs with nodejs


The new Application Loadbalancer (ALB) from AWS stores it's access logs in a slightly different format to S3 than the classic ELB.
  1. logs are written every 5 minutes now
  2. some fields are added at the beginning of each line and some at the end
  3. logs are written as gzipped streams with the ending .gz
I am parsing the access logs with a custom lambda function and forward them to an Elasticsearch cluster. The new fields in front of the existing ones broke my parsing, but that was easily fixed. My lambda function uses nodejs as a runtime, so `zlib.gunzipSync(log)` was the easiest way to gunzip the log files. Everything looked good, no errors all unit tests passing, but after I switched to ALB on production I noticed a large drop in the number of requests written to Elasticsearch.

I downloaded some log files and noticed strange values, like negative compression ratios, when testing the file with gunzip:
$ gunzip -lv fk0e6u3y.log.gz 
method  crc     date  time           compressed        uncompressed  ratio uncompressed_name
defla d26f8a35 Dec  7 10:45               18278                3052 -498.3% fk0e6u3y.log
My first thought was 'ahh corrupt log files'. But when I actually gunzip the file I got no errors and a reasonable number of lines (requests). Then I wrote a little nodejs app that uses zlib to gunzip the file. Only the first few lines were unzipped, but again without any error. That was mysterious. After a lot of googling, I finally found out that the gzip file format allows several concatenated chunks of compressed data. But gunzip shows only the data for the first chunk, thus the weird output and negative compression ratio. And nodejs had an issue too, it extracts only the first chunk and silently ignores the rest. As AWS supports only nodejs v.4.3.2 no fix was available.

But there is an easy workaround. Lambda runs on Amazon Linux and nodejs can start arbitrary processes on the host. So reading the data from S3 and then passing the data to an external gzip process did the trick:

execSync('gzip -d -c', {input: data.Body}).toString('ascii')

Here the data is passed from data.Body (a nodejs buffer) to STDIN of the gzip process. STDOUT is converted to an ascii string. That's nearly as easy as the original zlib version and it is working like a charm on production now. It might break in the future if AWS disallows starting external processes or removes gzip from the host.