The new Application Loadbalancer (ALB) from AWS stores it's access logs in a slightly different format to S3 than the classic ELB.
- logs are written every 5 minutes now
- some fields are added at the beginning of each line and some at the end
- logs are written as gzipped streams with the ending .gz
I downloaded some log files and noticed strange values, like negative compression ratios, when testing the file with gunzip:
$ gunzip -lv fk0e6u3y.log.gz
method crc date time compressed uncompressed ratio uncompressed_name
defla d26f8a35 Dec 7 10:45 18278 3052 -498.3% fk0e6u3y.log
My first thought was 'ahh corrupt log files'. But when I actually gunzip the file I got no errors and a reasonable number of lines (requests). Then I wrote a little nodejs app that uses zlib to gunzip the file. Only the first few lines were unzipped, but again without any error. That was mysterious. After a lot of googling, I finally found out that the gzip file format allows several concatenated chunks of compressed data. But gunzip shows only the data for the first chunk, thus the weird output and negative compression ratio. And nodejs had an issue too, it extracts only the first chunk and silently ignores the rest. As AWS supports only nodejs v.4.3.2 no fix was available.But there is an easy workaround. Lambda runs on Amazon Linux and nodejs can start arbitrary processes on the host. So reading the data from S3 and then passing the data to an external gzip process did the trick:
execSync('gzip -d -c', {input: data.Body}).toString('ascii')
Here the data is passed from data.Body (a nodejs buffer) to STDIN of the gzip process. STDOUT is converted to an ascii string. That's nearly as easy as the original zlib version and it is working like a charm on production now. It might break in the future if AWS disallows starting external processes or removes gzip from the host.