Processing Amazon S3 Logs
I saw a blog post on reddit today about processing Amazon S3 logs. It went into a lot of the detail about how to set up logging and stuff (although I believe I just used jets3t to set mine up), but nothing about how to process them with a log processing tool.
I wrote a log merge tool a long time ago for collecting large numbers of logfiles from various servers, sorting them, and sending them through a normal web processing tool that accepts CLF. Five or so months ago, I needed a way to process my S3 logs, so I modified my tool to accept them as input while producing the same output.
Since one of the primary design goals is to be able to handle a large number of logs, I most commonly take the file list on stdin. Here is a pseudo-shell example of how I process my S3 logs:
# s3sync -> a directory called l find l -type f | logmerge | webalizer -o report
The above currently processes 4,440 files (I'm not a very heavy S3 user). It does so without hitting my 256 descriptor ulimit, and without needing to have its input sorted in any particular way.
I'm fairly satisfied with the results and the performance. The current tip is written in C++ (sort of) and uses boost's regex to grok the input. I've an older revision in there that is pure C and uses PCRE, but it's no longer maintained.
I wouldn't consider this a finished project, but it does what I need quite well. I'd welcome any input (or patches!) from anyone who might get more use out of it.