21 February 2015

Think in Events not Logs - Logging Strategy

Events

Logging is the act of recording events. An event describes something that has happened. Therefore each event has
  • a timestamp (when it happened) 
  • and a description (what happened). 
Logs were invented by developers for doing ex post analysis of problems (aka debugging). Then their usefulness were quickly leveraged by operating personnel for monitoring system health. But today, human beings are only a minority of the logs audience. Computers systems that are constantly processing billions of log entries in the search for new insights (Big Data) or potential problems (Monitoring) are now a software tool category of their own.

A descriptive text that is written by a developer with only debugging in mind tends to be redundant, unstructured, ambiguous and incomplete. Computer systems have difficulties to parse and interpret such descriptions. The event description must reveal its data in a structured way that can be parsed easily by computers but is still human readable. A list of properties in a json-like format (key->value) is a reasonable way to go. This is sometimes described as structured Logging.

 {"timestamp":"2015-02-21 11:23:51.479 +01",
  "name": "address could not be saved", 
  "sessionId":"4711", "dbStatus":"42"} 

Log Levels

Log Levels are an anti pattern that originates in debugging and storage limitations. A log level (eg. verbose, warning, error) typically controls if an event is logged at all and how this event should be interpreted. Given cheap storage and computers processing log files, don't waste time with categorizing events during development time, configuring them at runtime, and then missing important information at problem time. The decision if an event is valuable and how to react should be deferred to operating time, the last responsible moment.

Log Management

That said, a logging strategy cannot stop here. Focussing on a single application is not enough. Todays computer systems are highly distributed systems running in cloud like environments where instances are volatile and storage is cheap. Log files have to be moved from machines to a persistent storage fast before a machine gets terminated. The many different logs must be aggregated into one format that can be easily accessed and analyzed, e.g. by extracting structured data and indexing it. If you have achieved this you can finally analyze your logged events and if the whole pipeline is fast enough, it will give you a better monitoring than just observing, CPU, RAM and IO metrics. 

The Future

From a future perspective it will look strange to think in log files, that are collected and moved to a central storage  for later processing. Parsing a myriad of ad hoc formats and trying to normalize them to do correlation will be no longer an issue as we are using only a few standardized formats. Log rotation, and high latency batch processing will be obscure because we will think in events and process them in real time. The accidental complexity of handling log files when events directly go into event streams and never touch the local disk. You can see a glimpse of the future in systems like Kinesis or Riemann that are build for events and not logs.