21 Februar 2015

Think in Events not Logs - Logging Strategy


Logging is the act of recording events. An event describes something that has happened. Therefore each event has
  • a timestamp (when it happened) 
  • and a description (what happened). 
Logs were invented by developers for doing ex post analysis of problems (aka debugging). Then their usefulness were quickly leveraged by operating personnel for monitoring system health. But today, human beings are only a minority of the logs audience. Computers systems that are constantly processing billions of log entries in the search for new insights (Big Data) or potential problems (Monitoring) are now a software tool category of their own.

A descriptive text that is written by a developer with only debugging in mind tends to be redundant, unstructured, ambiguous and incomplete. Computer systems have difficulties to parse and interpret such descriptions. The event description must reveal its data in a structured way that can be parsed easily by computers but is still human readable. A list of properties in a json-like format (key->value) is a reasonable way to go. This is sometimes described as structured Logging.

 {"timestamp":"2015-02-21 11:23:51.479 +01",
  "name": "address could not be saved", 
  "sessionId":"4711", "dbStatus":"42"} 

Log Levels

Log Levels are an anti pattern that originates in debugging and storage limitations. A log level (eg. verbose, warning, error) typically controls if an event is logged at all and how this event should be interpreted. Given cheap storage and computers processing log files, don't waste time with categorizing events during development time, configuring them at runtime, and then missing important information at problem time. The decision if an event is valuable and how to react should be deferred to operating time, the last responsible moment.

Log Management

That said, a logging strategy cannot stop here. Focussing on a single application is not enough. Todays computer systems are highly distributed systems running in cloud like environments where instances are volatile and storage is cheap. Log files have to be moved from machines to a persistent storage fast before a machine gets terminated. The many different logs must be aggregated into one format that can be easily accessed and analyzed, e.g. by extracting structured data and indexing it. If you have achieved this you can finally analyze your logged events and if the whole pipeline is fast enough, it will give you a better monitoring than just observing, CPU, RAM and IO metrics. 

The Future

From a future perspective it will look strange to think in log files, that are collected and moved to a central storage  for later processing. Parsing a myriad of ad hoc formats and trying to normalize them to do correlation will be no longer an issue as we are using only a few standardized formats. Log rotation, and high latency batch processing will be obscure because we will think in events and process them in real time. The accidental complexity of handling log files when events directly go into event streams and never touch the local disk. You can see a glimpse of the future in systems like Kinesis or Riemann that are build for events and not logs.

28 Januar 2015

GO TeamCity GO!

(tldr; Why TeamCity is still a time and money saving tool that can also be used for Continuous Delivery)

Today Continous Integration is a widely adopted practice and Continuous Delivery is the new state of the art. Many companies are now using (or 'abusing'?) their existing CI infrastructure to build pipelines for Continuous Delivery.

It turns out that this exercise works very well with tools like Jenkins, Hudson or Teamcity. The concept of having 'Builds' that are triggered by 'Commits' and producing 'Artifacts' can be easily extended to create delivery pipelines. Add the ability to chain 'Builds' and extend the trigger mechanism to fire when all preceding builds have been successfully finished. A 'Build' can be a classic compile build, a complex acceptance test, or the deployment to machines. It is just some kind of 'Job'. The possibility of 'Scheduled' triggers transforms your CI system to be a generic automation system. The traceability of 'Jobs' with logs and build numbers, some basic metrics and a nice Web-GUI lured many sysadmins exposed to CI systems to use them for automating all kind of stuff.

That is the way AutoScout went with TeamCity during the last years and with good success. Today, AutoScout has more than 1000 active build configurations executed by 51 agents. Many of those 'builds' aren't actually building stuff, but are

  • deploying apps to servers
  • updating elasticsearch indices
  • running database schema migrations
  • orchestrating fleets of test agents
  • and other crazy stuff you won't believe ;-)

We are not always happy with TeamCity. It tends to be a clicky-clicky tool producing clickops, because it has a nice GUI but a hard to use REST-API. And if you use Builds  for orchestrating other machines to do the real stuff (e.g. deployments) then you do not make good use of your agents resources. The agent is then blocked by idle waiting, wasting resources that cost real money. Over the years, TeamCity accumulated so many features that you need experience to do things the right way without shooting yourself in the foot.

A few months ago AutoScout decided to change its tech stack from DataCenter/Windows/.NET to Cloud/Linux/JVM. As this change is quite big and risky we have hired a lot of expertise from ThoughtWorks, the company that is also a well known proponent of Continuous Delivery. Their experts suggested to abandon TeamCity and give GO a try because GO is open source and built with Continuous Delivery in mind.

Now, after some hands on experiences with GO, I believe that you shouldn't replace a working and mature CI System like TeamCity with GO. My main argument is that GO is just another CI-Tool with no fundamental difference to TeamCity, Jenkins or whatsoever. It shares the same disadvantages
(clicky-clicky, server-agent model) but lacks many productivity features that TeamCity provides in a mature and battle proven way. Indeed I miss so many things in GO that I liked in TeamCity that I start a rant right now ;-)
GO features are mostly a small  subset of TeamCity features

Displaying Logs

Investigating build logs is one very important use case whether you need to find the reason why a build fails or you want to know what a build makes slow. GO is a complete disappointed in this point. It shows you only a raw black&white dump of stdout and stderr mixed together. No indicators or any kind of help that could support you in analyzing. Maybe because only can start shell scripts. TeamCity, in contrast, knows what it executes (e.g. msbuild, rake, test runners, ...). It uses this knowledge to enhance logs with hints that makes them easy to analyze. First of all, errors are extracted and shown in a prominent place, so most often you don't need to go to the build log at all. If you drill down to the log, the log is presented in a hierarchical view, where blocks containing errors are expanded (and colored in red) while the rest of the log is collapsed. TeamCity also shows you for each line its ingestion time and the accumulated durations of tasks and subtasks. That makes it very easy to see where the build spent time.  GO can't even display a timestamp, so if your build takes 20 minutes, you have no clue which task caused this delay. TeamCity just tells you which task needed how long, so you need immediately the reason for the performance degradation.
GO displaying a build log

TeamCity displaying a build log

Red Builds aka Failed Jobs

Red builds or failed jobs need to be investigated. In GO you can do silly things like wearing hats or hope that somebody takes care, in TeamCity you can start investigating a build an then the rest of the users know that you are taking care.

Speaking or red builds, TeamCity has many ways to mark a build as failed even if the job returned a success, e.g.

  • the number of tests suddenly dropped significantly
  • the number of ignored tests increased significantly
  • the build took significantly longer than the last ones
  • the build times out
  • ...
GO can't do any of those things.

Living community 

Just try to google for some build related problems: Lots of answers for TeamCity, none for GO (which might be related to its name ;-) I wasn't able to find a GO community, again that may be the name, but visit the official GO community site or count the available community plugins. (most of the are written by ThoughtWorks employees anyway). TeamCity has a living community and in case googling doesn't solve your problem, Jetbrains has a fast and helpful support. 

Native AWS support 

TeamCity can operate a fleet of agents on Amazon EC2 instances. Depending on the current load (pending jobs) it can create, start or stop agents. This allows you to use very fast C4 instance types, which will make your builds faster, but in a very cost effective way, because those instances will live only for those few hours per day in which they are actually needed. GO again has nothing to offer in this area, you have to build scaling up and down by your own script or just pay 24/7 for agents you don't use.

Miscellaneous TC features where GO is inferior

  • Auditing/Permissions: If you have more than one team TeamCity allows you manage permissions and it has a good auditing feature that shows you which user changed what and when. GO can show you the diff between two versions of the global config.xml file.
  • Custom build numbers in TeamCity, incrementing only numbers in GO.
  • Sophisticated environment variable setting in TC, predefined ones in GO 
  • Jobs can interact with Teamcity, e.g. update status data, nothing in GO
  • Jobs can have arbitrary names and descriptions in TC, GO doesn't allow whitespace in names, leading_to_strange_conventions.
  • all thinkable variants of triggers and dependency chains between jobs allows you to build sophisticated pipelines in TC
  • job queues are visible in TC, making it easy to debug non starting and pending builds. In GO you have no hint when you job will start or to manipulate the queue when something went wrong. I have seen jobs hanging in GO forever because agents requirements were missing.
  • Good history of deployments/builds from the past. You see which commits are in with deployment, and you can rollback to an older version directly from this history view.

There are still more points where TeamCity outperforms GO (backups, recovery, auto updating agents, ...). I hope I have listed enough to show you mature TeamCity is compared against GO. If you have a problem there is a very high chance that TeamCity has a solution. With GO you will find yourself running into limitations or living with crude workarounds which make your teams less productive. 


If you have a working and mature CI System (Jenkins, TeamCity, ...) don't switch to GO just because you want to do Continuous Delivery. GO has the same disadvantages but lacks many important productivity features. Either evolve your existing CI System to also deliver software or try out a radical different approach like hubot deploy.

05 Februar 2014

Remove big files from Git repositories permanently

Everything is possible with Git but I can't remember all the command line options. So here is my aid to memory to make a repository slim again by removing unwanted files. I'm using git version 1.8.4 on Windows.
First you need to rewrite history:
git filter-branch --index-filter "git rm -r --cached --ignore-unmatch *.gem" --tag-name-filter cat -- --all
Note the –r and the use of wildcards inside the index-filter command. With the other options this means that all *.gem files in all commits and tags are found and removed. This command prints all objects its deleted. If it doesn't print anything useful you have made an error!
Now delete the backup created by git filter-branch:
rd /q /s ".git/refs/original"
Some magic to get rid of orphaned objects inside the git repository:
git reflog expire --expire=now --all
git gc --prune=now
Verify that all files are really gone with git log -- *.gem and then repack your repository.
git gc --prune=now --aggressive
Finally, push your shrinked repository to the origin.
git push origin --force
The next time you clone the repository you clone the repository you get the shrinked version.
Update: But as soon as you do a git pull (--rebase) all the unneeded and painfully removed objects are downloaded again to your hard disk. The only way to prevent this is by deleting the repository on GitHub and replacing it with the shrinked one (without changing names or urls). Astonishingly, existing clones continued to work with the replaced repository.
Update2: On GitHub itself is now a nice article explaining the process of cleaning/shrinking repositories, including a link to a tool called BFG Repo Cleaner that is specialized for this task.

24 Februar 2013

The Unix Way

Neulich meinte mein Kollege +Christian Deger beim Mittagessen "do it the Unix way", was mich sofort an ein altes Buch von Edward Yourdon erinnerte, der zum Thema "Technische Fehler bei Wiederverwendbarkeit" etwas ähnliches sagt:

"Aufnahme von Softwarekomponenten, die zu groß sind und/oder zu viele Seiteneffekte haben. Das war eines der häufigsten Probleme mit wiederverwendbaren Bibliotheken in den 60'er Jahren: Es gab eine Tendenz zur Entwicklung von 'Universalkomponenten', die alles für alle unter allen Bedingungen bewirken sollten. Das Ergebnis war in der Regel, daß die Komponente für niemand besonders gut war, oft sogar verheerende und vollkommen unakzeptable Seiteneffekte hatte. Ein besseres Beispiel - meiner Meinung nachgerade das ideale Beispiel - ist die UNIX-Bibliothek, deren Komponenten winzig (oft kleiner als 10 Codezeilen) sind und keine Seiteneffekte aufweisen."
- Edward Yourdon, 1993, "Die westliche Programmierkunst am Scheideweg"
Ich habe dieses Buch heute nochmal kurz überflogen und finde es immer noch sehr erfrischend. Dass das Thema der über-abstrakten, super- generischen Komponenten schon bald 50 Jahre alt ist, zeigt doch, das diese Tendenz genetisch vorbestimmt sein muss. Jeder gute Entwickler ist in seiner Anfangszeit in diese Falle gelaufen :-)

20 November 2012

Understand and Prevent Deadlocks

Can you explain a typical C# deadlock in a few words? Do you know the simple rules that help you to write deadlock free code? Yes? Then stop reading and do something more useful.

If several threads have read/write access to the same data it is often often necessary to limit access to only on thread. This can be done with C# lock statement. Only one thread can execute code that is protected by a lock statement and a lock object. It is important to understand that not the lock statement protects the code, but the object given as an argument to the lock statement. If you don't know how the lock statement works, please read the msdn documentation before continuing. Using a lock statement is better than directly using a Mutex or EventWaitHandle because it protects you from stale locks that can occur if you forget to release your lock when an exception happens.

A deadlock can occur only if you use more than one lock object and the locks are acquired by each thread in a different order. Look at the following sequence diagram:

There are two threads A and B and two resources X and Y. Each resource is protected by a lock object.
Thread A acquires a lock for Resource X and continues. Then Thread B acquires a lock for Y and continues. Now Thread A tries to acquire a lock for Y. But Y is already locked by Thread B. This means Thread A is blocked now and waits until Y is released. Meanwhile Thread B continues and now needs a lock for X. But X is already locked by Thread B. Now Thread A is waiting for Thread B and Thread B is waiting for Thread A both threads will wait forever. Deadlock!

The corresponding code could look like this.

public class Deadlock
    static readonly object X = new object();
    static readonly object Y = new object();
    public void ThreadA()
                // do something

    public void ThreadB()
                // do something

Normally nobody will write code as above with obvious deadlocks. But look at the following code, which is deadlock free:

public class Deadlock
    static readonly object X = new object();
    static readonly object Y = new object();
    static object _resourceX;
    static object _resourceY;

    public object ResourceX
        get { lock (X) return _resourceX; }

    public object ResourceY
            lock (Y)
                return _resourceY ?? (_resourceY = "Y");

    public void ThreadA()

    public void ThreadB()
            _resourceY = "TEST";

But after re-factoring the getter for ResourceX to this

get { lock (X) return _resourceX ?? ResourceY; }

you have the same deadlock as in the first code sample!

Deadlock prevention rules

  1. Don't use static fields. Without static fields there is no need for locks.
  2. Don't reinvent the wheel. Use thread safe data structures from System.Collections.Concurrent or System.Threading.Interlocked before pouring lock statements over your code.
  3. A lock statement must be short in code and time. The lock should last nanoseconds not milliseconds.
  4. Don't call external code inside a lock block. Try to move this code outside the lock block. Only the manipulation of known private data should be protected. You don't know if external code contains locks now or in future (think of refactoring).

If you are following these rules you have a good chance to never introduce a deadlock in your whole career.

14 November 2012

How to find deadlocks in an ASP.NET application almost automatically

This is a quick how-to for finding deadlocks in an IIS/ASP.NET application running on a production server with .NET4 or .NET 4.5.

A deadlock bug inside your ASP.NET application is very ugly. And if it manifests only on some random production server of your web farm, maybe you feel like doom is immediately ahead. But with some simple tools you can catch and analyze those bugs.

These are the tools you need:
  • ProcDump from SysInternals   
  • WinDbg from Microsoft (Available as part of the Windows SDK (here is even more info)) 
  • sos.dll (part of the .NET framework)
  • sosext from STEVE'S TECHSPOT (Copy it into your WinDbg binaries folder)
ProcDump will be needed on the server where the deadlock occurs. All the other tools are only needed on your developer machine. Because WinDbg doesn't need any installation you can also prepare an emergency USB stick (or file share) with all the necessary tools.

If you think a deadlock occurred do the following:
  1. Connect to the Server
  2. Open IIS Manager 
  3. Open Worker Processes 
  4. Select the application pool that is suspected to be deadlocked
  5. Verify that you indeed have a deadlock, see the screenshot below
  6. Notice the <Process-ID> (see screenshot)
  7. Create a dump with procdump <Process-ID> -ma
    There are other tools, like Task Manager or Process Explorer, that could dump but only ProcDump is smart enough to create 32bit dumps for 32bit processes on a 64bit OS.
  8. Copy the dump and any available .pdb (symbol) files to your developer machine. 
  9. Depending on the bitness of your dump start either WinDbg (X86) or WinDbg (X64)
  10. Init the symbol path (File->Symbol File Path ...)
  11. File->Open Crash Dump
  12. Enter the following commands in the WinDbg Command Prompt and wait
  13. .loadby sos clr
  14. !load sosex
  15. !dlk
You should now see something like this:

0:000> .loadby sos clr
0:000> !load sosex
This dump has no SOSEX heap index.
The heap index makes searching for references and roots much faster.
To create a heap index, run !bhi
0:000> !dlk
Examining SyncBlocks...
Scanning for ReaderWriterLock instances...
Scanning for holders of ReaderWriterLock locks...
Scanning for ReaderWriterLockSlim instances...
Scanning for holders of ReaderWriterLockSlim locks...
Examining CriticalSections...
Scanning for threads waiting on SyncBlocks...
*** WARNING: Unable to verify checksum for mscorlib.ni.dll
Scanning for threads waiting on ReaderWriterLock locks...
Scanning for threads waiting on ReaderWriterLocksSlim locks...
*** WARNING: Unable to verify checksum for System.Web.Mvc.ni.dll
*** ERROR: Module load completed but symbols could not be loaded for System.Web.Mvc.ni.dll
Scanning for threads waiting on CriticalSections...
*** WARNING: Unable to verify checksum for System.Web.ni.dll
CLR thread 0x5 holds the lock on SyncBlock 0126fa70 OBJ:103a5878[System.Object]
...and is waiting for the lock on SyncBlock 0126fb0c OBJ:103a58d0[System.Object]
CLR thread 0xa holds the lock on SyncBlock 0126fb0c OBJ:103a58d0[System.Object]
...and is waiting for the lock on SyncBlock 0126fa70 OBJ:103a5878[System.Object]
CLR Thread 0x5 is waiting at System.Threading.Monitor.Enter(System.Object, Boolean ByRef)(+0x17 Native)
CLR Thread 0xa is waiting at System.Threading.Monitor.Enter(System.Object, Boolean ByRef)(+0x17 Native)

1 deadlock detected.

Now you know that the managed threads 0x5 and 0xa are waiting on each other. With the !threads command you get a list of all threads. The Id Column (in decimal) is the managed thread id. To the left the WinDbg number is written. With  ~[5]e!clrstack command you can see the stacktrace of CLR thread 0x5. Or just use ~e*!clrstack to see all stacktraces. With this information you should immediately see the reason for the deadlock and start fixing the problem..

Deadlocked Requests visible in IIS Worker Process

Automate the Deadlock Detection

If you are smart, create a little script that automates step 2 to 7. We use this powershell script for checking every minute for a deadlock situation:

param($elapsedTimeThreshold, $requestCountThreshold)
Import-Module WebAd*
$i = 1
$appPools = Get-Item IIS:\AppPools\*
while ($i -le 5) {
ForEach ($appPool in $appPools){
 $count = ($appPool | Get-WebRequest | ? { $_.timeElapsed -gt $elapsedTimeThreshold }).count
 if($count -gt $requestCountThreshold){
$id = dir IIS:\AppPools\$($appPool.Name)\WorkerProcesses\ | Select-Object -expand processId
$filename = "id_" +$id +".dmp"
$options = "-ma"
$allArgs = @($options,$id, $filename)
procdump.exe $allArgs
Start-Sleep -s 60

07 November 2012

Cooperative Thread Abort in .NET

Did you know that .NET uses a cooperative thread abort mechanism?

As someone coming from a C++ background I always thought that killing a thread is bad behavior and should be prevented at all costs. When you terminate a Win32 thread it can interrupt the thread in any state at every machine instruction so it may leave corrupted data structures behind.
A .NET thread cannot be terminated. Instead you can abort it. This is not just a naming issue, it is indeed a different behavior. If you call Abort() on a thread it will throw a ThreadAbortException in the aborted thread. All active catch and finally blocks will be executed before the thread gets terminated eventually. In theory, this allows the thread to do a proper cleanup. In reality this works only if every line of code is programmed in a way that it can handle a ThreadAbortException
in a proper way. And the first time you call 3rd-party code not under your control you are doomed.

Too make the situation more complex there are situations where throwing the ThreadAbortException is delayed. In versions prior to .NET 4.5 this was poorly documented but in the brand new documentation of the Thread.Abort method is very explicit about this. A thread cannot be aborted inside a catch or finally block or a static constructor (and probably not the initialization of static fields also) or any other constrained execution region.

Why is this important to you?

Well, if you are working in an ASP.NET/IIS environment the framework itself will call Abort() on threads which are executing too long. In this way the IIS can heal itself if some requests hit bad blocking code like endless loops, deadlocks or waiting on external requests. But if you were unlucky enough to implement your blocking code inside static constructors, catch or finally blocks your requests will hang forever in your worker process. It will look like the httpRuntime executionTimeout is not working and only a iisreset will cure the situation.

Download the CooperativeThreadAbortDemo sample application.