Tuesday, June 15, 2004

Exceptions, Tunable Logging and...



Code CompleteI didn't have time to fully explain what I meant by "tunable logging" yesterday. Nor how it integrates with a well thought-out system that handles exceptions gracefully.

Consider an application that you've shipped to a valued customer. The customer calls up one day and says, "Yo, the application isn't working right... all of a sudden the network sharing locks up and no one can open any of the files. We have to have everyone shut down the application and restart it to get it back up!"

What do you do? You could ask for all of their data (could be proprietary, in which case you're out of luck), try to set up a similar network configuration, then simulate their load and activity, etc. Trust me, this is very, very difficult to coordinate and achieve for many classes of application.

Or... if you've employed tunable logging, you can simply have certain key users at the customer site crank up their "log verbosity level" and start generating log-files. Then, when the problem re-occurs, the customer ships you the log-files... which help you diagnose the problem. You quickly turn around a fix, save the day, and the customer remains ecstatic about your product.

So how does tunable logging work? Consider a global "context" object familiar to those who develop servlets, ISAPI extensions, LDAP clients, ... well, just about any significant piece of code. A global context generally consists of settings, methods and such that can be used by the application and its slaves (e.g., multiple threads, forks, and so on). One of the settings in the global context might be:

INT m_nLogVerbosity;


(or its equivalent property). Your application should have a back-door that allows setting this property to between 0 and 9, inclusive. 0 would be the production default, which basically means that only noteworthy errors and events are logged.

If you're writing a web application, your authentication module might set this as a session variable on a per-user basis.

In either case, your application code indirectly uses the verbosity setting. Consider a snippet of code from last time:



if ((rc = tableCredits.open()) != OK) {
TunableLog("Credits open failed..."); // 1
break;
}
bUnwindTableCreditsOpen = TRUE;
TunableLog("Credits open succeeded...", 5); // 2

...

TunableLog("Integral transaction succeeded...", 3); // 3


Let's take each numbered invocation to the tunable log:

1) If the credits table fails to open, this is a critical fault. It gets logged unconditionally (the default argument value is 0, meaning log it no matter the current verbosity level).

2) If the credits table was opened successfully, we will log that fact if the global context's verbosity setting is 5 or higher. Normally, we wouldn't care about this event. But it (and related information like handle info, current user, etc.) might come in handy if we're diagnosing a problem.

3) If the entire transaction succeeded, we'll log it if the verbosity is 3 or higher.

So, in short, we have a logging system that we can "tune" based upon our needs... even after we ship it. Our backdoor allows us to have customers -- in the often-times nasty real world -- turn up the verbosity, capture events of interest, and ship us the logs that summarize critical sequence of events. Result: happy customers. More revenue. You get to buy a Maserati Spyder. What's not to like?

I can hear the whiners now. "What about perrrrrformance? This'll suck the life out of my application! Waaah! Waaaaaah!"

One word: bulls**t. If it's designed correctly, it won't suck the life out of your application. Obviously I don't recommend embedding logging in a highly vetted subsystem like a high-performance sorting routine. But you would put logging in at higher levels (e.g., execution of business rules) that are often delayed due to disk I/O, network activity, database locks, etc.

Consider the overhead implicit in this over-simplified C-style macro:

#define TunableLog(s, v)
if (v >= pCtxt->m_nLogVerbosity) {
Log(s);
}

Hmmm. A single integer comparison. That's the entire run-time overhead. I think we can spare the time, given the advantages with which it provides us.

Back to exceptions for a moment. Consider an integrated, tunable logging system that uses exceptions and not return-codes to handle our snippet. It doesn't appear intuitive to me. But that's just me. I'd be interested in seeing some alternative approaches that use exceptions and tunable logging.

2 comments:

Anonymous said...

I use a similar "tunable" logging system in almost all of my software. However, I took it one step further and used bit-masking to provide the ability to steamline the TYPE of logging as well as the LEVEL of logging. For example, you can use the upper 28 bits of a DWORD (32 bit UINT) to represent a catagory of logging. A "category" of logging might represent a specific subsystem or code module. Each significant code block could get its own bit (and accompanying bitmask).

For instance, you can assign your main program block/loop bit 16 (1000000000000000b or 0x8000). Then assign some of the sub components other bit values. In Doug's example, you would give your database-related logic a bit of 15 (0100000000000000b or 0x4000).

The least significant 2 or three bits would then be used to determine level og logging. As doug suggests, critical errors/events would be assigned to value 0, whereas program logic/trace info would be assigned to value 111b (7) or higher. If you needed more than 8 levels, you could use the last 4 bits.

Now, if you wanted to get all events messages only for a specific sub system (such as database logic), you could specify the following "logging value" during program operation: "subsystem_bitmask | logging_level". You could even track multiple subsystems by OR'ing the values together: "subsystem1_bitmask | subsystem2_bitmask | logging level."

Really, the only thing it doesn't let you do is assign a unique logging level to each subsystem being monitored.

Using the bitmask system provides significantly more functionality, with only a minor increase in code complexity-- instead of passing your tunable log function an integer value, you need to pass it a set of OR'd and shifted bits.

As far as performance goes, all modern PC's are going to use specialized/optimized operations for performing the shift operations necessary to "decode" the logging bitmask in your tunable logging function and will cause virtually no slow down in performance. Also, you should be aware that most users are going to be technical idiots, so you can't really expect them to go into the registry, OR bitmasked valued together, and then update the logging verbosity. I highly recommend providing a web page or some other client interface that allows a user to just select check boxes of the subsystem(s) they want to monitor, and a slider for selecting the logging level. If you can't do that, then at a minimum you need to provide an appendix in your user documentation that lists all possible combinations, descriptions, and DECIMAL value that could be sent to the application to obtaine the desired logging.

I don't know how to do those fancy HTML code blocks in Doug's blogger, but here is some example code:

#define SUBSYS1 0x8000 // Basic program trace
#define SUBSYS2 0x4000 // Database logic
#define SUBSYS3 0x2000 //Other special cases
#define LOGLEVEL0 0x0 // Critical Events Only
#define LOGLEVEL1 0x1 // Critical Events+Warnings
#define LOGLEVEL2 0x2 // Full Trace

if ((rc = tableCredits.open()) != OK) {
TunableLog("Credits open failed...", SUBSYS2, LOGLEVEL0);
break;
}
bUnwindTableCreditsOpen = TRUE;
TunableLog("Credits open succeeded...", SUBSYS2, LOGLEVEL1);

TunableLog("About to perform data validation...", SUBSYS1, LOGLEVEL2);
if( (rc = do_some_data_validation_here()) < 0 ) {
TunableLog("Validation failed...", SUBSYS1, LOGLEVEL0);
} else if( rc > 0 ) {
TunableLog("Validation succeeded but with some warnings...", SUBSYS1, LOGLEVEL1);
} else {
TunableLog("Validation successful...", SUBSYS1, LOGLEVEL2);
}

...

directorblue said...

Yes, good comments. I've done something similar inside BadBlue's Gnutella-based peer-to-peer support. Given the persnickety memory management issues involved with it (i.e., monitoring bunches of threads, socket-based networking, punching stuff into collections, managing caches...), I had almost an identical setup for logging. A memory-management subsystem and a "general" subsystem.

Good explanation.