One of the most prominent issues facing many medium to larger IT organizations is the subject of monitoring. Monitoring has been around a long time, and has come in many forms. Even to this day many datacenters have a schedule when someone physically walks up and down the rows to look for any blinking red lights. Monitoring has developed a long way from manually looking for blinking lights, but the wisdom behind the diligence of that scheduled walk haven’t followed technologies progress.
Years ago the idea of using simple pings took hold and network monitoring was born. Since that time innumerable applications have been written to address the issue of monitoring. It is only within the last few years that I have had an opportunity to find out how essential it is to be in control of how you monitor your infrastructure. Since I’ve been a Microsoft Administrator for so long, it made sense to use a Microsoft product to monitor my systems. About a year ago I setup and installed Microsoft Systems Center Operations Manager 2007, and since then my appreciation of monitoring has been tempered. And my expectations of a what to look for in a monitoring system have been raised incredibly high.
It is essential to have trust in your systems of course, trust that you have designed, managed and maintained your systems well enough that you can sleep at night. With demands for 24×7 uptime on practically every aspect of an IT ecosystem, it is essential that you know when a system is behaving aberrantly. Is it too much to ask that when you go to bed at night, you will be awaken by a phone call only in a real emergency?
And how do you decide what is an emergency? And therefore how do you decide what you need to monitor? My current preoccupation with monitoring systems comes from several miscalculated attempts by well-meaning staff. “Build it and they will come” is not a best practice to monitoring IT systems. It is misguided at best to buy an over-priced and over-complicated solution and then identify what you need to monitor by finding out what that shiny-new monitoring system doesn’t address. But without experiencing first-hand what a system can or can’t deliver, how can you ever decide what data you’re missing? If your parents purchased you a bike as a small child, did they spend a thousand dollars on a 26-speed Tour De France worthy bike? Most likely you received the Toys R’Us special with training wheels.
Do you need a ping test? Then dozens of free tools can solve your needs. What if you needs are more complicated? What if you have no way of knowing what your needs will be a week from now, two months, or two years later?
A real-life scenario could be encountered such as this:
It’s 2 am. Something happened that you won’t find out about until the morning when your users are calling. You rush into work only to find your VP and other angry management figures waving hands in the air. “E-mail is down! No one can work like this! Fix it or I will fix you!”
What happened here? What happens next? For several minutes or maybe several hours, someone has to dig through event logs, hardware logs, perform tests, sweat, Google a dozen things, browse TechNet, and possibly even call product support. All of this can happen just to find out what the problem was, and in no way necessarily fixes the issue. Once identified the issue can hopefully be solved very quickly. If it is a catastrophic failure it could be days or weeks before a resolution is found. Once the issue is resolved many companies now ask for an RCA, either formally or informally. An RCA for those who don’t know is a “root-cause analysis.” It is a blanket term to cover the what, the where, and the why of a major problem. They usually involve angry management and very timid-looking IT professionals. People get fired soon after RCAs, and then RCAs become another acronym: RGE. An RCA can lead to a “Resume Generating Event.”
Enter a reasonably robust and functional monitoring system.
It’s 2 am. Something has happened and it was logged on a system. The vigilant monitoring system detects the log entry, matches a rule in its database, and issues an alert. You are woken up by a text message and the sound of ultimate doom coming from your phone. You have 4 hours to research and fix this issue before the CEO wakes up and grabs his BlackBerry to check his email.
We have bridged the gap of not knowing a problem exists for hours, and we have achieved the true goal of a monitoring system, early warning. If you think of your monitoring system as a plane’s radar, the whole point is to detect threats before something becomes a critical event.
Since the days of the ping tests, monitoring systems have evolved into much more complicated, and much more intelligent applications. Now monitoring systems can warehouse events to provide long term statistics on the health and performance of systems. This warehouse can generate reports and be useful for applying strategy to your IT systems. You can move from a reactionary IT environment, to a smart ecosystem, predicting when and where expanding systems will be necessary before the end-users and management ever notice.
In addition to long term trending and reporting, some systems can be reclassified as management systems, and not monitoring systems. The software I am using Operations Manager 2007 or OpsMgr as I call it, not only monitors logs, web pages, and services, but it is designed to perform the first few steps in troubleshooting issues. If a service stops unexpectedly during the night, OpsMgr will detect the failure, and issue a command to the server to restart. It performs administrative recovery tasks so I may not have to.
The best example of this, and one reason my monitoring expectations are so high, is how OpsMgr has saved me from the humiliating experience of telling the CIO why something broke. I wrote an post several months ago about a Blackberry error where users were unable to send or do lookups from their handhelds. This error generated an event 20482 in the application log. Due to the site architecture of the Active Directory forest this BES was in, this event would happen whenever the Exchange specific domain controllers were knocked offline. The BES wouldn’t be able to perform global catalog lookups, and would result in the error. Since this usually happened at 6am (patch window for DCs) I didn’t want to be up every time this happened. So OpsMgr now stops and restarts the critical Blackberry Services whenever that 20482 event occurs. Problem solved by a monitoring system.
With that direct experience in my past, I task any IT group looking for a monitoring solution to dig a little deeper. If the point of a monitoring system is to make your life less stressful, how can it also make your life easier?