Monday, August 21, 2006

So what is Network Monitoring???

Network Monitoring

The term network monitoring, leads one to believe that we are only monitoring the network. In today's complex IT environments it is essential to monitor many things besides the network, a server running out of disk space, or a Windows service stopping can have drastic consequences for the organization.

Several years ago I started monitoring ethernet interfaces on routers with a tool called MRTG (www.mrtg.org), this was so easy to configure and provided so much information that I began using it to measure other things like Windows server statistics for cpu, memory, disk space etc... MRTG is still a fine tool but it requires a lot of manual scripting and was fairly hard to get configured for things that it wasn't built for. I began looking for other tool sets that were more easily managed and could provide notifications when a measurement crossed a certain threshold, it was then I discovered Nagios ( www.nagios.org ) . Nagios is a wonderful tool for notifications and is fairly easy to configure for monitoring most anything. I have combined this with Cacti ( www.cacti.net ) to provide graphing. Together these two tools can provide a very robust look at your environment, and they are FREE. If you don't have the skill set in house to install and configure these tools there a number of organizations that can assist with this for a reasonable fee, once up an running these tools require minimum maintenance and your staff can easily be trained.

Now, what do we need to monitor? I think most organizations miss the boat here, a server, or network administrator typically sets up these tools and only "points" them at there devices. When I speak about network monitoring I am talking about a comprehensive view of your environment. I start with connectivity, any monitoring tool has to be able to "ping" the devices in your environment. I typically start with the network interfaces, then the switches, and finally servers. Identifying these devices and configuring an up down monitor will let you know at a very base level what is working in your organization. Next I work on the notifications, who needs to know when something is down. Once again the administrators that typically set these things up, don't go far enough and not everyone that needs this information is included. Make sure and include you Service Desk, they are on the front lines, when something goes down a phone call from a non functional user is not far behind.

Okay, we know what is in our environment, if it's up or down, now lets start monitoring how well these devices are performing. For network devices, let look at things like bandwidth utilization ( how much traffic is going in and out of the device). Where possible you want to identify the ports on your ethernet switch infrastructure that are connected to your servers. This is often overlooked when setting up a network tool, this is valuable information and needs to be monitored. Once again we set thresholds and notifications for example 80% utilization on an interface may be a warning and 90% utilization may require a critical alert.

After configuring the network utilization, I start on server statistics. Each server in your environment should have at least the minimum of cpu, memory, and disk space monitored. We set up thresholds and notifications on these systems as well. Many other measurements are available depending on the type of server and the function that it performs.

Once the basic server monitoring is in place, begin looking at things like application performance.
This can be one of the most difficult things to measure but some creativity can get you through this. I have measured nightly processing by checking for the creation or modification of key or flag files at certain times. I know if these files are not there by the prescribed time that "production" is running behind, and send out the appropriate notification.

To maintain an accurate monitoring system you need to make this part of the installation and de-installation of anything in your environment. Make this part of your change management process, these monitoring configurations typically only take a few minutes to put in place.

With these very basic measurements in hand, trends can be observed that lend themselves to predicting things like a maxed out cpu, or disk space about to run out. This trending analysis is invaluable in budgetary planning. In many cases I have been able to predict months in advance when a system was going to run out of resources, this advance notices allows for proper planning and corrective action.

These are some very basic steps in building a network solution, for detailed information follow the links on my Resources page. If you have more questions please send me an email and I will respond as quickly as possible, typically within 24 hours.


Saturday, August 19, 2006

Richard Skinner's groundbreaking book "IT is about the Strategy" is a comprehensive guide to developing winning IT strategies for the SMB.

Most young organizations suffer from the same breakage points as they grow. Skinner shows how these milestones and there accompanying issues are a normal part of business growth, and how you can overcome each of these challenges.

I have had the pleasure of working with Richard a number of times over the years, and I know these techniques work. His book outlines common sense techniques that will take you out of firefighting and chaos. If you are looking for practical help with your IT issues, please check out this book at www.itisaboutstrategy.comLink

Monday, August 14, 2006

The Daily Status Meeting

In most businesses we face IT issues on a daily basis, dealing with them in a timely, and effective manner is critical. I have found it extremely useful to meet with the IT staff each morning and review any of the issues currently open or, just recently closed in our tracking system.

We limit this meeting to just the critical service impacting events. Having this meeting early in the morning gives us a chance to get the necessary resources deployed to deal with the problem effectively. Including all of the IT disciplines is essential, make sure that there is a good discussion of the issue and review how process is being followed in updating tickets, and communicating with the users.This is perfect opportunity to drive the internal IT processes and get everyone on the same page.

Many of the things I do today have been gleaned from the morning meeting. Remember, we are always on the journey to better IT management, we however never arrive. Maintaining effective processes including incident and problem management is a continuous effort. When IT staff are not consistently directed they will fall back into old habits of poor documentation, and poor communication.

I also use this meeting to review any changes that went into production the previous day. This is the perfect time to check status on these items and it only takes a few minutes. A well run morning status meeting will typically take 15 to 30 minutes, certainly a small price to pay to know what is going in your IT environment.

This may seem overly simplistic, but you would be amazed at how many organizations do not do this simple task.

Sunday, August 06, 2006

Are there babies dying?

With the recent turmoil in the middle east, there is indeed a situation where babies are dying. It is tragic to watch the news and see so many innocent children caught up in this tragic violence. Despite varying political views, no one wants to see young children suffer. We see rescue workers on both sides struggling to give aide to these victims as they place there own lives at risk. This is no doubt an emergency situation. The point of this post is to contrast these "real life" emergencies with the IT emergencies that we deal with.

On countless occasions I have been drawn into panic situations on the IT front, where it appeared as if certain doom was imminent, if the server wasn't back up right now. Managers are screaming, the phone rings off the hook, vacations are called off, and the world is on the edge of destruction, right? I worked with gentleman a few years ago, and while in the middle of one of these IT catastrophes asked "Are there babies dying?". This really hit home with me and helped me put things in the proper perspective. Large revenue impacting events are definitely important and should be responded to appropriately, but they are not on the same scale as the life and death events we see around us. Organizations find themselves in a panic first mentality which is counter productive, stressfull, and not cost effective.


In dealing with the IT emergency, some well thought out procedures can deal with the most critical events imaginable. The organization I work for had several incidents with hurricanes last year which required the shutdown of offices, rerouting of business, and the human task of getting money to people while the office was out of commission. We were able to do this, without undue stress, efficiently, and with some compassion. All of this was possible because we took time to plan, when the emergency arose, we new what to do.

Here are some really basic tips on disaster planning that we often take for granted.

1. Maintain an up to date contact list, with roles and responsibilities. It is amzing durng a crisis how hard phone numbers are to find.

2. Have a communication plan. Know who communicates with clients and who will manage internal communication. Let your staff do there jobs and have someone else manage communication. Know what you will do if all Internet, and telephone communication is shut off.

3. Prepare an escalation process, and make sure everyone knows how to use it.

4. Keep up to date inventories of equipment and software.

5. Backups, backups, backups. Enough said.

6. Document the chain of events. Use your incident management system if available to keep track of all activities surrounding the emergency.

7. At the end of the event, do a postmortem. Review the time line of events and look for things that could have been done better. Be critical of yourself, this will pay huge dividends the next time an event occurs. I found that sharing the weakness in our responses and a solid plan to correct them next time, adds credibility with clients and auditors. Don't be afraid of the truth!

In conclusion I would like to say that some cool heads, and basic management techniques solve problems faster, and more effectively than panic, and finger pointing. Remember the next time a server crashes ask "Are there babies dying?".