Monday, August 21, 2006

So what is Network Monitoring???

Network Monitoring

The term network monitoring, leads one to believe that we are only monitoring the network. In today's complex IT environments it is essential to monitor many things besides the network, a server running out of disk space, or a Windows service stopping can have drastic consequences for the organization.

Several years ago I started monitoring ethernet interfaces on routers with a tool called MRTG (www.mrtg.org), this was so easy to configure and provided so much information that I began using it to measure other things like Windows server statistics for cpu, memory, disk space etc... MRTG is still a fine tool but it requires a lot of manual scripting and was fairly hard to get configured for things that it wasn't built for. I began looking for other tool sets that were more easily managed and could provide notifications when a measurement crossed a certain threshold, it was then I discovered Nagios ( www.nagios.org ) . Nagios is a wonderful tool for notifications and is fairly easy to configure for monitoring most anything. I have combined this with Cacti ( www.cacti.net ) to provide graphing. Together these two tools can provide a very robust look at your environment, and they are FREE. If you don't have the skill set in house to install and configure these tools there a number of organizations that can assist with this for a reasonable fee, once up an running these tools require minimum maintenance and your staff can easily be trained.

Now, what do we need to monitor? I think most organizations miss the boat here, a server, or network administrator typically sets up these tools and only "points" them at there devices. When I speak about network monitoring I am talking about a comprehensive view of your environment. I start with connectivity, any monitoring tool has to be able to "ping" the devices in your environment. I typically start with the network interfaces, then the switches, and finally servers. Identifying these devices and configuring an up down monitor will let you know at a very base level what is working in your organization. Next I work on the notifications, who needs to know when something is down. Once again the administrators that typically set these things up, don't go far enough and not everyone that needs this information is included. Make sure and include you Service Desk, they are on the front lines, when something goes down a phone call from a non functional user is not far behind.

Okay, we know what is in our environment, if it's up or down, now lets start monitoring how well these devices are performing. For network devices, let look at things like bandwidth utilization ( how much traffic is going in and out of the device). Where possible you want to identify the ports on your ethernet switch infrastructure that are connected to your servers. This is often overlooked when setting up a network tool, this is valuable information and needs to be monitored. Once again we set thresholds and notifications for example 80% utilization on an interface may be a warning and 90% utilization may require a critical alert.

After configuring the network utilization, I start on server statistics. Each server in your environment should have at least the minimum of cpu, memory, and disk space monitored. We set up thresholds and notifications on these systems as well. Many other measurements are available depending on the type of server and the function that it performs.

Once the basic server monitoring is in place, begin looking at things like application performance.
This can be one of the most difficult things to measure but some creativity can get you through this. I have measured nightly processing by checking for the creation or modification of key or flag files at certain times. I know if these files are not there by the prescribed time that "production" is running behind, and send out the appropriate notification.

To maintain an accurate monitoring system you need to make this part of the installation and de-installation of anything in your environment. Make this part of your change management process, these monitoring configurations typically only take a few minutes to put in place.

With these very basic measurements in hand, trends can be observed that lend themselves to predicting things like a maxed out cpu, or disk space about to run out. This trending analysis is invaluable in budgetary planning. In many cases I have been able to predict months in advance when a system was going to run out of resources, this advance notices allows for proper planning and corrective action.

These are some very basic steps in building a network solution, for detailed information follow the links on my Resources page. If you have more questions please send me an email and I will respond as quickly as possible, typically within 24 hours.


No comments: