Wednesday, July 26, 2006

My server just crashed, now what???

The phone rings and on the other end is an angry user complaining that the system is down again. This is every IT persons nightmare, not to mention the user.

The IT person then scurrys around making phone calls, sending out frantic pings, hoping that the problem isn't in there area. System admins blame the network, network says there isn't a problem on there end, and send the problem back to the system admin, who in turn blames the programmers. What do you do? You reboot the box, when the box comes up the sys admins say they can't find anything in the logs, and it must just be one of those things. This happens about once or twice every 60 to 90 days. Does this sound familiar?

A system reported as down or crashed can mean many different things to IT, but to the user the system they depend on is down, end of story. Sometime systems crash or lock up and there is no apparent reason readily available, most of the time however there are warning signs and the mature IT staff will be looking for them.

This isn't an article on system administration, my point here is about monitoring what goes on in your environment. I shudder every time I here of a system locking up becuase of lack of disk space, 90% of the time this problem can be avoided by some simple monitoring and alerting. Monitoring will assist with other issues such as lack of memory, cpu, network traffic, or other errors. There is no excuse for organizations not doing network monitoring, There are free tools available. Nagios for example can be configured on a fairly low end machine and can track virtually anything that needs to be monitored. Nagios alows for alerts to be generated, escalations done, and has some rudimentary reporting which can show trending of uptime.


Having data available will help pin point problems much quicker. In the event of a server down scenario, a quick scan tell you what the network is doing, how the server was performing, and what if any capacity issues exist. With the setting of proper thresholds, most problems can be avoided before the user experiences a problem.

Most small businesses don't have good monitoring in place. If there is monitoring, it is done by different groups for there own particular needs, and a comprehensive approach is not in place to give everyone access to this vital information. Why is this? Poor managment. It is a management responsibility to make sure this infomation is available and regularly reported on.

Don't let your systems go unmonitored. Check out www.nagios.org this tool could save you thousands of dollars in lost production time. There are many other useful tools available, Nagios is the one I am most familar with, and in my opinion offers the most versatility. If you need help configuring Nagios, shoot me an email and I will help get you to a resource who can solve your problem.

No comments: