by Mike Lubanski
With the massive proliferation of Windows NT throughout the data center, more critical business applications are being moved onto the Windows NT platform. Such a move mandates the fine tuning and monitoring of the operating system to provide the most seamless and reliable platform upon which to host the applications.
Introduction
With the massive proliferation of Windows NT throughout the data center, more critical business applications are being moved onto the Windows NT platform. Such a move mandates the fine tuning and monitoring of the operating system to provide the most seamless and reliable platform upon which to host the applications. While each application will have its own monitoring requirements (i.e., those services and processes that determine the health of the application), the operating system is a common layer that needs to be monitored regardless of what application is running on top of it. In addition to monitoring the applications, therefore, it is also important to monitor the operating system layer to ensure the highest reliability and performance for the applications.
This document will detail the critical monitoring requirements for the Windows NT operating system. These requirements are generic enough to be used on any server running the NT operating system, yet flexible enough to be customized for the applications that reside on the OS. For example, a Windows NT file and print server with no other applications running on it need only to monitor the core operating system requirements. An Exchange server, on the other hand, will require both operating system and application monitoring to properly manage the health of the server.
Prerequisites and Assumptions
Several prerequisites and assumptions should be met in order to proceed with the monitoring.
- The SNMP service must be installed and configured on each server. The SNMP trap destination must be configured to point to a network management console where traps can be viewed, sort or acted upon by the support organization.
- To track any disk performance monitors, be sure that disk monitoring is enabled by executing “diskperf -y” on the target machine.
- This document covers “what” to monitor, but does not describe “how” or “by what means.” Therefore, a monitoring or management tool (of your choice) to monitor and alert on the events of the operating system will be necessary if a more sophisticated solution is desired. This will allow for the coordination and collaboration of events in order to be more proactive and response to the events. Simply monitoring events without the need for automation can be achieved with built-in operating system tools (such as Windows NT Performance Monitor counters).
- Some events state “Need Baseline.” This indicates that a baseline of normal activity is necessary to help determine what is above or below normal or what would be consider an error. For example, CPU utilization should be baselined to determine what is the normal utilization of the CPU. From this number, you can determine what is abnormal.
Key
The key to the table is as follows:
Problem Description
– This column defines the problem or error.
Method of Detection
– This column defines how the problem or error can be detected. Most of the time the method of detection will be an entry in the Windows NT Event Log or a SNMP trap that is generated by the machine.
Recommended Action
– This column defines what to do when the problem or error occurs. In order to turn “monitoring” efforts into “management” efforts, the recommended actions should be automated to occur when the problem occurs.
Monitoring Interval
– This column defines how often the monitoring sample should take place. Most monitors need to scan the system every 30 or 60 minutes, while others may need to scan every 5 minutes, such as a health check of a service.
Severity definition:
1 = High priority, notify immediately
2 = Medium priority, notify within 1 hour
3 = Low priority, notify within 24 hours.
Threshold
– This column defines the thresholds that need to be monitored. For events, the threshold will always be one. Other monitors such as CPU will have a specific value.
Windows NT
This section will detail what to monitor within the operating system to adequately manage the health of the server.
Problem Description |
Method of Detection |
Recommended Action |
Monitoring Interval |
Severity |
Threshold |
Runaway process | Monitor the CPU utilization of the server. | Verify what service or process is consuming CPU. Examine the top 20 processes consuming CPU. Check all of the server services. Restart services that are down in proper order. |
10 Mins |
1 |
>80% Sustained over 30 mins |
Paging too high | Monitor the paging frequency of the operating systems (pagefile usage) | Excessive paging requires the need to investigate what is consuming all of the memory. Possibly consider adding more memory. |
30 Mins
|
2
3 |
70% used page file 90% used page file
|
Paging too high. Runaway process | Monitor the pages in and out per second | High Paging is indicative of hard disk or memory problems. Investigate cause of paging. |
15 Mins between 8am-8pm |
1 |
40 pages per second over 15 mins
|
CPU Queue Length too high | Monitor the overall queue length of the CPU | The CPU queue length values must be correlated with CPU utilization. A sustained queue length is only indicative of a problem if the CPU utilization is also consistently high. Use Performance Monitor to identify CPU bottlenecks and rectify as necessary. Determine what process is consuming CPU time. Investigate response time for users, if response time is OK track CPU queue length, if response time is very poor investigate problem. |
10 Mins |
2 |
>5 sustained over 30 mins
|
Available bytes in physical memory. | Monitor the amount of physical memory bytes available. | Examine results with the page file usage results. Find the process consuming the RAM. Monitor Ram usage over a period of time and consider possibly adding more RAM. |
30 Mins |
2 1 |
10MB 5MB
|
Memory leak | Monitor the number of pool (non)-paged bytes | Compare the pool non-paged bytes with activity on the server, the process running may have a memory leak. Monitor the process closely. |
30 Mins |
1 |
Need Baseline |
Registry limit reached | Monitor the registry size limit and registry quota. (Object: Systems, Instance: % Registry Quota in use) | Identify if this is a quota problem or memory consumption |
Every 4 hours |
2 |
Need Baseline |
CPU performance degradation | Monitor the number of threads | Verify number of threads with CPU utilization and determine which process is consuming CPU resources. |
Every 30 Mins |
2 |
Need Baseline |
Running low on logical disk space | Monitor the amount of free logical disk space on all logical drives | Delete any temp files that may exist on the server. Move files to another server. Upgrade the hard drives to larger sizes. Rearrange the logical grouping of drives to give more space to the logical drive with the least amount of free space. |
Every 1 hour |
2 |
Need Baseline (typically 80% full is a good warning point) |
Poor Performance on physical disk | Monitor the Average Disk Queue Length and % Disk Time | A high queue length could indicate that the physical disk is not reading or writing fast enough to keep up with the requests. A high % Disk time indicates that the disk is spinning and in use more often than it should be. This should lead to bad or worn down disk more quickly than usual. |
Every 30 min. |
2 |
Queue length > 3 % Disk Time > 50% (Need Baseline) |
Poor network performance | Monitor the Network Interface performance monitor Current Bandwidth and Outbound Queue Length | Poor network performance can be caused by many problems that are unrelated to the server. Check the network performance of other servers to determine if there is a network-wide problem. If not, perform a closer diagnostic of this particular server. |
Every 1 hour |
3 |
Need Baseline |
Poor performance when trying to use the server interactively (slow keyboard and mouse response) | Monitor the Objects object and Processes counter to determine how many processes or simultaneously running. | A large number of processes running on the same machine may indicate that too many applications are on the server. Consider moving an application to another server. |
Every 60 minutes |
3 |
> 50 |
Poor performance of server | Monitor the Server object and the bytes total/sec counter.
Monitor the Server Work Queue object and the queue length counter. |
A high bytes total / sec indicates a server is too busy to adequately service its requests. Consider upgrading hardware or reassigning the server to a less-taxing role.
A high server work queue length indicates a lag in the amount of time that a server can process its requests. |
Every 30 minutes |
2 |
Need Baseline |
Failed operating system service | Monitor the service control manager for any failures in the following services:
Computer Browser Event Log Netlogon Server Workstation |
Attempt to restart service. If unsuccessful, reboot server. If unsuccessful, contact MS Tech support or reinstall OS. |
Every 5 minutes |
1 |
1 |
Hardware & Network Management
This section will detail what to monitor and manage in the low-level hardware and network interface card.
Hardware errors | Monitor the internal temperature of server | Check hardware for errors |
10 min. |
1 |
Need Baseline |
Hardware errors | Monitor any critical IDE or SCSI disk failures | Check hardware for errors |
10 min. |
1 |
Need Baseline |
Hardware errors | Monitor NIC failures | Check hardware for errors |
10 min. |
1 |
Need Baseline |
Hardware errors | Monitor any fan failures | Check hardware for errors |
10 min |
1 |
Need Baseline |
Hardware errors | Monitor any correctable memory errors | Check hardware for errors |
10 min. |
1 |
Need Baseline |
Network utilization high | Monitor the total bytes/second processed by the network interface card. | Check and/or tune performance of NIC card. |
10 min. |
1 |
Need Baseline |
ICMP errors | Monitor the receipt time for ICMP packets | Check and/or tune performance of NIC card. |
10 min. |
3 |
Need Baseline |
ICMP errors | Monitor the level of unreachable destinations. | Check and/or tune performance of NIC card. |
15 min |
3 |
Need Baseline |