9.1.2.1 General Troubleshooting Procedures

Troubleshooting takes a large portion of network administrators’ and support personnel’s time. Using efficient troubleshooting techniques shortens overall troubleshooting time when working in a production environment. There are three major stages to the troubleshooting process:

Stage 1. Gather symptoms - Troubleshooting begins with gathering and documenting symptoms from the network, end systems, and users. In addition, the network administrator determines which network components have been affected and how the functionality of the network has changed compared to the baseline. Symptoms may appear in many different forms, including alerts from the network management system, console messages, and user complaints. While gathering symptoms, it is important that the network administrator ask questions and investigate the issue in order to localize the problem to a smaller range of possibilities. For example, is the problem restricted to a single device, a group of devices, or an entire subnet or network of devices?

Stage 2. Isolate the problem - Isolating is the process of eliminating variables until a single problem, or a set of related problems has been identified as the cause. To do this, the network administrator examines the characteristics of the problems at the logical layers of the network so that the most likely cause can be selected. At this stage, the network administrator may gather and document more symptoms, depending on the characteristics that are identified.

Stage 3. Implement corrective action - Having identified the cause of the problem, the network administrator works to correct the problem by implementing, testing, and documenting possible solutions. After finding the problem and determining a solution, the network administrator may need to decide if the solution can be implemented immediately or if it must be postponed. This depends on the impact of the changes on the users and the network. The severity of the problem should be weighed against the impact of the solution. For example, if a critical server or router must be offline for a significant amount of time, it may be better to wait until the end of the workday to implement the fix. Sometimes, a workaround can be created until the actual problem is resolved. This is typically part of a network’s change control procedures.

If the corrective action creates another problem or does not solve the problem, the attempted solution is documented, the changes are removed, and the network administrator returns to gathering symptoms and isolating the issue.

These stages are not mutually exclusive. At any point in the process, it may be necessary to return to previous stages. For instance, the network administrator may need to gather more symptoms while isolating a problem. Additionally, when attempting to correct a problem, another problem could be created. In this instance, remove changes and begin troubleshooting again.

A troubleshooting policy, including change control procedures, should be established for each stage. A policy provides a consistent manner in which to perform each stage. Part of the policy should include documenting every important piece of information.

Note: Communicate to the users and anyone involved in the troubleshooting process that the problem has been resolved. Other IT team members should be informed of the solution. Appropriate documentation of the cause and the fix will assist other support technicians in preventing and solving similar problems in the future.

Troubleshooting with a Systematic Approach

Troubleshooting Process