Introduction to troubleshooting

Build 1501 on 14/Nov/2017  This topic last edited on: 14/Nov/2013, at 11:12

If something doesn't work or stops to work, the way to fix it is, firstly, to understand what the problem is about and then to find out how to fix it.

Often, error messages that are displayed on screen, or logged in a file, help you to pinpoint the problem. So, collecting error messages is an important preparation step for a successful troubleshooting.

About troubleshooting

Troubleshooting is a form of problem solving, through a logical, systematic search for the source of a problem so that it can be solved. Troubleshooting is needed to maintain complex systems where the symptoms of a problem can have many possible causes. Determining the most likely cause is a process of elimination - eliminating potential causes of a problem. Finally, troubleshooting requires confirmation that the solution restores the product or process to its working state.

Usually troubleshooting is applied to something that has suddenly stopped working. So the initial focus is often on recent changes to the system or to the environment in which it exists. (For example a server that "was working before it was rebooted"). However, there is a well known principle that correlation does not imply causality. (For example the failure of a device shortly after it's been plugged into a different outlet doesn't necessarily mean that the events were related. The failure could have been a matter of coincidence.) Therefore troubleshooting demands critical thinking rather than magical thinking.

A basic principle in troubleshooting is to start from the simplest and most probable possible problems first. This is illustrated by the old saying "When you see hoof prints, look for horses, not zebras", but this should not be taken as an affront, rather it should serve as a reminder or conditioning to always check the simple things first before calling for help.

Troubleshooting can also take the form of a systematic checklist, troubleshooting procedure, flowchart or table that is made before a problem occurs. Developing troubleshooting procedures in advance allows sufficient thought about the steps to take in troubleshooting and organizing the troubleshooting into the most efficient troubleshooting process. Troubleshooting tables can be computerized to make them more efficient for users.

Efficient methodical troubleshooting starts with a clear understanding of the expected behavior of the system and the symptoms being observed. From there the troubleshooter forms hypotheses on potential causes, and devises (or perhaps references a standardized checklist of) tests to eliminate these prospective causes.

Two common strategies used by troubleshooters are to check for frequently encountered or easily tested conditions first (for example, if the full-text search produces an empty listing, check if there's a content at all in the specified folder(s).

Then, "bisect" the system (for example in a network printing system, checking to see if the job reached the server to determine whether a problem exists in the subsystems "towards" the user's end or "towards" the device).

This latter technique can be particularly efficient in systems with long chains of serialized dependencies or interactions among its components. It's simply the application of a binary search across the range of dependencies and is often referred to as "half-splitting".[3]

Reproducing symptoms

One of the core principles of troubleshooting is that reproducible problems can be reliably isolated and resolved. Often considerable effort and emphasis in troubleshooting is placed on reproducibility ... on finding a procedure to reliably induce the symptom to occur.

Once this is done then systematic strategies can be employed to isolate the cause or causes of a problem; and the resolution generally involves repairing or replacing those components which are at fault.