The lessons for troubleshooting are obvious, and while following it does not guarantee that you will solve your issue, you will certainly know more about what you face. But is that the extent of this flowchart’s usefulness? In my opinion no. The concepts underlying the chart provide insight into aspects of network design, implementation and documentation.
It is hard to know what changed if your configuration is so horribly complicated that it would take you a week figure it out. Keep your design simple. Before adding to your design ask yourself if you really need whatever it is you are adding. This does not mean you should ditch features with reckless abandon, rather it is the ruthless pursuit of the simplest solution that satisfies all requirements. One of these requirements that is often left unspoken and thus frequently forgotten is scalability. The end result is that your solution is easy to understand and easy to implement which leads to being easy to troubleshoot.
Locking down a system to prevent change does no good if the system is not designed in such a way as to never need change. Realistically, you cannot completely eliminate all changes in your network, but it should certainly be something to strive for. The less things change the lower the chance of something going wrong.
So far we have touched on the first aspect of the flowchart which deals with changes in a network, but the second part is equally important. Barring change, it is almost always the cable(this includes the power cable). Failures are going to happen. Assume the worst and plan accordingly. A good goal to strive for is to be able to unplug any one cable or device and have the service continue running. To go much beyond this begins to get too expensive. Please note that “any device” includes the power supply to the building you are in so do think about geographic redundancy if it is practical. More than one engineer has had their sweet super redundant setup taken out by a faulty A/C unit in the colo.
Redundancy is good, but you need to know about failures so that you can restore redundancy in the event of failure. You should add to your design both network monitoring systems and configuration monitoring/management. The latter of those two seems to get the least attention although it is equally important. A particularly good example of configuration monitoring/management is Rancid by Shrubbery Networks which both backs up configuration and sends email alerts when a change is detected.
Now that you have a fantastic network design it should be accompanied by an equally fantastic implementation. One of the key aspects of implementation is keeping things organized. This means labeling wires devices etc. and using proper wire management.
This one is simple. Once you decide on doing something a certain way don’t change it. I don’t care how cool the solution is, if it isn’t consistent it will be difficult if not impossible to troubleshoot.
I won’t spend too much time here, because the next section is dedicated to it. Document your solution as you implement it. You will not remember all of the specific details after you are done.
I find that it is very easy to create too little documentation and very difficult to create too much. When building out a solution a good practice to follow is to write step by step instructions that are simple enough that a non technical person could replicate your steps. Good documentation doesn’t stop at step by step instructions. Diagrams and photos are also very good ideas.
I mentioned configuration backups earlier, but it is worth reiterating the importance of having complete backups of the configuration necessary to replicate the solution. Sometimes, it can be easier to start from scratch if you cannot locate the source of the problem, although if you have taken the suggestions mentioned in the rest of the article you will probably not reach that point.