What is a network loop? Well, Wikipedia defines it as a loop occurs in computer networks when there is more than one path between two endpoints. So what does this mean and why is it important to me.
Loop of frustration
A few of you are asking where I have gone the last few weeks. There are two things that have happened. My job had me pulling 60 hour weeks and with the commute, I was up to 75 hours away from home.
Second, a little background on why I am writing on this topic. My computer is behind a professional grade spam filter so to get to the net I go through this. Well in an attempt to place ads on the footer of each of my web pages I incorrectly placed a JS code. Now I’m not a JS master but I do know when I mess things up and I did big time. I am not sure how but my code created a flood of network traffic and is now blacklisted. FANTASTIC. I have no idea how to get it back so I use a proxy to get to the internet. Update: Removing the code and re optimizing the web page allowed me to get back into it.
The impact of a loop
So at my main work, we experience network loops all the time (not planned, and unfortunately). For those of you that don’t know, I work on systems that have multiple nodes of multiple clients serialized with each other. At the end of each client string, there is a redundant loop (where most network loops happen but not always) to ensure continuous connectivity that only activates when a network connection is lost. So there is a time every now and then that a client will go awol and open up its normal and redundant port. Now I know what your thinking, KMP why don’t you just close the port on the last client? Well, it is not that easy and it’s not always the last one in the chain that is causing the loop. By closing the loop on the last client you could be taking all the other clients offline, therefore, stoping the true network loop and as result not being able to find the root cause.
A network loop can cripple a network to a point where no data can flow at all, think of DDOS. So how do I resolve this? Without going into too much detail, it is done one by one. Our system allows me to connect to each client individually on a separate maintenance port, I highly recommend this for anybody that is doing large networks that have redundant links. This allows me to look at the network ports and identify if it is in a redundant condition or not. Once the client in question has been identified it is removed from the network and diagnosed on its own node.
So what can go wrong
A good majority of the time it is user error. In one scenario an engineer put a switch on the network and connected it to different nodes. (This was devastating and VERY hard to identify). Another situation is when a port on a client fails. Then it opens up its redundant port then the failed port comes back online and traffic continues. The first scenario is very hard to predict but the second is “easy”. What we can do is have the Kernel drop traffic coming from a MAC address of the failed port. On reboot, the failed unit will test its forward and redundant ports to ensure no flooded traffic occurs. If all is clean it will leave the correct ports open. If not… well you get it.