Friday, December 18, 2009

Troubleshoot Broadcast Storm With Network Sniffer

Based on the network architecture, the protocols, and the node count on a site being studied, an analyst must determine what constitutes a broadcast storm. This requires the analyst to be quite familiar with the topology and types of protocols and applications being deployed. A general benchmark is that a broadcast sequence occurring from a single device or a group of devices, either rapidly or on an intermittent cycle at more than 500 frames per second, is a storm event. At the very least, the sequence should be investigated if it is occurring at 500 frames per second (relative to just a few devices and a specific protocol operation).
After the threshold has been set on the network sniffer, a data-trace capture should be started. After the capture has been invoked, and a broadcast storm event has occurred in the Expert system with notification or in the statistics screen, the time of the storm and the devices related to the storm should be carefully noted. The addresses should be noted in a log along with the time of the storm and the frame-per-second count. Most protocol analyzers provide this information before the capture is even stopped. As soon as the broadcast storm occurrence takes place, the analyzer should be immediately stopped to ensure that the internal data-trace information is still within the memory buffer of the protocol analyzer. The data trace should then be saved to a disk drive or printed to a file to ensure that the information can be reviewed. The data-trace capture should then be opened and the actual absolute storm time noted from the Expert system or the statistical screen. Based on the absolute time, it may be possible on the protocol analyzer to turn on an absolute time feature. When turned on in the data trace, the absolute time feature enables an analyst to search on the actual storm for the absolute time event. This may immediately isolate and identify the cause of the broadcast storm.
Certain network sniffers offer hotkey filtering to move directly within the data-trace analysis results of the storm event. Either way, by using absolute time or hotkey filtering, the broadcast storm should be located within the data-trace capture.
Other metrics can be turned on in a protocol analysis display view when examining a broadcast storm, such as relative time and packet size. After the start of the storm has been located, the key devices starting and invoking the storm should be logged. Sometimes only one or two devices cause a cyclical broadcast storm occurrence throughout an internetwork, resulting in a broadcast storm event across many different network areas. The devices communicating at the time closest to the start of the storm inside the data-trace analysis results may be the devices causing the event.
After the storm has been located, the Relative Time field should be zeroed out and the storm should be closely reviewed by examining all packets or frames involved in the storm. If 500 or 1,000 frames are involved, all frames should be closely examined by paging through the trace. After the end of the storm has been located, the time between the start of the storm and the end of the storm should be measured by using a relative time process. This is achieved by just zeroing out the relative time at the beginning of the storm occurrence and examining the cumulative relative time at the end of the sequence. This provides a clear picture of the storm device participation and processes, the packet-size generation during the storm, and the source of the storm location. The initial several packets located for the broadcast storm should be investigated for the physical, network, and transport layer addressing schemes that may relate to the storm occurrence. This helps an analyst to understand the sequence of the storm event.
This is an extremely important process in network baselining and should be engaged in proactive and reactive analysis. In proactive baselining, an analyst must configure the proper broadcast storm thresholds on the protocol analyzer. This way, the storm events will show during the network baseline session. In a troubleshooting (reactive) event, it is important to know whether certain failure occurrences or site network failures are also being reported by the users; these may relate to the time of the storm occurrence. If this is the case, just isolating and identifying the broadcast storm may make it possible to isolate the devices causing the storm or the protocol operations involved. It may then be possible to stop the storm occurrence. This will increase performance levels and optimize the network.

No comments:

Post a Comment