Hi Guys,
OK a rundown on the network:
- Two watchguard XTM510 firewalls in an active/standby cluster.
- Seven Cisco SG300-28P Small Business switches (two core, five access)
- Seven VLANS.
- The VLAN purposes are as follows:
VLAN10 (management vlan): All IT equipment - access points, firewalls, switches, wireless controllers (2504) etc. etc.
VLAN20: AV Sources - Apple TVs, DVD players etc. etc.
VLAN30: Kaleidescape AV Source (Kaleidescape recommend this goes on its own VLAN which is why it isnt on 20)
VLAN40: Automated room/building control hardware - touchscreens controllers, processors to control the touchscreens, local room boxes that distribute HDMI and audio from the master Digital media switcher.
VLAN50: Staff VLAN
VLAN60: Building management systems
VLAN 80: VoIP/Cameras
TOPOLOGY
The two core switches both have one link to each Watchguard. Every access switch has one link to each core switch.
The ProblemWe are experiencing some packet loss on the network. Now, I don't know if this is expected in redundant networks but I haven't seen it before. The packet loss is around 0.25-0.4% over an hour period depending on where you are on the network hard wired in to a switch port. It doesn't seem to be affecting anything but we are seeing things drop off the network momentarily from our monitoring station.
My first thought was that it was something to do with spanning tree topology changes, because it has been misconfigured, but no spanning tree settings have been changed apart from making the core switches the root bridges. I've mapped out the spanning tree topology and everything looks okay; there are no blocked ports where there shouldn't be and no designated ports where there shouldn't be.
My next move was onto the Watchguards. I pulled up the logs and found an error message pertaining to "received packet with source address as own address on interface eth2.20, eth2.30, eth2.40....". This implies, to me, that there was a loop but having mapped out the spanning tree topology there are no loops. I contact WG support who say that this message is because their software has a bug in it..."okay, so what are you going to do about it" I said. Their response was anything but conclusive and concise... Anyway, I disconnected one of the WGs but still got the packet loss issue on the network, but no error log message like the above. So I made the assumption that the Watchguards perhaps weren't playing a role in this problem.
I have stripped down to single links to each of the switches to one core switch and the problem goes away, so I'm thinking its got to be RSTP or some kind broadcast storm.
So my question to you guys is...do you know what is going on or what my next step should be?
