October 21st, 2010
Another interesting issue. In this issue, I had a server, and when I pinged it from other hosts on the same network, I was getting duplicate ping responses:
64 bytes from 10.1.0.1: icmp_seq=1640 ttl=255 time=0.074 ms 64 bytes from 10.1.0.1: icmp_seq=1640 ttl=255 time=0.079 ms (DUP!) 64 bytes from 10.1.0.1: icmp_seq=1641 ttl=255 time=0.081 ms 64 bytes from 10.1.0.1: icmp_seq=1641 ttl=255 time=0.087 ms (DUP!) 64 bytes from 10.1.0.1: icmp_seq=1642 ttl=255 time=0.079 ms 64 bytes from 10.1.0.1: icmp_seq=1642 ttl=255 time=0.082 ms (DUP!)
Something not quite right there. The setup was:
network-switch | | igb0 igb1 \__ __/ \/ aggr0 |--- aggr0vlan1 |--- aggr0vlan2 |--- aggr0vlan3 ...
In the diagram above, we have a Cisco 2960G with a Port-Channel set up across 2 nic ports, which are attached to two network interfaces on the server. This is an LACP ethernet aggregation, used for providing extra bandwidth and/or redundancy to a server.
On OpenSolaris b134 (and on OpenIndiana b147), the aggr0 interface was created with:
dladm create-aggr -l igb0 -l igb1 -P L4 -L active aggr0
I then had a collection of VNICs, provisioned off the aggregation, for example:
dladm create-vnic -l aggr0 -v 1 aggr0vlan1 dladm create-vnic -l aggr0 -v 2 aggr0vlan2 dladm create-vnic -l aggr0 -v 3 aggr0vlan3 ...
I then stuck an IP address on each of these vnics using the usual:
ifconfig aggr0vlan1 plumb 10.1.0.1/16 up ifconfig aggr0vlan2 plumb 10.2.0.1/16 up ifconfig aggr0vlan3 plumb 10.3.0.1/16 up ...
There were no other interfaces configured, nothing else fancy at all.
When this was all set up, I got the duplicate ICMP ping packets. Very odd. I used snoop to track things down, and on the ICMP sender, it sent 1 ICMP packet, and received two back. When I snooped the server configured here, on aggr0, it was receiving two ICMP packets, hence sending two replies.
The strangest thing was, this issue would only occur at boot time! If I deleted all the VNICs, then set them up from scratch, no duplicate packets. If I rebooted the box, they came back. Weird!
So I simplified things down to one VNIC, aggr0vlan1. I rebooted, no duplicate packets. So I configured a second, rebooted. Then the duplicate packets were back.
Looks like a bug in Crossbow or the IGB driver to me. I tested on OpenIndiana b147 which had a 6 months newer kernel than OpenSolaris b134, but this didn’t fix the issue.
Then today (this all took place yesterday) on the bus on the way to work this morning, I remembered that crossbow has two internal constructs for doing vlan tagged virtual interfaces – a “vnic” interface with vlan tagging enabled (what I was using above), and a “vlan” interface. The syntax for the two commands is virtually identical ("dladm create-vnic -l link0 -v vlanid 1 link0vlan1" vs "dladm create-vlan -l link0 -v vlanid 1 link0vlan1"), and as far as I’m aware they should be logically the same, but I remember from a previous LOSUG one of the Oracle engineers mentioning that the implementations are different inside the kernel.
So instead of creating a vnic, I tried again with a vlan. BOOM! Fixed! No duplicate packets!
Very weird indeed. Glad I was able to work around the issue, but it did consume a fair whack of time. Perhaps someone with a bit more knowledge of Crossbow might be able to shed some light on this.
Entry Filed under: General