Archive for October, 2010
Weird OpenSolaris/Crossbow issue with Aggregations and VLANs/VNICs
Update: Seems this one is already known about in defect 15870 and bug 6950788. Should have googled/looked myself.
Another interesting issue. In this issue, I had a server, and when I pinged it from other hosts on the same network, I was getting duplicate ping responses:
64 bytes from 10.1.0.1: icmp_seq=1640 ttl=255 time=0.074 ms 64 bytes from 10.1.0.1: icmp_seq=1640 ttl=255 time=0.079 ms (DUP!) 64 bytes from 10.1.0.1: icmp_seq=1641 ttl=255 time=0.081 ms 64 bytes from 10.1.0.1: icmp_seq=1641 ttl=255 time=0.087 ms (DUP!) 64 bytes from 10.1.0.1: icmp_seq=1642 ttl=255 time=0.079 ms 64 bytes from 10.1.0.1: icmp_seq=1642 ttl=255 time=0.082 ms (DUP!)
Something not quite right there. The setup was:
network-switch
| |
igb0 igb1
\__ __/
\/
aggr0
|--- aggr0vlan1
|--- aggr0vlan2
|--- aggr0vlan3
...
In the diagram above, we have a Cisco 2960G with a Port-Channel set up across 2 nic ports, which are attached to two network interfaces on the server. This is an LACP ethernet aggregation, used for providing extra bandwidth and/or redundancy to a server.
On OpenSolaris b134 (and on OpenIndiana b147), the aggr0 interface was created with:
dladm create-aggr -l igb0 -l igb1 -P L4 -L active aggr0
I then had a collection of VNICs, provisioned off the aggregation, for example:
dladm create-vnic -l aggr0 -v 1 aggr0vlan1 dladm create-vnic -l aggr0 -v 2 aggr0vlan2 dladm create-vnic -l aggr0 -v 3 aggr0vlan3 ...
I then stuck an IP address on each of these vnics using the usual:
ifconfig aggr0vlan1 plumb 10.1.0.1/16 up ifconfig aggr0vlan2 plumb 10.2.0.1/16 up ifconfig aggr0vlan3 plumb 10.3.0.1/16 up ...
There were no other interfaces configured, nothing else fancy at all.
When this was all set up, I got the duplicate ICMP ping packets. Very odd. I used snoop to track things down, and on the ICMP sender, it sent 1 ICMP packet, and received two back. When I snooped the server configured here, on aggr0, it was receiving two ICMP packets, hence sending two replies.
The strangest thing was, this issue would only occur at boot time! If I deleted all the VNICs, then set them up from scratch, no duplicate packets. If I rebooted the box, they came back. Weird!
So I simplified things down to one VNIC, aggr0vlan1. I rebooted, no duplicate packets. So I configured a second, rebooted. Then the duplicate packets were back.
Looks like a bug in Crossbow or the IGB driver to me. I tested on OpenIndiana b147 which had a 6 months newer kernel than OpenSolaris b134, but this didn’t fix the issue.
Then today (this all took place yesterday) on the bus on the way to work this morning, I remembered that crossbow has two internal constructs for doing vlan tagged virtual interfaces – a “vnic” interface with vlan tagging enabled (what I was using above), and a “vlan” interface. The syntax for the two commands is virtually identical ("dladm create-vnic -l link0 -v vlanid 1 link0vlan1" vs "dladm create-vlan -l link0 -v vlanid 1 link0vlan1"), and as far as I’m aware they should be logically the same, but I remember from a previous LOSUG one of the Oracle engineers mentioning that the implementations are different inside the kernel.
So instead of creating a vnic, I tried again with a vlan. BOOM! Fixed! No duplicate packets!
Very weird indeed. Glad I was able to work around the issue, but it did consume a fair whack of time. Perhaps someone with a bit more knowledge of Crossbow might be able to shed some light on this.
Add comment October 21st, 2010
“Could not stat /dev/sda1″ when installing Citrix XenServer
If you get this error when installing Citrix XenServer…
Could not stat /dev/sda1 --- No such file or directory
… then I have a solution for you! Basically it’s caused by a race condition – the installer creates the partition table, but then immediately attempts to create a filesystem on /dev/sda1 before the Kernel has caught up with the partition table change.
You can fix the issue by:
1. Hit "Alt-F2" to get a console
2. Use "ps -ef" to get the pid of the installer, and "kill -9" it
3. Type "vi /etc/inittab"
4. Change the line under "# Start the installer on the console" to read:
tty1::respawn:/opt/xensource/installer/preinit
(This causes the installer to be respawned upon death – useful for debugging if things go wrong without requiring a reboot)
5. Type "vi /opt/xensource/installer/backend.py"
6. Near the top add "from time import sleep" under one of the import statements.
7. Near line 565, under the "def createDom0DiskFilesystems(…" bit, add a new line with "sleep(10)" in it:
def createDom0DiskFilesystems(disk, primary_partnum):
sleep(10)
rc, err = util.runCmd2....
Be sure to match the indentation of the lines below, as Python uses indentation as a part of its syntax.
8. Type "kill -HUP 1" to reload the inittab. The installer should respawn on tty1 – simply press “Alt-F1″ to get to it. Perform the install.
This simple fix works by adding a sleep statement before the bit that creates a filesystem. Yay!
2 comments October 20th, 2010
bash: fork: Not enough space
Quite a lot of our clients email asking if they have run out of disk space when they get this error:
bash: fork: Not enough space
Not so – the "space" referred to here means memory (RAM). So in basic terms, it means your server has run out of free memory and can’t start new programs.
Resolving this requires stopping running services or restarting your server. Unfortunately, given you have no free memory to launch new programs, doing this might be quite hard. But thankfully if you get this error in a Solaris Zone with capped memory usage, you can fix the issue from the Global Zone. (If you’re an EveryCity managed hosting customer, we can do this for you).
How do you stop it from happening? Well you need to identify what is gobbling all your memory. If it’s an Apache web server, chances are you just received an influx of visitors, as Apache (by default) has to spawn a new process for each connection, and each process uses RAM. The fix is to buy more memory, or optimise Apache (to use less RAM per process), or optimise your site so that requests take less time (so fewer Apache processes are needed to handle the same throughput).
2 comments October 12th, 2010
