Posts filed under 'General'
Compiling QT with Webkit on Solaris 10
Getting QT on Solaris 10 to build is a PITA, but getting it to build with Webkit enabled is even worse. But fret not, after some Googling the patches can be found.
You can find our build recipe for it over here:
http://hg.openindiana.org/users/aszeszo/s10-userland/file/c473cd11bbd3/components/qt4
We’re using GCC 4.4 to build QT which, although not the officially supported compiler on Solaris platforms (They specify Sun Studio), it works just fine.
Add comment November 25th, 2011
Fixing “No active dataset” on zone attach
When moving zones between OpenIndiana (and OpenSolaris) hosts, you can often end up with the following dreaded error:
# zoneadm -z zonename attach -U
Log File: /var/tmp/zonename.attach_log.B8aWed
ERROR: no active dataset.
Result: Attach Failed.
This can happen for a variety of reasons, such as not detaching the zone before moving it, and not transferring the ZFS properties with the zone. But personally I blame the half-arsed zone attach scripts that could do with some work.
To get around it, here is a super-quick/dirty script that should allow the zone to attach:
#!/bin/bash
zfsfs=$1
root=${zfsfs}/ROOT
zbe=${root}/zbe
for i in $zbe $root $zfsfs ; do
for j in zoned mountpoint ; do
zfs inherit $j $i
done
done
zfs set mountpoint=legacy $root
zfs set zoned=on $root
zfs set canmount=noauto $zbe
zfs set org.opensolaris.libbe:active=on $zbe
rbe=`zfs list -H -o name /`
uuid=`zfs get -H -o value org.opensolaris.libbe:uuid $rbe`
zfs set org.opensolaris.libbe:parentbe=$uuid $zbe
The script takes one argument, the zfs filesystem the zone lives in (the parent of “ROOT” for the zone). Ignore any errors about "dataset is used in a non-global zone", and once it has run, manually mount the dataset and attach it with:
mount -F zfs dataset/ROOT/zbe /zones/zonename/root zoneadm -z zonename attach
This guide is pretty rough but should hopefully set people in roughly the right direction.
1 comment October 31st, 2011
Adjusting drive timeouts with mdb on Solaris or OpenIndiana
Update: These timeouts don’t work nearly as well as one would hope, unfortunately the sd timeouts get passed to the driver which in the case of mpt/mpt_sas, appear to do very little with them. I have raised this as an issue within the Illumos community and the debate was quite polarising; the kernel developers deny there is a problem or disagree on how to solve it, despite lots of people complaining of the same symptoms. Unfortunately I think it’s a difficult problem to solve due to the wide variety of hardware types that ZFS/Illumos is deployed on.
Our way of coping with dodgy drives is to preempt their failure via trigger happy SMART/iostat monitoring scripts that zpool offline bad drives before they fail.

Yesterday we suffered our first disk failure in our shiny new NFS cluster that has been operating flawlessly for 3 months. The NFS cluster we have is quite nice - it consists of a pair of NFS servers (96GB of RAM, Dual Intel E5620 CPUs) dual-attached to a set of LSI SAS 6Gbps JBOD arrays, with lots of Seagate Constellation ES 2TB enterprise SAS drives. For good measure there’s 1.5TB of SSD cache (6×256GB SSDs) acting as a read cache (L2ARC), and a ZeusRAM SSD acting as the write cache (ZIL). It runs a custom build of OpenIndiana.
Ordinarily a disk failure would result in at most a few minutes of stall while the OS waits for the drive to recover, and gives up. However, this drive decided simply to run glacially slowly, so it didn’t get removed in a timely fashion. In fact, it didn’t get removed at all, resulting in all IO to the SAN being stuck, causing a rather severe outage. 45 minutes in total.
When things became unresponsive, we logged in, and "iostat -xn" showed a 100% busy time on one of the disks, while the others did nothing. We attempted to "zpool offline baddisk". Nothing much happened, presumably because the OS thought the drive was fine and was waiting on some queued IO finishing, or something along those lines. We had no immediate way of yanking the disk out, so we decided to failover the cluster from the primary NFS node to the secondary. This consists of powering off the primary node and letting the cluster software import the ZFS zpool and bring NFS services online.
When the secondary NFS node started importing the zpool, iostat once again showed a 100% busy time on the bad disk. Crap. Andrzej had the bright idea of deleting the disk entries from /dev, and sure enough this prompted ZFS to think the drive had disappeared, and the pool finally imported.
So immediately the question springs to mind, why did the OS not take this bad disk out of service? We consulted with our upstream vendor (contacted the folks over at Illumos) and all became clear.
The answer lays in the defaults in the Solaris SCSI subsystem. The default timeout for IO is 60 seconds with 5 retries (or 3 retries if its fibre channel/eSAS). For a storage array like ours, this is a 3 minute timeout for a single IO - or in other words, a very long time. Since the disk was accepting a trickle of IO, this timeout was never really reached.
Thankfully the timeouts can be adjusted, and Garrett D’Amore, the founder of Illumos and one of the lead developers who works at Nexenta, strongly suggested tuning the timeout to 5 seconds, with 3 retries.
Setting the timeout value is quite easy - its the system wide tunable sd_io_time. Keep in mind this will affect all disks. Edit /etc/system and drop in:
set sd:sd_io_time=5
If you have desktop SATA drives you’ll probably want a higher timeout, especially if you don’t have TLER (Time limited error recovery) on them, which limits error recovery to around 7 seconds.
The number of retries is set via /kernel/drv/sd.conf via sd-config-list. This file allows the setting to be set per-disk type via sd-config-list. To get 3 retries, the variable would be "retries-timeout:3". The format of this file is a bit weird, here is an example for two disks:
sd-config-list = "STEC ZeusRAM ", "throttle-max:32, disksort:false, cache-nonvolatile:true",
"SEAGATE ST32000444SS ", "retries-timeout:3";
The bit where you define the disk type is a fixed length field, consisting of 8 characters for the vendor, and 16 characters for the product. So you have to pad the field out to the correct length with spaces.
Once these are set, reboot to activate. You can check the values are set by doing:
## Print system wide sd_io_time timeout value:
# echo "sd_io_time::print" | mdb -k
0x3c
## Print per-disk timeout and retry values:
# echo "::walk sd_state | ::grep '.!=0' | ::sd_state" | mdb -k | egrep "^un|un_retry_count|un_cmd_timeout"
un: ffffff093239d9c0
un_retry_count = 0x3
un_cmd_timeout = 0x5
un: ffffff093239d380
un_retry_count = 0x3
un_cmd_timeout = 0x5
...
The return values are in hexadecimal, so for example 0×3c is 60 seconds.
Adjusting values without rebooting
We have a number of storage servers in production, some of which we really didn’t want to reboot just to change the timeout value. After discussions with some of the Illumos kernel developers,
we worked out how to set the property at runtime using the modular Solaris debugger, mdb. This allows editing kernel values at runtime.
The system wide sd_io_time is used to populate a per-disk timeout value which is also stored in the same structure as the per-disk retry count. So changing the values is pretty similar.
First, we want to obtain the memory values for the settings we wish to edit:
# echo "::walk sd_state | ::grep '.!=0' | ::print -a struct sd_lun un_cmd_timeout" | mdb -k > /tmp/un_cmd_timeouts # cat /tmp/un_cmd_timeouts ffffff0d347a3a7c un_cmd_timeout = 0x3c ffffff0d247983bc un_cmd_timeout = 0x3c ffffff0d3429d3fc un_cmd_timeout = 0x3c ffffff0d55daf37c un_cmd_timeout = 0x3c ...
Now we have the values in /tmp/un_cmd_timeouts, we can set the value using mdb -kw:
# for i in `cat /tmp/un_cmd_timeouts | awk '{print $1}'` ; do echo ${i}/W 0x5 | mdb -kw ; done
We can then check the value was set by re-running:
# echo "::walk sd_state | ::grep '.!=0' | ::print -a struct sd_lun un_cmd_timeout" | mdb -k
Now we can do the same for un_retry_count:
# echo "::walk sd_state | ::grep '.!=0' | ::print -a struct sd_lun un_retry_count" | mdb -k > /tmp/un_retry_count
# for i in `cat /tmp/un_retry_count | awk '{print $1}'` ; do echo ${i}/W 0x3 | mdb -kw ; done
Hey presto, we just adjusted boot time kernel parameters on the fly :-)
If you need to know which disk is which, you can assume the output from mdb is ordered, and do:
echo "::walk sd_state | ::grep '.!=0' | ::print struct sd_lun un_sd | ::print struct scsi_device sd_dev | ::devinfo -q" | mdb -k
This returns the sd instance id, which can be seen from "iostat -E". StackOverflow has some answers for mapping from sd to device name should you need to.
Concluding Remarks
With these values in place, our timeout is reduced from upwards of 3 minutes, to a mere 15 seconds. This is far more likely to cause the OS to offline dodgy disks like the one we were experiencing issues with.
There has been some recent discussion on the Illumos mailing lists regarding the default sd_io_time value, suggesting that the default should be lowered to 8 seconds. This has caused a bit of a furore, as people using Solaris with fibre channel disk arrays require higher timeouts, say 180 seconds. So there are people on both sides of the fence. But one thing is for sure - its a setting more people should know about.
5 comments May 14th, 2011
Autoconf, Automake and Libtoolized version of bzip2
Autoconf, Automake and libtool are 3 utilities designed to simultaneously help and hinder those of us that have to compile software. They together produce the familiar “./configure ; make ; make install” procedure most of us have used time and time again.
Although these tools are universally hated for being overly complex, slow and hard to use, thankfully most projects use them, because the alternative (usually some shitty Makefile that only works on Linux) is far far far worse.
BZip2 is one of those very simple system utilities we all require, where the author only ships a Makefile. Thankfully a helpful SuSE developer has autoconfized it. So grab a copy of those files into your bzip2-1.0.6 folder, and run autogen.sh
If only someone would do this for libxvid and ffmpeg…
Add comment March 28th, 2011
Lame, nasm, and text relocations (textrels)
Well, this took some debugging.
I’ve filed it all in a nasm bug report. To cut a long story short, if you compile LAME with Nasm 2.09, you’ll end up with TEXTRELs in the resultant libmp3lame.so.
What is a TEXTREL you may ask? Something bad! It stops the code being fully PIC (position independent), which stops the shared object being loaded into memory once and mapped multiple times. But worse, it causes Solaris ld to explode when linking:
gcc -shared -Wl,-h -Wl,libmp3lame.so.0 -o .libs/libmp3lame.so.0.0.0 .libs/VbrTag.o .libs/bitstream.o .libs/encoder.o .libs/fft.o .libs/gain_analysis.o .libs/id3tag.o .libs/lame.o .libs/newmdct.o .libs/presets.o .libs/psymodel.o .libs/quantize.o .libs/quantize_pvt.o .libs/reservoir.o .libs/set_get.o .libs/tables.o .libs/takehiro.o .libs/util.o .libs/vbrquantize.o .libs/version.o .libs/mpglib_interface.o -Wl,-z -Wl,allextract ../libmp3lame/i386/.libs/liblameasmroutines.a ../libmp3lame/vector/.libs/liblamevectorroutines.a ../mpglib/.libs/libmpgdecoder.a -Wl,-z -Wl,defaultextract -lm -lsocket -lnsl -lc -maccumulate-outgoing-args Text relocation remains referenced against symbol offset in file0x6e ../libmp3lame/i386/.libs/liblameasmroutines.a(choose_table.o) 0x75 ../libmp3lame/i386/.libs/liblameasmroutines.a(choose_table.o) 0x9a ../libmp3lame/i386/.libs/liblameasmroutines.a(choose_table.o) 0xa1 ../libmp3lame/i386/.libs/liblameasmroutines.a(choose_table.o) 0xa8 ../libmp3lame/i386/.libs/liblameasmroutines.a(choose_table.o) 0x12b ../libmp3lame/i386/.libs/liblameasmroutines.a(choose_table.o) 0x133 ../libmp3lame/i386/.libs/liblameasmroutines.a(choose_table.o) 0x1a0 ../libmp3lame/i386/.libs/liblameasmroutines.a(choose_table.o) 0x1aa ../libmp3lame/i386/.libs/liblameasmroutines.a(choose_table.o) 0x1b4 ../libmp3lame/i386/.libs/liblameasmroutines.a(choose_table.o) 0x1c2 ../libmp3lame/i386/.libs/liblameasmroutines.a(choose_table.o) 0x24c ../libmp3lame/i386/.libs/liblameasmroutines.a(choose_table.o) 0x25d ../libmp3lame/i386/.libs/liblameasmroutines.a(choose_table.o) 0x39 ../libmp3lame/i386/.libs/liblameasmroutines.a(fft3dn.o) 0x56 ../libmp3lame/i386/.libs/liblameasmroutines.a(fft3dn.o) 0x128 ../libmp3lame/i386/.libs/liblameasmroutines.a(fft3dn.o) 0x142 ../libmp3lame/i386/.libs/liblameasmroutines.a(fft3dn.o) 0x26e ../libmp3lame/i386/.libs/liblameasmroutines.a(fft3dn.o) 0x2b9 ../libmp3lame/i386/.libs/liblameasmroutines.a(fft3dn.o) 0x2d6 ../libmp3lame/i386/.libs/liblameasmroutines.a(fft3dn.o) 0x398 ../libmp3lame/i386/.libs/liblameasmroutines.a(fft3dn.o) 0x4ce ../libmp3lame/i386/.libs/liblameasmroutines.a(fft3dn.o) 0x2c ../libmp3lame/i386/.libs/liblameasmroutines.a(fftsse.o) 0x7a ../libmp3lame/i386/.libs/liblameasmroutines.a(fftsse.o) 0x88 ../libmp3lame/i386/.libs/liblameasmroutines.a(fftsse.o) 0xc4 ../libmp3lame/i386/.libs/liblameasmroutines.a(fftsse.o) 0xd9 ../libmp3lame/i386/.libs/liblameasmroutines.a(fftsse.o) 0xe7 ../libmp3lame/i386/.libs/liblameasmroutines.a(fftsse.o) 0x1d0 ../libmp3lame/i386/.libs/liblameasmroutines.a(fftsse.o) 0x1e4 ../libmp3lame/i386/.libs/liblameasmroutines.a(fftsse.o) 0x20b ../libmp3lame/i386/.libs/liblameasmroutines.a(fftsse.o) 0x219 ../libmp3lame/i386/.libs/liblameasmroutines.a(fftsse.o) t1l 0x189 ../libmp3lame/i386/.libs/liblameasmroutines.a(choose_table.o) largetbl 0xde ../libmp3lame/i386/.libs/liblameasmroutines.a(choose_table.o) largetbl 0x105 ../libmp3lame/i386/.libs/liblameasmroutines.a(choose_table.o) largetbl 0x10f ../libmp3lame/i386/.libs/liblameasmroutines.a(choose_table.o) table23 0x245 ../libmp3lame/i386/.libs/liblameasmroutines.a(choose_table.o) table56 0x256 ../libmp3lame/i386/.libs/liblameasmroutines.a(choose_table.o) ld: fatal: relocations remain against allocatable but non-writable sections collect2: ld returned 1 exit status
The way to fix the problem is to use NASM 2.08 or earlier, or wait until the bug gets fixed (although they might point their finger at LAME). I’m going to try yasm instead of nasm and see if that works, as an alternative.
If you don’t care about TEXTRELs, on Linux you don’t have to do anything (GNU ld allows them by default), but on Solaris you can tell the Solaris linker to allow impure text segments by adding "-mimpure-text -lrt" to your LDFLAGS. Or, you can use the GNU linker. This is quite hard, but I wrote a blog post about it.
Add comment March 26th, 2011
Using the GNU ld Linker on Solaris
On Solaris, GCC by default is compiled with the option –with-ld=/usr/ccs/bin/ld, telling it to use the Solaris linker. Unfortunately GCC uses this value above all else, meaning it will ignore LD= environment variables to set an alternative linker, such as /usr/sfw/bin/gld
Although tools like libtool/autoconf will pick up your LD= environment variable, and detect which options the linker supports (and whether its GNU ld or not), libtool unfortunately still calls gcc for the linking stage, which then ignores LD=. This makes it near-impossible to use GNU ld without actually doing a nasty hack, like “mv /usr/ccs/bin/ld /usr/ccs/bin/ld.off ; ln -s /usr/sfw/bin/ld /usr/ccs/bin/ld”. Yuck!
However, today when trying to get lame to compile using nasm (which generates objects that refuse to link with Solaris LD), I found Solaris LD accepts a very useful environment variable. The variable is LD_ALTEXEC.
Solaris LD will actually re-exec the value of LD_ALTEXEC, meaning that if you set LD_ALTEXEC to /usr/sfw/bin/gld, when /usr/ccs/bin/ld gets called, it immediately instead calls /usr/sfw/bin/gld with the arguments passed on. Thus, you can use whatever linker you wish. Hurrah! :-)
Add comment March 25th, 2011
Building IPS / pkg5 on Solaris 10
IPS/pkg5 is the native package manager on OpenSolaris, and thus by extension on OpenIndiana (The OpenSolaris fork I started last year). Over the past 6 months I’ve become familiar with IPS, and I can honestly say I’ve fallen in love with it. It’s very powerful, useful and fairly easy to use (If you forgive it’s obscure error messages).
It was developed from scratch to be cross-platform, allowing Sun to deliver packages to other systems such as Linux, AIX, and Solaris 10. I decided it might be a good idea for us to roll it out on our Solaris 10 cloud for use with managing software. Our clients love the power of Solaris 10, but they sure do hate the lack of native package management, so IPS could really be a big win for us.
But boy, is getting it working on Solaris 10 no easy task. IPS itself is written mostly in Python, however the dependency list is huge, and some of the packages are a real pain to compile. The IPS build system also makes a few assumptions that aren’t correct on Solaris 10 which complicated things.
Whilst other guides bypass a lot of the problems by using OpenCSW/Blastwave packages such as this one here, I wanted a little self-contained "/opt/pkg" directory with it’s own Python install and any dependencies. The whole point of my deployment of IPS is to get away from OpenCSW/Blastwave and friends, which introduce a whole other stack of software you have to keep up to date.
While I don’t have time to go into the build process in detail, I can offer some hints to help out.
I found I had to build the following packages (the ordering here is completely incorrect, sorry):
gettext expat rarian intltool python2.6 setuptools swig pyOpenSSL gnome-doc-utils libxml2 libxml2-python libxslt
You’ll want to skip building the gui tools, update manager and the brand stuff, so in pkg-gate/src/Makefile, change the SUBDIRS variable as such:
#SUBDIRS=web gui um po util/misc brand SUBDIRS=web util/misc
Also remember to set PYTHON= to your new python.
I had to patch M2Crypto - it uses SWIG to generate Python bindings, and assumes ENGINE_load_openssl is present in the OpenSSL library. When running pkg I was getting:
ImportError: ld.so.1: python2.6: fatal: relocation error: file /opt/pkg/python26/lib/python2.6/site-packages/M2Crypto/__m2crypto.so: symbol ENGINE_load_openssl: referenced symbol not found
This is because the Solaris 10 OpenSSL install is missing the ENGINE_load_openssl function - it has been yanked out for crypto export reasons (that now probably don’t apply as OpenSolaris contains it). I removed references to it, and managed to coerce it to work. The patches for M2Crypto are here:
# pwd
/root/pkg-gate/src/patch/M2Crypto
# cat pkg-gate_m2c.patch
--- SWIG/_engine.i.orig 2011-01-22 23:32:17.583271086 +0000
+++ SWIG/_engine.i 2011-01-22 23:32:50.478960838 +0000
@@ -26,9 +26,6 @@
%rename(engine_load_dynamic) ENGINE_load_dynamic;
extern void ENGINE_load_dynamic(void);
-%rename(engine_load_openssl) ENGINE_load_openssl;
-extern void ENGINE_load_openssl(void);
-
%rename(engine_cleanup) ENGINE_cleanup;
extern void ENGINE_cleanup(void);
# cat setup.patch
--- setup.py.orig 2011-01-22 23:49:21.466821165 +0000
+++ setup.py 2011-01-22 23:49:32.286055614 +0000
@@ -40,7 +40,7 @@
self.openssl = 'c:\\pkg'
else:
self.libraries = ['ssl', 'crypto']
- self.openssl = '/usr'
+ self.openssl = '/usr/sfw'
def finalize_options(self):
Lastly some tips - the NetBSD pkgsrc system contains useful patches for getting some of the above dependencies to compile on Solaris 10. I can’t remember which ones I used but it did come in handy. And don’t forget about your CFLAGS/LDFLAGS/PATH. I also found I had to temporarily rename Solaris patch to patch.off and symlink gpatch to get pkg5 to auto-patch M2Crypto as it assumes GNU flags. You may also need to add -lintl and -lsocket at some point during the dependency build process to your LDFLAGS (I can’t remember where).
I’m delighted to have pkg5 working on Solaris 10 now. I’ll report back at a later date how I’m getting on. For those that want to cheat, I have a tar’d version you can stick at /opt/pkg here. It’s a strange layout - forgive me. And keep in mind, I haven’t tried it much yet.
1 comment January 23rd, 2011
Obtaining the serial number for disks on LSI RAID cards via CentOS Linux
This is just a quick reminder for myself basically. To get the serial number of the disks of a CentOS system, you can do:
yum install lsscsi sg3_utils modprobe sg /usr/bin/lsscsi -g smartctl -a /dev/sg0
Unfortunately I couldn’t find a way to see the serial number via lsiutil, however lsiutil is still very useful.
Add comment November 17th, 2010
Weird OpenSolaris/Crossbow issue with Aggregations and VLANs/VNICs
Update: Seems this one is already known about in defect 15870 and bug 6950788. Should have googled/looked myself.
Another interesting issue. In this issue, I had a server, and when I pinged it from other hosts on the same network, I was getting duplicate ping responses:
64 bytes from 10.1.0.1: icmp_seq=1640 ttl=255 time=0.074 ms 64 bytes from 10.1.0.1: icmp_seq=1640 ttl=255 time=0.079 ms (DUP!) 64 bytes from 10.1.0.1: icmp_seq=1641 ttl=255 time=0.081 ms 64 bytes from 10.1.0.1: icmp_seq=1641 ttl=255 time=0.087 ms (DUP!) 64 bytes from 10.1.0.1: icmp_seq=1642 ttl=255 time=0.079 ms 64 bytes from 10.1.0.1: icmp_seq=1642 ttl=255 time=0.082 ms (DUP!)
Something not quite right there. The setup was:
network-switch
| |
igb0 igb1
\__ __/
\/
aggr0
|--- aggr0vlan1
|--- aggr0vlan2
|--- aggr0vlan3
...
In the diagram above, we have a Cisco 2960G with a Port-Channel set up across 2 nic ports, which are attached to two network interfaces on the server. This is an LACP ethernet aggregation, used for providing extra bandwidth and/or redundancy to a server.
On OpenSolaris b134 (and on OpenIndiana b147), the aggr0 interface was created with:
dladm create-aggr -l igb0 -l igb1 -P L4 -L active aggr0
I then had a collection of VNICs, provisioned off the aggregation, for example:
dladm create-vnic -l aggr0 -v 1 aggr0vlan1 dladm create-vnic -l aggr0 -v 2 aggr0vlan2 dladm create-vnic -l aggr0 -v 3 aggr0vlan3 ...
I then stuck an IP address on each of these vnics using the usual:
ifconfig aggr0vlan1 plumb 10.1.0.1/16 up ifconfig aggr0vlan2 plumb 10.2.0.1/16 up ifconfig aggr0vlan3 plumb 10.3.0.1/16 up ...
There were no other interfaces configured, nothing else fancy at all.
When this was all set up, I got the duplicate ICMP ping packets. Very odd. I used snoop to track things down, and on the ICMP sender, it sent 1 ICMP packet, and received two back. When I snooped the server configured here, on aggr0, it was receiving two ICMP packets, hence sending two replies.
The strangest thing was, this issue would only occur at boot time! If I deleted all the VNICs, then set them up from scratch, no duplicate packets. If I rebooted the box, they came back. Weird!
So I simplified things down to one VNIC, aggr0vlan1. I rebooted, no duplicate packets. So I configured a second, rebooted. Then the duplicate packets were back.
Looks like a bug in Crossbow or the IGB driver to me. I tested on OpenIndiana b147 which had a 6 months newer kernel than OpenSolaris b134, but this didn’t fix the issue.
Then today (this all took place yesterday) on the bus on the way to work this morning, I remembered that crossbow has two internal constructs for doing vlan tagged virtual interfaces - a “vnic” interface with vlan tagging enabled (what I was using above), and a “vlan” interface. The syntax for the two commands is virtually identical ("dladm create-vnic -l link0 -v vlanid 1 link0vlan1" vs "dladm create-vlan -l link0 -v vlanid 1 link0vlan1"), and as far as I’m aware they should be logically the same, but I remember from a previous LOSUG one of the Oracle engineers mentioning that the implementations are different inside the kernel.
So instead of creating a vnic, I tried again with a vlan. BOOM! Fixed! No duplicate packets!
Very weird indeed. Glad I was able to work around the issue, but it did consume a fair whack of time. Perhaps someone with a bit more knowledge of Crossbow might be able to shed some light on this.
Add comment October 21st, 2010
“Could not stat /dev/sda1″ when installing Citrix XenServer
If you get this error when installing Citrix XenServer…
Could not stat /dev/sda1 --- No such file or directory
… then I have a solution for you! Basically it’s caused by a race condition - the installer creates the partition table, but then immediately attempts to create a filesystem on /dev/sda1 before the Kernel has caught up with the partition table change.
You can fix the issue by:
1. Hit "Alt-F2" to get a console
2. Use "ps -ef" to get the pid of the installer, and "kill -9" it
3. Type "vi /etc/inittab"
4. Change the line under "# Start the installer on the console" to read:
tty1::respawn:/opt/xensource/installer/preinit
(This causes the installer to be respawned upon death - useful for debugging if things go wrong without requiring a reboot)
5. Type "vi /opt/xensource/installer/backend.py"
6. Near the top add "from time import sleep" under one of the import statements.
7. Near line 565, under the "def createDom0DiskFilesystems(…" bit, add a new line with "sleep(10)" in it:
def createDom0DiskFilesystems(disk, primary_partnum):
sleep(10)
rc, err = util.runCmd2....
Be sure to match the indentation of the lines below, as Python uses indentation as a part of its syntax.
8. Type "kill -HUP 1" to reload the inittab. The installer should respawn on tty1 - simply press “Alt-F1″ to get to it. Perform the install.
This simple fix works by adding a sleep statement before the bit that creates a filesystem. Yay!
2 comments October 20th, 2010
