Archive for March, 2010

Solaris iSCSI Initiator & Reboots

We use Solaris Zones, with each zone stored on its own zpool. The ZPool is stored on a SAN, and accessed via iSCSI. We’ve been doing this since Solaris 10 update 6, and Solaris 10 update 8 introduced an interesting issue we’ve run into.

When we asked a S10u8 box to reboot, it sat there for 10 minutes shutting down. Why? Because it was trying to stop the iSCSI initiator whilst there were live iSCSI filesystems in use. Duh! Stupid Solaris.

So I compared the iSCSI manifest from S10u7 to S10u8 and they’ve changed it in a few places. It used to depend on svc:/network/physical and svc:/system/metainit, and now it depends on svc:/network/service and svc:/network/loopback. However the biggest change was the timeout value, it was upped from 5 seconds to 600 seconds. Yes, 10 minutes.

So this highlighted an interesting problem – when rebooting boxes previously, Solaris would always try to stop the iSCSI initiator with live filesystems on it, and give up after 5 seconds and the box would come down.

Rather than hack the timeout value back to 5 seconds, I decided to investigate and see if I could add a dependency to fix this properly. I decided to make the svc:filesystem/local service depend on the iSCSI initiator service. The theory here was that filesystem/local mounts and unmounts the ZFS filesystems, so if it depends on the initiator, the initiator won’t be stopped before it unmounts the ZFS filesystems.

Unfortunately this didn’t work. Somewhere in the enormous SMF dependency tree, I ended up with a cycle, and upon boot services wouldn’t come up. At this point, I gave up and set the timeout back to 5 seconds.

If I can find the time, I’ll try and reproduce this issue on OpenSolaris, then file it on defects.opensolaris.org. After it’s been accepted, I’ll escalate it against our Solaris 10 premium support contract, and see if Sun will actually fix something for us.

3 comments March 23rd, 2010

Upgrading OpenSolaris snv_12* to snv132+

Just a quick post. If you’re upgrading an OpenSolaris host on the dev branch and get this error:

# beadm create snv134
# beadm mount snv134 /mnt
# pkg -R /mnt install entire@0.5.11-0.134
Creating Plan -pkg: Cannot remove 'pkg://opensolaris.org/SUNWgnome-a11y-libs-python24@0.5.11,5.11-0.127:20091111T055042Z' due to the following packages that depend on it:
  pkg://opensolaris.org/SUNWgnome-a11y-reader@0.5.11,5.11-0.127:20091111T055202Z

Then do this to resolve:

# beadm umount snv134
# beadm destroy snv134
# pkg uninstall SUNWgnome-a11y-reader
PHASE                                        ACTIONS
Removal Phase                                346/346
# beadm create snv134
# beadm mount snv134 /mnt

You might then get this new error:

# pkg -R /mnt install entire@0.5.11-0.134
Creating Plan \pkg: Cannot remove 'pkg://opensolaris.org/SUNWipkg-gui-l10n@0.5.11,5.11-0.127:20091111T075414Z' due to the following packages that depend on it:
  pkg://opensolaris.org/SUNWipkg-gui@0.5.11,5.11-0.127:20091111T075333Z

Which is easily fixed with:

# beadm umount snv134
# beadm destroy snv134
# pkg uninstall SUNWipkg-gui
PHASE                                        ACTIONS
Removal Phase                                251/251
# beadm create snv134
# beadm mount snv134 /mnt

Then it should all work nicely:

# pkg -R /mnt install entire@0.5.11-0.134
DOWNLOAD                                  PKGS       FILES    XFER (MB)
Completed                              1523/1523 120934/120934 1152.0/1152.0

PHASE                                        ACTIONS
Removal Phase                            144368/144368
Install Phase                            174332/174332
Update Phase                               1592/1592

(You can also ignore errors like this:)

Removal Phase                            1156/144368
driver (tl) clone permission update failed with return code 252
command run was: /usr/sbin/update_drv -b /mnt -d -m ticlts 0666 root sys clone
command output was:
------------------------------------------------------------
No entry found for driver (clone) in file (/mnt/etc/minor_perm).
------------------------------------------------------------

After though, you might get errors related to /dev/ptmx when logging in via SSH. Log in via the console/ilom and do “chmod 777 /dev/ptmx” to fix.

Add comment March 22nd, 2010

Installing the memcached Ruby Gem on Solaris

Gosh this one was quite hard. I was getting errors such as:

rlibmemcached_wrap.c:2074: error: syntax error before ‘bool’
rlibmemcached_wrap.c: In function ‘SWIG_AsVal_bool’:
rlibmemcached_wrap.c:2076: error: ‘obj’ undeclared (first use in this function)
rlibmemcached_wrap.c:2076: error: (Each undeclared identifier is reported only once
rlibmemcached_wrap.c:2076: error: for each function it appears in.)
rlibmemcached_wrap.c:2077: error: ‘val’ undeclared (first use in this function)
rlibmemcached_wrap.c:2077: error: ‘true’ undeclared (first use in this function)
rlibmemcached_wrap.c:2080: error: ‘false’ undeclared (first use in this function)

So to solve this I basically followed these helpful instructions Nick Sellen:

Nick Sellen says (January 27, 2010):

I had trouble installing it on my Solaris 10 with 32bit / gcc compiled ruby but managed it with a few modifications to extconf.rb:

1. added "--disable-64bit" to the libmemcached configure arguments
2. added "-std=gnu99" to CFLAGS (the rlibmemcached_wrap.c compilation was failing without that)
3. added an extra -R path for ext/lib - not sure if this was needed actually
4. recreated the rlibmemcached_wrap.c with swig (it removed a bunch of methods, not sure if this will bite me later)
5. added three extra libraries "-lnsl -lsocket -lposix4" to resolve a "symbol getaddrinfo: referenced symbol not found" relocation error with rlibmemcached.so (might only need libsocket)

You might also want to view the extconf.rb modifications directly.

The swig step basically involves downloading, compiling and installing swig to somewhere like /opt/swig, then doing “export SWIG=true” in your shell.

3 comments March 3rd, 2010