Alasdair on Everything

Posts filed under 'General'

The case for RAIDZ2

We have an old x4500 knocking around which is getting on for 3 years old now. At the beginning of last month, we did a scrub, and to our horror discovered checksum errors on almost all the drives:

  pool: pool01
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 23h0m with 0 errors on Wed Mar  3 12:55:36 2010
config:

        NAME         STATE     READ WRITE CKSUM
        pool01       DEGRADED     0     0     0
          raidz1-0   ONLINE       0     0     0
            c11t3d0  ONLINE       0     0     4  2.50K repaired
            c10t3d0  ONLINE       0     0     0
            c13t3d0  ONLINE       0     0     4  1.50K repaired
            c7t1d0   ONLINE       0     0     0
            c8t3d0   ONLINE       0     0     5  1K repaired
            c7t3d0   ONLINE       0     0     4  2K repaired
            c10t2d0  ONLINE       0     0     3  1K repaired
            c13t2d0  ONLINE       0     0     2  1K repaired
            c11t6d0  ONLINE       0     0     3  1K repaired
            c8t2d0   ONLINE       0     0    16  7K repaired
            c7t2d0   ONLINE       0     0     4  2.50K repaired
          raidz1-1   DEGRADED     0     0     0
            c11t7d0  ONLINE       0     0     6  64K repaired
            c10t7d0  DEGRADED     0     0    58  too many errors
            c13t7d0  ONLINE       0     0     4  3.50K repaired
            c12t7d0  ONLINE       0     0     3  7K repaired
            c8t7d0   ONLINE       0     0     2  4.50K repaired
            c7t7d0   ONLINE       0     0     4  11.5K repaired
            c10t6d0  ONLINE       0     0     4  11K repaired
            c13t6d0  ONLINE       0     0     8  86K repaired
            c12t6d0  ONLINE       0     0     0
            c8t6d0   ONLINE       0     0     2  1K repaired
            c7t6d0   ONLINE       0     0     2  2.50K repaired
          raidz1-2   DEGRADED     0     0     0
            c11t5d0  ONLINE       0     0     1  9K repaired
            c10t5d0  ONLINE       0     0     1  13K repaired
            c13t5d0  ONLINE       0     0     2  1.50K repaired
            c12t5d0  ONLINE       0     0     1  1K repaired
            c8t5d0   DEGRADED     0     0   135  too many errors
            c7t5d0   ONLINE       0     0     2  1.50K repaired
            c10t4d0  ONLINE       0     0     8  44K repaired
            c13t4d0  ONLINE       0     0     3  5K repaired
            c12t4d0  ONLINE       0     0     3  2K repaired
            c8t4d0   ONLINE       0     0     2  6.50K repaired
            c7t4d0   ONLINE       0     0     2  13.5K repaired

errors: No known data errors

Thankfully it’s not used for production, so this didn’t bother us a huge amount. ZFS repaired the data errors without issue (hurrah for ZFS!), and we have been replacing the worst affected disks. We’re now doing weekly scrubs to keep the data “fresh” and stop it rotting away.

However one interesting issue that cropped up. We’re using RAIDZ1, which only stores enough parity for 1 disk to be out of service. Since ZFS uses the parity data to reconstruct blocks with checksum errors, if you’re one disk down, and have a block with a checksum error, you’re in trouble - it can’t repair it and you’re data is corrupted.

So when you replace a failed disk in a RAIDZ1 set, you had better hope you don’t encounter any checksum errors on the other disks during the resilver process. Because ZFS has to read in all the data from the other disks to resilver the new disk, you’re at a high risk of encountering checksum errors, especially in our situation where the disks are wearing out.

And this is precisely what happened next. We replaced a failed disk, and during the resilver, ZFS encountered checksum errors on the other disks it couldn’t repair, and we started to lose data:

  pool: pool01
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed after 15h47m with 219 errors on Sat Apr 10 16:14:59 2010
config:

        NAME         STATE     READ WRITE CKSUM
        pool01       DEGRADED     0     0   331
          raidz1-0   ONLINE       0     0     0
            c11t3d0  ONLINE       0     0     0
            c10t3d0  ONLINE       0     0     0
            c13t3d0  ONLINE       0     0     0
            c8t5d0   ONLINE       0     0     0
            c8t3d0   ONLINE       0     0     0
            c7t3d0   ONLINE       0     0     0
            c10t2d0  ONLINE       0     0     0
            c13t2d0  ONLINE       0     0     0
            c11t6d0  ONLINE       0     0     0
            c8t2d0   ONLINE       0     0     0
            c7t2d0   ONLINE       0     0     0
          raidz1-1   ONLINE       0     0     0
            c11t7d0  ONLINE       0     0     0
            c11t2d0  ONLINE       0     0     0
            c13t7d0  ONLINE       0     0     0
            c12t7d0  ONLINE       0     0     0
            c8t7d0   ONLINE       0     0     1
            c7t7d0   ONLINE       0     0     0
            c10t6d0  ONLINE       0     0     0
            c13t6d0  ONLINE       0     0     0
            c12t6d0  ONLINE       0     0     0
            c8t6d0   ONLINE       0     0     0
            c7t6d0   ONLINE       0     0     0
          raidz1-2   DEGRADED     0     0   888
            c11t5d0  DEGRADED     0     0     0  too many errors
            c10t5d0  DEGRADED     0     0     0  too many errors
            c13t5d0  DEGRADED     0     0     0  too many errors
            c12t5d0  ONLINE       0     0     0  401G resilvered
            c12t3d0  DEGRADED     0     0     0  too many errors
            c7t5d0   DEGRADED     0     0     0  too many errors
            c10t4d0  DEGRADED     0     0     0  too many errors
            c13t4d0  DEGRADED     0     0     0  too many errors
            c12t4d0  DEGRADED     0     0     0  too many errors
            c8t4d0   DEGRADED     0     0     0  too many errors
            c7t4d0   DEGRADED     0     0     0  too many errors

errors: 219 data errors, use '-v' for a list

Ouch! 219 data errors.

Thankfully ZFS knows precisely which files are affected, and you can just delete/replace/restore the affected files/snapshots and it keeps on running.

However after this, I’m sold on RAIDZ2. I don’t think I’ll be using RAIDZ1 again - the risk of losing data when you’re replacing a failed disk is just too high.

Add comment April 10th, 2010

Upgrading OpenSolaris snv_12* to snv132+

Just a quick post. If you’re upgrading an OpenSolaris host on the dev branch and get this error:

# beadm create snv134
# beadm mount snv134 /mnt
# pkg -R /mnt install entire@0.5.11-0.134
Creating Plan -pkg: Cannot remove 'pkg://opensolaris.org/SUNWgnome-a11y-libs-python24@0.5.11,5.11-0.127:20091111T055042Z' due to the following packages that depend on it:
  pkg://opensolaris.org/SUNWgnome-a11y-reader@0.5.11,5.11-0.127:20091111T055202Z

Then do this to resolve:

# beadm umount snv134
# beadm destroy snv134
# pkg uninstall SUNWgnome-a11y-reader
PHASE                                        ACTIONS
Removal Phase                                346/346
# beadm create snv134
# beadm mount snv134 /mnt

You might then get this new error:

# pkg -R /mnt install entire@0.5.11-0.134
Creating Plan \pkg: Cannot remove 'pkg://opensolaris.org/SUNWipkg-gui-l10n@0.5.11,5.11-0.127:20091111T075414Z' due to the following packages that depend on it:
  pkg://opensolaris.org/SUNWipkg-gui@0.5.11,5.11-0.127:20091111T075333Z

Which is easily fixed with:

# beadm umount snv134
# beadm destroy snv134
# pkg uninstall SUNWipkg-gui
PHASE                                        ACTIONS
Removal Phase                                251/251
# beadm create snv134
# beadm mount snv134 /mnt

Then it should all work nicely:

# pkg -R /mnt install entire@0.5.11-0.134
DOWNLOAD                                  PKGS       FILES    XFER (MB)
Completed                              1523/1523 120934/120934 1152.0/1152.0

PHASE                                        ACTIONS
Removal Phase                            144368/144368
Install Phase                            174332/174332
Update Phase                               1592/1592

(You can also ignore errors like this:)

Removal Phase                            1156/144368
driver (tl) clone permission update failed with return code 252
command run was: /usr/sbin/update_drv -b /mnt -d -m ticlts 0666 root sys clone
command output was:
------------------------------------------------------------
No entry found for driver (clone) in file (/mnt/etc/minor_perm).
------------------------------------------------------------

After though, you might get errors related to /dev/ptmx when logging in via SSH. Log in via the console/ilom and do “chmod 777 /dev/ptmx” to fix.

Add comment March 22nd, 2010

Installing the memcached Ruby Gem on Solaris

Gosh this one was quite hard. I was getting errors such as:

rlibmemcached_wrap.c:2074: error: syntax error before ‘bool’
rlibmemcached_wrap.c: In function ‘SWIG_AsVal_bool’:
rlibmemcached_wrap.c:2076: error: ‘obj’ undeclared (first use in this function)
rlibmemcached_wrap.c:2076: error: (Each undeclared identifier is reported only once
rlibmemcached_wrap.c:2076: error: for each function it appears in.)
rlibmemcached_wrap.c:2077: error: ‘val’ undeclared (first use in this function)
rlibmemcached_wrap.c:2077: error: ‘true’ undeclared (first use in this function)
rlibmemcached_wrap.c:2080: error: ‘false’ undeclared (first use in this function)

So to solve this I basically followed these helpful instructions Nick Sellen:

Nick Sellen says (January 27, 2010):

I had trouble installing it on my Solaris 10 with 32bit / gcc compiled ruby but managed it with a few modifications to extconf.rb:

1. added "--disable-64bit" to the libmemcached configure arguments
2. added "-std=gnu99" to CFLAGS (the rlibmemcached_wrap.c compilation was failing without that)
3. added an extra -R path for ext/lib - not sure if this was needed actually
4. recreated the rlibmemcached_wrap.c with swig (it removed a bunch of methods, not sure if this will bite me later)
5. added three extra libraries "-lnsl -lsocket -lposix4" to resolve a "symbol getaddrinfo: referenced symbol not found" relocation error with rlibmemcached.so (might only need libsocket)

You might also want to view the extconf.rb modifications directly.

The swig step basically involves downloading, compiling and installing swig to somewhere like /opt/swig, then doing “export SWIG=true” in your shell.

3 comments March 3rd, 2010

VLC on Solaris 10

Some helpful chap has compiled up VLC for Solaris 10. Useful!

Add comment January 8th, 2010

Installing OpenSolaris/Solaris on a Fasthosts Dedicated Server

EDIT: Turns out that my server had dodgy wiring with the Eric card. Fasthosts fixed this and then I was able to get into the BIOS to change the boot order, rendering the below post rather unnecessary.

I was recently tasked with installing OpenSolaris on a Fasthosts Dedicated Server. Fasthost Dedicated Servers are cheap and cheerful. I would never put anything important on them, because if the shit hits the fan, you’re own your own. But they are incredibly cheap, so for un-important bits n pieces, they can make sense.

Unfortunately they only come pre-installed with Windows Server, CentOS or Ubuntu. Being a Solaris advocate, the first thing I wanted to do was kablam them with OpenSolaris.

The boxes rather usefully come with Raritan ERIC remote management cards. These remote management cards provide you with:

  • Keyboard, Video and Mouse remote access
  • Remote power management
  • Virtual CD-Rom

So, installing OpenSolaris should be a piece of cake, right? Sadly.. not quite. Fasthosts have either locked down the cards/servers so you can’t go into the BIOS/Alter the boot order, or the Eric KVM cards are deficient in that regard. Regardless of whether I chose PS2 or USB for the Keyboard emulation, pressing F2 or F12 on the BIOS boot screen yielded nothing useful.

Further, I had issues getting the Virtual CD Drive to mount. Rather unfortunately it can only access ISO images via Windows File Sharing. I set up a Samba Server, but the Eric card kept saying "Error accessing image". It turns out your ISO image has to be in a sub-folder, and the path uses backslashes. So I finally got a CD mounted in the end.

Once I had the ISO Image mounted, I needed to get the server to boot it. Since we can’t change the boot order, I finally got around it by nuking the MBR of the harrdrive. There are actually two harddrives in the Fasthosts box I ordered, so I ran:

# dd if=/dev/zero of=/dev/sda bs=1M count=100
# dd if=/dev/zero of=/dev/sdb bs=1M count=100

I probably only had to do the first 512 bytes, but more doesn’t hurt when you’re wiping the box anyway. Upon rebooting, sure enough, it started booting the OpenSolaris install CD. Magic!

2 comments December 29th, 2009

Nagios 3.2.0 coredumps when started via SMF on Solaris 10

This one was quite interesting. If you compile your own nagios-3.2.0 from source on Solaris 10, and start it manually, it runs just fine. If you run it via SMF with a service manifest, the process continually dumps core, so you get messages such as:

[ Oct 16 19:24:48 Enabled. ]
[ Oct 16 19:24:48 Executing start method ("/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg &") ]
[ Oct 16 19:24:48 Method "start" exited with status 0 ]
[ Oct 16 19:24:49 Stopping because process dumped core. ]
[ Oct 16 19:24:49 Executing stop method (:kill) ]
Successfully shutdown... (PID=29180)
[ Oct 16 19:24:49 Executing start method ("/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg &") ]
[ Oct 16 19:24:49 Method "start" exited with status 0 ]
[ Oct 16 19:24:50 Stopping because process dumped core. ]
[ Oct 16 19:24:50 Executing stop method (:kill) ]
Successfully shutdown... (PID=29232)
[ Oct 16 19:24:51 Executing start method ("/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg &") ]
[ Oct 16 19:24:51 Method "start" exited with status 0 ]
[ Oct 16 19:24:52 Stopping because process dumped core. ]
[ Oct 16 19:24:52 Executing stop method (:kill) ]
Successfully shutdown... (PID=29246)

So, why does nagios crash when started via SMF? Well, I decided to enable core dumps via coreadm, to find out why. We do this with:

# mkdir /cores
# coreadm -g /cores/core.%f.%p -i /cores/core.%f.%p -e global -e global-setid -e log -e process -e proc-setid
# coreadm
     global core file pattern: /cores/core.%f.%p
     global core file content: all
       init core file pattern: /cores/core.%f.%p
       init core file content: all
            global core dumps: enabled
       per-process core dumps: enabled
      global setid core dumps: enabled
 per-process setid core dumps: enabled
     global core dump logging: enabled

We can then check the core dump with:

# gdb /opt/nagios/bin/nagios /cores/core.nagios.23536
GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
*snip*
Core was generated by `/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg'.
Program terminated with signal 11, Segmentation fault.
#0  0xfed3590c in strlen () from /lib/libc.so.1
(gdb) bt
#0  0xfed3590c in strlen () from /lib/libc.so.1
#1  0xfed8eda6 in _ndoprnt () from /lib/libc.so.1
#2  0xfed9192d in fprintf () from /lib/libc.so.1
#3  0x08067c42 in run_async_host_check_3x ()
#4  0x08066f69 in run_scheduled_host_check_3x ()
#5  0x080658d0 in perform_scheduled_host_check ()
#6  0x0807c0e8 in handle_timed_event ()
#7  0x0807bd8c in event_execution_loop ()
#8  0x0805ecaa in main ()
(gdb) quit

Interesting - it’s crashing when the nagios function run_async_host_check_3x does a fprintf. Looks like a null pointer to me. Lets get the actual line number by installing a nagios binary which has not been stripped of debugging symbols. Thankfully the Nagios Makefile has a method of doing this already:

# cd /opt/src/nagios-3.2.0
# gmake install-unstripped
cd ./base && gmake install-unstripped
gmake[1]: Entering directory `/opt/src/build/nagios/files/nagios-3.2.0/base'
gmake install-basic
gmake[2]: Entering directory `/opt/src/build/nagios/files/nagios-3.2.0/base'
/opt/sfw/bin/install -c -m 775 -o nagios -g nagios -d /opt/nagios/bin
/opt/sfw/bin/install -c -m 774 -o nagios -g nagios nagios /opt/nagios/bin
*snip*

Now we re-run nagios via SMF, then gdb the latest coredump:

 gdb /opt/nagios/bin/nagios /globalcore/core.nagios.29248
GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
*snip*
Core was generated by `/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg'.
Program terminated with signal 11, Segmentation fault.
#0  0xfed3590c in strlen () from /lib/libc.so.1
(gdb) bt
#0  0xfed3590c in strlen () from /lib/libc.so.1
#1  0xfed8eda6 in _ndoprnt () from /lib/libc.so.1
#2  0xfed9192d in fprintf () from /lib/libc.so.1
#3  0x08067c42 in run_async_host_check_3x (hst=0x8139b78, check_options=0, latency=0.048000000000000001,
    scheduled_check=1, reschedule_check=1, time_is_valid=0x8047b40, preferred_time=0x8047b48)
    at /opt/src/build/nagios/files/nagios-3.2.0/base/checks.c:3134
#4  0x08066f69 in run_scheduled_host_check_3x (hst=0x8139b78, check_options=0, latency=0.048000000000000001)
    at /opt/src/build/nagios/files/nagios-3.2.0/base/checks.c:2791
#5  0x080658d0 in perform_scheduled_host_check (hst=0x8139b78, check_options=0, latency=0.048000000000000001)
    at /opt/src/build/nagios/files/nagios-3.2.0/base/checks.c:2108
#6  0x0807c0e8 in handle_timed_event (event=0x8133010) at /opt/src/build/nagios/files/nagios-3.2.0/base/events.c:1261
#7  0x0807bd8c in event_execution_loop () at /opt/src/build/nagios/files/nagios-3.2.0/base/events.c:1132
#8  0x0805ecaa in main (argc=134510324, argv=0x8139b78) at nagios.c:849
(gdb) quit

A hah! Now we have a line number. The line in question, line 3134 of checks.c, reads:

fprintf(check_result_info.output_file_fp,"output=%s\n",checkresult_dbuf.buf);

So this checkresult_dbuf.buf must be null. I googled, and found someone talking about it on the nagios-devel mailing list. Seems the fix they comitted (checking to see if checkresult_dbuf.buf is null) has been uncomitted/overwritten as this check is no longer in place in nagios 3.2.0. Not to worry, here’s a patch:

--- base/checks.c.orig  2009-10-16 19:28:42.082321083 +0100
+++ base/checks.c       2009-10-16 19:29:02.197305557 +0100
@@ -3131,7 +3131,7 @@
                                fprintf(check_result_info.output_file_fp,"early_timeout=%d\n",check_result_info.early_timeout);
                                fprintf(check_result_info.output_file_fp,"exited_ok=%d\n",check_result_info.exited_ok);
                                fprintf(check_result_info.output_file_fp,"return_code=%d\n",check_result_info.return_code);
-                               fprintf(check_result_info.output_file_fp,"output=%s\n",checkresult_dbuf.buf);
+                               fprintf(check_result_info.output_file_fp,"output=%s\n",(checkresult_dbuf.buf==NULL)?"(null)":checkresult_dbuf.buf);

                                /* close the temp file */
                                fclose(check_result_info.output_file_fp);

Apply this and you should be all set!

2 comments October 16th, 2009

Compiling Kannel 1.4.3 on Solaris 10

Kennel doesn’t appear to compile with Sun Studio, I couldn’t be bothered to work out why. It compiles with the default Solaris gcc 3.4.3, but fails with:

gcc -std=gnu99 -D_REENTRANT=1 -I. -Igw -g -O2 -D_FILE_OFFSET_BITS=64 -D_LARGE_FILES= -I/usr/include/libxml2 -o wmlscript/wsstream_data.o -c wmlscript/wsstream_data.c
wmlscript/wslexer.c: In function `read_float_from_exp':
wmlscript/wslexer.c:1037: error: syntax error before '||' token
gmake-3.81: *** [wmlscript/wslexer.o] Error 1
gmake-3.81: *** Waiting for unfinished jobs....
No postbuild script
Error: build failed

The issue is that wmlscript/wslexer.c uses "HUGE_VAL", which has a broken definition on Solaris (or at least gcc doesn’t like it).

The solution is to force it to use GCC’s built in HUGE_VAL definition, which you can do with the following patch:

--- wmlscript/wslexer.c.orig    2009-09-18 12:51:52.218499508 +0100
+++ wmlscript/wslexer.c 2009-09-18 12:54:09.811272795 +0100
@@ -1034,7 +1034,7 @@

     /* Check that the generated floating point number fits to
        `float32'. */
-    if (*result == HUGE_VAL || *result == -HUGE_VAL
+    if (*result == __builtin_huge_val() || *result == -__builtin_huge_val()
         || ws_ieee754_encode_single(*result, buf) != WS_IEEE754_OK)
         ws_src_error(compiler, 0, "floating point literal too large");

Happy compiling!

Add comment September 18th, 2009

Installing Windows Server 2008 on Citrix XenServer

During the install of Windows Server 2008, the installer might throw a screen up at you insisting you provide it with drivers so it can install Windows.

This screen can’t be bypassed (that I could see), and giving Windows Server 2008 the Citrix XenServer xe-tools.iso image is no good, as the drivers are contained within a .exe. Extracting the drivers on another computer and making your ISO is no good either - Windows won’t accept those drivers.

Usefully the installer doesn’t even tell you what hardware it wants drivers for. However on a hunch I removed the Network Adapter within Citrix XenServer, and sure enough, after a restart, the installer didn’t ask for any drivers and the install completed successfully.

I’ve had to fight with this drivers screen when installing Windows 7 on my Dell laptop before, and it’s not fun. It just doesn’t provide enough useful information for you to find the drivers it wants to install. Stupid Microsoft. Stupid Windows.

At least it accepts a CD or USB Key for the drivers, which is a vast improvement over the NT/2000/XP/2003 days where you’d need to blow the dust off your 3.5″ floppy drive…

Add comment July 29th, 2009

Windows Server 2003

Windows Server 2003 is now over 6 years old. Yet, we’re still asked by clients for new Windows Server 2003 installations, despite Windows Server 2008 coming out last year. I find this quite interesting, because Windows Server 2008 is a great product, and IIS 7.0 offers many significant advantages over IIS 6.0 (Such as native URL rewriting).

I’d say the biggest driver of this is that people fear the unknown - Server 2008 is somewhat new and people just don’t have the time to try it out. However, the situation in the Windows ecosphere is significantly different to what we encounter in the Linux & Unix world. For example, nobody would dare consider installing a Linux distribution that’s 6 years old.

CentOS first came out with version 2 in May 2005, Debian 3.0 “Woody” came out in 2002 (there wasn’t another release until 2005). Ubuntu didn’t even come out until 2005. All shipped with the Linux 2.4 kernel, and Apache 1.3, by default. Nobody in their right mind would run any of these distributions today.

So why then, do people continue to install Windows Server 2003? Why? For the following reasons:

  • Windows Server 2003 was a very strong release
  • Windows Server 2003 meets most peoples requirements
  • .NET 2.0, .NET 3.0 and .NET 3.5 all run fine on Windows Server 2003
  • Microsoft have released a FastCGI module for IIS 6.0, and there are numerous URL Rewrite options for Server 2003
  • People are wary of new Microsoft releases (Take Vista for example)

That’s not to say I approve of installing Windows Server 2003. It goes out of general support in 2010, which is but one year away. Windows Server 2008 is a great product with many fantastic new features built in. But I have a nagging feeling Windows Server 2003 will be with us for a long time to come. It’s just too simple, too clean and too elegant to disappear.

Add comment June 16th, 2009

Killing a Solaris 10 Zone stuck in the shutting_down state.

So, you have a Solaris 10 Zone. You’ve run “zoneadm -z zonename shutdown”. It hasn’t quite shut down, and is stuck in the shutting_down state. What can you do to fix it?

Well, sometimes some processes don’t die in a timely fashion. Check what processes are running with the following command:

# ps -fz zonename

If any processes other than zsched are running, kill -9 them. The zone should hopefully shut down.

If it doesn’t, and you’re left with zsched as the only remaining process, then potentially you’ve hit a bug, such as bug 6272846 - "User orders zone death; NFS client thumbs nose". This bug has been outstanding since May 2005, so don’t expect a fix any time soon.

Thankfully there are a few more things you can try to kill the damn zone off. Give some of the following a go:

# zoneadm -z zonename unmount -f
# zoneadm -z zonename reboot -- -s
# pkill -9 -z zonename

The above combo should hopefully deliver a fatal blow to your Zone. If not, bitch at Sun. Hopefully they’ll sort their lives out.

Add comment June 11th, 2009

Next Posts Previous Posts