Alasdair on Everything


Nagios 3.2.0 coredumps when started via SMF on Solaris 10

This one was quite interesting. If you compile your own nagios-3.2.0 from source on Solaris 10, and start it manually, it runs just fine. If you run it via SMF with a service manifest, the process continually dumps core, so you get messages such as:

[ Oct 16 19:24:48 Enabled. ]
[ Oct 16 19:24:48 Executing start method ("/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg &") ]
[ Oct 16 19:24:48 Method "start" exited with status 0 ]
[ Oct 16 19:24:49 Stopping because process dumped core. ]
[ Oct 16 19:24:49 Executing stop method (:kill) ]
Successfully shutdown... (PID=29180)
[ Oct 16 19:24:49 Executing start method ("/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg &") ]
[ Oct 16 19:24:49 Method "start" exited with status 0 ]
[ Oct 16 19:24:50 Stopping because process dumped core. ]
[ Oct 16 19:24:50 Executing stop method (:kill) ]
Successfully shutdown... (PID=29232)
[ Oct 16 19:24:51 Executing start method ("/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg &") ]
[ Oct 16 19:24:51 Method "start" exited with status 0 ]
[ Oct 16 19:24:52 Stopping because process dumped core. ]
[ Oct 16 19:24:52 Executing stop method (:kill) ]
Successfully shutdown... (PID=29246)

So, why does nagios crash when started via SMF? Well, I decided to enable core dumps via coreadm, to find out why. We do this with:

# mkdir /cores
# coreadm -g /cores/core.%f.%p -i /cores/core.%f.%p -e global -e global-setid -e log -e process -e proc-setid
# coreadm
     global core file pattern: /cores/core.%f.%p
     global core file content: all
       init core file pattern: /cores/core.%f.%p
       init core file content: all
            global core dumps: enabled
       per-process core dumps: enabled
      global setid core dumps: enabled
 per-process setid core dumps: enabled
     global core dump logging: enabled

We can then check the core dump with:

# gdb /opt/nagios/bin/nagios /cores/core.nagios.23536
GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
*snip*
Core was generated by `/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg'.
Program terminated with signal 11, Segmentation fault.
#0  0xfed3590c in strlen () from /lib/libc.so.1
(gdb) bt
#0  0xfed3590c in strlen () from /lib/libc.so.1
#1  0xfed8eda6 in _ndoprnt () from /lib/libc.so.1
#2  0xfed9192d in fprintf () from /lib/libc.so.1
#3  0x08067c42 in run_async_host_check_3x ()
#4  0x08066f69 in run_scheduled_host_check_3x ()
#5  0x080658d0 in perform_scheduled_host_check ()
#6  0x0807c0e8 in handle_timed_event ()
#7  0x0807bd8c in event_execution_loop ()
#8  0x0805ecaa in main ()
(gdb) quit

Interesting - it’s crashing when the nagios function run_async_host_check_3x does a fprintf. Looks like a null pointer to me. Lets get the actual line number by installing a nagios binary which has not been stripped of debugging symbols. Thankfully the Nagios Makefile has a method of doing this already:

# cd /opt/src/nagios-3.2.0
# gmake install-unstripped
cd ./base && gmake install-unstripped
gmake[1]: Entering directory `/opt/src/build/nagios/files/nagios-3.2.0/base'
gmake install-basic
gmake[2]: Entering directory `/opt/src/build/nagios/files/nagios-3.2.0/base'
/opt/sfw/bin/install -c -m 775 -o nagios -g nagios -d /opt/nagios/bin
/opt/sfw/bin/install -c -m 774 -o nagios -g nagios nagios /opt/nagios/bin
*snip*

Now we re-run nagios via SMF, then gdb the latest coredump:

 gdb /opt/nagios/bin/nagios /globalcore/core.nagios.29248
GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
*snip*
Core was generated by `/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg'.
Program terminated with signal 11, Segmentation fault.
#0  0xfed3590c in strlen () from /lib/libc.so.1
(gdb) bt
#0  0xfed3590c in strlen () from /lib/libc.so.1
#1  0xfed8eda6 in _ndoprnt () from /lib/libc.so.1
#2  0xfed9192d in fprintf () from /lib/libc.so.1
#3  0x08067c42 in run_async_host_check_3x (hst=0x8139b78, check_options=0, latency=0.048000000000000001,
    scheduled_check=1, reschedule_check=1, time_is_valid=0x8047b40, preferred_time=0x8047b48)
    at /opt/src/build/nagios/files/nagios-3.2.0/base/checks.c:3134
#4  0x08066f69 in run_scheduled_host_check_3x (hst=0x8139b78, check_options=0, latency=0.048000000000000001)
    at /opt/src/build/nagios/files/nagios-3.2.0/base/checks.c:2791
#5  0x080658d0 in perform_scheduled_host_check (hst=0x8139b78, check_options=0, latency=0.048000000000000001)
    at /opt/src/build/nagios/files/nagios-3.2.0/base/checks.c:2108
#6  0x0807c0e8 in handle_timed_event (event=0x8133010) at /opt/src/build/nagios/files/nagios-3.2.0/base/events.c:1261
#7  0x0807bd8c in event_execution_loop () at /opt/src/build/nagios/files/nagios-3.2.0/base/events.c:1132
#8  0x0805ecaa in main (argc=134510324, argv=0x8139b78) at nagios.c:849
(gdb) quit

A hah! Now we have a line number. The line in question, line 3134 of checks.c, reads:

fprintf(check_result_info.output_file_fp,"output=%s\n",checkresult_dbuf.buf);

So this checkresult_dbuf.buf must be null. I googled, and found someone talking about it on the nagios-devel mailing list. Seems the fix they comitted (checking to see if checkresult_dbuf.buf is null) has been uncomitted/overwritten as this check is no longer in place in nagios 3.2.0. Not to worry, here’s a patch:

--- base/checks.c.orig  2009-10-16 19:28:42.082321083 +0100
+++ base/checks.c       2009-10-16 19:29:02.197305557 +0100
@@ -3131,7 +3131,7 @@
                                fprintf(check_result_info.output_file_fp,"early_timeout=%d\n",check_result_info.early_timeout);
                                fprintf(check_result_info.output_file_fp,"exited_ok=%d\n",check_result_info.exited_ok);
                                fprintf(check_result_info.output_file_fp,"return_code=%d\n",check_result_info.return_code);
-                               fprintf(check_result_info.output_file_fp,"output=%s\n",checkresult_dbuf.buf);
+                               fprintf(check_result_info.output_file_fp,"output=%s\n",(checkresult_dbuf.buf==NULL)?"(null)":checkresult_dbuf.buf);

                                /* close the temp file */
                                fclose(check_result_info.output_file_fp);

Apply this and you should be all set!

2 comments October 16th, 2009

Compiling Kannel 1.4.3 on Solaris 10

Kennel doesn’t appear to compile with Sun Studio, I couldn’t be bothered to work out why. It compiles with the default Solaris gcc 3.4.3, but fails with:

gcc -std=gnu99 -D_REENTRANT=1 -I. -Igw -g -O2 -D_FILE_OFFSET_BITS=64 -D_LARGE_FILES= -I/usr/include/libxml2 -o wmlscript/wsstream_data.o -c wmlscript/wsstream_data.c
wmlscript/wslexer.c: In function `read_float_from_exp':
wmlscript/wslexer.c:1037: error: syntax error before '||' token
gmake-3.81: *** [wmlscript/wslexer.o] Error 1
gmake-3.81: *** Waiting for unfinished jobs....
No postbuild script
Error: build failed

The issue is that wmlscript/wslexer.c uses "HUGE_VAL", which has a broken definition on Solaris (or at least gcc doesn’t like it).

The solution is to force it to use GCC’s built in HUGE_VAL definition, which you can do with the following patch:

--- wmlscript/wslexer.c.orig    2009-09-18 12:51:52.218499508 +0100
+++ wmlscript/wslexer.c 2009-09-18 12:54:09.811272795 +0100
@@ -1034,7 +1034,7 @@

     /* Check that the generated floating point number fits to
        `float32'. */
-    if (*result == HUGE_VAL || *result == -HUGE_VAL
+    if (*result == __builtin_huge_val() || *result == -__builtin_huge_val()
         || ws_ieee754_encode_single(*result, buf) != WS_IEEE754_OK)
         ws_src_error(compiler, 0, "floating point literal too large");

Happy compiling!

Add comment September 18th, 2009

Installing Windows Server 2008 on Citrix XenServer

During the install of Windows Server 2008, the installer might throw a screen up at you insisting you provide it with drivers so it can install Windows.

This screen can’t be bypassed (that I could see), and giving Windows Server 2008 the Citrix XenServer xe-tools.iso image is no good, as the drivers are contained within a .exe. Extracting the drivers on another computer and making your ISO is no good either - Windows won’t accept those drivers.

Usefully the installer doesn’t even tell you what hardware it wants drivers for. However on a hunch I removed the Network Adapter within Citrix XenServer, and sure enough, after a restart, the installer didn’t ask for any drivers and the install completed successfully.

I’ve had to fight with this drivers screen when installing Windows 7 on my Dell laptop before, and it’s not fun. It just doesn’t provide enough useful information for you to find the drivers it wants to install. Stupid Microsoft. Stupid Windows.

At least it accepts a CD or USB Key for the drivers, which is a vast improvement over the NT/2000/XP/2003 days where you’d need to blow the dust off your 3.5″ floppy drive…

Add comment July 29th, 2009

Windows Server 2003

Windows Server 2003 is now over 6 years old. Yet, we’re still asked by clients for new Windows Server 2003 installations, despite Windows Server 2008 coming out last year. I find this quite interesting, because Windows Server 2008 is a great product, and IIS 7.0 offers many significant advantages over IIS 6.0 (Such as native URL rewriting).

I’d say the biggest driver of this is that people fear the unknown - Server 2008 is somewhat new and people just don’t have the time to try it out. However, the situation in the Windows ecosphere is significantly different to what we encounter in the Linux & Unix world. For example, nobody would dare consider installing a Linux distribution that’s 6 years old.

CentOS first came out with version 2 in May 2005, Debian 3.0 “Woody” came out in 2002 (there wasn’t another release until 2005). Ubuntu didn’t even come out until 2005. All shipped with the Linux 2.4 kernel, and Apache 1.3, by default. Nobody in their right mind would run any of these distributions today.

So why then, do people continue to install Windows Server 2003? Why? For the following reasons:

  • Windows Server 2003 was a very strong release
  • Windows Server 2003 meets most peoples requirements
  • .NET 2.0, .NET 3.0 and .NET 3.5 all run fine on Windows Server 2003
  • Microsoft have released a FastCGI module for IIS 6.0, and there are numerous URL Rewrite options for Server 2003
  • People are wary of new Microsoft releases (Take Vista for example)

That’s not to say I approve of installing Windows Server 2003. It goes out of general support in 2010, which is but one year away. Windows Server 2008 is a great product with many fantastic new features built in. But I have a nagging feeling Windows Server 2003 will be with us for a long time to come. It’s just too simple, too clean and too elegant to disappear.

Add comment June 16th, 2009

Killing a Solaris 10 Zone stuck in the shutting_down state.

So, you have a Solaris 10 Zone. You’ve run “zoneadm -z zonename shutdown”. It hasn’t quite shut down, and is stuck in the shutting_down state. What can you do to fix it?

Well, sometimes some processes don’t die in a timely fashion. Check what processes are running with the following command:

# ps -fz zonename

If any processes other than zsched are running, kill -9 them. The zone should hopefully shut down.

If it doesn’t, and you’re left with zsched as the only remaining process, then potentially you’ve hit a bug, such as bug 6272846 - "User orders zone death; NFS client thumbs nose". This bug has been outstanding since May 2005, so don’t expect a fix any time soon.

Thankfully there are a few more things you can try to kill the damn zone off. Give some of the following a go:

# zoneadm -z zonename unmount -f
# zoneadm -z zonename reboot -- -s
# pkill -9 -z zonename

The above combo should hopefully deliver a fatal blow to your Zone. If not, bitch at Sun. Hopefully they’ll sort their lives out.

Add comment June 11th, 2009

64bit Varnish on Solaris

When running a 64bit varnish on Solaris, you may encounter an error similar to:

# /opt/ec/sbin/amd64/varnishd -d
Compiled VCL program failed to load:
  ld.so.1: varnishd: fatal: ./vcl.ORk8t3RP.so: wrong ELF class: ELFCLASS32
VCL compilation failed

The problem is fairly self explanatory, your 64bit Varnish is failing to pass -m64 to the compiler when it compiles up the VCL program. The fix is very straight forward, simply pass:

# /opt/ec/sbin/amd64/varnishd -d -p cc_command='cc -Kpic -G -m64 -o %o %s'
storage_file: filename: ./varnish.NxaavR (unlinked) size 26135 MB.
Creating new SHMFILE
New Pid 22203

Debugging mode, enter "start" to start child

Et voilĂ , fixed. Enjoy!

Add comment May 31st, 2009

Text relocation remains against symbol, libx264

Just a very quick post regarding libx264.

If you are getting errors such as:

# gcc -shared -o libx264.so.67 common/mc.o common/predict.o common/pixel.o common/macroblock.o common/frame.o common/dct.o common/cpu.o common/cabac.o common/common.o common/mdate.o common/set.o common/quant.o common/vlc.o encoder/analyse.o encoder/me.o encoder/ratecontrol.o encoder/set.o encoder/macroblock.o encoder/cabac.o encoder/cavlc.o encoder/encoder.o extras/getopt.o  -Wl,-h,libx264.so.67 -lm -lpthread -s
Text relocation remains                         referenced
    against symbol                  offset      in file
                           0x6be       common/mc.o
                           0x6d5       common/mc.o
                           0xbbe       common/mc.o
                           0xbc5       common/mc.o
...
__udivdi3                           0x3809      common/set.o
__udivdi3                           0x3875      common/set.o
__udivdi3                           0x10cf      encoder/macroblock.o
__divdi3                            0x17865     encoder/analyse.o
__divdi3                            0x1e9       encoder/set.o
ld: fatal: relocations remain against allocatable but non-writable sections
collect2: ld returned 1 exit status

then simply add “-mimpure-text -lrt” to your LDFLAGS.

A quick note to self, “gcc -shared” is better than “gcc -G”. The former tells gcc to build a shared object, which tells the linker (I suppose). The latter just tells the linker. Swapping a -shared for -G can fix the above issue, but creates other issues. Or something along those lines - I’m a bit hazy on this one.

This issue came about because I was getting errors when running a 64bit amd64 ffmpeg linked against libx264:

ld.so.1: ffmpeg: fatal: relocation error: R_AMD64_PC32: file /opt/ec/lib/amd64/libx264.so.67: symbol main: value 0x280018fc805 does not fit

The problem here was that I’d compiled libx264 with gcc -G instead of gcc -shared. However using -shared generated the “Text relocation remains against symbol” errors, which needed the “-mimpure-text -lrt” fix.

1 comment May 19th, 2009

FFMpeg 64bit x86_64 / amd64 on Solaris 10

I wanted to post this before I move onto my next problem, so excuse the brevity. When compiling ffmpeg on Solaris 10 for 64bit, you may encounter this particular block of errors, which come out of the assembly found inside libavcodec/cabac.h:

/var/tmp//ccC7vvHU.s:8035: Error: `-1(%ebx)' is not a valid 64 bit base/index expression
/var/tmp//ccC7vvHU.s:8038: Error: `ff_h264_norm_shift(%ecx)' is not a valid 64 bit base/index expression
/var/tmp//ccC7vvHU.s:8060: Error: `ff_h264_lps_range(%eax,%esi,2)' is not a valid 64 bit base/index expression
/var/tmp//ccC7vvHU.s:8070: Error: `ff_h264_norm_shift(%esi)' is not a valid 64 bit base/index expression
/var/tmp//ccC7vvHU.s:8072: Error: `ff_h264_mlps_state+128(%eax)' is not a valid 64 bit base/index expression
/var/tmp//ccC7vvHU.s:8084: Error: `-1(%ebx)' is not a valid 64 bit base/index expression
/var/tmp//ccC7vvHU.s:8087: Error: `ff_h264_norm_shift(%ecx)' is not a valid 64 bit base/index expression
/var/tmp//ccC7vvHU.s:8107: Error: `ff_h264_lps_range(%eax,%esi,2)' is not a valid 64 bit base/index expression
/var/tmp//ccC7vvHU.s:8117: Error: `ff_h264_norm_shift(%esi)' is not a valid 64 bit base/index expression
...
/var/tmp//ccC7vvHU.s:32827: Error: `ff_h264_norm_shift(%esi)' is not a valid 64 bit base/index expression

The issue is that FFMpeg has failed to detect "BROKEN_RELOCATIONS". Simply set this in your config.h like so:

export CFLAGS=-m64
export LDFLAGS=-m64
./configure --prefix=/tmp/ffmpeg --arch=x86_64 --cpu=nocona --disable-encoder=nellymoser
echo '#define BROKEN_RELOCATIONS 1' >> config.h
gmake-3.81

I’ve missed a lot of detail out here, such as all the patches we use to get ffmpeg building on Solaris, but I’ll hopefully find more time tomorrow to post a blog entry about it.

Some interesting tidbits of information: FFMpeg with –enable-shared is 3 times slower, so I wouldn’t advise enabling this flag unless you absolutely need it. And the 64bit ffmpeg binary is twice as fast at transcoding wmv to flv over a 32bit one (in the basic not-very-fancy test I used). So it is worth investing time in compiling up a 64bit version.

Here’s a performance comparison going from a 32bit ffmpeg with –enable-shared and –disable-mmx on gcc 3.4, to a 64bit ffmpeg with –disable-shared and –enable-mmx on gcc 4.4:

# time /opt/ec/bin/ffmpeg -y -i ~/test2.wmv ~/test2.flv
FFmpeg version 0.5, Copyright (c) 2000-2009 Fabrice Bellard, et al.
  configuration: --prefix=/opt/ec --enable-shared --enable-nonfree --enable-gpl --enable-libamr-nb --enable-libamr-wb --enable-libdirac --enable-libfaac --enable-libfaad --enable-libmp3lame --enable-libopenjpeg --enable-libschroedinger --enable-libtheora --enable-libvorbis --enable-libx264 --enable-libxvid --disable-encoder=nellymoser --disable-mmx --enable-avfilter --disable-debug --enable-swscale --enable-postproc --enable-pthreads
  libavutil     49.15. 0 / 49.15. 0
  libavcodec    52.20. 0 / 52.20. 0
  libavformat   52.31. 0 / 52.31. 0
  libavdevice   52. 1. 0 / 52. 1. 0
  libavfilter    0. 4. 0 /  0. 4. 0
  libswscale     0. 7. 1 /  0. 7. 1
  libpostproc   51. 2. 0 / 51. 2. 0
  built on May  1 2009 14:25:24, gcc: 3.4.3 (csl-sol210-3_4-branch+sol_rpath)

Seems stream 1 codec frame rate differs from container frame rate: 1000.00 (1000/1) -> 25.00 (25/1)
Input #0, asf, from '/export/home/alasdair/test2.wmv':
  Duration: 00:11:04.65, start: 3.000000, bitrate: 172 kb/s
    Stream #0.0: Audio: wmav2, 44100 Hz, mono, s16, 32 kb/s
    Stream #0.1: Video: wmv1, yuv420p, 320x240, 180 kb/s, 25 tbr, 1k tbn, 1k tbc
Output #0, flv, to '/export/home/alasdair/test2.flv':
    Stream #0.0: Video: flv, yuv420p, 320x240, q=2-31, 200 kb/s, 90k tbn, 25 tbc
    Stream #0.1: Audio: libmp3lame, 44100 Hz, mono, s16, 64 kb/s
Stream mapping:
  Stream #0.1 -> #0.0
  Stream #0.0 -> #0.1
Press [q] to stop encoding
frame=16637 fps=263 q=12.6 Lsize=   22485kB time=665.48 bitrate= 276.8kbits/s
video:16626kB audio:5200kB global headers:0kB muxing overhead 3.016455%

real    1m3.285s
user    1m0.043s
sys     0m0.471s
# time ./ffmpeg-64 -y -i ~/test2.wmv ~/test2.flv
FFmpeg version 0.5, Copyright (c) 2000-2009 Fabrice Bellard, et al.
  configuration: --prefix=/tmp/ffmpeg --arch=x86_64 --cpu=nocona --disable-encoder=nellymoser
  libavutil     49.15. 0 / 49.15. 0
  libavcodec    52.20. 0 / 52.20. 0
  libavformat   52.31. 0 / 52.31. 0
  libavdevice   52. 1. 0 / 52. 1. 0
  built on May  3 2009 04:42:52, gcc: 4.4.0

Seems stream 1 codec frame rate differs from container frame rate: 1000.00 (1000/1) -> 25.00 (25/1)
Input #0, asf, from '/export/home/alasdair/test2.wmv':
  Duration: 00:11:04.65, start: 3.000000, bitrate: 172 kb/s
    Stream #0.0: Audio: wmav2, 44100 Hz, mono, s16, 32 kb/s
    Stream #0.1: Video: wmv1, yuv420p, 320x240, 180 kb/s, 25 tbr, 1k tbn, 1k tbc
Output #0, flv, to '/export/home/alasdair/test2.flv':
    Stream #0.0: Video: flv, yuv420p, 320x240, q=2-31, 200 kb/s, 90k tbn, 25 tbc
    Stream #0.1: Audio: adpcm_swf, 44100 Hz, mono, s16, 64 kb/s
Stream mapping:
  Stream #0.1 -> #0.0
  Stream #0.0 -> #0.1
Press [q] to stop encoding
frame=16637 fps=1799 q=12.0 Lsize=   31483kB time=665.48 bitrate= 387.6kbits/s
video:16624kB audio:14375kB global headers:0kB muxing overhead 1.561981%

real    0m9.356s
user    0m8.937s
sys     0m0.291s

The encode time has gone from 63 seconds to just 9 seconds! That’s a *huge* speed up. I’m rather impressed.

I’m also guessing that compiling all the ffmpeg dependencies as shared object .so files will have the same slowdown as ffmpeg with –enable-shared, so I’m going to try building all the dependencies as static libraries instead.

It’s now approaching 5am so I should really go off to bed, but I’m glad I finally got this knocked on the head.

4 comments May 3rd, 2009

What is Solaris? Why should I be using it?

Solaris is Sun Microsystem’s flagship Unix based operating system. It is free to obtain and use, and Sun opened the Solaris source code under an Open Source license in 2005.

It is robust, highly scalable and incredibly powerful; actively maintained by Sun, new features are being introduced on a regular basis. Paid commercial support is available. It fully supports Intel & AMD CPUs, ensuring it runs on the vast majority of commodity hardware such Dell & HP.

Solaris contains many killer features. If you’re currently using Linux, FreeBSD, Mac OS X, or another Unix varient, you should consider checking it out.

We are a big Solaris user here at EveryCity, providing Managed Solaris Hosting to our customers, via our dedicated and cloud based hosting platform, along side our Windows and Linux hosting offerings.

Killer Features

Solaris has a huge number of features, too many to mention. We will cover off some of the key features that we make heavy use of, many of which we severely miss when we use other operating systems.

ZFS Filesystem

ZFS is a revolutionary filesystem that throws history to the wind, doing away with the traditional link between files, filesystems and partitions. ZFS utilises a notion of “pooled storage”, where you allocate disks to a pool. You can then dynamically create filesystems on the fly, which all share the pool of storage. You can, for example, give each user their own ZFS filesystem.

ZFS filesystems support compression, encryption and quotas. The ZFS filesystem is atomic - transactions are either fully comitted or not comitted at all. There is no fsck/chkdsk tool, data on disk is always consistent, meaning that after an unexpected power loss, the system boots without needing to perform a lengthy disk check.

Storage Pools can be created with RAID levels, with ZFS supporting RAID0, RAID1, RAID10, RAID-Z (Raid 5) and RAID-Z2 (Raid 6). ZFS stores data blocks on disk with a CRC error checking hash, and ZFS will detect silent data corruption and report on it. If you are using RAID with parity (1, 10, Z or Z2), ZFS will recover from the corruption by utilising the parity data. This ensures ZFS can for example recover from the scenario where a disk in a RAID array is silently corrupting data. No commercial RAID card on the market can currently recover from this particular failure mode.

ZFS supports snapshots, which are virtually free and incredibly easy to do. Filesystem snapshots are immediately available, via a hidden “.zfs” directory, that let you view the filesystem at the time of the snapshot. Snapshots are read only, and can be "cloned" to produce full read/write filesystems. They are great for producing backups.

ZFS snapshots are incredibly powerful, and very useful. A great example is the Solaris liveupgrade utility, which is used to upgrade Solaris to a new release. liveupgrade will snapshot the root ZFS filesystem before performing the upgrade. If the upgrade fails, you can rollback to the snapshot, saving you having to restore from backups. We use ZFS snapshots internally on our backup server to store incremental daily backups.

ZFS contains many more features that we haven’t even begun to touch on, such as send/receive, separate log devices, the ARC and L2ARC caching systems, and many many more. We love ZFS, and now simply can’t imagine life without it.

Zones / Containers

Solaris Zones (Or "Containers" as you may hear them being referred to) are similar to FreeBSD Jails. They are virtualised Solaris installations that you can SSH into, install applications inside, and use just like a real physical server. Solaris Zones differ from other virtualisation technologies in that they all run on top of a single Kernel, with little or no overhead. Zones cannot for example run Linux or Windows inside, because they are not a hardware virtualisation solution - they are a logical virtualisation technology that groups system processes/users/resources into discrete units.

Zones are very powerful, and Solaris provides a full management framework for creating, starting, stopping, cloning and modifying them. Since there is little or no overhead, Zones are incredibly fast, and it is entirely feasible to run enterprise applications such as Oracle inside a zone. Zones are cheap to create, using very little memory. As an example, one might create a database Zone for MySQL, a production Zone with Apache for a live website, and a staging Zone also with Apache for development.

Zones support resource controls, such as CPU, memory, number of processes, etc, providing a great method of partitioning up a system. Resources are pooled, in that you set a memory cap, rather than allocating a specific quantity of RAM, allowing you to overcommit.

We use Solaris Zones for our cloud computing environment, and cannot sing their praises enough. We can deploy a brand new fully working Zone, including installing all necessary key applications such as Apache, PHP, MySQL, etc, within 1 to 2 minutes.

Service Management Framework

Sun has done away with the old System V init scripts that users of Linux may be familiar with. Instead, in Solaris, we have the Service Management Framework, part of Sun’s "Predictive self healing" strategy. SMF manages services via a set of command line tools, "svcs", "svcadm" and "svccfg". SMF can detect and restart failed services, tracks dependencies, starts services in parallel, and stores it’s configuration in XML based manifests (which you don’t need to touch, unless you’re creating a new service).

SMF is incredibly powerful, and makes managing services incredibly easy.

Versions of Solaris

Solaris comes in several versions. Solaris 10 is Sun’s commercial, stable operating system, available free of charge. New release come out on a roughly 6 month basis, which often introduce new major features, whilst maintaining backwards compatibility providing maximum stability. We utilise Solaris 10.

OpenSolaris is the open source edition of Solaris. It contains radical new features and bleeding edge technologies. It is relatively new, and evolving quickly with releases every 6 to 12 months. It provides a great way to try out new features, and is very stable, with many companies using it in a production environment. Commercial support is also available from Sun.

Solaris also has development editions, such as Solaris Express Community Edition, and you can also obtain the OpenSolaris codebase and build this yourself.

Conclusion

I hope you found the above interesting. Solaris is not without it’s flaws, and Sun are working hard to address them. For example, Solaris 10 currently lacks an integrated centralised package management system. But this feature has been developed, is present in OpenSolaris, and should hopefully arrive in Solaris 11.

We wouldn’t have made such a heavy investment in time and energy to use Solaris if we didn’t strongly believe in it’s technological benefits, and we urge others to play with it and give it a go. Perhaps you’ll fall in love with it as much as we have.

4 comments March 6th, 2009

Compiling MySQL-python on Solaris 10

This one can be a bit of a nightmare on Solaris, due to typical Solaris complications. There are two key things to watch out for.

First thing to watch out for, is that MySQL-python includes both pyconfig.h and my_config.h, both of which on Solaris may include SIZEOF_ definitions. If you’re using Sun Web Stack 1.4, the my_config.h file is for 64bit, but we’ll no doubt be compiling as 32bit. Our recommendation is to compile up your own mysql client library and link against this (see my previous post about compiling things against the Sun Web Stack 1.4 MySQL).

The second thing to watch out for is that Python and MySQL both record the compiler options they were compiled with. For example, -fPIC, -Wall, etc. MySQL-python blindly passes these to whatever compiler you’re using. These may be the wrong arguments, generating all sorts of warnings and/or errors. Ones like these:

# MySQL compiled with Sun Studio, Python compiled with gcc, we're compiling with Sun Studio:
building '_mysql' extension
creating build/temp.solaris-2.10-i86pc-2.5
creating build/temp.solaris-2.10-i86pc-2.5/src
/opt/SUNWspro/bin/cc -OPT:Olimit=0 -DNDEBUG -O -Kpic -Dversion_info=(1, 3, 0, 'f                                                         inal', 0) -D__version__=1.3.0 -I/opt/webstack/mysql/include/mysql -I/opt/python2                                                         .5/include/python2.5 -c src/mysqlmod.c -o build/temp.solaris-2.10-i86pc-2.5/src/                                                         mysqlmod.o -xarch=386 -xchip=pentium -xspace -xildoff -xc99=all -xnorunpath -m32                                                          -DBIG_TABLES -DHAVE_RWLOCK_T
cc: Warning: illegal option -OPT:Olimit=0
"/usr/include/sys/feature_tests.h", line 332: #error: "Compiler or options inval                                                         id for pre-UNIX 03 X/Open applications  and pre-2001 POSIX applications"
cc: acomp failed for src/mysqlmod.c
error: command '/opt/SUNWspro/bin/cc' failed with exit status 2

# MySQL compiled with gcc, Python compiled with Sun Studio, we're compiling with gcc:
building '_mysql' extension
creating build/temp.solaris-2.10-i86pc-2.5
creating build/temp.solaris-2.10-i86pc-2.5/src
gcc -OPT:Olimit=0 -DNDEBUG -O -Kpic -Dversion_info=(1, 3, 0, 'final', 0) -D__version__=1.3.0 -I/opt/ec/mysql_client/include/mysql -I/opt/python2.5/include/python2.5 -c src/mysqlmod.c -o build/temp.solaris-2.10-i86pc-2.5/src/mysqlmod.o -DHAVE_RWLOCK_T
gcc: unrecognized option `-Kpic'
cc1: error: invalid option argument `-OPT:Olimit=0'

# etc, there are quite a few combinations here

Our solution to this was to find out what your python is compiled with (Sun Studio or gcc) by looking in the “lib/python*/config/Makefile” file (relative to where python is installed), compile up your own MySQL client library using the same compiler, then compile MySQL-python with that same compiler.

Don’t forget to set the LDFLAGS environment variable with -L and -R paths to your MySQL client library path and Python library path, so that MySQL-python can find the libraries when it compiles.

Add comment March 6th, 2009

Next Posts Previous Posts