Alasdair on Everything

Posts filed under 'General'

Installing the memcached Ruby Gem on Solaris

Gosh this one was quite hard. I was getting errors such as:

rlibmemcached_wrap.c:2074: error: syntax error before ‘bool’
rlibmemcached_wrap.c: In function ‘SWIG_AsVal_bool’:
rlibmemcached_wrap.c:2076: error: ‘obj’ undeclared (first use in this function)
rlibmemcached_wrap.c:2076: error: (Each undeclared identifier is reported only once
rlibmemcached_wrap.c:2076: error: for each function it appears in.)
rlibmemcached_wrap.c:2077: error: ‘val’ undeclared (first use in this function)
rlibmemcached_wrap.c:2077: error: ‘true’ undeclared (first use in this function)
rlibmemcached_wrap.c:2080: error: ‘false’ undeclared (first use in this function)

So to solve this I basically followed these helpful instructions Nick Sellen:

Nick Sellen says (January 27, 2010):

I had trouble installing it on my Solaris 10 with 32bit / gcc compiled ruby but managed it with a few modifications to extconf.rb:

1. added "--disable-64bit" to the libmemcached configure arguments
2. added "-std=gnu99" to CFLAGS (the rlibmemcached_wrap.c compilation was failing without that)
3. added an extra -R path for ext/lib - not sure if this was needed actually
4. recreated the rlibmemcached_wrap.c with swig (it removed a bunch of methods, not sure if this will bite me later)
5. added three extra libraries "-lnsl -lsocket -lposix4" to resolve a "symbol getaddrinfo: referenced symbol not found" relocation error with rlibmemcached.so (might only need libsocket)

You might also want to view the extconf.rb modifications directly.

The swig step basically involves downloading, compiling and installing swig to somewhere like /opt/swig, then doing “export SWIG=true” in your shell.

3 comments March 3rd, 2010

VLC on Solaris 10

Some helpful chap has compiled up VLC for Solaris 10. Useful!

Add comment January 8th, 2010

Installing OpenSolaris/Solaris on a Fasthosts Dedicated Server

EDIT: Turns out that my server had dodgy wiring with the Eric card. Fasthosts fixed this and then I was able to get into the BIOS to change the boot order, rendering the below post rather unnecessary.

I was recently tasked with installing OpenSolaris on a Fasthosts Dedicated Server. Fasthost Dedicated Servers are cheap and cheerful. I would never put anything important on them, because if the shit hits the fan, you’re own your own. But they are incredibly cheap, so for un-important bits n pieces, they can make sense.

Unfortunately they only come pre-installed with Windows Server, CentOS or Ubuntu. Being a Solaris advocate, the first thing I wanted to do was kablam them with OpenSolaris.

The boxes rather usefully come with Raritan ERIC remote management cards. These remote management cards provide you with:

  • Keyboard, Video and Mouse remote access
  • Remote power management
  • Virtual CD-Rom

So, installing OpenSolaris should be a piece of cake, right? Sadly.. not quite. Fasthosts have either locked down the cards/servers so you can’t go into the BIOS/Alter the boot order, or the Eric KVM cards are deficient in that regard. Regardless of whether I chose PS2 or USB for the Keyboard emulation, pressing F2 or F12 on the BIOS boot screen yielded nothing useful.

Further, I had issues getting the Virtual CD Drive to mount. Rather unfortunately it can only access ISO images via Windows File Sharing. I set up a Samba Server, but the Eric card kept saying "Error accessing image". It turns out your ISO image has to be in a sub-folder, and the path uses backslashes. So I finally got a CD mounted in the end.

Once I had the ISO Image mounted, I needed to get the server to boot it. Since we can’t change the boot order, I finally got around it by nuking the MBR of the harrdrive. There are actually two harddrives in the Fasthosts box I ordered, so I ran:

# dd if=/dev/zero of=/dev/sda bs=1M count=100
# dd if=/dev/zero of=/dev/sdb bs=1M count=100

I probably only had to do the first 512 bytes, but more doesn’t hurt when you’re wiping the box anyway. Upon rebooting, sure enough, it started booting the OpenSolaris install CD. Magic!

2 comments December 29th, 2009

Nagios 3.2.0 coredumps when started via SMF on Solaris 10

This one was quite interesting. If you compile your own nagios-3.2.0 from source on Solaris 10, and start it manually, it runs just fine. If you run it via SMF with a service manifest, the process continually dumps core, so you get messages such as:

[ Oct 16 19:24:48 Enabled. ]
[ Oct 16 19:24:48 Executing start method ("/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg &") ]
[ Oct 16 19:24:48 Method "start" exited with status 0 ]
[ Oct 16 19:24:49 Stopping because process dumped core. ]
[ Oct 16 19:24:49 Executing stop method (:kill) ]
Successfully shutdown... (PID=29180)
[ Oct 16 19:24:49 Executing start method ("/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg &") ]
[ Oct 16 19:24:49 Method "start" exited with status 0 ]
[ Oct 16 19:24:50 Stopping because process dumped core. ]
[ Oct 16 19:24:50 Executing stop method (:kill) ]
Successfully shutdown... (PID=29232)
[ Oct 16 19:24:51 Executing start method ("/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg &") ]
[ Oct 16 19:24:51 Method "start" exited with status 0 ]
[ Oct 16 19:24:52 Stopping because process dumped core. ]
[ Oct 16 19:24:52 Executing stop method (:kill) ]
Successfully shutdown... (PID=29246)

So, why does nagios crash when started via SMF? Well, I decided to enable core dumps via coreadm, to find out why. We do this with:

# mkdir /cores
# coreadm -g /cores/core.%f.%p -i /cores/core.%f.%p -e global -e global-setid -e log -e process -e proc-setid
# coreadm
     global core file pattern: /cores/core.%f.%p
     global core file content: all
       init core file pattern: /cores/core.%f.%p
       init core file content: all
            global core dumps: enabled
       per-process core dumps: enabled
      global setid core dumps: enabled
 per-process setid core dumps: enabled
     global core dump logging: enabled

We can then check the core dump with:

# gdb /opt/nagios/bin/nagios /cores/core.nagios.23536
GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
*snip*
Core was generated by `/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg'.
Program terminated with signal 11, Segmentation fault.
#0  0xfed3590c in strlen () from /lib/libc.so.1
(gdb) bt
#0  0xfed3590c in strlen () from /lib/libc.so.1
#1  0xfed8eda6 in _ndoprnt () from /lib/libc.so.1
#2  0xfed9192d in fprintf () from /lib/libc.so.1
#3  0x08067c42 in run_async_host_check_3x ()
#4  0x08066f69 in run_scheduled_host_check_3x ()
#5  0x080658d0 in perform_scheduled_host_check ()
#6  0x0807c0e8 in handle_timed_event ()
#7  0x0807bd8c in event_execution_loop ()
#8  0x0805ecaa in main ()
(gdb) quit

Interesting - it’s crashing when the nagios function run_async_host_check_3x does a fprintf. Looks like a null pointer to me. Lets get the actual line number by installing a nagios binary which has not been stripped of debugging symbols. Thankfully the Nagios Makefile has a method of doing this already:

# cd /opt/src/nagios-3.2.0
# gmake install-unstripped
cd ./base && gmake install-unstripped
gmake[1]: Entering directory `/opt/src/build/nagios/files/nagios-3.2.0/base'
gmake install-basic
gmake[2]: Entering directory `/opt/src/build/nagios/files/nagios-3.2.0/base'
/opt/sfw/bin/install -c -m 775 -o nagios -g nagios -d /opt/nagios/bin
/opt/sfw/bin/install -c -m 774 -o nagios -g nagios nagios /opt/nagios/bin
*snip*

Now we re-run nagios via SMF, then gdb the latest coredump:

 gdb /opt/nagios/bin/nagios /globalcore/core.nagios.29248
GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
*snip*
Core was generated by `/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg'.
Program terminated with signal 11, Segmentation fault.
#0  0xfed3590c in strlen () from /lib/libc.so.1
(gdb) bt
#0  0xfed3590c in strlen () from /lib/libc.so.1
#1  0xfed8eda6 in _ndoprnt () from /lib/libc.so.1
#2  0xfed9192d in fprintf () from /lib/libc.so.1
#3  0x08067c42 in run_async_host_check_3x (hst=0x8139b78, check_options=0, latency=0.048000000000000001,
    scheduled_check=1, reschedule_check=1, time_is_valid=0x8047b40, preferred_time=0x8047b48)
    at /opt/src/build/nagios/files/nagios-3.2.0/base/checks.c:3134
#4  0x08066f69 in run_scheduled_host_check_3x (hst=0x8139b78, check_options=0, latency=0.048000000000000001)
    at /opt/src/build/nagios/files/nagios-3.2.0/base/checks.c:2791
#5  0x080658d0 in perform_scheduled_host_check (hst=0x8139b78, check_options=0, latency=0.048000000000000001)
    at /opt/src/build/nagios/files/nagios-3.2.0/base/checks.c:2108
#6  0x0807c0e8 in handle_timed_event (event=0x8133010) at /opt/src/build/nagios/files/nagios-3.2.0/base/events.c:1261
#7  0x0807bd8c in event_execution_loop () at /opt/src/build/nagios/files/nagios-3.2.0/base/events.c:1132
#8  0x0805ecaa in main (argc=134510324, argv=0x8139b78) at nagios.c:849
(gdb) quit

A hah! Now we have a line number. The line in question, line 3134 of checks.c, reads:

fprintf(check_result_info.output_file_fp,"output=%s\n",checkresult_dbuf.buf);

So this checkresult_dbuf.buf must be null. I googled, and found someone talking about it on the nagios-devel mailing list. Seems the fix they comitted (checking to see if checkresult_dbuf.buf is null) has been uncomitted/overwritten as this check is no longer in place in nagios 3.2.0. Not to worry, here’s a patch:

--- base/checks.c.orig  2009-10-16 19:28:42.082321083 +0100
+++ base/checks.c       2009-10-16 19:29:02.197305557 +0100
@@ -3131,7 +3131,7 @@
                                fprintf(check_result_info.output_file_fp,"early_timeout=%d\n",check_result_info.early_timeout);
                                fprintf(check_result_info.output_file_fp,"exited_ok=%d\n",check_result_info.exited_ok);
                                fprintf(check_result_info.output_file_fp,"return_code=%d\n",check_result_info.return_code);
-                               fprintf(check_result_info.output_file_fp,"output=%s\n",checkresult_dbuf.buf);
+                               fprintf(check_result_info.output_file_fp,"output=%s\n",(checkresult_dbuf.buf==NULL)?"(null)":checkresult_dbuf.buf);

                                /* close the temp file */
                                fclose(check_result_info.output_file_fp);

Apply this and you should be all set!

2 comments October 16th, 2009

Compiling Kannel 1.4.3 on Solaris 10

Kennel doesn’t appear to compile with Sun Studio, I couldn’t be bothered to work out why. It compiles with the default Solaris gcc 3.4.3, but fails with:

gcc -std=gnu99 -D_REENTRANT=1 -I. -Igw -g -O2 -D_FILE_OFFSET_BITS=64 -D_LARGE_FILES= -I/usr/include/libxml2 -o wmlscript/wsstream_data.o -c wmlscript/wsstream_data.c
wmlscript/wslexer.c: In function `read_float_from_exp':
wmlscript/wslexer.c:1037: error: syntax error before '||' token
gmake-3.81: *** [wmlscript/wslexer.o] Error 1
gmake-3.81: *** Waiting for unfinished jobs....
No postbuild script
Error: build failed

The issue is that wmlscript/wslexer.c uses "HUGE_VAL", which has a broken definition on Solaris (or at least gcc doesn’t like it).

The solution is to force it to use GCC’s built in HUGE_VAL definition, which you can do with the following patch:

--- wmlscript/wslexer.c.orig    2009-09-18 12:51:52.218499508 +0100
+++ wmlscript/wslexer.c 2009-09-18 12:54:09.811272795 +0100
@@ -1034,7 +1034,7 @@

     /* Check that the generated floating point number fits to
        `float32'. */
-    if (*result == HUGE_VAL || *result == -HUGE_VAL
+    if (*result == __builtin_huge_val() || *result == -__builtin_huge_val()
         || ws_ieee754_encode_single(*result, buf) != WS_IEEE754_OK)
         ws_src_error(compiler, 0, "floating point literal too large");

Happy compiling!

Add comment September 18th, 2009

Installing Windows Server 2008 on Citrix XenServer

During the install of Windows Server 2008, the installer might throw a screen up at you insisting you provide it with drivers so it can install Windows.

This screen can’t be bypassed (that I could see), and giving Windows Server 2008 the Citrix XenServer xe-tools.iso image is no good, as the drivers are contained within a .exe. Extracting the drivers on another computer and making your ISO is no good either - Windows won’t accept those drivers.

Usefully the installer doesn’t even tell you what hardware it wants drivers for. However on a hunch I removed the Network Adapter within Citrix XenServer, and sure enough, after a restart, the installer didn’t ask for any drivers and the install completed successfully.

I’ve had to fight with this drivers screen when installing Windows 7 on my Dell laptop before, and it’s not fun. It just doesn’t provide enough useful information for you to find the drivers it wants to install. Stupid Microsoft. Stupid Windows.

At least it accepts a CD or USB Key for the drivers, which is a vast improvement over the NT/2000/XP/2003 days where you’d need to blow the dust off your 3.5″ floppy drive…

Add comment July 29th, 2009

Windows Server 2003

Windows Server 2003 is now over 6 years old. Yet, we’re still asked by clients for new Windows Server 2003 installations, despite Windows Server 2008 coming out last year. I find this quite interesting, because Windows Server 2008 is a great product, and IIS 7.0 offers many significant advantages over IIS 6.0 (Such as native URL rewriting).

I’d say the biggest driver of this is that people fear the unknown - Server 2008 is somewhat new and people just don’t have the time to try it out. However, the situation in the Windows ecosphere is significantly different to what we encounter in the Linux & Unix world. For example, nobody would dare consider installing a Linux distribution that’s 6 years old.

CentOS first came out with version 2 in May 2005, Debian 3.0 “Woody” came out in 2002 (there wasn’t another release until 2005). Ubuntu didn’t even come out until 2005. All shipped with the Linux 2.4 kernel, and Apache 1.3, by default. Nobody in their right mind would run any of these distributions today.

So why then, do people continue to install Windows Server 2003? Why? For the following reasons:

  • Windows Server 2003 was a very strong release
  • Windows Server 2003 meets most peoples requirements
  • .NET 2.0, .NET 3.0 and .NET 3.5 all run fine on Windows Server 2003
  • Microsoft have released a FastCGI module for IIS 6.0, and there are numerous URL Rewrite options for Server 2003
  • People are wary of new Microsoft releases (Take Vista for example)

That’s not to say I approve of installing Windows Server 2003. It goes out of general support in 2010, which is but one year away. Windows Server 2008 is a great product with many fantastic new features built in. But I have a nagging feeling Windows Server 2003 will be with us for a long time to come. It’s just too simple, too clean and too elegant to disappear.

Add comment June 16th, 2009

Killing a Solaris 10 Zone stuck in the shutting_down state.

So, you have a Solaris 10 Zone. You’ve run “zoneadm -z zonename shutdown”. It hasn’t quite shut down, and is stuck in the shutting_down state. What can you do to fix it?

Well, sometimes some processes don’t die in a timely fashion. Check what processes are running with the following command:

# ps -fz zonename

If any processes other than zsched are running, kill -9 them. The zone should hopefully shut down.

If it doesn’t, and you’re left with zsched as the only remaining process, then potentially you’ve hit a bug, such as bug 6272846 - "User orders zone death; NFS client thumbs nose". This bug has been outstanding since May 2005, so don’t expect a fix any time soon.

Thankfully there are a few more things you can try to kill the damn zone off. Give some of the following a go:

# zoneadm -z zonename unmount -f
# zoneadm -z zonename reboot -- -s
# pkill -9 -z zonename

The above combo should hopefully deliver a fatal blow to your Zone. If not, bitch at Sun. Hopefully they’ll sort their lives out.

Add comment June 11th, 2009

64bit Varnish on Solaris

When running a 64bit varnish on Solaris, you may encounter an error similar to:

# /opt/ec/sbin/amd64/varnishd -d
Compiled VCL program failed to load:
  ld.so.1: varnishd: fatal: ./vcl.ORk8t3RP.so: wrong ELF class: ELFCLASS32
VCL compilation failed

The problem is fairly self explanatory, your 64bit Varnish is failing to pass -m64 to the compiler when it compiles up the VCL program. The fix is very straight forward, simply pass:

# /opt/ec/sbin/amd64/varnishd -d -p cc_command='cc -Kpic -G -m64 -o %o %s'
storage_file: filename: ./varnish.NxaavR (unlinked) size 26135 MB.
Creating new SHMFILE
New Pid 22203

Debugging mode, enter "start" to start child

Et voilà, fixed. Enjoy!

Add comment May 31st, 2009

Text relocation remains against symbol, libx264

Just a very quick post regarding libx264.

If you are getting errors such as:

# gcc -shared -o libx264.so.67 common/mc.o common/predict.o common/pixel.o common/macroblock.o common/frame.o common/dct.o common/cpu.o common/cabac.o common/common.o common/mdate.o common/set.o common/quant.o common/vlc.o encoder/analyse.o encoder/me.o encoder/ratecontrol.o encoder/set.o encoder/macroblock.o encoder/cabac.o encoder/cavlc.o encoder/encoder.o extras/getopt.o  -Wl,-h,libx264.so.67 -lm -lpthread -s
Text relocation remains                         referenced
    against symbol                  offset      in file
                           0x6be       common/mc.o
                           0x6d5       common/mc.o
                           0xbbe       common/mc.o
                           0xbc5       common/mc.o
...
__udivdi3                           0x3809      common/set.o
__udivdi3                           0x3875      common/set.o
__udivdi3                           0x10cf      encoder/macroblock.o
__divdi3                            0x17865     encoder/analyse.o
__divdi3                            0x1e9       encoder/set.o
ld: fatal: relocations remain against allocatable but non-writable sections
collect2: ld returned 1 exit status

then simply add “-mimpure-text -lrt” to your LDFLAGS.

A quick note to self, “gcc -shared” is better than “gcc -G”. The former tells gcc to build a shared object, which tells the linker (I suppose). The latter just tells the linker. Swapping a -shared for -G can fix the above issue, but creates other issues. Or something along those lines - I’m a bit hazy on this one.

This issue came about because I was getting errors when running a 64bit amd64 ffmpeg linked against libx264:

ld.so.1: ffmpeg: fatal: relocation error: R_AMD64_PC32: file /opt/ec/lib/amd64/libx264.so.67: symbol main: value 0x280018fc805 does not fit

The problem here was that I’d compiled libx264 with gcc -G instead of gcc -shared. However using -shared generated the “Text relocation remains against symbol” errors, which needed the “-mimpure-text -lrt” fix.

1 comment May 19th, 2009

Previous Posts