Nagios 3.2.0 coredumps when started via SMF on Solaris 10
October 16th, 2009
This one was quite interesting. If you compile your own nagios-3.2.0 from source on Solaris 10, and start it manually, it runs just fine. If you run it via SMF with a service manifest, the process continually dumps core, so you get messages such as:
[ Oct 16 19:24:48 Enabled. ]
[ Oct 16 19:24:48 Executing start method ("/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg &") ]
[ Oct 16 19:24:48 Method "start" exited with status 0 ]
[ Oct 16 19:24:49 Stopping because process dumped core. ]
[ Oct 16 19:24:49 Executing stop method (:kill) ]
Successfully shutdown... (PID=29180)
[ Oct 16 19:24:49 Executing start method ("/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg &") ]
[ Oct 16 19:24:49 Method "start" exited with status 0 ]
[ Oct 16 19:24:50 Stopping because process dumped core. ]
[ Oct 16 19:24:50 Executing stop method (:kill) ]
Successfully shutdown... (PID=29232)
[ Oct 16 19:24:51 Executing start method ("/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg &") ]
[ Oct 16 19:24:51 Method "start" exited with status 0 ]
[ Oct 16 19:24:52 Stopping because process dumped core. ]
[ Oct 16 19:24:52 Executing stop method (:kill) ]
Successfully shutdown... (PID=29246)
So, why does nagios crash when started via SMF? Well, I decided to enable core dumps via coreadm, to find out why. We do this with:
# mkdir /cores
# coreadm -g /cores/core.%f.%p -i /cores/core.%f.%p -e global -e global-setid -e log -e process -e proc-setid
# coreadm
global core file pattern: /cores/core.%f.%p
global core file content: all
init core file pattern: /cores/core.%f.%p
init core file content: all
global core dumps: enabled
per-process core dumps: enabled
global setid core dumps: enabled
per-process setid core dumps: enabled
global core dump logging: enabled
We can then check the core dump with:
# gdb /opt/nagios/bin/nagios /cores/core.nagios.23536 GNU gdb 6.6 Copyright (C) 2006 Free Software Foundation, Inc. *snip* Core was generated by `/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg'. Program terminated with signal 11, Segmentation fault. #0 0xfed3590c in strlen () from /lib/libc.so.1 (gdb) bt #0 0xfed3590c in strlen () from /lib/libc.so.1 #1 0xfed8eda6 in _ndoprnt () from /lib/libc.so.1 #2 0xfed9192d in fprintf () from /lib/libc.so.1 #3 0x08067c42 in run_async_host_check_3x () #4 0x08066f69 in run_scheduled_host_check_3x () #5 0x080658d0 in perform_scheduled_host_check () #6 0x0807c0e8 in handle_timed_event () #7 0x0807bd8c in event_execution_loop () #8 0x0805ecaa in main () (gdb) quit
Interesting – it’s crashing when the nagios function run_async_host_check_3x does a fprintf. Looks like a null pointer to me. Lets get the actual line number by installing a nagios binary which has not been stripped of debugging symbols. Thankfully the Nagios Makefile has a method of doing this already:
# cd /opt/src/nagios-3.2.0 # gmake install-unstripped cd ./base && gmake install-unstripped gmake[1]: Entering directory `/opt/src/build/nagios/files/nagios-3.2.0/base' gmake install-basic gmake[2]: Entering directory `/opt/src/build/nagios/files/nagios-3.2.0/base' /opt/sfw/bin/install -c -m 775 -o nagios -g nagios -d /opt/nagios/bin /opt/sfw/bin/install -c -m 774 -o nagios -g nagios nagios /opt/nagios/bin *snip*
Now we re-run nagios via SMF, then gdb the latest coredump:
gdb /opt/nagios/bin/nagios /globalcore/core.nagios.29248
GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
*snip*
Core was generated by `/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg'.
Program terminated with signal 11, Segmentation fault.
#0 0xfed3590c in strlen () from /lib/libc.so.1
(gdb) bt
#0 0xfed3590c in strlen () from /lib/libc.so.1
#1 0xfed8eda6 in _ndoprnt () from /lib/libc.so.1
#2 0xfed9192d in fprintf () from /lib/libc.so.1
#3 0x08067c42 in run_async_host_check_3x (hst=0x8139b78, check_options=0, latency=0.048000000000000001,
scheduled_check=1, reschedule_check=1, time_is_valid=0x8047b40, preferred_time=0x8047b48)
at /opt/src/build/nagios/files/nagios-3.2.0/base/checks.c:3134
#4 0x08066f69 in run_scheduled_host_check_3x (hst=0x8139b78, check_options=0, latency=0.048000000000000001)
at /opt/src/build/nagios/files/nagios-3.2.0/base/checks.c:2791
#5 0x080658d0 in perform_scheduled_host_check (hst=0x8139b78, check_options=0, latency=0.048000000000000001)
at /opt/src/build/nagios/files/nagios-3.2.0/base/checks.c:2108
#6 0x0807c0e8 in handle_timed_event (event=0x8133010) at /opt/src/build/nagios/files/nagios-3.2.0/base/events.c:1261
#7 0x0807bd8c in event_execution_loop () at /opt/src/build/nagios/files/nagios-3.2.0/base/events.c:1132
#8 0x0805ecaa in main (argc=134510324, argv=0x8139b78) at nagios.c:849
(gdb) quit
A hah! Now we have a line number. The line in question, line 3134 of checks.c, reads:
fprintf(check_result_info.output_file_fp,"output=%s\n",checkresult_dbuf.buf);
So this checkresult_dbuf.buf must be null. I googled, and found someone talking about it on the nagios-devel mailing list. Seems the fix they comitted (checking to see if checkresult_dbuf.buf is null) has been uncomitted/overwritten as this check is no longer in place in nagios 3.2.0. Not to worry, here’s a patch:
--- base/checks.c.orig 2009-10-16 19:28:42.082321083 +0100
+++ base/checks.c 2009-10-16 19:29:02.197305557 +0100
@@ -3131,7 +3131,7 @@
fprintf(check_result_info.output_file_fp,"early_timeout=%d\n",check_result_info.early_timeout);
fprintf(check_result_info.output_file_fp,"exited_ok=%d\n",check_result_info.exited_ok);
fprintf(check_result_info.output_file_fp,"return_code=%d\n",check_result_info.return_code);
- fprintf(check_result_info.output_file_fp,"output=%s\n",checkresult_dbuf.buf);
+ fprintf(check_result_info.output_file_fp,"output=%s\n",(checkresult_dbuf.buf==NULL)?"(null)":checkresult_dbuf.buf);
/* close the temp file */
fclose(check_result_info.output_file_fp);
Apply this and you should be all set!
Entry Filed under: General

2 Comments Add your own
1. Peter Eriksson | February 3rd, 2010 at 11:46 am
I’ve found a couple of other problems with Nagios on Solaris 10 too…
When you have a large setup with many hosts and many services the default “rlimit” on “descriptors” (256) is too low, which causes file leaks in /var/nagios/spool/checkcommands
(when I found it we had accumulated about 600K files there :-)
Raising the limit to “unlimited” causes “check_dns” to coredump instead… Setting it to 1024 seems to work better though.
2. Alasdair | February 3rd, 2010 at 12:13 pm
Hi Peter,
Thanks for that additional info – useful to know :)
Cheers,
Alasdair
Leave a Comment
Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>
Trackback this post | Subscribe to the comments via RSS Feed