Alasdair on Everything

Posts filed under 'Solaris'

vasprintf and asprintf on Solaris 10

Update: Martin in the comments suggested using the vasprintf definition in the OpenSolaris source.


If you get errors such as this on Solaris 10, it’s due to a lack of modern helpful string functions (which thankfully were added to OpenSolaris, so no problem here on OpenIndiana):

Undefined                       first referenced
 symbol                             in file
asprintf                            ../../bin/gcc/libgpac.so

There’s quite a nice implementation (no idea how safe it is to use, but at least the program now compiles!) over at Stack Overflow - http://stackoverflow.com/questions/4899221/substitute-or-workaround-for-asprintf-on-aix. Take the latter one, by Jonathan Leffler.

You can drop it in like so:

#if (defined (__SVR4) && defined (__sun))
int vasprintf(char **ret, const char *format, va_list args)
{
        va_list copy;
        va_copy(copy, args);

        /* Make sure it is determinate, despite manuals indicating otherwise */
        *ret = 0;

        int count = vsnprintf(NULL, 0, format, args);
        if (count >= 0) {
                char* buffer = malloc(count + 1);
                if (buffer != NULL) {
                        count = vsnprintf(buffer, count + 1, format, copy);
                        if (count < 0)
                                free(buffer);
                        else
                                *ret = buffer;
                }
        }
        va_end(args);  // Each va_start() or va_copy() needs a va_end()

        return count;
}

int asprintf(char **strp, const char *fmt, ...)
{
        s32 size;
        va_list args;
        va_start(args, fmt);
        size = vasprintf(strp, fmt, args);
        va_end(args);
        return size;
}
#endif

Since the code I’m compiling will not be facing the internet and only run by trusted users, I’m not too worried about how buffer overflow safe this code is. If you are concerned about that, you might want to take a look at gnulib, which has a nice properly portable version, although it’s a lot bigger.

2 comments July 19th, 2011

Solaris 10 for free, or on Non-Sun hardware, is dead

Update, 28th July: I was perhaps a bit premature in declaring this. HP is now selling Solaris Support once again, and I believe you can still get Solaris support on some Dell models. IBM however are no longer offering it.

So the title should more accurately read: Solaris 10 for free for production use, is dead


I think quite a lot of us have been living in denial about this, even after Oracle altered the Solaris 10 license to make the free download a 90 day trial. People sort of shrugged and said “Well, you can still buy a support license from HP and Dell”. Even after Oracle cancelled the HP deal, people were still hopeful. “Perhaps this was a negotiating tactic!” I heard people cry on IRC.

Well, the fact is, the truth should have been obvious as far back as February. On February 23rd this year, Dan Roberts, Director of Solaris Product Management at Oracle told the OpenSolaris Governing Board:

Q - PT - What about support on third-party hardware?

A - DR - At this point Oracle is very focused on places where they can make revenue and margin. Unfortunately for us, we have not seen a good uptake on those standalone subscriptions. Has seen more emails on the topic than the total number of systems sold. Hard to make a case. At this point, there are no plans to support non-Sun systems. We will continue to honor existing contracts for the term of that contract. Over time, we hope to move folks over to Sun hardware.

Q - PT - What about regular Solaris?

A - DR - Same answer as above.

Q - PT - Will the ability to download and run it without support continue?

A - DR - Look at the licenses carefully. Production deployments will require a support agreement which is sold on Sun systems only.

In plain english, Oracle has no intention of providing support for OpenSolaris, nor for Solaris on Non-Sun Hardware. Nor will you be allowed to run Solaris 10 on a production system without a support contract.

This relegates the OpenSolaris distribution to a useless toy not fit for production, and means if you want to use Solaris 10 you have to buy a Sun server from Oracle and buy a support contract.

This effectively makes Solaris 10 unviable for a large number of users. While Oracle’s Sun Servers are beautiful pieces of engineering, they are vastly over priced, and you can get equivalent Dell kit for half the price.

Dan did say some somewhat positive things, stating:

* Oracle is increasing investment in Solaris and Oracle considers OpenSolaris a part of Solaris.
* Will continue to support the community.
* Will continue to contribute to the source base.
* Plan to continue OpenSolaris releases.
* Solaris releases will continue.
* What will Oracle do to support OpenSolaris as a distribution? We will continue to support Solaris offerings and we will continue to include OpenSolaris. The form will change. We will no longer offer independent support offerings for Solaris or OpenSolaris. They will be part of Systems Support Offerings that include Sun hardware.

If the community wants to continue to be able to run some form of Solaris on their non-Sun hardware, (or on their Sun hardware but without a support contract), the community is going to have to step up and do something.

I have very strong reason to believe the community is about to do just that. I can’t provide details just yet, but something big may be coming RealSoonNow[tm]. Stay tuned.

4 comments July 17th, 2010

OpenSolaris - July Update

Well, the OpenSolaris Governing Board has given Oracle an ultimatum: Make contact by August 16th, or they resign and hand control of the community back to Oracle.

To quote the above linked forum post..

"Without the Oracle part of the partnership at the table, there is effectively nothing for the OGB - or development community - to do. The flagship OpenSolaris distro is absent, the IPS repositories are stagnant, the build instructions no longer work for the sources that exist, even the architectural reviews of community-developed components are being held behind Oracle’s closed doors. It is as if the spirit of open, collaborative development centered around the Solaris operating system has died."

Nobody really knows what Oracle are up to, but their decision not to even talk to the OpenSolaris Governing Board strongly suggests Oracle are disinterested in the health of the community. My personal opinion, based on what I’ve read and observed, is that Larry wants Solaris for Oracle’s enterprise systems at the top end, and doesn’t give two shits about OpenSolaris or the community.

As such, the best the community can hope for is that Oracle will continue to provide the source code to OpenSolaris. Worse case, this disappears. I don’t even want to contemplate this, as it essentially means we’ll have to formulate a “Solaris Exit Plan”. Effectively this means NetApp and Ubuntu.

Anyway, lets assume for now Oracle will continue to provide the OpenSolaris source code to the community. If they do, then I have some opinions on what the community should do.

Here is what I posted to the OpenSolaris discuss and ogb mailing list:


IMHO, The Oracle/Sun provided OpenSolaris reference distribution (henceforth referred to as Indiana to avoid confusion) has done the community a disservice, in the sense that it has prevented a community from producing something itself.

All the other OpenSolaris based distributions such as Schillix, Nexenta etc all cater for particular niches, but what what’s needed is a community produced version of Indiana. One with the same (or at least, similar) goals with an identical/similar architecture including aspects such as IPS, Automated Installer, Zones, etc.

As long as Oracle/Sun continue to release their own distribution, the community has no real reason to do so. Well, perhaps now is the time for this to happen. Perhaps what is needed is an agreement with Oracle along the lines of:

1. Oracle agrees to continue to provide the source code for OpenSolaris (nevada), along with constituent parts (such as IPS/pkg). Oracle continue to provide bug and security fixed updates to the closed source binaries.

2. OpenSolaris 2010.xx is never released, but becomes Solaris Next.

3. The community steps up and produces it’s own version of Indiana, tracking Solaris Next as best it can in a binary and package compatible way.

4. The community maintains it’s own source code repository that developers can commit to, and Oracle takes community improvements that they want.

This frees Oracle from their obligation to the community, and allows them to maintain their secrecy and radio silence. But it forges an even stronger community that can stand on it’s own legs.

Obviously the issue the community has is that we’ve never had the ability to produce the distribution itself. We don’t have the ability to build all the packages that go into the IPS repo, nor produce the Live CD, nor do we have an installer. And of course, finding people to do the actual work would present a significant challenge.

The good news is that there is a community out there. There are the community members who have been involved with the OpenSolaris derived distributions. There are ex Sun/Oracle staff who have moved to other companies, such as Nexenta. There are projects such as OSUnix who are trying to produce their own OS from the OpenSolaris codebase by replacing the closed binaries/code (such as the internationalised bits of libc).

Not to mention, there’s Blastwave and OpenCSW who are already building large amounts of software for Solaris/OpenSolaris, and if one/both decided to contribute, we have a huge source of software packages for the community based distro.

If the fragmented OpenSolaris community rallied round and came together, I’m quite confident a community based distribution could thrive. Indeed, if Solaris Next does become an “Oracle Hardware Only” OS, then an entire company providing support for the community based distribution would definitely have legs, and this could potentially afford to pay staff to work on building the distribution full time. Solaris is run by a very large number of people on Dell/HP/etc kit and these users would no doubt be eager to jump onto such a distribution.

I’m going to be talking about my thoughts on this at the London OpenSolaris Users Group later this month, if anyone is in London and wants to come along. And of course I’d appreciate peoples comments here on this thread.

Alasdair

1 comment July 15th, 2010

Broadcom NICs dropping out on Solaris 10

Update 2010-07-01: Sun got back to one of the blog commenters regarding the issue with Broadcom NICs dropping out on HP servers and stated the issue relates to the HP supplied Broadcom drivers, and Sun recommended using these. So HP people may be seeing a different issue. Please see this blog comment for details. Many thanks for passing this information on Daniel!.


BREAKING NEWS - 2010-06-25 11:30 BST (GMT+1): I’ve just spoken with a chap called mui on #opensolaris on irc.freenode.net who reports that this issue relates to “C States”. Disabling “C States” in the BIOS (It’s in “Processor Settings” on Dell boxes) supposedly will work-around the issue. C States support was added in Solaris 10 update 8, so this is probably why our Solaris 10 update 7 boxes are unaffected.

Supposedly Sun/Oracle have a patch internally they can supply to you for Solaris 10 if you have a support contract. If you’re on OpenSolaris, Mui has made this package available that works with snv_134. DISCLAIMER: Please test this prior to putting it into production as it’s provided with no warranty. Alternatively you might be able to grab the latest 6.0.1 BNX driver from the on-closed-bins.i386.tar.bz2 package on the OpenSolaris website.

Here’s the rest of the (now somewhat out of date) post…


We’ve encountered this bug quite a few times and up until I found these bug reports, we weren’t sure what was causing the issue:

S10 bnx NICs randomly hang/drop out of the network

The symptoms are basically that the server loses network connectivity - traffic just stalls. Because this keeps happening on production boxes we have to reboot pretty damn quickly so haven’t had an opportunity to diagnose the issue in detail. We tried a number of fixes to no avail, and I was at my wits end until I encountered the above bug report.

Our servers are Dell R410 machines and we’ve seen this happening on Dell R710 machines as well, with Solaris 10 update 8. We’re running with the latest Solaris 10 patches and the latest Broadcom drivers from the Broadcom website (5.2.2). I believe we’ve seen this issue with the stock drivers shipped with Solaris 10 update 8 as well.

From the bug reports, the issue seems related to the firmware running on the cards - version 5* is affected, version 4* isn’t. I believe the Firmware is tied to the Dell BIOS running on the machine. Here’s the output from one of our affected boxes:

#  prtdiag | head -n 2
System Configuration: Dell Inc. PowerEdge R410
BIOS Configuration: Dell Inc. 1.3.9 04/07/2010

# grep -i BCM /var/adm/mes*
/var/adm/messages:Jun 12 03:21:38 bnx: [ID 995108 kern.info] NOTICE:
bnx0: BCM5709 device with F/W Ver500000b is initialized.
/var/adm/messages:Jun 12 03:21:38 bnx: [ID 995108 kern.info] NOTICE:
bnx1: BCM5709 device with F/W Ver500000b is initialized.

Here is the output from a machine that’s not affected:

# prtdiag | head -n 2
System Configuration: Dell Inc. PowerEdge R410
BIOS Configuration: Dell Inc. 1.1.5 07/29/2009

#  grep BCM /var/adm/messages*
/var/adm/messages.2:May 27 15:11:43 bnx: [ID 995108 kern.info] NOTICE:
bnx1: BCM5709 device with F/W Ver4060004 is initialized.
/var/adm/messages.2:May 27 15:11:43 bnx: [ID 995108 kern.info] NOTICE:
bnx0: BCM5709 device with F/W Ver4060004 is initialized.

My understanding is that the fix is to downgrade the BIOS of the machine to a previous release that uses a 4* Broadcom Firmware release. We haven’t yet tested this but should be able to later this week. So far it doesn’t look like Sun/Oracle have released a publicly available patch to address the issue.

Update: 2010-06-25 - Upgrading/Downgrading the system BIOS makes no difference to the Broadcom FW (duh! silly me). I’ve written an updated post with more information here: http://blogs.everycity.co.uk/alasdair/2010/06/update-to-broadcom-nic-dropping-out-on-solaris-issue/

13 comments June 14th, 2010

Solaris iSCSI Initiator & Reboots

We use Solaris Zones, with each zone stored on its own zpool. The ZPool is stored on a SAN, and accessed via iSCSI. We’ve been doing this since Solaris 10 update 6, and Solaris 10 update 8 introduced an interesting issue we’ve run into.

When we asked a S10u8 box to reboot, it sat there for 10 minutes shutting down. Why? Because it was trying to stop the iSCSI initiator whilst there were live iSCSI filesystems in use. Duh! Stupid Solaris.

So I compared the iSCSI manifest from S10u7 to S10u8 and they’ve changed it in a few places. It used to depend on svc:/network/physical and svc:/system/metainit, and now it depends on svc:/network/service and svc:/network/loopback. However the biggest change was the timeout value, it was upped from 5 seconds to 600 seconds. Yes, 10 minutes.

So this highlighted an interesting problem - when rebooting boxes previously, Solaris would always try to stop the iSCSI initiator with live filesystems on it, and give up after 5 seconds and the box would come down.

Rather than hack the timeout value back to 5 seconds, I decided to investigate and see if I could add a dependency to fix this properly. I decided to make the svc:filesystem/local service depend on the iSCSI initiator service. The theory here was that filesystem/local mounts and unmounts the ZFS filesystems, so if it depends on the initiator, the initiator won’t be stopped before it unmounts the ZFS filesystems.

Unfortunately this didn’t work. Somewhere in the enormous SMF dependency tree, I ended up with a cycle, and upon boot services wouldn’t come up. At this point, I gave up and set the timeout back to 5 seconds.

If I can find the time, I’ll try and reproduce this issue on OpenSolaris, then file it on defects.opensolaris.org. After it’s been accepted, I’ll escalate it against our Solaris 10 premium support contract, and see if Sun will actually fix something for us.

3 comments March 23rd, 2010

Making Solaris SMF ignore core dumps in child processes

I can never ever remember how to do this and googling for it always takes ages, so I thought I’d jot it down here.

When Solaris SMF starts a process, it tracks that process and all its children. If any of those children coredump, SMF treats it as a failure and puts the state into maintenance mode. Not terribly useful if you’re launching buggy software like FFMpeg.

The solution? Simple! Slap this in your SMF Manifest under the exec stop method:

        <property_group name='startd' type='framework'>
                 <!-- sub-process core dumps shouldn't restart
                         session -->
                 <propval name='ignore_error' type='astring'
                         value='core,signal' />
        </property_group>

1 comment January 19th, 2010

Enabling 64bit MySQL on Solaris Sun Web Stack 1.4

Sun Web Stack 1.4 includes both a 32bit and 64bit MySQL, with the standard bin/mysqld and bin/amd64/mysqld binaries.

By default, the SMF service sun-mysql50 runs in 32bit mode. To enable 64bit mode, simply:

# svccfg -s sun-mysql50:default
svc:/application/database/sun-mysql50:default> listprop
sun-mysql50                        application
sun-mysql50/action_authorization   astring  solaris.smf.manage.sun-mysql/default
sun-mysql50/bin                    astring  /opt/webstack/mysql/5.0/bin
sun-mysql50/data                   astring  /var/opt/webstack/mysql/5.0/data
sun-mysql50/value_authorization    astring  solaris.smf.value.sun-mysql/default
sun-mysql50/enable_64bit           boolean  true
method_context                     framework
method_context/group               astring  mysql
method_context/limit_privileges    astring  :default
method_context/privileges          astring  :default
method_context/project             astring  :default
method_context/resource_pool       astring  :default
method_context/supp_groups         astring  :default
method_context/use_profile         boolean  false
method_context/user                astring  mysql
method_context/working_directory   astring  /var/opt/webstack/mysql
general                            framework
general/enabled                    boolean  true
restarter                          framework    NONPERSISTENT
restarter/logfile                  astring  /var/svc/log/application-database-sun-mysql50:default.log
restarter/contract                 count    105
restarter/start_pid                count    606
restarter/start_method_timestamp   time     1233237617.117424000
restarter/start_method_waitstatus  integer  0
restarter/auxiliary_state          astring  none
restarter/next_state               astring  none
restarter/state                    astring  online
restarter/state_timestamp          time     1233237617.119195000
svc:/application/database/sun-mysql50:default> setprop sun-mysql50/enable_64bit=true
svc:/application/database/sun-mysql50:default> exit
# svcadm refresh sun-mysql50
# svcadm disable sun-mysql50
# svcadm enable sun-mysql50
# ps -ef | grep mysql
   mysql   649   490   0 14:00:06 ?           0:00 /bin/sh /opt/webstack/mysql/5.0/bin/64/mysqld_safe --user=mysql --datadir=/var/
   mysql   747   649   0 14:00:06 ?           0:22 /opt/webstack/mysql/5.0/bin/64/mysqld --basedir=/opt/webstack/mysql/5.0 --datad

And as we can see from the process list, the 64 bit binary has been launched instead of the 32 bit one.

Add comment January 29th, 2009

Compiling Python 2.6 on Solaris 10

Sorry for not posting so much lately. Work has been busier than ever - it’s quite incredible. Just a quick post on compiling Python 2.6, which was giving me a few problems.

Dependencies

I’d recommend throwing on ncurses and readline from the Solaris 10 companion CD, the packages are SFWncur and SFWrline. The full dependency list is:

P SFWncur
P SFWrline
P SUNWbzip
P SUNWcry
P SUNWcsl
P SUNWcslr
P SUNWcsr
P SUNWgccruntime
P SUNWlibms
P SUNWlibmsr
P SUNWopenssl-libraries
P SUNWzlib

Compiling

The _ctype module fails to compile with Sun Studio 12. Rather than fix this, I simply used gcc instead. Also Python seemed to be missing _ssl, so I popped in the appropriate library paths. Thus:

export "LDFLAGS=-L/opt/sfw/lib -R/opt/sfw/lib -L/usr/sfw/lib -R/usr/sfw/lib"
export "CPPFLAGS=-I/usr/sfw/include -I/opt/sfw/include -I/opt/sfw/include/ncurses"
export "CFLAGS=-I/opt/sfw/include"
export "LIBS=-lncurses"
export CC=gcc CXX=g++
./configure --prefix=/opt/python26 --enable-shared --disable-ipv6 --with-threads --with-libs="-lncurses" --with-wctype-functions
gmake
gmake install

Not all the modules will compile, but the ones that were missing were not of importance (sqlite, bsdbd, etc).

1 comment January 27th, 2009

Solaris 10: Swap Space, /tmp and SMF

fork: Not enough space

Solaris 10 by default places /tmp on swap. This is good for speed, but not so good on a general purpose box where some applications may fill up /tmp. If you fill /tmp, you essentially reduce the amount of available swap to 0. This can lead to trouble, run out of physical ram, and new processes may not start. You get lovely fork() errors on the shell, and interesting messages in dmesg:

# ps -ef
-bash: fork: Not enough space
# free
-bash: fork: Not enough space
# prstat
-bash: fork: Not enough space
...
# dmesg
...
Dec  7 02:56:27 w01.someserver.everycity.co.uk genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 8193 (munin-node)
Dec  7 02:56:51 w01. someserver.everycity.co.uk tmpfs: [ID 518458 kern.warning] WARNING: /tmp: File system full, swap space limit exceeded
Dec  7 02:56:57 w01. someserver.everycity.co.uk genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 8223 (exim)
Dec  7 02:57:26 w01. someserver.everycity.co.uk genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 563 (httpd)
...

The easiest way to fix this is to immediately disable any services that eat ram using svcadm disable, and clear out /tmp. You can then either move /tmp to a physical partition by editing /etc/vfstab, increase the amount of swap, or my favourite, limit the amount of swap /tmp can use by adding a mount option to /etc/vfstab:

# grep /tmp /etc/vfstab
swap    -       /tmp    tmpfs   -       yes     SIZE=2048M

Unfortunately with this you have to reboot the box, which wasn’t an option with the machine I was running on. So I added a bunch more swap for the time being.

SMF Unhappy after running out of swap space

However I encountered a rather bizarre issue, which can only be described as a bug. Services I had stopped using svcadm disable, wouldn’t re-enable with svcadm enable:

# svcs http
STATE          STIME    FMRI
disabled       23:26:00 svc:/network/http:apache22-csk
# svcadm -v enable http
svc:/network/http:apache22-csk enabled.
# svcs http
STATE          STIME    FMRI
disabled       23:26:00 svc:/network/http:apache22-csk

What’s going on here? The log in /var/svc/log didn’t report the enable command either. After investigating, I came to the conclusion that SMF must have broken when the box ran out of memory. SMF is managed by two processes, svc.startd and svc.configd, and thankfully you can restart them. Simply kill them both:

# ps -ef | grep svc
    root 7     1   0 Dec 01 ?           0:01 /lib/svc/bin/svc.startd
    root 9     1   0 Dec 01 ?           0:00 /lib/svc/bin/svc.configd
# pkill -9 svc.configd
# pkill -9 svc.startd
# ps -ef | grep svc
    root 12803     1   0 23:47:07 ?           0:01 /lib/svc/bin/svc.configd
    root 12841     1   0 23:47:09 ?           0:00 /lib/svc/bin/svc.startd

Then enabling the process actually does it this time:

# svcs http
STATE          STIME    FMRI
disabled       23:26:00 svc:/network/http:apache22-csk
# svcadm -v enable http
svc:/network/http:apache22-csk enabled.
# svcs http
STATE          STIME    FMRI
enabled       23:49:00 svc:/network/http:apache22-csk

Problem solved! However I dislike it when things silently break in this way. You have to wonder, if SMF broke, what else may be having issues?

Add comment December 8th, 2008

Sun x4500 Thumper: Mapping logical drives to physical

The Sun x4500 has 48 disk slots, numbered 0 to 47. However on Solaris, drives are named according to their controller/target location. I was wondering how you work out how to go from the logical naming, to the physical one.

Well the answer lays on the x4500 Tools & Drivers CD. On it is a nifty package named "SUNWhd-1.07.pkg", which plonks a utility called "hd" at "/opt/SUNWhd/hd/bin/hd". Running spits out the serial numbers of the disks, their temperature, and at the end, it finally spits out some ASCII art depicting the layout:

---------------------SunFireX4500------Rear----------------------------

36:   37:   38:   39:   40:   41:   42:   43:   44:   45:   46:   47:
c4t3  c4t7  c3t3  c3t7  c6t3  c6t7  c5t3  c5t7  c1t3  c1t7  c0t3  c0t7
^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++
24:   25:   26:   27:   28:   29:   30:   31:   32:   33:   34:   35:
c4t2  c4t6  c3t2  c3t6  c6t2  c6t6  c5t2  c5t6  c1t2  c1t6  c0t2  c0t6
^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++
12:   13:   14:   15:   16:   17:   18:   19:   20:   21:   22:   23:
c4t1  c4t5  c3t1  c3t5  c6t1  c6t5  c5t1  c5t5  c1t1  c1t5  c0t1  c0t5
^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++
 0:    1:    2:    3:    4:    5:    6:    7:    8:    9:   10:   11:
c4t0  c4t4  c3t0  c3t4  c6t0  c6t4  c5t0  c5t4  c1t0  c1t4  c0t0  c0t4
^b+   ^b+   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++
-------*-----------*-SunFireX4500--*---Front-----*-----------*----------

Rather funky, and useful!

Add comment November 16th, 2008

Previous Posts