Alasdair on Everything

Archive for June, 2010

UK OpenSolaris IPS Mirror

This is just a quick post to let people know that we run a UK OpenSolaris IPS mirror, including the full /dev branch.

To use it, simply run:

# Add the mirror repository
pkg set-authority -m http://pkg.osol.mirror.everycity.co.uk opensolaris.org

It’s also worth mentioning that when using the mirror, pkg search/install still requests the metadata from the origin server, which by default is in North America. Sun/Oracle run a European mirror in the Czech Republic (which is faster from the UK). You can set your publisher by doing:

# Set new European origin
pkg set-publisher -g http://pkg-eu-2.opensolaris.org/dev opensolaris.org

# Remove USA origin
pkg set-publisher -G http://pkg.opensolaris.org/dev opensolaris.org

What IPS mirrors won’t do for you

As I mentioned, unfortunately a mirror doesn’t mirror the metadata, so when you do a search or request an install, pkg still connects to the origin server. This kind of defeats the purpose of running a mirror as the metadata operations are often the slowest bit, and is a yet another stupid limitation of IPS as a system.

If you like to live life dangerously, there’s an abusive method of creating your own package authority/origin by tearing the packages out of the repo, but it hammers the origin server for each package request and is frowned upon. Don’t let that stop you - perhaps the IPS gods will get the message that they need to implement this feature for people who want to deploy IPS mirrors in closed networks.

I’m not going to run our own origin until there’s an official supported method of doing this, as I’m not sure of the consequences. Sucking down all the packages this way resets all the timestamps, and I’m not sure if this might cause problems later down the line. Hopefully an official method of creating your own origin will spring up soon.

Fellow LoSUG presenter Andrew Watkins wrote up rather good details of how to create your own origin server using the unsupported method over here. I believe he had pretty good success with it, and presented this method at his LoSUG talk on the Automated Installer. His article is a tidied up version on a post by Christopher Kampmeier. I’ve posted a comment on his article asking for comments on this method.

Running your own mirror

Running your own mirror isn’t too hard, with instructions here. As of the date of this blog post, the whole mirror is 56GB on a compression=on ZFS dataset. Dedupe might reduce this significantly, but I’m not that brave.

If you create a mirror and your IPS Repo reports "0 packages", it’s because you’ve started a full repo server and not a mirror. Double check you did this bit:

# svccfg -s pkg/server
> add mirror
> select mirror
> addpg pkg application
> addpg start method
> setprop pkg/mirror = boolean: true
> setprop pkg/inst_root = astring: "/export/pkg"
> setprop pkg/threads = count: 50
> exit
# svcadm refresh pkg/server:mirror
# svcadm enable pkg/server:mirror

Enjoy!

Add comment June 26th, 2010

More Solaris Broadcom Driver Information

Update 2010-07-01: Sun got back to one of the blog commenters regarding the issue with Broadcom NICs dropping out on HP servers and stated the issue relates to the HP supplied Broadcom drivers, and Sun recommended using these. So HP people may be seeing a different issue. Please see this blog comment for details. Many thanks for passing this information on Daniel!.


As previously mentioned, we’ve been having a nightmare with Broadcom NICs suddenly dropping out / hanging / freezing. All network traffic ceases / halts, despite the interfaces being up and showing no signs of any issues. This issue started affecting us after rolling out an upgrade to Solaris 10 update 8, but it also affects recent OpenSolaris builds. This has been on Dell R410 servers and R710 servers, and we’ve heard about people on HP servers having the same issue.

We thankfully found a workaround for it, which basically consists of disabling C-States in the BIOS. This is a power saving feature and support for it was added into Solaris 10 update 8, which is where we’re seeing the issue.

However prior to finding this workaround, I contacted Broadcom via their “Submit a support request” feature on their website. Nobody got back to me, and we were getting rather desperate so I was rather naughty and dropped one of their Kernel driver engineers a direct email. I won’t say who as he probably doesn’t want others mailing him directly.

The chap replied promptly, which was very impressive. He was very polite and explained that he couldn’t really help customers directly, as the OEM suppliers get upset, but he did offer some hints/tips. He mentioned that MSI-X was causing issues on Linux and suggested disabling it if we’re using v5.2.3 drivers or later. We’re not, we’re on 5.2.2 and 5.2.2 is the newest release available on the Broadcom website, so that was quite interesting.

He attached the release notes for the 6.0.1 driver which isn’t publicly available yet. Here is a snippet of the contents:

               Broadcom NetXtreme II Gigabit Ethernet Driver
                      For Solaris 10 for i386 platform

              Copyright (c) 2000-2010 Broadcom Corporation
                         All rights reserved.

Version 6.0.1 (21 May, 2010)
============================

    Fixes
    -----
        1) Problem : default MTU now set to 1500, fixed jumboframe
                     and vlan issues.
           Cause :   buffer sizes weren't being allocated properly
                     to account for MAC header overhead w/ vlan tags
           Change :  allocations are now correct
        2) Problem : when MSIX interrupt allocation failed driver
                     fails to attach
           Cause :   code didn't exist to revert down to Fixed
           Change :  driver now reverts to Fixed when MSIX interrupt
                     allocation fails

Version 5.2.3 (23 March, 2010)
==============================

    Enhancements
    ------------
        1) Change  : Reworked interrupt code to no longer use deprecated
                     Solaris interrupt APIs.
        2) Change  : Added support for MSI-X interrupts. MSI-X is now used
                     by default and can be turned off via "disable_msix"
                     inside bnx.conf.  When MSI-X is disabled then Fixed
                     level interrupts are used.
        2) Change  : Added a new "statistics" group to kstat which contains
                     driver version and interrupt information.

Version 5.2.2 (14 December, 2009)
=================================

    Fixes
    -----
        1) Problem : Kernel Panic in the send routine:
                     assertion failed: umpacket->mp == NULL,
           Cause   : The umpacket->mp was not scrubbed properly because
                     the umpacket never went through the
                     bnx_xmit_ring_reclaim() function.
           Change  : After recycling the packet in the TX routine,
                     the packet is now reclaimed before it is being used.

The 6.x driver for Solaris 10 should hopefully be available later this year. The one that’s in OpenSolaris unfortunately can’t be used with Solaris 10 due to network stack differences.

But the interesting thing is that there *is* a newer 5.2.3 Driver out there that came out in March this year. So I had a google, and it looks like that this driver has been supplied to OEMs but still isn’t available from Broadcom directly. So I downloaded an IBM Driver ISO Image that contains this newer driver, and it installs fine. We’re going to be using this in conjunction with disabling C-States and I’ll report back on how that combination is going.

After discovering the C-States workaround for the NIC dropouts I mailed the Broadcom guy again to let him know, and stated we’d be disabling C-States to see if it fixes the issue. He replied with:

Please let me know if this works for you so that I can pass it on to our
Solaris developers.  

I checked with them to see if this was a known issue and they replied that
they had been trying to duplicate the problem but had not been successful
to date.  When performance testing we often disable certain CPU features
in order to maximize Ethernet throughput so it may be that the system BIOS
settings are the key difference here.

So this is very encouraging - hopefully this tip will enable the Broadcom Solaris engineers to reproduce the issue and fix it.

Another final thing - to keep all our servers identical, in addition to flashing the system bios, DRAC Firmware and LSI/SAS6i Firmware, we’ve now started upgrading the Firmware on all the Broadcom NICs too.

This is easier said than done. My method involved producing a 2.88MB Dos boot image with the appropriate files, taken from various places. I nabbed the latest Dell Broadcom NIC Firmware Linux package to get the firmware files. I then pinched the DOS uxdiag.exe tool from the Broadcom diagnostics ISO to do the upgrades. I then produced a .bat file which runs:

uxdiag -c 1 -t abcd -F -fbc bc09x50b.bin
uxdiag -c 2 -t abcd -F -fbc bc09x50b.bin

uxdiag -c 1 -t abcd -F -fncsi ncsifw_x.205
uxdiag -c 2 -t abcd -F -fncsi ncsifw_x.205

uxdiag -c 1 -t abcd -F -fib_ipv4n6 ib6btv41.06
uxdiag -c 2 -t abcd -F -fib_ipv4n6 ib6btv41.06

uxdiag -c 1 -t abcd -F -fmba bxmba508.nic
uxdiag -c 2 -t abcd -F -fmba bxmba508.nic

uxdiag -c 1 -t abcd -mfw 0
uxdiag -c 2 -t abcd -mfw 0

What a lot of faffing about. You’d think Dell would make this stuff easier to do. Anyway, if you’re interested, please feel free to download my Broadcom DOS Firmware update disk image.

14 comments June 26th, 2010

Update to Broadcom NIC Dropping out on Solaris issue

Update 2010-07-01: Sun got back to one of the blog commenters regarding the issue with Broadcom NICs dropping out on HP servers and stated the issue relates to the HP supplied Broadcom drivers, and Sun recommended using these. So HP people may be seeing a different issue. Please see this blog comment for details. Many thanks for passing this information on Daniel!.


BREAKING NEWS - 2010-06-25 11:30 BST (GMT+1): I’ve just spoken with a chap called mui on #opensolaris on irc.freenode.net who reports that this issue relates to “C States”. Disabling “C States” in the BIOS (It’s in “Processor Settings” on Dell boxes) supposedly will work-around the issue. C States support was added in Solaris 10 update 8, so this is probably why our Solaris 10 update 7 boxes are unaffected.

Supposedly Sun/Oracle have a patch internally they can supply to you for Solaris 10 if you have a support contract. If you’re on OpenSolaris, Mui has made this package available that works with snv_134. DISCLAIMER: Please test this prior to putting it into production as it’s provided with no warranty. Alternatively you might be able to grab the latest 6.0.1 BNX driver from the on-closed-bins.i386.tar.bz2 package on the OpenSolaris website.

Here’s the rest of the (now somewhat out of date) post…


Right, I have an update on the Broadcom NIC issue.

It seems the BIOS was a bit of a red herring, the Broadcom FW is completely independent of the system BIOS and downgrading this doesn’t change the Broadcom FW version. Pretty obvious really - I have no idea where I read that the two were linked.

Anyway, I did find a broadcom firmware tool called lnxfwnx2 which Dell distributes in the Broadcom firmware update packages. It’s a Linux tool and it lets you save out/restore firmware from Broadcom NICs.

Unfortunately I couldn’t find 4.x.x Firmware releases for the card, only 5.x.x releases. It’s highly frustrating Broadcom don’t provide these things directly.

However we have two Dell R410 boxes running Solaris 10 update 7 which have been running for over 200 days and never had any network issues at all. They have the 4.6.4 firmware on them. I am planning on taking one of these out of service, saving out the Broadcom firmware with the tool, and then loading this firmware onto the new misbehaving Dells.

I’ll also copy across the same BRCMbnx driver package from the boxes that haven’t had any issues as well. I’m also planning on putting the same Dell System BIOS on the new machines as the working ones. This way the Broadcom FWs will match, the System BIOS will match, and the Drivers will match. The only difference will be Solaris 10 update 7 vs Solaris 10 update 8.

We can then see if the new boxes behave themselves…

The Dell package is here: ftp://ftp.us.dell.com/network/NETW_FRMW_LX_R259547.BIN

I couldn’t get it to run on Ubuntu/Debian based distros, but it runs fine on the CentOS 5 32bit live CD:

http://mirror.sov.uk.goscomb.net/centos/5/isos/i386/CentOS-5.5-i386-LiveCD-Release2.iso

Once you’ve booted the LiveCD, configure the network, then do:

# wget ftp://ftp.us.dell.com/network/NETW_FRMW_LX_R259547.BIN
# chmod 755 NETW_FRMW_LX_R259547.BIN
# ./NETW_FRMW_LX_R259547.BIN --extract r259
# cd r259
# ./lnxfwnx2

It’s an interactive tool and you can type “help” to get a list of commands.

Here’s an example of saving/restoring:

0> dumpnvram nic-fw-backup.bin
0> restorenvram new-nic-frmw.bin

I got these instructions from here. It’s also possible saving the NVRAM will save all options, including the MAC address, so double check this when restoring the NVRAM on a different machine.

It also looks like the DOS based diagnostics ISO from Broadcom’s website has a similar tool called uxdiag.exe which can program the firmware and turn various features of the card on/off (such as WOL (Wake on LAN), the ‘mba’ (MultiBoot Agemt), the ‘management firmware’ (Still don’t know what this does). You can get the iso from:

http://www.broadcom.com/support/ethernet_nic/driver-sla.php?driver=NX2-diag

The boot menu gives the option “Install FreeDOS to harddisk” which is the option you want - you can opt later on not to do this but to run FreeDOS from the CD. A bit confusing. The uxdiag tool has a manual here.

I also spotted this thread on forums.sun.com which suggests a lot of HP people are having the same issue, irrespective of the FW version. So it remains to be seen what the root cause actually is.

Add comment June 25th, 2010

Broadcom NICs dropping out on Solaris 10

Update 2010-07-01: Sun got back to one of the blog commenters regarding the issue with Broadcom NICs dropping out on HP servers and stated the issue relates to the HP supplied Broadcom drivers, and Sun recommended using these. So HP people may be seeing a different issue. Please see this blog comment for details. Many thanks for passing this information on Daniel!.


BREAKING NEWS - 2010-06-25 11:30 BST (GMT+1): I’ve just spoken with a chap called mui on #opensolaris on irc.freenode.net who reports that this issue relates to “C States”. Disabling “C States” in the BIOS (It’s in “Processor Settings” on Dell boxes) supposedly will work-around the issue. C States support was added in Solaris 10 update 8, so this is probably why our Solaris 10 update 7 boxes are unaffected.

Supposedly Sun/Oracle have a patch internally they can supply to you for Solaris 10 if you have a support contract. If you’re on OpenSolaris, Mui has made this package available that works with snv_134. DISCLAIMER: Please test this prior to putting it into production as it’s provided with no warranty. Alternatively you might be able to grab the latest 6.0.1 BNX driver from the on-closed-bins.i386.tar.bz2 package on the OpenSolaris website.

Here’s the rest of the (now somewhat out of date) post…


We’ve encountered this bug quite a few times and up until I found these bug reports, we weren’t sure what was causing the issue:

S10 bnx NICs randomly hang/drop out of the network

The symptoms are basically that the server loses network connectivity - traffic just stalls. Because this keeps happening on production boxes we have to reboot pretty damn quickly so haven’t had an opportunity to diagnose the issue in detail. We tried a number of fixes to no avail, and I was at my wits end until I encountered the above bug report.

Our servers are Dell R410 machines and we’ve seen this happening on Dell R710 machines as well, with Solaris 10 update 8. We’re running with the latest Solaris 10 patches and the latest Broadcom drivers from the Broadcom website (5.2.2). I believe we’ve seen this issue with the stock drivers shipped with Solaris 10 update 8 as well.

From the bug reports, the issue seems related to the firmware running on the cards - version 5* is affected, version 4* isn’t. I believe the Firmware is tied to the Dell BIOS running on the machine. Here’s the output from one of our affected boxes:

#  prtdiag | head -n 2
System Configuration: Dell Inc. PowerEdge R410
BIOS Configuration: Dell Inc. 1.3.9 04/07/2010

# grep -i BCM /var/adm/mes*
/var/adm/messages:Jun 12 03:21:38 bnx: [ID 995108 kern.info] NOTICE:
bnx0: BCM5709 device with F/W Ver500000b is initialized.
/var/adm/messages:Jun 12 03:21:38 bnx: [ID 995108 kern.info] NOTICE:
bnx1: BCM5709 device with F/W Ver500000b is initialized.

Here is the output from a machine that’s not affected:

# prtdiag | head -n 2
System Configuration: Dell Inc. PowerEdge R410
BIOS Configuration: Dell Inc. 1.1.5 07/29/2009

#  grep BCM /var/adm/messages*
/var/adm/messages.2:May 27 15:11:43 bnx: [ID 995108 kern.info] NOTICE:
bnx1: BCM5709 device with F/W Ver4060004 is initialized.
/var/adm/messages.2:May 27 15:11:43 bnx: [ID 995108 kern.info] NOTICE:
bnx0: BCM5709 device with F/W Ver4060004 is initialized.

My understanding is that the fix is to downgrade the BIOS of the machine to a previous release that uses a 4* Broadcom Firmware release. We haven’t yet tested this but should be able to later this week. So far it doesn’t look like Sun/Oracle have released a publicly available patch to address the issue.

Update: 2010-06-25 - Upgrading/Downgrading the system BIOS makes no difference to the Broadcom FW (duh! silly me). I’ve written an updated post with more information here: http://blogs.everycity.co.uk/alasdair/2010/06/update-to-broadcom-nic-dropping-out-on-solaris-issue/

13 comments June 14th, 2010