Upgrading CheckMK 2.1 => 2.3

Hi Davide,

In general as already discussed here: Different version for Master & Slave - #3 by erik you can only upgrade from from one minor version to the next minor version and you have to upgrade all remote sites first before you can upgrade the master.

We run a three tier environment with test, preprod and prod which is advisable in bigger environments.

On test I run two sites running real checks and one site in simulation mode which is a 1:1 copy of a real site connected to master. Unfortunately simualtion mode is not as good as it was until 1.6 but it helps to do some basic checks.

I always start testing upgrades with the simulated site. Make a screenshot of the dashbard to document current number of alerts. We need that later for comparision. Then do a manual backup and run the upgrade. In case of issues during upgrade possibly I have to restore the site from backup and fix it on master. If there are issues with the configuration you can always run cmk-update-config with -v or even --debug. It also helps to troubleshoot.

If simulation site upgrade runs without issues I check for any UNKNOW, STALE alerts or even crashes. Also compare the figures of the dashboard so that you have the same number of services, hosts or alerts. With the last upgrade we had the issue that SNMP based checks had been gone. This way you can detect it. Also test things like activate changes, LDAP auth etc. whatever is configured.
If simulation site is OK we proceed with the other remote sites connected and do the same tests.
In the next step we upgrade PrepProd remote site and then the first sites in the prod env. If all runs well we continue upgrading all remote sites.
While that runs we do a backup of master from test envirnment and restore it in another vm to test upgrade. You may can skip that but we have a couple of people working even in test env and I dont want to disturb them too much. Atfter upgrade of master in Test environment we check if there are any compatibility issues with the old agent. Mainly such problems should already visible while upgrading the simulation site because it uses cached agent output from old agent.

If upgrade of master in Test env is finally done we also upgrade PreProd and then Prod env.
Before we bake new agents we disable automatic agent updates. After, we enable automatic agent updates region by region to see of we have any issues.

!!! Always do a backup on CLI before you do any upgrade. !!!

To automate the remote sites we run the following script:

#!/usr/bin/env bash
HOSTNAME=$(hostname)
SITECODE=${HOSTNAME:2:3}
SITECODE=${SITECODE^^}
VERSION=2.1.0p42
OVERSION=$( /usr/bin/omd version -b ${SITECODE})

if [ ${OVERSION} = ${VERSION}.cee ]
then
    echo -e "!!!!!! Site ${SITECODE} is already on version ${VERSION} !!!!!"
    exit 1
fi
echo "Updating site ${SITECODE} from version ${OVERSION} to checkmk version:"
if /usr/bin/omd versions -b | grep $VERSION.cee
then
    echo -e "################# Backup site ${SITECODE} to tmp ################"
    if /usr/bin/omd backup ${SITECODE} /tmp/Backup_${SITECODE}.tgz
    then
        echo -e "+ Backup successful"
        backup=1
    else
        echo -e "! Could not make a backup of site ${SITECODE}"
        echo -e "! Start site ${SITECODE} with version ${OVERSION}"
        /usr/bin/omd start ${SITECODE}
        exit 1
    fi

    echo -e "################# Stopping site ${SITECODE} in OMD ################"
     # Sometimes site could not be stopped with only one attempt. We try three times
     stopcounter=0
     until /usr/bin/omd stop ${SITECODE}
     do
         echo -e "! Failed stopping ${SITECODE} in OMD"
         ((stopcounter++))
         if (( stopcounter >= 3 ))
         then
            echo -e "! Failed stopping ${SITECODE} in OMD after ${stopcounter} attempts"
            /usr/bin/omd start ${SITECODE}
            exit 1
         fi
      done
      echo -e "+ Successful stopped ${SITECODE} in OMD"

    echo -e "\n################# Updating site ${SITECODE} to ${VERSION} ################\n"
    if /usr/bin/omd -f -V $VERSION.cee update --conflict=install ${SITECODE}
    then
        echo -e "+ Successful update site ${SITECODE} to Version ${VERSION}"
        update=1
    else
        echo -e "! Failed update site ${SITECODE} to Version ${VERSION}"
        update=0
    fi

    echo -e "\n################# Rebuilding checkmk config ################\n"
    if /usr/bin/su - ${SITECODE} -c 'cmk -U'
    then
        echo -e "+ Successful ran cmk -U"
        cmk=1
    else
        echo -e "! Failed running cmk -U"
        cmk=0
    fi
    success=$((${backup} * ${update} * ${cmk}))
    if [ ${success} -eq 0 ]
    then
        echo -e "############ restore site ${SITECODE} after failed update ################"
        omd restore --reuse --kill /tmp/Backup_${SITECODE}.tgz
    fi
 echo -e "################# Starting site after update ################"
    if /usr/bin/omd start ${SITECODE}
    then
        echo -e "+ Successful started site ${SITECODE^^} with Version $(omd version -b ${SITECODE})\n"
        sitestart=1
    else
        echo -e "! Failed to start site ${SITECODE^^} with Version $(omd version -b ${SITECODE})\n"
        sitestart=0
    fi
    success=$((${backup} * ${update} * ${cmk} * ${sitestart}))
    summary="Backup Site:\t${backup}\nUpdate Site:\t${update}\nRebuild config:\t${cmk}\nStart site:\t${sitestart}\n"
    if [ ${success} -gt 0 ]
    then
        echo -e "################# Update completed without errors ################"
        #rm -f /tmp/Backup_${SITECODE}.tgz
        echo -e ${summary}
        exit 0
    else
        echo -e "################# Update completed with errors ###############"
        echo -e ${summary}
        exit 2
    fi
fi
echo -e "############ Version ${VERSION} not available in omd #############"
exit 1
1 Like