Updating to 2.5 wipes everything

Hi y’all,

Long time Checkmk enjoyer here but I’m scratching my head with this one. I did a few weird things and had weird issues, so I’m taking my last functional checkmk backup which was on version 2.4.0p24cre.

This backup was taken on OpenSUSE 15.5. Since then, I updated to 15.6… and to 16.0. Now, I can restore this backup just fine, everything works as usual. I can update to 2.4.0p27cre as well, nothing out of the ordinary. However, when I try to update to 2.5.0-community, some weird things happen. During the update, I get prompts to delete supposedly obsolete directories. The problem is… it sees everything as obsolete and basically tries to wipe everything! Here is the whole process: Non empty obsolete directory var/pnp4nagios/log The directory var/pnp4nag - Pastebin.com

Now, after going through that (and pressing keep every time), the update completes without any error. However, it seems like it did delete a bunch of stuff anyways without prompting me, since trying to start the site from there doesn’t work at all:

OMD[home]:~$ omd start
Starting agent-receiver...OK
Starting mkeventd (builtin: syslog-udp,syslog-tcp,snmptrap)...OK
Starting rrdcached...OK
Starting redis...
*** FATAL CONFIG FILE ERROR (Redis 8.4.0) ***
Reading the configuration file, at line 15
>>> 'dir "/omd/sites/home/var/redis"'
No such file or directory
failed
Starting npcd...touch: cannot touch '/omd/sites/home/tmp/pnp4nagios/run/npcd.pid': No such file or directory
chown: cannot access '/omd/sites/home/tmp/pnp4nagios/run/npcd.pid': No such file or directory
An Error occured while reading your config on line 197
Message was: "Could not open pidfile '/omd/sites/home/tmp/pnp4nagios/run/npcd.pid': No such file or directory"
OK
Starting automation-helper...OK
Starting ui-job-scheduler...OK
ERROR: tmp directory is not ready. Use "omd start" to prepare it
Starting apache...AH00526: Syntax error on line 20 of /omd/sites/home/etc/apache/apache.conf:
DocumentRoot '/omd/sites/home/var/www' is not a directory, or is not readable
..........failed
ls: cannot access '/omd/sites/home/etc/xinetd.d/': No such file or directory
Starting crontab...OK

Bunch of missing stuff that we can see were reported as vanished during the update process.

So, tl;dr: updating to 2.5 breaks everything. Site currently running fine on 2.4.0p27 in the meantime and ready to test further updates.

Thanks for reporting.

Just to be sure, the update was done like the below steps and then you see this problem. Correct?

  • Checkmk 2.4.0p27.cre for SLES 16.0 was installed on Opensuse 16
  • Then the 2.4.0p27.cre backup was restored.
  • Checkmk 2.5.0.community for SLES 16.0 was installed on Opensuse 16
  • Update was performed from 2.4.0p27.cre to 2.5.0.community

That would be correct except the backup is from 2.4.0p24. When I wanted first to update to 2.5, I noticed SLES15 SP5 support was dropped, which prompted me to update.

So, going from no sites, I have 2.4.0p24, 2.4.0p27 and 2.5.0 installed, all SLES 16 community versions. I restore my 2.4.0p24 checkmk site which was taken from 2.4.0p24 sles15sp5 backup on my now 2.4.0p24 sles16 installation. I update to 2.4.0p27, then to 2.5 which is when it breaks.

Apologies for that dnLL.

Both dev and myself tried to reproduce this. We can’t. That’s also why it didn’t appear in any of our tests.
Will set up a SLES16 system now to see if this is SLES specific.

I will try tomorrow to uninstall all 3 versions of MK, delete everything, reinstall all 3 then restore the backup. I don’t know, maybe something went wrong elsewhere while dealing with the packages.

If I’m still able to reproduce then, I can upload the backup for testing purposes, it’s just my homelab, nothing critical, but the update process wiping stuff is somewhat scary.

OpenSUSE and SLES are binary compatible AFAIK. With that being said, it’s not a brand new 16.0 installation, it’s an upgraded one, which comes with its own sets of variables and challenges.

I tested this in two different ways. First, I started with a fresh SLES 16.0 system and installed the required Checkmk versions. Second, I took an existing site from SLES 15, backed it up, restored it on SLES 16, and then upgraded step by step.

In both cases, everything worked fine. I was able to upgrade from 2.4.0p24 → 2.4.0p27 → 2.5.0 without any issues.

Interestingly, I can’t reproduce the problem on other distributions like Ubuntu 24 either.

Ah, you were faster Sudhir :slight_smile:

I tried as well on SLES 16 now, but everything looks as expected.

@dnLL we need your help to figure this out. anything which points us to the root cause, let us know.

All right so here is what I just did this morning (root user):

zypper rm check-mk-community-2.5.0-sles16.0-38.x86_64
zypper in -y https://download.checkmk.com/checkmk/2.5.0/check-mk-community-2.5.0-sles16.0-38.x86_64.rpm
omd stop && omd update home

The update worked without any issue, it detected one obsolete directory (var/check_mk/persisted) and checkmk started correctly on 2.5.

So, it will be hard to reproduce. What I think could have happened… I installed Checkmk 2.5 for SLES 15.6 while on OpenSUSE 15.6. Then I updated the OS to SUSE 16.0 and installed Checkmk 2.5 for SLES 16. The package manager detected that installation as an upgrade, although it was technically the same version (check-mk-community-2.5.0-sles15.6-38.x86_64.rpm to check-mk-community-2.5.0-sles16.0-38.x86_64.rpm). The site was already on 2.5.0 at that time. And this is when things started to get weird, and when I decided to pull back my Checkmk backup. I don’t have the time to test this further this morning.

With that being said, going forward, I wonder if fail-safes could be added to prevent what happened during that broken update (here again for reference), since that update was pretty destructive data-wise. I don’t know enough about the update mechanic and what makes directories show up as obsolete or vanished, but someone that doesn’t take a backup before updating could be up for some surprises within a specific set of circumstances.

Here is the problem i think, the already upgraded site was rolled back with an older backup.