Instability 2.0p5 . Cause unknow. after reboot it seems stable but still not running smoothly

First question - is this a single instance and if yes how does the load on the system looks like.
If you have fetching data problems it must be a problem with the livestatus communication.
The next step where there any error messages at the upgrade process?
How does the ~/var/log/web.log locks like? Any entries there?

Would be nice to get some infos like andreas said.
The title (“Wow…”) and just a screenshot is not helpful. Not for you, not for the community.

In most cases the problems are related to local extensions that have to be adjusted to work with version 2.0.
You could also have a look at ~/var/log/apache/error.log (because of the internal server error).
Would also be interesting to know if there are any hints in ~/var/log/update.log

1 Like

After a reboot the problem has disappeared (for now).

This is a single instance, running snmp checks, regular agent checks (linux & windows).

There are additional local file checks on a single windows server.
Running witch 4CPU’s, 8GB memory.
The servicechecks seem to fluctuate between 50% and 100% .

image

Current Settings:

This server started as a 2.0 server and migrated step-by-step to 2.0.0p5
No issues were found during any of these steps.
Parallel I had a 1.6p19 server running (RAW edition) which worked flawlessly for months at a time.
I killed the 1.6p19 server after the 2nd server running 2.0.0p2 (CFE) seemed to be running smoothly.

root@monitorv2:/var/log/apache2# more error.log.1
[Thu Jun 10 00:00:25.713246 2021] [mpm_event:notice] [pid 1666:tid 140312453168192] AH00489: Apache/2.4.41 (Ubuntu) configured – r
esuming normal operations
[Thu Jun 10 00:00:25.713265 2021] [core:notice] [pid 1666:tid 140312453168192] AH00094: Command line: ‘/usr/sbin/apache2’
[Thu Jun 10 14:47:56.613381 2021] [mpm_event:notice] [pid 1666:tid 140312453168192] AH00491: caught SIGTERM, shutting down
[Thu Jun 10 14:48:34.033339 2021] [mpm_event:notice] [pid 1625:tid 140429804031040] AH00489: Apache/2.4.41 (Ubuntu) configured – r
esuming normal operations
[Thu Jun 10 14:48:34.039910 2021] [core:notice] [pid 1625:tid 140429804031040] AH00094: Command line: ‘/usr/sbin/apache2’
[Thu Jun 10 14:48:35.478187 2021] [proxy:error] [pid 1626:tid 140429651601152] (111)Connection refused: AH00957: HTTP: attempt to c
onnect to 127.0.0.1:5000 (127.0.0.1) failed
[Thu Jun 10 14:48:35.478262 2021] [proxy_http:error] [pid 1626:tid 140429651601152] [client 192.168.100.185:31280] AH01114: HTTP: f
ailed to make connection to backend: 127.0.0.1, referer: http://192.168.100.124/aim/check_mk/index.py?start_url=%2Faim%2Fcheck_mk%2
Fwato.py%3Fmode%3Dchangelog
[Thu Jun 10 14:48:36.469895 2021] [proxy:error] [pid 1627:tid 140429592852224] (111)Connection refused: AH00957: HTTP: attempt to c
onnect to 127.0.0.1:5000 (127.0.0.1) failed
[Thu Jun 10 14:48:36.469963 2021] [proxy_http:error] [pid 1627:tid 140429592852224] [client 192.168.100.185:26670] AH01114: HTTP: f
ailed to make connection to backend: 127.0.0.1, referer: http://192.168.100.124/aim/check_mk/index.py?start_url=%2Faim%2Fcheck_mk%2
Fwato.py%3Fmode%3Dchangelog
[Thu Jun 10 22:08:51.207519 2021] [proxy_http:error] [pid 1626:tid 140428561073920] (20014)Internal error (specific information not
available): [client 192.168.100.185:1047] AH01102: error reading status line from remote server 127.0.0.1:5000, referer: http://19
2.168.100.124/aim/check_mk/dashboard.py
[Thu Jun 10 22:21:50.086209 2021] [mpm_event:notice] [pid 1625:tid 140429804031040] AH00491: caught SIGTERM, shutting down
[Thu Jun 10 22:47:43.670575 2021] [mpm_event:notice] [pid 1193:tid 140419253382208] AH00489: Apache/2.4.41 (Ubuntu) configured – r
esuming normal operations
[Thu Jun 10 22:47:43.675907 2021] [core:notice] [pid 1193:tid 140419253382208] AH00094: Command line: ‘/usr/sbin/apache2’
[Fri Jun 11 00:00:13.252281 2021] [mpm_event:notice] [pid 1193:tid 140419253382208] AH00493: SIGUSR1 received. Doing graceful restart

Update.log (p4-p5) :
2021-05-30 13:34:10 - Updating site ‘aim’ from version 2.0.0p4.cfe to 2.0.0p5.cfe…

Creating temporary filesystem /omd/sites/aim/tmp…OK
Executing update-pre-hooks script “01_mkp-disable-outdated”…OK
Executing update-pre-hooks script “02_cmk-update-config”…
-| Initializing application…
-| Loading GUI plugins…
-| Updating Checkmk configuration…
-| ATTENTION: Some steps may take a long time depending on your installation, e.g. during major upgrades.
-| 1/15 Migrate deprecated network topology dashlet…
-| 2/15 Update global settings…
-| 3/15 Rewriting WATO tags…
-| 4/15 Rewriting WATO hosts and folders…
-| 5/15 Rewriting WATO rulesets…
-| Replacing ruleset non_inline_snmp_hosts with snmp_backend_hosts
-| 6/15 Rewriting autochecks…
-| 7/15 Cleanup version specific caches…
-| 8/15 Migrating fs_used name…
-| 9/15 Migrate pagetype topics…
-| 10/15 Migrate LDAP connections…
-| 11/15 Rewrite BI Configuration…
-| Skipping conversion of bi.mk (already done)
-| 12/15 Set version specific user attributes…
-| 13/15 Rewriting inventory data…
-| Skipping py2 inventory data update (already done)
-| 14/15 Migrate audit log…
-| No audit log present. Skipping.
-| 15/15 Rename discovered host label files…
-| Done
OK
Updating core configuration…
Finished update.

.

This is the global Apache log, relevant should be the Apache log inside your site.
~/var/log/apache
There is also the web.log what is important if there are some problems inside the web interface.
~/var/log/web.log

The log content would be nice as Preformatted text as there are no line breaks if the line is too long for the window.

There is no apache.log .

OMD[aim]:~/var/log$ ls *.log -sla
4 -rw-rw---- 1 aim aim 3888 Jun 11 08:21 alerts.log
2736 -rw-rw---- 1 aim aim 2797224 Jun 11 16:45 cmc.log
4 -rw-r–r-- 1 aim aim 4051 Jun 11 08:21 dcd.log
0 -rw-rw---- 1 aim aim 0 Jun 11 16:05 diskspace.log
4 -rw-r–r-- 1 aim aim 3971 Jun 11 08:21 liveproxyd.log
8 -rw-r–r-- 1 aim aim 5583 Jun 11 08:21 mkeventd.log
1656 -rw-r–r-- 1 aim aim 1691400 Jun 11 16:45 mknotifyd.log
1920 -rw-rw---- 1 aim aim 1958393 Jun 11 16:45 notify.log
44 -rw-r–r-- 1 aim aim 37423 Jun 11 08:21 redis-server.log
0 -rw-r----- 1 aim aim 0 Jun 9 00:00 rrdcached.log
0 -rw-r----- 1 aim aim 0 Jun 9 00:00 update.log
100 -rw-rw---- 1 aim aim 95904 Jun 11 15:58 web.log
OMD[aim]:~/var/log$

web.log.txt (93.7 KB)

There is a folder apache with some logs inside.
In you web.log you see yesterday many error messages every time the system does something with the automation calls.
The errors are looking very strange and not every time from the same source.
Today there are no error messages.

In /opt/omd/sites/aim/var/log/apache I found the following :
error_log.1.txt (945.0 KB)

~/var/log/apache is a directory, in which you should find access_log and error_log.

All the API errors are looking strange. Is there something running against some BI on the API?
There are too many wsgi errors. I would reinstall the checkmk package as i think there was an error at installation time with this version on your server.

But it could also be a problem with the Apache. I’m no specialist for Apache WSGI problems :slight_smile:

OK, but before I re-install. Can I restore the site backup afterwards or is that useless?

The normal site backup should work if it is the same version.
I mean reinstall the deb package :slight_smile: not the server

2 Likes