Rrdcached not running after upgrade to check-mk-enterprise

Hi,

after the upgrade to check-mk-enterprise p11, rrdcached is not running on two slaves, so currently I don’t have any graphs, we have a distributed monitoring with 8 slaves and a master. when I try to restart rrdcached manually I get the following:

~$ omd start rrdcached
Temporary filesystem already mounted
~$ omd restart rrdcached 
Temporary filesystem already mounted

In ~/tmp there is no rrdcached.pid

~$ ll tmp/
total 0
drwxr-xr-x  4 site site 100 Oct  7 11:31 apache/
drwxr-xr-x 10 site site 220 Oct  6 11:24 check_mk/
-rw-r--r--  1 site site   0 Oct  4 11:23 initialized
drwxr-xr-x  3 site site  60 Oct  4 11:23 liveproxyd/
drwxr-xr-x  2 site site  40 Oct  4 11:23 lock/
drwxr-xr-x  4 site site  80 Oct  4 11:23 nagios/
drwxrwxr-x  4 site site  80 Oct  4 11:23 nagvis/
drwxr-xr-x  5 site site 100 Oct  4 11:23 php/
drwxr-xr-x  5 site site 100 Oct  4 11:23 pnp4nagios/
drwxr-xr-x  2 site site  40 Oct  4 11:23 rrdcached/
drwxr-xr-x  4 site site 320 Oct  7 11:52 run/

On all check_mk hosts I get this:

~$omd status rrdcached 
-----------------------
Overall state:  unused

And there is no rrdcached at all

~$ omd status 
mkeventd:       running
liveproxyd:     running
mknotifyd:      running
cmc:            running
apache:         running
dcd:            running
redis:          running
stunnel:        running
xinetd:         running
crontab:        running
-----------------------
Overall state:  running

Any idea how I can start it again?

Thanks

Hi,
do you find any messages about the rrdcached service in the logfiles?

regards
Christian

Hi christian,

I see a lot of:

var/log/cmc.log.1:2021-10-06 00:00:48 [4] [rrdcached thread] [rrdcached at "/omd/sites/site/tmp/run/rrdcached.sock"] cannot connect: No such file or directory
var/log/cmc.log:2021-10-07 12:16:43 [4] [client 5] Error flushing RRD: Unable to connect to rrdcached: No such file or directory
var/pnp4nagios/log/perfdata.log:2021-03-25 16:01:31 [2980802] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/site/tmp/run/rrdcached.sock: illegal attempt to update using time 1616688084.000000 when last update time is 1616688084.000000 (minimum one second step)

it already started on the 04.10:

2021-10-04 11:22:39 [5] [rrdcached thread] started
2021-10-04 11:22:39 [5] [core 1046] -----------------------------------------------------------------
2021-10-04 11:22:39 [5] [core 1046] Check_MK Micro Core started with PID 1046
2021-10-04 11:22:39 [5] [core 1046] version 2.0.0p11 compiled Thu, 16 Sep 2021 12:17:02 +0000 on debian-10
2021-10-04 11:22:39 [5] [core 1046] built with g++-10 (GCC) 10.1.0, using RE2 regex engine
2021-10-04 11:22:39 [5] [core 1046] loaded configuration 408 (0xea8390) from 2021-10-04 11:22:39 with 80 hosts and 1861 services in 9.08754 ms
2021-10-04 11:22:39 [5] [core 1046] loaded saved program state with 80 hosts, 1861 services, 0 comments, and 5 downtimes in 7.31639 ms
2021-10-04 11:22:39 [5] [main] [livestatus manager] starting
2021-10-04 11:22:39 [5] [main] [livestatus manager] listening on /omd/sites/site/tmp/run/live
2021-10-04 11:22:39 [5] [main] [livestatus manager] created 20 Livestatus threads with stack size 4194304 in 1.39756 ms
2021-10-04 11:22:39 [5] [core 1046] [livestatus local] Successfully created new command pipe at "/omd/sites/site/tmp/run/nagios.cmd".
2021-10-04 11:22:39 [5] [core 1046] [livestatus local] Successfully opened command pipe at "/omd/sites/site/tmp/run/nagios.cmd".
2021-10-04 11:22:39 [5] [main] [RRD helper 1070] started, commandline: /omd/sites/site/bin/cmk --create-rrd --keepalive
2021-10-04 11:22:39 [5] [carbon thread] [carbon connection pool] started
2021-10-04 11:22:39 [5] [core 1046] building state history cache for the time period from 2019-10-05 11:22:39 to 2021-10-04 11:22:39 (730 days)
2021-10-04 11:22:39 [5] [alert helper 1072] started, commandline: /omd/sites/site/bin/cmk --handle-alerts --keepalive
2021-10-04 11:22:39 [5] [generic pool] [helper 1073] started, commandline: /omd/sites/site/lib/cmc/checkhelper
2021-10-04 11:22:39 [5] [generic pool] [helper 1074] started, commandline: /omd/sites/site/lib/cmc/checkhelper
2021-10-04 11:22:39 [5] [generic pool] [helper 1075] started, commandline: /omd/sites/site/lib/cmc/checkhelper
2021-10-04 11:22:39 [5] [generic pool] [helper 1076] started, commandline: /omd/sites/site/lib/cmc/checkhelper
2021-10-04 11:22:39 [5] [generic pool] [helper 1077] started, commandline: /omd/sites/site/lib/cmc/checkhelper
2021-10-04 11:22:39 [5] [generic pool] started 5 helpers in 5.71226 ms
2021-10-04 11:22:39 [5] [checker pool] [helper 1078] started, commandline: /omd/sites/site/bin/cmk --checker
2021-10-04 11:22:39 [5] [checker pool] [helper 1079] started, commandline: /omd/sites/site/bin/cmk --checker
2021-10-04 11:22:39 [5] [checker pool] [helper 1080] started, commandline: /omd/sites/site/bin/cmk --checker
2021-10-04 11:22:39 [5] [checker pool] [helper 1081] started, commandline: /omd/sites/site/bin/cmk --checker
2021-10-04 11:22:39 [5] [checker pool] started 4 helpers in 7.5122 ms
2021-10-04 11:22:39 [5] [real-time pool] [helper 1082] started, commandline: /omd/sites/site/bin/cmk --keepalive --real-time-checks
2021-10-04 11:22:39 [5] [real-time pool] started 1 helper in 1.49708 ms
2021-10-04 11:22:39 [5] [fetcher pool] [helper 1083] started, commandline: /omd/sites/site/bin/fetcher
2021-10-04 11:22:39 [5] [fetcher pool] [helper 1084] started, commandline: /omd/sites/site/bin/fetcher
2021-10-04 11:22:39 [5] [fetcher pool] [helper 1085] started, commandline: /omd/sites/site/bin/fetcher
2021-10-04 11:22:39 [5] [fetcher pool] [helper 1086] started, commandline: /omd/sites/site/bin/fetcher
2021-10-04 11:22:39 [5] [fetcher pool] [helper 1087] started, commandline: /omd/sites/site/bin/fetcher
2021-10-04 11:22:39 [5] [fetcher pool] [helper 1088] started, commandline: /omd/sites/site/bin/fetcher
2021-10-04 11:22:39 [5] [core 1046] finalized 1941 history caches in 1.61908 ms
2021-10-04 11:22:39 [5] [core 1046] ,-Cache for statehist------------------------------------------------------------.
2021-10-04 11:22:39 [5] [core 1046] |                                                                                |
2021-10-04 11:22:39 [5] [core 1046] |         parsed                speed                    cached                  |
2021-10-04 11:22:39 [5] [core 1046] |         -----------           -------------            ----------------------  |
2021-10-04 11:22:39 [5] [core 1046] |      1  Logfiles       17.36  Logfiles/s         1941  hosts/services          |
2021-10-04 11:22:39 [5] [core 1046] |  0.001  GB of data    12.141  MB/s               3896  host/service events     |
2021-10-04 11:22:39 [5] [core 1046] |  0.006  Mio messages   0.104  Mio messages/s        1  core starts/stops       |
2021-10-04 11:22:39 [5] [core 1046] |    0.2  days of history                          2.01  entries per host/serv.  |
2021-10-04 11:22:39 [5] [core 1046] |                                               3895.00  entries per day         |
2021-10-04 11:22:39 [5] [core 1046] |                                                  7842  strings                 |
2021-10-04 11:22:39 [5] [core 1046] |                                                  2192  unique strings (28.0%)  |
2021-10-04 11:22:39 [5] [core 1046] |  00:00 parsing time                                                            |
2021-10-04 11:22:39 [5] [core 1046] '--------------------------------------------------------------------------------'
2021-10-04 11:22:39 [5] [fetcher pool] [helper 1089] started, commandline: /omd/sites/site/bin/fetcher
2021-10-04 11:22:39 [5] [fetcher pool] [helper 1090] started, commandline: /omd/sites/site/bin/fetcher
2021-10-04 11:22:39 [5] [fetcher pool] [helper 1091] started, commandline: /omd/sites/site/bin/fetcher
2021-10-04 11:22:39 [5] [fetcher pool] [helper 1092] started, commandline: /omd/sites/site/bin/fetcher
2021-10-04 11:22:39 [5] [fetcher pool] [helper 1093] started, commandline: /omd/sites/site/bin/fetcher
2021-10-04 11:22:39 [5] [fetcher pool] [helper 1094] started, commandline: /omd/sites/site/bin/fetcher
2021-10-04 11:22:39 [5] [fetcher pool] [helper 1095] started, commandline: /omd/sites/site/bin/fetcher
2021-10-04 11:22:39 [5] [fetcher pool] started 13 helpers in 89.3714 ms
2021-10-04 11:22:39 [5] [notification helper 1096] started, commandline: /omd/sites/site/bin/cmk --notify --keepalive
2021-10-04 11:22:39 [5] [icmpsender 1097] started, commandline: /omd/sites/site/lib/cmc/icmpsender 8 0 1000
2021-10-04 11:22:39 [5] [icmpreceiver 1098] started, commandline: /omd/sites/site/lib/cmc/icmpreceiver
2021-10-04 11:23:32 [4] [rrdcached thread] [rrdcached at "/omd/sites/site/tmp/run/rrdcached.sock"] cannot connect: No such file or directory
2021-10-04 11:23:54 [4] [rrdcached thread] [rrdcached at "/omd/sites/site/tmp/run/rrdcached.sock"] cannot connect: No such file or directory
2021-10-04 11:24:14 [4] [rrdcached thread] [rrdcached at "/omd/sites/site/tmp/run/rrdcached.sock"] cannot connect: No such file or directory
2021-10-04 11:24:38 [4] [rrdcached thread] [rrdcached at "/omd/sites/site/tmp/run/rrdcached.sock"] cannot connect: No such file or directory
2021-10-04 11:25:11 [4] [rrdcached thread] [rrdcached at "/omd/sites/site/tmp/run/rrdcached.sock"] cannot connect: No such file or directory
2021-10-04 11:25:35 [4] [rrdcached thread] [rrdcached at "/omd/sites/site/tmp/run/rrdcached.sock"] cannot connect: No such file or directory
2021-10-04 11:25:51 [4] [rrdcached thread] [rrdcached at "/omd/sites/site/tmp/run/rrdcached.sock"] cannot connect: No such file or directory
2021-10-04 11:26:07 [4] [rrdcached thread] [rrdcached at "/omd/sites/site/tmp/run/rrdcached.sock"] cannot connect: No such file or directory
2021-10-04 11:26:49 [4] [rrdcached thread] [rrdcached at "/omd/sites/site/tmp/run/rrdcached.sock"] cannot connect: No such file or directory
2021-10-04 11:27:37 [4] [rrdcached thread] [rrdcached at "/omd/sites/site/tmp/run/rrdcached.sock"] cannot connect: No such file or directory
2021-10-04 11:27:49 [4] [client 3] error: client connection terminated: timeout

Maybe a stupid question, but check if a file system is full?
If that’s not the case, maybe someone else can help. Unfortunately, I am not yet such an expert :slight_smile:

more than enough disk space. Thanks Christian

1 Like

Can you restart only the rrdcached?
Before i test this i would cleanup the old cache/spool files with RRD data.

I get the above, but on all check-mk hosts, not only on those where the daemon is not running. and the daemon doesn’t even show in the omd status out put, as posted above.

Which files exactly should I clean up?

I just created a new site in a fresh linux installation, rrdcached is present. So I assume it has to do with upgrade from the RAW to the Enterprise edition. Because as I said rrdcached is not running in none of the current check-mk hosts, but only on 4 of them there is no rrdcached socket, which is pretty strange!

rrdcached must run in RAW and Enterprise. This is needed to store the performance data for all versions of CMK.
On command line i would inspect the “~/etc/init.d/” folder. Is there also no rrdcached file?
If it is there you need to check the file “~/etc/omd/site.conf”.
The entry “CONFIG_PNP4NAGIOS=‘on’” should exist.

That’s all what i would check.

“CONFIG_PNP4NAGIOS=‘on’” did it. Thanks Andreas

Hi Andreas, I still have one issue with only one slave. There are no Graphs at all. rrdcached is present and running. Journals in ~/var/rrcached/ are created but they are empty. After I converted the rrds, they are present in ~/var/check_mk/rrd/. CONFIG_PNP4NAGIOS=‘on' is correctly set.

~$ cat ~/etc/omd/site.conf
# Managed by Puppet. DO NOT EDIT!
#

CONFIG_ADMIN_MAIL=''
CONFIG_APACHE_MODE='own'
CONFIG_APACHE_TCP_ADDR='127.0.0.1'
CONFIG_APACHE_TCP_PORT='5000'
CONFIG_AUTOSTART='on'
CONFIG_CORE='cmc'
CONFIG_DOKUWIKI_AUTH='off'
CONFIG_LIVEPROXYD='on'
CONFIG_LIVESTATUS_TCP='on'
CONFIG_LIVESTATUS_TCP_ONLY_FROM='192.168.1.61'
CONFIG_LIVESTATUS_TCP_PORT='6557'
CONFIG_LIVESTATUS_TCP_TLS='on'
CONFIG_MKEVENTD='on'
CONFIG_MKEVENTD_SNMPTRAP='off'
CONFIG_MKEVENTD_SYSLOG='off'
CONFIG_MKEVENTD_SYSLOG_TCP='off'
CONFIG_MULTISITE_AUTHORISATION='on'
CONFIG_MULTISITE_COOKIE_AUTH='on'
CONFIG_NAGIOS_THEME='dark'
CONFIG_NSCA='off'
CONFIG_NSCA_TCP_PORT='5667'
CONFIG_PNP4NAGIOS='on'
CONFIG_TMPFS='on'

Further more, when adding a new host to this slave, rrds are not created, the host doesn’t have a folder in ~/var/check_mk/rrd/.

The Graphs look like this on hosts of this specific slave:

Unfortunately I couldn’t fix it and I don’t know what is the reason, but to get the site back I did the following:

  • create a backup without rrds -N option
  • create a new site
  • restore the backup to the new site
  • reinventory all hosts in the new site
  • stop and rename original site
  • stop rename new site according to the original one
  • start the new/old site