Check_mk was not displaying any hosts statistics and data

Hi All,

We have checkmk RAW 1.4.0P8 installed in our ennvironment. and we are montioring around 900 hosts.

we have seen this issue with our check_mk which is not displaying any data for few minutes and later it is recovered.

I am trying to understand what caused this issue .

I can see below logs in nagios.log

[1589791625] Caught SIGTERM, shutting down…
[1589791625] Successfully shutdown… (PID=5683)
[1589791625] npcdmod: If you don’t like me, I will go out! Bye.
[1589791625] Event broker module ‘/omd/sites/checkmk/lib/npcdmod.o’ deinitialized successfully.
[1589791625] livestatus: deinitializing
[1589791625] livestatus: waiting for main to terminate…
[1589791626] livestatus: waiting for client threads to terminate…
[1589791626] livestatus: could not join thread main
[1589791626] livestatus: main thread + 20 client threads have finished
[1589791626] Event broker module ‘/omd/sites/checkmk/lib/mk-livestatus/livestatus.o’ deinitialized successfully.
[1589791636] Nagios 3.5.0 starting… (PID=13291)
[1589791636] Local time is Mon May 18 09:47:16 BST 2020
[1589791636] LOG VERSION: 2.0
[1589791636] npcdmod: Copyright © 2008-2009 Hendrik Baecker (andurin@process-zero.de) - http://www.pnp4nagios.org
[1589791636] npcdmod: /omd/sites/checkmk/etc/pnp4nagios/npcd.cfg initialized
[1589791636] npcdmod: spool_dir = ‘/omd/sites/checkmk/var/pnp4nagios/spool/’.
[1589791636] npcdmod: perfdata file ‘/omd/sites/checkmk/var/pnp4nagios/perfdata.dump’.
[1589791636] npcdmod: Ready to run to have some fun!
[1589791636] Event broker module ‘/omd/sites/checkmk/lib/npcdmod.o’ initialized successfully.
[1589791636] livestatus: setting number of client threads to 20
[1589791636] livestatus: fl_socket_path=[/omd/sites/checkmk/tmp/run/live], fl_mkeventd_socket_path=[/omd/sites/checkmk/tmp/run/mkeventd/status]
[1589791636] livestatus: Livestatus 1.4.0p8 by Mathias Kettner. Socket: ‘/omd/sites/checkmk/tmp/run/live’
[1589791636] livestatus: Please visit us at http://mathias-kettner.de/
[1589791636] livestatus: running on OMD site checkmk, cool.
[1589791636] livestatus: opened UNIX socket at /omd/sites/checkmk/tmp/run/live
[1589791636] livestatus: your event_broker_options are sufficient for livestatus…
[1589791636] livestatus: finished initialization, further log messages go to /omd/sites/checkmk/var/nagios/livestatus.log
[1589791636] Event broker module ‘/omd/sites/checkmk/lib/mk-livestatus/livestatus.o’ initialized successfully.
[1589791636] Finished daemonizing… (New PID=13292)
[1589791637] livestatus: TIMEPERIOD TRANSITION: 24X7;-1;1
[1589791637] livestatus: TIMEPERIOD TRANSITION: DevOps_HAProxy_Ignore;-1;1
[1589791637] livestatus: TIMEPERIOD TRANSITION: Exclude3am;-1;1
[1589791637] livestatus: TIMEPERIOD TRANSITION: ExcludeMidnight;-1;1
[1589791637] livestatus: TIMEPERIOD TRANSITION: dnp_ignoretimes;-1;1
[1589791637] livestatus: TIMEPERIOD TRANSITION: vasf_ignore;-1;1
[1589791637] livestatus: logging initial states
[1589791637] livestatus: starting main thread and 20 client threads

and this is my icinga livstatus.log

020-05-18 09:47:03 [client 8] Unknown dynamic column ‘rrddata’
2020-05-18 09:47:06 [client 19] error: Client connection terminated while request still incomplete
2020-05-18 09:47:06 [main] socket thread has terminated
2020-05-18 09:47:06 [client 9] error: Client connection terminated while request still incomplete
2020-05-18 09:47:06 [main] flushing log file index
2020-05-18 11:38:52 [client 19] error: Client connection terminated while request still incomplete
2020-05-18 11:38:53 [main] socket thread has terminated
2020-05-18 11:38:53 [client 10] error: Client connection terminated while request still incomplete
2020-05-18 11:38:53 [main] flushing log file index
2020-05-18 11:41:50 [client 17] Unknown dynamic column ‘rrddata’
2020-05-18 11:41:50 [client 16] Unknown dynamic column ‘rrddata’
2020-05-18 11:41:51 [client 1] Unknown dynamic column ‘rrddata’
2020-05-18 11:41:52 [client 19] Unknown dynamic column ‘rrddata’
2020-05-18 11:41:54 [client 6] Unknown dynamic column ‘rrddata’
2020-05-18 11:41:54 [client 7] Unknown dynamic column ‘rrddata’

900 hosts an a RAW edition. How looks the CPU load and utilization on your monitoring server?
It is possible that an activation takes to long. Nagios reschedule all services or needs some time to spin up.
To say if you monitoring server suffers under to heavy workload the numbers of CPU cores would be also nice to know.

Thanks for looking @andreas-doehler

here is the load of our server

Cpu0 : 23.6%us, 15.2%sy, 0.0%ni, 47.1%id, 13.8%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu1 : 25.0%us, 11.0%sy, 0.0%ni, 62.7%id, 0.7%wa, 0.0%hi, 0.7%si, 0.0%st
Cpu2 : 18.0%us, 10.7%sy, 0.0%ni, 55.0%id, 16.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu3 : 26.0%us, 13.2%sy, 0.0%ni, 52.0%id, 8.4%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu4 : 30.0%us, 12.1%sy, 0.0%ni, 57.6%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu5 : 17.9%us, 11.0%sy, 0.0%ni, 70.8%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu6 : 24.2%us, 11.7%sy, 0.0%ni, 60.7%id, 3.4%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 22.7%us, 12.4%sy, 0.0%ni, 64.5%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 16431516k total, 15154716k used, 1276800k free, 222644k buffers
Swap: 8380412k total, 0k used, 8380412k free, 3866884k cached

00:00:05 runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
00:10:02 12 496 0.37 0.74 0.85
00:20:02 8 480 0.41 0.69 0.80
00:30:02 24 516 0.48 0.59 0.71
00:40:02 11 484 0.94 0.80 0.77
00:50:01 12 567 0.62 0.70 0.74
01:00:02 14 556 0.95 0.73 0.73
01:10:02 9 484 0.88 0.98 0.87
01:20:02 22 585 1.91 1.51 1.12
01:30:01 16 565 1.29 1.34 1.26
01:40:01 37 550 1.23 1.17 1.20
01:50:01 8 541 0.93 1.11 1.17
02:00:01 5 541 1.29 1.03 1.08
02:10:03 30 589 0.63 0.84 0.98
02:20:01 7 530 0.85 0.75 0.85
02:30:02 7 510 0.71 0.80 0.84
02:40:02 50 580 0.64 0.74 0.81
02:50:01 11 545 0.61 0.66 0.74
03:00:01 3 535 0.80 0.68 0.71
03:10:02 9 547 0.62 0.71 0.75
03:20:01 12 539 0.75 0.80 0.78
03:30:02 9 521 0.62 0.75 0.79
03:40:01 28 529 0.68 0.80 0.80
03:50:02 48 645 0.71 0.88 0.88
04:00:01 8 551 0.56 0.70 0.79
04:10:02 3 557 0.55 0.60 0.71
04:20:02 10 561 0.96 0.76 0.72
04:30:02 67 595 0.38 0.54 0.63
04:40:03 15 598 0.87 0.71 0.66
04:50:03 49 584 0.64 0.67 0.68
05:00:02 14 598 0.67 0.69 0.69
05:10:03 51 575 0.23 0.53 0.63
05:20:03 29 630 0.55 0.68 0.68
05:30:02 8 597 1.16 0.87 0.75
05:40:02 15 593 0.48 0.57 0.64
05:50:02 10 551 0.67 0.63 0.66
06:00:02 7 563 0.47 0.62 0.65
Average: 19 555 0.75 0.79 0.81

can any one help me to under what could caused this issue ?

The problem is after an restart or config change the Nagios core inside CRE needs some time to be available again for livestatus queries.
The time needed depends on different things. One thing is the overall load on the system that looks ok, On other thing is how big is your status file or how big is the log file from the current day. These two are reloaded an processed/cached inside the core.

To troubleshoot or to know where the problem lies you can do not so many things.
If you have enough resources on your machine make a clone of your site.
Stop booth sites, inside the clone remove all Nagios core log files and also the retention.dat. Start your site and take a look how long it takes before the livestatus is available to queries. Then do a config change inside the cloned site apply the changes and measure the time again how long it takes. Compare the results to your productive site. There should be a difference without the log files.

Thanks @andreas-doehler