Check chrony status stale

grasch · April 14, 2020, 6:08pm

Hi,
since updating to CEE 1.6.0p11 the time check with chrony is going into stale state.
I have doublechecked that there is no ntp installed - chrony only on Centos 7.x.

Any ideas why I’m getting this error on most of my hosts?

andreas-doehler · April 15, 2020, 9:18am

Can you take a look at your agent output on your system?
Is the agent also updated to the actual version?

grasch · April 15, 2020, 10:38am

Hi Andreas,

agent is updated to 1.6.0p11 latest.

The output is

<<chrony:cached(1586946575,30)>>
Reference ID : 59EA404D (89.234.64.77)
Stratum : 3
Ref time (UTC) : Wed Apr 15 10:27:12 2020
System time : 0.000278031 seconds slow of NTP time
Last offset : -0.000086032 seconds
RMS offset : 0.000207073 seconds
Frequency : 26.163 ppm fast
Residual freq : -0.005 ppm
Skew : 0.126 ppm
Root delay : 0.021480082 seconds
Root dispersion : 0.005031839 seconds
Update interval : 1043.9 seconds
Leap status : Normal

Same host from the gui

andreas-doehler · April 15, 2020, 10:40am

The next step is then

cmk --debug -vv yourhostname

You should see an chrony NTP check or some error message.

grasch · April 15, 2020, 10:57am

Hi,

no error…
NTP Time OK - Stratum: 3, Offset: 0.4600 ms, Reference ID: 59EA404D (89.234.64.77)

That’s the same I can see in the gui but the check is in stale state…

Strange behaviour… after issuing the command on the cli the check is recovered from stale state… I guess after some time it goes back to stale…

andreas-doehler · April 15, 2020, 11:20am

Do a “cmk --debug -vvII yourhost” and “cmk -R”
If then the check stays stale i cannot help anymore.

grasch · April 15, 2020, 11:49am

cmk --debug -vvll gives me:

FETCHING DATA
[agent] Not using cache (Don’t try it)
[agent] Execute data source
[agent] Connecting via TCP to :6556 (25.0s timeout)
[agent] Reading data from agent
[agent] Write data to cache file /omd/sites/clmlnz/tmp/check_mk/cache/
Try aquire lock on /omd/sites/clmlnz/tmp/check_mk/cache/
Got lock on /omd/sites/clmlnz/tmp/check_mk/cache/
Releasing lock on /omd/sites/clmlnz/tmp/check_mk/cache/
Released lock on /omd/sites/clmlnz/tmp/check_mk/cache/
Loading autochecks from /omd/sites/clmlnz/var/check_mk/autochecks/
[agent] Using persisted section ‘lnx_packages’
[agent] Using persisted section ‘lnx_cpuinfo’
[agent] Using persisted section ‘lnx_ip_r’
[agent] Using persisted section ‘lnx_uname’
[agent] Using persisted section ‘dmidecode’
[agent] Using persisted section ‘lnx_distro’
[piggyback] No persisted sections loaded
[piggyback] Execute data source
No piggyback files for ‘’. Skip processing.
No piggyback files for ‘’. Skip processing.
EXECUTING DISCOVERY PLUGINS (60)
Trying discovery with: jolokia_generic.string, kernel, jolokia_metrics.in_memory, cifsmounts, lnx_if, postfix_mailq_status, jolokia_metrics.bea_threads, jolokia_metrics.threads, jolokia_metrics.writer, tcp_conn_stats, jolokia_metrics.off_heap, jolokia_jvm_threading, systemd_units, ps, uptime, df_netapp, jolokia_metrics.bea_requests, postfix_mailq, check_mk.only_from, cpu.threads, diskstat, jolokia_jvm_threading.pool, ps_lnx, jolokia_metrics.serv_req, jolokia_generic, jolokia_metrics.tp, df_netscaler, systemd_units.services, local, cpu.loads, jolokia_metrics.app_state, jolokia_metrics.bea_sess, jolokia_metrics.perm_gen, df_netapp32, md, df, jolokia_metrics.uptime, jolokia_metrics.bea_queue, jolokia_metrics.requests, mem.used, job, jolokia_metrics.cache_hits, mem.vmalloc, mem.win, systemd_units.services_summary, jolokia_metrics.on_disk, kernel.util, ps.perf, vbox_guest, jolokia_generic.rate, chrony, nfsmounts, jolokia_metrics.gc, check_mk.agent_update, mounts, df_zos, jolokia_info, jolokia_metrics.mem, mem.linux, jolokia_metrics.app_sess
systemd_units does not support discovery. Skipping it.
ps_lnx does not support discovery. Skipping it.
ps.perf does not support discovery. Skipping it.
Try aquire lock on /omd/sites/clmlnz/var/check_mk/autochecks/.mk
Got lock on /omd/sites/clmlnz/var/check_mk/autochecks/.mk
Releasing lock on /omd/sites/clmlnz/var/check_mk/autochecks/.mk
Released lock on /omd/sites/clmlnz/var/check_mk/autochecks/.mk
1 chrony
1 cpu.loads
1 cpu.threads
4 df
1 diskstat
2 jolokia_info
2 jolokia_jvm_threading
2 jolokia_jvm_threading.pool
4 jolokia_metrics.app_sess
4 jolokia_metrics.app_state
4 jolokia_metrics.gc
2 jolokia_metrics.mem
4 jolokia_metrics.requests
2 jolokia_metrics.uptime
3 kernel
1 kernel.util
1 lnx_if
1 mem.linux
4 mounts
1 postfix_mailq
1 postfix_mailq_status
1 systemd_units.services_summary
1 tcp_conn_stats
1 uptime
SUCCESS - Found 49 services, no host labels

chrony is there…
Hm… I have seen that we have a few more services are in stale state because of caching… Any hints on that - can I configure caching behaviour?

It’s a problem since we upgraded to CEE I guess… hm…

grasch · April 15, 2020, 11:54am

Could this still be a problem?
https://checkmk.com/check_mk-werks.php?werk_id=8261

Related to the discovery service…

andreas-doehler · April 15, 2020, 12:36pm

I don’t think so as this is a very old issue.

Now the service stays stale? Is is not refreshed if you da a “cmk yourhost” on the command line?

grasch · April 15, 2020, 12:49pm

no - still stale - they go away afer doing that, but after minutes… again they are stale. Most of them are ntp (~ 70 hosts), some of them are azure related.

all of them are cache related as it seems… ~90 - 120 services (of total 11k) are in stale state

andreas-doehler · April 15, 2020, 1:02pm

On the host with the NTP problem there is also nothing written inside the “Check_MK” service about missing agent section or?

grasch · April 15, 2020, 2:42pm

no… section is there.

andreas-doehler · April 16, 2020, 6:29am

I don’t know anymore what this could be. The version p11 is ok the stale problem was until p10.
And only with active checks

the data is sent from the agent
If you do a “cmk hostname” on the command line the service is refreshed

Only point left

is there a rule for “Normal check interval for service checks” active who is affecting the ntp?

Look at the “Parameters for this service” and there at the “Monitoring Configuration” section.
Compare this section with the “Parameters for this service” of the “Check_MK” service.
Is this all the same?

grasch · April 16, 2020, 5:47pm

Hi Andreas!
For the Timeservice:

Service “Check_MK”

But both settings are “old” ones from the RAW edition. I had a similar problem
with AWS checks… Settings from the RAW didn’t work very well with CEE. Reverted
to standard, AWS is working fine now.

Regards
Günther

grasch · April 16, 2020, 5:59pm

hm… did a test… reset of “normal check interval” back to default (1 Minute…). Stale status is gone as it seems - will do an update tomorrow if it is still better now…

andreas-doehler · April 17, 2020, 6:31am

Das Problem hier ist - der NTP ist ein Check welcher auf der Agent Seite gecached wird. Aber nur mit 30 Sekunden.
Damit ist der Check bei jedem Abfrageintervall (5min) schon wieder invalide und wird gar nicht verarbeitet.
Ich wüsste jetzt aus dem Kopf auch keine Einstellung mit der man das verändern kann.

Einfach den Normal Check Interval immer auf einer Minute lassen passt
Ich verstelle dies wirklich nur in Ausnahmefällen.

grasch · April 17, 2020, 6:37am

Hi Andreas,

kann ich jetzt nachvollziehen. War in der RAW-Edition nie ein Problem, jetzt erst mit der CEE
sind da einige Dinge anders. Die Auslastung des Servers bei 12k Services ist natürlich jetzt
ungleich höher - aber es scheint jetzt tatsächlich zu funktionieren.

Thx for your help!

Günther

system · May 17, 2020, 4:37pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.