Check chrony status stale

Hi,
since updating to CEE 1.6.0p11 the time check with chrony is going into stale state.
I have doublechecked that there is no ntp installed - chrony only on Centos 7.x.

Any ideas why I’m getting this error on most of my hosts?
image

Can you take a look at your agent output on your system?
Is the agent also updated to the actual version?

1 Like

Hi Andreas,

agent is updated to 1.6.0p11 latest.

The output is

<<chrony:cached(1586946575,30)>>
Reference ID : 59EA404D (89.234.64.77)
Stratum : 3
Ref time (UTC) : Wed Apr 15 10:27:12 2020
System time : 0.000278031 seconds slow of NTP time
Last offset : -0.000086032 seconds
RMS offset : 0.000207073 seconds
Frequency : 26.163 ppm fast
Residual freq : -0.005 ppm
Skew : 0.126 ppm
Root delay : 0.021480082 seconds
Root dispersion : 0.005031839 seconds
Update interval : 1043.9 seconds
Leap status : Normal

Same host from the gui

The next step is then

cmk --debug -vv yourhostname

You should see an chrony NTP check or some error message.

Hi,

no error…
NTP Time OK - Stratum: 3, Offset: 0.4600 ms, Reference ID: 59EA404D (89.234.64.77)

That’s the same I can see in the gui but the check is in stale state… :frowning:

Strange behaviour… after issuing the command on the cli the check is recovered from stale state… I guess after some time it goes back to stale…

Do a “cmk --debug -vvII yourhost” and “cmk -R”
If then the check stays stale i cannot help anymore.

cmk --debug -vvll gives me:

  • FETCHING DATA
    [agent] Not using cache (Don’t try it)
    [agent] Execute data source
    [agent] Connecting via TCP to :6556 (25.0s timeout)
    [agent] Reading data from agent
    [agent] Write data to cache file /omd/sites/clmlnz/tmp/check_mk/cache/
    Try aquire lock on /omd/sites/clmlnz/tmp/check_mk/cache/
    Got lock on /omd/sites/clmlnz/tmp/check_mk/cache/
    Releasing lock on /omd/sites/clmlnz/tmp/check_mk/cache/
    Released lock on /omd/sites/clmlnz/tmp/check_mk/cache/
    Loading autochecks from /omd/sites/clmlnz/var/check_mk/autochecks/
    [agent] Using persisted section ‘lnx_packages’
    [agent] Using persisted section ‘lnx_cpuinfo’
    [agent] Using persisted section ‘lnx_ip_r’
    [agent] Using persisted section ‘lnx_uname’
    [agent] Using persisted section ‘dmidecode’
    [agent] Using persisted section ‘lnx_distro’
    [piggyback] No persisted sections loaded
    [piggyback] Execute data source
    No piggyback files for ‘’. Skip processing.
    No piggyback files for ‘’. Skip processing.
  • EXECUTING DISCOVERY PLUGINS (60)
    Trying discovery with: jolokia_generic.string, kernel, jolokia_metrics.in_memory, cifsmounts, lnx_if, postfix_mailq_status, jolokia_metrics.bea_threads, jolokia_metrics.threads, jolokia_metrics.writer, tcp_conn_stats, jolokia_metrics.off_heap, jolokia_jvm_threading, systemd_units, ps, uptime, df_netapp, jolokia_metrics.bea_requests, postfix_mailq, check_mk.only_from, cpu.threads, diskstat, jolokia_jvm_threading.pool, ps_lnx, jolokia_metrics.serv_req, jolokia_generic, jolokia_metrics.tp, df_netscaler, systemd_units.services, local, cpu.loads, jolokia_metrics.app_state, jolokia_metrics.bea_sess, jolokia_metrics.perm_gen, df_netapp32, md, df, jolokia_metrics.uptime, jolokia_metrics.bea_queue, jolokia_metrics.requests, mem.used, job, jolokia_metrics.cache_hits, mem.vmalloc, mem.win, systemd_units.services_summary, jolokia_metrics.on_disk, kernel.util, ps.perf, vbox_guest, jolokia_generic.rate, chrony, nfsmounts, jolokia_metrics.gc, check_mk.agent_update, mounts, df_zos, jolokia_info, jolokia_metrics.mem, mem.linux, jolokia_metrics.app_sess
    systemd_units does not support discovery. Skipping it.
    ps_lnx does not support discovery. Skipping it.
    ps.perf does not support discovery. Skipping it.
    Try aquire lock on /omd/sites/clmlnz/var/check_mk/autochecks/.mk
    Got lock on /omd/sites/clmlnz/var/check_mk/autochecks/.mk
    Releasing lock on /omd/sites/clmlnz/var/check_mk/autochecks/.mk
    Released lock on /omd/sites/clmlnz/var/check_mk/autochecks/.mk
    1 chrony
    1 cpu.loads
    1 cpu.threads
    4 df
    1 diskstat
    2 jolokia_info
    2 jolokia_jvm_threading
    2 jolokia_jvm_threading.pool
    4 jolokia_metrics.app_sess
    4 jolokia_metrics.app_state
    4 jolokia_metrics.gc
    2 jolokia_metrics.mem
    4 jolokia_metrics.requests
    2 jolokia_metrics.uptime
    3 kernel
    1 kernel.util
    1 lnx_if
    1 mem.linux
    4 mounts
    1 postfix_mailq
    1 postfix_mailq_status
    1 systemd_units.services_summary
    1 tcp_conn_stats
    1 uptime
    SUCCESS - Found 49 services, no host labels

chrony is there…
Hm… I have seen that we have a few more services are in stale state because of caching… Any hints on that - can I configure caching behaviour?

It’s a problem since we upgraded to CEE I guess… hm…

Could this still be a problem?
https://checkmk.com/check_mk-werks.php?werk_id=8261

Related to the discovery service…

I don’t think so as this is a very old issue.

Now the service stays stale? Is is not refreshed if you da a “cmk yourhost” on the command line?

no - still stale - they go away afer doing that, but after minutes… again they are stale. Most of them are ntp (~ 70 hosts), some of them are azure related.

all of them are cache related as it seems… ~90 - 120 services (of total 11k) are in stale state

On the host with the NTP problem there is also nothing written inside the “Check_MK” service about missing agent section or?

no… section is there.

I don’t know anymore what this could be. The version p11 is ok the stale problem was until p10.
And only with active checks :slight_smile:

  • the data is sent from the agent
  • If you do a “cmk hostname” on the command line the service is refreshed

Only point left

  • is there a rule for “Normal check interval for service checks” active who is affecting the ntp?

Look at the “Parameters for this service” and there at the “Monitoring Configuration” section.
Compare this section with the “Parameters for this service” of the “Check_MK” service.
Is this all the same?

Hi Andreas!
For the Timeservice:

Service “Check_MK”

But both settings are “old” ones from the RAW edition. I had a similar problem
with AWS checks… Settings from the RAW didn’t work very well with CEE. Reverted
to standard, AWS is working fine now.

Regards
Günther

hm… did a test… reset of “normal check interval” back to default (1 Minute…). Stale status is gone as it seems - will do an update tomorrow if it is still better now…

Das Problem hier ist - der NTP ist ein Check welcher auf der Agent Seite gecached wird. Aber nur mit 30 Sekunden.
Damit ist der Check bei jedem Abfrageintervall (5min) schon wieder invalide und wird gar nicht verarbeitet.
Ich wüsste jetzt aus dem Kopf auch keine Einstellung mit der man das verändern kann.

Einfach den Normal Check Interval immer auf einer Minute lassen passt :slight_smile:
Ich verstelle dies wirklich nur in Ausnahmefällen.

Hi Andreas,

kann ich jetzt nachvollziehen. War in der RAW-Edition nie ein Problem, jetzt erst mit der CEE
sind da einige Dinge anders. Die Auslastung des Servers bei 12k Services ist natürlich jetzt
ungleich höher - aber es scheint jetzt tatsächlich zu funktionieren.

Thx for your help!

Günther

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.