Does NTP Time check look at the wrong information from Chrony?

CMK version: 2.2
OS version: Ubuntu 18, 20, 22

Hi!

Been investigating the check for NTP Time since it from time to time on any server reports that the time has not been synchronized, which has not been true in any case so far.

We have Chrony, and chrony is configured with multiple NTP servers.

I have read up on some forums threads searching for Chrony and some mentions that the time between syncs should be lowered, update interval, the Chrony cache and other things but I can’t really find a root cause for this issue or maybe I am misreading something.

The NTP Time check uses chrony tracking.
This returns the time for when the primary source (usually the first server) was last synced, but not when Chrony was actually last synced. So it could go any amount of time before chrony syncs against the first configured source again.
The command also returns information for how well chrony is in sync since the last sync, which can be calculated provided the system time, offset and root delay/dispersion. The value here will increase if chrony stays out of sync.

Comparing the time that the check returns, it seems to be the reference time from chrony tracking from what I understand has no relevance in checking how much time has gone by since the last sync or how well the time is in sync.

Going through the docs from Chrony, this does not really seem to be used for this purpose either, so not sure how well the implementation of the check has been performed in this sense?

chronyc ntpdata returns tracking for all sources and when they were last updated, which could be later than the check describes it for the primary source.

chronyc sourcesreturns a short list of the same with a few details.

From what I am seeing looking at chronyc tracking this way does not really help much, using ntpdata or sources would
Is there another way to implement this or to make the NTP Time check look at this in a better way?
Or should I approach this in another way entirely?

Thank you!

That’s wrong. If you look at the output of “chronyc tracking” you will see in the first line the target server IP that is the actual synced server.
Compare the output to “chronyc sources” and you will see that tracking shows every time the active time source.

chronyc sources

MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^+ 10.50.0.24                    4   6    77     5   -383us[ -383us] +/-   20ms
^* 10.50.0.31                    4   6    77     6    +82us[ +324us] +/-   20ms

chronyc tracking

Reference ID    : 0A32001F (10.50.0.31)
Stratum         : 5
Ref time (UTC)  : Mon Apr 29 12:57:00 2024
System time     : 0.000178816 seconds fast of NTP time
Last offset     : +0.000241912 seconds
RMS offset      : 0.000080535 seconds
Frequency       : 12.373 ppm slow
Residual freq   : +0.228 ppm

The problem with “sources” and “ntpdata” is that you need to find the synced server by yourself. That is not the case with “tracking”.

I do not completely agree.

I have here the output from tracking and ntpdata, where the tracking show older information than ntpdata.
The commands run after one another.
Where in tracking, it shows the first ntp server, and in ntpdata the latest synced server is the second ntp server.
Usually chrony does a sync against the first server every second to third sync, but in some cases it can go over an hour before that server is synced, while the second and third can both have been synced a few minutes ago.

tracking:

chronyc> tracking
Reference ID    : 5BD3AB42 ([ntp server 1])
Stratum         : 2
Ref time (UTC)  : Mon Apr 29 13:12:57 2024
System time     : 0.000036014 seconds slow of NTP time
Last offset     : -0.000039573 seconds
RMS offset      : 0.000087197 seconds
Frequency       : 3.511 ppm slow
Residual freq   : -0.000 ppm
Skew            : 0.011 ppm
Root delay      : 0.006446205 seconds
Root dispersion : 0.000552796 seconds
Update interval : 2082.7 seconds
Leap status     : Normal

ntpdata:

chronyc> ntpdata

Remote address  : [ntp server 1] (5BD3AB42)
Remote port     : 123
Local address   : [host 1] (0ADC2210)
Leap status     : Normal
Version         : 4
Mode            : Server
Stratum         : 1
Poll interval   : 10 (1024 seconds)
Precision       : -20 (0.000000954 seconds)
Root delay      : 0.000000 seconds
Root dispersion : 0.000137 seconds
Reference ID    : 4D525300 (MRS)
Reference time  : Mon Apr 29 13:12:56 2024
Offset          : +0.001776087 seconds
Peer delay      : 0.009865415 seconds
Peer dispersion : 0.000000979 seconds
Response time   : 0.000130150 seconds
Jitter asymmetry: -0.50
NTP tests       : 111 111 1111
Interleaved     : No
Authenticated   : No
TX timestamping : Daemon
RX timestamping : Kernel
Total TX        : 1247
Total RX        : 1246
Total valid RX  : 1246

Remote address  : [ntp server 2] (5BD3AB32)
Remote port     : 123
Local address   : [host 1] (0ADC2210)
Leap status     : Normal
Version         : 4
Mode            : Server
Stratum         : 1
Poll interval   : 10 (1024 seconds)
Precision       : -20 (0.000000954 seconds)
Root delay      : 0.000000 seconds
Root dispersion : 0.000198 seconds
Reference ID    : 4D525300 (MRS)
Reference time  : Mon Apr 29 13:16:06 2024
Offset          : +0.002857164 seconds
Peer delay      : 0.017961286 seconds
Peer dispersion : 0.000000979 seconds
Response time   : 0.000117386 seconds
Jitter asymmetry: -0.50
NTP tests       : 111 111 1101
Interleaved     : No
Authenticated   : No
TX timestamping : Daemon
RX timestamping : Kernel
Total TX        : 1106
Total RX        : 1105
Total valid RX  : 1105

Remote address  : [ntp server 3] (5BD3AB26)
Remote port     : 123
Local address   : [host 1] (0ADC2210)
Leap status     : Normal
Version         : 4
Mode            : Server
Stratum         : 1
Poll interval   : 10 (1024 seconds)
Precision       : -19 (0.000001907 seconds)
Root delay      : 0.000000 seconds
Root dispersion : 0.000153 seconds
Reference ID    : 4D525300 (MRS)
Reference time  : Mon Apr 29 12:59:37 2024
Offset          : +0.000102238 seconds
Peer delay      : 0.008521945 seconds
Peer dispersion : 0.000001932 seconds
Response time   : 0.000138057 seconds
Jitter asymmetry: -0.50
NTP tests       : 111 111 1111
Interleaved     : No
Authenticated   : No
TX timestamping : Daemon
RX timestamping : Kernel
Total TX        : 1216
Total RX        : 1216
Total valid RX  : 1216

It is not synced against the second server - you will see this with “chronyc sources”.
The reference time is only the time when this server was last checked.

Ref time
This is the time (UTC) at which the last measurement from the reference source was processed.

After such a measurement chrony decides if this is the preferred source of truth or not. The source shown with tracking is the best time source from the perspective of chrony.

Thank you for clearing things out a bit.

So what we are saying is that even though Chrony has synced to a source, or maybe more correctly polled a source, it has deemed that source less accurate and will thus display what it deems most accurate even though that could have been over an hour ago?
So the time which triggers the alarm for Chrony not being synced, is more accurate described as not synced against the most accurate time source for the last 30 minutes?
While it has still been checking time against other sources, and is not really out of sync.

This still gives me the feeling, even though a corrected one, that we are not checking for what we want?
From my perspective I need to know that the server has its time synced, meaning that I need an alarm for when the time is out of sync or not synced at all, not how long it has been since it synced against its preferred truth since this can go unchecked for an hour.

Yes

Yes again.

Than you can define the the quality of time parameters and use a very high value for the phases without synchronization.


Keep in mind that a default chrony on Redhat has a sync interval of something like 64 seconds and not 30 minutes. The default rule makes sense for such a default configuration. If your chrony is deployed with an own config for the sync interval than you need also to adjust the parameters.

But wouldn’t setting the Phases without sync setting to something big, also more or less disable the check whether Chrony syncs at all?
With exception for the sanity check in regards to the quality of time.
If the solution to check for how long ago it was since Chrony was synchronized is to just up the time, we could just disable this part of the check entirely since it wont have any affect, which feels a bit backwards.
Chrony is still in sync and has checked other sources, just that it has not checked the source it considers to be the best.
I do want to know when it can not sync because then I can fix that before it goes out of sync if that makes sense. So the check to see when the last sync was seems valid, but not if it only returns the value for a thing which does not indicate if it is good or bad.

What is the reasoning for looking at the time since the last sync was performed against the preferred time source specifically?

Keep in mind that a default chrony on Redhat has a sync interval of something like 64 seconds and not 30 minutes. The default rule makes sense for such a default configuration. If your chrony is deployed with an own config for the sync interval than you need also to adjust the parameters.

The sync interval in the Chrony conf has not been changed so it is default.