Long runtime of Agent-Check, when agent is running as Service

Hello Forum,

Troubleshooting-Template with Versions etc. find below

since a few days the Agent-Checks on one of my SQL-Servers are failing since it is hitting the timeout of 60 secs. I’ve investigated this issue and can see in the task-manager, that the process ‘cscript.exe /Nologo “C:\ProgramData\checkmk\agent\plugins\mssql.vbs”’ runs very long (more than a minute) and consumes the whole time 100% of CPU (one Core).
I’ve tried to run the command-line manually and got a result within less than a second.

Then I stopped the check_mk-Agent-service and started the agent manually via command-line

.\check_mk_agent.exe adhoc -showall

and did on the check-mk-server an manual netcat-request:

pfleck@mycompany-s09:~$ time echo|nc mycompany-s20.intranet.mycompany.de 6556
<<<check_mk>>>
Version: 2.0.0p24
BuildDate: May 11 2022
AgentOS: windows
Hostname: mycompany-s20
Architecture: 64bit
WorkingDirectory: C:\Program Files (x86)\checkmk\service
ConfigFile: C:\Program Files (x86)\checkmk\service\check_mk.yml
LocalConfigFile: C:\ProgramData\checkmk\agent\check_mk.user.yml
AgentDirectory: C:\Program Files (x86)\checkmk\service
PluginsDirectory: C:\ProgramData\checkmk\agent\plugins
StateDirectory: C:\ProgramData\checkmk\agent\state
ConfigDirectory: C:\ProgramData\checkmk\agent\config
TempDirectory: C:\ProgramData\checkmk\agent\tmp
LogDirectory: C:\ProgramData\checkmk\agent\log
SpoolDirectory: C:\ProgramData\checkmk\agent\spool
LocalDirectory: C:\ProgramData\checkmk\agent\local
OnlyFrom:
[...]
real    0m1,973s
user    0m0,005s
sys     0m0,004s

everything fine!
If I stop the adhoc-mode and restart the service and do the same, the time is increasing extremly:

pfleck@mycompany-s09:~$ time echo|nc mycompany-s20.intranet.mycompany.de 6556
<<<check_mk>>>
[...]
real    1m15,714s
user    0m0,001s
sys     0m0,006s

I’ve also investigated what changed at the time, beginning with this issue: About 30min prior I baked new agents and I the auto-deploy installed it on the server. The strange thing is here: I have four servers deployed with the same agent and running the mssql-script, but only one shows this strange behavior.

Does anyone have any tipps to fix this issue or for further investigation?

thank you in advance
Philipp

CMK version:
2.0.0p24 (CEE)

OS version:
Debian 11 on Monitoring-Server, Windows 2019 (1809) on

Error message:
[agent] MKTimeout(‘Fetcher for host “mycompany-s20.intranet.mycompany.de” timed out after 60 seconds’)CRIT, [piggyback] Valid sources: mycompany-vcs02.intranet.mycompany.de, Missing monitoring data for check plugins: check_mk_agent_update, df, logwatch, mem_win, services_summary, systemtime, uptime, windows_tasks, windows_updates, winperf_if, winperf_phydisk, winperf_processor_util, wmi_cpuloadWARN, execution time 60.0 sec

Output of “cmk --debug -vvn hostname”:

OMD[mycompany_monitor]:~$ cmk --debug -vvn mycompany-s20.intranet.mycompany.de
Checkmk version 2.0.0p24
Try license usage history update.
Trying to acquire lock on /omd/sites/mycompany_monitor/var/check_mk/license_usage/next_run
Got lock on /omd/sites/mycompany_monitor/var/check_mk/license_usage/next_run
Trying to acquire lock on /omd/sites/mycompany_monitor/var/check_mk/license_usage/history.json
Got lock on /omd/sites/mycompany_monitor/var/check_mk/license_usage/history.json
Next run time has not been reached yet. Abort.
Releasing lock on /omd/sites/mycompany_monitor/var/check_mk/license_usage/history.json
Released lock on /omd/sites/mycompany_monitor/var/check_mk/license_usage/history.json
Releasing lock on /omd/sites/mycompany_monitor/var/check_mk/license_usage/next_run
Released lock on /omd/sites/mycompany_monitor/var/check_mk/license_usage/next_run
Loading autochecks from /omd/sites/mycompany_monitor/var/check_mk/autochecks/mycompany-s20.intranet.mycompany.de.mk
+ FETCHING DATA
  Source: SourceType.HOST/FetcherType.TCP
[cpu_tracking] Start [7fd4862eecd0]
[TCPFetcher] Fetch with cache settings: DefaultAgentFileCache(base_path=PosixPath('/omd/sites/mycompany_monitor/tmp/check_mk/cache/mycompany-s20.intranet.mycompany.de'), max_age=MaxAge(checking=0, discovery=120, inventory=120), disabled=False, use_outdated=False, simulation=False)
Not using cache (Too old. Age is 823 sec, allowed is 0 sec)
[TCPFetcher] Execute data source
Connecting via TCP to 192.168.7.10:6556 (5.0s timeout)
Reading data from agent
Output is not encrypted
Write data to cache file /omd/sites/mycompany_monitor/tmp/check_mk/cache/mycompany-s20.intranet.mycompany.de
Trying to acquire lock on /omd/sites/mycompany_monitor/tmp/check_mk/cache/mycompany-s20.intranet.mycompany.de
Got lock on /omd/sites/mycompany_monitor/tmp/check_mk/cache/mycompany-s20.intranet.mycompany.de
Releasing lock on /omd/sites/mycompany_monitor/tmp/check_mk/cache/mycompany-s20.intranet.mycompany.de
Released lock on /omd/sites/mycompany_monitor/tmp/check_mk/cache/mycompany-s20.intranet.mycompany.de
Closing TCP connection to 192.168.7.10:6556
[cpu_tracking] Stop [7fd4862eecd0 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=76.4699999988079))]
  Source: SourceType.HOST/FetcherType.PIGGYBACK
[cpu_tracking] Start [7fd4862fb3d0]
[PiggybackFetcher] Fetch with cache settings: NoCache(base_path=PosixPath('/omd/sites/mycompany_monitor/tmp/check_mk/data_source_cache/piggyback/mycompany-s20.intranet.mycompany.de'), max_age=MaxAge(checking=0, discovery=120, inventory=120), disabled=False, use_outdated=False, simulation=False)
[PiggybackFetcher] Execute data source
Piggyback file '/omd/sites/mycompany_monitor/tmp/check_mk/piggyback/mycompany-s20.intranet.mycompany.de/mycompany-vcs02.intranet.mycompany.de': mycompany-vcs02.intranet.mycompany.de
No piggyback files for '192.168.7.10'. Skip processing.
[cpu_tracking] Stop [7fd4862fb3d0 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.0))]
[cpu_tracking] Start [7fd4862eef40]
+ PARSE FETCHER RESULTS
  Source: SourceType.HOST/FetcherType.TCP
No persisted sections loaded
  -> Add sections: ['check_mk', 'df', 'dotnet_clrmemory', 'fileinfo', 'logwatch', 'mem', 'ps', 'services', 'systemtime', 'uptime', 'windows_tasks', 'windows_updates', 'winperf_if', 'winperf_phydisk', 'winperf_processor', 'wmi_cpuload']
  Source: SourceType.HOST/FetcherType.PIGGYBACK
No persisted sections loaded
  -> Add sections: ['esx_vsphere_vm', 'labels']
Received no piggyback data
Loading item states
Trying to acquire lock on /omd/sites/mycompany_monitor/tmp/check_mk/counters/mycompany-s20.intranet.mycompany.de
Got lock on /omd/sites/mycompany_monitor/tmp/check_mk/counters/mycompany-s20.intranet.mycompany.de
Releasing lock on /omd/sites/mycompany_monitor/tmp/check_mk/counters/mycompany-s20.intranet.mycompany.de
Released lock on /omd/sites/mycompany_monitor/tmp/check_mk/counters/mycompany-s20.intranet.mycompany.de
CPU utilization      Total CPU: 8.13%
Check_MK Agent       No errors, Last update check: 2022-06-03 11:15:52, Last agent update: 2022-05-31 07:45:30, Update URL: http://mycompany-s25.intranet.mycompany.de/mycompany_monitor/check_mk, Agent configuration: c408d0d7
Disk IO SUMMARY      Read: 11.2 kB/s, Write: 61.9 kB/s, Latency: 253 microseconds
ESX CPU              demand is 0.329 Ghz, 4 virtual CPUs
ESX Datastores       Stored on mycompany-storage_vol0001 (2.79 TB/25.4% free)
ESX Guest Tools      VMware Tools are installed and the version is current
ESX Heartbeat        Heartbeat status is green
ESX Hostsystem       Running on mycompany-esx04.intranet.mycompany.de
ESX Memory           Host: 21.34 GB, Guest: 655.00 MB, Ballooned: 0.00 B, Private: 21.23 GB, Shared: 696.00 MB
ESX Mounted Devices  HA functionality guaranteed
ESX Name             mycompany-S20 (SQL)
ESX Snapshots        Count: 0
Filesystem C:/       41.72% used (37.30 of 89.40 GB), trend: +882.61 MB / 24 hours
Filesystem D:/       12.09% used (4.84 of 39.98 GB), trend: 0.00 B / 24 hours
Filesystem E:/       0.66% used (816.06 MB of 119.98 GB), trend: -2.92 MB / 24 hours
Filesystem F:/       7.86% used (7.86 of 99.98 GB), trend: -373.14 kB / 24 hours
Filesystem G:/       21.92% used (6.57 of 29.98 GB), trend: 0.00 B / 24 hours
Interface 1          [Ethernet-Adapter für vmxnet3], (Connected), Speed: 10 GBit/s, In: 1.34 kB/s (<0.01%), Out: 3.15 kB/s (<0.01%)
Log Application      No error messages
Log HardwareEvents   No error messages
Log Internet Explorer No error messages
Log Key Management Service No error messages
Log Security         No error messages
Log System           122 CRIT, 3 WARN messages (Last worst: "Jun 03 09:31:24 49152.7031 Service_Control_Manager Der Dienst "Check MK Service" wurde unerwartet beendet. Dies ist bereits 1 Mal vorgekommen. Folgende Korrekturmaßnahmen werden in 2000 Millisekunden durchgeführt: Neustart des Diensts.")
Log Windows PowerShell No error messages
Memory               RAM: 10.84% - 3.47 GB of 32.00 GB, Commit charge: 9.22% - 3.39 GB of 36.75 GB
Processor Queue      15 min load: 0.00 at 4 logical cores (0.00 per core)
Service Summary      Autostart services: 60, Stopped services: 2
System Time          Offset: -404 ms
System Updates       0 important updates, 0 optional updates
Task \mycompany\Auto-Update für Virenscanner The task exited successfully (0x00000000), Last run time: 03.06.2022 09:49:46, Next run time: 03.06.2022 12:49:46
Task \User_Feed_Synchronization-{98C7095D-9731-40DD-99DF-4E7E8467904B} The task exited successfully (0x00000000), Last run time: 03.06.2022 08:52:50, Next run time: 03.06.2022 15:06:05
Uptime               Up since Jun 03 2022 09:04:18, Uptime: 2 hours 21 minutes
Piggyback file '/omd/sites/mycompany_monitor/tmp/check_mk/piggyback/mycompany-s20.intranet.mycompany.de/mycompany-vcs02.intranet.mycompany.de': mycompany-vcs02.intranet.mycompany.de
No piggyback files for '192.168.7.10'. Skip processing.
[cpu_tracking] Stop [7fd4862eef40 - Snapshot(process=posix.times_result(user=0.06999999999999984, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.07000000029802322))]
[agent] Version: 2.0.0p24, OS: windows, [piggyback] Valid sources: mycompany-vcs02.intranet.mycompany.de, Missing monitoring data for check plugins: mssql_backup, mssql_blocked_sessions, mssql_connections, mssql_counters_cache_hits, mssql_counters_file_sizes, mssql_counters_locks, mssql_counters_locks_per_batch, mssql_counters_pageactivity, mssql_counters_sqlstats, mssql_counters_transactions, mssql_databases, mssql_datafiles, mssql_instance, mssql_jobs, mssql_tablespaces, mssql_transactionlogs(!), execution time 76.5 sec | execution_time=76.540 user_time=0.070 system_time=0.000 children_user_time=0.000 children_system_time=0.000 cmk_time_agent=76.470

I’ve tested some other things:
I’ve started the Adhoc-Mode via psexec under the LocalSYSTEM-Account to investigate if it is a permission issue or something that belongs to a user-specific behavior and it seems so: If I start the Adhoc-Mode with an Administrator-User, everything is fine. If I start it with the SYSTEM-Account, the mssql.vbs-Plugin fails:

Received [128] bytes from 'msexch'
Received [128] bytes from 'skype'
perf:  In [2053] milliseconds process 'powershell.exe -NoLogo -NoProfile -ExecutionPolicy Bypass -File "C:\ProgramData\checkmk\agent\plugins\windows_tasks.ps1"' pid:[5284] SUCCEDED - generated [425] bytes of data in [2] blocks
perf:  In [75242] milliseconds process 'cscript.exe //Nologo "C:\ProgramData\checkmk\agent\plugins\mssql.vbs"' pid:[868] FAILED - generated [193384] bytes of data in [48] blocks
Sync Plugin stopped 'C:\ProgramData\checkmk\agent\plugins\mssql.vbs' Stopped: false Failed: true
Provider 'plugins' is about to be started, id '426334223182000' port [mail:\\.\mailslot\Global\WinAgentTest_0]
Sending data 'plugins' id is [426334223182000] length [733]
perf: Section 'plugins' took [1] milliseconds
Received [861] bytes from 'plugins'
perf: Answer is ready in [75614] milliseconds
Send [33145] bytes of data

I’ve tried to start the mssql.vbs manually with the SYSTEM-Account and it hangs in the mssql_datafiles-Section. It seems to me, that the script hangs in an endless-loop, since there is no reaction for several minutes and the CPU is in 100%. The last output of mssql.vbs was:

MSSQL_MSSQLSERVER|XDS|2022-06-07|19:49:24|D
MSSQL_MSSQLSERVER|XDS|2022-06-07|19:50:48|L
<<<mssql_transactionlogs:sep(124)>>>
MSSQLSERVER|master|mastlog|E:\MSSQL15.MSSQLSERVER\MSSQL\DATA\mastlog.ldf|0|2|1|1
MSSQLSERVER|tempdb|-|-|-|-|-|-|ERROR: Die SELECT-Berechtigung wurde für das database_files-Objekt, mssqlsystemresource-Datenbank, sys-Schema, verweigert. (SQLState: 42000/NativeError: 229).
MSSQLSERVER|model|-|-|-|-|-|-|ERROR: Die SELECT-Berechtigung wurde für das database_files-Objekt, mssqlsystemresource-Datenbank, sys-Schema, verweigert. (SQLState: 42000/NativeError: 229).
MSSQLSERVER|msdb|MSDBLog|E:\MSSQL15.MSSQLSERVER\MSSQL\DATA\MSDBLog.ldf|2097152|19|1|0
MSSQLSERVER|LeitXDS|MSDBLog|E:\MSSQL15.MSSQLSERVER\MSSQL\DATA\MSDBLog.ldf|2097152|19|1|0
MSSQLSERVER|XDS_Rights|MSDBLog|E:\MSSQL15.MSSQLSERVER\MSSQL\DATA\MSDBLog.ldf|2097152|19|1|0
MSSQLSERVER|XDS|MSDBLog|E:\MSSQL15.MSSQLSERVER\MSSQL\DATA\MSDBLog.ldf|2097152|19|1|0
<<<mssql_datafiles:sep(124)>>>
MSSQLSERVER|master|master|E:\MSSQL15.MSSQLSERVER\MSSQL\DATA\master.mdf|0|5|3|1

I guess it is a permission issue since the access on the transaction-logs of the databases “tempdb” and “model” failed previously and “tempdb” is the next database after “master” which was the last successfull output.
I’ll investigate the permissions of the SYSTEM-Account (I’ve thought this account has ALL permissions) and try to fix it. I’ll send you an update.

Nevertheless I think, that the mssql.vbs-Script should handle this case and not end up in an endless-loop.

best regards
Philipp

I’ve investigated the permissions and found out that the SYSTEM-Account didn’t has SELECT-Permissions on two databases. This caused and endless-Loop of the mssql.vbs-Script.

Best regards
Philipp

Hey @pfleck
We send you a fix for this via email. Can you please check and verify?
Martin

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.