Hello Forum,
Troubleshooting-Template with Versions etc. find below
since a few days the Agent-Checks on one of my SQL-Servers are failing since it is hitting the timeout of 60 secs. I’ve investigated this issue and can see in the task-manager, that the process ‘cscript.exe /Nologo “C:\ProgramData\checkmk\agent\plugins\mssql.vbs”’ runs very long (more than a minute) and consumes the whole time 100% of CPU (one Core).
I’ve tried to run the command-line manually and got a result within less than a second.
Then I stopped the check_mk-Agent-service and started the agent manually via command-line
.\check_mk_agent.exe adhoc -showall
and did on the check-mk-server an manual netcat-request:
pfleck@mycompany-s09:~$ time echo|nc mycompany-s20.intranet.mycompany.de 6556
<<<check_mk>>>
Version: 2.0.0p24
BuildDate: May 11 2022
AgentOS: windows
Hostname: mycompany-s20
Architecture: 64bit
WorkingDirectory: C:\Program Files (x86)\checkmk\service
ConfigFile: C:\Program Files (x86)\checkmk\service\check_mk.yml
LocalConfigFile: C:\ProgramData\checkmk\agent\check_mk.user.yml
AgentDirectory: C:\Program Files (x86)\checkmk\service
PluginsDirectory: C:\ProgramData\checkmk\agent\plugins
StateDirectory: C:\ProgramData\checkmk\agent\state
ConfigDirectory: C:\ProgramData\checkmk\agent\config
TempDirectory: C:\ProgramData\checkmk\agent\tmp
LogDirectory: C:\ProgramData\checkmk\agent\log
SpoolDirectory: C:\ProgramData\checkmk\agent\spool
LocalDirectory: C:\ProgramData\checkmk\agent\local
OnlyFrom:
[...]
real 0m1,973s
user 0m0,005s
sys 0m0,004s
everything fine!
If I stop the adhoc-mode and restart the service and do the same, the time is increasing extremly:
pfleck@mycompany-s09:~$ time echo|nc mycompany-s20.intranet.mycompany.de 6556
<<<check_mk>>>
[...]
real 1m15,714s
user 0m0,001s
sys 0m0,006s
I’ve also investigated what changed at the time, beginning with this issue: About 30min prior I baked new agents and I the auto-deploy installed it on the server. The strange thing is here: I have four servers deployed with the same agent and running the mssql-script, but only one shows this strange behavior.
Does anyone have any tipps to fix this issue or for further investigation?
thank you in advance
Philipp
CMK version:
2.0.0p24 (CEE)
OS version:
Debian 11 on Monitoring-Server, Windows 2019 (1809) on
Error message:
[agent] MKTimeout(‘Fetcher for host “mycompany-s20.intranet.mycompany.de” timed out after 60 seconds’)CRIT, [piggyback] Valid sources: mycompany-vcs02.intranet.mycompany.de, Missing monitoring data for check plugins: check_mk_agent_update, df, logwatch, mem_win, services_summary, systemtime, uptime, windows_tasks, windows_updates, winperf_if, winperf_phydisk, winperf_processor_util, wmi_cpuloadWARN, execution time 60.0 sec
Output of “cmk --debug -vvn hostname”:
OMD[mycompany_monitor]:~$ cmk --debug -vvn mycompany-s20.intranet.mycompany.de
Checkmk version 2.0.0p24
Try license usage history update.
Trying to acquire lock on /omd/sites/mycompany_monitor/var/check_mk/license_usage/next_run
Got lock on /omd/sites/mycompany_monitor/var/check_mk/license_usage/next_run
Trying to acquire lock on /omd/sites/mycompany_monitor/var/check_mk/license_usage/history.json
Got lock on /omd/sites/mycompany_monitor/var/check_mk/license_usage/history.json
Next run time has not been reached yet. Abort.
Releasing lock on /omd/sites/mycompany_monitor/var/check_mk/license_usage/history.json
Released lock on /omd/sites/mycompany_monitor/var/check_mk/license_usage/history.json
Releasing lock on /omd/sites/mycompany_monitor/var/check_mk/license_usage/next_run
Released lock on /omd/sites/mycompany_monitor/var/check_mk/license_usage/next_run
Loading autochecks from /omd/sites/mycompany_monitor/var/check_mk/autochecks/mycompany-s20.intranet.mycompany.de.mk
+ FETCHING DATA
Source: SourceType.HOST/FetcherType.TCP
[cpu_tracking] Start [7fd4862eecd0]
[TCPFetcher] Fetch with cache settings: DefaultAgentFileCache(base_path=PosixPath('/omd/sites/mycompany_monitor/tmp/check_mk/cache/mycompany-s20.intranet.mycompany.de'), max_age=MaxAge(checking=0, discovery=120, inventory=120), disabled=False, use_outdated=False, simulation=False)
Not using cache (Too old. Age is 823 sec, allowed is 0 sec)
[TCPFetcher] Execute data source
Connecting via TCP to 192.168.7.10:6556 (5.0s timeout)
Reading data from agent
Output is not encrypted
Write data to cache file /omd/sites/mycompany_monitor/tmp/check_mk/cache/mycompany-s20.intranet.mycompany.de
Trying to acquire lock on /omd/sites/mycompany_monitor/tmp/check_mk/cache/mycompany-s20.intranet.mycompany.de
Got lock on /omd/sites/mycompany_monitor/tmp/check_mk/cache/mycompany-s20.intranet.mycompany.de
Releasing lock on /omd/sites/mycompany_monitor/tmp/check_mk/cache/mycompany-s20.intranet.mycompany.de
Released lock on /omd/sites/mycompany_monitor/tmp/check_mk/cache/mycompany-s20.intranet.mycompany.de
Closing TCP connection to 192.168.7.10:6556
[cpu_tracking] Stop [7fd4862eecd0 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=76.4699999988079))]
Source: SourceType.HOST/FetcherType.PIGGYBACK
[cpu_tracking] Start [7fd4862fb3d0]
[PiggybackFetcher] Fetch with cache settings: NoCache(base_path=PosixPath('/omd/sites/mycompany_monitor/tmp/check_mk/data_source_cache/piggyback/mycompany-s20.intranet.mycompany.de'), max_age=MaxAge(checking=0, discovery=120, inventory=120), disabled=False, use_outdated=False, simulation=False)
[PiggybackFetcher] Execute data source
Piggyback file '/omd/sites/mycompany_monitor/tmp/check_mk/piggyback/mycompany-s20.intranet.mycompany.de/mycompany-vcs02.intranet.mycompany.de': mycompany-vcs02.intranet.mycompany.de
No piggyback files for '192.168.7.10'. Skip processing.
[cpu_tracking] Stop [7fd4862fb3d0 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.0))]
[cpu_tracking] Start [7fd4862eef40]
+ PARSE FETCHER RESULTS
Source: SourceType.HOST/FetcherType.TCP
No persisted sections loaded
-> Add sections: ['check_mk', 'df', 'dotnet_clrmemory', 'fileinfo', 'logwatch', 'mem', 'ps', 'services', 'systemtime', 'uptime', 'windows_tasks', 'windows_updates', 'winperf_if', 'winperf_phydisk', 'winperf_processor', 'wmi_cpuload']
Source: SourceType.HOST/FetcherType.PIGGYBACK
No persisted sections loaded
-> Add sections: ['esx_vsphere_vm', 'labels']
Received no piggyback data
Loading item states
Trying to acquire lock on /omd/sites/mycompany_monitor/tmp/check_mk/counters/mycompany-s20.intranet.mycompany.de
Got lock on /omd/sites/mycompany_monitor/tmp/check_mk/counters/mycompany-s20.intranet.mycompany.de
Releasing lock on /omd/sites/mycompany_monitor/tmp/check_mk/counters/mycompany-s20.intranet.mycompany.de
Released lock on /omd/sites/mycompany_monitor/tmp/check_mk/counters/mycompany-s20.intranet.mycompany.de
CPU utilization Total CPU: 8.13%
Check_MK Agent No errors, Last update check: 2022-06-03 11:15:52, Last agent update: 2022-05-31 07:45:30, Update URL: http://mycompany-s25.intranet.mycompany.de/mycompany_monitor/check_mk, Agent configuration: c408d0d7
Disk IO SUMMARY Read: 11.2 kB/s, Write: 61.9 kB/s, Latency: 253 microseconds
ESX CPU demand is 0.329 Ghz, 4 virtual CPUs
ESX Datastores Stored on mycompany-storage_vol0001 (2.79 TB/25.4% free)
ESX Guest Tools VMware Tools are installed and the version is current
ESX Heartbeat Heartbeat status is green
ESX Hostsystem Running on mycompany-esx04.intranet.mycompany.de
ESX Memory Host: 21.34 GB, Guest: 655.00 MB, Ballooned: 0.00 B, Private: 21.23 GB, Shared: 696.00 MB
ESX Mounted Devices HA functionality guaranteed
ESX Name mycompany-S20 (SQL)
ESX Snapshots Count: 0
Filesystem C:/ 41.72% used (37.30 of 89.40 GB), trend: +882.61 MB / 24 hours
Filesystem D:/ 12.09% used (4.84 of 39.98 GB), trend: 0.00 B / 24 hours
Filesystem E:/ 0.66% used (816.06 MB of 119.98 GB), trend: -2.92 MB / 24 hours
Filesystem F:/ 7.86% used (7.86 of 99.98 GB), trend: -373.14 kB / 24 hours
Filesystem G:/ 21.92% used (6.57 of 29.98 GB), trend: 0.00 B / 24 hours
Interface 1 [Ethernet-Adapter für vmxnet3], (Connected), Speed: 10 GBit/s, In: 1.34 kB/s (<0.01%), Out: 3.15 kB/s (<0.01%)
Log Application No error messages
Log HardwareEvents No error messages
Log Internet Explorer No error messages
Log Key Management Service No error messages
Log Security No error messages
Log System 122 CRIT, 3 WARN messages (Last worst: "Jun 03 09:31:24 49152.7031 Service_Control_Manager Der Dienst "Check MK Service" wurde unerwartet beendet. Dies ist bereits 1 Mal vorgekommen. Folgende Korrekturmaßnahmen werden in 2000 Millisekunden durchgeführt: Neustart des Diensts.")
Log Windows PowerShell No error messages
Memory RAM: 10.84% - 3.47 GB of 32.00 GB, Commit charge: 9.22% - 3.39 GB of 36.75 GB
Processor Queue 15 min load: 0.00 at 4 logical cores (0.00 per core)
Service Summary Autostart services: 60, Stopped services: 2
System Time Offset: -404 ms
System Updates 0 important updates, 0 optional updates
Task \mycompany\Auto-Update für Virenscanner The task exited successfully (0x00000000), Last run time: 03.06.2022 09:49:46, Next run time: 03.06.2022 12:49:46
Task \User_Feed_Synchronization-{98C7095D-9731-40DD-99DF-4E7E8467904B} The task exited successfully (0x00000000), Last run time: 03.06.2022 08:52:50, Next run time: 03.06.2022 15:06:05
Uptime Up since Jun 03 2022 09:04:18, Uptime: 2 hours 21 minutes
Piggyback file '/omd/sites/mycompany_monitor/tmp/check_mk/piggyback/mycompany-s20.intranet.mycompany.de/mycompany-vcs02.intranet.mycompany.de': mycompany-vcs02.intranet.mycompany.de
No piggyback files for '192.168.7.10'. Skip processing.
[cpu_tracking] Stop [7fd4862eef40 - Snapshot(process=posix.times_result(user=0.06999999999999984, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.07000000029802322))]
[agent] Version: 2.0.0p24, OS: windows, [piggyback] Valid sources: mycompany-vcs02.intranet.mycompany.de, Missing monitoring data for check plugins: mssql_backup, mssql_blocked_sessions, mssql_connections, mssql_counters_cache_hits, mssql_counters_file_sizes, mssql_counters_locks, mssql_counters_locks_per_batch, mssql_counters_pageactivity, mssql_counters_sqlstats, mssql_counters_transactions, mssql_databases, mssql_datafiles, mssql_instance, mssql_jobs, mssql_tablespaces, mssql_transactionlogs(!), execution time 76.5 sec | execution_time=76.540 user_time=0.070 system_time=0.000 children_user_time=0.000 children_system_time=0.000 cmk_time_agent=76.470