Nagios service keeps crashing

chauhan_sudhir · June 30, 2025, 2:06pm

I could solve the problem for me by simply deleting the faulty rrds of the affected hosts under ~/var/pnp4nagios/perfdata (deleted complete dir of the host).

rrds get recreated et voilà nagios doesn’t crash any more (and wato shows me graphs again).

So, the problem is solved or the core still crashes by performing the same action like Analyze recent notifications → replay this notification or anything else?

jakkul · June 30, 2025, 9:13pm

it was not an out-of-memory situation
find:

$ sudo su - gpmv
OMD[gpmv]:~$ find -L ~/local
/omd/sites/gpmv/local
/omd/sites/gpmv/local/lib
/omd/sites/gpmv/local/lib/check_mk
/omd/sites/gpmv/local/lib/check_mk/gui
/omd/sites/gpmv/local/lib/check_mk/gui/plugins
/omd/sites/gpmv/local/lib/check_mk/gui/plugins/views
/omd/sites/gpmv/local/lib/check_mk/gui/plugins/dashboard
/omd/sites/gpmv/local/lib/check_mk/base
/omd/sites/gpmv/local/lib/check_mk/base/plugins
/omd/sites/gpmv/local/lib/check_mk/plugins
/omd/sites/gpmv/local/lib/check_mk/special_agents
/omd/sites/gpmv/local/lib/python3
/omd/sites/gpmv/local/lib/python3/cmk_addons
/omd/sites/gpmv/local/lib/python3/cmk_addons/plugins
/omd/sites/gpmv/local/lib/python3/cmk
/omd/sites/gpmv/local/lib/python3/cmk/gui
/omd/sites/gpmv/local/lib/python3/cmk/gui/plugins
/omd/sites/gpmv/local/lib/python3/cmk/gui/plugins/views
/omd/sites/gpmv/local/lib/python3/cmk/gui/plugins/dashboard
/omd/sites/gpmv/local/lib/python3/cmk/base
/omd/sites/gpmv/local/lib/python3/cmk/base/plugins
/omd/sites/gpmv/local/lib/python3/cmk/plugins
/omd/sites/gpmv/local/lib/python3/cmk/special_agents
/omd/sites/gpmv/local/lib/nagios
/omd/sites/gpmv/local/lib/nagios/plugins
/omd/sites/gpmv/local/lib/python
/omd/sites/gpmv/local/lib/apache
/omd/sites/gpmv/local/share
/omd/sites/gpmv/local/share/check_mk
/omd/sites/gpmv/local/share/check_mk/alert_handlers
/omd/sites/gpmv/local/share/check_mk/enabled_packages
/omd/sites/gpmv/local/share/check_mk/enabled_packages/check_adaptec-1.1.mkp
/omd/sites/gpmv/local/share/check_mk/pnp-rraconf
/omd/sites/gpmv/local/share/check_mk/mibs
/omd/sites/gpmv/local/share/check_mk/inventory
/omd/sites/gpmv/local/share/check_mk/pnp-templates
/omd/sites/gpmv/local/share/check_mk/web
/omd/sites/gpmv/local/share/check_mk/web/htdocs
/omd/sites/gpmv/local/share/check_mk/web/htdocs/images
/omd/sites/gpmv/local/share/check_mk/web/htdocs/themes
/omd/sites/gpmv/local/share/check_mk/web/plugins
/omd/sites/gpmv/local/share/check_mk/web/plugins/sidebar
/omd/sites/gpmv/local/share/check_mk/web/plugins/views
/omd/sites/gpmv/local/share/check_mk/web/plugins/dashboard
/omd/sites/gpmv/local/share/check_mk/web/plugins/pages
/omd/sites/gpmv/local/share/check_mk/web/plugins/visuals
/omd/sites/gpmv/local/share/check_mk/web/plugins/config
/omd/sites/gpmv/local/share/check_mk/web/plugins/icons
/omd/sites/gpmv/local/share/check_mk/locale
/omd/sites/gpmv/local/share/check_mk/notifications
/omd/sites/gpmv/local/share/check_mk/agents
/omd/sites/gpmv/local/share/check_mk/agents/bakery
/omd/sites/gpmv/local/share/check_mk/agents/custom
/omd/sites/gpmv/local/share/check_mk/agents/custom/adaptec
/omd/sites/gpmv/local/share/check_mk/agents/custom/adaptec/lib
/omd/sites/gpmv/local/share/check_mk/agents/custom/adaptec/lib/local
/omd/sites/gpmv/local/share/check_mk/agents/custom/adaptec/lib/local/check_adaptec.ps1
/omd/sites/gpmv/local/share/check_mk/agents/linux
/omd/sites/gpmv/local/share/check_mk/agents/linux/alert_handlers
/omd/sites/gpmv/local/share/check_mk/reporting
/omd/sites/gpmv/local/share/check_mk/reporting/images
/omd/sites/gpmv/local/share/snmp
/omd/sites/gpmv/local/share/snmp/mibs
/omd/sites/gpmv/local/share/doc
/omd/sites/gpmv/local/share/nagios
/omd/sites/gpmv/local/share/nagios/htdocs
/omd/sites/gpmv/local/share/nagios/htdocs/theme
/omd/sites/gpmv/local/share/nagios/htdocs/theme/images
/omd/sites/gpmv/local/share/nagios/htdocs/theme/stylesheets
/omd/sites/gpmv/local/share/diskspace
/omd/sites/gpmv/local/share/nagvis
/omd/sites/gpmv/local/share/nagvis/htdocs
/omd/sites/gpmv/local/share/nagvis/htdocs/server
/omd/sites/gpmv/local/share/nagvis/htdocs/server/core
/omd/sites/gpmv/local/share/nagvis/htdocs/server/core/classes
/omd/sites/gpmv/local/share/nagvis/htdocs/server/core/classes/objects
/omd/sites/gpmv/local/share/nagvis/htdocs/userfiles
/omd/sites/gpmv/local/share/nagvis/htdocs/userfiles/images
/omd/sites/gpmv/local/share/nagvis/htdocs/userfiles/images/shapes
/omd/sites/gpmv/local/share/nagvis/htdocs/userfiles/images/maps
/omd/sites/gpmv/local/share/nagvis/htdocs/userfiles/images/iconsets
/omd/sites/gpmv/local/share/nagvis/htdocs/userfiles/styles
/omd/sites/gpmv/local/share/nagvis/htdocs/userfiles/gadgets
/omd/sites/gpmv/local/share/nagvis/htdocs/userfiles/scripts
/omd/sites/gpmv/local/share/nagvis/htdocs/userfiles/templates
/omd/sites/gpmv/local/bin

# sudo su - gpmv
OMD[gpmv]:~$ omd sites
SITE             VERSION          COMMENTS
gpmv             2.4.0p5.cre      default version
OMD[gpmv]:~$ omd versions
2.3.0p19.cre
2.3.0p34.cre
2.4.0p4.cre
2.4.0p5.cre (default)
OMD[gpmv]:~$ omd status
agent-receiver:     running
mkeventd:           running
rrdcached:          running
redis:              running
npcd:               running
automation-helper:  running
ui-job-scheduler:   running
nagios:             running
apache:             running
crontab:            running
---------------------------
Overall state:      running

there’s nothing funny here I guess.

chauhan_sudhir · July 1, 2025, 5:54am

Thanks for providing this information. I do not see anything strange. Can you also check the below log files especially when the time when the core crashes ?

$OMD_ROOT/var/log/web.log 
$OMD_ROOT/var/log/nagios.log
$OMD_ROOT/var/log/ui-job-scheduler/ui-job-scheduler.log
$OMD_ROOT/var/log/automation-helper/automation-helper.log

Also, do you remember if the crashed after performing a particular action ?

bofh · July 1, 2025, 7:00am

Yes, it’s solved for me.

hyunkel · July 7, 2025, 8:25am

Hello everyone, new user here, and I’m facing the same issue as everyone else : nagios is crashing every so often.

OMD[*****]:~$ omd versions
2.4.0p5.cre (default)
OMD[*****]:~$ omd sites
SITE             VERSION          COMMENTS
*****          2.4.0p5.cre      default version 
OMD[****]:~$ omd status
agent-receiver:     running
mkeventd:           running
rrdcached:          running
redis:              running
npcd:               running
automation-helper:  running
ui-job-scheduler:   running
nagios:             running
apache:             running
crontab:            running
---------------------------
Overall state:      running

Here’s what I could find as the only error in $OMD_ROOT/var/pnp4nagios/log/perfdata.log :

2025-07-07 06:21:33 [809913] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/*****/tmp/run/rrdcached.sock: illegal attempt to update using time 1751862069.000000 when last update time is 1751862081.000000 (minimum one second step)

Everytime the service crashes, it seems to be because of such an error (trying to update a graph with datetime before the actual datetime).

I’ve managed to stop crashes for now, by deleting all rrd graphs in var/pnp4nagius/perfdata, but it really is not the best experience so far.

syntetikvision · July 10, 2025, 2:17pm

I have the same issue. Can’t find any reason for constant crashes from nagios.

I’ve have done a clean install from scratch with version 2.4.0p5 and everything was fine. I was adding some hosts and suddenly it start to crash. I’ve upgrade to the latest version 2.4.0p7 apply the new version config but the issue persists.

After activate the debug to -1 can’t see any useful logs anywhere.

tail -f /opt/omd/sites/monitor/var/pnp4nagios/log/perfdata.log
2025-07-08 18:16:36 [6021] [0] RRDs::update /omd/sites/monitor/var/pnp4nagios/perfdata/nas03.licorbeirao.com/_HOST__rta.rrd 1751994972:0.301
2025-07-08 18:16:36 [6021] [0] RRDs::update ERROR rrdcached@unix:/omd/sites/monitor/tmp/run/rrdcached.sock: illegal attempt to update using time 1751994972.000000 when last update time is 1751994991.000000 (minimum one second step)

i have the same log entry as you but this is not the reason for crashing.

It’s unbelievable that this issue is still here since 2.2 for what i understand from foruns online.

Can someone help ?

syntetikvision · July 10, 2025, 2:18pm

How you solve the permanent crash of nagios on version 2.4p5 ? thank you

philippludwig · July 15, 2025, 10:33am

In my case, the nagios service started to crash when I added a “Host check command” which was set to “TCP Connect”. Mind you, the check worked, but somehow it crashed nagios afterwards. So I set it to “Always assume host to be up”, which is not ideal, but better than the crashes.

fig_wright · August 5, 2025, 12:58pm

I’m having this problem as well, since the weekend. I am running Checkmk Raw Edition 2.5.0-2025.07.16 via Docker. I think this setup has been upgraded from ver 2.4.something, but that was a couple of weeks ago. I don’t know what would have trigger the behaviour change; possibly the docker daemon was restarted. I only have 1 agent installed, the other VMs are all using ESX only. The VM with the agent was restarted in this period.

The following don’t report anything obviously suspicious:

cmk -U -vvv
cmk --debug -vvR

The only suspicious line in the Docker logs is this:

monitoring syslogd: /dev/xconsole: No such file or directory

When I restart either nagios, or the whole checkmk, I see warning emails arrive from each of my dozen monitored VMs - due to missing services - and then the OK message arrives from each of them in term again. This takes about 2 minutes, and then at that point the nagios service goes down.