Hi guys,
I work for an organization that hosts a number of websites for partner’s and daughter organizations. Whenever we do code updates sysadmins need to restart web and middle tier services. We inevitably forget to disable alerts/notifications in Nagios and admins get a slew of pages as we bounce services on each node in each cluster. The service restarting has gotten really bad in the last few months. We bounce services nearly daily and Nagios is screaming with pages so we need to get some automation in place.
I have a couple of ideas to handle this…
1 - I would love a snap-in that has a list of URL’s with an enable/disable button like Master Control. The underlying python script would (I assume) send commands to nagios.cmd to disable services pertaining to the URL.
2 - The next step would be to have Nagios connect to the web/app node and bounce services. Problem here is we monitor threads and we restart services when all the old threads are dead. We monitor threads by doing netstat -an. The app tier is Coldfusion MX9.
3 - My manager wants to work it from the other direction. He wants our standard script to connect to Nagios and disable/enable the service in line with the stop/start script. That way we don’t have to remember to flip over to Nagios to disable/enable services for a given URL.
What do you guys think is the best way to handle this?
Gents,
Scott suggested an NFS/SMB file share to act as a drop box. Whenever you want to issue a command to Nagios you place a file called hostname with the service and command contained in the file. A script running continuously from init could scan the directory and take the needed action.
The last one is pretty easy – we do something pretty similar for backups.
Another method, which is what I now use, is my nagios server has a smb mount shared to the internal world. Every minute or so it reads the directory for filenames that match monitored machines. If the file is empty, it puts the host in maint., if it contains service names it just does the listed services. If the file disappears from the directory, it takes it out of maintenance.
This makes the api really simple for other boxes – just write a filename/host name to the share, and erase it when done. Since I have so many people who maintain different groups of machines, and in services this works pretty well for me.
Hi guys,
I work for an organization that hosts a number of websites for partner’s and daughter organizations. Whenever we do code updates sysadmins need to restart web and middle tier services. We inevitably forget to disable alerts/notifications in Nagios and admins get a slew of pages as we bounce services on each node in each cluster. The service restarting has gotten really bad in the last few months. We bounce services nearly daily and Nagios is screaming with pages so we need to get some automation in place.
I have a couple of ideas to handle this…
1 - I would love a snap-in that has a list of URL’s with an enable/disable button like Master Control. The underlying python script would (I assume) send commands to nagios.cmd to disable services pertaining to the URL.
2 - The next step would be to have Nagios connect to the web/app node and bounce services. Problem here is we monitor threads and we restart services when all the old threads are dead. We monitor threads by doing netstat -an. The app tier is Coldfusion MX9.
3 - My manager wants to work it from the other direction. He wants our standard script to connect to Nagios and disable/enable the service in line with the stop/start script. That way we don’t have to remember to flip over to Nagios to disable/enable services for a given URL.
What do you guys think is the best way to handle this?
Thanks,
/Chris C
This e-mail message is intended only for the personal use of the recipient(s) named above. If you are not an intended recipient, you may not review, copy or distribute this message. If you have received this communication in error, please notify the Hearst Service Center (cadmin@hearstsc.com) immediately by email and delete the original message.