Check_MK upgrading (1.5.0p7.cee -> current stable 1.6.0p*)

sandytoshev · August 19, 2020, 2:24pm

Hi all,

The company I work for is using Check_MK for monitoring (the paid version). Everything is working fine, but it is time for upgrade. I have read the documentation and the upgrade process is pretty straight-forward.
However, I got the feeling that it is way too easy - OK some custom plugins/edited files may need to be taken care of, but in general, the procedure is no big deal. I have read the following: “https://checkmk.com/cms_update.html”.

Is tha above all? Any advise on what else, I can read up/get familiar with?
One major question is coming to mind - does the Check_MK agents that are installed on the monitored hosts have to be updated also?

Thank you all in advance.

martin.schwarz · August 19, 2020, 2:57pm

some quick notes:

in a distributed setup (with config push), first update all slaves, then the master
after upgrading, check the list of “unacknowledged incompatible werks” in the release notes (red numbered button besides the version number on the top of the sidebar)
agents can be older but should not be newer than the monitoring server. you might miss some features when using an older agent though, so upgrading agents afterwards is highly recomended.
always have a working backup

sandytoshev · August 20, 2020, 5:53am

Thank you for the fast response! Not having to upgrade 200+ agents is a very nice thing. About your other points - yeah - they will be taken care of (especially the backup).
BR.

louis · August 20, 2020, 7:05am

If you’re worried about having to upgrade 200+ agents manually (yes you really don’t want to do that; I agree), consider letting some tool like puppet handle the upgrade of the agents. Yes, that will be a one time annoying job, but once it’s done upgrading the agent on however many servers will be a piece of cake.

Louis

sandytoshev · August 20, 2020, 7:15am

Hi again,
Something else came to mind while I was re-reading the upgrade guide. Nothing is mentioned about the Check_MK config (I mean the config we have done via the web - WATO - adding checks, hosts, devices). After the upgrade, we will not have to add the hosts and the devices again, right? I do understand that there may be some “incompatible werks”, that we will have to work on. But the expectation is that the checks will be there, the users/mail groups, etc will be there also? The main functionality will stay mainly unaffected and functional? And then it is up to us, to start building up and using the new functionalities and optimizations that the new version is providing.

Thanks.

P.S.
@louis
Yeah - already started reading on how to do it with puppet.

louis · August 20, 2020, 7:20am

Well, it should all work out of the box. Now I must admit I’ve never done an upgrade for de commercial version, only the raw edition, but for me that worked flawlessly.

Maybe, if your CheckMK server is running as a virtual machine, you could clone it into an isolated environment and try the upgrade there, before going into production?

sandytoshev · August 25, 2020, 9:08am

Hi again,

Following up.
I found a server in our environment that was installed back in the day with the production one - using the same Check_MK version and having only one host to monitor (having the mail-groups and things also, also connected to our AD). So I update this VM first.
The update went pretty fast and easy. Came out with 127 “incompatible werks”. I am reading them up currently, but wanted to ask something in general first.

Am I supposed to expect the same amount of werks on my production server? Having in mind that I have a lot of more hosts there.
Is it safe to state the following: “Bug fixes that are compatible may be ignored.”.Example:
1.6.0p15 2020-07-17 11:40:46 Bug fix Trivial change replace error message with no services discovered when licensing information is not found
I do not see what can I do about it, really. My guess is that these kind of werks are informational, kind of like a Readme.

The “Incompatible - TODO” werks are, of course, a whole different story and they should be addressed.

Thank you in advance.

martin.schwarz · August 25, 2020, 9:24am

You night want to read https://checkmk.com/check_mk-werks.php about what werks are and how they are organized/classified

martin.schwarz · August 25, 2020, 9:33am

ad 1: werks are just the individual changes that make up the new version. So they are the same no matter how many or what kind of hosts you monitor. That’s why you as the monitoring admin should inspect the incompatible werks and check if they affect your environment. For example, if a werk changed the naming of a services on, say, Fortigate firewalls, then you should do a service rediscovery on your fortigate systems, and perhaps adjust rulesets etc. If you are using only Checkpoint, then this does not affect you. In both cases, you can then acknowledge the incompatible werk in question to indicate you have checked it.
ad 2: yes, if the werk is compatible, then you no action is required. You still might want to have a quick glance at the werk subjects so you know which bugs are fixed etc.

sandytoshev · August 25, 2020, 10:02am

Thank you for the fast response - covers my thoughts completely. Already glancing over the bug fixes - just to make sure that I understand them.
Then will dive deep into the TODO things.

sandytoshev · September 9, 2020, 8:10am

Following up - I guess you guys may be interested.
So I went to upgrade our production server last night (reverted it back to the old version, but will do the procedure again soon enough).
So pretty much everything worked out-of-the box, besides two major things.

ESXi hosts multipath is renamed (not gone or anything - renamed). If the monitoring service with 1.5 is called something like “Multipath L20 physical”, with Check_MK 1.6, the same thing is discovered with a WWN number. So this is some manual work (the ESXi hosts are 41), but it is OK.
A lot (and I mean a lot) of the Oracle checks are gone… the discovery cannot find them. Checks like ASM groups getting UKNOWN status, Inventory jobs getting CRIT status…

So what I suggested to do is to add a Oracle host (srv1 let’s say) to the already updated test Check_MK server. Then ask the Oracle people to check if what they “see” there is enough for them.
Then, as a follow up, I am thinking of updating the Check_MK agent on the srv1, and see what checks will appear/disapper…

Basically - Oracle is fancy and makes problems

P.S. I guess there is no problem for a certain VM to be monitored from two Check_MKs ? I understand that when/if I update the agent of the srv1.

sandytoshev · September 17, 2020, 7:31am

Hi again,

So the discovery Check_MK gets into UKNOWN state every time it is ran and tells me to “submit a crash report”. What I see in the logs is the following:

ValueError (invalid literal for int() with base 10: ‘ST_RAC_CIMB/’)

Traceback
File “/omd/sites/NWTSBCK/lib/python/cmk_base/decorator.py”, line 58, in wrapped_check_func
status, infotexts, long_infotexts, perfdata = check_func(hostname, *args, **kwargs)
File “/omd/sites/NWTSBCK/lib/python/cmk_base/discovery.py”, line 422, in check_discovery
on_error=“raise”)
File “/omd/sites/NWTSBCK/lib/python/cmk_base/discovery.py”, line 1057, in _get_host_services
return _get_node_services(host_config, ipaddress, sources, multi_host_sections, on_error)
File “/omd/sites/NWTSBCK/lib/python/cmk_base/discovery.py”, line 1065, in _get_node_services
multi_host_sections, on_error)
File “/omd/sites/NWTSBCK/lib/python/cmk_base/discovery.py”, line 1098, in _get_discovered_services
multi_host_sections, on_error)
File “/omd/sites/NWTSBCK/lib/python/cmk_base/discovery.py”, line 834, in _discover_services
check_plugin_name, on_error):
File “/omd/sites/NWTSBCK/lib/python/cmk_base/data_sources/host_sections.py”, line 299, in _update_with_parse_function
return parse_function(section_content)
File “/omd/sites/NWTSBCK/share/check_mk/checks/oracle_asm_diskgroup”, line 157, in parse_oracle_asm_diskgroup
“fg_disks”: int(fg_disks),

As far as I understand this is the reason why I cannot see the Oracle ASM disks in the monitoring.
Any idea how to fix this?
Thank you in advance.

P.S. A little bit below in the logs, I see that it finds the ASM disks (and their actual size, etc) … but something’s wrong and it cannot show them:
" [None,
u’MOUNTED’,
u’EXTERN’,
u’N’,
u’512’,
u’4096’,
u’4194304’,
u’2289640’,
u’162292’,
u’0’,
u’162292’,
u’0’,
u’N’,
u’ST_RAC_DATA/’],"

andreas-doehler · September 17, 2020, 9:02am

Is the Oracle plugin also the newer version from 1.6?

sandytoshev · September 17, 2020, 9:52am

Nope… ohhh. I understand - OK. Thanks
Actually by Oracle plugin - you mean that I should update the Check_mk-agent version, right?

martin.schwarz · September 17, 2020, 10:58am

Both the agent and the agent plugins

sandytoshev · September 25, 2020, 7:37am

Hi again all,

First, I want to apologise for the loooong post.
So, I have updated the agent and the plugins, but the discovery still breaks with the following. I have all the checks working, expect the ASM disk-related ones - the used/free space ones.
"Starting job…

FETCHING DATA
[agent] Execute data source
[piggyback] Execute data source
No piggyback files for ‘Name_of_the_VM’. Skip processing.
No piggyback files for ‘IP_OF_THE_VM’. Skip processing.
WARNING: Exception while parsing agent section ‘oracle_asm_diskgroup’: ValueError(“invalid literal for int() with base 10: ‘ST_RAC_CIMB/’”,)
File “/omd/sites/SITE_NAME/lib/python/cmk_base/data_sources/host_sections.py”, line 299, in _update_with_parse_function
return parse_function(section_content)
File “/omd/sites/SITE_NAME/share/check_mk/checks/oracle_asm_diskgroup”, line 157, in parse_oracle_asm_diskgroup
“fg_disks”: int(fg_disks),

systemd_units does not support discovery. Skipping it.
ps_lnx does not support discovery. Skipping it.
ps.perf does not support discovery. Skipping it.
Completed."

Maybe I did something wrong with the update, so I am writing down how I updated the agent and the plugins:
0. Deleted the VM from monitoring.

Agent update:
Went to the Check_mk server GUI -> WATO -> Monitoring agents -> agent files -> Packaged agents category - > downloaded the check-mk-agent-1.6.0p16-1.noarch.rpm
SCP the file to the monitored VM and:
rpm -U check-mk-agent-1.6.0p16-1.noarch.rpm
systemctl reload xinted.service

checked the update with:

check_mk_agent | head
<<<check_mk>>>
Version: 1.6.0p16
AgentOS: linux

Plugins update. I have only 3 plugins for that server (at least that is what is in “/usr/lib/check_mk_agent/plugins”) - mk_oracle, mk_oracle_asm & mk_oracle_crs.
So I went again to:
Check_mk server GUI -> WATO -> Monitoring agents -> agent files -> LINUX/UNIX AGENTS - PLUGINS category. Opened the above-mentioned plugins in a browser and copy-pasted the text into the files with the same names in /usr/lib/check_mk_agent/plugins. The text is copied via notepad++ and into VIM so I guess no special characters or something is inserted.
Add the host again to the monitoring. And the error from the start of my post is presented when I do Full scan. I believe that the agent is working, because there are a LOT of checks discovered. The machine is pretty much monitored as it should be - just no info about the ASM free/used space.

Will be gratefull if someone can give some insight.
Thank you in advance.

andreas-doehler · September 25, 2020, 8:00am

First you should have at the agent output.
Search for the section header “<<<oracle_asm_diskgroup…”
This should be visible two times, the first occurrence is empty that is correct.

The agent output processed can be found in ~/tmp/check_mk/cache/hostname

sandytoshev · September 29, 2020, 1:00pm

Thanks for the answer,

I have checked the agent output and indeed, the 1st occurance is just a line (empty). Then, I see the following:
<<<oracle_asm_diskgroup>>>
MOUNTED EXTERN N 512 4096 4194304 51196 7668 0 7668 0 N ST_RAC_CIMB/
MOUNTED EXTERN N 512 4096 4194304 163836 77428 0 77428 0 N ST_RAC_CON/
MOUNTED EXTERN N 512 4096 4194304 10236 9900 0 9900 0 Y ST_RAC_CW/
MOUNTED EXTERN N 512 4096 4194304 262140 251720 0 251720 0 N ST_RAC_LOGS/
MOUNTED EXTERN N 512 4096 4194304 2289640 108868 0 108868 0 N ST_RAC_DATA/

<<<oracle_asm_diskgroup>>>
MOUNTED EXTERN N 512 512 4096 4194304 51196 7668 0 7668 0 N ST_RAC_CIMB/
MOUNTED EXTERN N 512 512 4096 4194304 163836 77428 0 77428 0 N ST_RAC_CON/
MOUNTED EXTERN N 512 512 4096 4194304 10236 9900 0 9900 0 Y ST_RAC_CW/
MOUNTED EXTERN N 512 512 4096 4194304 2289640 108868 0 108868 0 N ST_RAC_DATA/
MOUNTED EXTERN N 512 512 4096 4194304 262140 251720 0 251720 0 N ST_RAC_LOGS/

I guess the agent sees the disks, but cannot parse them for some reason?

system · October 29, 2020, 11:00pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.