Local checks in cluster

Hi,

we have 5 node server running with MapR. Node1 is running RESTAPI, Node2 is also running RESTAPI. My local check is asking REST on node1 from node1, and local check is asking REST on node2 from node2.

Then I have clustered host in Check_MK consists of real hosts Node1 and Node2 in Check_MK.

Thing is, now I have doubled the output with automated prefix on which node check is running. Is it possible to hide that prefix and output just only one text? Or is that not possible with local checks and I need to make them as MRPE? Or my scripts needs to be adapted somehow?

I would like to avoid rewriting script for MRPE usage, as I donā€™t know how to pass performace data with MRPE style script.

Nobody was trying local checks in cluster?
It is not possible to have output printed only once like in MRPE?
On picture ā€œtest_pluginā€ check is MRPE script, ā€œtestingā€ check is local script.
check_mk_local_cluster

Your example means the MRPE is not cluster aware and prints the first result.
The local check prints all results. That is the right behavior.

What you need is a real check what can handle such conditions as running of x nodes of a cluster and so on.

I thought that if I put the MRPE script on two servers, make a virtual cluster in Check_mk, assign a service to Clustered services for overlapping clusters and itā€™s done. So I was wrong, and some of our checks are running in false meaning cluster for pretty long time. Thatā€™s not good.

No idea how to do that. I canā€™t find any documentation or examples for that.

Hi @marbaa

As I understand the section ā€œoverlapping clustersā€, here:

You would only need this if you are e.g. running a service that is running on ā€œboth nodesā€ and
is shared by letā€™s say two VIPs. If this is your scenario, then you havenā€™t done anything wrong
(again: if I understand aforementioned section correctly).

I realized, that for many (if not most) of my clusters, I donā€™t need ā€œoverlappingā€ ones after all, but that may differ on your end. I believe what Andreas meant though, isnā€™t something that you will find in the documentation.

I think he meant, that your script, will need to have some built in logic, that ā€œunderstandsā€ e.g. on
which node a particular service is ā€œactiveā€, so that your check output reflects the correct state
of your cluster.

Thomas

Thanks @openmindz . I read that section now. And Iā€™m confused even more than I was before :smiley: Probably I donā€™t understand docu correctly and donā€™t understand correctly what to use for my need.

MapR is basically own NFS filesystem implementation spreaded over few identical servers. On two servers there is running RESTAPI service. Each server has own IP, so that service has also own IP. Informations about the filesystem are available either from first service, or from second. Output from them will be the same, they are running on two hosts just for redundancy.

Can you recommend what type of check should I use for that? To not have duplicate output like in first post?

In this case it would best to use the BI function to create virtuell objects and then use these objects for notification as cluster status.

I try to describe.

Create a BI rule to aggregate the REST services. The aggregation function should be ā€œbestā€ in this case as you said it must run 1 time but can run 2 times.
Create a BI rule for the other services where the aggregation function is something like ā€œover 60% is ok and over 40% is warningā€.
Now you have two aggregation rules. These two rules will be put inside an aggregation named ā€œMapRā€ and you have an overall status. You can test with manually modifying the node status inside the aggregation if the desired overall status is right.

Here Reduce complexity with Business Intelligence you can find the complete BI documentation
If your created BI is as you want it these BI can be brought back inside the monitoring as a service with the BI special agent.

Thanks Andreas. I tried to play little bit with BI, but to checking the state of it it is needed to navigate away from host. Second thing is that, we are using Thruk for our company users (historical reasons) for fetching data from Check_MK.

I think I will stay with local checks and doubled output text as there is no easy solution (if ever possible) for my need or for what I want.

Hi @marbaa

While BI is extremely powerful, and I believe Andreas has given you a good example to use as a starting point, it is a ā€œscience unto itselfā€: It does have some complexity of its own to set up in a useful way. Not impossible, but as far as I understand you, you currently need something simpler.

I havenā€™t ever used ā€œMapRā€, so I donā€™t have any experience I could share, but
maybe the following approach works for you:

  • Create a cluster object, assign a Cluster IP. If a Cluster IP isnā€™t applicable for your cluster, re-read the note under the example in Section 2.1 of the previously mentioned article in the official documentation in order to modify this accordingly with a different ā€œHost Check commandā€.

  • Assign the nodes, and the clustered services e.g. your file system that you share via NFS or whichever other services you deem critical, as described in Section 2.2.

  • Run a discovery for the cluster, and the nodes, so that the services that are ā€œclusteredā€ are ā€œmovedā€ to your cluster object, as mentioned in Section 2.3, to avoid ā€œduplicatesā€ as you said.

And now to your local check:

Directly at the end of Section 2.3 there is a ā€œTipā€ that may be of interest for you, which Iā€™ll partially quote: Use the ruleset

Settings for local checks to influence the result by choosing between Worst state and Best state.

Iā€™ve never used this ruleset, but it does sound like it could be applicable for you.

HTH,
Thomas

Hi @openmindz,

thanks for steps. I already went according them. But always I see output from both checks with preffix ā€œOn nodeā€.

I think it is not possible to to achieve what I want with just pure local checks. Local check will always add ā€œOn nodeā€ to the output.

Probably, for single output I will need to write check plugin which will have parameter node_info: True and then write agent plugin which will send the output text, in my case the filesystem usage.

Thatā€™s the thing Iā€™m going to try next, I will write results afterwards.

That TIP doesnā€™t do anything useful :worried:

Hi @marbaa

OK, now I get what you mean, and what your initially posted problem is. As far as I understand, I have to say that your ā€œissueā€ is more of a ā€œcosmeticā€ nature. I believe that this can easily be overcome by good internal documentation and communication regarding this particular check, butā€¦ I may be wrong. I know: people can be weird, and not be able to ā€œtolerateā€ this ā€œduplicated outputā€. Seriously though: Is that such a big deal? :man_facepalming:

In any case, I did some very brief research about MapR and what I found, is perhaps not really news to you, but that thing has apparently its own CLI (maprcli) one could probably (ab)use to write a check that will do what you want:

https://docs.datafabric.hpe.com/62/ReferenceGuide/maprcli-REST-API-Syntax.html

There is a myriad of commands, such as node list or dashboard info which do look usefulā€¦ maybe others, too. Iā€™m sure as someone who is working with that thing, such as yourself, youā€™ll have some experience with it, and will be able to find the right combination of commands to achieve what you want: It shouldnā€™t be too hard to do.

In summary

What you need to code into your script - whether you want to use it as local check or via MRPE - is some sort of condition, that, in case both your nodes are active, only generates the output you posted above, on one of themā€¦ which is what Andreas already said above. This shouldnā€™t lead to those ā€œpeskyā€ duplicates anymore.

I wanted to add that you will not find any documentation on how to do that, because it strictly depends on the application you want to write it for (in your case: that MapR thingieā€¦): Without wanting to sound rude, you will need to come up with the logic. Again, the existing CLI toolbox should provide plenty of stuff to play around with.

Oh and by the way - quoting myself - the tip to use

does do something useful, butā€¦ perhaps not for your case, so Iā€™m sorry if I misled you with that advice. With all that said, I sincerely hope that you will come up with a satisfying solution for your issue, and please do post your results: Iā€™m sure they will provide helpful advice for similar issues others haveā€¦:slight_smile:

Thomas

Sorry, took me a time to have a look on this my issue, super busy with work.

@openmindz Yup, I know about maprcli, actually getting info is faster and easier with python and REST, than using maprcli in bash, or in python.
But that is not point, doesnā€™t matter how I get data. The matter is how I output them to the Check_mk portal.

I was thinking about that logic in my plugins you and Andreas said. Iā€™m, trying, but canā€™t understand how that logic would help.

Local check/any check must produce output, if not, then Check_mk will vanish it (in case that script will not have execute rights), or make it UNKN. Do you agree?

So if I make some comparison of output from both scripts and make final echo, that will not help, because local check must produce output.

I donā€™t know. I think I will give up on this, to not waste your time anymore.

Hi @marbaa

Same here my friend: So much to do, so little time. Despite the fact that Iā€™m more at home than ever, the time I really have for myself, or for stuff I like (e.g. CMK) is less than ever beforeā€¦ I could imagine that this isnā€™t news to youā€¦ and Iā€™m also sure that this will be true for many people all over the world, especially us ā€œIT Monkeysā€ā€¦ Anyway, I digressā€¦

Yes, thatā€™s correct. Hereā€™s an idea for a different approach:

  • maprcli - but Iā€™m sure whichever method you employ, can be used just as well - has this subcommand node list. As far as I can see in the documentation, this returns the total number of nodes, an ā€œidā€ and also the ā€œhealthā€ for each node. Valuable information to build uponā€¦

  • You could run a node list (or similar), on every node and then e.g. determine whether all of them are healthy. After you have verified that, you could write up a condition, which executes your check only on the first healthy node. On the rest of the nodes, you would then only output something like: ā€œNode healthy, actual check result gathered on node <NODENAME>ā€, or something like that.

So an ā€œASCII mock-upā€ of your screenshot in your initial post, would then look something like this:

mapr_fs_space - On node NODE1: 62,80% space used etc. On node NODE2: Node healthy, please check output of NODE1.

How does that sound to you?

Take care,
Thomas

Hi @openmindz,

thanks for another possible solution. But you donā€™t have to go through MapR documentation how to collect data from MapR :slight_smile: the way how they are collected is not relevenat.

The behaviour of Check_MK itself and how it displays output when service is clustered is important. I canā€™t tell for sure, but I think when we were using Check_MK 1.4 at company, there was not shown text ā€œOn nodeā€ in local checks. Maybe it was introduced in 1.6 version.

Yesterday I spin up my virtual test environemt and installed Check_MK 2.0 and run test script there, just out of curiosity, and you know what? :slight_smile:

It shows the output as I wish, there is no automatically added text ā€œOn nodeā€ to the output. Both with check_mk_agent from 1.6.

This is 1.6

And this is 2.0

Output in 2.0 si taken from host which is written as first in Cluster settings. When I put at first place ā€˜client2ā€™, then output is shown from 'client2;

Removed execute permission from test script on ā€˜client1ā€™:
1.6
image

2.0

What is new in 2.0 that, it shows output from both nodes in Check_MK service. I will show both output even if I specify Check_MK Service as clustered service.

When I find time, maybe I will try to test it on Check_MK 1.4.

1 Like

Hi @marbaa

Cool, thanks for sharing, that is indeed very interesting.

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact @fayepal if you think this should be re-opened.