Best practises for monitoring external devices

Elaak · May 17, 2022, 10:11am

Hello,
I have multiple Checkmk systems running, all controlled from one of those instances. Attached to each of these Checkmk systems there are a number of external devices, which can only be reached by calling a specific port. The response from the devices has to be cleaned and parsed in order to get the wanted metrics.

Because these devices are external, my first try was to create active checks, meaning a bash script that calls a python script which handles the calling and parsing of the data.

Here is a simple version of the bash script:

#!/bin/bash
var=$(python3 /tmp/my_test_script.py -i $1 -p $2 > &1)
if [[ $((var+0)) -lt $3 ]]; then
  echo "All good"
  exit 0
else
  echo "It is bad"
  exit 1
fi

and here the python script:


def main(ip, port, limit):
  server_address = (ip, port)
  sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  sock.settimeout(1)
  sock.connect(server_address)

  message = "secret"
  error=0
  try:
    sock.sendall(message)
    data = sock.recv(30)
  except:
    exit(1)
  finally:
    print(data)
    exit(0)

def parse_arguments():
  parser = argparse.ArgumentParser()
  req_arg_group = parser.add_argument_group("Required arguments")
  req_arg_group.add_argument("--ip", "-i", required=True, help="Specify host IP")
  req_arg_group.add_argument("--port", "-p", required=True, help="Specify host port")
  req_arg_group.add_argument("--limit", "-l", required=True, help="Specify threshold")
  return parser.parse_args()

if __name__=="__main__":
  args = parse_arguments()
  main(args.ip, args.port, args.limit)

Then I add in Checkmk an active check with the parameters for the bash script (ip and port of device, and the threshold). As a first try it works well, of course messages and exit codes has to be improved.

This is a very cumbersome method, so I want to ask if I am missing an obvious way to monitor these systems. No answer or suggestion is too obvious, as I am still very new to this.

Many thanks!

mschlenker · May 19, 2022, 11:52am

This is exactly how active checks are meant to be written. These active checks can also print out additional info using “classical” Nagios syntax (lots of semicolons).

However there are other possible ways to implement this kind of checks as well. One might be datasource which I like for expandability and easy debugging. This enables you to generate a full agent output:

This is some minimal useful output of a data source script taken from an IoT example. It implements an agent that just announces it’s operating system, a check section co2ampel_plugin that will be ignored by Checkmk as long as no plugin is present on the server side and a local section (P means interpreted by Checkmk, thresholds are 800 and 1000, see local checks).

<<<check_mk>>>
AgentOS: arduino
<<<co2ampel_plugin>>>
co2 895
temp 20.70
<<<local:sep(0)>>>
P "CO2 level (ppm)" co2ppm=895;800;1000 CO2/ventilation control.

(we are currently working on improving the article for data source with better examples, so do not hesitate to ask again if something isn’t clear)

Elaak · May 31, 2022, 3:44pm

Thank you @mschlenker,

I will definitely have a look at datasource programs. If I understand it correctly, I can define a special agent, similar to the file ~\share\check_mk\agents\special\agent_netapp, and the imported scriptlocated at ~\lib\check_mk\special_agents\agent_netapp.py?

Say I am creating an agent called agent_abc123. I would then create similar scripts for my case, and place them in ~\local\share\check_mk\agents\special and ~\local\lib\check_mk\special_agents\?

Would the import of the python script be the same as in for agent_netapp? So in the file \local\share\check_mk\agents\special\agent_abc123 I would import from cmk.special_agents.agent_abc123 import main which should import and run ~\local\lib\check_mk\special_agents\agent_abc123.py?