Custom Plugin Check Speed

scotsie · June 12, 2023, 7:02pm

TL;DR How would you handle 64 dynamically discovered websites from a list and subsequent login/content checks with response time in one plugin?

Looking for guidance on optimizing performance of a special agent plugin or maybe a “You’re thinking about how it works wrong, try this instead.”

The goal is to pull a dynamic list of customer portals to discover new and decom old customers. The check is a simulation of the login experience from remote locations. In this case I am using distributed nodes in several different AWS or Azure locations and a ‘fake host’ where the discovered service is shown.

The special agent plugin I created fetches a dynamic list of customer names, websites and tokens from a secured URL. I currently have the script running though the URL checks as part of the initial information gathering and returning check values, data and response time in the section output such as this:

~/local/share/check_mk/agents/special/agent_customer_portals output

<<<customer_portals>>>
0 Customer_1 response_time=0.684 URL:https://portal.customer1.net/ Request succeeded. Logoff found in site response.
2 Customer_2 response_time=0.078 URL:https://myportal.customer2.com/ HTTP Error: 403
<<<>>>

This was very slow to populate initially until I got a little smarter and included mutli-core processing. This improved performance but wasn’t helpful once i placed it onto the distributed nodes (AWS X-large). Switching to multi-threading instead saw better performance but still frequent timeouts.

Are there better ways to accomplish this task?
I am current considering moving the check out from the special agent and instead returning just the customer, portal url and token. The check would then take place in the plugin file ‘check_customer_portal’. I don’t know if this will be more efficient though so thought I’d ask for opinions.

Sincerely,
Scotsie

elias.voelker · June 12, 2023, 8:01pm

Hi @scotsie

This sounds like a use case for RobotMK / the robot framework.

Take a look here: checkmk conference #7: Tech Session - E2E Monitoring - YouTube
End-to-End Monitoring with Checkmk: Configuring Robotmk | Checkmk

Cheers
Elias

simonm · June 13, 2023, 11:01am

Hi @scotsie,

Disclaimer: I am the Robotmk developer and a little bit biased.

There’s nothing to be said against doing it with a special agent. It’s just very much implemented under the “Checkmk” hood.

As Elias writes, another approach would be to use Robot Framework and Robotmk for this.
I could imagine an approach where Robot Framework dynamically creates a test case for each customer using the Datadriver library.
But first things first:

The Datadriver library by default reads the test data from XLSX or CSV, but also supports to write CustomDataReader classes. This is where you could put the code which fetches the customers from the URL. Should be straightforward.
At the very beginning of each Testsuite execution (done by Robotmk), the Datadriver library creates the test cases with the data using a test case template (it’s completely up to you what you are doing inside the test) Depending on how/whether the customer data changes, test cases change as well.
In Checkmk you turn the discovery-level in the Robotmk rule to level 2, i.e. you don’t get 1 service for the whole testsuite (which is the default), but one per (automatically generated) Customer portal test.
With the automatic discovery in Checkmk rule you can ensure that the new cusotmer portal services are automatically added to the monitoring, and the vanished ones disappear.

I want to emphasize again: this approach is not the “better” one.
But in the end it could be more maintenance-friendly than a SpecialAgent incl. associated check.

I hope this helps you; drop me a PM if you need further assistance!

Regards,
Simon

scotsie · June 13, 2023, 12:17pm

@elias.voelker, Thank you for the reminder and reference.
I looked at RobotMK early on when just getting started with CheckMK but at the time it felt like overkill for what I thought was going to be a pretty straight forward approach. Not to mention I was trying to force myself to become proficient with both python and checkmk.

That being said, I will revisit it, especially with the advice of @simonm being posted shortly after your post.

scotsie · June 13, 2023, 12:25pm

@simonm Thank you for the disclosure and information. Now that you and @elias.voelker mentioned it, I recall looking at RobotMK when I first started working with CheckMK. At the time I didn’t have this particular requirement and filed it away as more of a development level testing suite.

Now that I’ve spent a little more time with CheckMK and have written a few simple plugins I will definitely revisit the setup and documentation especially if it supports the dynamic site creation. These portals are all pretty much the same with the exception of branding and the unique token.

Let me take some time to review and test but I may take you up on the DM offer if I run into any hurdles.

Sincerely,
Scotsie

scotsie · August 14, 2023, 11:32am

After quite a bit of reading, fighting with various version/library conflicts and testing, I got nodejs, playwright, robotframework, robotframework-browser, robotframework-datadriver and robotmk working.

Posting the .robot file for my very simple test case to follow up and for feedback. @simonm if you have any thoughts or ‘best practice’ suggestions, I’d appreciate the additional feedback. Not having exposure to automated testing before and trying to stick to exclusively playwright made the research a bit more interesting.

*** Settings ***
Documentation   Dynamically generated list of customer data is used to
...             launch customer specific portal sites using a security
...             token and confirm visible 'Account Information:' header.
Library         RPA.Browser.Playwright
Library         DataDriver    ${CURDIR}/customer_list.csv    dialect=unix
Library         ${CURDIR}/${APPLICATION}

Suite Setup     Test Suite Setup
Suite Teardown  Close Browser
Test Template   Customer Portal Test

*** Variables ***
${APPLICATION}    customer-portals-list.py

*** Test Cases ***
${custname} Portal - ${custurl}    ${custurl}    ${static}

*** Keywords ***
Test Suite Setup
    New Browser    browser=chromium    headless=True    args=["--ignore-certificate-errors"]
    Set Browser Timeout    30s
    customer_list_csv

Customer Portal Test
    [Documentation]    Template for Customer Portal testing
    [Arguments]        ${custurl}    ${static}
    Open Browser To Login Page    ${custurl}
    Submit Static Token    ${custurl}    ${static}
    Verify content loads

Open Browser To Login Page
    [Arguments]    ${custurl}
    New Page    ${custurl}
    Get Element States    xpath=//*[@id="btnSubmit"]    validate    visible

Submit Static Token
    [Arguments]    ${custurl}    ${token}
    Go To    ${custurl}/portal/default.aspx?tokentype=static&token=${token}

Verify content loads
    Get Element States    //*[contains(text(),"Account Information:")]    validate    visible

Incidentally, I suppress the SSL warnings after discovering some customer sites were not SSL secured and not finding a way to 1) accept and bypass in the test case and 2) report on this (perhaps to warn in CheckMK). Reportedly our account managers are working to resolve this and I will also be adding an automatic SSL check via Check_MK now.

simonm · August 14, 2023, 12:47pm

Hi Scott,

congratulations, that’s great news!
And thanks for sharing your example with other Checkmk users!

I will gladly go into a few points on it:

You use RPA.Brower.Playwright instead of robotframework-browser - if that was intentional: perfect. If not: stick with it
The package “rpaframework” is from Robocorp, who publish a lot of libraries under their name to give their customers a stable base of libraries for the Robocorp Cloud. I also tend more and more not to take the latest hot sh… from Github/PyPi, but use these extensively tested libraries. Well chosen.
What does the CSV data structure behind it look like? (Just out of interest and to share it for the other users - if possible).
Is customer_list_csv a keyword from your own library?
Currently you set the flag to ignore the certificate errors when creating the browser instance. This then applies to all tests equally. If you want to decide this per test, you can use “New Context” and use the flag “ignoreHTTPSErrors” here. Since contexts and pages have TEST-scope, they will be closed again at the end of the test. (“New Context” is called behind the scenes when you do not use it but then call “New Page”)
I personally recommend CSS instead of XPath for the selectors in the meantime. (Example: #btnSubmit would do exactly the same). XPAth selectors also work without “xpath=”, by the way, so by letting them start with //.

Feel free to ask

Regards,
Simon

scotsie · August 14, 2023, 1:24pm

Absolutely, I chose it based on your remarks in the #9 conference I attended virtually so you can give yourself credit for that . Most of my test case research was trying to sift through examples that referred to Selenium or didn’t differentiate. As Playwright continues to mature I imagine more articles will follow.

Certainly.
Example customer_list.csv

${custname},${custurl},${dynamic},${static}
Customer Alpha,https://portal.alphasite.net/,dynamictokendataalpha,statictokendataalpha
Customer Beta,https://phone.betacorp.org/,dynamictokendatabeta,statictokendatabeta

Actually it’s a method inside the customer-portals-list.py file after trying to get the datadriver to work with a list of dictionaries response from the script and failing. Due to time limitations I stuck with it and hope to circle back to that for refactoring.

Unfortunately, I didn’t know ahead of time the sites were not using SSL. I will likely be using a similar process to create CheckMK SSL checks for each customer automatically as a byproduct.
If there’s a way to identify SSL failures through robotframework but continue testing, I’d love to see an example. My thought would be to WARN on SSL failure but I don’t know if that’s currently possible.

I’ll check into that. I am not very familiar with websites in general and went with what I read from a coupe of other tutorials and with the help of a firefox browser extension xPath 2 & Robot framework commands. I am however realistic enough to know I’ve got a lot to learn as I continue to make more test cases.
Now that I have it working, I’m sure the flood gates will open for all sorts of testing.

Sincerely,
Scotsie

scotsie · August 22, 2023, 12:48pm

@simonm, it appears that my method of creation for the test case is still suffering from the original issue of taking a very long time to run through the 64 test cases generated through my CSV with the datadriver library.

Looking through documentation, I see references to robotframework-pabot for asynchronous checking but I don’t believe it’s a viable solution with robotmk at this time unless I’m missing an option or parameter.

I played with the browser timeout in the .robot file, template test case but found this ran into issues with robotmk overall timing out and tests reporting ‘stale’. The remote node is in AWS (AWS X-large) but I believe should be enough resources. I’ve modified the robotmk timeout and check interval to 15 and 10 minutes respectively and still see it timing out periodically. The timing on the individual sites seems to indicate no issues with them but the test case itself have random ‘spikes’ in how long they take. As if the launching of the testing browser is problematic.

Am I trying to do to much or is there another approach I should consider?
Maybe generating individual test case .robot files for each customer using a cron job python script? Would this net any different result?

Appreciate your advice and feel free to point me at any guides or documentation as well.

Sincerely,
Scotsie

simonm · August 23, 2023, 9:04am

Hi Scotsie,

I’m not quite sure I’ve fully grasped the problem.

Is it that the tests just take a long time because they are run serially altogether (which is why you asked for pabot) or is the main problem that the browser times out? Both are related, but are different problems.

The total execution time of the individual tests is probanly higher than the execution interval of Robotmk, so there is an overlap - Robotmk can’t (yet) handle parallel runs and doesn’t return any results.

In any case, you are right, pabot is not currently supported by Robotmk. Maybe we can take something like this into consideration later.

Maybe you can give me some more information about the questions, then I can help you.

Regards,
Simon

scotsie · September 20, 2023, 1:04pm

@simonm, I apologize for the delayed response. Another project and tasks pushed this project off a bit.

The problem seemed to be a mixture of both but this is from observation and not hard data. If there’s a way to document or provide hard metrics I’d be happy to gather that data.

Extending the “Result cache time” and “Runner execution interval” under the agent plugin settings allowed the tests to successfully run the majority of the time. However, periodically browser timeouts would occur with individual sites, usually several at a time, causing robotmk exceed the cache timeout and report problems with the browser timeouts and any sites after those.

My problem is that I’m not confident that the browser timeouts are reporting actual failures since all of these sites are being served by the same IIS webserver. The site developer “doesn’t see any issues” and manual tests at the exact time when the robotframework test is reporting timeouts work. Again, not very scientific but the burden of proof is on me.

All of these checks are being run on AWS or Azure VMs with 2 CPU and 8 GB of RAM. That seems like it should be enough based on watching performance. The checks are fairly simple at this stage. The data driver works and each test is a new browser, run test and tear down.

I guess I’m stuck mentally on which direction I should pursue or what might point me in the next direction.

Thank you for your time and advice.
Sincerely,
Scotsie

simonm · September 20, 2023, 2:22pm

Hi Scotsie,

did you already try to reduce the timeout of the Browser library? By default it’s 10s and that is often too much.
Hm, 2 CPUs seems a rather low value. Just out of curiosity, could you try to give the machine 4 cpus?
For flaky tests, I already took the last joker and setup a second identical machine to run the same tests.
Then I aggregated the results with the BI module. It has the advantage that you have (kind of) clustered the services and made them more available. And secondly, you would perhaps see if erorrs occurr on both hosts at the same time.
Let me know if something of these tips helped.
Regards,
Simon

scotsie · October 5, 2023, 8:13pm

@simonm, thanks for the follow up.

I have played around with the timeouts and found that 10s did help with the frequency of occurrences.

The CPU count was just an arbitrary choice. I bumped the core count from 2 to 4 on the instances. I didn’t notice any huge gains in performance.

The checks are being run from 4 remote hosts, each in a different ‘zone’ in AWS and Azure to simulate customer experience. The struggle was with periodic timeouts that were no consistent across all hosts.
All that being said, the project took a turn (due to SSL vendor changes) and the customer sites are being migrated to new URLs and servers. I’ve been told to put this on hold for now until the migrations are complete. For now I’m going to consider this topic closed.

I appreciate your time and efforts on this. I will create a new topic if, when the time comes, I have further questions.

Sincerely,
Scotsie

simonm · October 6, 2023, 6:08am

Hi Scotsie,

I am sorry to hear that the project is on hold now (but, due to external reasons).
Feel free to ask here, I am happy to help!
Best regards,
Simon