Provisioning tags via rest api / ansible - error

CEE, 2.3.0p25: distributed setup
OS version: rocky 8.10

Error message:
good day everyone,
just switched from the raw to the cee version in trial mode but the error i experience happens on both versions.
I provision a lot of tags (and my entire config tbh) through ansible with below versions:

> *collections:*
> *# Install a collection from Ansible Galaxy.*
> *  - name: checkmk.general*
> *    version: ">=5.4.0,<6.0.0"*
> *  - name: community.mysql*
> *    version: ">=3.12.0,<4.0.0"*
> *  - name: community.general*
> *    version: ">=10.2.0,<11.0.0"*

> requires-python = ">=3.12"
> dependencies = [
>     "ansible-core>=2.18.1",
>     "pymysql>=1.1.1",
> ]

All my tag configuration in my ansible is looking exactly the same (for each tag, i have about 15).
Coming from an sql query, i insert below:

> - name: "Create function tag."
>     checkmk.general.tag_group:
>       server_url: "{{ server_url }}"
>       site: "{{ site }}"
>       automation_user: "{{ automation_user }}"
>       automation_secret: "{{ automation_secret }}"
>       name: "function"
>       title: "Function"
>       topic: "my_topic"
>       tags: |-
>         {% set result = [] %}
>         {% set _ = result.append({'id':'empty'| lower,'title': ""}) %}
>         {% for function in tags_function_result.query_result[0] %}
>         {%   set _ = result.append({'id': function.hostgroup | lower, 'title': function.hostgroup}) %}
>         {% endfor %}
>         {{ result }}
>       state: "present"

Now, only for the function tag i very often got the error:

...
> TASK [Create status tag.] ***************************************************************************************************************************************************************
> ok: [localhost]
> 
> TASK [Create function tag.] *************************************************************************************************************************************************************
> fatal: [localhost]: FAILED! => {"changed": false, "msg": "401 - The user is not authorized to do this request Details: b'{\"title\": \"Updating this host tag group \\\\\"function\\\\\" requires additional authorization\", \"status\": 401, \"detail\": \"The host tag group you intend to edit is used by other instances. You must authorize Checkmk to update the relevant instances using the repair parameter\"}'"}

How I solve this at the moment:

OMD[monitoring]:~/etc/check_mk$ truncate -s 0 conf.d/wato/tags.mk
OMD[monitoring]:~/etc/check_mk$ truncate -s 0 multisite.d/wato/tags.mk

I re-run my ansible-playbook again to provision tags (always with 1 and the same user):

> TASK [Create function tag.] *************************************************************************************************************************************************************
> changed: [localhost]
> 
> TASK [Create hostgroup tag.] ************************************************************************************************************************************************************
> fatal: [localhost]: FAILED! => {"changed": false, "msg": "400 - Bad request: Parameter or validation failure Details: b'{\"title\": \"Bad Request\", \"status\": 400, \"detail\": \"These fields have problems: id\", \"fields\": {\"id\": [\"The specified tag group id is already in use: \\'hostgroup\\'\"]}}'"}

If i re-run it again:
the latest error, happening on hostgroup tag is now on another tag I want to create.
funtion and hostgroup showing OK. So the sync is ok.

Can someone explain why:
1: checkmk is not having sufficient rights all of a sudden to update always the same function tag?
2: a tag group id is suddenly already in use? and where? because all others like function (after the truncate) are also in use
3: i can insert an empty id with an empty name via the rest api but not in the GUI.

this url gave me some insights but not a solution:
https://forum.checkmk.com/t/how-to-repair-tag-groups-you-can-not-override-the-builtin-tag-group-agent/32030/9

and a big thanks up front to whoever have some good feedback or ideas to solve the issue

@robin.gierse is the author of that ansible role, so he might be the right one to ask such things.

truncating that files sounds not right for me, sounds even very wrong. I hope you can restart that site and the files come back else ping me for defaults or create a new site and copy them yourself.

omd su sitename
echo -e "GET hosts\nColumns: name address groups tags" | lq

That should answer the ā€œwhereā€ question.

To answer you initial question: I believe you have to activate changes, I don’t know to be true howto do that with ansible.

Hi Rene,
Ill await some feedback from Robin, thanks for tagging him.
truncating solves the issue at the moment but i do agree, not the best solution.
so at this moment, there is no need to recreate the site whatsoever.
after the truncate, i reprovision all my tags again, so no data is lost.

on a side note…
what i also did in the past, before the truncate, is i removed the tag from the UI. and reprovisioned it. unfortunatly that broke my setup knowing that the function tag is on every host.

the query you gave me gives me indeed al hosts where this tag applies (which is every host) So the already in use error is logic and in the same way not logic at all. (certainly given the fact, if i reprovision the tags again, it works flawlessly)

I do got the feeling i’m running into a raise condition (with the last ā€˜in use error’) but again, since we are in trial mode, i’m not going to open an official support ticket yet and await Robins (or others) feedback.

via ansible, easiest way (since i’m part of the only team doing the changes)

  - name: "Start activation on all sites."
    checkmk.general.activation:
      server_url: "{{ server_url }}"
      site: "{{ site }}"
      automation_user: "{{ automation_user }}"
      automation_secret: "{{ automation_secret }}"
      redirect: 'true'
      force_foreign_changes: 'true'
    run_once: 'true'
    tags: activate

thanks

If one thing is for sure, it is this: Do not do this. Modifying .mk files manually will almost certainly create issues down the road.

Again, not a solution. A bad idea at best.

We have Integration Tests, which go through the cycle you do: Create and update. And as you can see there, it is healthy across the board.

So my assumption is: You data is somehow broken. Whatever you query in the tags: section, probably creates a broken data structure. Try to add a few tags manually and see what happens.

Is all your scripting only done against the master site in your distributed setup.
The error sounds like you try to update tags on a slave site.

That was my idea too. Some kind of lock? But my ansible has been configed with localhost, so i doubt it.
Meanwhile, i added yetserday some tags via the ui, this seems to work (if i update the empty one).

Now, there is no cron running to update stuff (jost,tags,histgroups,…), but after some while it just breaks. I do the faulty action, and i end up here clueless. :slight_smile:
Is there some logging i can enable on those files maybe? To give me more info?
Thank you very much so far. We’ll get there. I just know it.

Grtz

maybe a wild guess:
Is there maybe a link between host groups which i assign to hosts based on rules (based on these tags?) and it creates a lock?

meanwhile, since it worked until yesterday flawlessly, it broke again this night/this morning.
so, going to the rest api ui to take a dump of all tags groups… that seems normal to me.

Other approach: trying a new user in the ansible config… unfortunately, no luck it seems.
So, something is/became faulty purely on that function tag in a config somewhere.
lets try the swagger to put some records…
some proxy stuff blocks my way of working, so no need to go into detail here.

a last resort: i tried the ā€œrepair: trueā€ parameter on my 2 largest tags (function and hostgroup (which exists of env+function)) and all of a sudden, it works.

we have no clue what the repair tag does in this case, but it does the trick.

Yes, this is the case. You do not want to delete something, that is used somewhere.

It removes the tag group from hosts and rules.

Ah, F me. That explains a lot.

I do agree I do not want to delete something, but tags, and thus the function-tag is a very living thing in our DB. If hosts get trashed, or rebranded, functions might change. What also causes the exisiting/remaining hosts to have faulty function values who doesnt exist anymore.
Host group mappings uses those tags too.

So I do think, somehow the host groups causes a lock on those function tags i want to update.
I’ll need to rework it, it will be a challenge. No clue how yet.

If possible, let’s keep this topic open. I hope I can get some more insights and another proper way to provision the tags/host groups together with linking them to the hosts

A big thank you! And if more insights/info is available, I’d love to know.

BR!

goodmorning all.
a large amount of time between my last post and this one but i’ve found a solution to my problem. Not ideal, but I don’t end up in the unauthorized situation anymore.

what I did is I changed the way my list of tag-items is created by using a set_fact.
secondly, and not very proud of it, but I used the uri module and I contact the rest-api directly instead of using the checkmk_tag _group module.
The big difference I noticed is the ā€˜repair=true’ parameter.
Contacting the api directly overwrites the parameter with the new values (PUT) where the ansible module indeed removes this tag from all hosts. (meaning i need to update every host over and over again)

Long story short, from my experience, the repair=true parameter in combination with the uri module, fixes the issue. It does not remove the tag from the hosts (like it did with the checkmk module) and I don’t get in that locked state anymore.

many thanks for all hints and guidance…

1 Like