Linux Inventory plugin shows one card of such type but server has eight

Hello,
I used official mk_inventory.linux plugin from my local check_mk server.
I saw that on the host machine this plugin shows only one GPU of type X, but I know that it has 8 identical of such type.

This is part for gpu:

section_lnx_video() {
    # Collect VGAs if they are present
    vgas="$(lspci | grep VGA | cut -d" " -f 1)"
    [ -n "$vgas" ] || return
    echo "<<<lnx_video:sep(58)>>>"
    printf '%s\n' "$vgas" | while IFS= read -r vga; do
        lspci -v -s "$vga"
    done
}

This code outputs those 8 cards on host machine without a problem. I believe that since those are the same types of cards the view in check_mk collapses those 8 cards into one row.

Can I somehow add some id for those cards here so that check_mk will not collapse the view?  

Forgot to write that i have Checkmk Raw Edition 2.3.0p30 version.

There is already some sort of (PCI) ID. The problem is the way checkmk/cmk/plugins/collection/agent_based/inventory_lnx_video.py at 46f1ebdbae91ddb1f98817e0690a066f26e0ec1b · Checkmk/checkmk · GitHub parses the line.

Example:

00:08.0 VGA compatible controller: Microsoft Corporation Hyper-V virtual VGA (prog-if 00 [VGA controller])

This string is split at “:” and the second-to-last field is extracted as a key for the inventory dictionary. In the example this is “Microsoft Corporation Hyper-V virtual VGA (prog-if 00 [VGA controller])”.

So if you have 8 vga cards that have the same string in this filed you’ll only get the values auf the last one in your inventory.

Because 2.3 will be out of active maintenance in a month I experimented with 2.4 which has the same issue.
We will see what will happen with my PR #861 to add the slot id which should fix this issue.

Maybe you could add your lnx_video section output to the PR to have more test data.

Thank you for fast MR and proposed solution.

Here are first two cards as seen by official plugin:

<<<lnx_video:sep(58)>>>
01:00.0 VGA compatible controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: NVIDIA Corporation GA102GL [RTX A5000]
	... more here ...

25:00.0 VGA compatible controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: NVIDIA Corporation GA102GL [RTX A5000]
	... more here ...

I tried to change plugin itself and changed above part to this (we are also only interested in NVIDIA cards, not integrated ones):

    # Collect NVIDIA VGAs only
    vgas="$(lspci | grep VGA | grep -i NVIDIA | cut -d' ' -f1)"
    [ -n "$vgas" ] || return

    echo "<<<lnx_video:sep(58)>>>"

    printf '%s\n' "$vgas" | while IFS= read -r vga; do
        # Replace ':' with '.' for PCI ID
        pci="$(echo "$vga" | tr ':' '.')"
        # Capture the lspci output
        lspci -v -s "$vga" | awk -v pci="$pci" 'NR==1 {print $0 " " pci; next} {print}'
   done

This is the output:

<<<lnx_video:sep(58)>>>
01:00.0 VGA compatible controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1) (prog-if 00 [VGA controller]) 01.00.0
	Subsystem: NVIDIA Corporation GA102GL [RTX A5000]
	... more here ...

25:00.0 VGA compatible controller: NVIDIA Corporation GA102GL [RTX A5000] (rev a1) (prog-if 00 [VGA controller]) 25.00.0
	Subsystem: NVIDIA Corporation GA102GL [RTX A5000]
	... more here ...

Thanks to this (PCI) ID at the end I now see all 8 cards in inventory view.

Now I am not sure If I should wait for the official patch or just send this changed file to every host that we have. How long would we have to wait for official patch?

Considering that 2.3 will be out of active maintenance in a month I will also update it to 2.4 in the meantime.

I have no idea when anything will happen with my PR. Maybe someone from CheckMK can comment on this issue and if my proposed change is good or bad.

And do you know if after changing major version I now have to reinstall agents on hosts being monitored? I can not find any information about it and I also did not experience any bugs with Version: 2.3.0p30 agent on monitored hosts.

@dandon223 I would recommend updating the agents after Checkmk upgrades. But you would need to redo your modification for the inventory until there is an upstream change for it

@Sara or @martin.hirschvogel Who is the person to connect to when the PR-CI tests in Github fail? Some of them seem to be flaky and it would be great to retrigger just an individual test if necessary. Although for the formating test im not sure if the test itself does have a problem. In my repo it passes but it cannot download one artefact:

WARNING: Download from https://artifacts.lan.tribe29.com/repository/upstream-archives/github.com/llvm/llvm-project/releases/download/llvmorg-19.1.7/LLVM-19.1.7-Linux-X64.tar.xz failed: class java.io.IOException Connect timed out

Meanwhile in the checkmk repo the download seems to work but the test fails because it runs out of diskspace:

ERROR: /home/runner/.cache/bazel/_bazel_runner/aed53f964069daa6ab471b6b9883c077/external/bazel_tools/tools/build_defs/repo/http.bzl:139:45: An error occurred during the fetch of repository 'llvm_linux_x86_64+':
   Traceback (most recent call last):
	File "/home/runner/.cache/bazel/_bazel_runner/aed53f964069daa6ab471b6b9883c077/external/bazel_tools/tools/build_defs/repo/http.bzl", line 139, column 45, in _http_archive_impl
		download_info = ctx.download_and_extract(
Error in download_and_extract: java.io.IOException: Error extracting /home/runner/.cache/bazel/_bazel_runner/aed53f964069daa6ab471b6b9883c077/external/llvm_linux_x86_64+/temp4487391751063356141/LLVM-19.1.7-Linux-X64.tar.xz to /home/runner/.cache/bazel/_bazel_runner/aed53f964069daa6ab471b6b9883c077/external/llvm_linux_x86_64+/temp4487391751063356141: write (No space left on device)
ERROR: no such package '@@llvm_linux_x86_64+//': java.io.IOException: Error extracting /home/runner/.cache/bazel/_bazel_runner/aed53f964069daa6ab471b6b9883c077/external/llvm_linux_x86_64+/temp4487391751063356141/LLVM-19.1.7-Linux-X64.tar.xz to /home/runner/.cache/bazel/_bazel_runner/aed53f964069daa6ab471b6b9883c077/external/llvm_linux_x86_64+/temp4487391751063356141: write (No space left on device)
ERROR: /home/runner/work/checkmk/checkmk/bazel/tools/format/BUILD:56:16: //bazel/tools/format:format_C++_with_clang-format.check depends on @@llvm_linux_x86_64+//:bin/clang-format in repository @@llvm_linux_x86_64+ which failed to fetch. no such package '@@llvm_linux_x86_64+//': java.io.IOException: Error extracting /home/runner/.cache/bazel/_bazel_runner/aed53f964069daa6ab471b6b9883c077/external/llvm_linux_x86_64+/temp4487391751063356141/LLVM-19.1.7-Linux-X64.tar.xz to /home/runner/.cache/bazel/_bazel_runner/aed53f964069daa6ab471b6b9883c077/external/llvm_linux_x86_64+/temp4487391751063356141: write (No space left on device)

Hi @mayrstefan ,

There is some work being done on the CI tests, which might lead to issues. I will try to get more information and let you know.

Hi @mayrstefan ,

I was told there were some improvements made and that your PR was merged. Is it ok now or are there still issues?

Hi Sara,

yes, my workaround for the tests running in Github actions was merged last week. Now that the tests have a chance to finish successfully I updated all of my other PRs. The PR for this specific issue now got a “tracked” label and I wait for feedback now.

Thank you

1 Like

My proposal got accepted Werk #18889: Fix missing graphic cards in HW/SW inventory of Linux host

2 Likes