Service Ceph Pool reports wrong usage percentage

Hey all,

We are using Checkmk Raw Edition 2.1.0b7 and Ceph on version 17.2.1 (quincy).

There seems to be an error how the ceph plugin calculates pool usage, resulting in premature warnings.

This issue was mentioned before but is stale now:

In our example of pool hdd_ec, checkmk reports 80.67% used (2.35 of 2.91 PB) but ceph reports 73.57%.

0|0[root@osd-1 ~]# ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    3.5 PiB  1.1 PiB  2.4 PiB   2.4 PiB      67.51
ssd     42 TiB   42 TiB  324 GiB   324 GiB       0.75
TOTAL  3.5 PiB  1.2 PiB  2.4 PiB   2.4 PiB      66.74

--- POOLS ---
POOL                                    ID   PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
...
hdd_ec                                   3  2048  1.6 PiB  607.52M  2.3 PiB  73.57    575 TiB
...

I think the plugin mixes values which include replication/ec (POOL USED) with values which just represent the actual data (POOL MAX AVAIL), like this:

POOL USED / (POOL USED + POOL MAX AVAIL)
100 * 2.3 PiB / 2.3 PiB + 575 TiB = 80%

But it should rather to be calculated like this:

POOL STORED / (POOL STORED + POOL MAX AVAIL)
1.6 PiB / 1.6 PiB + 575 TiB = 73.57%

Sidenote: I think on an optimal cluster, the later result would also match the raw usage of the pool:

POOL USED / (POOL USED + RAW STORAGE AVAIL)
100 * 2.3 PiB / 2.3 PiB + 1.1 PiB = 67.65%

But MAX AVAIL is not only considering replication/ec but also unequal data distribution on OSDs:

So I guess, we have to improve the distribution in our cluster.
But I’m wondering if the discrepancy between these two values would be worth to monitor as well.

I tend to use Check_MK only for the actual status monitoring, because it produces soooo many false alerts on every corner (space as you mentioned, PGs when the balancer moves stuff around, MGRs can not be monitored by checkmk).

You either start to write your own integration, or begin to move this to somewhere else.

Agreed, remapping, backfilling, scrubbing etc. seem to be normal/regular operations and shouldn’t result in warnings. And for pool usage we will probably rely on other stats then.

Sorry to reopen this old thread, but I still see the same problem!

Running Checkmk Raw Edition 2.2.0p7 in docker, same problem.

 {
      "name": "cephfs_data_ec",
      "id": 4,
      "stats": {
        "stored": 1150003665966840,
        "stored_data": 1150003665960960,
        ..
        "kb_used": 1684575626849,
        "bytes_used": 1725005441892981,
        "data_bytes_used": 1725005441884160,
        ..
        "percent_used": 0.8744609355926514,
        "max_avail": 165096395374592,
        ..
        "stored_raw": 1725005431832576,
        "avail_raw": 247644587614516
      }
    },

I would expect checkmk to calculcate 100*1150003665966840 / (1150003665966840 + 165096395374592) = 87.4%. However, it returns 91.2%.

It seems to use the bytes_used counter that was used on old clusters instead of the stored counter.

Could you guys pretty please fix this? :wink:

1 Like

Hi I have the same problem with CheckMK Raw Edition 2.2.0p9 and a Ceph Qunicy cluster.

ceph df
POOL                            ID   PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
vmdata1                          0  2048   38 TiB   10.07M  114 TiB  68.52     17 TiB
"Ceph Pool vmdata1",
  {},
  [
   "themes\/facelift\/images\/icon_service_graph.svg"
  ],
  "WARN",
  "Used: 86.72% - 114 TiB of 132 TiB (warn\/crit at 80.00%\/90.00% used)WARN, trend per 1 day 0 hours: +2.78 TiB, trend per 1 day 0 hours: +2.11%, Time left until disk full: 6 days 6 hours",
  "Used: 86.72% - 114 TiB of 132 TiB (warn\/crit at 80.00%\/90.00% used)WARN\ntrend per 1 day 0 hours: +2.78 TiB\ntrend per 1 day 0 hours: +2.11%\nTime left until disk full: 6 days 6 hours",
  "86.72%",

Almost a year later (or two since the start of this topic). We have CheckMK 2.3 but it still doesn’t work.

raw agent output

<<<ceph_df_json:sep(0)>>>

{"version":"ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)"}

{"stats":{"total_bytes":11042323169280,"total_avail_bytes":5122803105792,"total_used_bytes":5919520063488,"total_used_raw_bytes":5919520063488,"total_used_raw_ratio":0.53607559204101562,"num_osds":23,"num_per_pool_osds":23,"num_per_pool_omap_osds":23},"stats_by_class":{"ssd":{"total_bytes":11042323169280,"total_avail_bytes":5122803105792,"total_used_bytes":5919520063488,"total_used_raw_bytes":5919520063488,"total_used_raw_ratio":0.53607559204101562}},"pools":[{"name":".mgr","id":1,"stats":{"stored":189858544,"stored_data":189858544,"stored_omap":0,"objects":46,"kb_used":556236,"bytes_used":569585664,"data_bytes_used":569585664,"omap_bytes_used":0,"percent_used":0.00014548329636454582,"max_avail":1304852561920,"quota_objects":0,"quota_bytes":0,"dirty":0,"rd":87012,"rd_bytes":223686656,"wr":177897,"wr_bytes":4241326080,"compress_bytes_used":0,"compress_under_bytes":0,"stored_raw":569575616,"avail_raw":3914557726651}},{"name":"pveclus1-ceph","id":4,"stats":{"stored":2036728219124,"stored_data":2036727939072,"stored_omap":280052,"objects":501278,"kb_used":5968735709,"bytes_used":6111985365468,"data_bytes_used":6111984525312,"omap_bytes_used":840156,"percent_used":0.60958051681518555,"max_avail":1304852561920,"quota_objects":0,"quota_bytes":0,"dirty":0,"rd":13812319885,"rd_bytes":697578409407488,"wr":27691420285,"wr_bytes":478009717050368,"compress_bytes_used":0,"compress_under_bytes":0,"stored_raw":6110184472576,"avail_raw":3914557726651}}]}

let’s pass this trough jq to make it readable:

{
  "stats": {
    "total_bytes": 11042323169280,
    "total_avail_bytes": 5122803105792,
    "total_used_bytes": 5919520063488,
    "total_used_raw_bytes": 5919520063488,
    "total_used_raw_ratio": 0.5360755920410156,
    "num_osds": 23,
    "num_per_pool_osds": 23,
    "num_per_pool_omap_osds": 23
  },
  "stats_by_class": {
    "ssd": {
      "total_bytes": 11042323169280,
      "total_avail_bytes": 5122803105792,
      "total_used_bytes": 5919520063488,
      "total_used_raw_bytes": 5919520063488,
      "total_used_raw_ratio": 0.5360755920410156
    }
  },
  "pools": [
    {
      "name": ".mgr",
      "id": 1,
      "stats": {
        "stored": 189858544,
        "stored_data": 189858544,
        "stored_omap": 0,
        "objects": 46,
        "kb_used": 556236,
        "bytes_used": 569585664,
        "data_bytes_used": 569585664,
        "omap_bytes_used": 0,
        "percent_used": 0.00014548329636454582,
        "max_avail": 1304852561920,
        "quota_objects": 0,
        "quota_bytes": 0,
        "dirty": 0,
        "rd": 87012,
        "rd_bytes": 223686656,
        "wr": 177897,
        "wr_bytes": 4241326080,
        "compress_bytes_used": 0,
        "compress_under_bytes": 0,
        "stored_raw": 569575616,
        "avail_raw": 3914557726651
      }
    },
    {
      "name": "pveclus1-ceph",
      "id": 4,
      "stats": {
        "stored": 2036728219124,
        "stored_data": 2036727939072,
        "stored_omap": 280052,
        "objects": 501278,
        "kb_used": 5968735709,
        "bytes_used": 6111985365468,
        "data_bytes_used": 6111984525312,
        "omap_bytes_used": 840156,
        "percent_used": 0.6095805168151855,
        "max_avail": 1304852561920,
        "quota_objects": 0,
        "quota_bytes": 0,
        "dirty": 0,
        "rd": 13812319885,
        "rd_bytes": 697578409407488,
        "wr": 27691420285,
        "wr_bytes": 478009717050368,
        "compress_bytes_used": 0,
        "compress_under_bytes": 0,
        "stored_raw": 6110184472576,
        "avail_raw": 3914557726651
      }
    }
  ]
}

note the "percent_used": 0.609 for pool pveclus1-ceph

Now let’t look in cmk at the Service performance data (source code) for this pool

fs_used=5828862.716593;5658108.873274;6365372.482433;0;7072636.091593 fs_free=1243773.375;;;0; fs_used_percent=82.414289;80;90;0;100 fs_size=7072636.091593;;;0; growth=7366.472984;;;; trend=133044.25791;;;;

Now is suddenly 82%.

Great work CheckMK

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.