Inodes of /tmp Filesystem on checkmk site

Hi,

we noticed that the inodes of the /tmp filesystem for one of our slave sites are getting over 90% usage.
“lsof +f – /tmp” shows more 60730 entries which belongs to the site user:

python 52887 slave4 2542u REG 253,9 4096 45505 /tmp/ffi4FgrU8 (deleted)
python 52887 slave4 2543u REG 253,9 4096 45543 /tmp/ffiQBmMhf (deleted)

These files are not visible in the /tmp filesystem, maybe they are not closed correctly by the site.

Restarting the site will maybe solve the issue temporary, but what will be a permanent solution? Is this a bug?

We are using 1.6.p21cee on RHEL 7.7.
The site is running already for two months.

Best Regards
Thomas

The following process is responsable for the open inode files:
prod_sl+ 3067 26118 5 Jun26 ? 1-10:57:23 python /omd/sites/prod_slave4/bin/cmk --keepalive --keepalive-fd=3

We have now doubled the /tmp filesystem from 1GB to 2GB, so of course there is space again. But we can already see that the inode count continues to increase. Any ideas?

Hi,
did you have the same behaviour on your other ditributed sites too?
Cheers,
Christian

Hi Christian,

no, we have not this behaviour on the other sites. They have only ca. 150 Inodes used in the tmp-Filesystem.
But the other sides have not so much host and services.
The affected side has about 327 Hosts and 36594 Services and is for our AIX servers.

Best Regards
Thomas

Hello @andreas-doehler ,
I hope it’s ok if I address you directly. Do you have any idea what could be the reason for the inodes? After we have increased the size of /tmp filesystem, the inodes are now stable with 48% usage. But this behaviour is strange, isn’t it?

best regards
Christian

Keepalive processes with file descriptors i only know from mkeventd.
Inside my 1.6 installations i have no cmk processes with this command line.

Example here

OMD[cmk]:~$ ps aux | grep keepalive
cmk       2812  2.4  0.1 206788 71352 ?        S    15:07   0:01 python /omd/sites/cmk/bin/cmk --create-rrd --keepalive
cmk       2813  2.5  0.1  97496 69416 ?        S    15:07   0:01 python /omd/sites/cmk/bin/cmk --handle-alerts --keepalive
cmk       2814  2.5  0.1  97508 69356 ?        S    15:07   0:01 python /omd/sites/cmk/bin/cmk --notify --keepalive
cmk       2820  2.5  0.1  93828 65384 ?        S    15:07   0:01 python /omd/sites/cmk/bin/cmk --keepalive
cmk       2858  2.4  0.1  97972 69912 ?        S    15:07   0:01 python /omd/sites/cmk/bin/cmk --keepalive --real-time-checks

directly inside the /tmp folder should write nearly no process if i remember it correctly.

1 Like

@T.Schmitz Which filesystem are you using on your /tmp? ext4? xfs? Other?

Hello @rawiriblundell
we are using ext4:

/dev/mapper/local_vg-tmp_lv on /tmp type ext4 (rw,nosuid,nodev,relatime,seclabel,data=ordered)

As mentioned above, it seems that the following processes are responsible for the inodes:
cmk --keepalive --keepalive-fd=3

One strange thing is, this process with “-fd=3” is only existent on this slave-site. All other Sites does not have such a process, only the keepalive process without fd=3.

On this site we have 350 AIX Nodes, which send their errlog via logwatch.aix to the event-console.

regards
Christian

If we park any need for a root cause solution to one side…

… sometimes, and speaking as a sysadmin, you just have systems that do this kind of thing. A permanent workaround is to just reformat the filesystem:

  • again to ext4 but with a custom specified number of inodes. IIRC it’s a 32 bit number, so mkfs.ext4 -N 4294967295 ... to get the maximum number of inodes. Your 65.54k inodes is now 4 billion. Or:
  • switch to xfs. It’s just better ™. Its default inode behaviour is far more stable than ext4’s, so you generally shouldn’t need to customise its inode count
  • very rarely, an xfs filesystem might have inode exhaustion. It is rare, but not unheard of. You can create an xfs filesystem with 64 bit inode numbering. Good luck exhausting that.

You’re in a fortunate position that this is just /tmp, so you don’t have a need to migrate any data.

1 Like

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.