Troubleshooting OMD start problem? (Redis: [Errno 9] Bad file descriptor)

coofercat · December 15, 2021, 8:54am

CMK version: 2.0.0p17.cre
OS version: Ubuntu 20.04

Error message: [Errno 9] Bad file descriptor)

This error occurs when I try to omd start my site. I’ve narrowed it down to a problem during the start of redis:

# omd start mysite redis
Temporary filesystem already mounted
Starting redis...ERROR: Failed to launch building of search index
[Errno 9] Bad file descriptor
OK

Whatever this problem is seems to be causing my entire checkmk to malfunction, so the “OK” at the end of the output (and the 0 return code from OMD) seems, er, misleading to say the least. I do get a Redis server process after running this, so it is starting Redis itself, but something else evidently is not starting (again, rather misleading OMD output).

I’m trying to figure out what in the “start redis” process could be causing the issue. I’d like to see a detailed list of things OMD is up to here, but can’t get any verbose or debug output (the expected -v or -debug command line switches don’t work).

I have of course googled the error message, which it seems has happened in a number of places before. However, none seem to relate to this, so I’m no nearer a solution.

Any ideas would be much appreciated!

_rb · December 15, 2021, 9:16am

any hints in ~/var/log/redis-server.log?

coofercat · December 15, 2021, 9:30am

Good thought - but sadly, no - Redis itself seems to be fine:

1299946:M 15 Dec 2021 09:16:33.754 # Server initialized
1299946:M 15 Dec 2021 09:16:33.754 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1'
for this to take effect.
1299946:M 15 Dec 2021 09:16:33.754 * The server is now ready to accept connections at /omd/sites/mysite/tmp/run/redis

I assume some other process is started as well as Redis server, but OMD doesn’t break that process out to a separate task, nor does it say what it is

Some additional background: This problem occurred when I tried to move the /opt/omd/sites directory to an NFS mount. Moving it back to local disk seems to work just fine - I assume somewhere we’re trying to save a “special” file, which I guess my NFS mounts don’t allow for - but tracking down what’s what is proving difficult.

tosch · December 15, 2021, 12:09pm

Hi @coofercat,

checkmk needs some streams and pipes to run and these are created at the start process. As far as i know this isn’t supported by nfs because stream and pipes aren’t files in the main meaning. (special file type created by mknod())
I also wouldn’t recommend to use checkmk on NFS because you can lead into problems with stale NFS handles and the massive I/O checkmk can produce by storing the performance values in the RRD files.
Additionally checkmk could have problems with the file rights on NFS, because every site is running with a dedicated user. If you map your NFS with just one user permission you will get problems too.

robin.gierse · December 15, 2021, 12:21pm

Boiling down what @tosch pointed out in detail: Do not put the /omd folder hierarchy on NFS!

If you want to push backups to NFS, that is perfectly fine, but keep the application local.
I am unaware of any use case, where it is a good idea to run an application from NFS at all.

tosch · December 15, 2021, 12:23pm

Take a look to SAP, totally standard to have the kernel components running from NFS

robin.gierse · December 15, 2021, 12:34pm

On second thought, that is not too bad, if it really only is the kernel.

coofercat · December 15, 2021, 12:43pm

Thanks all - that is indeed a shame.

As for NFS performance and the like, IMHO, this is mostly a solved problem. Cloud providers supply incredibly capable (and resilient) NFS services these days, which in my experience prove to be very reliable over the long term.

My use-case is to have a VM running in Amazon. It’s in an auto-scaling group (ASG) of 1, which means that if Amazon (or I) ever terminate the instance, the ASG will recreate it. I haven’t done it yet, but the server image will soon have Ansible self-apply on it, so when it boots up for the first time, it’ll self-install CheckMK onto the image.

As it stands, destroying the VM means I’ll lose my CheckMK setup - all the software will auto-restore quite happily, but the replacement server won’t have any of the config of the previous one. I’ll need to do a restore from backup - just because a VM failed. Whilst VMs do tend to be pretty reliable in most cloud providers these days, the vendors remind us that the do fail, and that we should architect our applications to be resilient to such failures.

As an example, take another product entirely… Sonatype Nexus (a package repository server). That works in much the same way as CheckMK (ie. it writes files to disk with package meta data, server config etc). With that product, it’s entirely possible to run the “data folder” on NFS. Therefore, if I terminate the VM, it is recreated and 10 minutes later I have a fully working Nexus server - exactly the same as the one before it.

To try to justify this approach a little further: I’ll bet I can patch my Nexus server, or recover from (say) a suspected security incident faster than you can. All I have to do is terminate my server, the cloud magically starts another one, and it starts up with all latest patches applied and none of the “dirt” from its previous existence (I’m not suggesting this is all I’d do in these circumstances, but it sure is a very powerful way to operate servers).

An alternative would be for me to use Ansible to entirely apply all the config into CheckMK - sadly, that’s not really possible either (or rather: it’s so much work that it makes very little sense to try it!). I have previously tried with an older CheckMK version, and whilst it works, it’s pretty hard to work with.

We are where we are though - if it’s not possible, it’s not possible. As I say, that is indeed a shame.

tosch · December 15, 2021, 2:19pm

If you consider to put some of your folders inside the site (like ~site/etc, ~site/var, …) on a NFS you should be fine, but the whole structure won’t work i guess. This i due to the limitation of NFS file system.

coofercat · December 15, 2021, 3:29pm

Okay, so NFS isn’t possible (or is tricky). Is there a way to debug/troubleshoot OMD though? It seems it’s doing something non-obvious which perhaps I can isolate and then figure out a way forward from there?

I had a look in some of the OMD code - it seems to have a “global settings” capability which seems to have some sort of “verbose” in it. I can’t see how to set it into verbose mode though. Any ideas?

tosch · December 16, 2021, 8:59am

You can use the -v or --verbose option, but i guess it’s not that verbose you are hoping for.

I assume the omd start process breaks after creating the FIFO-pipe and try to test it and can’t find it, because it couldn’t be created on NFS.

EDIT:
You also could enable the bash debugging to see the commands are executed. I am not sure if it works with the underlaying python running at the omd-script. Just give it a shot with set -vx. To disable it for your session again, use set +vx.

coofercat · December 16, 2021, 9:18am

Thanks for the suggestions - am I missing something, but omd doesn’t have a -v (or --verbose) option (or at least, not on mine). The help text isn’t very helpful, but these two combinations are at least accepted, but neither shows any more detail than without the -v option:

omd -v start mysite redis
omd start mysite redis -v

Sadly, the -x setting only applies to Bash, and only to the current shell - it doesn’t follow down into other programs, so no, it doesn’t tell me what’s going on in the OMD python program.

I suspect you’re right about the FIFO pipe - it’d be nice to get some actual output showing the pipe failing to be created, or more details about when it’s failing to be used. My impressions of OMD are that it wasn’t designed with this sort of thing in mind.

tosch · December 16, 2021, 9:31am

It’s not print out in the usage page but on the source code i found the options (at least on my old version ).

You may could modify the omd script and get it a bit more verbose at the point where the pipe is created or checked.

coofercat · December 17, 2021, 9:17am

Just to document this, @tosch explains in Docker-Swarm & NFS4-Volume gives [Errno 9] Bad file descriptor - #4 by tosch :

Liveproxyd tries to create the file ~site/tmp/run/live which needs to be a FIFO pipe and fails i assume.

This tells me what I was hoping to get OMD to tell me at least. I’m at least now able to tackle the underlying problem (thanks @tosch), but I still see problems in OMD here. I’m not really sure “go wade through thousands of lines of Python code” is a particularly viable solution, although it is one I may resort to in future.

coofercat · January 25, 2022, 9:13am

I’ve had a bit of a hack about, and through an excessive use of symlinks and other messing about, I was able to make /omd/sites/checkmk/tmp and /omd/sites/checkmk/tmp/run into either symlinks to a real filesystem somewhere else, or a tmpfs mount. With the rest of /omd/sites/checkmk on an NFS4 mount, it still doesn’t work though. The ‘live’ FIFO pipe seems to be created successfully in this cranky setup.

I have tried to ‘debug’ the issue. The shell debugging doesn’t penetrate into the Python code, and the Python code is well, thousands of lines of uncommented, untestable mess. Good luck to anyone trying to figure out any part of that!

I conclude therefore, that whilst the live FIFO is one problem, there are evidently others which I’m not aware of. I also conclude that OMD and CheckMK are not compatible with NFS4, which, reduces my ability to deploy CheckMK in a production environment with (easy) resilience built in. This is a real shame.

robin.gierse · February 4, 2022, 7:20am

Hi @coofercat, I understand your frustration.

But I have to say: Despite the setup you want to implement being supported by other solutions, we do not support it. There are countless deployments of Checkmk that work flawlessly in supported environments and people are quite creative with their deployments.

I do respect your motivation to actually debug your error code-wise, I really do. But I cannot help but ask myself if it is worthwhile for you. There is plenty of examples how to roll out Checkmk even in huge deployments and with resilience in mind. Why not store backups on NFS as recommended and pull them from there in case of failure?

I do not intend to patronize you, I just want to offer some perspective here.

coofercat · February 4, 2022, 9:15am

we do not support it

Understood - a shame though, but understood.

But I cannot help but ask myself if it is worthwhile for you.

Well, it would have been if it had worked - even if it left me unsupported, it would at least give a template for OMD to (maybe) grow into one day. But ultimately, I’ve learned that much of CheckMK’s internals are pretty much impenetrable for people such as me.

Why not store backups on NFS as recommended and pull them from there in case of failure?

Indeed. Actually, I’ve elected to put them into S3, but the end result is the same. I still don’t really have the option of a completely self-building server, but I can get close, albeit with a few hours of lost data and some manual work.

Thanks for your thoughts here, it’s really helpful to hear it from someone who’s evidently working with CheckMK in lots of different scenarios. I’m ready to be told “you’re doing it wrong” when I am

OMD is a layer of complexity which is largely unnecessary, yet constrains installations to particular ways of working which may not suit everyone. IMHO, other ways to work are better than those OMD allows, but I’m happy to accept not everyone will agree with me there.

I’ve used (and still recommend) CheckMK - it’s a good mixture of features and flexibility. However, it’s been built in a particular way in a particular style which it seems doesn’t suit well to the more “cloudy” ways of working. I just hope that over time the chains can be loosened so it works well for more use-cases. I’m just sorry I can’t log a well defined ticket asking for the changes that would help me.

robin.gierse · February 4, 2022, 12:26pm

I appreciate your open mindedness and your feedback @coofercat, I really do!

All possible disclaimers apply here, but: I think restoring a backup will become possible through the REST API someday. Do not quote me on that, but maybe just a little spark to keep your hopes lit.

OMD has quite a history and still works really well in most scenarios, but of course that is subject to opinion and depending on the use case might not feel like a good approach. We are very aware of the cloud-native generation, and we are working tirelessly to improve Checkmk and to make it fit for every possible use case.

I am glad you are still with us, despite not loving the solution. Not everyone has to love it, we can just be friends, that’s perfectly fine.

system · February 4, 2023, 12:26pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.