Docker-Swarm & NFS4-Volume gives [Errno 9] Bad file descriptor

Sintbert · September 8, 2021, 10:04am

Hi
I am trying to install the docker-image of V2.0.0p9 on my docker-swarm setup with NFS for volumes.
I was able to download the docker-image from the customer portal and upload it to my registry.

docker-compose.yml:

version: '3.7'

services:
  basesite:
    image: registry.[XXX]:5000/checkmk/check-mk-enterprise:2.0.0p9
    networks:
      - traefik-public
      - internal
    volumes:
      - type: volume
        source: basesite_data
        target: /omd/sites
        volume:
          nocopy: true
    environment: 
      CMK_SITE_ID: "cmk"
      CMK_LIVESTATUS_TCP: "on"
      CMK_PASSWORD: "initialpassword"
      MAIL_RELAY_HOST: "XXXX"
    deploy:
      labels:
        - traefik.enable=true
        - traefik.docker.network=traefik-public
        - traefik.constraint-label=traefik-public
        - traefik.http.routers.checkmkbasesite.rule=Host(`[Hostname]`)
        - traefik.http.routers.checkmkbasesite.entrypoints=websecure
        - traefik.http.routers.checkmkbasesite.tls=true
        - traefik.http.routers.checkmkbasesite.tls.options=default
        - traefik.http.routers.checkmkbasesite.service=checkmkbasesite
        - traefik.http.services.checkmkbasesite.loadbalancer.server.port=5000


networks:
  traefik-public:
    external: true
  internal:
    attachable: true

volumes:
  basesite_data:
    driver: local
    driver_opts:
      type: nfs
      o: nfsvers=4,addr=[NFS-server address],rw
      device: ":/15k1/check_mk/basesite/data"

Everything works nice as long as the volume is not added to the service, but nothing is persisted.
As soon as it is in, i get a lot of “[Errno 9] Bad file descriptor”.

log:

2021-09-08T09:04:27.477115169Z ### PREPARE POSTFIX (Hostname: 19d7bab954cc, Relay host: XXXXX)
2021-09-08T09:04:27.585072742Z ### STARTING MAIL SERVICES
2021-09-08T09:04:29.688910700Z Starting Postfix Mail Transport Agent: postfix.
2021-09-08T09:04:29.693624860Z ### CREATING SITE 'cmk'
2021-09-08T09:04:40.443822450Z Adding /opt/omd/sites/cmk/tmp to /etc/fstab.
2021-09-08T09:04:40.443856850Z Going to set TMPFS to off.
2021-09-08T09:04:41.970110806Z Error in plugin file /omd/sites/cmk/share/check_mk/checks/3par_capacity: Cannot write configuration file "/omd/sites/cmk/tmp/check_mk/check_includes/builtin/3par_capacity": [Errno 9] Bad file descriptor
2021-09-08T09:04:41.976635689Z Error in plugin file /omd/sites/cmk/share/check_mk/checks/3par_cpgs: Cannot write configuration file "/omd/sites/cmk/tmp/check_mk/check_includes/builtin/3par_cpgs": [Errno 9] Bad file descriptor
.......
2021-09-08T09:04:50.301163901Z Configuration Error: [Errno 9] Bad file descriptor
2021-09-08T09:04:53.748263728Z Preparing tmp directory /omd/sites/cmk/tmp...Updating core configuration...
2021-09-08T09:04:53.748297028Z Executing post-create script "01_create-sample-config.py"...OK
2021-09-08T09:04:53.786314414Z Adding /opt/omd/sites/cmk/tmp to /etc/fstab.
2021-09-08T09:04:53.786351714Z Going to set TMPFS to off.
2021-09-08T09:04:53.786356314Z Created new site cmk with version 2.0.0p9.cee.
2021-09-08T09:04:53.786362314Z   The site can be started with omd start cmk.
2021-09-08T09:04:53.786376514Z   The default web UI is available at http://19d7bab954cc/cmk/
2021-09-08T09:04:53.786404815Z   The admin user for the web applications is cmkadmin with password: initialpassword
2021-09-08T09:04:53.786407515Z   For command line administration of the site, log in with 'omd su cmk'.
2021-09-08T09:04:53.786435615Z   After logging in, you can change the password for cmkadmin with 'htpasswd etc/htpasswd cmkadmin'.
2021-09-08T09:04:55.801569765Z ### STARTING XINETD
2021-09-08T09:04:55.820969413Z Starting internet superserver: xinetd.
2021-09-08T09:04:55.821312517Z ### STARTING SITE
2021-09-08T09:04:56.496151043Z Preparing tmp directory /omd/sites/cmk/tmp...Starting mkeventd...OK
2021-09-08T09:04:56.808608737Z Starting liveproxyd...OK
2021-09-08T09:04:57.564430899Z Traceback (most recent call last):
2021-09-08T09:04:57.564566900Z   File "/omd/sites/cmk/lib/python3/cmk/cee/mknotifyd/main.py", line 282, in main
2021-09-08T09:04:57.564574201Z     with store.lock_checkmk_configuration():
2021-09-08T09:04:57.564577501Z   File "/omd/sites/cmk/lib/python3.8/contextlib.py", line 113, in __enter__
2021-09-08T09:04:57.564580601Z     return next(self.gen)
2021-09-08T09:04:57.564583901Z   File "/omd/sites/cmk/lib/python3/cmk/utils/store.py", line 58, in lock_checkmk_configuration
2021-09-08T09:04:57.564587001Z     aquire_lock(path)
2021-09-08T09:04:57.564589701Z   File "/omd/sites/cmk/lib/python3/cmk/utils/store.py", line 478, in aquire_lock
2021-09-08T09:04:57.564592401Z     fcntl.flock(fd, flags)
2021-09-08T09:04:57.564595001Z OSError: [Errno 9] Bad file descriptor

Filesystem in NFS after this:

nonroot@vsrv-swarmnfs01:/nfs/15k1/check_mk/basesite/data$ ll
drwxr-xr-x 7 nonroot nonroot 4096 Sep  8 09:04 cmk/
nonroot@vsrv-swarmnfs01:/nfs/15k1/check_mk/basesite/data/cmk$ ll
-rw-r--r--  1 nonroot nonroot 1091 Sep  8 09:04 .bashrc
-rw-r--r--  1 nonroot nonroot  809 Sep  8 09:04 .j4p
-rw-r--r--  1 nonroot nonroot   56 Sep  8 09:04 .modulebuildrc
-rw-r--r--  1 nonroot nonroot 2052 Sep  8 09:04 .profile
drwxr-xr-x  3 nonroot nonroot 4096 Sep  8 09:04 .version_meta/
lrwxrwxrwx  1 nonroot nonroot   11 Sep  8 09:04 bin -> version/bin
drwxr-xr-x 22 nonroot nonroot 4096 Sep  8 09:04 etc/
lrwxrwxrwx  1 nonroot nonroot   15 Sep  8 09:04 include -> version/include
lrwxrwxrwx  1 nonroot nonroot   11 Sep  8 09:04 lib -> version/lib
drwxr-xr-x  5 nonroot nonroot 4096 Sep  8 09:04 local/
lrwxrwxrwx  1 nonroot nonroot   13 Sep  8 09:04 share -> version/share
drwxr-xr-x 11 nonroot nonroot 4096 Sep  8 09:16 tmp/
drwxr-xr-x 14 nonroot nonroot 4096 Sep  8 09:05 var/
lrwxrwxrwx  1 nonroot nonroot   26 Sep  8 09:04 version -> ../../versions/2.0.0p9.cee

NFS Settings: /etc/exports

/nfs/15k1 10.1.10.0/24(rw,sync,no_subtree_check,no_root_squash)

I have tried around for days now and got nowhere in with this issue. Any help will be much appreciated.

Regards Sintbert

Frosch · October 22, 2021, 6:57am

I can’t help you.
But I have the same error here.

reindan · December 16, 2021, 1:23pm

I’m having the same issue when using Kubernetes. As soon as I enable a NFS persistent volume, I get exacly the same error messages.
Did you find a solution for this?

tosch · December 17, 2021, 7:46am

Hi @Sintbert, hi @Frosch and hi @reindan and welcome to the checkmk community.

This problem sounds really similar to the following post, where i described that NFS isn’t a good idea to run checkmk on. In your cases this could be the problem that FIFO pipes can’t be created on NFS because it’s not supported.

Liveproxyd tries to create the file ~site/tmp/run/live which needs to be a FIFO pipe and fails i assume.

reindan · December 17, 2021, 8:04am

I also stumbled upon this issue and was able to get my test environment working using NFSv3 with the parameter local_lock=all.

Here is an example of a working persistent volume for Kubernetes:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: checkmk-data-0
  namespace: checkmk
spec:
  capacity:
    storage: 16Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: fast
  mountOptions:
    - hard
    - vers=3
    - nfsvers=3
    - local_lock=all
  nfs:
    path: /exporteddirectory/checkmk
    server: testfile.example.com

Although it is working, using NFSv3 doesn’t feel right nowadays.
So I would really appreaciate a solution with NFSv4.

coofercat · December 17, 2021, 9:12am

It seems I have similar problems (although without containers) - I’ll see if an NFS3 mount is possible.

I see two problems:

OMD doesn’t tell you what’s going on - it just gives you an opaque error, without explaining what the problem is (it says it’s starting Redis, but it seems it’s actually starting LiveProxy). Thankfully @tosch explains:

Liveproxyd tries to create the file ~site/tmp/run/live which needs to be a FIFO pipe and fails i assume.

OMD/Checkmk should probably be working more “in tune” with the underlying system, rather than making a very complicated tree of directories and symlinks in /opt/omd. If, for example it used the system /var/run directory for that FIFO pipe, then this wouldn’t be an issue at all (either on real servers, or containers).

I can’t realistically see either of these problems being solved any time soon. (1) requires a lot of OMD rework (my untrained eyes suggest that’s going to be a big job). (2) requires much the same if not more.

A possible way forward would be to alter the ~site/tmp/run directory to be a symlink to (say) /var/run/checkmk or something (that would need creating, which could be done with a change to the systemd unit). Now the pipe would always be created on a real filesystem, and correctly would be destroyed across restarts etc. Multi-site systems might need more thinking about.

This assumes this is the only problem that’s caused by the file layout and use of NFS though.

Sintbert · December 17, 2021, 9:56am

Hi, i assumed it was something more complicated… I worked around it by useing local storage on the swarm-host and pinning the container via a label to that host. Whitch destroys all benefits of the swarm setup…

version: '3.7'

services:
  basesite:
    image: registry.[domain]:5000/checkmk/check-mk-enterprise:2.0.0p15
    networks:
      - traefik-public
      - internal
    environment: 
      CMK_SITE_ID: "cmk"
      CMK_LIVESTATUS_TCP: "on"
      CMK_PASSWORD: "initialpassword"
      MAIL_RELAY_HOST: "[Mailhost]"
    volumes:
      - basesite_sites:/omd/sites
    deploy:
      placement:
        constraints:
          - node.labels.checkmk_basesite == true
      labels:
        - traefik.enable=true
        - traefik.docker.network=traefik-public
        - traefik.constraint-label=traefik-public
        - traefik.http.routers.checkmk_basesite.rule=Host(`checkmk.[domain]`)
        - traefik.http.routers.checkmk_basesite.entrypoints=websecure
        - traefik.http.routers.checkmk_basesite.tls=true
        - traefik.http.routers.checkmk_basesite.tls.options=default
        - traefik.http.routers.checkmk_basesite.service=checkmk_basesite
        - traefik.http.services.checkmk_basesite.loadbalancer.server.port=5000

system · December 17, 2022, 9:56am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.