[Check_mk (english)] Distributed WATO

Paul_Bongers1 · September 23, 2015, 1:28pm

The site in question is running Ubuntu 14.04.3 LTS.

Maybe useful to know that the other remote site (mentioned earlier in this thread, running same configuration) is on Ubuntu 12.04.4 LTS. Possibly different version or configuration of apache by the package maintainers are related to the problem?

Regards,

Paul

···

On 23/09/15 15:11, Marcel Schulte wrote:

Hi Paul,

just a thought flying by... Which OS do you use at remote/slave sites? Is SElinux enabled?

I read something similar last week, see this thread: http://lists.mathias-kettner.de/pipermail/omd-users/2015-September/001328.html

Reagrds,
Marcel

Paul Bongers <Paul.Bongers@osudio.com <mailto:Paul.Bongers@osudio.com>> schrieb am Mi., 23. Sep. 2015 um > 15:00 Uhr:

    Hi list,

    Hoping to see any difference, I tried switching apache mode to shared.
    I still get the same errors, but this time I see a python
    traceback in the apache error log (with debug enabled).

    Hopefully this is helpfull on solving the issue.

    From the error log:

    [Wed Sep 23 12:55:41.293558 2015] [:error] [pid 11025:tid
    140673877477120] [client 192.168.99.220:65432
    <http://192.168.99.220:65432>] python_handler: Dispatch() returned
    non-integer.
    [Wed Sep 23 12:55:41.293622 2015] [mpm_event:debug] [pid 11025:tid
    140673877477120] event.c(992): (103)Software caused connection
    abort: [client 192.168.99.220:65432 <http://192.168.99.220:65432>]
    AH00470: network write failure in core output filter
    Traceback (most recent call last):
      File
    "/omd/versions/1.2.6p10.cre/lib/python/mod_python/importer.py",
    line 1934, in ReportError
        req.write(text)

    Regards, Paul

    On 09/09/15 14:02, Paul Bongers wrote:

    Hi Marcel,

    All three sites are running Check_MK version 1.2.6p10 and have
    (default) omd version 1.2.6p10.cre.

    $ cmk --version |head -n1
    This is check_mk version 1.2.6p10

    # omd version
    OMD - Open Monitoring Distribution Version 1.2.6p10.cre

    I'm currently suspecting that it has something to do with the
    apache configuration on the remote site.

    This is what I found in the apache error log:
    [Wed Sep 09 11:23:02.438919 2015] [proxy_http:error] [pid
    32174:tid 139895238506240] (104)Connection reset by peer: [client
    192.168.99.220:10850 <http://192.168.99.220:10850>] AH01095:
    prefetch request body failed to 127.0.0.1:5000
    <http://127.0.0.1:5000> (127.0.0.1) from 192.168.99.220 ()

    The timestamp of this entry matches the timestamp I found in the
    access log when the master site is trying to push the configuration.

    With apaches loglevel increased to debug, I'm seeing this in the
    logs:
    [Wed Sep 09 11:29:21.597416 2015] [authz_core:debug] [pid
    10337:tid 139895447553792] mod_authz_core.c(828): [client
    192.168.99.220:63828 <http://192.168.99.220:63828>] AH01628:
    authorization result: granted (no directives)
    [Wed Sep 09 11:29:21.597473 2015] [proxy:debug] [pid 10337:tid
    139895447553792] mod_proxy.c(1104): [client 192.168.99.220:63828
    <http://192.168.99.220:63828>] AH01143: Running scheme http
    handler (attempt 0)
    [Wed Sep 09 11:29:21.597480 2015] [proxy:debug] [pid 10337:tid
    139895447553792] proxy_util.c(2020): AH00942: HTTP: has acquired
    connection for (127.0.0.1)
    [Wed Sep 09 11:29:21.597484 2015] [proxy:debug] [pid 10337:tid
    139895447553792] proxy_util.c(2072): [client 192.168.99.220:63828
    <http://192.168.99.220:63828>] AH00944: connecting
    http://127.0.0.1:5000/<site\_id>/check\_mk/automation\.py?command=push\-snapshot&secret=%3BO%3FX3JG>6CC1%3DSHAMJHI%3FX%3A%40N8B0J>0U&siteid=<site\_id>&mode=slave&restart=yes&debug=
    to 127.0.0.1:5000 <http://127.0.0.1:5000>
    [Wed Sep 09 11:29:21.597556 2015] [proxy:debug] [pid 10337:tid
    139895447553792] proxy_util.c(2206): [client 192.168.99.220:63828
    <http://192.168.99.220:63828>] AH00947: connected
    /<site_id>/check_mk/automation.py?command=push-snapshot&secret=%3BO%3FX3JG%3E6CC1%3DSHAMJHI%3FX%3A%40N8B0J%3E0U&siteid=<site_id>&mode=slave&restart=yes&debug=
    to 127.0.0.1:5000 <http://127.0.0.1:5000>
    [Wed Sep 09 11:29:21.597605 2015] [proxy:debug] [pid 10337:tid
    139895447553792] proxy_util.c(2610): AH00962: HTTP: connection
    complete to 127.0.0.1:5000 <http://127.0.0.1:5000> (127.0.0.1)
    [Wed Sep 09 11:29:21.633003 2015] [proxy_http:error] [pid
    10337:tid 139895447553792] (104)Connection reset by peer: [client
    192.168.99.220:63828 <http://192.168.99.220:63828>] AH01095:
    prefetch request body failed to 127.0.0.1:5000
    <http://127.0.0.1:5000> (127.0.0.1) from 192.168.99.220 ()
    [Wed Sep 09 11:29:21.633019 2015] [proxy:debug] [pid 10337:tid
    139895447553792] proxy_util.c(2035): AH00943: HTTP: has released
    connection for (127.0.0.1)
    [Wed Sep 09 11:29:21.633095 2015] [mpm_event:debug] [pid
    10337:tid 139895447553792] event.c(992): (32)Broken pipe: [client
    192.168.99.220:63828 <http://192.168.99.220:63828>] AH00470:
    network write failure in core output filter

    A web search resulted in several hits suggesting that mod_proxy
    throws error because the file upload (POST data) is too big.

    Regards, Paul

    On 09/09/15 12:19, Marcel Schulte wrote:

    Hi Paul,

    as already said I have no remote sites... But I read about
    version differences causing problems. What version are your
    master and slave sites at?

    * master site
    * working local slave
    * faulting remote slave

    Regards,
    Marcel

    Paul Bongers <Paul.Bongers@osudio.com
    <mailto:Paul.Bongers@osudio.com>> schrieb am Mi., 9. Sep. 2015 >>> um 12:09 Uhr:

        To be able to find more information on what's going wrong, I
        added a bit of code to wato.py so that the command used to
        push changes to the remote site was displayed in the error.
        Then I ran the command from the shell, adding some verbosity:

        OMD[main]:~$ curl -vv -b /dev/null -L -w " %{http_code}\n"
        -s -S -F
        snapshot=@/omd/sites/main/tmp/check_mk/sync-<site_id>.tar.gz
        "http://<remote_host>/<site_id>/check_mk/automation.py?command=push-snapshot&secret=%3BO%3FX3JG%3E6CC1%3DSHAMJHI%3FX%3A%40N8B0J%3E0U&siteid=<site_id>&mode=slave&restart=yes&debug="2>&1
        * Hostname was NOT found in DNS cache
        * Trying <ip>...
        * Connected to <remote_host> (<ip>) port 80 (#0)
        > POST
        /<site_id>/check_mk/automation.py?command=push-snapshot&secret=%3BO%3FX3JG%3E6CC1%3DSHAMJHI%3FX%3A%40N8B0J%3E0U&siteid=<site_id>&mode=slave&restart=yes&debug=
        HTTP/1.1
        > User-Agent: curl/7.35.0
        > Host: <remote_host>
        > Accept: */*
        > Content-Length: 72203
        > Expect: 100-continue
        > Content-Type: multipart/form-data;
        boundary=------------------------66e0b55bb4881b35
        >
        < HTTP/1.1 100 Continue
        * Recv failure: Connection reset by peer
        * Closing connection 0

         100
        curl: (56) Recv failure: Connection reset by peer

        Note that a local slave that is configured exactly the same
        way is updated just fine.
        What is going wrong here?

        Regards, Paul

        On 08/09/15 14:06, Paul Bongers wrote:

        I've opened up port 6557 on the firewall, but I still get
        an error when applying changes.
        The error message is:Error: HTTP Error - 100: curl: (56)
        Recv failure: Connection reset by peer

        Also, the remote shows up as dead in WATO, as long as I
        have Livestatus TCP disabled.
        Changing the connection to 'Connect via TCP' instead of
        'Use Livestatus Proxy-Daemon' doesn't change anything.

        For testing purposes I added another slave, that resides on
        the same network as the master. This slave has the same
        configuration as the remote one and is configured the same
        way on the master server. The local slave works just fine.

        Therefore, I get the impression that some other port(s)
        still need(s) to be opened.

        What am I missing here?

        Configuration of the slave site:

        $ omd config show
        ADMIN_MAIL:
        APACHE_MODE: own
        APACHE_TCP_ADDR: 127.0.0.1
        APACHE_TCP_PORT: 5000
        AUTOSTART: on
        CORE: nagios
        CRONTAB: on
        DEFAULT_GUI: check_mk
        DOKUWIKI_AUTH: off
        LIVEPROXYD: off
        LIVESTATUS_TCP: on
        LIVESTATUS_TCP_PORT: 6557
        MKEVENTD: off
        MKNOTIFYD: on
        MULTISITE_AUTHORISATION: on
        MULTISITE_COOKIE_AUTH: on
        NAGIOS_THEME: classicui
        NAGVIS_URLS: auto
        NSCA: on
        NSCA_TCP_PORT: 5667
        PNP4NAGIOS: on
        TMPFS: on

        Slave configuration on the master site (retrieved from
        $OMD_HOME/etc/check_mk/liveproxyd.mk <http://liveproxyd.mk>):

        sites = \
        {'site_name': {'cache': True,
                          'channel_timeout': 3.0,
                          'channels': 5,
                          'connect_retry': 4.0,
                          'heartbeat': (5, 2.0),
                          'query_timeout': 120.0,
                          'socket': ('remote_host_name', 6557)}}

        Regards,

        Paul

        On 08/09/15 08:57, Marcel Schulte wrote:

        Hi Paul,

        You have to activate Livestatus-Script tcp port (defaults
        to 6557) at remote site and firewall access to that port.

        HTH,
        Marcel

        Paul Bongers <Paul.Bongers@osudio.com
        <mailto:Paul.Bongers@osudio.com>> schrieb am Di., 8. Sep. >>>>> 2015 08:50:

            Hi list,

            I'm trying to set up distributed WATO on a new server.
            However, I'm running into trouble as the remote site
            is running on a machine behind a restricted firewall.

            What ports should be opened up to make this possible?

            Both sites are running OMD 1.2.6p10.
            I'm planning to use liveproxyd for accessing
            livestatus data.

            --

            Met vriendelijke groet / Best regards,

            Paul Bongers

            Application Engineer

            _______________________________________________
            checkmk-en mailing list
            checkmk-en@lists.mathias-kettner.de
            <mailto:checkmk-en@lists.mathias-kettner.de>
            http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

            We’ll meet in Munich for the 2nd Check_MK Conference!
            Book your place now and be part of it.
            October 18th-20th, 2015
            http://mathias-kettner.com/conference

        _______________________________________________
        checkmk-en mailing list
        checkmk-en@lists.mathias-kettner.de
        <mailto:checkmk-en@lists.mathias-kettner.de>
        http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

        We’ll meet in Munich for the 2nd Check_MK Conference!
        Book your place now and be part of it.
        October 18th-20th, 2015
        http://mathias-kettner.com/conference