After Update to Checkmk 2.3.0p42: The site is currently locked by another activation process

Hey @annbe :waving_hand:

This is a known regression in 2.3.0 affecting distributed setups with parallel activations — you’re not alone: Bug: activation often hangs since 2.3.0

Quick cleanup when it’s stuck (as @andreas-doehler suggested :crossed_fingers:):

ls -lah ~/var/check_mk/background_jobs/ | grep activate
rm -R ~/var/check_mk/background_jobs/activate-changes-scheduler-<UUID>/
rm -R ~/tmp/check_mk/wato/activation/<old-UUID>/

Full procedure: https://checkmk.atlassian.net/wiki/spaces/KB/pages/9470533/

Diagnose the root cause (picking up @mimimi hints :crossed_fingers:):

# Watch what happens during activation on central:
tail -f ~/var/log/apache/access_log

# Check if ~/local is unexpectedly large (slow sync = timeout):
du -sh ~/local

# Test config generation speed on each remote:
time cmk -U

Workaround until this is fixed in a patch: :innocent:
Since master + one slave works fine, a simple sequential REST API activation script would bridge the gap:

#!/bin/bash
# Sequential site activation via Checkmk REST API
# Usage: ./activate_sites.sh

HOST_NAME="<your-central-site>"
SITE_NAME="<your-site-name>"
USERNAME="automation"
PASSWORD="<automation-secret>"
API_URL="https://$HOST_NAME/$SITE_NAME/check_mk/api/1.0"

SITES=("slave1" "slave2" "slave3")

for SITE in "${SITES[@]}"; do
  echo ">>> Activating site: $SITE ..."

  RESPONSE=$(curl -s -o /tmp/cmk_response.json -w "%{http_code}" \
    --request POST \
    --header "Authorization: Bearer $USERNAME $PASSWORD" \
    --header "Accept: application/json" \
    --header "Content-Type: application/json" \
    --data "{\"sites\": [\"$SITE\"], \"force_foreign_changes\": false}" \
    "$API_URL/domain-types/activation_run/actions/activate-changes/invoke")

  echo "HTTP Status: $RESPONSE"
  cat /tmp/cmk_response.json | python3 -m json.tool 2>/dev/null

  if [[ "$RESPONSE" -ge 400 ]]; then
    echo "!!! ERROR activating $SITE (HTTP $RESPONSE) — aborting."
    exit 1
  fi

  echo "--- Done with $SITE. Waiting 30s before next site..."
  sleep 30
done

echo "=== All sites activated successfully."

Adjust the sleep value depending on how long your sync takes. Not pretty, but it gets you through until there’s a proper fix. :slightly_smiling_face:

Hope that helps — would be great if you could share what access.log and du -sh ~/local show, so we can narrow down whether it’s a sync size or a true concurrency bug!

Perhaps also interesting for you @ mbunkus

Greetz Bernd

1 Like

more detailed script:

activate_sites.sh
#!/bin/bash
# =============================================================================
# Checkmk Sequential Site Activation Script
# =============================================================================
# Purpose:
#   Workaround for a known bug in Checkmk 2.3.0+ where activating multiple
#   remote sites simultaneously causes "site is currently locked by another
#   activation process" errors in distributed monitoring setups.
#
#   Instead of triggering all sites at once (which causes race conditions),
#   this script activates each site one by one via the REST API, with a
#   configurable wait period in between.
#
# Reference:
#   https://forum.checkmk.com/t/bug-activation-often-hangs-since-2-3-0/46795
#
# Requirements:
#   - curl
#   - python3 (for JSON pretty-printing)
#   - An automation user with sufficient privileges in Checkmk
#
# Usage:
#   chmod +x activate_sites.sh
#   ./activate_sites.sh
# =============================================================================

# -----------------------------------------------------------------------------
# CONFIGURATION
# Edit these variables to match your environment
# -----------------------------------------------------------------------------

# Hostname or IP of the Checkmk central server
HOST_NAME="<your-central-site>"

# Name of the Checkmk site (the OMD site name, e.g. "monitoring")
SITE_NAME="<your-site-name>"

# Credentials of the Checkmk automation user
# Find/create this user in Setup > Users > Automation users
USERNAME="automation"
PASSWORD="<automation-secret>"

# List of remote site names to activate sequentially
# These must match the site IDs shown in Setup > Distributed Monitoring
SITES=("slave1" "slave2" "slave3")

# Wait time in seconds between each site activation.
# Increase this if your sites have large configs or slow sync (check with: time cmk -U)
WAIT_SECONDS=30

# -----------------------------------------------------------------------------
# INTERNAL VARIABLES — do not edit below this line
# -----------------------------------------------------------------------------

# Construct the base REST API URL from host and site name
API_URL="https://$HOST_NAME/$SITE_NAME/check_mk/api/1.0"

# Temporary file to store the raw API response body for parsing
TMP_RESPONSE="/tmp/cmk_activation_response.json"

# -----------------------------------------------------------------------------
# HELPER FUNCTION: print a timestamped log line
# -----------------------------------------------------------------------------
log() {
  echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"
}

# -----------------------------------------------------------------------------
# MAIN LOOP: iterate over each site and activate it individually
# -----------------------------------------------------------------------------
log "Starting sequential activation for ${#SITES[@]} site(s)..."

for SITE in "${SITES[@]}"; do

  log ">>> Triggering activation for site: '$SITE'"

  # Send POST request to the activate-changes endpoint.
  #
  # Key points:
  #   -s              : silent mode (suppresses curl progress output)
  #   -o $TMP_RESPONSE: write response body to a temp file (separate from status code)
  #   -w "%{http_code}": output only the HTTP status code to stdout
  #
  # Headers:
  #   Authorization   : Bearer auth is the standard method for Checkmk REST API.
  #                     Do NOT use -u (Basic Auth) — that requires special Apache config.
  #                     See: https://docs.checkmk.com/latest/en/rest_api.html
  #   Accept          : Tell the API we expect a JSON response
  #   Content-Type    : We are sending a JSON request body
  #
  # Payload:
  #   sites                : Activate only this specific remote site
  #   force_foreign_changes: Set to false — do not activate changes made by other users.
  #                          Set to true only if you intentionally want to push foreign changes.
  HTTP_STATUS=$(curl -s \
    -o "$TMP_RESPONSE" \
    -w "%{http_code}" \
    --request POST \
    --header "Authorization: Bearer $USERNAME $PASSWORD" \
    --header "Accept: application/json" \
    --header "Content-Type: application/json" \
    --data "{\"sites\": [\"$SITE\"], \"force_foreign_changes\": false}" \
    "$API_URL/domain-types/activation_run/actions/activate-changes/invoke")

  # Log the raw HTTP status code returned by the API
  log "HTTP Status: $HTTP_STATUS"

  # Pretty-print the response body for readability.
  # If python3 is not available or the response is not valid JSON, this fails silently.
  log "API Response:"
  python3 -m json.tool "$TMP_RESPONSE" 2>/dev/null || cat "$TMP_RESPONSE"
  echo ""

  # Check if the HTTP status indicates an error (4xx or 5xx).
  # Common error codes from this endpoint:
  #   423 — site is locked by another activation process (the issue we're working around)
  #   422 — no pending changes to activate
  #   401 — authentication failed (check USERNAME / PASSWORD)
  #   404 — site not found (check SITE name)
  if [[ "$HTTP_STATUS" -ge 400 ]]; then
    log "!!! ERROR: Activation of '$SITE' failed with HTTP $HTTP_STATUS. Aborting."
    log "    Check the response above for details."
    # Clean up temp file before exiting
    rm -f "$TMP_RESPONSE"
    exit 1
  fi

  log "<<< Site '$SITE' activation triggered successfully."

  # Only wait if there are more sites left to process
  # (no need to wait after the last site)
  if [[ "$SITE" != "${SITES[-1]}" ]]; then
    log "    Waiting ${WAIT_SECONDS}s before activating next site..."
    sleep "$WAIT_SECONDS"
  fi

done

# -----------------------------------------------------------------------------
# CLEANUP & SUMMARY
# -----------------------------------------------------------------------------

# Remove temporary response file
rm -f "$TMP_RESPONSE"

log "=== All ${#SITES[@]} site(s) activated successfully."

Hello @mimimi ,

Hello @BH2005 ,

Thank you for your responses. :slight_smile:

I just tested what mimimi wrote and executed it, and here are the results. Unfortunately, I can’t assess whether this is significant or too lengthy. You’ll have to let me know. :wink:

Master:

cmk -U 
real    0m2.739s
user    0m2.477s
sys     0m0.254s

~/local --> 896K

Site 1:

cmk -U 
real    0m3.465s
user    0m3.067s
sys     0m0.250s

~/local --> 872K

Site 2:

cmk -U 
real    0m3.392s
user    0m3.077s
sys     0m0.293s

~/local --> 872K

Site 3:

cmk -U 
real    0m2.782s
user    0m2.514s
sys     0m0.257s

~/local --> 468K

I also performed an activation for all sites again and only saw the following call repeatedly appearing in the access logs.

Master:

Site:

One more question regarding the script. Does it already exist, and do I just need to make some adjustments, or do I need to create it from scratch? If so, where should I place it? I didn’t quite understand that (maybe I overlooked it).

Best regards,
Annett

on the master server create as site user the script:
nano activate_sites.sh

take a closer look to the detailed script there are more comments for the usage.

… the timing looks fine for me by the way …

1 Like

@annbe did you restart your Checkmk servers at any point in time?

Hello @robin.gierse ,

Yes, we have also restarted Checkmk. We have already upgraded to the newer patch version 43, but the behavior remains the same. For now, the workaround of activating the changes individually is still working. We hope that a fix for the issue will be available soon.

To everyone:
We also have an update to Checkmk 2.4 scheduled. Is it known whether this issue exists there as well or if it has been resolved?