Oracle RAC Monitoring: Spradic message "Login into Database failed"

I am facing an issue with Oracle monitoring that I get a particular Oracle Instance and its checks as vanished with a message “Login into Database failed” and after sometime all these UNKOWN checks come back in the inventory in like the 3rd or 4th re-discovery.

What can I check here ?

Are you using mk_oracle or mk_oracle.ps1?

What is the Oracle version of your monitored database? Can you reproduce your observation on various hosts with an Oracle database?

I am using mk_oracle. The version 12.2.0.1.0. Specifically, the ORA Intance is flapping. So it will be “Unknown = Item not found in monitoring data” and after some time it will be OK.

Has there been any maintenance for the database recently?

Maybe you should have a try with the standard debugging / diagnostic routines provided for mk_oracle as described in docs.checkmk.com - any informative hints within the log file when performing option -l?

Yes, I already tried this. The problem is the monitoring works fine except the Oracle Instance service check which is flapping with sporadic UNKNOWNS ></ OK

Is sporadic UNKNOWN completely random or is there a certain intervall?

Can you please show us, which sections for the plugin you have defined synchronoulsy and which asynchronously?
If you check, for example, the oracle jobs on each call of the agent and don’t cache the result you may run into the situation your queries against the database are to slow and you don’t get an answer within a monitoring cycle. This could cause a complete instance to be shown as stale or unknown.

Its random and doesn’t happen like at a particular time of the day

Hi @tosch
I installed the mk_oracle plugin and the mk_oracle.cfg via puppet. I have not defined anything.

In defualt the job-section runs synchronously. Can you check the mk_oracle.cfg and post the value of SYNC_SECTIONS and ASYNC_SECTIONS?
If your instance changes to unknown/stale, can you save your cached data within ~site/tmp/check_mk/cache/<host>? This could help analyse the problem.

In mk_oracle.cfg, I have only:

DBUSER='check_mk:xxxxxxxx'

What sections I should define ?

Forgot to mention the important thing. I am not directly running the mk_oracle instead runing this on a host and then piggybacking the results to checkmk

This is how my mk_oracle.cfg looks like:

REMOTE_ORACLE_HOME="/usr/lib/oracle/18.5/client64"
TNS_ADMIN="/usr/lib/oracle/18.5/client64/lib/network/admin"
REMOTE_INSTANCE_1='check_mk:mypassword::myRemoteHost:1521:myOracleHost:MYINST3:11.2:MYINST3'

Could this also be that I have the client 18.5 but I want to monitor 11.2 ?

You said, your instance is running 12.2.0.1? It could be a problem if you set a fixed version inside your config file. In my opinion you shouldn’t set this version and let the plugin check the version directly.

1 Like

Thanks @tosch for your help so far.
So, I should do this:

REMOTE_INSTANCE_1='check_mk:mypassword::myRemoteHost:1521:myOracleHost:MYINST3::MYINST3'

And I did that. So far, I don’t see any UNKNWON. I will monitor it. Do I still need the SYNC,ASYNC sections as I am calling the mk_oracle from this 3rd server which in turn polls everything from the RAC ?

The sync and async sections have a default value. We just found out that the statements for some check are ressource heavy and could produces some unnecessary stress on the databse, like i said befor with oracle jobs. Also this sections aren’t that important, at least for us, so we decided to cache thjem for 10 minutes and reduce the stress on the database.
Our sync and async sections are defined like this:

# Sections to run in foreground and wait for the result
SYNC_SECTIONS='instance dataguard_stats logswitches longactivesessions performance processes recovery_area recovery_status sessions undostat'

# Sections to run in the background, at a slower interval cached
ASYNC_SECTIONS='locks resumable rman tablespaces ts_quotas'

# Sections to run in foreground for ASM
SYNC_ASM_SECTIONS='asm_diskgroup instance processes'

# Sections to run in the background for ASM
ASYNC_ASM_SECTIONS=''

# Cache time (i.e. check interval) for async sections
CACHE_MAXAGE=600

Thanks @tosch
Since, I use the mk_oracle plugin on a seperate server which fetches all the information from the Oracle RAC remotely ,Should I still use these sections ?

I think so because of the remote function you are getting an additional latency for your check results. Please check the documentation before, it could be the case that the sections differ in name for oracle remote instances or you need some additional configuration parameters.

I checked it already and the names are same. I also tried putting all the sections under ```
ASYNC_SECTIONS

but no luck so far.

Could this be also because I use Oracle 18.5 client on my 3rd server but my Oracle RAC version is 11.2 ?