10.2.04 Data guard logical standbys and dbms_scheduler

Learned an interesting one today.  I noticed that the nightly stats job was not running on any of our data guard logical standbys.  It would not run despite running a complete remove/reset the scheduled maintenance window, resetting the parameters for the job, etc.  A fellow DBA tipped me off to this issue which applies to logical standbys (enterprise ed, any platform).  Apparently dbms_scheduler just doesn’t run on logicals in without a fix.

To resolve it you can set an event as follows:

stop log apply
alter system set event = ‘10793 trace name context forever, level 2’ scope=spfile;
restart the database and restart log apply

I have yet to be able to implement this so YMMV.  We will probably put it in place over the next few weeks.

ORA-01309: invalid session in dataguard logical standby

We ran across a fun one today with dataguard and a logical standby.  Second time we have seen it with the same fix. After making changes such as moving redo logs around or increasing TEMP (with log apply off) the redo/arch transport will correctly start and begin shipping data from the primary to the logical standby.  However, the log miner is not working and gives the error ORA-01309: invalid session in the alert log.  Stopping and restarting log apply do not work.  The error does not always happen with all configuration changes, only sometimes.  The error looks like this and happens immediately after the log apply setup:

LOGSTDBY status: ORA-16111: log mining and apply setting up
LOGSTDBY status: ORA-01309: invalid session

This metalink article gives us a clue.  Although it refers to Streams, it seems likely this is the same memory bug causing this issue, since the error is the same and the recommended fix – a database restart on the logical standby – works every time.

Another clue is that we do indeed see a parallel query process hung up, as the article says we will.  This is alerted on in the alert log (at least it’s easy to get the pid when it’s in the alert log) and we have to kill -9 it at the OS level to get the database to shutdown, even with shutdown immediate.  It looks like this in the alert log:

Active call for process 6244 user 'oracle' program 'oracle@dvpsor03 (P000)'
SHUTDOWN: waiting for active calls to complete.

The bug they list as the possible root cause does not sound like our culprit, but that doesn’t negate the similarities.

Once we got the database to go down cleanly, and restarted and turned on log apply, everything worked like a charm.

3/11/2010 edit: Experienced a 2nd set of symptoms with the same fix and likely the same cause today. A logical standby threw this error in the alert log:

LSP0: can’t recover from rollback of multi-chunk txn, aborting..

Research showed this means it encountered a deadlock in log apply. (Interesting note: a second logical standby hanging off the same primary did not encounter the same problem.) When this happens, it makes a “note to self” that it needs to do X then Y next time, then restarts log apply. However, the parallel process, P000, did not shut down properly. When it came back up, it was just reading the same block forever (over 2 hours). While P000 was hung, vmstat showed that memory was being slowly but steadily consumed. I had to “alter database abort logical standby apply” to get apply to stop. Then when I shut the database down, it hung. I killed P000 at the OS level, and shutdown completed immediately. Database and log apply both started up fine and the logical standby caught up.

04/30/2010 edit: we have now confirmed that if you do not shut down the database, but you instead shut down log apply, then wait, then kill -9 the P000 process, then start log apply, you can get away with not doing a database restart and things continue on fine.  We’ve used this at least 3 times now and it is consistent.  With a logical standby often users are accessing the database for reporting even if log apply is temporarily stopped, so it is handy that this works.