Oracle GoldenGate v11 and RMAN-08137

This is worth a post, despite my dearth of time to post lately … just had an interesting one and found the perfect blog article to resolve my issue, so definitely need to send a shoutout to the author, Andy Black.

In a nutshell, v11 of GG now uses the same sort of “delete only when applied” method that logicals do.  v10 does not.  If you create an extract and don’t delete it, it thinks the archive logs are still needed  and won’t delete them when you do an rman backup.  I did have the extract enabled for a bit, I think that is also needed to create this situation, not just an unused extract.

This of course can provide no end of confusion if you have a primary database with both dataguard and goldengate running, and you don’t know GG v11 made this change.  I was so confused as to why my logicals weren’t telling the primary that they were long done with these logs.  Turns out GG was the holdup.

Andy provides more than enough info to fix the problem, and I also left a comment there about a specific situation and how to fix it (quick ref – make sure to dblogin before you delete an extract, or it will leave a record in db_capture and you’ll end up with RMAN-08137 issues).

Gotta love Oracle / DBA work, learn something new every day.

UPDATE 12/14: we had the same problem but in a restored non-prod copy of the database. Even after installing goldengate and trying to use ggsci to remove the extract, it wouldn’t budge from dba_capture. So I put in an oracle SR. They came back with this oracle article (1351352.1) and this DBMS to remove things manually from dba_capture – DBMS_CAPTURE_ADM.DROP_CAPTURE.  I used the following:

select 'exec DBMS_CAPTURE_ADM.DROP_CAPTURE ('''||capture_name||''');' from dba_capture;

then run whichever bits of the output need to be removed.

10.2.04 Data guard logical standbys and dbms_scheduler

Learned an interesting one today.  I noticed that the nightly stats job was not running on any of our data guard logical standbys.  It would not run despite running a complete remove/reset the scheduled maintenance window, resetting the parameters for the job, etc.  A fellow DBA tipped me off to this issue which applies to 10.2.0.4 logical standbys (enterprise ed, any platform).  Apparently dbms_scheduler just doesn’t run on logicals in 10.2.0.4 without a fix.

To resolve it you can set an event as follows:

stop log apply
alter system set event = ‘10793 trace name context forever, level 2’ scope=spfile;
restart the database and restart log apply

I have yet to be able to implement this so YMMV.  We will probably put it in place over the next few weeks.

Data Guard standby redo logs / resizing the primary’s redo logs

Data Guard standby redo logs are worth a few comments.  The misconception that there have to be the exact same number of standby redo logs as there are primary redo logs seems to come up often.  In fact, the size has to be the same, but the standby should always have at least one more standby redo log group than the primary does.  As well, if you are using threads/RAC on the primary, then you need to have one extra standby redo log group for each primary thread.

For example, a RAC database with three threads and 3 log groups in each thread would have 9 redo logs.  A physical standby hanging off this primary database would then need to have a minimum of 12 standby redo logs.  A logical standby would as well, but could be given even more standby redo logs to help with realtime log apply.

If you experience a large load/throughput on your primary database, and you do not have enough standby redo logs, you may run into this error:

RFS[89]: No standby redo logfiles of size 203806 blocks available

This is not to be confused with the error received when there is no standby redo log that matches the primary redo log’s size.

This section of the data guard documentation explains how to configure standby redo logs.

To see your standby redo logs, you use a special v$ view, v$standby_log.  The only thing about this view to note is that “UNASSIGNED” means the same thing as “INACTIVE” does in the v$log view of normal redo logs.

Recently due to log file switch contention, I had to increase the size and number of the primary’s redo logs, which then meant doing the same to the standby logs.  Since they are 10g logical standbys, they are open and also have their own redo logs.  Those logs did not have to be modified.

To make primary redo log changes:
-- prod and standbys: set STANDBY_FILE_MANAGEMENT to MANUAL so file changes
-- are not transferred to standbys
 alter system set STANDBY_FILE_MANAGEMENT=MANUAL scope=both;
-- standbys: stop log apply
 ALTER DATABASE STOP LOGICAL STANDBY APPLY;
-- add new redo groups/logs - first a set of new ones,
-- then log switch until the database
-- is using the new ones, then drop the old groups and
-- replace them with the new size/naming
-- and the old group number
alter database add logfile group 4 ('/u03/oradata/MYDB/redo04.log') size 512M;
 -- repeat until the new logs are active/current and the old ones are not
 alter system switch logfile;
 -- then:
 alter system checkpoint;
 select * from v$log;
 -- can't drop a logfile group until it is archived
 alter system archive log group X;
 -- drop the old groups
 alter database drop logfile group 1;
-- on the file system:
## clean up the old files
 mv /u03/oradata/MYDB/redo101.log /u03/oradata/MYDB/old
alter database add logfile group 1 ('/u03/oradata/MYDB/redo01.log') size 512M;

---------------------------------------------------------------------
-- logical standbys - run on each - run on standby db unless noted
---------------------------------------------------------------------

-- stop log apply again to be sure
ALTER DATABASE STOP LOGICAL STANDBY APPLY;
ALTER DATABASE ADD STANDBY LOGFILE GROUP 101 
('/u03/oradata/MYDB/stbyredo101.log') SIZE 512M;
-- repeat log switches on PRODUCTION as necessary for standby redo
-- logs to become "unassigned"
alter system switch logfile;
-- remove one standby redo log at a time, then replace it with the
-- renamed log
ALTER DATABASE DROP STANDBY LOGFILE GROUP 10;

-- on the file system:
## clean up the old files
mv /u03/oradata/MYDB/redo10.log /u03/oradata/MYDB/old
-- rename the online redo files on the standby and add more logs to match prod
-- online redo in standbys is also overtaxed by reporting activity and log apply
alter database add logfile group 4 ('/u03/oradata/MYDB/redo04.log') size 512M;
-- issue this in the STANDBY db as often as needed to free up the online redo files
alter system switch logfile;
alter database drop logfile group 1;

-- on the file system:
alter database add logfile group 1
('/u03/oradata/MYDB/redo01.log') size 512M;
## clean up the old files
mv /u03/oradata/MYDB/redo101.log /u03/oradata/MYDB/old
-- in Production: set STANDBY_FILE_MANAGEMENT back to AUTO
alter system set STANDBY_FILE_MANAGEMENT=AUTO scope=both;

-- may want to restart the standbys here due to possible bug.
-- start log apply in standbys
ALTER DATABASE START LOGICAL STANDBY APPLY IMMEDIATE;

One minor tricky thing — while the add group syntax uses parentheses, the drop logfile member syntax does not:

ALTER DATABASE DROP LOGFILE MEMBER '/u03/oradata/MYDB/redo10.log';  -- does not work
                                                                       with ()

ORA-01309: invalid session in dataguard logical standby

We ran across a fun one today with dataguard and a logical standby.  Second time we have seen it with the same fix. After making changes such as moving redo logs around or increasing TEMP (with log apply off) the redo/arch transport will correctly start and begin shipping data from the primary to the logical standby.  However, the log miner is not working and gives the error ORA-01309: invalid session in the alert log.  Stopping and restarting log apply do not work.  The error does not always happen with all configuration changes, only sometimes.  The error looks like this and happens immediately after the log apply setup:

LOGSTDBY status: ORA-16111: log mining and apply setting up
LOGSTDBY status: ORA-01309: invalid session

This metalink article gives us a clue.  Although it refers to Streams, it seems likely this is the same memory bug causing this issue, since the error is the same and the recommended fix – a database restart on the logical standby – works every time.

Another clue is that we do indeed see a parallel query process hung up, as the article says we will.  This is alerted on in the alert log (at least it’s easy to get the pid when it’s in the alert log) and we have to kill -9 it at the OS level to get the database to shutdown, even with shutdown immediate.  It looks like this in the alert log:

Active call for process 6244 user 'oracle' program 'oracle@dvpsor03 (P000)'
SHUTDOWN: waiting for active calls to complete.

The bug they list as the possible root cause does not sound like our culprit, but that doesn’t negate the similarities.

Once we got the database to go down cleanly, and restarted and turned on log apply, everything worked like a charm.

3/11/2010 edit: Experienced a 2nd set of symptoms with the same fix and likely the same cause today. A logical standby threw this error in the alert log:

LSP0: can’t recover from rollback of multi-chunk txn, aborting..

Research showed this means it encountered a deadlock in log apply. (Interesting note: a second logical standby hanging off the same primary did not encounter the same problem.) When this happens, it makes a “note to self” that it needs to do X then Y next time, then restarts log apply. However, the parallel process, P000, did not shut down properly. When it came back up, it was just reading the same block forever (over 2 hours). While P000 was hung, vmstat showed that memory was being slowly but steadily consumed. I had to “alter database abort logical standby apply” to get apply to stop. Then when I shut the database down, it hung. I killed P000 at the OS level, and shutdown completed immediately. Database and log apply both started up fine and the logical standby caught up.

04/30/2010 edit: we have now confirmed that if you do not shut down the database, but you instead shut down log apply, then wait, then kill -9 the P000 process, then start log apply, you can get away with not doing a database restart and things continue on fine.  We’ve used this at least 3 times now and it is consistent.  With a logical standby often users are accessing the database for reporting even if log apply is temporarily stopped, so it is handy that this works.