REDO and UNDO – the ‘why’ behind them

Fantastic article by Jonathan Lewis on why Oracle couldn’t just use REDO to cover the purposes of both UNDO and REDO.


See all external directories in Oracle 10g

Took me a while to find this – I knew it had to be there but couldn’t find the view to query – so I thought I’d post this info.  Every data pump article I found listed how to create a directory, but not how to make sure there wasn’t already one set up for the directory you wanted.  Finally found a good article on logical directories.  To see all directories in the database, query the dictionary view ‘all_directories’.

select * from all_directories;

My Oracle Support article has a new article up on the ongoing issues with My Oracle Support, the replacement for Metalink.  I know I have spent 30+ minute writing up a ticket only to lose it when one out of many Draft saves failed, as well as gotten stuck in “you haven’t filled out required fields” loops that were impossible to exit.  I did try to submit a ticket on one of my issues, when there were clearly error messages from a botched site fix or upgrade showing on my page, but despite the errors clearly showing, they kept telling me to try new browsers and the like so I eventually gave up.  So their attempts to do technical support on their support site has also been spotty.  Now I just avoid submitting tickets from Firefox on my Mac now; IE8 on PC seems to be more stable.

I’m all for forward progress, but I’m still bemused by the architectural choice to make the site Flash based.  The basic actions of a support site are simple – find articles, submit and track tickets.  These activities can be enhanced by Flash but by no means require it, and the KISS principle is always best with web sites.  Oracle of course wants to create a support portal that does more than the basics, and acts more like a heads-up display, receiving info from your systems, etc etc.  But user uptake of that sort of feature was misjudged or worse, paternalistically viewed by the design team.  A support portal is a bad place to get fancy, because 80% of the time when I hit that site I’m already stressed, possibly panicked, and every second may count.  Adding features that I have to wait for to load or that make it harder to do the basics quickly and in a stable fashion is a bad idea.

Overall, Oracle seems to have lost sight of what a support portal is for … any site overhaul should start with the question, “What are our key services and how can we maintain and enhance them without degrading their user experience?”  That should be your guiding principle.  Then start tacking on new stuff.

ORA-01309: invalid session in dataguard logical standby

We ran across a fun one today with dataguard and a logical standby.  Second time we have seen it with the same fix. After making changes such as moving redo logs around or increasing TEMP (with log apply off) the redo/arch transport will correctly start and begin shipping data from the primary to the logical standby.  However, the log miner is not working and gives the error ORA-01309: invalid session in the alert log.  Stopping and restarting log apply do not work.  The error does not always happen with all configuration changes, only sometimes.  The error looks like this and happens immediately after the log apply setup:

LOGSTDBY status: ORA-16111: log mining and apply setting up
LOGSTDBY status: ORA-01309: invalid session

This metalink article gives us a clue.  Although it refers to Streams, it seems likely this is the same memory bug causing this issue, since the error is the same and the recommended fix – a database restart on the logical standby – works every time.

Another clue is that we do indeed see a parallel query process hung up, as the article says we will.  This is alerted on in the alert log (at least it’s easy to get the pid when it’s in the alert log) and we have to kill -9 it at the OS level to get the database to shutdown, even with shutdown immediate.  It looks like this in the alert log:

Active call for process 6244 user 'oracle' program 'oracle@dvpsor03 (P000)'
SHUTDOWN: waiting for active calls to complete.

The bug they list as the possible root cause does not sound like our culprit, but that doesn’t negate the similarities.

Once we got the database to go down cleanly, and restarted and turned on log apply, everything worked like a charm.

3/11/2010 edit: Experienced a 2nd set of symptoms with the same fix and likely the same cause today. A logical standby threw this error in the alert log:

LSP0: can’t recover from rollback of multi-chunk txn, aborting..

Research showed this means it encountered a deadlock in log apply. (Interesting note: a second logical standby hanging off the same primary did not encounter the same problem.) When this happens, it makes a “note to self” that it needs to do X then Y next time, then restarts log apply. However, the parallel process, P000, did not shut down properly. When it came back up, it was just reading the same block forever (over 2 hours). While P000 was hung, vmstat showed that memory was being slowly but steadily consumed. I had to “alter database abort logical standby apply” to get apply to stop. Then when I shut the database down, it hung. I killed P000 at the OS level, and shutdown completed immediately. Database and log apply both started up fine and the logical standby caught up.

04/30/2010 edit: we have now confirmed that if you do not shut down the database, but you instead shut down log apply, then wait, then kill -9 the P000 process, then start log apply, you can get away with not doing a database restart and things continue on fine.  We’ve used this at least 3 times now and it is consistent.  With a logical standby often users are accessing the database for reporting even if log apply is temporarily stopped, so it is handy that this works.