Feeds:
Posts
Comments

Nasty one. This article sums it up quite well.  Also see this Oracle doc (also referenced in Jeremy Schneider’s blog post): Adaptive Switching Between Log Write Methods can Cause ‘log file sync’ Waits (Doc ID 1462942.1).

The Oracle doc’s list of symptoms were a perfect match for us, except we’re on 11.2.0.3.7.  Our Oracle support rep told us it is fixed in 11.2.0.4.

  • On Solaris, using Oracle 11.2.0.3.1
  • The log writer process on the database is blocking most of the other sessions and taking an average wait time of 10ms to commit.
  • On checking the hardware, the I/O wait time is very small from the hardware side
  • This means that all the time is on CPU processing the request.
  • Database is OLTP-based and has high commits/second
  • The same issue did not occur in 10g [Tyler's note: or 11.2.0.2].
  • The problem seems to ‘go away’ temporarily after the database is restarted but ultimately returns after a few days.
  • AWR may show ‘log file sync’ waits as the highest wait

Verified issue in two databases upgraded from 11.2.0.2 to 11.2.0.3 apprx two months before the issue occurred. It can start “suddenly” as in one night during/after a large data load, log file sync jumped from 3ms to 11ms and stayed there, consistently, until we restarted the database days later. Perfect example, high log file sync, low log file parallel write, pwrites() in dtruss lined up perfectly with log file parallel writes, and no change to log file parallel write time when log file sync spiked up to its ‘new normal’. A database restart fixes it, but as this Oracle doc that the above article quotes (and our Oracle engineer gave us as well – 1462942.1) says, it comes back. The bullet points at the beginning of that doc were a 100% match for our experience. We had the Solaris kernel folks engaged as well, they were convinced it was memory fragmentation, then Oracle finally suggested this after days of round and round about non-matching ‘fixes’. Our tech said it is fixed in 11.2.0.4. The doc says 11.2.0.3.1, we’re on 11.2.0.3.7, Solaris 10.

We also found that while you occasionally see this switch back and forth in your LGWR trace file, and that does not mean you have the issue (necessarily), when it jumps up suddenly like it did for us, it’s like a switch was flipped and then it is “stuck.”  After it jumped up, it no longer switched from polling to post/wait occasionally, until we restarted the database.  Before that, it would happen once in a while.

Since I had a considerable amount of ‘fun’ figuring this out, I thought I’d share.  I do some operational graphing (mount size over time, backup duration over time) in a bottle/mako/bootstrap/highcharts based portal, and develop locally before pushing it out with git.  Locally I use virtualenv with virtualenvwrapper (highly recommended). Connecting to Oracle from python is relatively simple using cx_Oracle, however if you’re still using the 32bit InstantClient and you’re using 64bit python, you won’t be able to use cx_Oracle.  The easy answer is to update your instantclient install to the 64bit version that came out January 2013 (huge thread by folks who were waiting for it).  There is a workaround (namely, using lipo to remove the 64bit version from the python-universal you’ll find in your virtualenv’s python binaries folder), but that way is far more painful.  Most of the info out there in various forums still tells you that using 64bit python and cx_Oracle is impossible, but with the latest instantclient that is no longer the case.

64bit client instructions:

  1. Download 11.2.0.3 64bit instantclient.  You need BASIC + sdk.  NOTE: Do not get the lite instantclient, or you won’t be able to connect to databases using certain character sets.  It will fail with a really descriptive error (not) … “cx_Oracle.DatabaseError: ORA-28547: connection to server failed, probable Oracle Net admin error.”  Sqlplus from your machine will fail the same way.
  2. Put it in a dir (as root). I put mine here:
    /usr/local/lib/instantclient64-11203
  3. Set paths:
    export ORACLE_HOME=/usr/local/lib/instantclient64-11203
    export DYLD_LIBRARY_PATH=/usr/local/lib/instantclient64-11203

    … and add $ORACLE_HOME to your path.

  4. Create these two softlinks in the instantclient dir:
    lrwxr-xr-x   1 root         admin    36B Jul 24 17:17 lib -> /usr/local/lib/instantclient64-11203
    lrwxr-xr-x   1 root         admin    20B Jul 24 17:18 libclntsh.dylib -> libclntsh.dylib.11.1
  5. Now you should be able to install cx_Oracle.  Make sure you’re in the correct virtualenv (if you have virtualenvwrapper, type “workon” and the name of your virtualenv).  Then check whether you’re using 32 or 64bit python right now:
    python -c 'import struct;print( 8 * struct.calcsize("P"))'

    64   <– shows we’re using 64bit python

  6. Do pip uninstall cx_Oracle first if you had installed previously when in 32bit mode … or just to be sure.  Then:
    pip install cx_Oracle
  7. Test:
    import cx_Oracle
    try:
     con = cx_Oracle.connect('username/blah@dvcadb09.capella.edu/NOTHERE')
    finally:
     try:
       con.close()
     except:
       pass
  8.  Good error (it got to the listener):(uops-dbas-sysportal)uops-dbas-sysportal> python test.pyTraceback (most recent call last):

    File “test.py”, line 2, in <module>
    con = cx_Oracle.connect(‘username/blah@myserver/NOTHERE’)
    cx_Oracle.DatabaseError: ORA-12514: TNS:listener does not currently know of service requested in connect descriptor

  9. If you plan to use tnsnames.ora make sure to also add:
    export TNS_ADMIN=/my/path/to/tnsnames.ora

If you can’t make this work, you can run everything in 32bit mode; many people have explained that process well, try this: using lipo to remove the 64bit version from the python binary.  Here’s roughly what it looks like:

(uops-dbas-sysportal)bin> python -c 'import struct;print( 8 * struct.calcsize("P"))'
64
(uops-dbas-sysportal)bin> echo $WORKON_HOME
/Users/username/.virtualenvs
(uops-dbas-sysportal)bin> pwd
/Users/username/.virtualenvs/uops-dbas-sysportal/bin
(uops-dbas-sysportal)bin> pwd
/Users/username/.virtualenvs/uops-dbas-sysportal/bin
(uops-dbas-sysportal)bin> file python
python-universal: Mach-O universal binary with 2 architectures
python-universal (for architecture x86_64):Mach-O 64-bit executable x86_64
python-universal (for architecture i386):Mach-O executable i386
(uops-dbas-sysportal)bin> cp python python-universal
(uops-dbas-sysportal)bin> lipo python-universal -thin i386 -output python
(uops-dbas-sysportal)bin> file python
python: Mach-O executable i386
(uops-dbas-sysportal)bin> python -c 'import struct;print( 8 * struct.calcsize("P"))'
32
(uops-dbas-sysportal)bin> deactivate
bin> python -c 'import struct;print( 8 * struct.calcsize("P"))'
64

This one is worth sharing: http://richardfoote.wordpress.com/2013/05/08/important-clustering-factor-calculation-improvement-fix-you/

HOTSOS!

So, yeah, kinda let this die for a while. Hope to post some recent work soon. Just a quick note that I’m at HOTSOS13 and looking forward to a few days of stuffing my brain! I’m on twitter as @tylervanvierzen if you happen to be there too.

Mining listener logs

Arup Nanda’s examples for mining listener logs were incredibly useful today.  Definitely worth a look; it doesn’t take long to implement the basics from the first PDF listed.

We used this extensively in troubleshooting a connection storm-y application while the app admins tuned the web server config to fix the storms.

I assisted with a GoldenGate 11.1 -> 11.2 upgrade today. It was pretty hairy — a lot of small but important changes.  One thing I wanted to toss out here, because it cost me a few hours, is that there is a deprecation/error handling change missing from the GoldenGate 11.2 documentation. In previous versions, whether it paid attention to it or not, for REPLICAT we were able to have the EXTTRAIL command listed in both the initial setup (ie. using an obey file) AND in the PRM file. As of 11.2, if you have it listed in the PRM file, it gives you an error that is really difficult to understand (at least when you are under the mistaken impression that it HAS to be in the replicat’s prm file … clearly it does not).  (Also submitted this to Oracle Support as a suggested metalink note, since this OGG error is not searchable on their site.)

The error message is:

2012-05-18 09:52:32 ERROR OGG-00360 Oracle GoldenGate Delivery for Oracle, rohcsd41.prm: EXT TRAIL/FILE cannot be used with REPLICAT GROUP.

and the fix is:

Remove or comment out any EXTTRAIL commands in the PRM file. EXTTRAIL should be set with an earlier configuration command or in the obey file, not in the PRM file. (I’m assuming this has always been the case and in 11.2 they just decided to start throwing an error about this.)

Example from the .prm file:

– **** THIS IS DEPRECATED IN 11.2 PRM FILES FOR REPLICAT PRM FILES!!! ****
–EXTTRAIL ./dirdat/myserver14/HCS/rohcsd41/e1

Figured this out by trial and error, actually while I was compiling a convincing “you have a bug” log for Oracle Support prior to opening an SR.  :)  The odds of hitting this are small, but it’s worth putting a post out there just in case.  Certainly drove us nuts for a while.

GoldenGate DDL include/exclude statements look something like this:

 DDL INCLUDE MAPPED &
 EXCLUDE ALL INSTR 'TRIGGER' &
 EXCLUDE ALL INSTR 'GRANT' &
 EXCLUDE ALL OPTYPE 'ALTER' &
 OBJTYPE 'SESSION'
;

My goal was to exclude all trigger operations.  I tried both OPTYPE/OBJTYPE (for ‘create’ and ‘trigger’) and INSTR in a one-table extract/pump/replicat setup, doing the DDL excludes in the replicat file.  It complete ignored whatever I did.  Finally, I removed a line leftover from testing, which was commented out as follows:

--DDL INCLUDE ALL &   <--- this line
 DDL INCLUDE MAPPED &
 EXCLUDE ALL INSTR 'TRIGGER' &
 EXCLUDE ALL INSTR 'GRANT' &
 EXCLUDE ALL OPTYPE 'ALTER' &
 OBJTYPE 'SESSION'
;

Well it turns out that in this particular formatting case, GoldenGate ignores your comment “–” and parses the “INCLUDE ALL” and then ignores the excludes.  After a bunch of testing to try and find something that would actually work to exclude TRIGGER operations, removing this comment:

--DDL INCLUDE ALL &

was the ‘magic’ fix.  Now triggers operations are correctly excluded when using INSTR. However, it still does not work to do the following:

DDL INCLUDE MAPPED &
 EXCLUDE ALL OPTYPE 'CREATE' & 
             OBJTYPE 'TRIGGER' 
;

CREATE TRIGGER DDL was still replicated with this configuration in the replicat.  But at least INSTR works.

Follow

Get every new post delivered to your Inbox.