High log file sync with low log file parallel write / performant disk IO

Nasty one. This article sums it up quite well.  Also see this Oracle doc (also referenced in Jeremy Schneider’s blog post): Adaptive Switching Between Log Write Methods can Cause ‘log file sync’ Waits (Doc ID 1462942.1).

The Oracle doc’s list of symptoms were a perfect match for us, except we’re on 11.2.0.3.7.  Our Oracle support rep told us it is fixed in 11.2.0.4.

  • On Solaris, using Oracle 11.2.0.3.1
  • The log writer process on the database is blocking most of the other sessions and taking an average wait time of 10ms to commit.
  • On checking the hardware, the I/O wait time is very small from the hardware side
  • This means that all the time is on CPU processing the request.
  • Database is OLTP-based and has high commits/second
  • The same issue did not occur in 10g [Tyler’s note: or 11.2.0.2].
  • The problem seems to ‘go away’ temporarily after the database is restarted but ultimately returns after a few days.
  • AWR may show ‘log file sync’ waits as the highest wait

Verified issue in two databases upgraded from 11.2.0.2 to 11.2.0.3 apprx two months before the issue occurred. It can start “suddenly” as in one night during/after a large data load, log file sync jumped from 3ms to 11ms and stayed there, consistently, until we restarted the database days later. Perfect example, high log file sync, low log file parallel write, pwrites() in dtruss lined up perfectly with log file parallel writes, and no change to log file parallel write time when log file sync spiked up to its ‘new normal’. A database restart fixes it, but as this Oracle doc that the above article quotes (and our Oracle engineer gave us as well – 1462942.1) says, it comes back. The bullet points at the beginning of that doc were a 100% match for our experience. We had the Solaris kernel folks engaged as well, they were convinced it was memory fragmentation, then Oracle finally suggested this after days of round and round about non-matching ‘fixes’. Our tech said it is fixed in 11.2.0.4. The doc says 11.2.0.3.1, we’re on 11.2.0.3.7, Solaris 10.

We also found that while you occasionally see this switch back and forth in your LGWR trace file, and that does not mean you have the issue (necessarily), when it jumps up suddenly like it did for us, it’s like a switch was flipped and then it is “stuck.”  After it jumped up, it no longer switched from polling to post/wait occasionally, until we restarted the database.  Before that, it would happen once in a while.

Advertisements

DBMS_SPM.EVOLVE_SQL_PLAN_BASELINE runs and evolves successfully, but errors when returning output (ORA-06502)

We recently have upgraded many of our databases to 11g.  One of our tools to make sure performance did not degrade was to implement SQL Plan Baselines for our top queries, as identified a few different ways via AWR.  That’s a long story, but one interesting short note is that after you have Baselines, you might want to evaluate them periodically to see if there is a better plan available.  Oracle provides DBMS_SPM.EVOLVE_SQL_PLAN_BASELINE so you can do this.

I chose to wrap my evolve in a shell script and call it from cron.  The script worked perfectly for just over a month, then one day all my log had in it was an error:

DECLARE
*
ERROR at line 1:
ORA-06502: PL/SQL: numeric or value error
ORA-06512: at line 13

I tried increasing my SET LONG statement – no dice.  Also, it was clear from the baselines themselves and the total run time before the error that the evolve was actually working, it was just the log/text output at the end that was failing.  In fact, the DBMS_SPM.EVOLVE_SQL_PLAN_BASELINE wasn’t failing at all.  It was actually DBMS_OUTPUT.PUT_LINE that was failing.

Finally I read that DBMS_OUTPUT.PUT_LINE has a 32767 character limit for a single line.   Since I could see that my output had grown slowly but steadily each day, and on the day it started failing it had grown beyond this limit, this was the answer.  This is not exactly a bug in the DBMS_SPM package, however since this output should be expected to be long, one wonders why a better output method wasn’t developed.  Oh well – on to the solution.

Many  smart people have posted about how to chunk up a clob to get around this limit and print it to the screen/sqlplus.  I happened to like and use this one (modified to use put_line instead of writing to a file).

My working evolve script — leaving out the surrounding shell stuff which you’d want to customize for your environment anyway — looks like this:

set serveroutput on
SET LONG 1000000
DECLARE
  results  CLOB;
  r_len    NUMBER;
  putlimit constant binary_integer := 32767;
  r_buffer VARCHAR2(32767);
  r_pos    PLS_INTEGER := 1;
BEGIN
  results := DBMS_SPM.EVOLVE_SQL_PLAN_BASELINE (
   time_limit   => 180,
   verify       => 'YES',
   commit       => 'YES'
    );
  dbms_output.put_line('finished with the evolve here, no errors so far.');
  r_len := dbms_lob.getlength(results);
  dbms_output.put_line('length of results var: '||r_len);
  while r_pos < r_len loop
    r_buffer := dbms_lob.substr(results, putlimit, r_pos);
    exit when r_buffer is null;
    DBMS_OUTPUT.PUT_LINE(r_buffer);
    r_pos := r_pos + LEAST(LENGTH(r_buffer)+1,putlimit);
  end loop;
END;
/