High log file sync with low log file parallel write / performant disk IO

Nasty one. This article sums it up quite well.  Also see this Oracle doc (also referenced in Jeremy Schneider’s blog post): Adaptive Switching Between Log Write Methods can Cause ‘log file sync’ Waits (Doc ID 1462942.1).

The Oracle doc’s list of symptoms were a perfect match for us, except we’re on 11.2.0.3.7.  Our Oracle support rep told us it is fixed in 11.2.0.4.

  • On Solaris, using Oracle 11.2.0.3.1
  • The log writer process on the database is blocking most of the other sessions and taking an average wait time of 10ms to commit.
  • On checking the hardware, the I/O wait time is very small from the hardware side
  • This means that all the time is on CPU processing the request.
  • Database is OLTP-based and has high commits/second
  • The same issue did not occur in 10g [Tyler’s note: or 11.2.0.2].
  • The problem seems to ‘go away’ temporarily after the database is restarted but ultimately returns after a few days.
  • AWR may show ‘log file sync’ waits as the highest wait

Verified issue in two databases upgraded from 11.2.0.2 to 11.2.0.3 apprx two months before the issue occurred. It can start “suddenly” as in one night during/after a large data load, log file sync jumped from 3ms to 11ms and stayed there, consistently, until we restarted the database days later. Perfect example, high log file sync, low log file parallel write, pwrites() in dtruss lined up perfectly with log file parallel writes, and no change to log file parallel write time when log file sync spiked up to its ‘new normal’. A database restart fixes it, but as this Oracle doc that the above article quotes (and our Oracle engineer gave us as well – 1462942.1) says, it comes back. The bullet points at the beginning of that doc were a 100% match for our experience. We had the Solaris kernel folks engaged as well, they were convinced it was memory fragmentation, then Oracle finally suggested this after days of round and round about non-matching ‘fixes’. Our tech said it is fixed in 11.2.0.4. The doc says 11.2.0.3.1, we’re on 11.2.0.3.7, Solaris 10.

We also found that while you occasionally see this switch back and forth in your LGWR trace file, and that does not mean you have the issue (necessarily), when it jumps up suddenly like it did for us, it’s like a switch was flipped and then it is “stuck.”  After it jumped up, it no longer switched from polling to post/wait occasionally, until we restarted the database.  Before that, it would happen once in a while.

Advertisements