2015-05-27 EDIT: Some great discussion in the comments, and based on that I added some info at the end. So if you’re doing the TL;DR thing, don’t skip the very end.
2015-06-03 EDIT: looks like earlier in May Oracle published a brief doc based on my ticket: High “Latch Free” Waits with Contention on ‘Result Cache: RC Latch’ when RESULT_CACHE_MODE = MANUAL on Oracle 12c (Doc ID 2002089.1) (had discussed with engineer).
Recently migrated Peoplesoft Campus Solutions database from 126.96.36.199 Solaris non-RAC non-ASM to 188.8.131.52 Linux ASM RAC (yes really, all at once). This is a mixed workload database, OLTP and Peoplesoft batch jobs. Seeing a lot of “latch free” waits during job runs and more than you would expect even during low times when no jobs are running. This is odd, usually there is something after “latch free: ???” since 10g and also we have never seen latching issues before on 11g with the same workload. It is also clear that it is having a significant performance impact because certain jobs are taking many times longer to run when these waits are high. The jobs (like many delivered peoplesoft jobs) cause a fair amount of blocking which is acceptable to the business side when they last 11 minutes but not when they last an hour — so much blocking occurs with the job having latch free waits at the root that eventually there are multiple-participant deadlocks and the job fails. And of course regular users are being blocked for a long time so web usage is heavily impacted.
AWR showed that the latch free wait was actually 100% “Result Cache: RC Latch.”
> select name, gets, misses, sleeps, immediate_gets, wait_time/1000 wait_time_ms
where name like 'Result Cache%'
NAME GETS MISSES SLEEPS IMMEDIATE_GETS WAIT_TIME_MS
------------------------------ ---------- ---------- ---------- -------------- ------------
Result Cache: RC Latch 97101269 1106885 995211 0 128375216
Result Cache: SO Latch 124798 15 0 0 0
Result Cache: MB Latch 93 0 0 0 0
That’s ugly. Ok, so we’re having problems latching on the result cache. The question is, why are we using the result cache at all?
> show parameter result
NAME TYPE VALUE
------------------------------------ --------------------------------- ------------------------------
client_result_cache_lag big integer 3000
client_result_cache_size big integer 0
result_cache_max_result integer 5
result_cache_max_size big integer 330304K
result_cache_mode string MANUAL
result_cache_remote_expiration integer 0
result_cache_mode is set to MANUAL so (according to 12c docs) only the results from queries that use the RESULT_CACHE hint should be stored in the result cache.
I verified that none of the user or job queries were using the result_cache hint. A search of gv$sql looking for ‘%result_cache%’ in sql_text was illuminating — 400K (fluctuates) sql statements that contained the result_cache hint. They turned out to all be from dynamic sampling/statistics, as can be determined by the presence of the DS_SVC hint — here’s an example:
SELECT /* DS_SVC */ /*+ dynamic_sampling(0) no_sql_tune no_monitoring optimizer_features_enable(default) no_parallel result_cache(snapshot=3600) */ SUM(C1) FROM (SELECT /*+ qb_name("innerQuery") NO_INDEX_FFS( "PS_SSR_TSRSLT_STAT") */ 1 AS C1 FROM "PS_SSR_TSRSLT_STAT" SAMPLE BLOCK(1.37542, 8) SEED(1
) "PS_SSR_TSRSLT_STAT" WHERE ("PS_SSR_TSRSLT_STAT"."REPORT_REQUEST_NBR"='003125267') AND ("PS_SSR_TSRSLT_STAT"."STRM"='1137') AND ("PS_SSR_TSRSLT_STAT"."REQUEST_SEQ_NBR"=1) AND ("PS_SSR_TSRSLT_STAT"."REPORT_NUMBER"=1)) innerQuery
There were ZERO queries in gv$sql that used the result_cache hint and were NOT from dynamic sampling.
OK, why are we using dynamic sampling so much, I thought things were at the defaults. I know 12c has a new level of 11 or AUTO which mean “the optimizer will use dynamic statistics to verify cardinality estimates for *all* SQL operators, and it will determine an internal time limit to spend verifying the estimates.” (http://docs.oracle.com/database/121/REFRN/refrn10140.htm#REFRN10140) But I did not set my database to 11/AUTO.
> show parameter optimizer_dynamic_sampling
NAME TYPE VALUE
------------------------------------ --------------------------------- ------------------------------
optimizer_dynamic_sampling integer 2
optimizer_dynamic_sampling is set to 2. From 12c docs: 2 means “Use dynamic statistics if at least one table in the statement has no statistics. This is the default setting.” So, I picked out lots of tables from the dynamic sampling sql statements and made sure they had stats, re-ran to be sure, and then watched for them to be dynamically sampled again (should not be). Dynamic sampling statements for these tables continued to show up in gv$sql with “LAST_LOAD_TIME” equal to the present time, so dynamic sampling was continuing to run AND store its sampling statements in the result cache.
At this point I requested a full rolling restart of the Peoplesoft application tier to make sure we did not have some persistent “dirty” sessions where something/someone had issued a session-level dynamic sampling hint at level . This would recreate all the sessions. This did not get rid of the issue.
Next step, I decided to test turning off dynamic sampling to prove the theory beyond a doubt. Before I did this, I ran a query that looked at that last 60 seconds of gv$sql LAST_LOAD_TIMEs. There were hundreds of results. I had also run this query on and off for the past hour and it always had lots of results, clearly there was a ton of dynamic sampling going on. Then I ran this:
alter system set optimizer_dynamic_sampling=0 scope=both sid='*';
After a few minutes, the query consistently returned zero rows for the past 60 seconds. This stayed the same until I turned dynamic sampling back on at level 2:
alter system set optimizer_dynamic_sampling=2 scope=both sid='*';
Instantly there were over 1000 dynamic sampling queries in gv$sql with a last_load_time in the past 60 seconds.
This doesn’t make sense based on what the docs say about level 2, but ok, let’s assume *for now that this is a good thing and take a look at the result cache / tune it.
> SET SERVEROUTPUT ON
> EXECUTE DBMS_RESULT_CACHE.MEMORY_REPORT
R e s u l t C a c h e M e m o r y R e p o r t
Block Size = 1K bytes
Maximum Cache Size = 330304K bytes (330304 blocks)
Maximum Result Size = 16515K bytes (16515 blocks)
Total Memory = 339190216 bytes [2.055% of the Shared Pool]
... Fixed Memory = 108416 bytes [0.001% of the Shared Pool]
... Dynamic Memory = 339081800 bytes [2.054% of the Shared Pool]
....... Overhead = 850504 bytes
....... Cache Memory = 330304K bytes (330304 blocks)
........... Unused Memory = 0 blocks
........... Used Memory = 330304 blocks
............... Dependencies = 2 blocks (2 count)
............... Results = 330302 blocks
................... SQL = 39452 blocks (39452 count)
................... Invalid = 290850 blocks (290850 count)
Ok, that’s completely full / has used its max allowed memory. Perhaps it needs more (and I did find significant CANCELLED requests where the shared pool wanted to GROW and the buffer cache said NO from the time period where these latch free waits were the worst). However, before we assume that let’s see what we can see about the efficiency of the use of the result cache.
> select trunc(to_date(creation_timestamp, 'MM/DD/YYYY HH24:MI:SS'), 'MI') MIN, count(*)
where creation_timestamp > sysdate-1/24
group by trunc(to_date(creation_timestamp, 'MM/DD/YYYY HH24:MI:SS'), 'MI')
order by trunc(to_date(creation_timestamp, 'MM/DD/YYYY HH24:MI:SS'), 'MI')
04/11/2015 15:17:00 212
04/11/2015 15:18:00 188
04/11/2015 15:19:00 213
04/11/2015 15:20:00 168
04/11/2015 15:21:00 109
04/11/2015 15:22:00 199
04/11/2015 15:23:00 194
04/11/2015 15:24:00 153
04/11/2015 15:25:00 149
That’s quite a bit of churn for something that’s meant to be reused. Let’s zoom out:
> select trunc(to_date(creation_timestamp, 'MM/DD/YYYY HH24:MI:SS'), 'HH24') HOUR, count(*)
group by trunc(to_date(creation_timestamp, 'MM/DD/YYYY HH24:MI:SS'), 'HH24')
order by trunc(to_date(creation_timestamp, 'MM/DD/YYYY HH24:MI:SS'), 'HH24')
03/27/2015 03:00:00 2
03/28/2015 03:00:00 2
03/30/2015 03:00:00 2
04/10/2015 15:00:00 26881
04/10/2015 16:00:00 88926
04/10/2015 17:00:00 23122
04/10/2015 18:00:00 15636
04/10/2015 19:00:00 20075
04/10/2015 20:00:00 9660
04/10/2015 21:00:00 6549
04/10/2015 22:00:00 14950
04/10/2015 23:00:00 18235
04/11/2015 00:00:00 6855
04/11/2015 01:00:00 17711
04/11/2015 02:00:00 3656
04/11/2015 03:00:00 48318
04/11/2015 04:00:00 54292
04/11/2015 05:00:00 124391
04/11/2015 06:00:00 76643
04/11/2015 07:00:00 81864
04/11/2015 08:00:00 97948
04/11/2015 09:00:00 56916
04/11/2015 10:00:00 22606
04/11/2015 11:00:00 31104
04/11/2015 12:00:00 61384
04/11/2015 13:00:00 21226
04/11/2015 14:00:00 31671
04/11/2015 15:00:00 30287
In a word, wow. With that kind of creation rate, no wonder we are getting waits, since the first run of result cached queries are often slower than without since they have to write to the result cache. We are probably burning through the result cache’s LRU really fast, making it very inefficient — but let’s prove that:
> select count(*) from gv$result_cache_objects;
Result cache is specific to node. I did verify that the objects were evenly split across nodes (about 300K each).
> SELECT scan_count, count(*)
group by scan_count
ORDER BY 2,1
Ouch. So of nearly a million objects in the result cache across the cluster, 725K were never used (scanned). I ran this over time a few times to make sure I wasn’t seeing a burst of adds and then reuse after that. It didn’t fluctuate. (The evenness of the per-minute object creation rate backs this up.) Of the other 270K or so, 161K were scanned once. Even though we have creation timestamps going back to 15:00 yesterday, we haven’t seen much use of the cached results.
This view also lets us see the LRU position (LRU_NUMBER) of any object in the result cache, let’s see how spread out it is:
> SELECT lru_number, count(*)
group by lru_number
ORDER BY 2
So almost every object is at the start of the LRU. This backs up the creation timestamp info above showing that they are cycling through very quickly.
At this point I went back to the docs because I don’t want to have to entirely turn off dynamic sampling, and it is not working the way I would expect (set to level 2 but acting like it’s on the new 12c AUTO). That’s when I found this text (http://docs.oracle.com/database/121/REFRN/refrn10140.htm#REFRN10140):
When this parameter is set to 11, the optimizer will use dynamic statistics to verify cardinality estimates for all SQL operators, and it will determine an internal time limit to spend verifying the estimates.
There are cases where the optimizer will automatically decide to use 11, for example:
The query will run in parallel.
The query was executed before and its history is available (from the cursor cache, Automatic Workload Repository, or the SQL Management Base).
So what this basically says is that even if you set dynamic sampling to the value you want, the Oracle optimizer *can* choose to ignore it if it has ever been run before. Given the lack of reuse of the objects in the result cache, it seems like this should not apply (probably has not seen these queries before) but this is close enough to what is happening for it to be plausible that it is ignoring my level 2 setting and using dynamic sampling. Whether that is an optimizer bug or by design remains to be seen.
The 12c doc on OPTIMIZER_DYNAMIC_SAMPLING linked over to the Oracle Database SQL Tuning Guide so I went to see what it had to say on OPTIMIZER_DYNAMIC_SAMPLING (http://docs.oracle.com/database/121/TGSQL/tgsql_astat.htm#TGSQL455). This was illuminating, missed it before when re-reading the level definitions:
13.1.3 Disabling Dynamic Statistics
In general, the best practice is not to incur the cost of dynamic statistics for queries whose compile times must be as fast as possible, for example, unrepeated OLTP queries.
… and goes on to tell you how to turn off dynamic statistics entirely: ALTER SESSION SET OPTIMIZER_DYNAMIC_SAMPLING=0;
So we don’t have a size problem. We have a revolving door LRU problem, as well as the wrong type of query being put into the result cache. Things are put in and pushed out so fast they’re never used, and additionally we are putting in dynamic sampling queries about queries with literals that will almost never be repeated. Oracle’s tuning guide basically says if you have a lot of OLTP / small queries running then turn dynamic statistics off entirely. This is unfortunate in a mixed-workload database like Peoplesoft, because our jobs can benefit from being able to sample tables on the fly. Especially with Peoplesoft where jobs are constantly deleting and reloading “temporary” (real) tables and statistics are often incorrect mid-job.
I opened an SR and worked with Oracle Support on this issue. They confirmed my tests and said that use of dynamic statistics in spite of the OPTIMIZER_DYNAMIC_SAMPLING setting is expected behavior in 12c. They pointed me to two unpublished bug fixes released in 184.108.40.206 (you can find the bug number in the release notes but there’s no info on them, 8823383 and 12914055) which enabled dynamic statistics in a number of cases. The example they gave was if the query runs in parallel and there table is sufficiently large, based on a threshold value, but that is not the only case.
When I asked for clarification on the cases that might trigger the use of dynamic statistics, the explanation was that dynamic statistics is now part of AS (Adaptic Statistics) which is comprised of Adaptive Dynamic Statistics (ADS), Automatic Reoptimization, and SQL Plan Directives, and they all interact. One example was that if an adaptive plan is used, dynamic statistics might also be used. ADS might also be triggered when Automatic Reoptimization is used. It makes sense that these other functions would want to use dynamic statistics, it is just confusing to see Oracle appear to ignore your setting.
Suggested fixes for this issue were to hint dynamic sampling off, or to turn it off entirely. Setting OPTIMIZER_DYNAMIC_SAMPLING=0 does still disable ADS entirely. You can also turn the result cache off. But there is a much more targeted solution that Oracle Support provided — you can set “_optimizer_ads_use_result_cache” = FALSE which will stop only dynamic statistics from using the result cache! That’s exactly what I was looking for.
alter system set "_optimizer_ads_use_result_cache" = FALSE;
With this issue, that amounts to a magic bullet. I’ll update as we roll it out if there are any issues, but I think at least for Peoplesoft databases and any other applications that use a lot of literals and have an at least partial OLTP workload, when upgrading to 220.127.116.11 or 12c this parameter is worth serious consideration.
Note: I did try turning on AUTO/11 for a short time under the same workload I was monitoring while researching, and it ran ADS on far more queries than with ADS set to 2. So while the optimizer is using ADS on more than you would expect with ADS set to 2, it is not the same as setting it to AUTO/11.
2015-05-27 EDIT: There’s some great back and forth in the comments that I strongly encourage you to read. Christian Antognini, an expert in all the new or updated Adaptive features in 12c (among other things), had some significant insights into what is and is not happening with dynamic statistics. Thanks Chris! Here’s a summary that I wrote based on his comments (but do read through the comments):
There are dual paths by which dynamic statistics (DS) is triggered. On its own / by itself — DS behaves as expected and documented. But because of SQL Plan Directives (SPD), DS can be kicked off at any time if a SPD exists. Effectively “dynamic sampling/statistics is running more often in 12c” is a correct statement but the important part is that it is not triggered by DS, but by SPD. 12c dynamic statistics/sampling on its own does not trigger any more often than it would have in 11g for a given value of OPTIMIZER_DYNAMIC_SAMPLING.