Comparing Common Queries
Between Test and Production
The developers complained that their test database was so much
slower than production that they could not use it to really test whether their
batch processes would run fast enough when migrated to production. They did not
give me any particular queries to check. Instead they said that the system was
generally too slow. So, I went through a process to find SQL
statements that they had run in test and that normally run in production
and compare their run times. I thought that I would document the process that I
went through here.
First I found the top 100 queries by elapsed time on both the
test and production databases using this query:
column FORCE_MATCHING_SIGNATURE format 99999999999999999999
select FORCE_MATCHING_SIGNATURE from
(select
FORCE_MATCHING_SIGNATURE,
sum(ELAPSED_TIME_DELTA) total_elapsed
from DBA_HIST_SQLSTAT
where 
FORCE_MATCHING_SIGNATURE is not null and
FORCE_MATCHING_SIGNATURE <>0
group by FORCE_MATCHING_SIGNATURE
order by total_elapsed desc)
where rownum < 101;
The output looked like this:
FORCE_MATCHING_SIGNATURE
------------------------
      944718698451269965
     4634961225655610267
    15939251529124125793
    15437049687902878835
     2879196232471320459
    12776764566159396624
    14067042856362022182
...
Then I found the signatures that were in common between the two
lists.
insert into test_sigs values (944718698451269965);
insert into test_sigs values (4634961225655610267);
insert into test_sigs values (15939251529124125793);
...
insert into prod_sigs values (3898230136794347827);
insert into prod_sigs values (944718698451269965);
insert into prod_sigs values (11160330134321800286);
...
select * from test_sigs
intersect
select * from prod_sigs;
This led to 32 values of FORCE_MATCHING_SIGNATURE which
represented queries that ran on both test and production, except for the
possible difference in constants.
Next I looked at the overall performance of these 32 queries in
test and production using this query:
create table common_sigs
(FORCE_MATCHING_SIGNATURE number);
insert into common_sigs values (575231776450247964);
insert into common_sigs values (944718698451269965);
insert into common_sigs values (1037345866341698119);
...
select 
sum(executions_delta) total_executions,
sum(ELAPSED_TIME_DELTA)/(sum(executions_delta)*1000),
sum(CPU_TIME_DELTA)/(sum(executions_delta)*1000),
sum(IOWAIT_DELTA)/(sum(executions_delta)*1000),
sum(CLWAIT_DELTA)/(sum(executions_delta)*1000),
sum(APWAIT_DELTA)/(sum(executions_delta)*1000),
sum(CCWAIT_DELTA)/(sum(executions_delta)*1000),
sum(BUFFER_GETS_DELTA)/sum(executions_delta),
sum(DISK_READS_DELTA)/sum(executions_delta),
sum(ROWS_PROCESSED_DELTA)/sum(executions_delta)
from DBA_HIST_SQLSTAT ss,common_sigs cs
where 
ss.FORCE_MATCHING_SIGNATURE = cs.FORCE_MATCHING_SIGNATURE;
Here is part of the output:
TOTAL_EXECUTIONS Elapsed Average ms CPU Average ms IO Average ms
---------------- ------------------ -------------- -------------
         5595295         366.185529      241.92785    59.8682797
          430763         1273.75822     364.258421    1479.83294
The top line is production and the bottom is test.
This result supported the development team’s assertion that test
was slower than production. The 32 queries averaged about 3.5 times longer run
times in test than in production. Also, the time spent on I/O was about 25
times worse. I am not sure why the I/O time exceeded the elapsed time on test.
I guess it has something to do with how Oracle measures I/O time. But
clearly on average these 32 queries are much slower on test and I/O time
probably caused most of the run time difference.
After noticing this big difference between test and production I
decided to get these same sorts of performance metrics for each signature
to see if certain ones were worse than others. The query looked like this:
select 
ss.FORCE_MATCHING_SIGNATURE,
sum(executions_delta) total_executions,
sum(ELAPSED_TIME_DELTA)/(sum(executions_delta)*1000),
sum(CPU_TIME_DELTA)/(sum(executions_delta)*1000),
sum(IOWAIT_DELTA)/(sum(executions_delta)*1000),
sum(CLWAIT_DELTA)/(sum(executions_delta)*1000),
sum(APWAIT_DELTA)/(sum(executions_delta)*1000),
sum(CCWAIT_DELTA)/(sum(executions_delta)*1000),
sum(BUFFER_GETS_DELTA)/sum(executions_delta),
sum(DISK_READS_DELTA)/sum(executions_delta),
sum(ROWS_PROCESSED_DELTA)/sum(executions_delta)
from DBA_HIST_SQLSTAT ss,common_sigs cs
where ss.FORCE_MATCHING_SIGNATURE = cs.FORCE_MATCHING_SIGNATURE
having 
sum(executions_delta) > 0
group by
ss.FORCE_MATCHING_SIGNATURE
order by
ss.FORCE_MATCHING_SIGNATURE;
I put together the outputs from running this query on test
and production and lined the result up like this:
FORCE_MATCHING_SIGNATURE   
PROD Average ms    TEST Average ms
------------------------ ------------------ ------------------
     
575231776450247964        
20268.6719         16659.4585
     
944718698451269965        
727534.558          3456111.6 *
    
1037345866341698119        
6640.87641         8859.53518
     1080231657361448615         3611.37698         4823.62857
    
2879196232471320459        
95723.5569         739287.601 *
    
2895012443099075884        
687272.949         724081.946
    
3371400666194280661        
1532797.66         761762.181
     4156520416999188213         109238.997         213658.722
    
4634693999459450255         
4923.8897         4720.16455
    
5447362809447709021        
2875.37308          2659.5754
    
5698160695928381586        
17139.6304         16559.1932
    
6260911340920427003        
290069.674         421058.874 *
    
7412302135920006997        
20039.0452         18951.6357
    
7723300319489155163        
18045.9756         19573.4784
    
9153380962342466451        
1661586.53         1530076.01
     9196714121881881832         5.48003488         5.13169472
    
9347242065129163091        
4360835.92         4581093.93
   
11140980711532357629        
3042320.88         5048356.99
   
11160330134321800286        
6868746.78         6160556.38
    12212345436143033196          5189.7972         5031.30811
   
12776764566159396624        
139150.231         614207.784  *
   
12936428121692179551        
3563.64537         3436.59365
   
13637202277555795727         
7360.0632         6410.02772
    14067042856362022182         859.732015         771.041714
   
14256464986207527479        
51.4042938         48.9237251
   
14707568089762185958        
627.586095          414.14762
   
15001584593434987669        
1287629.02         1122151.35
    15437049687902878835         96014.9782         996974.876  *
   
16425440090840528197        
48013.8912         50799.6184
   
16778386062441486289        
29459.0089         26845.8327
   
17620933630628481201        
51199.0511         111785.525  *
    18410003796880256802         581563.611         602866.609
I put an asterisk (*) beside the six queries that were much
worse on test than production. I decided to focus on these six to get to the
bottom of the reason between the difference. Note that many of the 32 queries
ran about the same on test as prod so it really isn’t the case that everything
was slow on test.
Now that I had identified the 6 queries I wanted to look at what
they were spending their time on including both CPU and wait events. I used the
following query to use ASH to get a profile of the time spent by these queries
on both databases:
select 
case SESSION_STATE
when 'WAITING' then event
else SESSION_STATE
end TIME_CATEGORY,
(count(*)*10) seconds
from DBA_HIST_ACTIVE_SESS_HISTORY
where 
FORCE_MATCHING_SIGNATURE in
('944718698451269965',
'2879196232471320459',
'6260911340920427003',
'12776764566159396624',
'15437049687902878835',
'17620933630628481201')
group by SESSION_STATE,EVENT
order by seconds desc;
The profile looked like this in test:
TIME_CATEGORY           
SECONDS
------------------------ -------
db file parallel read    
207450
ON CPU                   
141010
db file sequential read   
62990
direct path read          
36980
direct path read temp     
29240
direct path write temp     23110
The profile looked like this in production:
TIME_CATEGORY           
SECONDS
------------------------ -------
ON CPU                   
433260
PX qref latch             
64200
db file parallel read     
35730
db file sequential read   
14360
direct path read          
12750
direct path write temp    
12000
So, I/O waits dominate the time on test but not production.
Since db file parallel read and db file sequential read were the top I/O waits
for these 6 queries I used ash to see which of the 6 spent the most time on
these waits.
db file parallel read:
select
  2  sql_id,
  3  (count(*)*10) seconds
  4  from DBA_HIST_ACTIVE_SESS_HISTORY
  5  where
  6  FORCE_MATCHING_SIGNATURE in
  7  ('944718698451269965',
  8  '2879196232471320459',
  9  '6260911340920427003',
 10  '12776764566159396624',
 11  '15437049687902878835',
 12  '17620933630628481201') and
 13  event='db file parallel read'
 14  group by sql_id
 15  order by seconds desc;
SQL_ID           SECONDS
------------- ----------
ak2wk2sjwnd34     159020
95b6t1sp7y40y      37030
brkfcwv1mqsas      11370
7rdc79drfp28a         30
db file sequential read:
select
  2  sql_id,
  3  (count(*)*10) seconds
  4  from DBA_HIST_ACTIVE_SESS_HISTORY
  5  where
  6  FORCE_MATCHING_SIGNATURE in
  7  ('944718698451269965',
  8  '2879196232471320459',
  9  '6260911340920427003',
 10  '12776764566159396624',
 11  '15437049687902878835',
 12  '17620933630628481201') and
 13  event='db file sequential read'
 14  group by sql_id
 15  order by seconds desc;
SQL_ID           SECONDS
------------- ----------
95b6t1sp7y40y      26840
ak2wk2sjwnd34      22550
6h0km9j5bp69t      13300
brkfcwv1mqsas        170
7rdc79drfp28a        130
Two queries stood out at the top waiters on these two
events: 95b6t1sp7y40y and ak2wk2sjwnd34. Then I just ran my normal
sqlstat query for both sql_ids for both test and production to find out
when they last ran. Here is what the query looks like for ak2wk2sjwnd34:
select ss.sql_id,
ss.plan_hash_value,
sn.END_INTERVAL_TIME,
ss.executions_delta,
ELAPSED_TIME_DELTA/(executions_delta*1000) "Elapsed Average
ms",
CPU_TIME_DELTA/(executions_delta*1000) "CPU Average
ms",
IOWAIT_DELTA/(executions_delta*1000) "IO Average ms",
CLWAIT_DELTA/(executions_delta*1000) "Cluster Average
ms",
APWAIT_DELTA/(executions_delta*1000) "Application Average
ms",
CCWAIT_DELTA/(executions_delta*1000) "Concurrency Average
ms",
BUFFER_GETS_DELTA/executions_delta "Average buffer
gets",
DISK_READS_DELTA/executions_delta "Average disk
reads",
ROWS_PROCESSED_DELTA/executions_delta "Average rows
processed"
from DBA_HIST_SQLSTAT ss,DBA_HIST_SNAPSHOT sn
where ss.sql_id = 'ak2wk2sjwnd34'
and ss.snap_id=sn.snap_id
and executions_delta > 0
and ss.INSTANCE_NUMBER=sn.INSTANCE_NUMBER
order by ss.snap_id,ss.sql_id;
I found two time periods where both of these queries were
recently run on both test and production and got an AWR report for each
time period to compare them.
Here are a couple of pieces of the AWR report for the test
database:
Here are similar pieces for the production database:
What really stood out to me was that the wait events were so
different. In production the db file parallel read waits averaged around 1
millisecond and the db file sequential reads averaged under 1 ms. On test they
were 26 and 5 milliseconds, respectively. The elapsed times for
sql_ids 95b6t1sp7y40y and ak2wk2sjwnd34 were considerably longer in
test.
This is as far as my investigation went. I know that the
slowdown is most pronounced on the two queries and I know that their I/O waits
correspond to the two wait events. I am still trying to find a way to bring the
I/O times down on our test database so that it more closely matches production.
But at least I have a more narrow focus with the two top queries and the two
wait events.





 
