oracleops-support: Database Hangs Because SMON is taking 100% CPU doing transaction recovery

Database Hangs Because SMON is taking 100% CPU doing transaction recovery

In this Post,
  Goal
  Solution
  References

--------------------------------------------------------------------------------


This document is being delivered to you via Oracle Support's Rapid Visibility (RaV) Rapid Visibility (RaV) process, and therefore has not been subject to an independent technical review. 



Applies to: Oracle Server - Enterprise Edition - Version: 8.1.7 to 10.2
Information in this document applies to any platform.
GoalAfter killing a large running transaction, either by killing the process or aborting the database, the database may appear to hang 
and/or the SMON background process is taking all the available cpu. Basically SMON is in the progress of rolling back this terminated 
large transaction, which can be time consuming. As a result of this users often try to restart the database, but because a shutdown 
immediate will hang, the database is often aborted. Please note that stopping the database will actually make things worse and will not 
decrease the amount of work SMON needs to perfom to complete the rollback. 

The message 'Waiting for smon to disable tx recovery' will posted in the alert log as well. 




SolutionFirst we need to determine the progress SMON is making. After killing the shadow process or aborting the database the v$transaction entry is lost, so you cannot estimate by examining v$transaction.used_ublk how the rollback procedure proceeds. 

The following queries are available to monitor the progress of the transaction recovery 

  set linesize 100 
  alter session set NLS_DATE_FORMAT='DD-MON-YYYY HH24:MI:SS'; 
  select usn, state, undoblockstotal "Total", undoblocksdone "Done", undoblockstotal-undoblocksdone "ToDo", 
             decode(cputime,0,'unknown',sysdate+(((undoblockstotal-undoblocksdone) / (undoblocksdone / cputime)) / 86400)) 
              "Estimated time to complete" 
   from v$fast_start_transactions; 

Run the above query several times in a row, this will give you a good idea on how SMON is progressing. 

- In some versions the cputime does not work (always 0), hence the estimated completion time will not be displayed 
- In some cases the v$fast_start_transactions view will not work. If this is the case then you can query the internal data dictionary 
view x$ktuxe, the 'ktuxesiz' column represents the remaining number of undo blocks required for rollback: 

   select ktuxeusn, to_char(sysdate,'DD-MON-YYYY HH24:MI:SS') "Time", ktuxesiz, ktuxesta 
   from x$ktuxe 
   where ktuxecfl = 'DEAD'; 



The following outlines a few action that can be taken in order to speed up SMON recovery of the large transaction. 

1. There are cases where parallel transaction recovery is not as fast as serial transaction recovery, because the pq slaves are 
interfering with each other. To check the Parallel Recovery processes and there state query: 

          select * from v$fast_start_servers; 

    Column STATE shows the state of the server being IDLE or RECOVERING, if only 1 process is in 
    state RECOVERING while the other processes are in state IDLE, then you should disable 
    Parallel Transaction Recovery. How to do this is outlined in the following note: 

         Note 238507.1: How to Disable Parallel Transaction Recovery When Parallel Txn Re very is Active 

    If all the processes are in state RECOVERING, then you can benefit from adding more processes: 

         alter system set fast_start_parallel_rollback = high; 

    This will create parallel servers as much as 4 times the number of CPUs. 


2. Increase the parameter '_cleanup_rollback_entries' to 400. The default is 100. 

This parameter cannot be changed dynamically, so in order to change this the database will need to be restarted. As outlined above 
this is not advisable in the current situation, but there is basically no alternative. 
If the database is is going to be restarted, then you best also specify 

  fast_start_parallel_rollback=false 


3. As a last resort, if the above has been verified, we can disable SMON from performing the recovery at this time. 
This will allow the database to operate as normal, and you can choose to let the recovery take place at a later time when it is more 
convenient. Because the clean-up of the large transaction is actually put on hold it is important, and I cannot stress this enough, to 
disable this event again to let SMON finish its work. 

  init.ora: 
    event="10513 trace name context forever, level 2" 


Subject:  How to Disable Parallel Transaction Recovery When Parallel Txn Recovery is Active 
  Doc ID:  Note:238507.1 Type:  PROBLEM 
  Last Revision Date:  24-APR-2004 Status:  PUBLISHED 


Symptom(s)
~~~~~~~~~~

Parallel Transaction Recovery is taking too long.

You can use V$TRANSACTION USED_UBLK to estimate how long the rollback is going 
to take but there is no formula for this.  If you shutdown the database after 
rollback has started, it will begin where it left off.  

You can also look at V$FAST_START_TRANSACTIONS for UNDOBLOCKSDONE 
versus UNDOBLOCKSTOTAL.



Change(s)
~~~~~~~~~~

A large transaction got killed or rolled back.


Cause
~~~~~~~

There are cases where parallel transaction recovery is not as fast as serial 
transaction recovery, because the pq slaves are interfering with each other.  
This depends mainly on the type of changes that need to be made during rollback 
and usually may happen when rolling back INDEX Updates in parallel.   


Fix
~~~~

Dynamically switch from parallel recovery to serial.  If you are in a clustered 
environment you will need to do this on all instances at the same time:

1. Find SMON's Oracle PID:

Example:

SQL> select pid, program from v$process where program like '%SMON%';

       PID PROGRAM
---------- ------------------------------------------------
         6 oracle@stsun7 (SMON)

2. Disable SMON transaction cleanup:

SVRMGR> oradebug setorapid 
SVRMGR> oradebug event 10513 trace name context forever, level 2

3. Kill the PQ slaves that are doing parallel transaction recovery. 
You can check V$FAST_START_SERVERS to find these.

4. Turn off fast_start_parallel_rollback:

 alter system set fast_start_parallel_rollback=false;

If SMON is recovering, this command might hang, if it does just control-C out of it.  You may need to try this many times to get this to complete (between SMON cycles).

5. Re-enable SMON txn recovery:

SVRMGR> oradebug setorapid 
SVRMGR> oradebug event 10513 trace name context off


References
~~~~~~~~~~~
Note 144332.1 
Parallel Rollback may hang database, Parallel query servers get 100% cpu
oracleops-support

Wikipedia

Database Hangs Because SMON is taking 100% CPU doing transaction recovery

No comments:

Post a Comment