您在这里:首页 > 学员专区 > 技术文章
Oracle视频
Oracle
CUUG课程

一次RMAN备份报错的诊断过程(一)

 

今天检查数据库中的备份输出脚本时,发现RMAN备份出现了错误。

这一篇主要描述问题的现象。

 

 

错误信息如下:

bash-3.00$ more /data/backup/backup_tradedb_090523.out

Script. /data/backup/backup_tradedb.sh

==== started on Sat May 23 23:00:00 CST 2009 ====

 

RMAN: /opt/oracle/product/10.2/database/bin/rman

ORACLE_SID: tradedb1

ORACLE_HOME: /opt/oracle/product/10.2/database

Recovery Manager: Release 10.2.0.3.0 - Production on Sat May 23 23:00:01 2009

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

RMAN-06900: WARNING: unable to generate V$RMAN_STATUS or V$RMAN_OUTPUT row

RMAN-06901: WARNING: disabling update of the V$RMAN_STATUS and V$RMAN_OUTPUT rows

ORACLE error from target database:

ORA-03113: end-of-file on communication channel

RMAN-00571: ===========================================================

RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============

RMAN-00571: ===========================================================

RMAN-00554: initialization of internal recovery manager package failed

RMAN-06003: ORACLE error from target database:

ORA-03114: not connected to ORACLE

Script. /data/backup/backup_tradedb.sh

==== ended in error on Sat May 23 23:22:34 CST 2009 ====

这个错误信息以前还真的没有碰到过,出现了ORA-3113错误,还有两个告警信息,初步怀疑可能与网络不稳定有关,尝试重新启动后台备份,如果可以顺利运行,说明这个问题可能是瞬时网络不稳定造成的。

bash-3.00$ more /data/backup/backup_tradedb_090525.out

Script. bash

==== started on Mon May 25 14:34:14 CST 2009 ====

 

RMAN: /opt/oracle/product/10.2/database/bin/rman

ORACLE_SID: tradedb1

ORACLE_HOME: /opt/oracle/product/10.2/database

恢复管理器: Release 10.2.0.3.0 - Production on星期一5月25 14:34:15 2009

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

RMAN-06900:警告:无法生成V$RMAN_STATUS或V$RMAN_OUTPUT行

RMAN-06901:警告:禁止更新V$RMAN_STATUS和V$RMAN_OUTPUT行

来自目标数据库的ORACLE错误:

RMAN-00571: ===========================================================

RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============

RMAN-00571: ===========================================================

RMAN-00554:内部恢复管理器程序包初始化失败

RMAN-06003:来自目标数据库的ORACLE错误:

ORA-03114:未连接到ORALCE

Script. bash

==== ended in error on Mon May 25 14:56:54 CST 2009 ====

在后台运行备份后,很快输出脚本中就显示了错误信息,和前一次错误信息略有区别,这次没有ORA-3113错误了。看来问题是可以重现的,尝试直接RMAN连接:

bash-3.00$ rman target /

恢复管理器: Release 10.2.0.3.0 - Production on星期一5月25 15:02:51 2009

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

RMAN连接数据库的操作被挂起。在等待RMAN相应的时候,查询了一下alert文件,检查备份启动时刻是否有对应的错误信息,结果果然有对应的错误:

Sat May 23 17:16:43 2009

GES: Potential blocker (pid=17895) on resource WF-00000046-00000000;

 enqueue info in file /opt/oracle/admin/tradedb/bdump/tradedb1_lmd0_7210.trc and DIAG trace file

Sat May 23 21:28:36 2009

GES: Potential blocker (pid=15807) on resource TM-00002391-00000000;

 enqueue info in file /opt/oracle/admin/tradedb/bdump/tradedb1_lmd0_7210.trc and DIAG trace file

Sat May 23 23:15:04 2009

GES: Potential blocker (pid=21881) on resource CF-00000000-00000000;

 enqueue info in file /opt/oracle/admin/tradedb/bdump/tradedb1_lmd0_7210.trc and DIAG trace file

Killing enqueue blocker (pid=21881) on resource CF-00000000-00000000

 by killing session 275.54937

Sat May 23 23:15:58 2009

Killing enqueue blocker (pid=21881) on resource CF-00000000-00000000

 by killing session 275.54937

Sat May 23 23:15:58 2009

Killing enqueue blocker (pid=21881) on resource CF-00000000-00000000

 by killing session 275.54937

Sat May 23 23:17:14 2009

Killing enqueue blocker (pid=21881) on resource CF-00000000-00000000

 by killing session 275.54937

Sat May 23 23:22:34 2009

Killing enqueue blocker (pid=21881) on resource CF-00000000-00000000

 by terminating the process

看这个信息,问题似乎是由于全局资源被占用导致的。而且刚才后台运行的备份脚本也有对应的错误信息:

Mon May 25 15:17:58 2009

GES: Potential blocker (pid=8568) on resource CF-00000000-00000000;

 enqueue info in file /opt/oracle/admin/tradedb/bdump/tradedb1_lmd0_7210.trc and DIAG trace file

Mon May 25 15:17:58 2009

Killing enqueue blocker (pid=8568) on resource CF-00000000-00000000

 by killing session 87.1697

Mon May 25 15:17:58 2009

Killing enqueue blocker (pid=8568) on resource CF-00000000-00000000

 by killing session 87.1697

Mon May 25 15:21:57 2009

Killing enqueue blocker (pid=8568) on resource CF-00000000-00000000

 by killing session 87.1697

Mon May 25 15:21:58 2009

Killing enqueue blocker (pid=8568) on resource CF-00000000-00000000

 by killing session 87.1697

Mon May 25 15:25:28 2009

Killing enqueue blocker (pid=8568) on resource CF-00000000-00000000

 by terminating the process

很快刚才RMAN操作也报错退出了:

bash-3.00$ rman target /

恢复管理器: Release 10.2.0.3.0 - Production on星期一5月25 15:02:51 2009

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

RMAN-06900:警告:无法生成V$RMAN_STATUS或V$RMAN_OUTPUT行

RMAN-06901:警告:禁止更新V$RMAN_STATUS和V$RMAN_OUTPUT行

来自目标数据库的ORACLE错误:

RMAN-00571: ===========================================================

RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============

RMAN-00571: ===========================================================

RMAN-00554:内部恢复管理器程序包初始化失败

RMAN-06003:来自目标数据库的ORACLE错误:

ORA-03114:未连接到ORALCE

对比rman的错误信息,以及Oracle的alert文件中的错误信息,很可能正是由于无法获取到共享的资源,导致无法更新V$RMAN_STATUS和V$RMAN_OUTPUT信息,从而导致RMAN连接报错。

检查trace文件/opt/oracle/admin/tradedb/bdump/tradedb1_lmd0_7210.trc的信息,在RMAN报错的时间点发现了错误信息:

*** 2009-05-25 15:17:58.130

DUMP LOCAL BLOCKER/HOLDER: block level 3 res [0x0][0x0],[CF]

----------resource 0x5763846d0----------------------

resname       : [0x0][0x0],[CF]

Local node    : 0

dir_node      : 1

master_node   : 1

hv idx        : 95

hv last r.inc : 2

current inc   : 4

hv status     : 0

hv master     : 1

open options  : dd

Held mode     : KJUSERPW

Cvt mode      : KJUSERNL

Next Cvt mode : KJUSERPR

msg_seq       : bdddbdcf

res_seq       : 5

grant_bits    : KJUSERNL KJUSERCR KJUSERPW

grant mode    : KJUSERNL  KJUSERCR  KJUSERCW  KJUSERPR  KJUSERPW  KJUSEREX

count         : 2         1         0         0         1         0

val_state     : KJUSERVS_VALUE

valblk        : 0x0009c399000000000038000000000000 .8

access_node   : 0

vbreq_state   : 0

state         : x8

resp          : 5763846d0

On Scan_q?    : N

Total accesses: 4017991

Imm.  accesses: 3763043

Granted_locks : 2

Cvting_locks  : 2

value_block:  0009 c3 99 00 00 00 00 00 38 00 00 00 00 00 00

GRANTED_Q :

lp 8b6213238 gl KJUSERPW rp 5763846d0 [0x0][0x0],[CF]

  master 1 gl owner 8b60b8da0 possible pid 8568 xid 1006-0066-000016ACbast 0 rseq 5 mseq 0 history 0x49a51495

  open opt KJUSERDEADLOCK  KJUSERIDLK

lp 737246d38 gl KJUSERCR rp 5763846d0 [0x0][0x0],[CF]

  master 1 gl owner737115f20 possible pid 7228 xid 0001-0013-00000004 bast 0 rseq 5 mseq 0 history 0x9a5

  open opt KJUSERDEADLOCK  KJUSERIDLK

CONVERT_Q:

lp 737336690 gl KJUSERNL rl KJUSERPR rp 5763846d0 [0x0][0x0],[CF]

  master 1 gl owner 73710ce50 possible pid 7654 xid 1001-001D-00000019 bast 0 rseq 5 mseq 0 history 0x49a5149a

  convert opt KJUSERGETVALUE KJUSERNODEADLOCKWAIT

lp 8b62134d8 gl KJUSERNL rl KJUSERPR rp 5763846d0 [0x0][0x0],[CF]

  master 1 gl owner 8b61173d0 possible pid 7652 xid 1001-0012-00000002 bast 0 rseq 5 mseq 0 history 0x49a5149a

  convert opt KJUSERGETVALUE KJUSERNODEADLOCKWAIT

----------enqueue 0x8b6213238------------------------

lock version     : 21790201

Owner node       : 0

grant_level      : KJUSERPW

req_level        : KJUSERPW

bast_level       : KJUSERNL

notify_func      : 0

resp             : 5763846d0

procp            : 58e58acd8

pid              : 8568

proc version     : 1264

oprocp           : 0

opid             : 0

group lock owner : 8b60b8da0

possible pid     : 8568

xid              : 1006-0066-000016AC

dd_time          : 0.0 secs

dd_count         : 0

timeout          : 0.0 secs

On_timer_q?      : N

On_dd_q?         : N

lock_state       : GRANTED

Open Options     : KJUSERDEADLOCK  KJUSERIDLK

Convert options  : KJUSERGETVALUE KJUSERNODEADLOCKWAIT

History          : 0x49a51495

Msg_Seq          : 0x0

res_seq          : 5

valblk           : 0x0009c399000000000038000000000000 .8

Potential blocker (pid=8568) on resource CF-00000000-00000000

DUMP LOCAL BLOCKER: initiate state dump for TIMEOUT

  possible owner[102.8568]

Submitting asynchronized dump request [28]

KCC: controlfile sequence number from SGA =  639897

Killing enqueue blocker (pid=8568) on resource CF-00000000-00000000

 by killing session 87.1697

Kill session sent to session 87.1697

*** 2009-05-25 15:18:02.165

Setting 3-way CR grants to 0 global-lru off? 0

*** 2009-05-25 15:19:04.032

Setting 3-way CR grants to 1 global-lru off? 0

*** 2009-05-25 15:21:57.782

Killing enqueue blocker (pid=8568) on resource CF-00000000-00000000

 by killing session 87.1697

Kill session sent to session 87.1697

这个信息进一步证实问题确实是由于资源被占用所引发的。既然是资源占用问题,检查一下系统当前的锁信息:

SQL> SELECT SID, TYPE, ID1, ID2, LMODE, REQUEST, CTIME, BLOCK

  2  FROM V$LOCK

  3  WHERE SID IN (SELECT SID FROM V$SESSION WHERE USERNAME IS NOT NULL);

       SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK

---------- -- ---------- ---------- ---------- ---------- ---------- ----------

       118 TO     196404          1          3          0     144940          2

       118 JQ          0          4          6          0     144943          2

       102 JQ          0         74          6          0     141340          2

       289 TO  -40016215          1          3          0     141337          2

       289 JQ          0         27          6          0     141343          2

       102 TM      70434          0          3          0     141319          2

       118 TM     196404          0          3          0     144940          2

       118 TX     262181      87690          6          0     144940          2

       289 TX     458770      86777          6          0     141337          2

已选择9行。

当前持有锁的会话都和JOB有关,且这个锁的持有时间已经不短了,检查一下当前系统运行的JOB:

SQL> SELECT SID, JOB, LAST_DATE, THIS_DATE

  2  FROM DBA_JOBS_RUNNING;

       SID        JOB LAST_DATE           THIS_DATE

---------- ---------- ------------------- -------------------

       118          4 2009-05-10 00:00:05 2009-05-24 00:00:02

       102         74 2009-05-23 01:00:00 2009-05-24 01:00:04

       289         27 2009-05-17 01:00:03 2009-05-24 01:00:04

可以看到,这3个JOB的运行时间都不正常,最长的一个居然已经运行了将近两天。

另外在检查操作系统的时候还发现一个问题,后台存在大量的racgmain进程:

bash-3.00$ ps -ef|grep racgmain     

  oracle 24085     1   0   May 23 ?           0:00 /opt/oracle/product/10.2/crs/bin/racgmain check

  oracle 20734     1   0   May 23 ?           0:00 /opt/oracle/product/10.2/crs/bin/racgmain check

  oracle 19021     1   0   May 23 ?           0:00 /opt/oracle/product/10.2/database/bin/racgmain check

  oracle 19035     1   0   May 23 ?           0:00 /opt/oracle/product/10.2/database/bin/racgmain check

  oracle 20207     1   0   May 23 ?           0:00 /opt/oracle/product/10.2/database/bin/racgmain check

.

.

.

  oracle 18643     1   0   May 23 ?           0:00 /opt/oracle/product/10.2/database/bin/racgmain check

  oracle 18420     1   0   May 23 ?           0:00 /opt/oracle/product/10.2/database/bin/racgmain check

统计一下,发现居然有四百多个进程:

bash-3.00$ ps -ef|grep -c "racgmain check"

449

看来整个问题还是很复杂的,而且要解决这个问题,还需要找到一个方向来入手解决这个问题。

 

 

(以上内容摘于网络,如有侵权,请告之,将第一时间删除)

相关文章 [上一篇] Oracle10203RAC环境添加新节点(五)
010-88589926(88587026)
CUUG热门培训课程
Oracle DBA就业培训
CUUG名师
网络课程
技术沙龙
最新动态

总机:(010)-88589926,88589826,88587026 QQ讨论群:243729577 182441349 邮箱:cuug_bj@cuug.com
通信地址:北京市海淀区紫竹院路98号北京化工大学科技园609室(CUUG)邮政编码:100089 
中国UNIX用户协会 Copyright 2010  ALL Rights Reserved 北京神脑资讯技术有限公司
京ICP备11008061号  京公网安备110108006275号