您在这里:首页 > 学员专区 > 技术文章
Oracle视频
Oracle
CUUG课程

11.2 RAC自动启动报错诊断

 


一个客户的11.2 RAC for Linux X86-64环境停电后出现了故障,RAC环境无法自动启动。

 

 

登录数据库服务器后,发现CLUSTER进程存在,于是尝试手工启动数据库:

$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.2.0 Production on Sun May 29 20:17:16 2011

Copyright (c) 1982, 2010, Oracle. All rights reserved.

Connected to an idle instance.

SQL> startup

ORA-01078: failure in processing system parameters

ORA-01565: error in identifying file '+DG_DATA/SPSP/spfileSPSP.ora'

ORA-17503: ksfdopn:2 Failed to open file +DG_DATA/SPSP/spfileSPSP.ora

ORA-15056: additional error message

ORA-17503: ksfdopn:2 Failed to open file +DG_DATA/spsp/spfilespsp.ora

ORA-15001: diskgroup "DG_DATA" does not exist or is not mounted

ORA-06512: at line 4

看来问题出在ASM磁盘组上,检查ASM磁盘组的告警日志:

NOTE: client SPSP2:SPSP registered, osid 14319, mbr 0x1

Sun May 29 16:58:31 2011

WARNING: Write Failed. group:1 disk:0 AU:12 offset:856064 size:4096

NOTE: unable to write any mirror side for diskgroup DG_ARCH

NOTE: cache initiating offline of disk 0 group DG_ARCH

NOTE: process 13597 initiating offline of disk 0.3915929597 (ARCH1) with mask 0x7e in group 1

Sun May 29 16:58:31 2011

WARNING: Disk ARCH1 in mode 0x7f is now being offlined

WARNING: Disk ARCH1 in mode 0x7f is now being taken offline

NOTE: initiating PST update: grp = 1, dsk = 0/0xe96857fd, mode = 0x15

GMON updating disk modes for group 1 at 21 for pid 15, osid 13597

ERROR: too many offline disks in PST (grp 1)

Sun May 29 16:58:31 2011

NOTE: cache dismounting (not clean) group 1/0x9C08A705 (DG_ARCH)

WARNING: Disk ARCH1 in mode 0x7f offline aborted

WARNING: Offline for disk ARCH1 in mode 0x7f failed.

Sun May 29 16:58:31 2011

NOTE: halting all I/Os to diskgroup 1 (DG_ARCH)

NOTE: unable to offline disks after getting write error for diskgroup DG_ARCH

NOTE: disk 0 had IO error

NOTE: LGWR doing non-clean dismount of group 1 (DG_ARCH)

NOTE: LGWR sync ABA=237.2255 last written ABA 237.2256

Sun May 29 16:58:32 2011

ERROR: ORA-15130 in COD recovery for diskgroup 1/0x9c08a705 (DG_ARCH)

ERROR: ORA-15130 thrown in RBAL for group number 1

Errors in file /oracle/gridbase/diag/asm/+asm/+ASM2/trace/+ASM2_rbal_13603.trc:

ORA-15130: diskgroup "DG_ARCH" is being dismounted

WARNING: Read Failed. group:1 disk:0 AU:1 offset:4096 size:4096

WARNING: Read Failed. group:1 disk:0 AU:1 offset:0 size:4096

ERROR: no PST quorum in group: required 1, found 0

ERROR: Could not heartbeat PST for grp 1. Force dismounting the disk group.

WARNING: Write Failed. group:2 disk:2 AU:1 offset:1044480 size:4096

WARNING: Write Failed. group:2 disk:1 AU:1 offset:1044480 size:4096

WARNING: Write Failed. group:2 disk:0 AU:1 offset:1044480 size:4096

WARNING: disk 0.3915929598 (VD1) not responding to heart beat

WARNING: disk 1.3915929599 (VD2) not responding to heart beat

ERROR: too many offline disks in PST (grp 2)

WARNING: disk 2.3915929600 (VD3) not responding to heart beat

ERROR: too many offline disks in PST (grp 2)

Sun May 29 16:58:32 2011

Errors in file /oracle/gridbase/diag/asm/+asm/+ASM2/trace/+ASM2_b001_14488.trc:

ORA-15130: diskgroup "" is being dismounted

WARNING: Write Failed. group:3 disk:0 AU:1 offset:1044480 size:4096

WARNING: disk 0.3915929601 (DATA1) not responding to heart beat

ERROR: too many offline disks in PST (grp 3)

WARNING: Write Failed. group:4 disk:0 AU:1 offset:1044480 size:4096

WARNING: disk 0.3915929603 (FLSH1) not responding to heart beat

ERROR: too many offline disks in PST (grp 4)

WARNING: Write Failed. group:5 disk:0 AU:1 offset:1044480 size:4096

WARNING: disk 0.3915929604 (REDO1) not responding to heart beat

ERROR: too many offline disks in PST (grp 5)

Sun May 29 16:58:32 2011

NOTE: process 14490 initiating offline of disk 0.3915929598 (VD1) with mask 0x7e in group 2

NOTE: process 14490 initiating offline of disk 1.3915929599 (VD2) with mask 0x7e in group 2

NOTE: process 14490 initiating offline of disk 2.3915929600 (VD3) with mask 0x7e in group 2

NOTE: checking PST: grp = 2

.

.

.

freeing rdom 4

WARNING: dirty detached from domain 4

NOTE: cache dismounted group 4/0x9C28A708 (DG_FLSH)

SQL> alter diskgroup DG_FLSH dismount force /* ASM SERVER */

NOTE: cache deleting context for group DG_FLSH 4/0x9c28a708

GMON dismounting group 4 at 30 for pid 38, osid 14494

NOTE: Disk in mode 0x8 marked for de-assignment

SUCCESS: diskgroup DG_FLSH was dismounted

SUCCESS: alter diskgroup DG_FLSH dismount force /* ASM SERVER */

ERROR: PST-initiated MANDATORY DISMOUNT of group DG_FLSH

NOTE: diskgroup resource ora.DG_FLSH.dg is offline

List of instances:

2

Dirty detach reconfiguration started (new ddet inc 1, cluster inc 2)

Global Resource Directory partially frozen for dirty detach

* dirty detach - domain 5 invalid = TRUE

19 GCS resources traversed, 0 cancelled

Dirty Detach Reconfiguration complete

freeing rdom 5

WARNING: dirty detached from domain 5

NOTE: cache dismounted group 5/0x9C38A709 (DG_REDO)

SQL> alter diskgroup DG_REDO dismount force /* ASM SERVER */

NOTE: cache deleting context for group DG_REDO 5/0x9c38a709

GMON dismounting group 5 at 31 for pid 39, osid 14496

NOTE: Disk in mode 0x8 marked for de-assignment

SUCCESS: diskgroup DG_REDO was dismounted

SUCCESS: alter diskgroup DG_REDO dismount force /* ASM SERVER */

ERROR: PST-initiated MANDATORY DISMOUNT of group DG_REDO

NOTE: diskgroup resource ora.DG_REDO.dg is offline

Sun May 29 16:58:36 2011

Errors in file /oracle/gridbase/diag/asm/+asm/+ASM2/trace/+ASM2_ora_13637.trc:

ORA-15078: ASM diskgroup was forcibly dismounted

NOTE: client exited [13627]

Sun May 29 16:58:40 2011

Shutting down instance (abort)

License high water mark = 15

USER (ospid: 13624): terminating the instance

Instance terminated by USER, pid = 13624

Sun May 29 16:58:41 2011

Instance shutdown complete

根据错误信息,ASM在尝试写磁盘组的时候出现了IO错误。

进一步检查CLUSTER的日志信息:

2011-05-29 00:00:19.569

[cssd(13270)]CRS-1649:An I/O error occured for voting file: ORCL:VD3; details at (:CSSNM00059:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.

2011-05-29 00:00:19.570

[cssd(13270)]CRS-1649:An I/O error occured for voting file: ORCL:VD2; details at (:CSSNM00059:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.

2011-05-29 00:00:19.678

[cssd(13270)]CRS-1649:An I/O error occured for voting file: ORCL:VD1; details at (:CSSNM00059:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.

2011-05-29 00:00:22.584

[cssd(13270)]CRS-1649:An I/O error occured for voting file: ORCL:VD3; details at (:CSSNM00060:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.

2011-05-29 00:00:22.584

[cssd(13270)]CRS-1649:An I/O error occured for voting file: ORCL:VD2; details at (:CSSNM00060:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.

2011-05-29 00:00:22.684

[cssd(13270)]CRS-1649:An I/O error occured for voting file: ORCL:VD1; details at (:CSSNM00060:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.

2011-05-29 00:00:24.818

[/oracle/product/11g/grid/bin/orarootagent.bin(13818)]CRS-5822:Agent '/oracle/product/11g/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:1:4} in /oracle/product/11g/grid/log/oracle-01/agent/crsd/orarootagent_root/orarootagent_root.log.

2011-05-29 00:00:24.818

[/oracle/product/11g/grid/bin/oraagent.bin(13815)]CRS-5822:Agent '/oracle/product/11g/grid/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:2:8} in /oracle/product/11g/grid/log/oracle-01/agent/crsd/oraagent_grid/oraagent_grid.log.

2011-05-29 00:00:25.244

[ohasd(12737)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.

2011-05-29 00:00:26.626

[crsd(14838)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /oracle/product/11g/grid/log/oracle-01/crsd/crsd.log.

2011-05-29 00:00:27.278

[ohasd(12737)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.

2011-05-29 00:00:28.620

[crsd(14855)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /oracle/product/11g/grid/log/oracle-01/crsd/crsd.log.

2011-05-29 00:00:29.306

[ohasd(12737)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.

2011-05-29 00:00:30.664

[crsd(14868)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /oracle/product/11g/grid/log/oracle-01/crsd/crsd.log.

2011-05-29 00:00:31.333

[ohasd(12737)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.

2011-05-29 00:00:32.692

[crsd(14881)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /oracle/product/11g/grid/log/oracle-01/crsd/crsd.log.

2011-05-29 00:00:33.367

[ohasd(12737)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.

.

.

.

2011-05-29 20:14:32.155

[crsd(14562)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /oracle/product/11g/grid/log/oracle-01/crsd/crsd.log.

2011-05-29 20:14:32.822

[ohasd(12719)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.

2011-05-29 20:14:34.195

[crsd(14575)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /oracle/product/11g/grid/log/oracle-01/crsd/crsd.log.

2011-05-29 20:14:34.849

[ohasd(12719)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.

2011-05-29 20:14:34.849

[ohasd(12719)]CRS-2771:Maximum restart attempts reached for resource 'ora.crsd'; will not restart.

2011-05-29 20:15:43.700

[cssd(13251)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file ORCL:VD2 will be considered not functional in 99960 milliseconds

2011-05-29 20:15:43.701

[cssd(13251)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file ORCL:VD3 will be considered not functional in 99960 milliseconds

2011-05-29 20:15:44.702

[cssd(13251)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file ORCL:VD1 will be considered not functional in 99100 milliseconds

2011-05-29 20:16:33.800

[cssd(13251)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file ORCL:VD2 will be considered not functional in 49870 milliseconds

2011-05-29 20:16:33.801

[cssd(13251)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file ORCL:VD3 will be considered not functional in 49870 milliseconds

2011-05-29 20:16:34.802

[cssd(13251)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file ORCL:VD1 will be considered not functional in 49000 milliseconds

2011-05-29 20:17:03.861

[cssd(13251)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file ORCL:VD1 will be considered not functional in 19950 milliseconds

2011-05-29 20:17:03.861

[cssd(13251)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file ORCL:VD2 will be considered not functional in 19810 milliseconds

2011-05-29 20:17:03.861

[cssd(13251)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file ORCL:VD3 will be considered not functional in 19810 milliseconds

2011-05-29 20:17:23.900

[cssd(13251)]CRS-1604:CSSD voting file is offline: ORCL:VD1; details at (:CSSNM00058:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.

2011-05-29 20:17:23.901

[cssd(13251)]CRS-1604:CSSD voting file is offline: ORCL:VD2; details at (:CSSNM00058:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.

2011-05-29 20:17:23.901

[cssd(13251)]CRS-1604:CSSD voting file is offline: ORCL:VD3; details at (:CSSNM00058:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.

2011-05-29 20:17:23.901

[cssd(13251)]CRS-1606:The number of voting files available, 0, is less than the minimum number of voting files required, 2, resulting in CSSD termination to ensure data integrity; details at (:CSSNM00018:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log

2011-05-29 20:17:23.901

[cssd(13251)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log

2011-05-29 20:17:23.952

[cssd(13251)]CRS-1652:Starting clean up of CRSD resources.

2011-05-29 20:17:24.133

[cssd(13251)]CRS-1653:The clean up of the CRSD resources failed.

根据错误信息,开始的时候,所有的磁盘组都可以加载到MOUNT状态,但是随后Oracle在尝试写VOT所在磁盘组时,出现了错误,导致OCR进程退出,而其他的磁盘组也因为错误陆续被OFFLINE。

而且由于无法获取VOT磁盘,很快节点被踢出CLUSTER环境,导致主机重启。

最后检查了cssd/ocssd.log信息:

2011-05-29 05:46:20.919: [ CSSD][1108568384]clssgmHandleGrockRcfgUpdate: grock(DAALL_DB), updateseq(5), status(0), sendresp(1)

2011-05-29 05:46:20.919: [ SKGFD][1076803904]ERROR: -15(asmlib ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so op ioerror error I/O Error)

2011-05-29 05:46:20.919: [ CSSD][1108568384]clssgmTestSetLastGrockUpdate: grock(GR+GCR1), updateseq(4) msgseq(5), lastupdt<0xb585af0>, ignoreseq(0)

2011-05-29 05:46:20.919: [ CSSD][1076803904](:CSSNM00059:)clssnmvWriteBlocks: write failed at offset 17 of ORCL:VD1

2011-05-29 05:46:20.919: [ CSSD][1105414464]clssgmDeadProc: proc 0xb23bd10

2011-05-29 05:46:20.919: [ CSSD][1105414464]clssgmDestroyProc: cleaning up proc(0xb23bd10) con(0x4d0b) skgpid 14228 ospid 14228 with 0 clients, refcount 0

2011-05-29 05:46:20.919: [ CSSD][1105414464]clssgmDiscEndpcl: gipcDestroy 0x4d0b

2011-05-29 05:46:20.920: [ CSSD][1076803904]clssnmvDiskAvailabilityChange: voting file ORCL:VD1 now offline

2011-05-29 05:46:20.920: [ SKGFD][1097120064]Lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: closing handle 0x2aaab01aa5b0 for disk :ORCL:VD1:

2011-05-29 05:46:20.919: [ CSSD][1108568384]clssgmRemoveMember: grock GR+GCR1, member number 3 (0x2aaab03886d0) node number 1 state 0x0 grock type 2

2011-05-29 05:46:20.920: [ CSSD][1108568384]clssgmQueueGrockEvent: groupName(GR+GCR1) count(3) master(0) event(1), incarn 5, mbrc 3, to member 1, events 0x280, state 0x0

2011-05-29 05:46:20.920: [ CSSD][1108568384]clssgmRPCDone: rpc 0x2aaaaaec7060 (RPC#98) state 6, flags 0x100

2011-05-29 05:46:20.920: [ CSSD][1108568384]clssgmDelMemCmpl: rpc 0x2aaaaaec7060, ret 0, client (nil) member 0x2aaab03886d0

2011-05-29 05:46:20.920: [ CSSD][1108568384]clssgmFreeRPCIndex: freeing rpc 98

2011-05-29 05:46:20.920: [ CSSD][1108568384]clssgmHandleGrockRcfgUpdate: grock(GR+GCR1), updateseq(5), status(0), sendresp(1)

2011-05-29 05:46:20.921: [ CSSD][1105414464]clssgmDeadProc: proc 0x2aaab03997f0

2011-05-29 05:46:20.921: [ CSSD][1105414464]clssgmFenceClient: fencing client (0x2aaab0399960), member 0 in group DGSPSP0, no share, death fence 1, SAGE fence 0

.

.

.

2011-05-29 20:44:54.778: [ CRSMAIN][1088713120] Checking the OCR device

2011-05-29 20:44:54.779: [ CRSMAIN][1088713120] Sync-up with OCR

2011-05-29 20:44:54.779: [ CRSMAIN][1088713120] Connecting to the CSS Daemon

2011-05-29 20:44:54.779: [ CRSMAIN][1088713120] Getting local node number

2011-05-29 20:44:54.779: [ CRSMAIN][1105021248] Policy Engine is not initialized yet!

2011-05-29 20:44:54.779: [ CRSMAIN][1088713120] Initializing OCR

[ CLWAL][1088713120]clsw_Initialize: OLR initlevel [70000]

2011-05-29 20:44:55.099: [ OCRASM][1088713120]proprasmo: Error in open/create file in dg [DG_CRS]

[ OCRASM][1088713120]SLOS : SLOS: cat=8, pn=kgfoOpenFile01, dep=15056, loc=kgfokge

2011-05-29 20:44:55.099: [ OCRASM][1088713120]ASM Error Stack :

2011-05-29 20:44:55.126: [ OCRASM][1088713120]proprasmo: kgfoCheckMount returned [6]

2011-05-29 20:44:55.126: [ OCRASM][1088713120]proprasmo: The ASM disk group DG_CRS is not found or not mounted

2011-05-29 20:44:55.127: [ OCRRAW][1088713120]proprioo: Failed to open [+DG_CRS]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.

2011-05-29 20:44:55.127: [ OCRRAW][1088713120]proprioo: No OCR/OLR devices are usable

2011-05-29 20:44:55.127: [ OCRASM][1088713120]proprasmcl: asmhandle is NULL

2011-05-29 20:44:55.127: [ GIPC][1088713120] gipcCheckInitialization: possible incompatible non-threaded init from [prom.c : 687], original from [clsss.c : 5325]

2011-05-29 20:44:55.129: [ default][1088713120]clsvactversion:4: Retrieving Active Version from local storage.

2011-05-29 20:44:55.130: [ CSSCLNT][1088713120]clssgsgrppubdata: group (ocr_Oracle-cluster) not found

2011-05-29 20:44:55.130: [ OCRRAW][1088713120]proprio_repairconf: Failed to retrieve the group public data. CSS ret code [20]

2011-05-29 20:44:55.133: [ OCRRAW][1088713120]proprioo: Failed to auto repair the OCR configuration.

2011-05-29 20:44:55.133: [ OCRRAW][1088713120]proprinit: Could not open raw device

2011-05-29 20:44:55.133: [ OCRASM][1088713120]proprasmcl: asmhandle is NULL

2011-05-29 20:44:55.134: [ OCRAPI][1088713120]a_init:16!: Backend init unsuccessful : [26]

2011-05-29 20:44:55.134: [ CRSOCR][1088713120] OCR context init failure. Error: PROC-26: Error while accessing the physical storage

2011-05-29 20:44:55.134: [ CRSOCR][1088713120][PANIC] OCR Context is NULL(File: caaocr.cpp, line: 145)

2011-05-29 20:44:55.134: [ CRSD][1088713120][PANIC] CRSD Exiting. OCR Failed

2011-05-29 20:44:55.134: [ CRSD][1088713120] Done.

基于错误信息的判断,问题和存储层直接相关,如果不是磁盘损坏,就是多路径设置异常,导致问题的原因是cluster在读写VOT和OCR的ASM是出现了错误,问题应该发生在组成这个ASM磁盘组的裸设备上。

最终客户通过重建OCR和VOT解决了这个错误。

 

 

(以上内容摘于网络,如有侵权,请告之,将第一时间删除)

相关文章 [上一篇] 樊文凯讲师
010-88589926(88587026)
CUUG热门培训课程
Oracle DBA就业培训
CUUG名师
网络课程
技术沙龙
最新动态

总机:(010)-88589926,88589826,88587026 QQ讨论群:243729577 182441349 邮箱:cuug_bj@cuug.com
通信地址:北京市海淀区紫竹院路98号北京化工大学科技园609室(CUUG)邮政编码:100089 
中国UNIX用户协会 Copyright 2010  ALL Rights Reserved 北京神脑资讯技术有限公司
京ICP备11008061号  京公网安备110108006275号