一个客户的11.2 RAC for Linux X86-64环境停电后出现了故障,RAC环境无法自动启动。
登录数据库服务器后,发现CLUSTER进程存在,于是尝试手工启动数据库:
$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.2.0 Production on Sun May 29 20:17:16 2011
Copyright (c) 1982, 2010, Oracle. All rights reserved.
Connected to an idle instance.
SQL> startup
ORA-01078: failure in processing system parameters
ORA-01565: error in identifying file '+DG_DATA/SPSP/spfileSPSP.ora'
ORA-17503: ksfdopn:2 Failed to open file +DG_DATA/SPSP/spfileSPSP.ora
ORA-15056: additional error message
ORA-17503: ksfdopn:2 Failed to open file +DG_DATA/spsp/spfilespsp.ora
ORA-15001: diskgroup "DG_DATA" does not exist or is not mounted
ORA-06512: at line 4
看来问题出在ASM磁盘组上,检查ASM磁盘组的告警日志:
NOTE: client SPSP2:SPSP registered, osid 14319, mbr 0x1
Sun May 29 16:58:31 2011
WARNING: Write Failed. group:1 disk:0 AU:12 offset:856064 size:4096
NOTE: unable to write any mirror side for diskgroup DG_ARCH
NOTE: cache initiating offline of disk 0 group DG_ARCH
NOTE: process 13597 initiating offline of disk 0.3915929597 (ARCH1) with mask 0x7e in group 1
Sun May 29 16:58:31 2011
WARNING: Disk ARCH1 in mode 0x7f is now being offlined
WARNING: Disk ARCH1 in mode 0x7f is now being taken offline
NOTE: initiating PST update: grp = 1, dsk = 0/0xe96857fd, mode = 0x15
GMON updating disk modes for group 1 at 21 for pid 15, osid 13597
ERROR: too many offline disks in PST (grp 1)
Sun May 29 16:58:31 2011
NOTE: cache dismounting (not clean) group 1/0x9C08A705 (DG_ARCH)
WARNING: Disk ARCH1 in mode 0x7f offline aborted
WARNING: Offline for disk ARCH1 in mode 0x7f failed.
Sun May 29 16:58:31 2011
NOTE: halting all I/Os to diskgroup 1 (DG_ARCH)
NOTE: unable to offline disks after getting write error for diskgroup DG_ARCH
NOTE: disk 0 had IO error
NOTE: LGWR doing non-clean dismount of group 1 (DG_ARCH)
NOTE: LGWR sync ABA=237.2255 last written ABA 237.2256
Sun May 29 16:58:32 2011
ERROR: ORA-15130 in COD recovery for diskgroup 1/0x9c08a705 (DG_ARCH)
ERROR: ORA-15130 thrown in RBAL for group number 1
Errors in file /oracle/gridbase/diag/asm/+asm/+ASM2/trace/+ASM2_rbal_13603.trc:
ORA-15130: diskgroup "DG_ARCH" is being dismounted
WARNING: Read Failed. group:1 disk:0 AU:1 offset:4096 size:4096
WARNING: Read Failed. group:1 disk:0 AU:1 offset:0 size:4096
ERROR: no PST quorum in group: required 1, found 0
ERROR: Could not heartbeat PST for grp 1. Force dismounting the disk group.
WARNING: Write Failed. group:2 disk:2 AU:1 offset:1044480 size:4096
WARNING: Write Failed. group:2 disk:1 AU:1 offset:1044480 size:4096
WARNING: Write Failed. group:2 disk:0 AU:1 offset:1044480 size:4096
WARNING: disk 0.3915929598 (VD1) not responding to heart beat
WARNING: disk 1.3915929599 (VD2) not responding to heart beat
ERROR: too many offline disks in PST (grp 2)
WARNING: disk 2.3915929600 (VD3) not responding to heart beat
ERROR: too many offline disks in PST (grp 2)
Sun May 29 16:58:32 2011
Errors in file /oracle/gridbase/diag/asm/+asm/+ASM2/trace/+ASM2_b001_14488.trc:
ORA-15130: diskgroup "" is being dismounted
WARNING: Write Failed. group:3 disk:0 AU:1 offset:1044480 size:4096
WARNING: disk 0.3915929601 (DATA1) not responding to heart beat
ERROR: too many offline disks in PST (grp 3)
WARNING: Write Failed. group:4 disk:0 AU:1 offset:1044480 size:4096
WARNING: disk 0.3915929603 (FLSH1) not responding to heart beat
ERROR: too many offline disks in PST (grp 4)
WARNING: Write Failed. group:5 disk:0 AU:1 offset:1044480 size:4096
WARNING: disk 0.3915929604 (REDO1) not responding to heart beat
ERROR: too many offline disks in PST (grp 5)
Sun May 29 16:58:32 2011
NOTE: process 14490 initiating offline of disk 0.3915929598 (VD1) with mask 0x7e in group 2
NOTE: process 14490 initiating offline of disk 1.3915929599 (VD2) with mask 0x7e in group 2
NOTE: process 14490 initiating offline of disk 2.3915929600 (VD3) with mask 0x7e in group 2
NOTE: checking PST: grp = 2
.
.
.
freeing rdom 4
WARNING: dirty detached from domain 4
NOTE: cache dismounted group 4/0x9C28A708 (DG_FLSH)
SQL> alter diskgroup DG_FLSH dismount force /* ASM SERVER */
NOTE: cache deleting context for group DG_FLSH 4/0x9c28a708
GMON dismounting group 4 at 30 for pid 38, osid 14494
NOTE: Disk in mode 0x8 marked for de-assignment
SUCCESS: diskgroup DG_FLSH was dismounted
SUCCESS: alter diskgroup DG_FLSH dismount force /* ASM SERVER */
ERROR: PST-initiated MANDATORY DISMOUNT of group DG_FLSH
NOTE: diskgroup resource ora.DG_FLSH.dg is offline
List of instances:
2
Dirty detach reconfiguration started (new ddet inc 1, cluster inc 2)
Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 5 invalid = TRUE
19 GCS resources traversed, 0 cancelled
Dirty Detach Reconfiguration complete
freeing rdom 5
WARNING: dirty detached from domain 5
NOTE: cache dismounted group 5/0x9C38A709 (DG_REDO)
SQL> alter diskgroup DG_REDO dismount force /* ASM SERVER */
NOTE: cache deleting context for group DG_REDO 5/0x9c38a709
GMON dismounting group 5 at 31 for pid 39, osid 14496
NOTE: Disk in mode 0x8 marked for de-assignment
SUCCESS: diskgroup DG_REDO was dismounted
SUCCESS: alter diskgroup DG_REDO dismount force /* ASM SERVER */
ERROR: PST-initiated MANDATORY DISMOUNT of group DG_REDO
NOTE: diskgroup resource ora.DG_REDO.dg is offline
Sun May 29 16:58:36 2011
Errors in file /oracle/gridbase/diag/asm/+asm/+ASM2/trace/+ASM2_ora_13637.trc:
ORA-15078: ASM diskgroup was forcibly dismounted
NOTE: client exited [13627]
Sun May 29 16:58:40 2011
Shutting down instance (abort)
License high water mark = 15
USER (ospid: 13624): terminating the instance
Instance terminated by USER, pid = 13624
Sun May 29 16:58:41 2011
Instance shutdown complete
根据错误信息,ASM在尝试写磁盘组的时候出现了IO错误。
进一步检查CLUSTER的日志信息:
2011-05-29 00:00:19.569
[cssd(13270)]CRS-1649:An I/O error occured for voting file: ORCL:VD3; details at (:CSSNM00059:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.
2011-05-29 00:00:19.570
[cssd(13270)]CRS-1649:An I/O error occured for voting file: ORCL:VD2; details at (:CSSNM00059:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.
2011-05-29 00:00:19.678
[cssd(13270)]CRS-1649:An I/O error occured for voting file: ORCL:VD1; details at (:CSSNM00059:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.
2011-05-29 00:00:22.584
[cssd(13270)]CRS-1649:An I/O error occured for voting file: ORCL:VD3; details at (:CSSNM00060:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.
2011-05-29 00:00:22.584
[cssd(13270)]CRS-1649:An I/O error occured for voting file: ORCL:VD2; details at (:CSSNM00060:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.
2011-05-29 00:00:22.684
[cssd(13270)]CRS-1649:An I/O error occured for voting file: ORCL:VD1; details at (:CSSNM00060:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.
2011-05-29 00:00:24.818
[/oracle/product/11g/grid/bin/orarootagent.bin(13818)]CRS-5822:Agent '/oracle/product/11g/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:1:4} in /oracle/product/11g/grid/log/oracle-01/agent/crsd/orarootagent_root/orarootagent_root.log.
2011-05-29 00:00:24.818
[/oracle/product/11g/grid/bin/oraagent.bin(13815)]CRS-5822:Agent '/oracle/product/11g/grid/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:2:8} in /oracle/product/11g/grid/log/oracle-01/agent/crsd/oraagent_grid/oraagent_grid.log.
2011-05-29 00:00:25.244
[ohasd(12737)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.
2011-05-29 00:00:26.626
[crsd(14838)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /oracle/product/11g/grid/log/oracle-01/crsd/crsd.log.
2011-05-29 00:00:27.278
[ohasd(12737)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.
2011-05-29 00:00:28.620
[crsd(14855)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /oracle/product/11g/grid/log/oracle-01/crsd/crsd.log.
2011-05-29 00:00:29.306
[ohasd(12737)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.
2011-05-29 00:00:30.664
[crsd(14868)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /oracle/product/11g/grid/log/oracle-01/crsd/crsd.log.
2011-05-29 00:00:31.333
[ohasd(12737)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.
2011-05-29 00:00:32.692
[crsd(14881)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /oracle/product/11g/grid/log/oracle-01/crsd/crsd.log.
2011-05-29 00:00:33.367
[ohasd(12737)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.
.
.
.
2011-05-29 20:14:32.155
[crsd(14562)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /oracle/product/11g/grid/log/oracle-01/crsd/crsd.log.
2011-05-29 20:14:32.822
[ohasd(12719)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.
2011-05-29 20:14:34.195
[crsd(14575)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /oracle/product/11g/grid/log/oracle-01/crsd/crsd.log.
2011-05-29 20:14:34.849
[ohasd(12719)]CRS-2765:Resource 'ora.crsd' has failed on server 'oracle-01'.
2011-05-29 20:14:34.849
[ohasd(12719)]CRS-2771:Maximum restart attempts reached for resource 'ora.crsd'; will not restart.
2011-05-29 20:15:43.700
[cssd(13251)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file ORCL:VD2 will be considered not functional in 99960 milliseconds
2011-05-29 20:15:43.701
[cssd(13251)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file ORCL:VD3 will be considered not functional in 99960 milliseconds
2011-05-29 20:15:44.702
[cssd(13251)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file ORCL:VD1 will be considered not functional in 99100 milliseconds
2011-05-29 20:16:33.800
[cssd(13251)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file ORCL:VD2 will be considered not functional in 49870 milliseconds
2011-05-29 20:16:33.801
[cssd(13251)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file ORCL:VD3 will be considered not functional in 49870 milliseconds
2011-05-29 20:16:34.802
[cssd(13251)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file ORCL:VD1 will be considered not functional in 49000 milliseconds
2011-05-29 20:17:03.861
[cssd(13251)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file ORCL:VD1 will be considered not functional in 19950 milliseconds
2011-05-29 20:17:03.861
[cssd(13251)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file ORCL:VD2 will be considered not functional in 19810 milliseconds
2011-05-29 20:17:03.861
[cssd(13251)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file ORCL:VD3 will be considered not functional in 19810 milliseconds
2011-05-29 20:17:23.900
[cssd(13251)]CRS-1604:CSSD voting file is offline: ORCL:VD1; details at (:CSSNM00058:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.
2011-05-29 20:17:23.901
[cssd(13251)]CRS-1604:CSSD voting file is offline: ORCL:VD2; details at (:CSSNM00058:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.
2011-05-29 20:17:23.901
[cssd(13251)]CRS-1604:CSSD voting file is offline: ORCL:VD3; details at (:CSSNM00058:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log.
2011-05-29 20:17:23.901
[cssd(13251)]CRS-1606:The number of voting files available, 0, is less than the minimum number of voting files required, 2, resulting in CSSD termination to ensure data integrity; details at (:CSSNM00018:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log
2011-05-29 20:17:23.901
[cssd(13251)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /oracle/product/11g/grid/log/oracle-01/cssd/ocssd.log
2011-05-29 20:17:23.952
[cssd(13251)]CRS-1652:Starting clean up of CRSD resources.
2011-05-29 20:17:24.133
[cssd(13251)]CRS-1653:The clean up of the CRSD resources failed.
根据错误信息,开始的时候,所有的磁盘组都可以加载到MOUNT状态,但是随后Oracle在尝试写VOT所在磁盘组时,出现了错误,导致OCR进程退出,而其他的磁盘组也因为错误陆续被OFFLINE。
而且由于无法获取VOT磁盘,很快节点被踢出CLUSTER环境,导致主机重启。
最后检查了cssd/ocssd.log信息:
2011-05-29 05:46:20.919: [ CSSD][1108568384]clssgmHandleGrockRcfgUpdate: grock(DAALL_DB), updateseq(5), status(0), sendresp(1)
2011-05-29 05:46:20.919: [ SKGFD][1076803904]ERROR: -15(asmlib ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so op ioerror error I/O Error)
2011-05-29 05:46:20.919: [ CSSD][1108568384]clssgmTestSetLastGrockUpdate: grock(GR+GCR1), updateseq(4) msgseq(5), lastupdt<0xb585af0>, ignoreseq(0)
2011-05-29 05:46:20.919: [ CSSD][1076803904](:CSSNM00059:)clssnmvWriteBlocks: write failed at offset 17 of ORCL:VD1
2011-05-29 05:46:20.919: [ CSSD][1105414464]clssgmDeadProc: proc 0xb23bd10
2011-05-29 05:46:20.919: [ CSSD][1105414464]clssgmDestroyProc: cleaning up proc(0xb23bd10) con(0x4d0b) skgpid 14228 ospid 14228 with 0 clients, refcount 0
2011-05-29 05:46:20.919: [ CSSD][1105414464]clssgmDiscEndpcl: gipcDestroy 0x4d0b
2011-05-29 05:46:20.920: [ CSSD][1076803904]clssnmvDiskAvailabilityChange: voting file ORCL:VD1 now offline
2011-05-29 05:46:20.920: [ SKGFD][1097120064]Lib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: closing handle 0x2aaab01aa5b0 for disk :ORCL:VD1:
2011-05-29 05:46:20.919: [ CSSD][1108568384]clssgmRemoveMember: grock GR+GCR1, member number 3 (0x2aaab03886d0) node number 1 state 0x0 grock type 2
2011-05-29 05:46:20.920: [ CSSD][1108568384]clssgmQueueGrockEvent: groupName(GR+GCR1) count(3) master(0) event(1), incarn 5, mbrc 3, to member 1, events 0x280, state 0x0
2011-05-29 05:46:20.920: [ CSSD][1108568384]clssgmRPCDone: rpc 0x2aaaaaec7060 (RPC#98) state 6, flags 0x100
2011-05-29 05:46:20.920: [ CSSD][1108568384]clssgmDelMemCmpl: rpc 0x2aaaaaec7060, ret 0, client (nil) member 0x2aaab03886d0
2011-05-29 05:46:20.920: [ CSSD][1108568384]clssgmFreeRPCIndex: freeing rpc 98
2011-05-29 05:46:20.920: [ CSSD][1108568384]clssgmHandleGrockRcfgUpdate: grock(GR+GCR1), updateseq(5), status(0), sendresp(1)
2011-05-29 05:46:20.921: [ CSSD][1105414464]clssgmDeadProc: proc 0x2aaab03997f0
2011-05-29 05:46:20.921: [ CSSD][1105414464]clssgmFenceClient: fencing client (0x2aaab0399960), member 0 in group DGSPSP0, no share, death fence 1, SAGE fence 0
.
.
.
2011-05-29 20:44:54.778: [ CRSMAIN][1088713120] Checking the OCR device
2011-05-29 20:44:54.779: [ CRSMAIN][1088713120] Sync-up with OCR
2011-05-29 20:44:54.779: [ CRSMAIN][1088713120] Connecting to the CSS Daemon
2011-05-29 20:44:54.779: [ CRSMAIN][1088713120] Getting local node number
2011-05-29 20:44:54.779: [ CRSMAIN][1105021248] Policy Engine is not initialized yet!
2011-05-29 20:44:54.779: [ CRSMAIN][1088713120] Initializing OCR
[ CLWAL][1088713120]clsw_Initialize: OLR initlevel [70000]
2011-05-29 20:44:55.099: [ OCRASM][1088713120]proprasmo: Error in open/create file in dg [DG_CRS]
[ OCRASM][1088713120]SLOS : SLOS: cat=8, pn=kgfoOpenFile01, dep=15056, loc=kgfokge
2011-05-29 20:44:55.099: [ OCRASM][1088713120]ASM Error Stack :
2011-05-29 20:44:55.126: [ OCRASM][1088713120]proprasmo: kgfoCheckMount returned [6]
2011-05-29 20:44:55.126: [ OCRASM][1088713120]proprasmo: The ASM disk group DG_CRS is not found or not mounted
2011-05-29 20:44:55.127: [ OCRRAW][1088713120]proprioo: Failed to open [+DG_CRS]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2011-05-29 20:44:55.127: [ OCRRAW][1088713120]proprioo: No OCR/OLR devices are usable
2011-05-29 20:44:55.127: [ OCRASM][1088713120]proprasmcl: asmhandle is NULL
2011-05-29 20:44:55.127: [ GIPC][1088713120] gipcCheckInitialization: possible incompatible non-threaded init from [prom.c : 687], original from [clsss.c : 5325]
2011-05-29 20:44:55.129: [ default][1088713120]clsvactversion:4: Retrieving Active Version from local storage.
2011-05-29 20:44:55.130: [ CSSCLNT][1088713120]clssgsgrppubdata: group (ocr_Oracle-cluster) not found
2011-05-29 20:44:55.130: [ OCRRAW][1088713120]proprio_repairconf: Failed to retrieve the group public data. CSS ret code [20]
2011-05-29 20:44:55.133: [ OCRRAW][1088713120]proprioo: Failed to auto repair the OCR configuration.
2011-05-29 20:44:55.133: [ OCRRAW][1088713120]proprinit: Could not open raw device
2011-05-29 20:44:55.133: [ OCRASM][1088713120]proprasmcl: asmhandle is NULL
2011-05-29 20:44:55.134: [ OCRAPI][1088713120]a_init:16!: Backend init unsuccessful : [26]
2011-05-29 20:44:55.134: [ CRSOCR][1088713120] OCR context init failure. Error: PROC-26: Error while accessing the physical storage
2011-05-29 20:44:55.134: [ CRSOCR][1088713120][PANIC] OCR Context is NULL(File: caaocr.cpp, line: 145)
2011-05-29 20:44:55.134: [ CRSD][1088713120][PANIC] CRSD Exiting. OCR Failed
2011-05-29 20:44:55.134: [ CRSD][1088713120] Done.
基于错误信息的判断,问题和存储层直接相关,如果不是磁盘损坏,就是多路径设置异常,导致问题的原因是cluster在读写VOT和OCR的ASM是出现了错误,问题应该发生在组成这个ASM磁盘组的裸设备上。
最终客户通过重建OCR和VOT解决了这个错误。
(以上内容摘于网络,如有侵权,请告之,将第一时间删除)
总机:(010)-88589926,88589826,88587026 QQ讨论群:243729577 182441349 邮箱:cuug_bj@cuug.com
通信地址:北京市海淀区紫竹院路98号北京化工大学科技园609室(CUUG)邮政编码:100089
中国UNIX用户协会 Copyright 2010 ALL Rights Reserved 北京神脑资讯技术有限公司
京ICP备11008061号 京公网安备110108006275号