一个客户的11.2的RAC环境,出现了一个节点DOWN掉的问题。
这篇继续描述定位问题、解决问题的过程。
在检查了ASM和数据库的告警日志后,发现错误是由于CLUSTER问题导致,而分析了CLUSTER上的告警信息,进一步发现问题似乎和时间同步服务有关。
检查otcssd日志信息发现:
2010-05-10 15:21:25.785: [ CTSS][2571]ctssslave_swm: The magnitude [31532764620334] of the offset [-31532764620334 usec] is larger than [86400000000 usec] sec which is the CTSS limit.
2010-05-10 15:21:25.785: [ CTSS][2571]ctssslave_swm: The magnitude of the systime diff is larger than max adjtime limit. Offset [-31532764620334] usec will be changed to max adjtime limit [+/- 131071].
2010-05-10 15:21:25.785: [ CTSS][2571]ctssslave_swm15: The CTSS master is ahead this node. The local time offset [131071 usec] is being adjusted. Sync method [2]
2010-05-10 15:21:25.786: [ CTSS][2571]ctssslave_swm17: LT [1273476085sec 785752usec], MT [1305008850sec 406086usec], Delta [1412usec]
2010-05-10 15:21:25.786: [ CTSS][2571]ctssslave_swm19: The offset is [-131071 usec] and sync interval set to [4]
2010-05-10 15:21:25.786: [ CTSS][2571]ctssslave_swm: Received from master (mode [0xcc] nodenum [1] hostname [jzdbnc] )
2010-05-10 15:21:25.786: [ CTSS][2571]ctsselect_msm: Sync interval returned in [4]
2010-05-10 15:21:25.786: [ CTSS][2057]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
2010-05-10 15:21:26.886: [ CTSS][2571]ctsselect_msm: CTSS mode is [0xc4]
2010-05-10 15:21:26.886: [ CTSS][2571]ctssslave_swm1_2: Ready to initiate new time sync process.
2010-05-10 15:21:26.886: [ CTSS][2571]ctssslave_swm2_1: Waiting for time sync message from master. sync_state[2].
2010-05-10 15:21:26.887: [ CTSS][2057]ctsscomm_recv_cb2: Receive incoming message event. Msgtype [2].
2010-05-10 15:21:26.887: [ CTSS][2057]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3].
2010-05-10 15:21:26.887: [ CTSS][2571]ctssslave_swm2_3: Received time sync message from master.
2010-05-10 15:21:26.887: [ CTSS][2571]ctssslave_swm: The magnitude [31532764520368] of the offset [-31532764520368 usec] is larger than [86400000000 usec] sec which is the CTSS limit.
2010-05-10 15:21:26.887: [ CTSS][2571]ctssslave_swm: The magnitude of the systime diff is larger than max adjtime limit. Offset [-31532764520368] usec will be changed to max adjtime limit [+/- 131071].
2010-05-10 15:21:26.887: [ CTSS][2571]ctssslave_swm15: The CTSS master is ahead this node. The local time offset [131071 usec] is being adjusted. Sync method [2]
2010-05-10 15:21:26.888: [ CTSS][2571]ctssslave_swm17: LT [1273476086sec 887774usec], MT [1305008851sec 408142usec], Delta [1430usec]
2010-05-10 15:21:26.888: [ CTSS][2571]ctssslave_swm19: The offset is [-131071 usec] and sync interval set to [4]
2010-05-10 15:21:26.889: [ CTSS][2571]ctssslave_swm: Received from master (mode [0xcc] nodenum [1] hostname [jzdbnc] )
2010-05-10 15:21:26.889: [ CTSS][2571]ctsselect_msm: Sync interval returned in [4]
2010-05-10 15:21:26.889: [ CTSS][2057]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
2010-05-10 15:21:27.989: [ CTSS][2571]ctsselect_msm: CTSS mode is [0xc4]
2010-05-10 15:21:27.989: [ CTSS][2571]ctssslave_swm1_2: Ready to initiate new time sync process.
2010-05-10 15:21:27.989: [ CTSS][2571]ctssslave_swm2_1: Waiting for time sync message from master. sync_state[2].
2010-05-10 15:21:27.990: [ CTSS][2057]ctsscomm_recv_cb2: Receive incoming message event. Msgtype [2].
2010-05-10 15:21:27.990: [ CTSS][2057]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3].
2010-05-10 15:21:27.990: [ CTSS][2571]ctssslave_swm2_3: Received time sync message from master.
2010-05-10 15:21:27.990: [ CTSS][2571]ctssslave_swm: The magnitude [31532764419342] of the offset [-31532764419342 usec] is larger than [86400000000 usec] sec which is the CTSS limit.
2010-05-10 15:21:27.990: [ CTSS][2571]ctssslave_swm: The magnitude of the systime diff is larger than max adjtime limit. Offset [-31532764419342] usec will be changed to max adjtime limit [+/- 131071].
2010-05-10 15:21:27.990: [ CTSS][2571]ctssslave_swm15: The CTSS master is ahead this node. The local time offset [131071 usec] is being adjusted. Sync method [2]
2010-05-10 15:21:27.990: [ CTSS][2571]ctssslave_swm17: LT [1273476087sec 990725usec], MT [1305008852sec 410067usec], Delta [1347usec]
2010-05-10 15:21:27.991: [ CTSS][2571]ctssslave_swm19: The offset is [-131071 usec] and sync interval set to [4]
2010-05-10 15:21:27.991: [ CTSS][2571]ctssslave_swm: Received from master (mode [0xcc] nodenum [1] hostname [jzdbnc] )
2010-05-10 15:21:27.991: [ CTSS][2571]ctsselect_msm: Sync interval returned in [4]
2010-05-10 15:21:27.991: [ CTSS][2057]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
2010-05-10 15:21:29.062: [ CTSS][515]ctss_checkcb: clsdm requested check alive. checkcb_data{mode[0xc4], offset[-1467993347 ms]}, length=[8].
2010-05-10 15:21:29.091: [ CTSS][2571]ctsselect_msm: CTSS mode is [0xc4]
2010-05-10 15:21:29.091: [ CTSS][2571]ctssslave_swm1_2: Ready to initiate new time sync process.
2010-05-10 15:21:29.091: [ CTSS][2571]ctssslave_swm2_1: Waiting for time sync message from master. sync_state[2].
2010-05-10 15:21:29.092: [ CTSS][2057]ctsscomm_recv_cb2: Receive incoming message event. Msgtype [2].
2010-05-10 15:21:29.092: [ CTSS][2057]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3].
2010-05-10 15:21:29.092: [ CTSS][2571]ctssslave_swm2_3: Received time sync message from master.
2010-05-10 15:21:29.092: [ CTSS][2571]ctssslave_swm: The magnitude [31532764319434] of the offset [-31532764319434 usec] is larger than [86400000000 usec] sec which is the CTSS limit.
2010-05-10 15:21:29.092: [ CTSS][2571]ctssslave_swm: The magnitude of the systime diff is larger than max adjtime limit. Offset [-31532764319434] usec will be changed to max adjtime limit [+/- 131071].
2010-05-10 15:21:29.092: [ CTSS][2571]ctssslave_swm15: The CTSS master is ahead this node. The local time offset [131071 usec] is being adjusted. Sync method [2]
2010-05-10 15:21:29.093: [ CTSS][2571]ctssslave_swm17: LT [1273476089sec 92763usec], MT [1305008853sec 412197usec], Delta [1454usec]
2010-05-10 15:21:29.093: [ CTSS][2571]ctssslave_swm19: The offset is [-131071 usec] and sync interval set to [4]
2010-05-10 15:21:29.093: [ CTSS][2571]ctssslave_swm: Received from master (mode [0xcc] nodenum [1] hostname [jzdbnc] )
2010-05-10 15:21:29.093: [ CTSS][2571]ctsselect_msm: Sync interval returned in [4]
2010-05-10 15:21:29.093: [ CTSS][2057]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
2010-05-10 15:21:30.193: [ CTSS][2571]ctsselect_msm: CTSS mode is [0xc4]
2010-05-10 15:21:30.193: [ CTSS][2571]ctssslave_swm1_2: Ready to initiate new time sync process.
2010-05-10 15:21:30.193: [ CTSS][2571]ctssslave_swm2_1: Waiting for time sync message from master. sync_state[2].
2010-05-10 15:21:30.194: [ CTSS][2057]ctsscomm_recv_cb2: Receive incoming message event. Msgtype [2].
2010-05-10 15:21:30.194: [ CTSS][2057]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3].
2010-05-10 15:21:30.194: [ CTSS][2571]ctssslave_swm2_3: Received time sync message from master.
2010-05-10 15:21:30.194: [ CTSS][2571]ctssslave_swm: The magnitude [31532764219409] of the offset [-31532764219409 usec] is larger than [86400000000 usec] sec which is the CTSS limit.
2010-05-10 15:21:30.194: [ CTSS][2571]ctssslave_swm: The magnitude of the systime diff is larger than max adjtime limit. Offset [-31532764219409] usec will be changed to max adjtime limit [+/- 131071].
2010-05-10 15:21:30.194: [ CTSS][2571]ctssslave_swm15: The CTSS master is ahead this node. The local time offset [131071 usec] is being adjusted. Sync method [2]
2010-05-10 15:21:30.195: [ CTSS][2571]ctssslave_swm17: LT [1273476090sec 194781usec], MT [1305008854sec 414190usec], Delta [1382usec]
2010-05-10 15:21:30.195: [ CTSS][2571]ctssslave_swm19: The offset is [-131071 usec] and sync interval set to [4]
2010-05-10 15:21:30.195: [ CTSS][2571]ctssslave_swm: Received from master (mode [0xcc] nodenum [1] hostname [jzdbnc] )
2010-05-10 15:21:30.195: [ CTSS][2571]ctsselect_msm: Sync interval returned in [4]
2010-05-10 15:21:30.195: [ CTSS][2057]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
2010-05-10 15:21:31.295: [ CTSS][2571]ctsselect_msm: CTSS mode is [0xc4]
2010-05-10 15:21:31.295: [ CTSS][2571]ctssslave_swm1_2: Ready to initiate new time sync process.
2010-05-10 15:21:31.295: [ CTSS][2571]ctssslave_swm2_1: Waiting for time sync message from master. sync_state[2].
2010-05-10 15:21:31.296: [ CTSS][2057]ctsscomm_recv_cb2: Receive incoming message event. Msgtype [2].
2010-05-10 15:21:31.296: [ CTSS][2057]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3].
2010-05-10 15:21:31.296: [ CTSS][2571]ctssslave_swm2_3: Received time sync message from master.
2010-05-10 15:21:31.296: [ CTSS][2571]ctssslave_swm: The magnitude [31532764119367] of the offset [-31532764119367 usec] is larger than [86400000000 usec] sec which is the CTSS limit.
2010-05-10 15:21:31.296: [ CTSS][2571]ctssslave_swm: The magnitude of the systime diff is larger than max adjtime limit. Offset [-31532764119367] usec will be changed to max adjtime limit [+/- 131071].
2010-05-10 15:21:31.297: [ CTSS][2571]ctssslave_swm15: The CTSS master is ahead this node. The local time offset [131071 usec] is being adjusted. Sync method [2]
2010-05-10 15:21:31.297: [ CTSS][2571]ctssslave_swm17: LT [1273476091sec 296807usec], MT [1305008855sec 416174usec], Delta [1431usec]
2010-05-10 15:21:31.297: [ CTSS][2571]ctssslave_swm19: The offset is [-131071 usec] and sync interval set to [4]
2010-05-10 15:21:31.297: [ CTSS][2571]ctssslave_swm: Received from master (mode [0xcc] nodenum [1] hostname [jzdbnc] )
2010-05-10 15:21:31.297: [ CTSS][2571]ctsselect_msm: Sync interval returned in [4]
2010-05-10 15:21:31.297: [ CTSS][2057]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
2010-05-10 15:21:32.397: [ CTSS][2571]ctsselect_msm: CTSS mode is [0xc4]
2010-05-10 15:21:32.397: [ CTSS][2571]ctssslave_swm1_2: Ready to initiate new time sync process.
2010-05-10 15:21:32.397: [ CTSS][2571]ctssslave_swm2_1: Waiting for time sync message from master. sync_state[2].
2010-05-10 15:21:59.396: [ CTSS][515]ctss_checkcb: clsdm requested check alive. checkcb_data{mode[0xc4], offset[-1467993047 ms]}, length=[8].
2010-05-10 15:22:02.684: [ CTSS][2057]ctsscomm_recv_cb2: Receive incoming message event. Msgtype [2].
2010-05-10 15:22:02.684: [ CTSS][2057]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3].
2010-05-10 15:22:02.684: [ CTSS][2571]ctssslave_swm2_3: Received time sync message from master.
2010-05-10 15:22:02.684: [ CTSS][2571]ctssslave_swm: sendtime{sec[1273476092], usec[397374]}, receivetime{sec[1273476122], usec[684693]}.
2010-05-10 15:22:02.684: [ CTSS][2571]ctssslave_swm: The RTT of sync msg [30287319] is too large for time sync to be accurate. Recommends retry. Returns [17].
2010-05-10 15:22:02.684: [ CTSS][2571]ctssslave_swm: Received from master (mode [0xcc] nodenum [1] hostname [jzdbnc] )
2010-05-10 15:22:02.684: [ CTSS][2571]ctsselect_msm: Failed in sync_with_master [17]
2010-05-10 15:22:02.684: [ CTSS][2057]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
2010-05-10 15:22:02.685: [ CTSS][2571]ctsselect_msm: Sync interval returned in [1]
2010-05-10 15:22:04.592: [GIPCHAUP][1029] gipchaUpperProcessDisconnect: processing DISCONNECT for hendp 111411950 [000000000000025a] { gipchaEndpoint : port 'CTSSGROUP_2/811b-0c78-d761-0a0d', peer 'jzdbnc:9ead-4b66-b4c7-458c', srcCid 00000000-0000025a, dstCid 00000000-0000028a, numSend 0, maxSend 100, usrFlags 0x4000, flags 0x204 }
2010-05-10 15:22:04.848: [ CSSCLNT][1543]clsssRecvMsg: got a disconnect from the server while waiting for message type 1
2010-05-10 15:22:04.848: [ CSSCLNT][1543]clssgsGroupGetStatus: communications failed (0/3/-1)
2010-05-10 15:22:04.848: [ CSSCLNT][1543]clssgsGroupGetStatus: returning 8
2010-05-10 15:22:04.848: [ CRSCCL][1543]Daemon exiting due to error in clssgsgrpstat retval = 8
2010-05-10 15:22:05.636: [ CTSS][1]Oracle Database CTSS Release 11.2.0.2.0 Production Copyright 2006, 2007 Oracle. All rights reserved.
2010-05-10 15:22:05.637: [ CTSS][1]ctss_scls_init: SCLs Context is 11098c370
[ clsdmt][515]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=jzdbiufoDBG_CTSSD))
2010-05-10 15:22:05.664: [ clsdmt][515]PID for the Process [9765310], connkey 11
2010-05-10 15:22:05.664: [ clsdmt][515]Creating PID [9765310] file for home /oracle/app/11.2.0/grid host jzdbiufo bin ctss to /oracle/app/11.2.0/grid/ctss/init/
2010-05-10 15:22:05.664: [ clsdmt][515]Writing PID [9765310] to the file [/oracle/app/11.2.0/grid/ctss/init/jzdbiufo.pid]
2010-05-10 15:22:06.505: [ CTSS][515]ctss_checkcb: clsdm requested check alive. checkcb_data{mode[0x0], offset[0 ms]}, length=[8].
2010-05-10 15:22:06.512: [ CTSS][1]ctss_css_init: clsssinit() failed with [3].
2010-05-10 15:22:06.512: [ CTSS][1](:ctss_css_init1:): CTSS daemon exiting. CSS unresponsive [3]
[ CTSS][1]SLOS : SLOS: cat=1311719766, pn=skgxnqtsz, dep=0, loc=SKGXN not av
2010-05-10 15:22:06.512: [ CTSS][1]ctss_main: CSS init failed [7]
2010-05-10 15:22:06.512: [ CTSS][1]ctss_main: CTSS init failed [7]
2010-05-10 15:22:06.512: [ CTSS][1]ctss_main: CTSS daemon aborting [7].
2010-05-10 15:22:06.512: [ CTSS][1]CTSS daemon aborting
2010-05-10 15:22:07.510: [ CTSS][515]ctss_checkcb: clsdm requested check alive. checkcb_data{mode[0x0], offset[0 ms]}, length=[8].
2010-05-10 15:22:08.511: [ CTSS][515]ctss_checkcb: clsdm requested check alive. checkcb_data{mode[0x0], offset[0 ms]}, length=[8].
2010-05-10 15:22:09.513: [ CTSS][515]ctss_checkcb: clsdm requested check alive. checkcb_data{mode[0x0], offset[0 ms]}, length=[8].
在错误信息中,出现了多次告警信息:
[CTSS][2571]ctssslave_swm: The magnitude [31532764119367] of the offset [-31532764119367 usec] is larger than [86400000000 usec] sec which is the CTSS limit.
显然Oracle的时间同步后台一直在调整两个服务器的时间,不过令人意外的是,本来只相差几秒的时间,为什么调整后变成了1个小时左右,而且这里的数值如此巨大。
简单分析一下不难发现,Oracle的时间调整限制是1天:
SQL> set numw 15
SQL> select 24 * 60 * 60 * 1000000 from dual;
24*60*60*1000000
----------------
86400000000
那么那个大的出奇的时间范围是什么呢:
SQL> select 31532764119367 / 1000000 / 60 / 60 / 24 from dual;
31532764119367/1000000/60/60/24
-------------------------------
364.96254767786
居然是365天,也就是1年的时间,这是巧合吗,检查两个节点上的时间发现:
$ hostname
jzdbiufo
$ date
Mon May 10 16:31:33 GMT+08:00 2010
另一个节点:
# hostname
jzdbnc
# date
Tue May 10 15:38:04 GMT+08:00 2011
二者相差不是1个小时,而是差1个小时1年的时间差。
显然是操作人员在手工调整时间的时候将一个节点上的日期错误的深入为2010年,导致两个节点的时间差达到1年之久,这么大的时间差,远远超过Oracle时间同步服务允许的最大限制,因此Oracle的时间同步只能在后台选择一个可接受的时间间隔,大概是0.13秒,来尝试不断的修改系统时间,以期达到不同两个节点时间的目的。
在CLUSTER出于不断的同步时间的过程中,客户又在两个节点上进行了大数据量的导入工作,由此导致了短时间内两个节点间的通信出现了超时,导致一个节点被踢出了CLUSTER,而由于时间同步的问题,导致了节点重启后试图再次加入到CLUSTER中报错。
找到问题,解决就很容易了,关闭CLUSTER后,将错误的节点上的时间调整到当前时间,再次启动CLUSTER和DB,问题解决。
(以上内容摘于网络,如有侵权,请告之,将第一时间删除)
总机:(010)-88589926,88589826,88587026 QQ讨论群:243729577 182441349 邮箱:cuug_bj@cuug.com
通信地址:北京市海淀区紫竹院路98号北京化工大学科技园609室(CUUG)邮政编码:100089
中国UNIX用户协会 Copyright 2010 ALL Rights Reserved 北京神脑资讯技术有限公司
京ICP备11008061号 京公网安备110108006275号