Toasters,
The following is a description of a problem that occurred over three nights last week. Although vendors are involved in attempting to determine the cause and suggest 'recommended practise' changes there may be someone out there who has experienced a similar issue.
What has been observed is; - Oracle unable to write to a redo log file and its mirror on drive E: or F:, logged error: "O/S-Error: (OS 64) The specified network name is no longer available"
- The Oracle Log Writer terminates the Oracle instance. - The service "OracleService<SID>" terminates.
This has occurred in one instance on seven Servers at the same time and on other nights with a smaller number of Servers. Backups across the Network have been taking place at this time and we are therefore lead to believe that this is load related.
There are seven Windows based Servers running SAP/Oracle. Five are Win2k/SP3, one WIn2K/SP4, and one is NT/SP6a . Oracle on NT is 7.2.4 Oracle on Win2K is 8.1.7 SAP used is 1 x 3.1i (NT), 3 x 4.6c, 3 x 3.0b
These servers use data drives mapped off a F820 with 3TB on 3 x DS-14 shelves. Eight volumes are used with all CIFS shares in qtrees. Logs and executables are kept on mapped drives not local disk. The systems are attached to a Foundry (Bigiron400) via 1GB fibre. The F820 connects to the Foundry via trunked dual 1GB fibre (dual single port cards). The Filer appears to check out OK with just uptime messages in the logs at this time with no link-loss or any other indication. Netdiag -v reports only on small average packet size for a couple of hosts and a small number of retransmissions. Ifstat looks good. The Foundry appears to check out ok with no indication of packet loss or similar.
There are a number of error messages in the Oracle logs. I have listed a couple of these below. What is interesting is that at no time has any of these machines indicated redirector problems in the event logs at the time of the event with the only event being the termination of the Oracle service.
Some points. iSCSI may be looked at in the future. It is not an option at the moment. The automatic hourly snapshots on two volumes are being removed. All future snapshots will be scheduled. At first glance this looks like some sort of redirector issue where Oracle is sensitive to Mapped drive loss. In one log Oracle places a log entry in a log on a drive that it is reporting the loss of. If this is the case then why does Win2K not report drive loss?. Would Oracle not be responding to an SMB indication of drive loss ?. If the redirector is being hit and not itself at fault then what infrastructure related issue might cause this?.
Any and all comment appreciated.
Thanks,
Neil Stichbury Technical Support Gen-i Limited New Zealand
Fri Feb 13 20:07:19 2004 KCF: write/open error block=0x404c online=1 file=8 E:\ORACLE\BWP\SAPDATA2\ROLL_1\ROLL.DATA1 error=27070 txt: 'OSD-04016: Error queuing an asynchronous I/O request. O/S-Error: (OS 64) The specified network name is no longer available.' Fri Feb 13 20:07:19 2004 Errors in file E:\oracle\BWP/saptrace/background\bwpLGWR.TRC: ORA-00345: Message 345 not found; product=RDBMS; facility=ORA ; arguments: [199] [55] ORA-00312: Message 312 not found; product=RDBMS; facility=ORA ; arguments: [2] [1] [E:\ORACLE\BWP\ORIGLOGB\LOG_G12M1.DBF] ORA-27070: Message 27070 not found; product=RDBMS; facility=ORA OSD-04016: Error queuing an asynchronous I/O request.
Thu Feb 12 20:05:52 2004<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />
LGWR: terminating instance due to error 340
Thu Feb 12 20:05:52 2004
KCF: write/open error block=0xdda6 online=1
file=2 E:\ORACLE\D46\SAPDATA1\ROLL_1\ROLL.DATA1
error=27070 txt: 'OSD-04016: Error queuing an asynchronous I/O request.
O/S-Error: (OS 64) The specified network name is no longer available.'