Hi Alan & Art,
We had a similar problem with a customer and it turned out to be the combination of platform, Oracle version and DOT version.
Customer was on RS6000 with AIX 4.3.3 and Oracle 8.0.6 with DOT 6.1. Basically the problem did not appear during intial testing on DOT 5.x. However, in live environment using DOT 6.x they had log curruptions. During testing, NetApp replicated the issue which resolved itself only when going to Oracle 8.1.7. We tried many things, AIX NFS patches, changed Oracle and NetApp parameters without any success ...
I suggest you have a chat with NetApp as there are known issues with some version combinations ....
FYI the following is an extract of correspondence between NetApp and end-user ....
We have two other cases that are very similar to yours in that they involve corrupted Oracle redo logs with AIX. Over the past several weeks, we have created a test environment in our labs with an RS 6000 and AIX 4.3.3.0.
We're using an 840 filer with Ontap 6.1. We have a database with all of the files (data, control, online redo logs, and archived redo logs) on the filer. Oracle 7.3.4.5 is installed on the AIX box (I realize that you are running Oracle 8.0.6, but I'll address that later).
Our methodology was such that we created the database with just one table and one tablespace. We then launched a script that does millions of inserts. The purpose was to generate a lot of redo activity. With the basic configuration (Oracle 7.3.4.5, Ontap 6.1, and AIX 4.3.3.0), we were able to reproduce the problem that another customer was having with the corrupted redo logs. In an attempt to rectify the problem, we used an iterative technique to try and isolate the cause of the corruption was.
We lowered the nfs packet size (rsize and wsize) incrementally from 32k to 16k to 8k to 4k. This prolonged the period of time before the corruption, but did not eliminate it altogether. We changed numerous paramters on the Filer to no avail. We tried udp and tcp. Each time, the script failed with an archival error, which stalled the database. We had to perform a shutdown abort. Upon startup, there were errors reporting corruptions in the online redo logs, which is very similar to your case.
We've had conference calls with Oracle and IBM. Oracle dismissed this problem, because the corruption does not occur with both sets of redo logs on local disk. IBM, however, has acknowledged this is a problem.They are having development engineers analyze the AIX trace file we provided. IBM also told us that just last week they had released two new NFS-related patches for AIX. We have received these patches and applied them to our test system. Currently, we are running our tests with the most recent AIX-NFS specifc patches installed. We will also be conducting similiar tests with Oracle 8. Most likely, this will be 8.1.7.
Hope it helps Dave
The latest results from our testing hold some promise. Using Oracle
8.1.7 with online and archived redo logs on the filer, we ran our load test for 17 hours with no corruptions. We will be continuing this test for the next two weeks to make sure that 8.1.7 is a viable solution. We have a TAR open with Oracle that we hope will identify the root cause of the problem. In parallel to the ongoing 8.1.7 test, we will be conducting
8.0.5 and 8.0.6 tests on separate equipment -----Original Message----- From: Art Hebert [mailto:art@arzoon.com] Sent: Friday, 14 February 2003 3:01 PM To: 'Alan McLachlan'; Art Hebert Cc: toasters@mathworks.com Subject: RE: Checking netapp for bad blocks?
Alan -- the f760 is clustered redo logs on both filers and archive logs on the second filer. The primary db is fine. But the archive logs are getting corrupted someplace over on the standby database. Like you say its one of the checks we are looking at.
art
-----Original Message----- From: Alan McLachlan [mailto:amclachlan@asi.com.au] Sent: Thursday, February 13, 2003 10:18 PM To: Art Hebert Cc: toasters@mathworks.com Subject: RE: Checking netapp for bad blocks?
You can run wack (I think it's "wafl_check" now, look at the list) in maintenance mode from the floppy boot menu ("ctrl-c" on a disk reboot from a serial console).
However, unless your NVRAM card has failed - which you would know about instantly - the likelyhood that this will do you any good at all is remote in the extreme. Your DBA doesn't understand the always-consistent fs on NetApp (WAFL) so he's trying to put a tick in a box on his troubleshooting checklist.
It is far more likely that what he has is application-level corruption within the files, which fsck on a UFS volume wouldn't have helped him with anyway.
Are the transaction logs on the filer or on local disk on the database server?
-----Original Message----- From: Art Hebert [mailto:art@arzoon.com] Sent: Friday, 14 February 2003 3:41 AM To: 'Toasters@Mathworks.Com' Subject: Checking netapp for bad blocks?
Our oracle database has experienced two problems lately with file corruption on our standby database. The corruption isn't on the primary database and our Oracle DBA is asking if I can do something similar to fsck on the netapp.
Any thoughts on the commands to run and what to check for would be appreciated.
art
-----Original Message----- From: Stephane Bentebba [mailto:stephane.bentebba@fps.fr] Sent: Thursday, February 13, 2003 5:49 AM To: Jonathan Cc: Jordan Share; 'Toasters@Mathworks.Com' Subject: Re: Adding a drive shelf to my existing F740
Jonathan wrote:
2 - there is an extra trick to save a spare disk : don't
keep two but
only one spare disks : a 36G one, as Filer can reconstruct
a 18G broken
disk on a 36G. BUT you have to order Netapp for a 36G after that failure. If you have a hardware contract with Netapp, take care because, when a 18G fail, they would send you a 18G too,
not a 36G. You
would have to check for that (look for a special deal ?)
To add to your trick #2, Don't worry about getting a 18GB
back from Netapp
and leaving your 36 acting like a 18. Simply add the 18 in
the place of the
failed 18 and let the filer take it as a spare. Then do a
disk fail on the
36 acting as a 18. The filer will now look for a spare 18
and find the new
spare to rebuild the raid group. Then do a disk unfail on
the 36 acting as a
18 and Ontap will zero the 36 and put it back as a 36 spare.
So you don't
need to work out a special deal with Netapp. I have done
this many times and
with great success.
of course ! i knew it indeed, what a dumb I am :) you have to switch mode in order to use disk unfail : use " rc_toggle_basic " or " priv set advanced / admin"
----- Original Message ----- From: "Stephane Bentebba" stephane.bentebba@fps.fr To: "Jordan Share" iso9@jwiz.org Cc: "'Toasters@Mathworks.Com'" toasters@mathworks.com Sent: Thursday, February 13, 2003 6:12 AM Subject: Re: Adding a drive shelf to my existing F740
First : to be sure we speak the same language (I'm french) the raidsize include the parity disk. a group raid of sized 14 can contain up to 13 data disks
and one parity
disk. the max size of a raid can be 28
- the more disk in a raid you have, the more the filer need time to
reconstruct the disk (and the longer the CPU is used for that).
- larger the disk is, the more it consume time also.
for exemple, a F760 with a cpu load average at 40% in
production can go
up to 50% during 6 hours to reconstruct a 72G disk
1 - taking this into account, I won't advise you to move
your raidsize
to 28, but if you want to save as many as possible disk,
you can turn
the raidsize to 28 and add those 36G disks in the same
raidgroup than
other (18G) 2 - there is an extra trick to save a spare disk : don't
keep two but
only one spare disks : a 36G one, as Filer can reconstruct
a 18G broken
disk on a 36G. BUT you have to order Netapp for a 36G after that failure. If you have a hardware contract with Netapp, take care because, when a 18G fail, they would send you a 18G too,
not a 36G. You
would have to check for that (look for a special deal ?)
These two tricks apart, I can't see better ways than what
you had plained
Jordan Share wrote:
After reading further in my manuals, I've got some
questions about the
RAIDgroup size.
When I initially created the volume, I made the raidsize
14, so that it
would use 1 parity disk for the 14 disks we had at that time.
Looking at it now, I kind of think that I want a raidsize
of 13 for the
current volume (since one disk is a hot spare).
So, I issued: vol options vol0 raidsize 13
That worked ok, because now:
vol status vol0
Volume State Status Options vol0 online normal root, raidsize=13 raid group 0: normal
Basically, what I'm trying to avoid here is having the
first 36gig disk
added to the initial raidgroup. I believe changing the
raidsize to 13
will
have fixed that, since now that raidgroup is full, and
additional disks
will
go into their own raidgroup.
I welcome (and request. :) all comments on this post.
Thanks very much, Jordan Share
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com]On Behalf Of
devnull@adc.idt.com
Sent: Wednesday, February 12, 2003 11:28 AM To: Jordan Share Cc: 'Toasters@Mathworks.Com' Subject: Re: Adding a drive shelf to my existing F740
As I understand it, I would then have 2 hot spares (18
and 36), and two
drives for parity.
Yup.
You would need the separate spare to accomodate the
difference in drive
sizes. The parity drive is a function of RAID 4.
We did the same about a year back on our 740 and it worked well.
Are there any "gotchas" (or blatant ignorance on my part) in
this scenario? It was mostly smooth. Though you might need to upgrade your ONTAP
version
to 6.1.1R2 atleast
/dev/null
devnull@adc.idt.com
**** ASI Solutions Disclaimer **** The material transmitted may contain confidential and/or privileged material and is intended only for the addressee. If you receive this in error, please notify the sender and destroy any copies of the material immediately. ASI will protect your Privacy according to the 10 Privacy Principles outlined under the new Privacy Act, Dec 2001.
This email is also subject to copyright. Any use of or reliance upon this material by persons or entities other than the addressee is prohibited.
E-mails may be interfered with, may contain computer viruses or other defects. Under no circumstances do we accept liability for any loss or damage which may result from your receipt of this message or any attachments. **** END OF MESSAGE ****
**** ASI Solutions Disclaimer **** The material transmitted may contain confidential and/or privileged material and is intended only for the addressee. If you receive this in error, please notify the sender and destroy any copies of the material immediately. ASI will protect your Privacy according to the 10 Privacy Principles outlined under the new Privacy Act, Dec 2001.
This email is also subject to copyright. Any use of or reliance upon this material by persons or entities other than the addressee is prohibited.
E-mails may be interfered with, may contain computer viruses or other defects. Under no circumstances do we accept liability for any loss or damage which may result from your receipt of this message or any attachments. **** END OF MESSAGE ****