Hi Alan & Art,
We had a similar problem with a customer and it turned out to be the
combination of platform, Oracle version and DOT version.
Customer was on RS6000 with AIX 4.3.3 and Oracle 8.0.6 with DOT 6.1.
Basically the problem did not appear during intial testing on DOT 5.x.
However, in live environment using DOT 6.x they had log curruptions. During
testing, NetApp replicated the issue which resolved itself only when going
to Oracle 8.1.7. We tried many things, AIX NFS patches, changed Oracle and
NetApp parameters without any success ...
I suggest you have a chat with NetApp as there are known issues with some
version combinations ....
FYI the following is an extract of correspondence between NetApp and
end-user ....
We have two other cases that are very similar to yours in that they involve
corrupted Oracle redo logs with AIX. Over the past several weeks, we have
created a test environment in our labs with an RS 6000 and AIX 4.3.3.0.
We're using an 840 filer with Ontap 6.1. We have a database with all of the
files (data, control, online redo logs, and archived redo logs) on the
filer. Oracle 7.3.4.5 is installed on the AIX box (I realize that you are
running Oracle 8.0.6, but I'll address that later).
Our methodology was such that we created the database with just one table
and one tablespace. We then launched a script that does millions of
inserts. The purpose was to generate a lot of redo activity. With the
basic configuration (Oracle 7.3.4.5, Ontap 6.1, and AIX 4.3.3.0), we were
able to reproduce the problem that another customer was having with the
corrupted redo logs. In an attempt to rectify the problem, we used an
iterative technique to try and isolate the cause of the corruption was.
We lowered the nfs packet size (rsize and wsize) incrementally from 32k to
16k to 8k to 4k. This prolonged the period of time before the corruption,
but did not eliminate it altogether. We changed numerous paramters on the
Filer to no avail. We tried udp and tcp. Each time, the script failed with
an archival error, which stalled the database. We had to perform a shutdown
abort. Upon startup, there were errors reporting corruptions in the online
redo logs, which is very similar to your case.
We've had conference calls with Oracle and IBM. Oracle dismissed this
problem, because the corruption does not occur with both sets of redo logs
on local disk. IBM, however, has acknowledged this is a problem.They are
having development engineers analyze the AIX trace file we provided. IBM
also told us that just last week they had released two new NFS-related
patches for AIX. We have received these patches and applied them to our
test system. Currently, we are running our tests with the most recent
AIX-NFS specifc patches installed. We will also be conducting similiar tests
with Oracle 8. Most likely, this will be 8.1.7.
Hope it helps
Dave
The latest results from our testing hold some promise. Using Oracle
> 8.1.7 with online and archived redo logs on the filer, we ran our load
> test for 17 hours with no corruptions. We will be continuing this test
> for the next two weeks to make sure that 8.1.7 is a viable solution. We
> have a TAR open with Oracle that we hope will identify the root cause of
> the problem. In parallel to the ongoing 8.1.7 test, we will be conducting
> 8.0.5 and 8.0.6 tests on separate equipment
> -----Original Message-----
> From: Art Hebert [mailto:art@arzoon.com]
> Sent: Friday, 14 February 2003 3:01 PM
> To: 'Alan McLachlan'; Art Hebert
> Cc: toasters(a)mathworks.com
> Subject: RE: Checking netapp for bad blocks?
>
>
>
>
> Alan -- the f760 is clustered redo logs on both filers and
> archive logs on
> the second filer. The primary db is fine. But the archive
> logs are getting
> corrupted someplace over on the standby database. Like you
> say its one of
> the checks we are looking at.
>
> art
>
>
> -----Original Message-----
> From: Alan McLachlan [mailto:amclachlan@asi.com.au]
> Sent: Thursday, February 13, 2003 10:18 PM
> To: Art Hebert
> Cc: toasters(a)mathworks.com
> Subject: RE: Checking netapp for bad blocks?
>
>
> You can run wack (I think it's "wafl_check" now, look at the list) in
> maintenance mode from the floppy boot menu ("ctrl-c" on a
> disk reboot from a
> serial console).
>
> However, unless your NVRAM card has failed - which you would
> know about
> instantly - the likelyhood that this will do you any good at
> all is remote
> in the extreme. Your DBA doesn't understand the
> always-consistent fs on
> NetApp (WAFL) so he's trying to put a tick in a box on his
> troubleshooting
> checklist.
>
> It is far more likely that what he has is application-level corruption
> within the files, which fsck on a UFS volume wouldn't have
> helped him with
> anyway.
>
> Are the transaction logs on the filer or on local disk on the database
> server?
>
>
>
> -----Original Message-----
> From: Art Hebert [mailto:art@arzoon.com]
> Sent: Friday, 14 February 2003 3:41 AM
> To: 'Toasters(a)Mathworks.Com'
> Subject: Checking netapp for bad blocks?
>
>
>
> Our oracle database has experienced two problems lately with file
> corruption on our standby database. The corruption isn't on
> the primary
> database and our Oracle DBA is asking if I can do something
> similar to fsck
> on the netapp.
>
> Any thoughts on the commands to run and what to check for would be
> appreciated.
>
> art
>
> -----Original Message-----
> From: Stephane Bentebba [mailto:stephane.bentebba@fps.fr]
> Sent: Thursday, February 13, 2003 5:49 AM
> To: Jonathan
> Cc: Jordan Share; 'Toasters(a)Mathworks.Com'
> Subject: Re: Adding a drive shelf to my existing F740
>
>
> Jonathan wrote:
>
> >>2 - there is an extra trick to save a spare disk : don't
> keep two but
> >>only one spare disks : a 36G one, as Filer can reconstruct
> a 18G broken
> >>disk on a 36G. BUT you have to order Netapp for a 36G after that
> >>failure. If you have a hardware contract with Netapp, take care
> >>because, when a 18G fail, they would send you a 18G too,
> not a 36G. You
> >>would have to check for that (look for a special deal ?)
> >>
> >
> >To add to your trick #2, Don't worry about getting a 18GB
> back from Netapp
> >and leaving your 36 acting like a 18. Simply add the 18 in
> the place of the
> >failed 18 and let the filer take it as a spare. Then do a
> disk fail on the
> >36 acting as a 18. The filer will now look for a spare 18
> and find the new
> >spare to rebuild the raid group. Then do a disk unfail on
> the 36 acting as
> a
> >18 and Ontap will zero the 36 and put it back as a 36 spare.
> So you don't
> >need to work out a special deal with Netapp. I have done
> this many times
> and
> >with great success.
> >
> >
> of course ! i knew it indeed, what a dumb I am :)
> you have to switch mode in order to use disk unfail : use "
> rc_toggle_basic " or " priv set advanced / admin"
>
> >
> >----- Original Message -----
> >From: "Stephane Bentebba" <stephane.bentebba(a)fps.fr>
> >To: "Jordan Share" <iso9(a)jwiz.org>
> >Cc: "'Toasters(a)Mathworks.Com'" <toasters(a)mathworks.com>
> >Sent: Thursday, February 13, 2003 6:12 AM
> >Subject: Re: Adding a drive shelf to my existing F740
> >
> >
> >
> >
> >>First : to be sure we speak the same language (I'm french)
> >>the raidsize include the parity disk.
> >>a group raid of sized 14 can contain up to 13 data disks
> and one parity
> >>disk.
> >>the max size of a raid can be 28
> >>
> >>- the more disk in a raid you have, the more the filer need time to
> >>reconstruct the disk (and the longer the CPU is used for that).
> >>- larger the disk is, the more it consume time also.
> >>for exemple, a F760 with a cpu load average at 40% in
> production can go
> >>up to 50% during 6 hours to reconstruct a 72G disk
> >>
> >>1 - taking this into account, I won't advise you to move
> your raidsize
> >>to 28, but if you want to save as many as possible disk,
> you can turn
> >>the raidsize to 28 and add those 36G disks in the same
> raidgroup than
> >>other (18G)
> >>2 - there is an extra trick to save a spare disk : don't
> keep two but
> >>only one spare disks : a 36G one, as Filer can reconstruct
> a 18G broken
> >>disk on a 36G. BUT you have to order Netapp for a 36G after that
> >>failure. If you have a hardware contract with Netapp, take care
> >>because, when a 18G fail, they would send you a 18G too,
> not a 36G. You
> >>would have to check for that (look for a special deal ?)
> >>
> >>These two tricks apart, I can't see better ways than what
> you had plained
> >>
> >>Jordan Share wrote:
> >>
> >>
> >>
> >>>After reading further in my manuals, I've got some
> questions about the
> >>>RAIDgroup size.
> >>>
> >>>When I initially created the volume, I made the raidsize
> 14, so that it
> >>>would use 1 parity disk for the 14 disks we had at that time.
> >>>
> >>>Looking at it now, I kind of think that I want a raidsize
> of 13 for the
> >>>current volume (since one disk is a hot spare).
> >>>
> >>>So, I issued:
> >>>vol options vol0 raidsize 13
> >>>
> >>>That worked ok, because now:
> >>>
> >>>
> >>>
> >>>
> >>>>vol status vol0
> >>>>
> >>>>
> >>>>
> >>>>
> >>> Volume State Status Options
> >>> vol0 online normal root, raidsize=13
> >>> raid group 0: normal
> >>>
> >>>Basically, what I'm trying to avoid here is having the
> first 36gig disk
> >>>added to the initial raidgroup. I believe changing the
> raidsize to 13
> >>>
> >>>
> >will
> >
> >
> >>>have fixed that, since now that raidgroup is full, and
> additional disks
> >>>
> >>>
> >will
> >
> >
> >>>go into their own raidgroup.
> >>>
> >>>I welcome (and request. :) all comments on this post.
> >>>
> >>>Thanks very much,
> >>>Jordan Share
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>>-----Original Message-----
> >>>>From: owner-toasters(a)mathworks.com
> >>>>[mailto:owner-toasters@mathworks.com]On Behalf Of
> devnull(a)adc.idt.com
> >>>>Sent: Wednesday, February 12, 2003 11:28 AM
> >>>>To: Jordan Share
> >>>>Cc: 'Toasters(a)Mathworks.Com'
> >>>>Subject: Re: Adding a drive shelf to my existing F740
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>As I understand it, I would then have 2 hot spares (18
> and 36), and two
> >>>>>drives for parity.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>Yup.
> >>>>
> >>>>You would need the separate spare to accomodate the
> difference in drive
> >>>>sizes. The parity drive is a function of RAID 4.
> >>>>
> >>>>We did the same about a year back on our 740 and it worked well.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>Are there any "gotchas" (or blatant ignorance on my part) in
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>this scenario?
> >>>>It was mostly smooth. Though you might need to upgrade your ONTAP
> >>>>
> >>>>
> >version
> >
> >
> >>>>to 6.1.1R2 atleast
> >>>>
> >>>>
> >>>>
> >>>>/dev/null
> >>>>
> >>>>devnull(a)adc.idt.com
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >>
> >
> >
> >
> >
>
>
>
> **** ASI Solutions Disclaimer ****
> The material transmitted may contain confidential and/or privileged
> material and is intended only for the addressee. If you
> receive this in
> error, please notify the sender and destroy any copies of the material
> immediately. ASI will protect your Privacy according to the 10 Privacy
> Principles outlined under the new Privacy Act, Dec 2001.
>
> This email is also subject to copyright. Any use of or
> reliance upon this
> material by persons or entities other than the addressee is
> prohibited.
>
> E-mails may be interfered with, may contain computer viruses or other
> defects. Under no circumstances do we accept liability for
> any loss or
> damage which may result from your receipt of this message or any
> attachments.
> **** END OF MESSAGE ****
>
**** ASI Solutions Disclaimer ****
The material transmitted may contain confidential and/or privileged
material and is intended only for the addressee. If you receive this in
error, please notify the sender and destroy any copies of the material
immediately. ASI will protect your Privacy according to the 10 Privacy
Principles outlined under the new Privacy Act, Dec 2001.
This email is also subject to copyright. Any use of or reliance upon this
material by persons or entities other than the addressee is prohibited.
E-mails may be interfered with, may contain computer viruses or other
defects. Under no circumstances do we accept liability for any loss or
damage which may result from your receipt of this message or any
attachments.
**** END OF MESSAGE ****