RE: Checking netapp for bad blocks? - toasters

18 Feb 2003

      Hi Alan & Art,
We had a similar problem with a customer and it turned out to be the
combination of platform, Oracle version and DOT version.
Customer was on RS6000 with AIX 4.3.3 and Oracle 8.0.6 with DOT 6.1.
Basically the problem did not appear during intial testing on DOT 5.x.
However, in live environment using DOT 6.x they had log curruptions.  During
testing, NetApp replicated the issue which resolved itself only when going
to Oracle 8.1.7.  We tried many things, AIX NFS patches, changed Oracle and
NetApp parameters without any success ...
I suggest you have a chat with NetApp as there are known issues with some
version combinations ....
FYI the following is an extract of correspondence between NetApp and
end-user ....
We have two other cases that are very similar to yours in that they involve
corrupted Oracle redo logs with AIX.  Over the past several weeks, we have
created a test environment in our labs with an RS 6000 and AIX 4.3.3.0.
We're using an 840 filer with Ontap 6.1.  We have a database with all of the
files (data, control, online redo logs, and archived redo logs) on the
filer.  Oracle 7.3.4.5 is installed on the AIX box (I realize that you are
running Oracle 8.0.6, but I'll address that later).
Our methodology was such that we created the database with just one table
and one tablespace.  We then launched a script that does millions of
inserts.  The purpose was to generate a lot of redo activity.  With the
basic configuration (Oracle 7.3.4.5, Ontap 6.1, and AIX 4.3.3.0),  we were
able to reproduce the problem that another customer was having with the
corrupted redo logs.  In an attempt to rectify the problem, we used an
iterative technique to try and isolate the cause of the corruption was.
We lowered the nfs packet size (rsize and wsize) incrementally from 32k to
16k to 8k to 4k.  This prolonged the period of time before the corruption,
but did not eliminate it altogether.  We changed numerous paramters on the
Filer to no avail.  We tried udp and tcp.  Each time, the script failed with
an archival error, which stalled the database.  We had to perform a shutdown
abort.  Upon startup, there were errors reporting corruptions in the online
redo logs, which is very similar to your case.
We've had conference calls with Oracle and IBM.  Oracle dismissed this
problem, because the corruption does not occur with both sets of redo logs
on local disk.  IBM, however, has acknowledged this is a problem.They are
having development engineers analyze the AIX trace file we provided.  IBM
also told us that just last week they had released two new NFS-related
patches for AIX.  We have received these patches and applied them to our
test system.  Currently, we are running our tests with the most recent
AIX-NFS specifc patches installed. We will also be conducting similiar tests
with Oracle 8.  Most likely, this will be 8.1.7.
Hope it helps
Dave
The latest results from our testing hold some promise.  Using Oracle
...
8.1.7 with online and archived redo logs on the filer, we ran our load 
test for 17 hours with no corruptions.  We will be continuing this test 
for the next two weeks to make sure that 8.1.7 is a viable solution.  We 
have a TAR open with Oracle that we hope will identify the root cause of 
the problem.  In parallel to the ongoing 8.1.7 test, we will be conducting
...
8.0.5 and 8.0.6 tests on separate equipment 
-----Original Message-----
From: Art Hebert [mailto:art@arzoon.com]
Sent: Friday, 14 February 2003 3:01 PM
To: 'Alan McLachlan'; Art Hebert
Cc: toasters@mathworks.com
Subject: RE: Checking netapp for bad blocks?
Alan -- the f760 is clustered redo logs on both filers and 
archive logs on
the second filer. The primary db is fine. But the archive 
logs are getting
corrupted someplace over on the standby database. Like you 
say its one of
the checks we are looking at.
art
-----Original Message-----
From: Alan McLachlan [mailto:amclachlan@asi.com.au]
Sent: Thursday, February 13, 2003 10:18 PM
To: Art Hebert
Cc: toasters@mathworks.com
Subject: RE: Checking netapp for bad blocks?
You can run wack (I think it's "wafl_check" now, look at the list) in
maintenance mode from the floppy boot menu ("ctrl-c" on a 
disk reboot from a
serial console).
However, unless your NVRAM card has failed - which you would 
know about
instantly - the likelyhood that this will do you any good at 
all is remote
in the extreme. Your DBA doesn't understand the 
always-consistent fs on
NetApp (WAFL) so he's trying to put a tick in a box on his 
troubleshooting
checklist.
It is far more likely that what he has is application-level corruption
within the files, which fsck on a UFS volume wouldn't have 
helped him with
anyway.
Are the transaction logs on the filer or on local disk on the database
server?
-----Original Message-----
From: Art Hebert [mailto:art@arzoon.com]
Sent: Friday, 14 February 2003 3:41 AM
To: 'Toasters@Mathworks.Com'
Subject: Checking netapp for bad blocks?
Our oracle database has experienced two problems lately with file
corruption on our standby database. The corruption isn't on 
the primary
database and our Oracle DBA is asking if I can do something 
similar to fsck
on the netapp.
Any thoughts on the commands to run and what to check for would be
appreciated.
art
-----Original Message-----
From: Stephane Bentebba [mailto:stephane.bentebba@fps.fr]
Sent: Thursday, February 13, 2003 5:49 AM
To: Jonathan
Cc: Jordan Share; 'Toasters@Mathworks.Com'
Subject: Re: Adding a drive shelf to my existing F740
Jonathan wrote:
...
...
2 - there is an extra trick to save a spare disk : don't
keep two but
...
...
only one spare disks : a 36G one, as Filer can reconstruct
a 18G broken
...
...
disk on a 36G. BUT you have to order Netapp for a 36G after that
failure. If  you have a hardware contract with Netapp, take care
because,  when a 18G fail, they would send you a 18G too,
not a 36G. You
...
...
would have to check for that (look for a special deal ?)
To add to your trick #2, Don't worry about getting a 18GB
back from Netapp
...
and leaving your 36 acting like a 18. Simply add the 18 in
the place of the
...
failed 18 and let the filer take it as a spare. Then do a
disk fail on the
...
36 acting as a 18. The filer will now look for a spare 18
and find the new
...
spare to rebuild the raid group. Then do a disk unfail on
the 36 acting as
a
...
18 and Ontap will zero the 36 and put it back as a 36 spare.
So you don't
...
need to work out a special deal with Netapp. I have done
this many times
and
...
with great success.
of course ! i knew it indeed, what a dumb I am :)
you have to switch mode in order to use disk unfail : use " 
rc_toggle_basic " or " priv set advanced / admin"
...
----- Original Message -----
From: "Stephane Bentebba" stephane.bentebba@fps.fr
To: "Jordan Share" iso9@jwiz.org
Cc: "'Toasters@Mathworks.Com'" toasters@mathworks.com
Sent: Thursday, February 13, 2003 6:12 AM
Subject: Re: Adding a drive shelf to my existing F740
...
First : to be sure we speak the same language (I'm french)
the raidsize include the parity disk.
a group raid of sized 14 can contain up to 13 data disks
and one parity
...
...
disk.
the max size of a raid can be 28

the more disk in a raid you have, the more the filer need time to

reconstruct the disk (and the longer the CPU is used for that).

larger the disk is, the more it consume time also.

for exemple, a F760 with a cpu load average at 40% in
production can go
...
...
up to 50% during 6 hours to reconstruct a 72G disk
1 - taking this into account, I won't advise you to move
your raidsize
...
...
to 28, but if you want to save as many as possible disk,
you can turn
...
...
the raidsize to 28 and add those 36G disks in the same
raidgroup than
...
...
other (18G)
2 - there is an extra trick to save a spare disk : don't
keep two but
...
...
only one spare disks : a 36G one, as Filer can reconstruct
a 18G broken
...
...
disk on a 36G. BUT you have to order Netapp for a 36G after that
failure. If  you have a hardware contract with Netapp, take care
because,  when a 18G fail, they would send you a 18G too,
not a 36G. You
...
...
would have to check for that (look for a special deal ?)
These two tricks apart, I can't see better ways than what
you had plained
...
...
Jordan Share wrote:
...
After reading further in my manuals, I've got some
questions about the
...
...
...
RAIDgroup size.
When I initially created the volume, I made the raidsize
14, so that it
...
...
...
would use 1 parity disk for the 14 disks we had at that time.
Looking at it now, I kind of think that I want a raidsize
of 13 for the
...
...
...
current volume (since one disk is a hot spare).
So, I issued:
vol options vol0 raidsize 13
That worked ok, because now:
...
vol status vol0
   Volume State   Status            Options
     vol0 online  normal            root, raidsize=13
          raid group 0: normal

Basically, what I'm trying to avoid here is having the
first 36gig disk
...
...
...
added to the initial raidgroup.  I believe changing the
raidsize to 13
...
...
...
will
...
...
have fixed that, since now that raidgroup is full, and
additional disks
...
...
...
will
...
...
go into their own raidgroup.
I welcome (and request. :) all comments on this post.
Thanks very much,
Jordan Share
...
-----Original Message-----
From: owner-toasters@mathworks.com
[mailto:owner-toasters@mathworks.com]On Behalf Of
devnull@adc.idt.com
...
...
...
...
Sent: Wednesday, February 12, 2003 11:28 AM
To: Jordan Share
Cc: 'Toasters@Mathworks.Com'
Subject: Re: Adding a drive shelf to my existing F740
...
As I understand it, I would then have 2 hot spares (18
and 36), and two
...
...
...
...
...
drives for parity.
Yup.
You would need the separate spare to accomodate the
difference in drive
...
...
...
...
sizes. The parity drive is a function of RAID 4.
We did the same about a year back on our 740 and it worked well.
...
Are there any "gotchas" (or blatant ignorance on my part) in
this scenario?
It was mostly smooth. Though you might need to upgrade your ONTAP
version
...
...
...
to 6.1.1R2 atleast
/dev/null
devnull@adc.idt.com
**** ASI Solutions Disclaimer **** 
 The material transmitted may contain confidential and/or privileged
material and is intended only for the addressee.  If you 
receive this in
error, please notify the sender and destroy any copies of the material
immediately. ASI will protect your Privacy according to the 10 Privacy
Principles outlined under the new Privacy Act, Dec 2001.
This email is also subject to copyright.  Any use of or 
reliance upon this
material by persons or entities other than the addressee is 
prohibited.
E-mails may be interfered with, may contain computer viruses or other
defects.  Under no circumstances do we accept liability for 
any loss or
damage which may result from your receipt of this message or any
attachments. 
**** END OF MESSAGE ****
**** ASI Solutions Disclaimer **** 
 The material transmitted may contain confidential and/or privileged
material and is intended only for the addressee.  If you receive this in
error, please notify the sender and destroy any copies of the material
immediately. ASI will protect your Privacy according to the 10 Privacy
Principles outlined under the new Privacy Act, Dec 2001.
This email is also subject to copyright.  Any use of or reliance upon this
material by persons or entities other than the addressee is prohibited.
E-mails may be interfered with, may contain computer viruses or other
defects.  Under no circumstances do we accept liability for any loss or
damage which may result from your receipt of this message or any
attachments. 
**** END OF MESSAGE ****