toasters August 2011

toasters@lists.teaparty.net

101 participants
33 discussions

Recover deleted snapshot
by Carl Howell 26 Aug '11

26 Aug '11

Does anyone know if it's possible to recover a deleted snapshot? Thanks, Carl Howell Systems Engineer University of West Florida

3 3

Sources of unaligned IO other that Vmware? - pw.over_limit persists
by Fletcher Cocquyt 26 Aug '11

26 Aug '11

I originally tried posting this back on April 11 - now that the list is "fixed" I want to try again - thanks: Hi, we¹ve aligned all our Vmware vmdk¹s according to the Netapp best practices while tracking the pw.over_limit counter see: http://www.vmadmin.info/2010/07/quantifying-vmdk-misalignment.html Counters that indicate improper alignment ( ref: ftp://service.boulder.ibm.com/storage/isv/NS3593-0.pdf) ³There are various ways of determining if you do not have proper alignment. Using perfstat counters, under the wafl_susp section, ³wp.partial_writes³, ³pw.over_limit³, and ³pw.async_read,³ are indicators of improper alignment. The ³wp.partial write³ is the block counter of unaligned I/O. If more than a small number of partial writes happen, then IBM® System StorageTM N series with WAFL® (write anywhere file layout) will launch a background read. These are counted in ³pw.async_read³; ³pw.over_limit³ is the block counter of the writes waiting on disk reads.² -- So the pw.over_limit counter is still recording an 5 minute average of 14 with 7-10 peaks in the 50-100 range at certain times of the day. If I look at the clients talking to the Netapp those times its mostly Oracle RAC servers with storage for data and voting disks on NFS. This leads me to the question: What if any are the other possible sources for unaligned IO on Netapp? All references I find are vmware vmdk but are there others like Oracle which may be doing block IO over NFS? Many thanks -- Fletcher Cocquyt Principal Engineer Information Resources and Technology (IRT) Stanford University School of Medicine http://vmadmin.info

4 4

out of order/random delivery
by Kyle Oliver 26 Aug '11

26 Aug '11

Anyone else get messages from this mailing-list out of order? I find that I often see replies to a message and then get the original message hours later. The timestamps on the messages are correct, but they get to my inbox in a strange order? For the record, it happened when the list was hosted at Mathworks as well. I guess it just seems more pronounced here with the recent volume of messages. -Kyle

2 1

Anyone see this copy problem before...
by tmac 26 Aug '11

26 Aug '11

while doing a cp, I see the following: cp: skipping file `/a/b/c/d/e/file', as it was replaced while being copied. The file has a date of Sep 11, 2010, the ctime is May 19th, 13:53. The atime should not matter. (which is today). Some notes about the path: /a is an automount base /b is the actual automount /c/d/e is the rest of the path RHEL5.6 autofs: 5.0.1-0.rc2.143.el5_6.2 kernel: 2.6.18-238.12.1.el5 nfs-utils: 1.0.9-50.el5 --tmac

3 3

Mixed San question
by Alon Zeltser 25 Aug '11

25 Aug '11

Hello toasters I'm trying to implement fas2040 in a new mixed fabric environment I have few IBM blade centers each contain 2 cisco mds 9100 switch All the blade centers mds switches connected to brocade 300 switch The brocade is configured with interopmod 3 The brocade is the principal switch with domain ID 100 while the other mds are connected with E-port downstream with domain ID from 90 to 110 without 100 I have two issue 1. I cannot create aliases / zones in the brocade the button is grated out and say this is disable in mcdata open fabric interopabilty Mode I can live with this issue and configure everything with cisco tools But even then alias or zones are not shown on the brocade even tough they work 2. The real issue is when I configure one blade server on each cage (one port on each mds ) everything is work fine but the minute I try to add a second blade server on the same switch the first blade disconnect from it's group and in "fcp show initiator" I can no longer see the first server wwn until I disconnect the second server Any ideas ? Sent from my iPhone

2 1

snapmirroring Exchange
by Jon Hill 25 Aug '11

25 Aug '11

Is it possible for a snapmirror update command (one not mediated by SnapDrive) to temporarily halt I/O on the source disk? More to the point, is it possible for a snapmirror update command to harm the source server if that server is an active Exchange 2003 box? We do own SnapDrive for Exchange, but installing it is not an option at this time. This snapmirror solution is being suggested as an emergency solution to replicate our Exchange database for the next few days, while our operations team tries to figure out why DoubleTake has failed for four days in a row. They understand that without SnapDrive's VSS goodness there is a good chance of data loss on the target end, but they'd like to give this a shot on the theory that it's better than nothing. But it's only better than nothing if it doesn't interfere with the production (source) server, hence my inquiry. If there is any chance of an I/O suspension that could take the production server offline, then obviously this is a nonstarter. Many thanks. This e-mail, including any attachment(s), is intended solely for use by the named addressee(s). If you are not the intended recipient, you are not authorized to disclose, copy, distribute or retain this message, without written authorization from Jennison Associates LLC ("Jennison"). This e-mail may contain proprietary, confidential or privileged information. If you have received this message in error, please notify the sender immediately. If you have requested Jennison to e-mail any account information to you or your designee, you are deemed to have consented to electronic delivery of information from Jennison. E-mail messages are not secure and may contain computer viruses or other defects, may not be accurately replicated on other systems, or may be intercepted, deleted or interfered with without the knowledge of the sender or the intended recipient.

5 4

two aix hosts, one works (rwx) one does not (ro)
by Chaim Rieger 25 Aug '11

25 Aug '11

two aix (same level) hosts, both can mount /vol/drv33 however the host go2l30s3 can not touch the mounted vol (ie no wx permissions) (even explicitly added the ip of this host to exports) the host iocuso01 can do rwx storage is gonfs01.domain.com (155.13.224.102) exports is /vol/drv33 -sec=sys,rw,root=iocuso01.domain.com:go2uso01.domain.com:155.13.20.54:go2l30p8.domain.com networking for both hosts is as follows go2l30s3:/> ifconfig -a en4: flags=1e080863,80<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),CHAIN> inet 155.13.20.54 netmask 0xffffff00 broadcast 155.13.20.255 tcp_sendspace 262144 tcp_recvspace 262144 rfc1323 1 en5: flags=1e080863,80<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),CHAIN> inet 155.13.17.41 netmask 0xffffff00 broadcast 155.13.17.255 tcp_sendspace 262144 tcp_recvspace 262144 rfc1323 1 lo0: flags=e08084b<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT> inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255 inet6 ::1/0 tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1 go2l30s3:/> iocuso01:/> ifconfig -a en5: flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),PSEG,LARGESEND,CHAIN> inet 163.236.88.185 netmask 0xffffffc0 broadcast 163.236.88.191 tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0 lo0: flags=e08084b,c0<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,LARGESEND,CHAIN> inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255 inet6 ::1%1/0 tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1 iocuso01:/>

3 2

Test
by Marcus Reuter 25 Aug '11

25 Aug '11

1-2-3 test -- Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de

1 0

test sending
by Luca Domenella 25 Aug '11

25 Aug '11

test Luca Domenella T: +39 0698962316 E: <mailto:luca.domenella@bwinparty.com> luca.domenella(a)bwinparty.com bwin Italia Via Adolfo Ravà, 124 00142 Roma (RM) <http://www.bwinparty.com/> www.bwinparty.com This email and any attachments are confidential, and may be legally privileged and protected by copyright. If you are not the intended recipient dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Any views or opinions are solely those of the sender. This communication is not intended to form a binding contract unless expressly indicated to the contrary and properly authorised. Any actions taken on the basis of this email are at the recipient's own risk.

1 0

crashes due to corrupt volume (now resolved by upgrade)
by William Yardley 25 Aug '11

25 Aug '11

Along the lines of posting stuff we tried to post to the list in the past, here's some information about a situation we had recently, where our NetApp was crashing over and over, resulting in 4 or so consecutive sleepless nights for me. The situation was resolved in the short term by upgrading Data OnTAP from 7.1.1 (built 6/2006) to 7.3.3. Sadly, this was also the impetus to speed up migrating the data to our new EMC based infrastructure (not my decision). Posting here just in case it's helpful to anyone. Interestingly, the errors seemed to continue after the upgrade, however something dumped core, but the system didn't crash; note also that the debugging information is more specific about which volume the problem is on than in the log messages from before the upgrade: Wed Jul 27 19:34:51 PDT [raid.data.ws.blkErr:error]: WAFL sanity check detected bad data on volume acs, Disk /aggr0/plex0/rg1/0a.41 Shelf 2 Bay 9 [NETAPP X274_S10K7146F10 NA07] S/N [3KS0R58B000075418D6B], block 3765221, inode number -45, snapid 0, file block 690030, level 1. Wed Jul 27 19:34:51 PDT [raid.data.lw.blkErr:CRITICAL]: Bad data detected on Disk /aggr0/plex0/rg1/0a.41 Shelf 2 Bay 9 [NETAPP X274_S10K7146F10 NA07] S/N [3KS0R58B000075418D6B], block #3765221 Wed Jul 27 19:34:51 PDT [wafl.inconsistent.indirect:error]: Bad indirect block in vol 'acs', snap 13, fbn 690030, level 1, fileid -1. Wed Jul 27 19:34:51 PDT [wafl.inconsistent.vol:error]: WAFL inconsistent: volume acs is inconsistent. Note: Any new Snapshot copies might contain this inconsistency. Wed Jul 27 19:34:51 PDT [wafl.raid.incons.buf:error]: WAFL inconsistent: bad block 143002981 (vvbn:97941303 fbn:690030 level:1) in inode (fileid: 4294967251 snapid:0 file_type:1 disk_flags:0x2) in volume acs. Wed Jul 27 19:34:51 PDT [coredump.micro.completed:info]: Microcore (/etc/crash/micro-core.101186401.2011-07-28.02_34_51) generation completed Sorry about the unfinished sentence at the beginning. I was running on very little sleep at the time. ----- Forwarded message from William Yardley <toasters(a)veggiechinese.net> ----- From: William Yardley <toasters(a)veggiechinese.net> To: toasters(a)mathworks.com Subject: crashes due to corrupt volume? I have a FAS3020 which has been crashing repeatedly. It has a single aggregate (aggr0) with Doing WAFL_check fixed a few errors, but after that ran clean. wafliron from the maintenance menu also came back clean. However, we kept seeing errors like this, even after replacing the disks which were supposedly having problems (leading me to believe it's something on the volume that's corrupt rather than a physical problem). We did have a disk failure, and that disk already rebuilt to the spare. The problem Mon Jul 25 02:19:44 PDT [raid.data.ws.blkErr:error]: WAFL sanity check detected bad data on volume aggr0, Disk /aggr0/plex0/rg1/0a.41 Shelf 2 Bay 9 [NETAPP X274_SCHT6146F10 NA08] S/N [3HY8FV2S00007418EU6V], block 3765221, inode number -45, snapid 0, file block 690030, level 1. Mon Jul 25 02:40:10 PDT [raid.data.lw.blkErr:CRITICAL]: Bad data detected on Disk /aggr0/plex0/rg1/0a.41 Shelf 2 Bay 9 [NETAPP X274_SCHT6146F10 NA08] S/N [3HY8FV2S00007418EU6V], block #3765221 Mon Jul 25 02:40:10 PDT [raid.rg.readerr.repair.data:notice]: Fixing bad data on Disk /aggr0/plex0/rg1/0a.41 Shelf 2 Bay 9 [NETAPP X274_SCHT6146F10 NA08] S/N [3HY8FV2S00007418EU6V], block #3765221 [and later, note a few different SNs after we replaced the two drives which the errors were showing up on -- we did this by doing a fail -i on the drives, and then reconstructing one at a time.] Mon Jul 25 21:44:45 PDT [raid.data.ws.blkErr:error]: WAFL sanity check detected bad data on volume aggr0, Disk /aggr0/plex0/rg1/0a.40 Shelf 2 Bay 8 [NETAPP X274_SCHT6146F10 NA08] S/N [3HY8FRM000007451DZZ2], block 3765221, inode number -45, snapid 0, file block 690030, level 1. Tue Jul 26 04:15:29 PDT [raid.data.lw.blkErr:CRITICAL]: Bad data detected on Disk /aggr0/plex0/rg1/0c.41 Shelf 2 Bay 9 [NETAPP X274_S10K7146F10 NA07] S/N [3KS0R58B000075418D6B], block #3765221 Tue Jul 26 04:15:29 PDT [raid.rg.readerr.repair.data:notice]: Fixing bad data on Disk /aggr0/plex0/rg1/0c.41 Shelf 2 Bay 9 [NETAPP X274_S10K7146F10 NA07] S/N [3KS0R58B000075418D6B], block #3765221 also a bunch of these: Tue Jul 26 04:16:30 PDT [wafl.dir.inconsistent:error]: wafl_nfs_lookup: inconsistent directory entry {x20 13 14495183 105181080 -520015383} <C7DA730C028D> in {x20 13 2781660 870430057 -520015383}. WAFL may be inconsistent. Call NetApp Support. Tue Jul 26 04:16:30 PDT [wafl.dir.inconsistent:error]: wafl_nfs_lookup: inconsistent directory entry {x20 13 14495184 105181084 -520015383} <C7DDF30B0808> in {x20 13 2781660 870430057 -520015383}. WAFL may be inconsistent. Call NetApp Support. The parity inconsistency files in etc/crash/ show something like: Checksum mismatch on disk /aggr0/plex0/rg1/0a.41 (S/N 3HY8FV2S00007418EU6V), block #3765221 (sector #0x1cb9f28). Checksum mismatch found on Mon Jul 25 02:30:00 PDT 2011 Expected block: [...] The system would often crash, and then crash again on reboot -- booting without /etc/rc and *then* enabling services seemed to help, though the system would usually crash again within hours. After looking at a (very) old post on this list, I tried disabling quotas on the active volumes; this seems to have possibly helped, but I'm nervous that the system will crash again. Short of restoring to a snapshot from before the problem started, is there anything else (online or offline) I can do to fix any problems that might be on the volume, or manually remove the problematic block(s)? Some other info, in case it helps: files*> aggr status -v Aggr State Status Options aggr0 online raid_dp, aggr root, diskroot, nosnap=off, raidtype=raid_dp, raidsize=16, snapmirrored=off, resyncsnaptime=60, fs_size_fixed=off, snapshot_autodelete=on, lost_write_protect=on Volumes: m_hosts, luns, acs, root, tmp, software, home, legacy_mail, x_tmt, backups, dropbox, netflow2, xen_backup Plex /aggr0/plex0: online, normal, active RAID group /aggr0/plex0/rg0: normal RAID group /aggr0/plex0/rg1: normal files*> aggr status -r Aggregate aggr0 (online, raid_dp) (block checksums) Plex /aggr0/plex0 (online, normal, active) RAID group /aggr0/plex0/rg0 (normal) RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 0a.22 0a 1 6 FC:A - FCAL 10000 136000/278528000 137104/280790184 parity 0a.32 0a 2 0 FC:A - FCAL 10000 136000/278528000 137104/280790184 data 0a.23 0a 1 7 FC:A - FCAL 10000 136000/278528000 139072/284820800 data 0a.39 0a 2 7 FC:A - FCAL 10000 136000/278528000 137104/280790184 data 0c.35 0c 2 3 FC:B - FCAL 10000 136000/278528000 137104/280790184 data 0c.26 0c 1 10 FC:B - FCAL 10000 136000/278528000 137104/280790184 data 0a.45 0a 2 13 FC:A - FCAL 10000 136000/278528000 137104/280790184 data 0c.18 0c 1 2 FC:B - FCAL 10000 136000/278528000 137104/280790184 data 0c.28 0c 1 12 FC:B - FCAL 10000 136000/278528000 137104/280790184 data 0c.37 0c 2 5 FC:B - FCAL 10000 136000/278528000 137104/280790184 data 0a.24 0a 1 8 FC:A - FCAL 10000 136000/278528000 137104/280790184 data 0a.43 0a 2 11 FC:A - FCAL 10000 136000/278528000 137104/280790184 data 0c.29 0c 1 13 FC:B - FCAL 10000 136000/278528000 137485/281570072 RAID group /aggr0/plex0/rg1 (normal) RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 0a.17 0a 1 1 FC:A - FCAL 10000 136000/278528000 137422/281442144 parity 0a.33 0a 2 1 FC:A - FCAL 10000 136000/278528000 139072/284820800 data 0c.41 0c 2 9 FC:B - FCAL 10000 136000/278528000 137104/280790184 data 0a.38 0a 2 6 FC:A - FCAL 10000 136000/278528000 137104/280790184 data 0a.36 0a 2 4 FC:A - FCAL 10000 136000/278528000 137104/280790184 data 0c.27 0c 1 11 FC:B - FCAL 10000 136000/278528000 137104/280790184 data 0a.21 0a 1 5 FC:A - FCAL 10000 136000/278528000 137104/280790184 data 0a.34 0a 2 2 FC:A - FCAL 10000 136000/278528000 137104/280790184 data 0c.19 0c 1 3 FC:B - FCAL 10000 136000/278528000 137104/280790184 data 0a.20 0a 1 4 FC:A - FCAL 10000 136000/278528000 137104/280790184 data 0c.44 0c 2 12 FC:B - FCAL 10000 136000/278528000 137422/281442144 data 0c.16 0c 1 0 FC:B - FCAL 10000 136000/278528000 137104/280790184 data 0a.42 0a 2 10 FC:A - FCAL 10000 136000/278528000 137104/280790184 Spare disks RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- Spare disks for block or zoned checksum traditional volumes or aggregates spare 0c.25 0c 1 9 FC:B - FCAL 10000 136000/278528000 137104/280790184 spare 0c.40 0c 2 8 FC:B - FCAL 10000 136000/278528000 137104/280790184 ----- End forwarded message -----

1 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

toasters August 2011