toasters March 2010

toasters@lists.teaparty.net

30 participants
25 discussions

scsi.cmd.notReadyCondition:notice Device returns not yet ready Drive spinning up
by Linux Admin 02 Mar '10

02 Mar '10

Folks, What to make out of these error conditions in the logs? It happens for a variety of differnet drives, not just one or 2 FROM syslog translator: Description The not ready check condition event is issued when an I/O request receives a check condition where the device reports that it is not ready to process the I/O request. The I/O request will be delayed to wait for the device to become ready and then tried again. This is enough to trip and crash application eventhough the retry is successful Sat Feb 23 09:47:36 MST [netapp01: scsi.cmd.notReadyCondition:notice]: Disk device 2a.41: Device returns not yet ready: CDB 0x2f:1186f800:0400: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x2)(9006). Sat Feb 23 09:47:36 MST [netapp01: scsi.cmd.notReadyCondition:notice]: Disk device 2a.41: Device returns not yet ready: CDB 0x28:1404c338:0058: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x2)(9021). Sat Feb 23 09:47:36 MST [netapp01: scsi.cmd.notReadyCondition:notice]: Disk device 2a.41: Device returns not yet ready: CDB 0x28:044cc690:0028: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x2)(9033). Sat Feb 23 09:47:36 MST [netapp01: scsi.cmd.notReadyCondition:notice]: Disk device 2a.41: Device returns not yet ready: CDB 0x28:0e5fe178:00c0: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x2)(9020). Sat Feb 23 09:47:36 MST [netapp01: scsi.cmd.notReadyCondition:notice]: Disk device 2a.41: Device returns not yet ready: CDB 0x28:1404c3d8:0088: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x2)(9021). Sat Feb 23 09:47:36 MST [netapp01: scsi.cmd.notReadyCondition:notice]: Disk device 2a.41: Device returns not yet ready: CDB 0x28:0ca00ab8:0018: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x2)(8830). Sat Feb 23 09:47:36 MST [netapp01: scsi.cmd.notReadyCondition:notice]: Disk device 2a.41: Device returns not yet ready: CDB 0x28:14fc5db8:0020: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x2)(8982). Sat Feb 23 09:47:36 MST [netapp01: scsi.cmd.notReadyCondition:notice]: Disk device 2a.41: Device returns not yet ready: CDB 0x28:0ca00ba0:0008: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x2)(8830). Sat Feb 23 09:48:35 MST [netapp01: scsi.cmd.retrySuccess:info]: Disk device 2a.41: request successful after retry #3: cdb 0x2f:1186f800:0400 (68017). Sat Feb 23 09:48:35 MST [netapp01: scsi.cmd.retrySuccess:info]: Disk device 2a.41: request successful after retry #3: cdb 0x28:1404c338:0058 (67890). Sat Feb 23 09:48:35 MST [netapp01: scsi.cmd.retrySuccess:info]: Disk device 2a.41: request successful after retry #3: cdb 0x28:1404c3d8:0088 (67891). Sat Feb 23 09:48:35 MST [netapp01: scsi.cmd.retrySuccess:info]: Disk device 2a.41: request successful after retry #3: cdb 0x28:0ca00ba0:0008 (67706). Sat Feb 23 09:48:35 MST [netapp01: scsi.cmd.retrySuccess:info]: Disk device 2a.41: request successful after retry #3: cdb 0x28:0e5fe178:00c0 (67901). Sat Feb 23 09:48:35 MST [netapp01: scsi.cmd.retrySuccess:info]: Disk device 2a.41: request successful after retry #3: cdb 0x28:0ca00ab8:0018 (67717). Sat Feb 23 09:48:35 MST [netapp01: scsi.cmd.retrySuccess:info]: Disk device 2a.41: request successful after retry #3: cdb 0x28:14fc5db8:0020 (67877). Sat Feb 23 09:48:35 MST [netapp01: scsi.cmd.retrySuccess:info]: Disk device 2a.41: request successful after retry #3: cdb 0x28:044cc690:0028 (67936).

2 2

Slow aggregate/shelf, hot disk
by Adam McDougall 01 Mar '10

01 Mar '10

For a long time we've known backing up our largest volume (3.5T) was slow. More recently I've been investigating why and it seems like a problem with only that shelf or possibly aggregate. Basically it is several times slower than any other shelf/aggregate we have, it seems bottlenecked whether I am reading/writing from nfs, ndmp, reallocate scans, etc, that shelf is always slower. I will probably have a support case opened tomorrow with netapp but I feel like checking with the list to see what else I can find out on my own. When doing NDMP backups I get only around 230Mbit/sec as opposed to 800+ on others. The performance drops distinctly on the hour probably for snapshots (see pic). Details below. 0c.25 seems like a hot disk but the activity on that aggr also seems too high since the network bandwidth is fairly small. A 'reallocate measure' on the two large volumes on aggregate hanksata0 both return a score of '1'. I guess my two main questions are, how do I figure out what is causing the activity on hanksata0 (especially the hot disk which is sometimes at 100%) and if its not just activity but an actual problem, how could I further debug the slow performance to find out what items are at fault? I used ndmpcopy to copy a fast volume with large files from another filer to a new volume on hanksata0 and hanksata1. The volume on hanksata0 is slow but the one on hanksata1 is not. Both of those aggregates are on the same loop with hanksata1 terminating it. Sun Feb 28 20:14:20 EST [hank: wafl.scan.start:info]: Starting WAFL layout measurement on volume scratchtest. Sun Feb 28 20:19:01 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/scratchtest' is 2. ^^^ almost 5 minutes! Sun Feb 28 20:13:38 EST [hank: wafl.scan.start:info]: Starting WAFL layout measurement on volume scratchtest2. Sun Feb 28 20:14:12 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/scratchtest2' is 1. ^^^ less than 1 min When I write to scratchtest, you can see the network bandwidth jump up for a few seconds then it stalls for about twice as long, presumably so the filer can catch up writing, then it repeats. Speed averages around 30-40MB/sec if that. I even tried using the spare sata disk from both of these shelves to make a new volume, copied scratchtest to it (which took 26 minutes for around 40G), and reads were equally slow as the existing scratchtest, although I'm not sure if thats because a single disk is too slow to prove anything, or theres a shelf problem. hanksata0 6120662048 6041632124 79029924 99% hanksata0/.snapshot 322140104 14465904 307674200 4% hanksata1 8162374688 2191140992 5971233696 27% hanksata1/.snapshot 429598664 39636812 389961852 9% hanksata0 and 1 are both ds14mk2 AT but hanksata0 has X268_HGEMI aka X268A-R5 (750m x 14) and hanksata1 has disks X269_HGEMI aka X269A-R5 (1T x 14). hanksata0 has been around since we got the filer say around 2 years ago, hanksata1 was added within the last half year. Both shelves have always had 11 data disks, 2 parity, 1 spare, the aggregates were never grown. volumes on hanksata0 besides root (all created over a year ago): volume 1 (research): NO dedupe (too big) 10 million inodes, approx 3.5T, 108G in snapshots endures random user read/write but usually fairly light traffic. Populated initially with rsync then opened to user access via NFS. Sun Feb 28 21:38:11 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/research' is 1. volume 2 (reinstallbackups): dedupe enabled 6.6 million files, approx 1.6T, 862G in snapshots volume created over a year ago and has several dozen gigs of windows PC backups written or read multiple times per week using CIFS but otherwise COMPLETELY idle. Older data is generally deleted to snapshots after some weeks and the snapshots expire after a few weeks. Only accessed via CIFS. Mon Mar 1 12:15:58 EST [hank: wafl.reallocate.check.value:info]: Allocation measurement check on '/vol/reinstallbackups' is 1. hanksata1 only has one volume besides the small test ones I made, it runs plenty fast. dedupe enabled 4.3 million files, approx 1.6T, 12G in snapshots created a few months ago on an otherwise unused new aggregate with initial rsync, then daily rsyncs from another fileserver that is not very active disk ut% xfers ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs /hanksata0/plex0/rg0: 0c.16 7 5.69 0.94 1.00 55269 3.22 3.02 2439 1.52 2.71 579 0.00 .... . 0.00 .... . 0c.17 9 6.34 0.94 1.00 74308 3.84 2.86 2228 1.56 2.93 873 0.00 .... . 0.00 .... . 0c.18 63 121.00 118.86 1.01 30249 1.38 3.26 3516 0.76 5.43 2684 0.00 .... . 0.00 .... . 0c.19 60 117.74 116.69 1.00 30546 0.40 3.73 5049 0.65 5.56 2840 0.00 .... . 0.00 .... . 0c.20 60 120.82 119.66 1.02 29156 0.43 5.33 5469 0.72 4.80 3583 0.00 .... . 0.00 .... . 0c.21 60 119.37 118.25 1.02 29654 0.36 4.60 5870 0.76 5.76 3140 0.00 .... . 0.00 .... . 0c.22 62 124.87 123.32 1.02 29423 0.62 5.65 5677 0.94 3.58 2710 0.00 .... . 0.00 .... . 0c.23 62 119.48 118.35 1.03 30494 0.36 4.00 6875 0.76 5.14 3417 0.00 .... . 0.00 .... . 0c.24 61 119.08 117.96 1.02 29981 0.47 6.92 3289 0.65 3.94 2930 0.00 .... . 0.00 .... . 0c.25 93 118.17 116.72 1.03 45454 0.58 4.00 17719 0.87 4.63 11658 0.00 .... . 0.00 .... . 0c.26 61 121.40 120.27 1.04 29271 0.43 7.75 3097 0.69 5.21 2131 0.00 .... . 0.00 .... . 0c.27 59 115.75 114.81 1.03 29820 0.43 5.50 4530 0.51 6.00 3321 0.00 .... . 0.00 .... . 0c.28 63 125.53 124.15 1.01 30302 0.65 6.94 3808 0.72 3.40 5191 0.00 .... . 0.00 .... . Both sata shelves are on controller 0c attached to two 3040. Both sata shelves are on controller 0c attached to two 3040. Raid-DP in 13-disk raid groups so we have 2 parity and one spare per shelf. Active-Active single path HA. Latest firmwares/code as of beginning of the year. 7.3.2. no VMs, no snapmirror, nothing fancy that I can think of. wafl scan status only shows 'active bitmap rearrangement' or 'container block reclamation'. Thanks for thoughts and input!

1 0

Re: NetApp scripting question
by David McWilliams 01 Mar '10

01 Mar '10

This is what I did and it worked great, all qtrees set up in no time, with all appropriate perms. Sláinte, David "Build a man a fire he'll be warm for a day, set a man on fire and he'll be warm for the rest of his life" - Terry Pratchett Checkout my photos - http://www.panoramio.com/user/1113507 On Mon, Mar 1, 2010 at 1:23 PM, <Marek.Stopka(a)tieto.com> wrote: > Or you can put all commandis into a single file and go like > > toaster> source /home/createallneededcifs.txt > > :-) > > I like this one more then directly touching configuration file... > ________________________________________ > From: owner-toasters(a)mathworks.com [owner-toasters(a)mathworks.com] On > Behalf Of Kevin Davis [kdavis(a)mathworks.com] > Sent: Monday, March 01, 2010 5:16 PM > To: David McWilliams > Cc: NetApp list > Subject: Re: NetApp scripting question > > > I am putting in a new filer to replace three Windows servers and in the > process I will have to create > > approx 110 shares. I have heard that there is a way of scripting the > process. does anyone have any > > information that they could provide for me on this process? > > It's easy, but like any scripting task, you just need to know what you want > ahead of time. If these shares are of the most basic sort (share name, > path, > everyone/full control access), then all you need to do is dump all the > commands > into a file and execute them on the filer or via SSH. If you can handle a > reboot, just append all of the new shares into > <root>/etc/cifsconfig_share.cfg. > The syntax is straightforward, and the shares will be created upon next > boot. I > suspect the same would happen if you just restarted CIFS, but I haven't > tried > it. Take a look at it - it's simple - and you can just as easily set > different > access to each share if you have that information too. > > > *------------------------------------------*-----------------------* > | Kevin Davis (UNIX/Storage Sysadmin) | Natick, Massachusetts | > | 508.647.7660 | 01760-2098 | > | mailto:kevin.davis@mathworks.com *-----------------------* > | http://www.mathworks.com | | > *------------------------------------------*-----------------------* > >

1 0

NetApp scripting question
by David McWilliams 01 Mar '10

01 Mar '10

I am putting in a new filer to replace three Windows servers and in the process I will have to create approx 110 shares. I have heard that there is a way of scripting the process. does anyone have any information that they could provide for me on this process? Sláinte, David "Build a man a fire he'll be warm for a day, set a man on fire and he'll be warm for the rest of his life" - Terry Pratchett Checkout my photos - http://www.panoramio.com/user/1113507

10 9

Has anyone seen this ndmpcopy error message
by Michael Homa 01 Mar '10

01 Mar '10

Hi: I'm hoping someone has seen this message. I've been working with netapp tech support but they, so far, have not been able to duplicate the error in their lab. I'll start by apologizing for the length of the email. Brevity can result in key details being left out; details that might otherwise explain the situation and lead to the solution. On the other hand, too many details can annoy the crap out of people. We have two netapps, both model FAS270. Netapp1, the source for ndmpcopy, runs 7.0.4; netapp2, the destination, netapp2, runs 7.2.7. We use ndmpcopy to copy volumes from netapp1 to netapp2. Intermittently (though it seems to occur more frequently now), the ndmpcopy fails to run and gives the following message: Body error NDMP_ILLEGAL_STATE_ERR in reply message NdmpMessageDataStartRecover from destination Feb 25 06:00:05 CST [ndmpc:233]: Failed to start restore on destination There is no pattern regarding on which volume or the size of the volume the ndmpcopy operation will fail. It can occur on one of our smaller volumes (1g) or on our largest (765g). The error will occur when the ndmpcopy operation is initiated from the console or when it is started from a perl script on the administrative host for the destination filer. Though the two netapps are physically located in separate buildings that are miles apart, they are on the same subnet and no other network issues regarding them have occurred. There is only one instance of the error message in the knowledgebase and it concerns a SAN library with BakBone NetVault with NDMP plugin. A review of the ndmpcopy spec at ndmp.org suggests one of the following: 1) Illegal state. A data operation is not currently in progress. 2) Tape record size cannot be set in the current state. 3) A data operation is already in progress. Only one data operation is allowed to be executing at a time. 4) Illegal state. NDMP server not in the halted state 5) Message cannot be processed in current state I checked the state of the ndmpd server on the destination, netapp2, and there are no other sessions. All the others are things I have no control over. Some of the things I've tried: o Toggled the ndmpd daemon off and on for both source and destination o Set the ndmpd debug level to 70 o Set the debug option on the ndmpcopy command o Set the ndmpd level to three (had been four) Regarding the detailed ndmpdlog, there is no explanation as to what causes the error (only the error): Feb 25 06:00:05 CST [ndmpd:73]: Restore type: dump Feb 25 06:00:05 CST [ndmpd:73]: Error code: NDMP_ILLEGAL_STATE_ERR <--- Feb 25 06:00:05 CST [ndmpd:73]: Key: EXTRACT, Value: N Feb 25 06:00:05 CST [ndmpd:73]: Key: EXCLUDE_PATH, Value: /etc Feb 25 06:00:05 CST [ndmpd:73]: Key: FILESYSTEM, Value: /vol/vol0 Feb 25 06:00:05 CST [ndmpd:73]: Key: BASE_DATE, Value: 9856860996 Feb 25 06:00:05 CST [ndmpd:73]: Key: UPDATE, Value: Y Feb 25 06:00:05 CST [ndmpd:73]: Key: LEVEL, Value: 2 <--- Feb 25 06:00:05 CST [ndmpd:73]: Key: DMP_NAME, Value: ndmpcopy:/vol/vol0//vol/vol0/128.248.155.101128.248.155.102 As you can see, this failure occurred on a level two operation; it has occurred on the other two levels as well. I believe, though I can say for certain, that the failure rate increased after upgrading ONTAP on the destination from 7.2.2 to 7.2.7. Please do not suggest we license snapmirror. We are a university in Illinois, a state whose financial situation is only slightly better than that of California. If I can't solve the problem, then *** yes *** I'll try vol copy. Michael Homa Operating Systems Support and Database Group Academic Computing and Communication Center University of Illinois at Chicago email: mhoma(a)uic.edu

2 1

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

toasters March 2010