I have a 740 with ~600GB of data. it is running 5.36r2. it has 512mb of ram and is running oc-3 (155mb) atm.
periodically i see the good old
"nfs server suchandsuch not responding" on various mid-range solaris boxes.
and users complain about bad performance. general network performance is good for file transfers etc. there is a Ontap bug 12898 for 5.36r2 relating to these indications but in a TCP nfs environment. most of my solaris NFS mounts are V3 UDP. i am going to do an upgrade anyway but I was wondering if anyone else has seen this type of indication. also, see below the output of my sysstat. this goes on most of the day. it looks to me as if my filer is getting hammered. what do you guys think?
thanks
roger
CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache 51% 4074 1 0 2854 3059 2040 908 0 0 1 100% 4146 0 0 4803 5471 6737 10593 0 0 1 67% 3228 1 0 4511 5770 3899 3587 0 0 1 81% 1894 0 0 3431 3363 8564 10631 0 0 1 71% 2308 0 0 5999 6482 3671 2843 0 0 1 57% 2108 1 0 2645 3359 4482 7641 0 0 1 65% 2882 1 0 5364 8072 3736 0 0 0 1 99% 2466 0 0 8339 7281 5808 6465 0 0 1 96% 2264 2 0 2869 3556 10404 12876 0 0 1 84% 4252 0 0 4003 2174 4090 12652 0 0 1 70% 3063 0 0 2939 4694 4820 5616 0 0 1 67% 3098 1 0 3097 4024 2878 10606 0 0 1 66% 4958 0 0 5352 2693 2137 0 0 0 1
Roger,
You are correct, your filer is getting hammered. Look at the disk IO first. Normally, your filer will flush writes every 10 seconds (unless NVRAM fills up). You're writing all the time so you know people are trying to write more than 16MB in a 10-second period (Ontap splits the 32MB NVRAM into two parts; when one half fills up, it makes the other half available to users while it flushes the first half to disk).
Cache age of 1 means your read cache is being refreshed with new data every minute. Ideally, this number would be 60 meaning everything is in your read cache and your not reading from disk that often.
Try going into rc_toggle_basic mode and running the statit command. Run statit -b to begin collection. Wait 30 seconds or more and then run statit -e. This will give you details on what the CPU is doing. It also mentions CP or checkpoint which is the operation of flushing NVRAM to disk.
The other command you can use to gather meaninful data is wafl_susp. Run wafl_susp -z to reset the stats. Wait 30 seconds or more and then run wafl_susp -w (you may want to do this via rsh and save the output to a file).
In the output, there is a field called "cp_from_cp". If this is called at all, your NVRAM is overflowing and denying write requests to users. cp_from_cp means you're in the middle of checkpointing the first half of NVRAM and yet the second half already filled up and needs to checkpoint as well. This is VERY bad.
Recommendations: Upgrade the head. The F8XX filers have 128MB NVRAM and come with 1GB - 3GB RAM. This will fix your problem (if you have the budget). The other thing would be to find another filer in your environment and move some of the data off the busy one.
The output from statit will tell you which physical disks are most utilized. You may be able to use this to narrow down which volume is the culprit (assuming you have many volumes).
Hope this helps.
/Brian/
CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache 51% 4074 1 0 2854 3059 2040 908 0 0 1 100% 4146 0 0 4803 5471 6737 10593 0 0 1 67% 3228 1 0 4511 5770 3899 3587 0 0 1 81% 1894 0 0 3431 3363 8564 10631 0 0 1 71% 2308 0 0 5999 6482 3671 2843 0 0 1 57% 2108 1 0 2645 3359 4482 7641 0 0 1 65% 2882 1 0 5364 8072 3736 0 0 0 1 99% 2466 0 0 8339 7281 5808 6465 0 0 1 96% 2264 2 0 2869 3556 10404 12876 0 0 1 84% 4252 0 0 4003 2174 4090 12652 0 0 1 70% 3063 0 0 2939 4694 4820 5616 0 0 1 67% 3098 1 0 3097 4024 2878 10606 0 0 1 66% 4958 0 0 5352 2693 2137 0 0 0 1
A 740 running 6.1.1 keeps logging
sysconfig: SCSI Adapter card (PN X201X) in slot 1 must be in one of these slots: 6,2.
and indeed we do have a DLT7000 hooked up to a SCSI adapter in slot 1.
But checking the 6.1.1 configuration guide, "SCSI for tape" shows slot 6 as the first choice ("X1") and slot 1 as the second choice ("X2") and doesn't list slot 2 as a valid choice at all. What am I missing?
We've been chasing some NFS timeout issues here as well. The systat 1 command on the filer doesn't look very bad, however, I did the wafl_susp -z followed by a wafl_susp -w after a few minutes(approx 5). The value at that time was
= 1 and
it continued to increase over time. The recommendation below was to wait 30 seconds before checking the value, but does it matter. Does a value other than 0 in this parameter mean the filer is denying write requests to users ?
Kelvin Edwards System Admin Jefferson Lab
Brian Long wrote:
The other command you can use to gather meaninful data is wafl_susp. Run wafl_susp -z to reset the stats. Wait 30 seconds or more and then run wafl_susp -w (you may want to do this via rsh and save the output to a file).
In the output, there is a field called "cp_from_cp". If this is called at all, your NVRAM is overflowing and denying write requests to users. cp_from_cp means you're in the middle of checkpointing the first half of NVRAM and yet the second half already filled up and needs to checkpoint as well. This is VERY bad.