I have been having some performance issues on my F880 the past couple of days. Specifically, an NDMP backup that normally takes 2 hours to complete was stuck in the mapping phase for 12 hours before I killed it.
What I am seeing is a high disk read, but without a corresponding network out. Here is a sysstat:
CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 62% 1456 6 0 794 16954 46326 0 0 0 4s 56% 1418 0 0 597 13689 36506 0 0 0 5s 64% 1737 3 0 867 17637 43572 0 0 0 5s 53% 1135 18 0 572 14823 42905 1155 0 0 4s 73% 2067 20 0 1282 37835 69880 6347 0 0 4s 69% 2031 24 0 1267 38868 67632 0 0 0 4s 74% 1921 3 0 1027 26644 50881 7778 0 0 5s 59% 1459 0 0 774 18843 41313 0 0 0 6s 60% 1212 2 0 665 16391 38766 0 0 0 7s 54% 1028 19 0 509 11873 33960 0 0 0 7s 61% 844 302 0 571 13765 39032 0 0 0 7s 72% 1164 598 0 1870 16750 39968 0 0 0 5s 71% 948 640 0 8961 9788 29552 0 0 0 5s 79% 1297 603 0 24157 5831 26034 0 0 0 4s 87% 2130 465 0 23581 9794 31269 0 0 0 2s
I don't know why the disks are reading so much without it going out the network.
This is an F880 running OnTap 7.1. It is somewhat full -- about 92% (1.2TB out of 1.3TB).
There isn't any stuck ndmp sessions:
ndmpd status
ndmpd ON. No ndmpd sessions active.
I haven't really looked at what clients may be doing what due to the low network out numbers- but I am going there next.
Does anyone have any suggestions on what to look for ?
Thanks,
Paul
Did your number of files dramatically increase? Is there a reconstruction going on? A wafl scan, or an upgrade? Ok that last one might be silly, but you never know.. :)
Can you capture a perfstat or at least a few 30 second samples of statit? Perfstat is usually too much, but here's a quick shell script I run to quickly grab what I need.
#!/bin/sh # if [ -z $1 ]; then echo " " echo "I need a filer target" echo "An example syntax" echo " get-stats.sh filer01.msg.dcn" echo " " exit 0 fi
FILER=$1 # while true do DATAFILE="$FILER`date | awk '{print "_data_" $2 $3 }'`" echo "" >> $DATAFILE date >> $DATAFILE echo "------------------------------" >> $DATAFILE rsh $FILER 'priv set -q diag; statit -b' 2>/dev/null echo "Starting statit sample" >> $DATAFILE rsh $FILER 'priv set -q diag; nfsstat -z' 2>/dev/null echo "Zeroing nfsstat" >> $DATAFILE rsh $FILER 'priv set -q diag; nfs_hist -z' 2>/dev/null echo "Zeroing nfs_hist" >> $DATAFILE rsh $FILER 'priv set -q diag; wafl_susp -z' 2>/dev/null echo "Zeroing wafl_susp" >> $DATAFILE rsh $FILER 'sysstat -xs -c 30 1' >> $DATAFILE
# And we wait...
rsh $FILER 'priv set -q diag; statit -en' >> $DATAFILE 2>/dev/null rsh $FILER 'priv set -q diag; nfsstat -d' >> $DATAFILE rsh $FILER 'priv set -q diag; nfs_hist' >> $DATAFILE rsh $FILER 'priv set -q diag; wafl_susp -w' >> $DATAFILE
echo " ** " >> $DATAFILE done
if you don't allow rsh, you can enable passphrase ssh and replace rsh with ssh (I think it's built into 7 for free now..) or just run the commands above in sequence and save them to a text. A few samples of each for about 30 seconds should do it, but only during the problem, not helpful if it's not happening at the moment.
I usually run the script for about 5 to 10 minutes, then ctrl+c out of it.
Right now I look at it a lot by hand.
-Blake
On 2/14/07, Paul Letta letta@jlab.org wrote:
I have been having some performance issues on my F880 the past couple of days. Specifically, an NDMP backup that normally takes 2 hours to complete was stuck in the mapping phase for 12 hours before I killed it.
What I am seeing is a high disk read, but without a corresponding network out. Here is a sysstat:
CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 62% 1456 6 0 794 16954 46326 0 0 0 4s 56% 1418 0 0 597 13689 36506 0 0 0 5s 64% 1737 3 0 867 17637 43572 0 0 0 5s 53% 1135 18 0 572 14823 42905 1155 0 0 4s 73% 2067 20 0 1282 37835 69880 6347 0 0 4s 69% 2031 24 0 1267 38868 67632 0 0 0 4s 74% 1921 3 0 1027 26644 50881 7778 0 0 5s 59% 1459 0 0 774 18843 41313 0 0 0 6s 60% 1212 2 0 665 16391 38766 0 0 0 7s 54% 1028 19 0 509 11873 33960 0 0 0 7s 61% 844 302 0 571 13765 39032 0 0 0 7s 72% 1164 598 0 1870 16750 39968 0 0 0 5s 71% 948 640 0 8961 9788 29552 0 0 0 5s 79% 1297 603 0 24157 5831 26034 0 0 0 4s 87% 2130 465 0 23581 9794 31269 0 0 0 2s
I don't know why the disks are reading so much without it going out the network.
This is an F880 running OnTap 7.1. It is somewhat full -- about 92% (1.2TB out of 1.3TB).
There isn't any stuck ndmp sessions:
ndmpd status
ndmpd ON. No ndmpd sessions active.
I haven't really looked at what clients may be doing what due to the low network out numbers- but I am going there next.
Does anyone have any suggestions on what to look for ?
Thanks,
Paul
Look at your cache age. It's very low, meaning that you are not serving data out of memory, only directly from disk. As the cache age does down, see that your network out numbers go down as well.
When your filer gets full, it isn't able to find contiguous space to write data to (thus write 'anywhere'), so it starts breaking things up and you lose temporal locality, which eliminates the benefits of Netapp's read-ahead caching mechanism, thus low cache ages and poor performance. That's probably what's happening.
You should think about a wafl_scan reallocate when the system utiliziation goes down, or getting a bunch of data off the box.
Glenn (the other one)
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Paul Letta Sent: Wednesday, February 14, 2007 11:10 AM To: toasters@mathworks.com Subject: High disk reads but low network out
I have been having some performance issues on my F880 the past couple of
days. Specifically, an NDMP backup that normally takes 2 hours to complete was stuck in the mapping phase for 12 hours before I killed it.
What I am seeing is a high disk read, but without a corresponding network out. Here is a sysstat:
CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s
Cache
in out read write read write
age
62% 1456 6 0 794 16954 46326 0 0 0
4s
56% 1418 0 0 597 13689 36506 0 0 0
5s
64% 1737 3 0 867 17637 43572 0 0 0
5s
53% 1135 18 0 572 14823 42905 1155 0 0
4s
73% 2067 20 0 1282 37835 69880 6347 0 0
4s
69% 2031 24 0 1267 38868 67632 0 0 0
4s
74% 1921 3 0 1027 26644 50881 7778 0 0
5s
59% 1459 0 0 774 18843 41313 0 0 0
6s
60% 1212 2 0 665 16391 38766 0 0 0
7s
54% 1028 19 0 509 11873 33960 0 0 0
7s
61% 844 302 0 571 13765 39032 0 0 0
7s
72% 1164 598 0 1870 16750 39968 0 0 0
5s
71% 948 640 0 8961 9788 29552 0 0 0
5s
79% 1297 603 0 24157 5831 26034 0 0 0
4s
87% 2130 465 0 23581 9794 31269 0 0 0
2s
I don't know why the disks are reading so much without it going out the network.
This is an F880 running OnTap 7.1. It is somewhat full -- about 92% (1.2TB out of 1.3TB).
There isn't any stuck ndmp sessions:
ndmpd status
ndmpd ON. No ndmpd sessions active.
I haven't really looked at what clients may be doing what due to the low
network out numbers- but I am going there next.
Does anyone have any suggestions on what to look for ?
Thanks,
Paul
might want to look into turning "minra on" ( which turns on minimum read ahead - so it still reads a head, but not as much*) for the volumes with the heavy load, if the theory is highly random workload.
-Blake
* I know I know, the adaptive readahead changes in 6.5, and 7.0 make this option null and void, but in some situations, like a highly random workload over small files, it's probably a good idea since you'll likely never get a good readahead use rate from that workload...
On 2/14/07, Glenn Dekhayser gdekhayser@voyantinc.com wrote:
Look at your cache age. It's very low, meaning that you are not serving data out of memory, only directly from disk. As the cache age does down, see that your network out numbers go down as well.
When your filer gets full, it isn't able to find contiguous space to write data to (thus write 'anywhere'), so it starts breaking things up and you lose temporal locality, which eliminates the benefits of Netapp's read-ahead caching mechanism, thus low cache ages and poor performance. That's probably what's happening.
You should think about a wafl_scan reallocate when the system utiliziation goes down, or getting a bunch of data off the box.
Glenn (the other one)
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Paul Letta Sent: Wednesday, February 14, 2007 11:10 AM To: toasters@mathworks.com Subject: High disk reads but low network out
I have been having some performance issues on my F880 the past couple of
days. Specifically, an NDMP backup that normally takes 2 hours to complete was stuck in the mapping phase for 12 hours before I killed it.
What I am seeing is a high disk read, but without a corresponding network out. Here is a sysstat:
CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s
Cache
in out read write read write
age
62% 1456 6 0 794 16954 46326 0 0 0
4s
56% 1418 0 0 597 13689 36506 0 0 0
5s
64% 1737 3 0 867 17637 43572 0 0 0
5s
53% 1135 18 0 572 14823 42905 1155 0 0
4s
73% 2067 20 0 1282 37835 69880 6347 0 0
4s
69% 2031 24 0 1267 38868 67632 0 0 0
4s
74% 1921 3 0 1027 26644 50881 7778 0 0
5s
59% 1459 0 0 774 18843 41313 0 0 0
6s
60% 1212 2 0 665 16391 38766 0 0 0
7s
54% 1028 19 0 509 11873 33960 0 0 0
7s
61% 844 302 0 571 13765 39032 0 0 0
7s
72% 1164 598 0 1870 16750 39968 0 0 0
5s
71% 948 640 0 8961 9788 29552 0 0 0
5s
79% 1297 603 0 24157 5831 26034 0 0 0
4s
87% 2130 465 0 23581 9794 31269 0 0 0
2s
I don't know why the disks are reading so much without it going out the network.
This is an F880 running OnTap 7.1. It is somewhat full -- about 92% (1.2TB out of 1.3TB).
There isn't any stuck ndmp sessions:
ndmpd status
ndmpd ON. No ndmpd sessions active.
I haven't really looked at what clients may be doing what due to the low
network out numbers- but I am going there next.
Does anyone have any suggestions on what to look for ?
Thanks,
Paul
Hi,
as for the minra option i must admit, that the best setting for newer Ontap releases is to set it to "minra off". We had databases with a high random-load and when set "minra off" we saw better responsetimes than with "minra on".
It seems that Ontap manages the amount of read-ahead according to the structure of data-requests (random or sequential).
But i think like always you must say "it depends on" :D
Regards
Jochen
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Blake Golliher Sent: Wednesday, February 14, 2007 7:51 PM To: Glenn Dekhayser Cc: letta@jlab.org; toasters@mathworks.com Subject: Re: High disk reads but low network out
might want to look into turning "minra on" ( which turns on minimum read ahead - so it still reads a head, but not as much*) for the volumes with the heavy load, if the theory is highly random workload.
-Blake
* I know I know, the adaptive readahead changes in 6.5, and 7.0 make this option null and void, but in some situations, like a highly random workload over small files, it's probably a good idea since you'll likely never get a good readahead use rate from that workload...
On 2/14/07, Glenn Dekhayser gdekhayser@voyantinc.com wrote:
Look at your cache age. It's very low, meaning that you are not
serving
data out of memory, only directly from disk. As the cache age does down, see that your network out numbers go down as well.
When your filer gets full, it isn't able to find contiguous space to write data to (thus write 'anywhere'), so it starts breaking things up and you lose temporal locality, which eliminates the benefits of Netapp's read-ahead caching mechanism, thus low cache ages and poor performance. That's probably what's happening.
You should think about a wafl_scan reallocate when the system utiliziation goes down, or getting a bunch of data off the box.
Glenn (the other one)
-----Original Message----- From: owner-toasters@mathworks.com
[mailto:owner-toasters@mathworks.com]
On Behalf Of Paul Letta Sent: Wednesday, February 14, 2007 11:10 AM To: toasters@mathworks.com Subject: High disk reads but low network out
I have been having some performance issues on my F880 the past couple
of
days. Specifically, an NDMP backup that normally takes 2 hours to complete was stuck in the mapping phase for 12 hours before I killed
it.
What I am seeing is a high disk read, but without a corresponding network out. Here is a sysstat:
CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s
Cache
in out read write read
write
age
62% 1456 6 0 794 16954 46326 0 0
0
4s
56% 1418 0 0 597 13689 36506 0 0
0
5s
64% 1737 3 0 867 17637 43572 0 0
0
5s
53% 1135 18 0 572 14823 42905 1155 0
0
4s
73% 2067 20 0 1282 37835 69880 6347 0
0
4s
69% 2031 24 0 1267 38868 67632 0 0
0
4s
74% 1921 3 0 1027 26644 50881 7778 0
0
5s
59% 1459 0 0 774 18843 41313 0 0
0
6s
60% 1212 2 0 665 16391 38766 0 0
0
7s
54% 1028 19 0 509 11873 33960 0 0
0
7s
61% 844 302 0 571 13765 39032 0 0
0
7s
72% 1164 598 0 1870 16750 39968 0 0
0
5s
71% 948 640 0 8961 9788 29552 0 0
0
5s
79% 1297 603 0 24157 5831 26034 0 0
0
4s
87% 2130 465 0 23581 9794 31269 0 0
0
2s
I don't know why the disks are reading so much without it going out
the
network.
This is an F880 running OnTap 7.1. It is somewhat full -- about 92% (1.2TB out of 1.3TB).
There isn't any stuck ndmp sessions:
ndmpd status
ndmpd ON. No ndmpd sessions active.
I haven't really looked at what clients may be doing what due to the
low
network out numbers- but I am going there next.
Does anyone have any suggestions on what to look for ?
Thanks,
Paul