High disk reads but low network out

List overview All Threads
Download

newer

older

BayLISA meeting tonight - Feb 15...

weird behaviour on the nfs share...

Paul Letta

14 Feb 2007 14 Feb '07

4:10 p.m.

I have been having some performance issues on my F880 the past couple of days. Specifically, an NDMP backup that normally takes 2 hours to complete was stuck in the mapping phase for 12 hours before I killed it.

What I am seeing is a high disk read, but without a corresponding network out. Here is a sysstat:

...

CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 62% 1456 6 0 794 16954 46326 0 0 0 4s 56% 1418 0 0 597 13689 36506 0 0 0 5s 64% 1737 3 0 867 17637 43572 0 0 0 5s 53% 1135 18 0 572 14823 42905 1155 0 0 4s 73% 2067 20 0 1282 37835 69880 6347 0 0 4s 69% 2031 24 0 1267 38868 67632 0 0 0 4s 74% 1921 3 0 1027 26644 50881 7778 0 0 5s 59% 1459 0 0 774 18843 41313 0 0 0 6s 60% 1212 2 0 665 16391 38766 0 0 0 7s 54% 1028 19 0 509 11873 33960 0 0 0 7s 61% 844 302 0 571 13765 39032 0 0 0 7s 72% 1164 598 0 1870 16750 39968 0 0 0 5s 71% 948 640 0 8961 9788 29552 0 0 0 5s 79% 1297 603 0 24157 5831 26034 0 0 0 4s 87% 2130 465 0 23581 9794 31269 0 0 0 2s

I don't know why the disks are reading so much without it going out the network.

This is an F880 running OnTap 7.1. It is somewhat full -- about 92% (1.2TB out of 1.3TB).

There isn't any stuck ndmp sessions:

...

ndmpd status

ndmpd ON. No ndmpd sessions active.

I haven't really looked at what clients may be doing what due to the low network out numbers- but I am going there next.

Does anyone have any suggestions on what to look for ?

Thanks,

Paul

Show replies by date

Blake Golliher

14 Feb 14 Feb

6:21 p.m.

Did your number of files dramatically increase? Is there a reconstruction going on? A wafl scan, or an upgrade? Ok that last one might be silly, but you never know.. :)

Can you capture a perfstat or at least a few 30 second samples of statit? Perfstat is usually too much, but here's a quick shell script I run to quickly grab what I need.

#!/bin/sh # if [ -z $1 ]; then echo " " echo "I need a filer target" echo "An example syntax" echo " get-stats.sh filer01.msg.dcn" echo " " exit 0 fi

FILER=$1 # while true do DATAFILE="$FILER`date | awk '{print "_data_" $2 $3 }'`" echo "" >> $DATAFILE date >> $DATAFILE echo "------------------------------" >> $DATAFILE rsh $FILER 'priv set -q diag; statit -b' 2>/dev/null echo "Starting statit sample" >> $DATAFILE rsh $FILER 'priv set -q diag; nfsstat -z' 2>/dev/null echo "Zeroing nfsstat" >> $DATAFILE rsh $FILER 'priv set -q diag; nfs_hist -z' 2>/dev/null echo "Zeroing nfs_hist" >> $DATAFILE rsh $FILER 'priv set -q diag; wafl_susp -z' 2>/dev/null echo "Zeroing wafl_susp" >> $DATAFILE rsh $FILER 'sysstat -xs -c 30 1' >> $DATAFILE

# And we wait...

rsh $FILER 'priv set -q diag; statit -en' >> $DATAFILE 2>/dev/null rsh $FILER 'priv set -q diag; nfsstat -d' >> $DATAFILE rsh $FILER 'priv set -q diag; nfs_hist' >> $DATAFILE rsh $FILER 'priv set -q diag; wafl_susp -w' >> $DATAFILE

echo " ** " >> $DATAFILE done

if you don't allow rsh, you can enable passphrase ssh and replace rsh with ssh (I think it's built into 7 for free now..) or just run the commands above in sequence and save them to a text. A few samples of each for about 30 seconds should do it, but only during the problem, not helpful if it's not happening at the moment.

I usually run the script for about 5 to 10 minutes, then ctrl+c out of it.

Right now I look at it a lot by hand.

-Blake

On 2/14/07, Paul Letta letta@jlab.org wrote:

...

I have been having some performance issues on my F880 the past couple of days. Specifically, an NDMP backup that normally takes 2 hours to complete was stuck in the mapping phase for 12 hours before I killed it.

What I am seeing is a high disk read, but without a corresponding network out. Here is a sysstat:

...
CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 62% 1456 6 0 794 16954 46326 0 0 0 4s 56% 1418 0 0 597 13689 36506 0 0 0 5s 64% 1737 3 0 867 17637 43572 0 0 0 5s 53% 1135 18 0 572 14823 42905 1155 0 0 4s 73% 2067 20 0 1282 37835 69880 6347 0 0 4s 69% 2031 24 0 1267 38868 67632 0 0 0 4s 74% 1921 3 0 1027 26644 50881 7778 0 0 5s 59% 1459 0 0 774 18843 41313 0 0 0 6s 60% 1212 2 0 665 16391 38766 0 0 0 7s 54% 1028 19 0 509 11873 33960 0 0 0 7s 61% 844 302 0 571 13765 39032 0 0 0 7s 72% 1164 598 0 1870 16750 39968 0 0 0 5s 71% 948 640 0 8961 9788 29552 0 0 0 5s 79% 1297 603 0 24157 5831 26034 0 0 0 4s 87% 2130 465 0 23581 9794 31269 0 0 0 2s

I don't know why the disks are reading so much without it going out the network.

This is an F880 running OnTap 7.1. It is somewhat full -- about 92% (1.2TB out of 1.3TB).

There isn't any stuck ndmp sessions:

...
ndmpd status

ndmpd ON. No ndmpd sessions active.

I haven't really looked at what clients may be doing what due to the low network out numbers- but I am going there next.

Does anyone have any suggestions on what to look for ?

Thanks,

Paul

Glenn Dekhayser

6:34 p.m.

Look at your cache age. It's very low, meaning that you are not serving data out of memory, only directly from disk. As the cache age does down, see that your network out numbers go down as well.

When your filer gets full, it isn't able to find contiguous space to write data to (thus write 'anywhere'), so it starts breaking things up and you lose temporal locality, which eliminates the benefits of Netapp's read-ahead caching mechanism, thus low cache ages and poor performance. That's probably what's happening.

You should think about a wafl_scan reallocate when the system utiliziation goes down, or getting a bunch of data off the box.

Glenn (the other one)

-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Paul Letta Sent: Wednesday, February 14, 2007 11:10 AM To: toasters@mathworks.com Subject: High disk reads but low network out

I have been having some performance issues on my F880 the past couple of

days. Specifically, an NDMP backup that normally takes 2 hours to complete was stuck in the mapping phase for 12 hours before I killed it.

What I am seeing is a high disk read, but without a corresponding network out. Here is a sysstat:

...

CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s

Cache

...

                           in   out     read  write    read write

age

...

62% 1456 6 0 794 16954 46326 0 0 0

...

56% 1418 0 0 597 13689 36506 0 0 0

...

64% 1737 3 0 867 17637 43572 0 0 0

...

53% 1135 18 0 572 14823 42905 1155 0 0

...

73% 2067 20 0 1282 37835 69880 6347 0 0

...

69% 2031 24 0 1267 38868 67632 0 0 0

...

74% 1921 3 0 1027 26644 50881 7778 0 0

...

59% 1459 0 0 774 18843 41313 0 0 0

...

60% 1212 2 0 665 16391 38766 0 0 0

...

54% 1028 19 0 509 11873 33960 0 0 0

...

61% 844 302 0 571 13765 39032 0 0 0

...

72% 1164 598 0 1870 16750 39968 0 0 0

...

71% 948 640 0 8961 9788 29552 0 0 0

...

79% 1297 603 0 24157 5831 26034 0 0 0

...

87% 2130 465 0 23581 9794 31269 0 0 0

I don't know why the disks are reading so much without it going out the network.

This is an F880 running OnTap 7.1. It is somewhat full -- about 92% (1.2TB out of 1.3TB).

There isn't any stuck ndmp sessions:

...

ndmpd status

ndmpd ON. No ndmpd sessions active.

I haven't really looked at what clients may be doing what due to the low

network out numbers- but I am going there next.

Does anyone have any suggestions on what to look for ?

Thanks,

Paul

Blake Golliher

6:50 p.m.

might want to look into turning "minra on" ( which turns on minimum read ahead - so it still reads a head, but not as much*) for the volumes with the heavy load, if the theory is highly random workload.

-Blake

* I know I know, the adaptive readahead changes in 6.5, and 7.0 make this option null and void, but in some situations, like a highly random workload over small files, it's probably a good idea since you'll likely never get a good readahead use rate from that workload...

On 2/14/07, Glenn Dekhayser gdekhayser@voyantinc.com wrote:

...

Look at your cache age. It's very low, meaning that you are not serving data out of memory, only directly from disk. As the cache age does down, see that your network out numbers go down as well.

When your filer gets full, it isn't able to find contiguous space to write data to (thus write 'anywhere'), so it starts breaking things up and you lose temporal locality, which eliminates the benefits of Netapp's read-ahead caching mechanism, thus low cache ages and poor performance. That's probably what's happening.

You should think about a wafl_scan reallocate when the system utiliziation goes down, or getting a bunch of data off the box.

Glenn (the other one)

-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Paul Letta Sent: Wednesday, February 14, 2007 11:10 AM To: toasters@mathworks.com Subject: High disk reads but low network out

I have been having some performance issues on my F880 the past couple of

days. Specifically, an NDMP backup that normally takes 2 hours to complete was stuck in the mapping phase for 12 hours before I killed it.

What I am seeing is a high disk read, but without a corresponding network out. Here is a sysstat:

...
CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s

Cache

...
                           in   out     read  write    read write
age

...
62% 1456 6 0 794 16954 46326 0 0 0

4s

...
56% 1418 0 0 597 13689 36506 0 0 0

5s

...
64% 1737 3 0 867 17637 43572 0 0 0

5s

...
53% 1135 18 0 572 14823 42905 1155 0 0

4s

...
73% 2067 20 0 1282 37835 69880 6347 0 0

4s

...
69% 2031 24 0 1267 38868 67632 0 0 0

4s

...
74% 1921 3 0 1027 26644 50881 7778 0 0

5s

...
59% 1459 0 0 774 18843 41313 0 0 0

6s

...
60% 1212 2 0 665 16391 38766 0 0 0

7s

...
54% 1028 19 0 509 11873 33960 0 0 0

7s

...
61% 844 302 0 571 13765 39032 0 0 0

7s

...
72% 1164 598 0 1870 16750 39968 0 0 0

5s

...
71% 948 640 0 8961 9788 29552 0 0 0

5s

...
79% 1297 603 0 24157 5831 26034 0 0 0

4s

...
87% 2130 465 0 23581 9794 31269 0 0 0

2s

I don't know why the disks are reading so much without it going out the network.

This is an F880 running OnTap 7.1. It is somewhat full -- about 92% (1.2TB out of 1.3TB).

There isn't any stuck ndmp sessions:

...
ndmpd status

ndmpd ON. No ndmpd sessions active.

I haven't really looked at what clients may be doing what due to the low

network out numbers- but I am going there next.

Does anyone have any suggestions on what to look for ?

Thanks,

Paul

Willeke, Jochen

15 Feb 15 Feb

11:13 a.m.

Hi,

as for the minra option i must admit, that the best setting for newer Ontap releases is to set it to "minra off". We had databases with a high random-load and when set "minra off" we saw better responsetimes than with "minra on".

It seems that Ontap manages the amount of read-ahead according to the structure of data-requests (random or sequential).

But i think like always you must say "it depends on" :D

Regards

Jochen

-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Blake Golliher Sent: Wednesday, February 14, 2007 7:51 PM To: Glenn Dekhayser Cc: letta@jlab.org; toasters@mathworks.com Subject: Re: High disk reads but low network out

-Blake

On 2/14/07, Glenn Dekhayser gdekhayser@voyantinc.com wrote:

...

Look at your cache age. It's very low, meaning that you are not

serving

...

data out of memory, only directly from disk. As the cache age does down, see that your network out numbers go down as well.

When your filer gets full, it isn't able to find contiguous space to write data to (thus write 'anywhere'), so it starts breaking things up and you lose temporal locality, which eliminates the benefits of Netapp's read-ahead caching mechanism, thus low cache ages and poor performance. That's probably what's happening.

You should think about a wafl_scan reallocate when the system utiliziation goes down, or getting a bunch of data off the box.

Glenn (the other one)

-----Original Message----- From: owner-toasters@mathworks.com

[mailto:owner-toasters@mathworks.com]

...

On Behalf Of Paul Letta Sent: Wednesday, February 14, 2007 11:10 AM To: toasters@mathworks.com Subject: High disk reads but low network out

I have been having some performance issues on my F880 the past couple

...

days. Specifically, an NDMP backup that normally takes 2 hours to complete was stuck in the mapping phase for 12 hours before I killed

it.

...

What I am seeing is a high disk read, but without a corresponding network out. Here is a sysstat:

...
CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s

Cache

...
                           in   out     read  write    read

write

...

age

...
62% 1456 6 0 794 16954 46326 0 0

...

4s

...
56% 1418 0 0 597 13689 36506 0 0

...

5s

...
64% 1737 3 0 867 17637 43572 0 0

...

5s

...
53% 1135 18 0 572 14823 42905 1155 0

...

4s

...
73% 2067 20 0 1282 37835 69880 6347 0

...

4s

...
69% 2031 24 0 1267 38868 67632 0 0

...

4s

...
74% 1921 3 0 1027 26644 50881 7778 0

...

5s

...
59% 1459 0 0 774 18843 41313 0 0

...

6s

...
60% 1212 2 0 665 16391 38766 0 0

...

7s

...
54% 1028 19 0 509 11873 33960 0 0

...

7s

...
61% 844 302 0 571 13765 39032 0 0

...

7s

...
72% 1164 598 0 1870 16750 39968 0 0

...

5s

...
71% 948 640 0 8961 9788 29552 0 0

...

5s

...
79% 1297 603 0 24157 5831 26034 0 0

...

4s

...
87% 2130 465 0 23581 9794 31269 0 0

...

2s

I don't know why the disks are reading so much without it going out

the

...

network.

This is an F880 running OnTap 7.1. It is somewhat full -- about 92% (1.2TB out of 1.3TB).

There isn't any stuck ndmp sessions:

...
ndmpd status

ndmpd ON. No ndmpd sessions active.

I haven't really looked at what clients may be doing what due to the

low

...

network out numbers- but I am going there next.

Does anyone have any suggestions on what to look for ?

Thanks,

Paul

6763

Age (days ago)

6764

Last active (days ago)

toasters@lists.teaparty.net

4 comments

4 participants

tags (0)

participants (4)

Blake Golliher
Glenn Dekhayser
Paul Letta
Willeke, Jochen