We have a number of f230's here that seem to have either stopped doing their raid scrubbing or isn't bothering to report it any more.
while the /etc/rc file doesn't explicitly ask to turn on scrubbing, doing an 'rsh netapp1 options' returns this:
raidtimeout 24 raid.reconstruct_speed 4 raid.scrub.enable on
which seems pretty typical. Only 2 out of our 7 operating netapps are actually logging this:
Sun Oct 18 01:00:00 EDT [raid_scrub_admin]: Beginning disk scrubbing... Sun Oct 18 02:47:16 EDT [raid_scrub_admin]: Scrub found 0 parity inconsistencies Sun Oct 18 02:47:16 EDT [raid_scrub_admin]: Scrub found 0 media errors Sun Oct 18 02:47:16 EDT [raid_scrub_admin]: Disk scrubbing finished...
Of the two that are logging, one is an f210 running ONTAP 4.2a, the other is an f210 running ONTAP 5.1.2
The ones not logging are 4 x f230's, 1 x f220.
On a related note, we happen to know that one of our hot spares on one of the f210's is showing a read error. It was discovered during the install. I'm not entirely sure if I have logs of the discovery. But along that line, are there any tests or scrubs done or are they assumed to be healthy?
---------------------------------------------------------------- Dave Cole (DC1110) | dacole@netcom.ca Systems Administrator |* dacole@rik.net * | office/~dacole/ Netcom Canada |* www.rik.net/~dacole/ * 905 King Street West, Toronto, M6K 3G9 | phone - 416.341.5801 Toronto, Ontario, Canada, Earth, Sol | fax - 416.341.5725
Dave Cole dacole@netcom.ca writes:
We have a number of f230's here that seem to have either stopped doing their raid scrubbing or isn't bothering to report it any more.
One thing that can cause scrubs to be missed sometimes is setting the system time (via rdate) near the top of the hour. If it's not that, I suggest mailing support@netapp.com about it.
- Dan
NetApp told me that a "Busy" box will skip Raid Scrub.
I've got a F520 that does the same thing.
I added a rsh to my crontab to force it:
05 03 * * 0 /usr/bin/rsh <HostName> disk scrub start
Larry Rosenman
Hmm... That's interesting. Is there a CPU usage threshold before the filer becomes "Busy"? Is it 60%, 70% or what?
Peter Tran
-----Original Message----- From: Larry Rosenman-CyberRamp System Administration [mailto:ler@cyberramp.net] Sent: Tuesday, October 27, 1998 11:22 AM To: quinlan@transmeta.com; Dave Cole Cc: toasters@mathworks.com Subject: RE: raid scrubbing and hot spares
NetApp told me that a "Busy" box will skip Raid Scrub.
I've got a F520 that does the same thing.
I added a rsh to my crontab to force it:
05 03 * * 0 /usr/bin/rsh <HostName> disk scrub start
Larry Rosenman -- Larry Rosenman, Sr. System Administrator, CyberRamp Internet Services E-Mail: ler@cyberramp.net, http://www.cyberramp.net Voice: (214) 343-3333/(817) 461-8484 (Metro)/Fax: (214) 343-3727 Technical Support: (214) 340-2020/(817) 226-2020 (Metro) U.S. Mail: 11350 Hillguard Rd, Dallas, TX 75243-8311
-----Original Message----- From: Daniel Quinlan [mailto:quinlan@transmeta.com] Sent: Tuesday, October 27, 1998 11:10 AM To: Dave Cole Cc: toasters@mathworks.com Subject: Re: raid scrubbing and hot spares
Dave Cole dacole@netcom.ca writes:
We have a number of f230's here that seem to have either stopped doing their raid scrubbing or isn't bothering to report it any more.
One thing that can cause scrubs to be missed sometimes is setting the system time (via rdate) near the top of the hour. If it's not that, I suggest mailing support@netapp.com about it.
- Dan
They were fairly nebulous. My filer is not CPU bound, but has lots of traffic.
I wish I had a better answer.
Netapp: Any clue(s)?
LER
-- Larry Rosenman, Sr. System Administrator, CyberRamp Internet Services E-Mail: ler@cyberramp.net, http://www.cyberramp.net Voice: (214) 343-3333/(817) 461-8484 (Metro)/Fax: (214) 343-3727 Technical Support: (214) 340-2020/(817) 226-2020 (Metro) U.S. Mail: 11350 Hillguard Rd, Dallas, TX 75243-8311
-----Original Message----- From: ptran [mailto:ptran@broadcom.com] Sent: Tuesday, October 27, 1998 3:49 PM To: Larry Rosenman-CyberRamp System Administration; quinlan@transmeta.com; Dave Cole Cc: toasters@mathworks.com Subject: RE: raid scrubbing and hot spares
Hmm... That's interesting. Is there a CPU usage threshold before the filer becomes "Busy"? Is it 60%, 70% or what?
Peter Tran
-----Original Message----- From: Larry Rosenman-CyberRamp System Administration [mailto:ler@cyberramp.net] Sent: Tuesday, October 27, 1998 11:22 AM To: quinlan@transmeta.com; Dave Cole Cc: toasters@mathworks.com Subject: RE: raid scrubbing and hot spares
NetApp told me that a "Busy" box will skip Raid Scrub.
I've got a F520 that does the same thing.
I added a rsh to my crontab to force it:
05 03 * * 0 /usr/bin/rsh <HostName> disk scrub start
Larry Rosenman -- Larry Rosenman, Sr. System Administrator, CyberRamp Internet Services E-Mail: ler@cyberramp.net, http://www.cyberramp.net Voice: (214) 343-3333/(817) 461-8484 (Metro)/Fax: (214) 343-3727 Technical Support: (214) 340-2020/(817) 226-2020 (Metro) U.S. Mail: 11350 Hillguard Rd, Dallas, TX 75243-8311
-----Original Message----- From: Daniel Quinlan [mailto:quinlan@transmeta.com] Sent: Tuesday, October 27, 1998 11:10 AM To: Dave Cole Cc: toasters@mathworks.com Subject: Re: raid scrubbing and hot spares
Dave Cole dacole@netcom.ca writes:
We have a number of f230's here that seem to have either stopped doing their raid scrubbing or isn't bothering to report it any more.
One thing that can cause scrubs to be missed sometimes is setting the system time (via rdate) near the top of the hour. If it's not that, I suggest mailing support@netapp.com about it.
- Dan
Hmm... That's interesting. Is there a CPU usage threshold before the filer becomes "Busy"? Is it 60%, 70% or what?
I have a firm and definitive answer:
It depends.
Seriously, there are four main resources that a filer can run out of, and sysstat(1) reports on all four. They are:
- CPU - network bandwidth - disk bandwidth - caching memory
A filer can perform surprisingly well at 90%+ CPU utilization if the other resources aren't bottlenecked. Our kernel does a pretty good job of avoiding queuing delays and really does seem to operate pretty well even at full utilization.
On the other hand, the performance might be unacceptably poor at 40% CPU utilization if the cache age is low, and the disk bandwidth has reached it's limit.
Of course, with the CPU, it's easy to see how much is left, because it is reported as a percentage. There's no simple solution for figuring out how much disk bandwidth is left, because the disk subsystem can be bottlenecked on raw disk bandwidth, on disk head seeks, or even on the SCSI or Fibre Channel connection. And of course, the bottleneck depends in part on the access patterns. Sequential reads from large files is relatively easy, because the blocks will be near each other on disk and seeking will be minimized. Random reads from files throughout the file system will generate lots of seeks that will really slow things down.
The original question was in the context of a disk scrub. This generates both disk I/O, which can contribute to a disk bottleneck, and CPU to look at the data, which can contribute to a CPU bottleneck.
Where the bottleneck will actually turn out to be depends on lots of things: How many disks? how many SCSIs (or Fibre Channels)? What's the disk access pattern from users?
I know that I haven't answered your question, but hopefully I've shed some light on why it's a hard question. :-)
Dave
Dave Hitz wrote:
Seriously, there are four main resources that a filer can run out of, and sysstat(1) reports on all four. They are:
- CPU - network bandwidth - disk bandwidth - caching memory
Therefore, I want MRTG and HP OpenView to pay attention to those 4 things. However, my look at the MIB doesn't really have those last two, and is slightly unclear about what the best variables are for the first two.
I'd really be able to better plan my purchases for the next year if I could have MRTG generate graphs of those four statistics for every Filer that I own. I could show a pretty picture to management and with certainty state, "...and therefore we can safely add more disk" or "...and therefore, you can't add more disk to this box; but another Filer."
Am I mis-reading the MIB or does the MIB fall short?
--tal
Seriously, there are four main resources that a filer can run out of, and sysstat(1) reports on all four. They are:
- CPU - network bandwidth - disk bandwidth - caching memory
Therefore, I want MRTG and HP OpenView to pay attention to those 4 things. However, my look at the MIB doesn't really have those last two, and is slightly unclear about what the best variables are for the first two.
I'd really be able to better plan my purchases for the next year if I could have MRTG generate graphs of those four statistics for every Filer that I own. I could show a pretty picture to management and with certainty state, "...and therefore we can safely add more disk" or "...and therefore, you can't add more disk to this box; but another Filer."
Am I mis-reading the MIB or does the MIB fall short?
The mib for the future release currently in process contains the following:
cpuBusyTimePerCent OBJECT-TYPE SYNTAX INTEGER (0..100) ACCESS read-only STATUS mandatory DESCRIPTION "The percent of time that the CPU has been doing useful work since the last boot." ::= { cpu 3 }
miscNetRcvdKB OBJECT-TYPE SYNTAX INTEGER ACCESS read-only STATUS mandatory DESCRIPTION "The total number of KBytes received on all the network interfaces, since the last boot." ::= { misc 2 }
miscNetSentKB OBJECT-TYPE SYNTAX INTEGER ACCESS read-only STATUS mandatory DESCRIPTION "The total number of KBytes transmitted on all the network interfaces, since the last boot." ::= { misc 3 }
dfEntry OBJECT-TYPE SYNTAX DfEntry ACCESS not-accessible STATUS mandatory DESCRIPTION "Provide a report of the available disk space on the referenced file system." INDEX { dfIndex } ::= { dfTable 1 }
DfEntry ::= SEQUENCE { dfIndex INTEGER, dfFileSys DisplayString, dfKBytesTotal INTEGER, dfKBytesUsed INTEGER, dfKBytesAvail INTEGER, dfPerCentKBytesCapacity INTEGER, dfInodesUsed INTEGER, dfInodesFree INTEGER, dfPerCentInodeCapacity INTEGER, dfMountedOn DisplayString, dfMaxFilesAvail INTEGER, dfMaxFilesUsed INTEGER, dfMaxFilesPossible INTEGER } ...
Do these take care of the first three? I'll take a look at adding an entry for cache memory utilization.
Alan
--------------------------------------------------------------- Alan G. Yoder agy@netapp.com Network Appliance, Inc. 2770 San Tomas Expressway voice 408-367-3031 Santa Clara, CA 95051 fax 408-367-3451 ---------------------------------------------------------------
agy wrote:
The mib for the future release currently in process contains the following:
cpuBusyTimePerCent OBJECT-TYPE SYNTAX INTEGER (0..100) ACCESS read-only STATUS mandatory DESCRIPTION "The percent of time that the CPU has been doing useful work since the last boot."
It isn't very interesting to know the average CPU busy% for the last 6 months. We don't reboot our filers and proxies very often. It is much more useful to have an immediate measurement of the CPU busy% value as reported by sysstat. That would show patterns and peaks. One of our filers works for a news system, and there I'm interested in seeing just how busy it gets around 9pm when everyone is reading news. An average value that I get now, is really quite useless.
The same argument goes for all the other "since the last boot" things. An immediate measurement, no matter how heavily influenced by the moment, still says a lot more.
Regards Elena
---------------------------------------------------------------- Elena Samsonova e-mail: E.Samsonova@wxs.nl World Access / Planet Internet phone: +31 33 45 40 417 PO Box 2529, 3800 GB Amersfoort fax: +31 33 45 40 401 The Netherlands ----------------------------------------------------------------
Have you looked at the MRTG tools available on NOW? They can be used to generate graphs of usage (including CPU) over a selectable period.
The tools are at http://now.netapp.com/download/tools/filer-mrtg/ MRTG is at http://ee-staff.ethz.ch/~oetiker/webtools/mrtg/mrtg.html
Andrew
At 10:47 04/11/98 +0100, Elena Samsonova wrote:
agy wrote:
The mib for the future release currently in process contains the following:
cpuBusyTimePerCent OBJECT-TYPE SYNTAX INTEGER (0..100) ACCESS read-only STATUS mandatory DESCRIPTION "The percent of time that the CPU has been doing useful work since the last boot."
It isn't very interesting to know the average CPU busy% for the last 6 months. We don't reboot our filers and proxies very often. It is much more useful to have an immediate measurement of the CPU busy% value as reported by sysstat. That would show patterns and peaks. One of our filers works for a news system, and there I'm interested in seeing just how busy it gets around 9pm when everyone is reading news. An average value that I get now, is really quite useless.
The same argument goes for all the other "since the last boot" things. An immediate measurement, no matter how heavily influenced by the moment, still says a lot more.
Regards Elena
Elena Samsonova e-mail: E.Samsonova@wxs.nl World Access / Planet Internet phone: +31 33 45 40 417 PO Box 2529, 3800 GB Amersfoort fax: +31 33 45 40 401 The Netherlands
I've downloaded this toolset myself and have got it operational. It is a bit tricky to set up, but it is now invaluable as a source of information and a great help with load balancing and capacity planning.
Raymond Brennan, Systems & QA Engineer, HDL Design Division, Mentor Graphics Ltd., Rivergate, Newbury Business Park, London Road, Newbury, Berkshire RG14 2QB, UK.
Tel : +44-1635-811-411 Fax : +44-1635-810-108
URLs: http://www.mentorg.com http://www.renoir.com
E-Mail : mailto:ray_brennan@mentorg.com
-----Original Message----- From: Andrew Bond [mailto:andrewb@netapp.com] Sent: Wednesday, November 04, 1998 10:46 AM To: Elena Samsonova; toasters@mathworks.com Cc: agy Subject: Re: SMTP MIB (was Re: raid scrubbing and hot spares)
Have you looked at the MRTG tools available on NOW? They can be used to generate graphs of usage (including CPU) over a selectable period.
The tools are at http://now.netapp.com/download/tools/filer-mrtg/ MRTG is at http://ee-staff.ethz.ch/~oetiker/webtools/mrtg/mrtg.html
Andrew
At 10:47 04/11/98 +0100, Elena Samsonova wrote:
agy wrote:
The mib for the future release currently in process contains the following:
cpuBusyTimePerCent OBJECT-TYPE SYNTAX INTEGER (0..100) ACCESS read-only STATUS mandatory DESCRIPTION "The percent of time that the CPU has been doing useful work since the last boot."
It isn't very interesting to know the average CPU busy% for the last 6 months. We don't reboot our filers and proxies very often. It is much more useful to have an immediate measurement of the CPU busy% value as reported by sysstat. That would show patterns and peaks. One of our filers works for a news system, and there I'm interested in seeing just how busy it gets around 9pm when everyone is reading news. An average value that I get now, is really quite useless.
The same argument goes for all the other "since the last boot" things. An immediate measurement, no matter how heavily influenced by the moment, still says a lot more.
Regards Elena
Elena Samsonova e-mail: E.Samsonova@wxs.nl World Access / Planet Internet phone: +31 33 45 40 417 PO Box 2529, 3800 GB Amersfoort fax: +31 33 45 40 401 The Netherlands
Andrew Bond wrote:
Have you looked at the MRTG tools available on NOW? They can be used to generate graphs of usage (including CPU) over a selectable period.
The tools are at http://now.netapp.com/download/tools/filer-mrtg/ MRTG is at http://ee-staff.ethz.ch/~oetiker/webtools/mrtg/mrtg.html
Yes, I know that, we use those tools and a lot more. However, there's a problem here. For example, if we look at a NetCache Appliance and use this tool to calculate CPU idle%, we do get a curve looking plausible enough, but it is very misleading. Since MRTG only queries the server every 5 minutes, the formula calculates an *average* CPU idle% for the last 5 minutes. It comes to 25%. However, in reality the proxy server has very frequent peaks to 0% idle and sometimes remains there for 5-15 seconds in a row. So looking at the graph I conclude that the proxy server still has a lot of unused CPU capacity, whereas in reality it isn't true since the CPU gets rather long periods of saturation.
As I said before, averages do not really say that much. I would much prefer to have immediate measurements on my MRTG graph.
And by the way, monitor pages on NetCache Appliance also display averages since the last boot. I suggest to replace them all with immediate values, or at least with averages over short periods of time (last 5 minutes) for things like hit rate and such.
Regards Elena
---------------------------------------------------------------- Elena Samsonova e-mail: E.Samsonova@wxs.nl World Access / Planet Internet phone: +31 33 45 40 417 PO Box 2529, 3800 GB Amersfoort fax: +31 33 45 40 401 The Netherlands ----------------------------------------------------------------
On Tue, 27 Oct 1998, Dave Cole wrote:
-On a related note, we happen to know that one of our hot spares on one -of the f210's is showing a read error. It was discovered during the -install. I'm not entirely sure if I have logs of the discovery. But -along that line, are there any tests or scrubs done or are they -assumed to be healthy?
Anyone have any thoughts on this?
---------------------------------------------------------------- Dave Cole (DC1110) | dacole@netcom.ca Systems Administrator |* dacole@rik.net * | office/~dacole/ Netcom Canada |* www.rik.net/~dacole/ * 905 King Street West, Toronto, M6K 3G9 | phone - 416.341.5801 Toronto, Ontario, Canada, Earth, Sol | fax - 416.341.5725