One of my inn 1.4unoff4 news reader servers started throttling itself just today with "Interrupted system call writing article file" (happened twice in the past 24 hours). The spool is on an F230 running 4.0.3, 256MB of read cache and 4MB of write cache. The news server is an Ultra 170, 512MB of RAM, ~250 to 300 readers around peak times. The two are on a FDDI ring.
The F230 hovers around 65% CPU usage, so I don't think that's the problem, but the Ultra is reporting 900 to 1200 packets per second both in and out of its FDDI interface. Half of its time is spent in the kernel, according to top(1). The mounts are NFSv3 over UDP. Would dropping back down to NFSv2 help any? I'm trying to determine if this is a network congestion problem, or an OS limitation (on either the Netapp or the Sun).
One of my inn 1.4unoff4 news reader servers started throttling
itself just today with "Interrupted system call writing article file" (happened twice in the past 24 hours).
That's strange. I have no idea what would be causing that. That's usually the result of a signal coming in (e.g., SIGINT generated by hitting ^C) while waiting for the write to complete, but I don't know why you'd be getting such a signal.
BTW, you say one of your *reader* servers is having this problem while *writing*. You do only have one machine doing the writing, don't you? INN has no mechanisms to syncronize multiple machines writing to the news database.
The spool is on an F230 running 4.0.3, 256MB of read cache and 4MB of write cache. The news server is an Ultra 170, 512MB of RAM, ~250 to 300 readers around peak times. The two are on a FDDI ring.
4MB of NVRAM might not be enough with a machine as fast as an Ultra driving it. Give Tech Support a call and they can work with you to determine if you're exhausting this resource.
Do you know what article is being written, and if so, is it a large one (perhaps to one of the alt.binaries groups)? That would stress NVRAM a bit harder, though at worst that should only lead to a slow response by the filer to the write request.
... but the Ultra is reporting 900 to 1200 packets per second both in and out of its FDDI interface. Half of its time is spent in the kernel, according to top(1).
Lots of kernel time isn't surprising for a netnews server.
The mounts are NFSv3 over UDP. Would dropping back down to NFSv2 help any?
Definitely. One of our customers saw active file renumbering drop from 12-14 hours to under 30 minutes just by switching from v3 to v2. This is because NFSv3 clients seem to use READDIRPLUS whenever they can, instead of READDIR following by GETATTR on each file returned by the READDIR. That's good for 'ls -l' but awful for netnews, since it doesn't need the info GETATTR would return and it's expensive to get that stuff.
I'm trying to determine if this is a network congestion problem, or an OS limitation (on either the Netapp or the Sun).
It sounds like something weird happening on the Sun, possibly exacerbated by slow filer responses due to NVRAM starvation. At the loads you're talking about, netnews doesn't stress a network all that much, so unless there's a *lot* of other stuff happening on your net, I wouldn't be inclined to suspect network congestion unless all other plausible avenues had been explored.
-- Karl Swartz - Technical Marketing Engineer Network Appliance kls@netapp.com (W) kls@chicago.com (H)
On Wed, 30 Jul 1997, Karl Swartz wrote:
That's strange. I have no idea what would be causing that. That's usually the result of a signal coming in (e.g., SIGINT generated by hitting ^C) while waiting for the write to complete, but I don't know why you'd be getting such a signal.
These are the actual syslog messages recorded by Solaris. It looks like a problem with network congestion on the Ultra's interface, and possible overruns on the interface or OS network buffers. This is on FDDI though, and the filer's systat only reports about 30Mbps outgoing at peak times.
Jul 31 07:20:35 unix: NFS lookup failed for server netapp-1: error 5 (RPC: Timed out) Jul 31 07:20:35 unix: NFS read failed for server netapp-1: error 5 (RPC: Timed out) Jul 31 07:20:35 unix: NFS read failed for server netapp-1: error 5 (RPC: Timed out) Jul 31 07:20:35 unix: NFS access failed for server netapp-1: error 5 (RPC: Timed out) Jul 31 07:20:39 unix: NFS write failed for server netapp-1: error 27 (RPC: Received disconnect from remote) Jul 31 07:20:39 unix: NFS write error on host netapp-1: I/O error. Jul 31 07:20:39 unix: NFS write failed for server netapp-1: error 27 (RPC: Received disconnect from remote) Jul 31 07:20:39 unix: NFS write error on host netapp-1: I/O error. Jul 31 07:20:39 unix: NFS write failed for server netapp-1: error 27 (RPC: Received disconnect from remote) Jul 31 07:20:39 unix: NFS write error on host netapp-1: I/O error. Jul 31 07:20:45 unix: NFS read failed for server netapp-1: error 27 (RPC: Received disconnect from remote) Jul 31 07:20:45 unix: NFS access failed for server netapp-1: error 27 (RPC: Received disconnect from remote) Jul 31 07:20:45 unix: NFS lookup failed for server netapp-1: error 27 (RPC: Received disconnect from remote)
BTW, you say one of your *reader* servers is having this problem while *writing*. You do only have one machine doing the writing, don't you? INN has no mechanisms to syncronize multiple machines writing to the news database.
The reader server receives a feed from the feeder machine (which has its own spool on a different filer). The incoming feed is just a trickle compared to the requests made by news readers.
4MB of NVRAM might not be enough with a machine as fast as an Ultra driving it. Give Tech Support a call and they can work with you to determine if you're exhausting this resource.
No, I don't think it would be NVRAM. There isn't much writing going on at all. This is during off-peak (157 readers on right now); peak is about double this:
CPU NFS CIFS HTTP Net kb/s Disk kb/s Tape kb/s Cache in out read write read write age 28% 422 0 0 80 1307 1168 0 0 0 2 21% 346 0 0 72 1551 1588 0 0 0 2 34% 451 0 0 96 1699 1812 0 0 0 2 36% 347 0 0 72 1150 1276 224 0 0 2 39% 368 0 0 79 1443 1640 2376 0 0 2 14% 277 0 0 55 1183 1232 0 0 0 2 35% 424 0 0 93 1756 1796 0 0 0 2 25% 327 0 0 58 1414 1660 0 0 0 2
Do you know what article is being written, and if so, is it a large one (perhaps to one of the alt.binaries groups)? That would stress NVRAM a bit harder, though at worst that should only lead to a slow response by the filer to the write request.
Individual articles larger than 512K are dropped at the feeder. I suppose one of our readers could attempt to post large messages, but they are all coming in over analog and ISDN, so their bitrate is negligible. But yeah, I don't think that should cause NFS timeouts.
The mounts are NFSv3 over UDP. Would dropping back down to NFSv2 help any?
Definitely. One of our customers saw active file renumbering drop from 12-14 hours to under 30 minutes just by switching from v3 to v2.
I'll give that a try then (and maybe just turn off nfsv3 on the Netapp entirely). minra and no_atime_update are enabled already.
It sounds like something weird happening on the Sun, possibly exacerbated by slow filer responses due to NVRAM starvation. At the loads you're talking about, netnews doesn't stress a network all that much, so unless there's a *lot* of other stuff happening on your net, I wouldn't be inclined to suspect network congestion unless all other plausible avenues had been explored.
The FDDI hub is peaking around 75% "load" (I'll have to check the docs on the Cisco 1400 to see what exactly that means). There was the obvious (and expected) increase when we moved the spools off fiber- channel Sparc Storage Arrays to Netapps. I think the congestion might be at the host/interface itself: Solaris simply can't keep up with 50 or 60Mbps aggregate bandwidth out it's FDDI interface. I've had instances where an Ultra on that FDDI ring will just disappear off the network for a couple of minutes, and then magically reappear.
I'm hoping that most of these problems will go away with a private NFS backbone (a good idea in any case). Right now, a reader request for a news article will generate traffic equal to twice the article size. The reader spool filer is reporting 30Mbps during peak, and the feeder spool filer (which also provides Web services) pumps out 10Mbps. I don't know how FDDI's performance degrades as it nears capacity, but 80 Mbps is probably straining things.
I think this has come up before, but I don't recall the status... anyway, we have a filer running 4.1c with three disk shelves full of 4GB disks. Is it possible to add a shelf of 9GB disks to the current 4GB-disk shelves? (Seperate shelves, of course -- I'm not trying to mix 4GB and 9GB disks in the same shelf.)
If not currently possible, will it be possible, and when?