Re: "Interrupted system call" to F230 news spool

31 Jul 1997


      On Wed, 30 Jul 1997, Karl Swartz wrote:
...
That's strange.  I have no idea what would be causing that.  That's
usually the result of a signal coming in (e.g., SIGINT generated by
hitting ^C) while waiting for the write to complete, but I don't
know why you'd be getting such a signal.
These are the actual syslog messages recorded by Solaris.  It
looks like a problem with network congestion on the Ultra's interface,
and possible overruns on the interface or OS network buffers.  This is
on FDDI though, and the filer's systat only reports about 30Mbps
outgoing at peak times.
Jul 31 07:20:35 unix: NFS lookup failed for server netapp-1: error 5 (RPC: Timed out)
Jul 31 07:20:35 unix: NFS read failed for server netapp-1: error 5 (RPC: Timed out)
Jul 31 07:20:35 unix: NFS read failed for server netapp-1: error 5 (RPC: Timed out)
Jul 31 07:20:35 unix: NFS access failed for server netapp-1: error 5 (RPC: Timed out)
Jul 31 07:20:39 unix: NFS write failed for server netapp-1: error 27 (RPC: Received disconnect from remote)
Jul 31 07:20:39 unix: NFS write error on host netapp-1: I/O error.
Jul 31 07:20:39 unix: NFS write failed for server netapp-1: error 27 (RPC: Received disconnect from remote)
Jul 31 07:20:39 unix: NFS write error on host netapp-1: I/O error.
Jul 31 07:20:39 unix: NFS write failed for server netapp-1: error 27 (RPC: Received disconnect from remote)
Jul 31 07:20:39 unix: NFS write error on host netapp-1: I/O error.
Jul 31 07:20:45 unix: NFS read failed for server netapp-1: error 27 (RPC: Received disconnect from remote)
Jul 31 07:20:45 unix: NFS access failed for server netapp-1: error 27 (RPC: Received disconnect from remote)
Jul 31 07:20:45 unix: NFS lookup failed for server netapp-1: error 27 (RPC: Received disconnect from remote)
...
BTW, you say one of your *reader* servers is having this problem
while *writing*.  You do only have one machine doing the writing,
don't you?  INN has no mechanisms to syncronize multiple machines
writing to the news database.
The reader server receives a feed from the feeder machine (which
has its own spool on a different filer).  The incoming feed is just a
trickle compared to the requests made by news readers.
...
4MB of NVRAM might not be enough with a machine as fast as an Ultra
driving it.  Give Tech Support a call and they can work with you to
determine if you're exhausting this resource.
No, I don't think it would be NVRAM.  There isn't much writing
going on at all.  This is during off-peak (157 readers on right now);
peak is about double this:
CPU   NFS  CIFS  HTTP    Net kb/s    Disk kb/s     Tape kb/s    Cache
                           in  out    read write    read write     age
 28%   422     0     0     80 1307    1168     0       0     0       2
 21%   346     0     0     72 1551    1588     0       0     0       2
 34%   451     0     0     96 1699    1812     0       0     0       2
 36%   347     0     0     72 1150    1276   224       0     0       2
 39%   368     0     0     79 1443    1640  2376       0     0       2
 14%   277     0     0     55 1183    1232     0       0     0       2
 35%   424     0     0     93 1756    1796     0       0     0       2
 25%   327     0     0     58 1414    1660     0       0     0       2
...
Do you know what article is being written, and if so, is it a large
one (perhaps to one of the alt.binaries groups)?  That would stress
NVRAM a bit harder, though at worst that should only lead to a slow
response by the filer to the write request.
Individual articles larger than 512K are dropped at the feeder.  I
suppose one of our readers could attempt to post large messages, but
they are all coming in over analog and ISDN, so their bitrate is
negligible.  But yeah, I don't think that should cause NFS timeouts.
...
...
The mounts are NFSv3 over UDP.  Would dropping back down to NFSv2
help any?
Definitely.  One of our customers saw active file renumbering drop
from 12-14 hours to under 30 minutes just by switching from v3 to v2.
I'll give that a try then (and maybe just turn off nfsv3 on the
Netapp entirely).  minra and no_atime_update are enabled already.
...
It sounds like something weird happening on the Sun, possibly
exacerbated by slow filer responses due to NVRAM starvation.  At the
loads you're talking about, netnews doesn't stress a network all that
much, so unless there's a *lot* of other stuff happening on your net,
I wouldn't be inclined to suspect network congestion unless all other
plausible avenues had been explored.
The FDDI hub is peaking around 75% "load" (I'll have to check the
docs on the Cisco 1400 to see what exactly that means).  There was the
obvious (and expected) increase when we moved the spools off fiber-
channel Sparc Storage Arrays to Netapps.  I think the congestion might
be at the host/interface itself:  Solaris simply can't keep up with 50
or 60Mbps aggregate bandwidth out it's FDDI interface.  I've had
instances where an Ultra on that FDDI ring will just disappear off the
network for a couple of minutes, and then magically reappear.
I'm hoping that most of these problems will go away with a private
NFS backbone (a good idea in any case).  Right now, a reader request
for a news article will generate traffic equal to twice the article
size.  The reader spool filer is reporting 30Mbps during peak, and the
feeder spool filer (which also provides Web services) pumps out
10Mbps.  I don't know how FDDI's performance degrades as it nears
capacity, but 80 Mbps is probably straining things.
-- 
Brian Tao (BT300, taob@netcom.ca)
"Though this be madness, yet there is method in't"

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: "Interrupted system call" to F230 news spool