Webmail farm - toasters

4 Mar 2002


      Hi all.
We are currently experiencing some heavy load on a filer serving as storage to
a webmail farm:
FILER> sysstat 1
 [...]
 63%   5583      0      0    1047  4996    3524    16       0     0       3
 70%   6002      0      0     999  6005    3836     0       0     0       3
 65%   5738      0      0    1067  5829    2671     0       0     0       3
 68%   5881      0      0     972  6195    3424    16       0     0       3
 83%   7174      0      0    1363  7401    5477     0       0     0       3
 88%   7951      0      0    1609  8026    3984     0       0     0       3
 91%   8041      0      0    1387  8357    7076    16       0     0       3
 87%   7732      0      0    1369  8508    4601     0       0     0       3
 87%   7258      0      0    1196  7554    6006   681       0     0       3
100%   6290      0      0    1039  6406    8108  5108       0     0       3
 95%   6953      0      0    1381  6488    7536  2783       0     0       3
 88%   8205      0      0    1427  8375    5456     0       0     0       3
 73%   6115      0      0     993  6408    5051    16       0     0       3
 79%   7046      0      0    1138  7779    2629     0       0     0       3
 83%   6851      0      0    1181  7212    8240     0       0     0       3
 86%   7888      0      0    1417  8185    5305    16       0     0       3
 79%   7435      0      0    1217  7646    1676     0       0     0       3
 50%   4001      0      0     664  4293    2490     0       0     0       3
 48%   4253      0      0     711  3939    1564    16       0     0       3
 46%   4115      0      0     681  4066    1265     0       0     0       3
 [...]
The farm consists of 6 frontends, 2xPentiumIII - 800Mhz, 1Gb ram, 100Mb Fast Ethernet,
running apache and an altered c-client, doing direct maildir access (no imap, direct
filesystem access). The frontends go to about 250 concurrent sessions. There are nearly
1 million (1000000) maildir stored in here.
FILER> version   
NetApp Release 6.0.1R2: Fri Feb 9 01:12:44 PST 2001
FILER> ifconfig -a
e0: flags=848043<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 192.168.1.103 netmask 0xffffff00 broadcast 192.168.1.255
        partner inet 192.168.1.104 (not in use)
        ether 00:a0:98:00:9f:0a (100tx-fd-up)
e2a: flags=8042<BROADCAST,RUNNING,MULTICAST> mtu 1500
        ether 00:20:fc:1e:63:d4 (auto-unknown-cfg_down)
e2b: flags=8042<BROADCAST,RUNNING,MULTICAST> mtu 1500
        ether 00:20:fc:1e:63:d5 (auto-unknown-cfg_down)
e2c: flags=8042<BROADCAST,RUNNING,MULTICAST> mtu 1500
        ether 00:20:fc:1e:63:d6 (auto-unknown-cfg_down)
e2d: flags=8042<BROADCAST,RUNNING,MULTICAST> mtu 1500
        ether 00:20:fc:1e:63:d7 (auto-unknown-cfg_down)
e7: flags=8042<BROADCAST,RUNNING,MULTICAST> mtu 1500
        ether 00:03:47:22:85:5e (auto-1000sx-fd-down) flowcontrol full
lo: flags=948049<UP,LOOPBACK,RUNNING,MULTICAST,TCPCKSUM> mtu 4056
        inet 127.0.0.1 netmask 0xff000000 broadcast 127.0.0.1
        ether 00:00:00:00:00:00 (Shared memory)
The filer volume webmail:
FILER> df
Filesystem              kbytes       used      avail capacity  Mounted on
/vol/webmail/        406736736  383407488   23329248    94%    /vol/webmail/
/vol/webmail/.snapshot  101684180          0  101684180     0%    /vol/webmail/.snapshot
is mounted in each of the 6 frontends.
NFSstat gives me:
FILER> nfsstat
Server rpc:
TCP:
calls      badcalls   nullrecv   badlen     xdrcall    
0          0          0          0          0
UDP:
calls      badcalls   nullrecv   badlen     xdrcall    
350325996760          0          0          0
Server nfs:
calls      badcalls
393275655620
Server nfs V2: (25634001711 calls)
null       getattr    setattr    root       lookup     readlink   read       
0 0%       2872301897 11%41557694 0%0 0%       5182588465 20%124663 0%  16772257357 65%
wrcache    write      create     remove     rename     link       symlink    
0 0%       604329011 2%7689267 0% 19040366 0%12918825 0%11991858 0%26947 0%   
mkdir      rmdir      readdir    statfs     
167656 0%  827086 0%  108179901 0%718 0%
Server nfs V3: (13693563851 calls)
null       getattr    setattr    lookup     access     readlink   read       
0 0%       5446218 0% 43552013 0%2360996181 17%28976674 0%97560 0%   10689673307 78%
write      create     mkdir      symlink    mknod      remove     rmdir      
410587211 3%5298393 0% 161919 0%  4 0%       0 0%       15451799 0%82070 0%   
rename     link       readdir    readdir+   fsstat     fsinfo     pathconf   
12977697 0%9969929 0% 110291020 1%0 0%       928 0%     928 0%     0 0%       
commit     
0 0%
The getattr seems way too big and this may point to a bad caching on the frontends. But
could this bring the CPU to 100% most of the time? Could this be a wafl issue related
with the low available space on the volume?
I'll first increase the nfs client cache to try to lower the getattr's. But I fear this
won't help much, further optimization, from the bottom up, is needed.
Any ideas to help optimize the performance in this scenario? Any ideas are welcome.
If you need any further info (I wanted to send a filestats but is taking an eternity...)
please ask.
TIA.
-- 
Jose Celestino japc@co.sapo.pt SysAdmin::SAPO.pt http://www.sapo.pt
---------------------------------------------------------------------
main(){printf("%xu%xk%x!\n",15,12,237);}