Hi all.
We are currently experiencing some heavy load on a filer serving as storage to a webmail farm:
FILER> sysstat 1 [...] 63% 5583 0 0 1047 4996 3524 16 0 0 3 70% 6002 0 0 999 6005 3836 0 0 0 3 65% 5738 0 0 1067 5829 2671 0 0 0 3 68% 5881 0 0 972 6195 3424 16 0 0 3 83% 7174 0 0 1363 7401 5477 0 0 0 3 88% 7951 0 0 1609 8026 3984 0 0 0 3 91% 8041 0 0 1387 8357 7076 16 0 0 3 87% 7732 0 0 1369 8508 4601 0 0 0 3 87% 7258 0 0 1196 7554 6006 681 0 0 3 100% 6290 0 0 1039 6406 8108 5108 0 0 3 95% 6953 0 0 1381 6488 7536 2783 0 0 3 88% 8205 0 0 1427 8375 5456 0 0 0 3 73% 6115 0 0 993 6408 5051 16 0 0 3 79% 7046 0 0 1138 7779 2629 0 0 0 3 83% 6851 0 0 1181 7212 8240 0 0 0 3 86% 7888 0 0 1417 8185 5305 16 0 0 3 79% 7435 0 0 1217 7646 1676 0 0 0 3 50% 4001 0 0 664 4293 2490 0 0 0 3 48% 4253 0 0 711 3939 1564 16 0 0 3 46% 4115 0 0 681 4066 1265 0 0 0 3 [...]
The farm consists of 6 frontends, 2xPentiumIII - 800Mhz, 1Gb ram, 100Mb Fast Ethernet, running apache and an altered c-client, doing direct maildir access (no imap, direct filesystem access). The frontends go to about 250 concurrent sessions. There are nearly 1 million (1000000) maildir stored in here.
FILER> version NetApp Release 6.0.1R2: Fri Feb 9 01:12:44 PST 2001
FILER> ifconfig -a e0: flags=848043<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.1.103 netmask 0xffffff00 broadcast 192.168.1.255 partner inet 192.168.1.104 (not in use) ether 00:a0:98:00:9f:0a (100tx-fd-up) e2a: flags=8042<BROADCAST,RUNNING,MULTICAST> mtu 1500 ether 00:20:fc:1e:63:d4 (auto-unknown-cfg_down) e2b: flags=8042<BROADCAST,RUNNING,MULTICAST> mtu 1500 ether 00:20:fc:1e:63:d5 (auto-unknown-cfg_down) e2c: flags=8042<BROADCAST,RUNNING,MULTICAST> mtu 1500 ether 00:20:fc:1e:63:d6 (auto-unknown-cfg_down) e2d: flags=8042<BROADCAST,RUNNING,MULTICAST> mtu 1500 ether 00:20:fc:1e:63:d7 (auto-unknown-cfg_down) e7: flags=8042<BROADCAST,RUNNING,MULTICAST> mtu 1500 ether 00:03:47:22:85:5e (auto-1000sx-fd-down) flowcontrol full lo: flags=948049<UP,LOOPBACK,RUNNING,MULTICAST,TCPCKSUM> mtu 4056 inet 127.0.0.1 netmask 0xff000000 broadcast 127.0.0.1 ether 00:00:00:00:00:00 (Shared memory)
The filer volume webmail:
FILER> df Filesystem kbytes used avail capacity Mounted on /vol/webmail/ 406736736 383407488 23329248 94% /vol/webmail/ /vol/webmail/.snapshot 101684180 0 101684180 0% /vol/webmail/.snapshot
is mounted in each of the 6 frontends.
NFSstat gives me:
FILER> nfsstat
Server rpc: TCP: calls badcalls nullrecv badlen xdrcall 0 0 0 0 0
UDP: calls badcalls nullrecv badlen xdrcall 350325996760 0 0 0
Server nfs: calls badcalls 393275655620
Server nfs V2: (25634001711 calls) null getattr setattr root lookup readlink read 0 0% 2872301897 11%41557694 0%0 0% 5182588465 20%124663 0% 16772257357 65% wrcache write create remove rename link symlink 0 0% 604329011 2%7689267 0% 19040366 0%12918825 0%11991858 0%26947 0% mkdir rmdir readdir statfs 167656 0% 827086 0% 108179901 0%718 0%
Server nfs V3: (13693563851 calls) null getattr setattr lookup access readlink read 0 0% 5446218 0% 43552013 0%2360996181 17%28976674 0%97560 0% 10689673307 78% write create mkdir symlink mknod remove rmdir 410587211 3%5298393 0% 161919 0% 4 0% 0 0% 15451799 0%82070 0% rename link readdir readdir+ fsstat fsinfo pathconf 12977697 0%9969929 0% 110291020 1%0 0% 928 0% 928 0% 0 0% commit 0 0%
The getattr seems way too big and this may point to a bad caching on the frontends. But could this bring the CPU to 100% most of the time? Could this be a wafl issue related with the low available space on the volume?
I'll first increase the nfs client cache to try to lower the getattr's. But I fear this won't help much, further optimization, from the bottom up, is needed.
Any ideas to help optimize the performance in this scenario? Any ideas are welcome.
If you need any further info (I wanted to send a filestats but is taking an eternity...) please ask.
TIA.