Ahh, in the meantime I was able to get the output of filestats, it may be of any help....
FILER> filestats volume webmail snapshot japc VOL=webmail SNAPSHOT=japc INODES=17653504 COUNTED_INODES=9942433 TOTAL_BYTES=373740074688 TOTAL_KB=383386132
FILE SIZE CUMULATIVE COUNT CUMULATIVE TOTAL KB 1K 1410803 3019516 10K 7413094 34777324 100K 9373165 95819528 1M 9888040 272893780 10M 9942389 380408808 100M 9942431 381046720 1G 9942432 381185320 MAX 9942433 383386132
AGE(ATIME) CUMULATIVE COUNT CUMULATIVE TOTAL KB 0 0 0 30D 3675160 168977132 60D 4879663 208385196 90D 6522408 239232636 120D 7459500 268557600 MAX 9942433 383386132
UID COUNT TOTAL KB #64010 9915276 380263412 #0 27157 3122720
GID COUNT TOTAL KB #64010 9891377 380150296 #0 27089 2372920 #1003 67 749800 #65534 23900 113116
Thus spake Jose Celestino, on Mon, Mar 04, 2002 at 05:41:29PM +0000:
Hi all.
We are currently experiencing some heavy load on a filer serving as storage to a webmail farm:
FILER> sysstat 1 [...] 63% 5583 0 0 1047 4996 3524 16 0 0 3 70% 6002 0 0 999 6005 3836 0 0 0 3 65% 5738 0 0 1067 5829 2671 0 0 0 3 68% 5881 0 0 972 6195 3424 16 0 0 3 83% 7174 0 0 1363 7401 5477 0 0 0 3 88% 7951 0 0 1609 8026 3984 0 0 0 3 91% 8041 0 0 1387 8357 7076 16 0 0 3 87% 7732 0 0 1369 8508 4601 0 0 0 3 87% 7258 0 0 1196 7554 6006 681 0 0 3 100% 6290 0 0 1039 6406 8108 5108 0 0 3 95% 6953 0 0 1381 6488 7536 2783 0 0 3 88% 8205 0 0 1427 8375 5456 0 0 0 3 73% 6115 0 0 993 6408 5051 16 0 0 3 79% 7046 0 0 1138 7779 2629 0 0 0 3 83% 6851 0 0 1181 7212 8240 0 0 0 3 86% 7888 0 0 1417 8185 5305 16 0 0 3 79% 7435 0 0 1217 7646 1676 0 0 0 3 50% 4001 0 0 664 4293 2490 0 0 0 3 48% 4253 0 0 711 3939 1564 16 0 0 3 46% 4115 0 0 681 4066 1265 0 0 0 3 [...]
The farm consists of 6 frontends, 2xPentiumIII - 800Mhz, 1Gb ram, 100Mb Fast Ethernet, running apache and an altered c-client, doing direct maildir access (no imap, direct filesystem access). The frontends go to about 250 concurrent sessions. There are nearly 1 million (1000000) maildir stored in here.
FILER> version NetApp Release 6.0.1R2: Fri Feb 9 01:12:44 PST 2001
FILER> ifconfig -a e0: flags=848043<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.1.103 netmask 0xffffff00 broadcast 192.168.1.255 partner inet 192.168.1.104 (not in use) ether 00:a0:98:00:9f:0a (100tx-fd-up) e2a: flags=8042<BROADCAST,RUNNING,MULTICAST> mtu 1500 ether 00:20:fc:1e:63:d4 (auto-unknown-cfg_down) e2b: flags=8042<BROADCAST,RUNNING,MULTICAST> mtu 1500 ether 00:20:fc:1e:63:d5 (auto-unknown-cfg_down) e2c: flags=8042<BROADCAST,RUNNING,MULTICAST> mtu 1500 ether 00:20:fc:1e:63:d6 (auto-unknown-cfg_down) e2d: flags=8042<BROADCAST,RUNNING,MULTICAST> mtu 1500 ether 00:20:fc:1e:63:d7 (auto-unknown-cfg_down) e7: flags=8042<BROADCAST,RUNNING,MULTICAST> mtu 1500 ether 00:03:47:22:85:5e (auto-1000sx-fd-down) flowcontrol full lo: flags=948049<UP,LOOPBACK,RUNNING,MULTICAST,TCPCKSUM> mtu 4056 inet 127.0.0.1 netmask 0xff000000 broadcast 127.0.0.1 ether 00:00:00:00:00:00 (Shared memory)
The filer volume webmail:
FILER> df Filesystem kbytes used avail capacity Mounted on /vol/webmail/ 406736736 383407488 23329248 94% /vol/webmail/ /vol/webmail/.snapshot 101684180 0 101684180 0% /vol/webmail/.snapshot
is mounted in each of the 6 frontends.
NFSstat gives me:
FILER> nfsstat
Server rpc: TCP: calls badcalls nullrecv badlen xdrcall 0 0 0 0 0
UDP: calls badcalls nullrecv badlen xdrcall 350325996760 0 0 0
Server nfs: calls badcalls 393275655620
Server nfs V2: (25634001711 calls) null getattr setattr root lookup readlink read 0 0% 2872301897 11%41557694 0%0 0% 5182588465 20%124663 0% 16772257357 65% wrcache write create remove rename link symlink 0 0% 604329011 2%7689267 0% 19040366 0%12918825 0%11991858 0%26947 0% mkdir rmdir readdir statfs 167656 0% 827086 0% 108179901 0%718 0%
Server nfs V3: (13693563851 calls) null getattr setattr lookup access readlink read 0 0% 5446218 0% 43552013 0%2360996181 17%28976674 0%97560 0% 10689673307 78% write create mkdir symlink mknod remove rmdir 410587211 3%5298393 0% 161919 0% 4 0% 0 0% 15451799 0%82070 0% rename link readdir readdir+ fsstat fsinfo pathconf 12977697 0%9969929 0% 110291020 1%0 0% 928 0% 928 0% 0 0% commit 0 0%
The getattr seems way too big and this may point to a bad caching on the frontends. But could this bring the CPU to 100% most of the time? Could this be a wafl issue related with the low available space on the volume?
I'll first increase the nfs client cache to try to lower the getattr's. But I fear this won't help much, further optimization, from the bottom up, is needed.
Any ideas to help optimize the performance in this scenario? Any ideas are welcome.
If you need any further info (I wanted to send a filestats but is taking an eternity...) please ask.
TIA.
-- Jose Celestino japc@co.sapo.pt SysAdmin::SAPO.pt http://www.sapo.pt
main(){printf("%xu%xk%x!\n",15,12,237);}