Hi, we have over 20 webservers running a web application and serving all the data off a Netapp NFS mount. Netapp CPU normally sits at around 20-40% and NFS response is good.
Today we hit what appears to have been a bug of some sort. All of a sudden, with no apparent increase in client connections, the F760 CPU went to 80-90% and the load on all the webservers rose sharply (from approx 0.5-1.5 up to 15-20). The site response went down the drain (20+ seconds for a page that normally takes under 1 second).
It appeared to be caused by the application doing readdirs (along with other operations: read/write/getattr) on a specific directory, which at that time held about 70,000 files. We fixed the problem by disabling the readdirs within the application and also reducing the number of files in that directory down to about 45,000.
We don't know exactly which fix (stopping readdirs or removing the files) did the trick, but after that Netapp CPU dropped back to normal and the webservers were happy and the site responsive again.
It appeared that the combination of all the operations on the large directory were causing the NFS clients to hang and the Netapp CPU to max out. Although at the same time, other NFS operations were still performing at a reasonable speed (ie, the whole Netapp was not locked up).
Any ideas on a bug or limitation on either the Linux or the Netapp side with regards to large (70,000+ files) directories?
Info: Netapp F760C ONTAP 6.1.2R3
Linux 2.4.20 (Redhat) Mount options: rw,hard,intr,vers=3,proto=tcp,rsize=32768,wsize=32768
Cheers, Chris
we already encounter a similar prb :
over CIFS in a large hierarchical subdirectorie structure with over 100.000 files in a directory with an application running some kind of find on these files (ie : looking for a specific string inside files)
the filer was blocking the creation of files over CIFS, there is a limitation concerning max number of files in a directory (around 130.00) you can adjust this but the client also modify his application to create less files per dir
also the costumer explain that the processing was longer as there were more files in dir
i think the most impact on your issue was the readattrbute. i make the parrallel between your readattrb and the find function of the above application this is CPU consuming as the half of the job has to be done on the filer side as there is lot of files (near 100.000), you saw the cpu get hight and experience longer response
F8XX are around 2 times much powerfull (CPU and mem concern) than the F760 so i think upgrade your Filer could perhaps let you resolve the prb definitely (despite your cpu is not just near the 100% limit but between 80 and 90%) upgrading Ontapp to a newer version perhaps could help, but I can't tell you 613r2 is a stable version in my opinion, but perhaps not optimized for your kind of job
bye
Chris Miles wrote:
Hi, we have over 20 webservers running a web application and serving all the data off a Netapp NFS mount. Netapp CPU normally sits at around 20-40% and NFS response is good.
Today we hit what appears to have been a bug of some sort. All of a sudden, with no apparent increase in client connections, the F760 CPU went to 80-90% and the load on all the webservers rose sharply (from approx 0.5-1.5 up to 15-20). The site response went down the drain (20+ seconds for a page that normally takes under 1 second).
It appeared to be caused by the application doing readdirs (along with other operations: read/write/getattr) on a specific directory, which at that time held about 70,000 files. We fixed the problem by disabling the readdirs within the application and also reducing the number of files in that directory down to about 45,000.
We don't know exactly which fix (stopping readdirs or removing the files) did the trick, but after that Netapp CPU dropped back to normal and the webservers were happy and the site responsive again.
It appeared that the combination of all the operations on the large directory were causing the NFS clients to hang and the Netapp CPU to max out. Although at the same time, other NFS operations were still performing at a reasonable speed (ie, the whole Netapp was not locked up).
Any ideas on a bug or limitation on either the Linux or the Netapp side with regards to large (70,000+ files) directories?
Info: Netapp F760C ONTAP 6.1.2R3
Linux 2.4.20 (Redhat) Mount options: rw,hard,intr,vers=3,proto=tcp,rsize=32768,wsize=32768
Cheers, Chris
It appeared to be caused by the application doing readdirs (along with other operations: read/write/getattr) on a specific directory, which at that time held about 70,000 files. We fixed the problem by disabling the readdirs within the application and also reducing the number of files in that directory down to about 45,000.
Probably is likely two-fold -- large (erm, huge) directories are going to be painful for the filer in the first place, since a 'simple' request explodes out into such a large anwser.. Secondly, I don't know if the same holds true for Linux, but Solaris will bypass the DNLC for directories over a given size -- likewise, what's already a nasty request gets amplified since the client isn't caching it.
We don't know exactly which fix (stopping readdirs or removing the files) did the trick, but after that Netapp CPU dropped back to normal and the webservers were happy and the site responsive again.
I believe its READDIR+ that you disabled (disabling READDIR altogether would be ... interesting). This would definatley help in that situation, as READDIR+ effectively does a GETATTR for every file at the same time, so if the attribute data isn't beling used, would be that much more work to be done.
Avoid large directories like this -- it will always be a problem on any platform. Even doing something as ugly as apache's mod_rewrite to hash out across multiple levels will boost performance in the end..
You should also take a look at your config to see if you can identify what was constantly doing the lookups (ie. apache mod_speling will do this). Even if its not destroying things as it was here, it will impact performance you'll probably want to address it anyways..