I wonder whether anyone else has seen the following effect?
I made a casual reference to the problem back in May, but it's just
jumped up and bit us rather hard. [Actually, it was before Christmas,
but I didn't get round to writing it up then!]
NFS server: F740 running ONTAP 5.3.7R1
NFS client: UltraSPARC (E220R) running Solaris 8 (well patched)
[but the problem has been around since 5.3.5 & Solaris 2.6, at least:
probably much longer]
In some circumstances, the Solaris kernel can acquire the notion that
it has pending NFS writes (apparently in the "buffer cache") to be
performed on a file in a snapshot. When such a write is rejected by
the filer with EROFS, Solaris logs the error... and tries it again
after 30 seconds. And again. And again ...
On one recent occurrence of this, as the result of a genuine user "error",
the messages came so thick and fast that we had to reboot the client
machine. The following, by contrast, was a controlled experiment!
Test file:
$ ls -li test/test
3715488 -rw-r--r-- 1 cet1 cet1 21167 Dec 23 15:50 test/test
A snapshot was taken at 16:00. Then the command
echo x >>test/.snapshot/hourly.0/test
which one might reasonably expect to fail, completed quietly. But then
the kernel messages started:
Dec 23 18:00:43 draco.cus.cam.ac.uk nfs: [ID 808668 kern.notice]
NFS write error on host puppis-intracus: Read-only file system.
Dec 23 18:00:43 draco.cus.cam.ac.uk nfs: [ID 702911 kern.notice]
(file handle: e1b43900 13b1f03 20000b00 38b1a0 a8268f00 fb380000 40000000 c7741500)
Dec 23 18:00:43 draco.cus.cam.ac.uk nfs: [ID 808668 kern.notice]
NFS write error on host puppis-intracus: Read-only file system.
Dec 23 18:00:43 draco.cus.cam.ac.uk nfs: [ID 702911 kern.notice]
(file handle: e1b43900 13b1f03 20000b00 38b1a0 a8268f00 fb380000 40000000 c7741500)
Dec 23 18:01:07 draco.cus.cam.ac.uk nfs: [ID 808668 kern.notice]
NFS write error on host puppis-intracus: Read-only file system.
Dec 23 18:01:07 draco.cus.cam.ac.uk nfs: [ID 702911 kern.notice]
(file handle: e1b43900 13b1f03 20000b00 38b1a0 a8268f00 fb380000 40000000 c7741500)
Dec 23 18:01:37 draco.cus.cam.ac.uk nfs: [ID 808668 kern.notice]
NFS write error on host puppis-intracus: Read-only file system.
Dec 23 18:01:37 draco.cus.cam.ac.uk nfs: [ID 702911 kern.notice]
(file handle: e1b43900 13b1f03 20000b00 38b1a0 a8268f00 fb380000 40000000 c7741500)
...
continuing apparently indefinitely. I was quite expecting to have to reboot
this client machine as well, but coming back after Christmas I find they
eventually terminated:
...
Dec 25 15:59:37 draco.cus.cam.ac.uk nfs: [ID 808668 kern.notice]
NFS write error on host puppis-intracus: Read-only file system.
Dec 25 15:59:37 draco.cus.cam.ac.uk nfs: [ID 702911 kern.notice]
(file handle: e1b43900 13b1f03 20000b00 38b1a0 a8268f00 fb380000 40000000 c7741500)
Dec 25 16:00:07 draco.cus.cam.ac.uk nfs: [ID 626546 kern.notice]
NFS write error on host puppis-intracus: Stale NFS file handle.
Dec 25 16:00:07 draco.cus.cam.ac.uk nfs: [ID 702911 kern.notice]
(file handle: e1b43900 13b1f03 20000b00 38b1a0 a8268f00 fb380000 40000000 c7741500)
The snapshot had been deleted, and the change from an EROFS to an ESTALE
seems to have finally persuaded the Solaris kernel to stop retrying!
[The file handle contents agree perfectly with the expected value for
the snapshot of test/test, by the way, based on the description that
Guy Harris gave last May.]
I suppose this has to be seen primarily as a Solaris problem, but I
wonder what attributes of the NetApp filer are confusing it.
Q1. Why doesn't the open for writing of "test/.snapshot/hourly.0/test"
fail? Is Solaris being confused by parts of a single NFS filing
system being read-only, but not all of it? Or by the duplicate
inode numbers of "test/test" (which would certainly have been
cached) and "test/.snapshot/hourly.0/test"?
Q2. Why does Solaris go on trying the write for so long? [And why
keep on logging it so persistently? :-( ]
I'll have to open a call with Sun about this in the new year, and would
like to understand the problem better myself by then.
Chris Thompson University of Cambridge Computing Service,
Email: cet1(a)ucs.cam.ac.uk New Museums Site, Cambridge CB2 3QG,
Phone: +44 1223 334715 United Kingdom.