Hello fellow NetApp Admins.
I have a bit of an odd one that I'm trying to troubleshoot - and whilst I'm
not sure it's specifically filer related, it's NFS related (and is
happening on a filer mount).
What happens is this - there's a process that updates a file, and relies on
'rename()' being atomic- a journal is updated, and then reference pointer
(file) is newly created, and renamed over an old one.
The expectation is that this file will always be there - because "rename()"
is defined as an atomic operation.
But that's not quite what I'm getting - I have one nfs client doing it's
(atomic) rename. And another client (different NFS host) reading it, and -
occasionally - reporting 'no such file or directory'.
This is causing an operation to fail, which in turn means that someone has
to intervene in the process. This operation (and multiple extremely similar
ones) happen at 5m intervals, and every few days (once a week maybe?) it
fails for this reason, and our developers think that should be impossible.
But as such - it looks like a pretty narrow race condition.
So what I'm trying to figure out is first off:
- Could this be a NetApp bug? We've moved from 7 mode to CDOT, and it
didn't happen before. On the flip side though - I have no guarantee that it
'never happened before' because we weren't catching a race condition.
(moving to new tin and improving performance does increase race condition
likelihood after all)
- Could this be a kernel bug? We're all on kernel 2.6.32-504.12.2.el6.x86_64
- and whilst we're deploying Centos 7, all the hosts involved aren't yet.
(But that's potentially also just coincidence, as there's quite a few
hosts, and they're all the same kernel versions).
- Is it actually impossible for a file A renamed over file B to generate
ENOENT on a different client? Specifically, in RFC3530 We have: " The
RENAME operation must be atomic to the client.". So the client doing the
rename sees an atomic operation - but the expectation is that a separate
client will also perceive an 'atomic' change - once the cache is refreshed,
the 'new' directory has the new files, and at no point was there 'no such
file or directory' because it was either the old one, or the newly renamed
one. Is this actually a valid thing to think?
This is a bit of a complicated one, and has me clutching at straws a bit -
I can't reliably reproduce it - a basic fast spinning loop script on
multiple client to read-write-rename didn't hit it. I've got pcaps running
hoping to catch it 'in flight' - but haven't yet managed to catch it
happening. But any suggestions would be gratefully received.