Hello fellow NetApp Admins.
I have a bit of an odd one that I'm trying to troubleshoot - and whilst I'm not sure it's specifically filer related, it's NFS related (and is happening on a filer mount).
What happens is this - there's a process that updates a file, and relies on 'rename()' being atomic- a journal is updated, and then reference pointer (file) is newly created, and renamed over an old one.
The expectation is that this file will always be there - because "rename()" is defined as an atomic operation.
But that's not quite what I'm getting - I have one nfs client doing it's (atomic) rename. And another client (different NFS host) reading it, and - occasionally - reporting 'no such file or directory'.
This is causing an operation to fail, which in turn means that someone has to intervene in the process. This operation (and multiple extremely similar ones) happen at 5m intervals, and every few days (once a week maybe?) it fails for this reason, and our developers think that should be impossible. But as such - it looks like a pretty narrow race condition.
So what I'm trying to figure out is first off:
- Could this be a NetApp bug? We've moved from 7 mode to CDOT, and it didn't happen before. On the flip side though - I have no guarantee that it 'never happened before' because we weren't catching a race condition. (moving to new tin and improving performance does increase race condition likelihood after all)
- Could this be a kernel bug? We're all on kernel 2.6.32-504.12.2.el6.x86_64 - and whilst we're deploying Centos 7, all the hosts involved aren't yet. (But that's potentially also just coincidence, as there's quite a few hosts, and they're all the same kernel versions).
- Is it actually impossible for a file A renamed over file B to generate ENOENT on a different client? Specifically, in RFC3530 We have: " The RENAME operation must be atomic to the client.". So the client doing the rename sees an atomic operation - but the expectation is that a separate client will also perceive an 'atomic' change - once the cache is refreshed, the 'new' directory has the new files, and at no point was there 'no such file or directory' because it was either the old one, or the newly renamed one. Is this actually a valid thing to think?
This is a bit of a complicated one, and has me clutching at straws a bit - I can't reliably reproduce it - a basic fast spinning loop script on multiple client to read-write-rename didn't hit it. I've got pcaps running hoping to catch it 'in flight' - but haven't yet managed to catch it happening.
But any suggestions would be gratefully received.