Enable the option (temporarily) 'cifs.trace_dc_connection'. The output (via screen\messages file) will help.
It may not be an issue with complete connectivity drop, but the DC is definitely rejecting the RPC request to look up group membership (SamrGetAliasMembership).
Glenn
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Borzenkov, Andrey Sent: Tuesday, November 28, 2006 4:40 AM To: Simon Vallet; toasters@mathworks.com Subject: RE: Intermittent "Permission denied" on NTFS qtree
Hi,
You apparently have some intermittent connectivity problems with DC. As for error messages - this indicates that the domain controller sometimes does not respond, or that connectivity with it was briefly interrupted, which somehow proves connectivity problem.
Regards
Andrey Borzenkov Senior system engineer IT Product Services Fujitsu Siemens Computers Russian Federation
Telephone: +7(495)737-2723 Email: mailto:Andrey.Borzenkov@fujitsu-siemens.com Internet: http://www.fujitsu-siemens.com
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Simon Vallet Sent: Monday, November 27, 2006 6:09 PM To: toasters@mathworks.com Subject: Intermittent "Permission denied" on NTFS qtree
Hi,
we've been trying to solve an intermittent "Permission denied" access problem on a NTFS qtree for some time now, to no avail, so maybe some of you will have some advice.
We have a CIFS- and NFS-exported NTFS qtree on a FAS3020 cluster running Data ONTAP 7.1.1. The PDC is a win2k3 server, and Unix users are managed via NIS.
From time to time, legitimate users are refused access to certain files with a "Permission denied" message, especially when trying to access the qtree via NFS. These users *do* have permission to access the files in question as per NTFS ACLs -- in fact, when they retry to access the file a few minutes later, they are granted access.
We're seing this problem since a bit more of a week now, and still have no clue what the cause of the problem could be... the SID cache seems to be a good suspect, but it's hard to tell; we've also thought of buggy win2k3 patches...
In parallel, we're seing the following messages in the logs: CIFSRPC SamrGetAliasMembership: Exception rpc_s_unknown_reject caught.
Has anybody experienced a similar problem ?
Simon
Hi,
On Tue, 28 Nov 2006 07:41:58 -0500 "Glenn Walker" ggwalker@mindspring.com wrote:
Enable the option (temporarily) 'cifs.trace_dc_connection'. The output (via screen\messages file) will help.
It may not be an issue with complete connectivity drop, but the DC is definitely rejecting the RPC request to look up group membership (SamrGetAliasMembership).
Apparently, there are some connectivity problems, but it seems they are quite random -- a trace of network traffic between the filer and the PDC reveals some unexpected TCP resets issued byt the DC :
[...] filer -> DC [FIN,ACK] DC->filer [ACK] DC->filer [RST,ACK] [...]
this shouldn't be a problem, since the filer requested a FIN anyway, but the time coincidence is troubling...
Enabling cifs.trace_dc_connection and cifs.trace_login yields some more information:
AUTH: notice- The context has expired. AUTH: notice- No error. AUTH: notice- Unexpected GSSAPI security context error. AUTH: notice- The context has expired. AUTH: notice- No error. CIFSRPC SamrGetAliasMembership: Exception rpc_s_unknown_reject caught. AUTH: Error looking up domain groups during login from 192.168.x.x:RPC_NT_CALL_FAILED (0xc002001b).
and ten seconds later: AUTH: TraceLDAPServer- Starting AD LDAP server address discovery for domain.tld AUTH: TraceLDAPServer- Found 2 AD LDAP server addresses using generic DNS query. AUTH: TraceLDAPServer- AD LDAP server address discovery for domain.tld complete. 2 unique addresses found. AUTH: notice- Unexpected GSSAPI security context error. [...]
This goes on for ten minutes, then the filer tries to locate a DC again, and then everything works fine again
AUTH: TraceDC- Starting DC address discovery for domain. AUTH: TraceDC- Filer is not a member of a site. AUTH: TraceDC- Found 2 addresses using generic DNS query. AUTH: TraceDC- Starting WINS queries. AUTH: TraceDC- Found 2 BDC addresses through WINS. AUTH: TraceDC- Found 1 PDC addresses through WINS. AUTH: TraceDC- DC address discovery for PC complete. 2 unique addresses found.
I'm not really sure of what *should* happen, but this definitely does *not* look good... I understand that a security context expires sometimes, but I wonder why it takes so long to re-negociate
Simon
Hi,
On Tue, 28 Nov 2006 07:41:58 -0500 "Glenn Walker" ggwalker@mindspring.com wrote:
Enable the option (temporarily) 'cifs.trace_dc_connection'. The output (via screen\messages file) will help.
It may not be an issue with complete connectivity drop, but the DC is definitely rejecting the RPC request to look up group membership (SamrGetAliasMembership).
Apparently, there are some connectivity problems, but it seems they are quite random -- a trace of network traffic between the filer and the PDC reveals some unexpected TCP resets issued byt the DC :
[...] filer -> DC [FIN,ACK] DC->filer [ACK] DC->filer [RST,ACK] [...]
this shouldn't be a problem, since the filer requested a FIN anyway, but the time coincidence is troubling...
Enabling cifs.trace_dc_connection and cifs.trace_login yields some more information:
AUTH: notice- The context has expired. AUTH: notice- No error. AUTH: notice- Unexpected GSSAPI security context error. AUTH: notice- The context has expired. AUTH: notice- No error. CIFSRPC SamrGetAliasMembership: Exception rpc_s_unknown_reject caught. AUTH: Error looking up domain groups during login from 192.168.x.x:RPC_NT_CALL_FAILED (0xc002001b).
and ten seconds later: AUTH: TraceLDAPServer- Starting AD LDAP server address discovery for domain.tld AUTH: TraceLDAPServer- Found 2 AD LDAP server addresses using generic DNS query. AUTH: TraceLDAPServer- AD LDAP server address discovery for domain.tld complete. 2 unique addresses found. AUTH: notice- Unexpected GSSAPI security context error. [...]
This goes on for ten minutes, then the filer tries to locate a DC again, and then everything works fine again
AUTH: TraceDC- Starting DC address discovery for domain. AUTH: TraceDC- Filer is not a member of a site. AUTH: TraceDC- Found 2 addresses using generic DNS query. AUTH: TraceDC- Starting WINS queries. AUTH: TraceDC- Found 2 BDC addresses through WINS. AUTH: TraceDC- Found 1 PDC addresses through WINS. AUTH: TraceDC- DC address discovery for PC complete. 2 unique addresses found.
I'm not really sure of what *should* happen, but this definitely does *not* look good... I understand that a security context expires sometimes, but I wonder why it takes so long to re-negociate
Simon
Just a guess here, but is the time on the filer and PDC synchronized?
I have noticed that even though filers use ntp, they do not correct the time continuously. Instead they periodically reset the local clock according to the ntp server's clock. By default this happens once an hour. Between hourly time resets, the filer uses some internal heuristics to try keep its local time correct, but sometimes it doesn't do a very good job and the time on the filer may gradually drift forward or backward. We had a filer that would drift forward between 5 and 10 seconds over the course of an hour. Our work around was this:
options timed.sched 10m
which resets the time every 10 minutes instead of every hour. Now the filer keeps much better time.
Steve Losen scl@virginia.edu phone: 434-924-0640
University of Virginia ITC Unix Support
This sort of reminds me of bug 147265 which I encountered a while back, but you said you're running 7.1.1, and the Bugs Online says it's fixed in that release.
I would expect that ONTAP should acquire a new token if getting disconnected in such a manner...
If you can take a packet trace between the filers and DC that might help support escalate the case quicker? I suppose you've already done that.
On Wed, 29 Nov 2006, Simon Vallet wrote:
Hi,
On Tue, 28 Nov 2006 07:41:58 -0500 "Glenn Walker" ggwalker@mindspring.com wrote:
Enable the option (temporarily) 'cifs.trace_dc_connection'. The output (via screen\messages file) will help.
It may not be an issue with complete connectivity drop, but the DC is definitely rejecting the RPC request to look up group membership (SamrGetAliasMembership).
Apparently, there are some connectivity problems, but it seems they are quite random -- a trace of network traffic between the filer and the PDC reveals some unexpected TCP resets issued byt the DC :
[...] filer -> DC [FIN,ACK] DC->filer [ACK] DC->filer [RST,ACK] [...]
this shouldn't be a problem, since the filer requested a FIN anyway, but the time coincidence is troubling...
Enabling cifs.trace_dc_connection and cifs.trace_login yields some more information:
AUTH: notice- The context has expired. AUTH: notice- No error. AUTH: notice- Unexpected GSSAPI security context error. AUTH: notice- The context has expired. AUTH: notice- No error. CIFSRPC SamrGetAliasMembership: Exception rpc_s_unknown_reject caught. AUTH: Error looking up domain groups during login from 192.168.x.x:RPC_NT_CALL_FAILED (0xc002001b).
and ten seconds later: AUTH: TraceLDAPServer- Starting AD LDAP server address discovery for domain.tld AUTH: TraceLDAPServer- Found 2 AD LDAP server addresses using generic DNS query. AUTH: TraceLDAPServer- AD LDAP server address discovery for domain.tld complete. 2 unique addresses found. AUTH: notice- Unexpected GSSAPI security context error. [...]
This goes on for ten minutes, then the filer tries to locate a DC again, and then everything works fine again
AUTH: TraceDC- Starting DC address discovery for domain. AUTH: TraceDC- Filer is not a member of a site. AUTH: TraceDC- Found 2 addresses using generic DNS query. AUTH: TraceDC- Starting WINS queries. AUTH: TraceDC- Found 2 BDC addresses through WINS. AUTH: TraceDC- Found 1 PDC addresses through WINS. AUTH: TraceDC- DC address discovery for PC complete. 2 unique addresses found.
I'm not really sure of what *should* happen, but this definitely does *not* look good... I understand that a security context expires sometimes, but I wonder why it takes so long to re-negociate
Simon
On Wed, 29 Nov 2006 08:31:11 -0800 (PST) Myles Uyema netapp@uyema.net wrote:
This sort of reminds me of bug 147265 which I encountered a while back, but you said you're running 7.1.1, and the Bugs Online says it's fixed in that release.
OK, after some more investigation it appears that we're hitting some form of #203311 and/or #147265, and it seems the former one is not marked as fixed in the 7.1 base release
I would expect that ONTAP should acquire a new token if getting disconnected in such a manner...
It does, but only 5 minutes later. When the expiry takes place in the middle of the night, this is no problem. When users are constantly using the filer, this is a (light) problem.
If you can take a packet trace between the filers and DC that might help support escalate the case quicker? I suppose you've already done that.
I did send a trace to NetApp -- let's see what they'll say.
I'll post a summary once this is resolved -- thanks to all of you for your suggestions
Simon