More logs would be helpful - by default, the filer will try to 'improve' DC connectivity by searching every 4 hours. The last part of the logs that you posted may very well be that, or more likely it is trying to regain connectivity after an error (security context expiring is not something that is seen frequently - this is abnormal!).
The biggest problems are the context expiring, the GSSAPI security context error (result of the security context expiring, no doubt), and the RPC rejection.
I'm assuming that the RPC rejection and the GSSAPI errors are a direct result of the context expiration.
Glenn
-----Original Message----- From: Simon Vallet [mailto:svallet@genoscope.cns.fr] Sent: Wednesday, November 29, 2006 6:10 AM To: Glenn Walker Cc: Andrey.Borzenkov@fujitsu-siemens.com; toasters@mathworks.com Subject: Re: Intermittent "Permission denied" on NTFS qtree
Hi,
On Tue, 28 Nov 2006 07:41:58 -0500 "Glenn Walker" ggwalker@mindspring.com wrote:
Enable the option (temporarily) 'cifs.trace_dc_connection'. The
output (via screen\messages file) will help.
It may not be an issue with complete connectivity drop, but the DC is
definitely rejecting the RPC request
to look up group membership (SamrGetAliasMembership).
Apparently, there are some connectivity problems, but it seems they are quite random -- a trace of network traffic between the filer and the PDC reveals some unexpected TCP resets issued byt the DC :
[...] filer -> DC [FIN,ACK] DC->filer [ACK] DC->filer [RST,ACK] [...]
this shouldn't be a problem, since the filer requested a FIN anyway, but the time coincidence is troubling...
Enabling cifs.trace_dc_connection and cifs.trace_login yields some more information:
AUTH: notice- The context has expired. AUTH: notice- No error. AUTH: notice- Unexpected GSSAPI security context error. AUTH: notice- The context has expired. AUTH: notice- No error. CIFSRPC SamrGetAliasMembership: Exception rpc_s_unknown_reject caught. AUTH: Error looking up domain groups during login from 192.168.x.x:RPC_NT_CALL_FAILED (0xc002001b).
and ten seconds later: AUTH: TraceLDAPServer- Starting AD LDAP server address discovery for domain.tld AUTH: TraceLDAPServer- Found 2 AD LDAP server addresses using generic DNS query. AUTH: TraceLDAPServer- AD LDAP server address discovery for domain.tld complete. 2 unique addresses found. AUTH: notice- Unexpected GSSAPI security context error. [...]
This goes on for ten minutes, then the filer tries to locate a DC again, and then everything works fine again
AUTH: TraceDC- Starting DC address discovery for domain. AUTH: TraceDC- Filer is not a member of a site. AUTH: TraceDC- Found 2 addresses using generic DNS query. AUTH: TraceDC- Starting WINS queries. AUTH: TraceDC- Found 2 BDC addresses through WINS. AUTH: TraceDC- Found 1 PDC addresses through WINS. AUTH: TraceDC- DC address discovery for PC complete. 2 unique addresses found.
I'm not really sure of what *should* happen, but this definitely does *not* look good... I understand that a security context expires sometimes, but I wonder why it takes so long to re-negociate
Simon
As a wild guess - have you checked that clock is synchronized between DCs and filer? Is it possible that clock drifts too far between time daemon scheduled updates?
-andrey
________________________________
From: Glenn Walker [mailto:ggwalker@mindspring.com] Sent: Wed 11/29/2006 2:08 PM To: Simon Vallet Cc: Borzenkov, Andrey; toasters@mathworks.com Subject: RE: Intermittent "Permission denied" on NTFS qtree
More logs would be helpful - by default, the filer will try to 'improve' DC connectivity by searching every 4 hours. The last part of the logs that you posted may very well be that, or more likely it is trying to regain connectivity after an error (security context expiring is not something that is seen frequently - this is abnormal!).
The biggest problems are the context expiring, the GSSAPI security context error (result of the security context expiring, no doubt), and the RPC rejection.
I'm assuming that the RPC rejection and the GSSAPI errors are a direct result of the context expiration.
Glenn
-----Original Message----- From: Simon Vallet [mailto:svallet@genoscope.cns.fr] Sent: Wednesday, November 29, 2006 6:10 AM To: Glenn Walker Cc: Andrey.Borzenkov@fujitsu-siemens.com; toasters@mathworks.com Subject: Re: Intermittent "Permission denied" on NTFS qtree
Hi,
On Tue, 28 Nov 2006 07:41:58 -0500 "Glenn Walker" ggwalker@mindspring.com wrote:
Enable the option (temporarily) 'cifs.trace_dc_connection'. The
output (via screen\messages file) will help.
It may not be an issue with complete connectivity drop, but the DC is
definitely rejecting the RPC request
to look up group membership (SamrGetAliasMembership).
Apparently, there are some connectivity problems, but it seems they are quite random -- a trace of network traffic between the filer and the PDC reveals some unexpected TCP resets issued byt the DC :
[...] filer -> DC [FIN,ACK] DC->filer [ACK] DC->filer [RST,ACK] [...]
this shouldn't be a problem, since the filer requested a FIN anyway, but the time coincidence is troubling...
Enabling cifs.trace_dc_connection and cifs.trace_login yields some more information:
AUTH: notice- The context has expired. AUTH: notice- No error. AUTH: notice- Unexpected GSSAPI security context error. AUTH: notice- The context has expired. AUTH: notice- No error. CIFSRPC SamrGetAliasMembership: Exception rpc_s_unknown_reject caught. AUTH: Error looking up domain groups during login from 192.168.x.x:RPC_NT_CALL_FAILED (0xc002001b).
and ten seconds later: AUTH: TraceLDAPServer- Starting AD LDAP server address discovery for domain.tld AUTH: TraceLDAPServer- Found 2 AD LDAP server addresses using generic DNS query. AUTH: TraceLDAPServer- AD LDAP server address discovery for domain.tld complete. 2 unique addresses found. AUTH: notice- Unexpected GSSAPI security context error. [...]
This goes on for ten minutes, then the filer tries to locate a DC again, and then everything works fine again
AUTH: TraceDC- Starting DC address discovery for domain. AUTH: TraceDC- Filer is not a member of a site. AUTH: TraceDC- Found 2 addresses using generic DNS query. AUTH: TraceDC- Starting WINS queries. AUTH: TraceDC- Found 2 BDC addresses through WINS. AUTH: TraceDC- Found 1 PDC addresses through WINS. AUTH: TraceDC- DC address discovery for PC complete. 2 unique addresses found.
I'm not really sure of what *should* happen, but this definitely does *not* look good... I understand that a security context expires sometimes, but I wonder why it takes so long to re-negociate
Simon
On Wed, 29 Nov 2006 14:14:25 +0100 "Borzenkov, Andrey" Andrey.Borzenkov@fujitsu-siemens.com wrote:
As a wild guess - have you checked that clock is synchronized between DCs and filer? Is it possible that clock drifts too far between time daemon scheduled updates?
Kerberos seems indeed sensible to clock synchronization problems, but both machines synchronize on the same NTP server, and the drift betwenn the hourly filer re-synchronization is not sensible.
Simon
On Wed, 29 Nov 2006 08:08:22 -0500 "Glenn Walker" ggwalker@mindspring.com wrote:
More logs would be helpful - by default, the filer will try to 'improve' DC connectivity by searching every 4 hours. The last part of the logs that you posted may very well be that, or more likely it is trying to regain connectivity after an error (security context expiring is not something that is seen frequently - this is abnormal!).
Yes -- actually, the filer did try to "improve" its LDAP connection *just before* the first GSSAPI error:
06:33:05 CET -- AUTH: TraceLDAPServer- Attempting to improve AD LDAP connection for domain.tld 06:33:05 CET -- AUTH: TraceLDAPServer- AD LDAP connection improved for PC.LOCAL. 06:33:59 CET -- AUTH: notice- Unexpected GSSAPI security context error.
this might be related
The biggest problems are the context expiring, the GSSAPI security context error (result of the security context expiring, no doubt), and the RPC rejection.
I'm assuming that the RPC rejection and the GSSAPI errors are a direct result of the context expiration.
Digging a bit more, I looked into the details of what is going over the wire during that time:
From 06:34:00 (first GSSAPI error) until 06:38 (no GSSAPI error seen after this), the filer constantly issues LDAP Bind requests, to which the DC replies with 'saslBindInProgress'
From what I know of the GSSAPI/Krb5 negociation, the filer should then continue the challeng-response until it has a valid Bind response. But instead, it issues an Unbind request, suggesting it doesn't have valid Kerberos credentials and is giving up :
06:34:00 filer -> DC -- Bind request (Initial: MsgId=1) 06:34:00 DC->filer -- saslBindInProgress 06:34:00 filer->DC -- Unbind request (answer: MsgId=2)
At 06:38:10, it seems to re-negociate a Kerberos ticket with the DC, and the following LDAP Bind request then proceeds as (I think) it should:
06:38:11 filer -> DC -- Bind request (Initial: MsgId=1) 06:38:11 DC->filer -- saslBindInProgress 06:38:11 filer->DC -- Bind request (answer: MsgId=2) 06:38:11 DC->filer -- saslBindInProgress 06:38:11 filer->DC -- Bind request (answer: MsgId=3) 06:38:11 DC->filer -- Bind success
So it seems the question is indeed: why does the filer looses its Kerberos credentials ? I'm thinking with Andrew that this could be due to buggy MS patches : we did indeed apply several patches two weeks ago, and that's were the problems seem to start. But identifying the incorrect patch will not be easy, which is why I wanted to check the general behaviour first.
The reason why I'm thinking this is that the successful LDAP connection above is brutally interrupted by the DC at 06:54:29 with a RST in reply to an apparently typical LDAP query -- I'll have to dig a bit more into Kerberos to check the credentials are still valid at this time : We have another RST-interrupted LDAP session between 06:54:29 and 07:18:31, and then another one , starting at 07:18:31, which didn't get interrupted until the end of the dump, at 11:44.
This all thing is very strange: the RST could be due to wrong/expired credentials or a buggy MS patch : oddly enough, every "Bind success" issued by the DC can't be completely decoded by Ethereal (ethereal bug or MS bug ?) Maybe the ticket expiration time somehow ends badly encoded ?
Thanks for the input so far, Simon