As a wild guess - have you checked that clock is synchronized between DCs and filer? Is it possible that clock drifts too far between time daemon scheduled updates?
 
-andrey


From: Glenn Walker [mailto:ggwalker@mindspring.com]
Sent: Wed 11/29/2006 2:08 PM
To: Simon Vallet
Cc: Borzenkov, Andrey; toasters@mathworks.com
Subject: RE: Intermittent "Permission denied" on NTFS qtree

More logs would be helpful - by default, the filer will try to 'improve'
DC connectivity by searching every 4 hours.  The last part of the logs
that you posted may very well be that, or more likely it is trying to
regain connectivity after an error (security context expiring is not
something that is seen frequently - this is abnormal!).

The biggest problems are the context expiring, the GSSAPI security
context error (result of the security context expiring, no doubt), and
the RPC rejection.

I'm assuming that the RPC rejection and the GSSAPI errors are a direct
result of the context expiration.

Glenn

-----Original Message-----
From: Simon Vallet [mailto:svallet@genoscope.cns.fr]
Sent: Wednesday, November 29, 2006 6:10 AM
To: Glenn Walker
Cc: Andrey.Borzenkov@fujitsu-siemens.com; toasters@mathworks.com
Subject: Re: Intermittent "Permission denied" on NTFS qtree

Hi,

On Tue, 28 Nov 2006 07:41:58 -0500
"Glenn Walker" <ggwalker@mindspring.com> wrote:

> Enable the option (temporarily) 'cifs.trace_dc_connection'.  The
output (via screen\messages file) will help.
>
> It may not be an issue with complete connectivity drop, but the DC is
definitely rejecting the RPC request
> to look up group membership (SamrGetAliasMembership).

Apparently, there are some connectivity problems, but it seems they are
quite random -- a trace
of network traffic between the filer and the PDC reveals some unexpected
TCP resets issued byt the DC :

[...]
filer -> DC [FIN,ACK]
DC->filer [ACK]
DC->filer [RST,ACK]
[...]

this shouldn't be a problem, since the filer requested a FIN anyway, but
the time coincidence is troubling...

Enabling cifs.trace_dc_connection and cifs.trace_login yields some more
information:

AUTH: notice- The context has expired.
AUTH: notice- No error.
AUTH: notice- Unexpected GSSAPI security context error.
AUTH: notice- The context has expired.
AUTH: notice- No error.
CIFSRPC SamrGetAliasMembership: Exception rpc_s_unknown_reject caught.
AUTH: Error looking up domain groups during login from
192.168.x.x:RPC_NT_CALL_FAILED (0xc002001b).

and ten seconds later:
AUTH: TraceLDAPServer- Starting AD LDAP server address discovery for
domain.tld
AUTH: TraceLDAPServer- Found 2 AD LDAP server addresses using generic
DNS query.
AUTH: TraceLDAPServer- AD LDAP server address discovery for domain.tld
complete. 2 unique addresses found.
AUTH: notice- Unexpected GSSAPI security context error.
[...]

This goes on for ten minutes, then the filer tries to locate a DC again,
and then everything works fine again

AUTH: TraceDC- Starting DC address discovery for domain.
AUTH: TraceDC- Filer is not a member of a site.
AUTH: TraceDC- Found 2 addresses using generic DNS query.
AUTH: TraceDC- Starting WINS queries.
AUTH: TraceDC- Found 2 BDC addresses through WINS.
AUTH: TraceDC- Found 1 PDC addresses through WINS.
AUTH: TraceDC- DC address discovery for PC complete. 2 unique addresses
found.

I'm not really sure of what *should* happen, but this definitely does
*not* look good...
I understand that a security context expires sometimes, but I wonder why
it takes so long
to re-negociate

Simon