[Pkg-openldap-devel] Bug#464024: syncrepl provider kills consumer by sending truncated cookie

Ralph Rößner roessner at capcom.de
Mon Feb 4 19:33:30 UTC 2008


Package: slapd
Version: 2.4.7-3
Severity: Important

Hi,

when our syncrepl consumers (refreshOnly mode) query the provider for
changes, the provider will sometimes send back an intermediate message
that has the syncronization cookie truncated (the csn is missing). This
causes the consumer to die (segfault). Upon restart, the consumer
database will be empty. In a rarer case, the consumer will survive but
have its database cleaned out as well. This problem appeared after the
upgrade from 2.3.83-1+lenny1. 

Our LDAP infrastructure contains a syncrepl provider and three consumers
in refreshOnly mode. Two of the consumers get an identical subset of the
data and are configured alike except for the replication user, while the
third serves a different purpose. All consumers have been hit by the
problem, the ones configured alike die at the same time. The problem
appears at apparently random intervals, from a few hours to a few days.

Since then I have tried a few changes to our configuration and an
upgrade to 2.4.7-4, mainly to keep things alive (mail customers not
being happy). This has yielded only one result, namely that switching to
refreshAndPersist mode avoids the problem, I had one of the alike
configured consumers running in refreshAndPersist, and it survived when
the other failed.

I have set up a test consumer server, copying the existing
configuration, and it has nicely duplicated the problem, even
reproducably for a stretch of time, So I am able to provide sane (i.e.
without a lot of queries for mail adresses) debug logs that show the
consumer failing. I have also captured a debug log of the provider
working at the replication query, from a later point in time since
restarting the provider to change the log level has cleared the problem
for a while.

You will notice in the logs that the intermediate message returned to
the client contains a cookie that stops after the "csn=" string, i.e. it
does not actually contain a value for the csn. I think that is what
kills the consumer. I don't have a clue why the provider does that.

I have provided a network trace (in pcap format) of the exchange,
leaving out the handshake and bind request message to avoid password
disclosure. Unless I'm mistaken, the refreshDeletes flag of the
intermediate message is set to TRUE, indicating multiple deletes
(right?). This fits well with the rare case of the consumer deleting all
its entries (which I have not been able to get logs of so far). 



More information about the Pkg-openldap-devel mailing list