Redo RelationCounterField unsaved state tracking to avoid deadlocks.
Review Request #9268 — Created Oct. 14, 2017 and submitted
We were hitting a very peculiar deadlock bug that was reproduceable but
depended on the operations being performed and the state of the garbage
collector. The deadlock occurred due to Django'sSignal.connect()
triggering aRelationCounterField
weak reference to drop while
iterating through signal handlers, which resulted in a callback in
RelationCounterField
that then attempted to disconnect signals. Since
Signal.connect()
already grabbed a lock,Signal.disconnect()
was
blocked waiting on that lock, causing the process (whether unit tests or
web server) to hang.In order to get around this, the state tracking has been changed a bit,
in a way that actually cleans up the code and brings some consistency
and order to how unsaved instances work. Previously, unsaved model
instances (those with a primary key of None) required listening both for
a save (to begin tracking state) and a weak reference drop (to clean up
signal handlers), which required the use of non-weak signal callbacks,
preventingSignal
from doing its own cleanup.In the new design,
InstanceState
has been updated to be used for both
saved and unsaved instances. It now owns the signal handlers, and
connects to them using weak references, so thatSignal
can handle
disconnecting (in a way that avoids deadlocks) when the state is
no longer referenced. AnInstanceState
that represents an unsaved
instance will listen for when the instance is saved, remove the existing
references to itself, and register newInstanceStates
for the saved
reference.
InstanceState
also listens for when the weak reference is cleared for
both saved and unsaved instances, so that the resets always happen when
needed. It also no longer listens topre_delete
for unsaved instances,
reducing the number of signal connections.This all avoids the problem of nested locks in
Signal
, creates
consistency in how all instances are tracked, and further reduces the
chances of state or signal connection leaks.
Set up a reproduceable unit test run in Review Board that consistently
eventually ended up in a deadlock (due to the garbage collector deciding
to run at some point). After this change, I could not reproduce any
deadlocks in the test runs.All unit tests for Djblets and Review Board pass.
Description | From | Last Updated |
---|---|---|
typo: represented -> represent |
david |