=========================================
Application and Resync IO Synchronization
=========================================

This summary describes one aspect of the Distributed Replicated Block Device
(DRBD) protocol. For a full definition of the protocol see the DRBD code.

IO operations to the DRBD backing device can originate from 2 sources. Normal
operations from a filesystem or other user of DRBD are called application IO.
Resync operations in DRBD also perform reads and writes. These operations must
be synchronized to ensure that the data on the backing device is correct.

For instance, the following must be prevented:

* Resync read obtains block version v1 on some node
* Application writes block version v2 on all nodes
* Resync write overwrites block version v2 with version v1

In addition, care must be taken to ensure that bitmap bits are only cleared
when this block is genuinely in sync between these nodes.

Synchronization with resync extents
===================================

Older versions of DRBD without the feature flag ``DRBD_FF_RESYNC_DAGTAG``
perform synchronization using "resync extents". These are also known as "bitmap
extents". They are stored in an LRU cache. These extents are exclusive with the
activity log extents.

Synchronization with data generation tags
=========================================

When the feature flag ``DRBD_FF_RESYNC_DAGTAG`` is present, DRBD synchronizes
resync and application IO using the concept of the "data generation tag"
(dagtag). This is coupled with fine-grained locking of the request intervals,
implemented internally using an interval tree.

For a given node, the dagtag is the number of sectors written by the
application on that node. The dagtag is used to determine which node has newer
data in certain scenarios. DRBD keeps track of the dagtags of its peers.

Whenever a resync request is made, the ``L_SYNC_TARGET`` node making the
request sends the dagtag from the current Primary node, if any. The
``L_SYNC_SOURCE`` node must wait until it has received the data corresponding
to this dagtag before responding to the resync request. This is important for
preventing the resync from writing older data over newer data.

In addition, the request intervals are locked according to the following
scheme:

* Primary: The application IO interval is locked while the data is being
  written to the backing disk. In addition, conflicting application IO is
  prevented until the epoch containing the request is complete. However, this
  is effectively a separate lock. Resync IO is not blocked while this lock is
  held.
* Secondary: The application IO interval is locked while the data is being
  written to the backing disk.
* Sync target: The ``P_RS_DATA_REQUEST`` interval is locked. The lock is taken
  in two phases. Before sending the request, the interval is locked for
  conflicts with other peers. Then the dagtag is recorded and the request is
  sent. When the reply is received, the interval is additionally locked for
  conflicts with ongoing IO, in particular writes from the same peer. The lock
  is released when the reply has been received and the data written to the
  backing disk.
* Sync source: The ``P_RS_DATA_REQUEST`` interval is locked. The lock is taken
  when the dagtag for the request is reached. It is released when the
  ``P_RS_WRITE_ACK`` is received. This lock is a read lock; it is exclusive
  with write locks, but not with other read locks.
* Verify source: The online verify interval is marked but does not block any
  other requests. The mark is set then the dagtag is recorded and the request
  is sent. The mark is removed when the ``P_OV_REPLY`` has been received, the
  dagtag from the reply has been reached and the data read. If any conflicting
  writes occur while the mark is set, the sectors are skipped.
* Verify target: The online verify interval is marked but does not block any
  other requests. The mark is set when the dagtag for the request has been
  reached. It is removed after reading the data. The latest dagtag received by
  this node is sent with the ``P_OV_REPLY``. If any conflicting writes occur
  while the mark is set, the sectors are skipped.
* Sending ``P_PEERS_IN_SYNC``: Intervals are briefly locked while sending
  ``P_PEERS_IN_SYNC`` to ensure that the bits remain in sync until the packet
  has been sent.

If a conflict occurs when an interval should be locked, the request is delayed
until the conflict resolves. Internally this is implemented by storing the
interval in the tree in an unlocked form. When an interval is removed from the
tree, the tree is searched for any intervals which can now be released.

Application IO defers to resync IO. That is, application IO is blocked by
resync IO even when that resync IO has not yet obtained the lock for its
interval. This is important for ensuring progress. In the normal case, resyncs
only make one pass through the data. Hence they will eventually terminate.
Application IO, on the other hand, can keep a given region busy for an
arbitrary length of time. So resync IO must not wait indefinitely for
application IO.

Correctness of data
-------------------

We only consider the synchronization between application and resync IO here.

The locking scheme prevents any writes from other peers to the resync request
interval from when the request is initiated until the received data is written.
After the lock is taken on the target, the dagtag is recorded and the request
is sent to the source. The source then waits until it has reached this dagtag
before reading. This ensures that the resync data is at least as new as the
data on the target when the request was made.

Conflicting application writes that reach the target while the resync request
is in progress are held until the resync data has been written. Hence they
overwrite the resync data. In the case where the source had already received
this application write when it performed the resync read, the application write
will overwrite the resync write on the target with identical data. This is
harmless.

Resync requests sent from the target are not exclusive with application writes
from the same peer. However, since the resync and application data originate
from the same node, they are transmitted in the correct order in the data
stream. Application IO defers to received resync IO, ensuring that a resync
write received before an application write is also submitted first.

Correctness of bitmap bits
--------------------------

DRBD guarantees that bitmap bits are set, or the corresponding activity log
extent is active, on at least one peer whenever 2 nodes are out of sync with
each other. A resync is called "stable" when the target is a neighbor of the
Primary node, if there is one. After a stable resync, all bitmap bits should be
clear. In other situations, DRBD makes a best effort attempt to clear bits when
appropriate.

Hence we need to ensure that:

1. Bits are set when out of sync
2. Bits are only cleared when in sync
3. Bits are cleared in a stable resync

We are only considering the synchronization of application and resync IO here,
so we only need to consider interactions between the 2 types. Requirement (1)
holds due to the general design for how writes work in DRBD. Requirement (3)
holds because there is no operation that sets bits in a stable resync. The
potential issues with these interactions arise with requirement (2). We need to
ensure that bits are never cleared that have become out of sync during the
operation.

On a Primary node, writes cause bits to be set and cleared when the
corresponding ``P_BARRIER_ACK`` packets are received. On a Secondary node,
writes cause bits to be set and cleared when the corresponding ``P_PEER_ACK``
is received. On a sync target, bits are cleared when the resync data has been
written. On a sync source, bits are cleared when ``P_RS_WRITE_ACK`` is
received.

The bits cleared by writes must always be in sync because the corresponding
nodes have received the write. As demonstrated in the section "correctness of
data", they cannot lose this data due to a resync.

For a stable resync, bits will not become out of sync for the peer device on
either side during the resync operation because both peers receive the
application writes.

On a sync target for an unstable resync, no application writes are received, so
there will be no bits set that could be incorrectly cleared.

On a sync source for an unstable resync, the interval is locked until
``P_RS_WRITE_ACK`` is received. Hence, when the bit is cleared, the target has
the same data for the interval as the source. That is, they are still in sync.

For ``P_PEERS_IN_SYNC`` we consider only the 3 node case. There is only one
configuration with an unstable resync with 3 nodes. That is a chain
A - B - C with A being Primary and a sync from B to C. The only
``P_PEERS_IN_SYNC`` packets that have an effect in this configuration are those
from B to A indicating that C is in sync for some interval. B only sends this
packet when no bitmap bits are set towards C for the interval. In addition,
B must ensure that no application write causes bits to be incorrectly cleared
on A towards C. This could occur when B has sent ``P_BARRIER_ACK`` for a write
which is not yet represented in its bitmap towards C. So B must not send
``P_PEERS_IN_SYNC`` for an interval where this may be the case. To do this, it
checks that there is no activity in the activity log that overlaps with this
interval. To ensure that no writes occur between this check and sending
``P_PEERS_IN_SYNC``, it locks the interval temporarily.

Deadlock safety
---------------

We can ignore the locking of application IO until the containing epoch is
complete. No other lock acquisition depends on it. To put it another way, it
operates on a level above the rest of the locking.

The locking on Primary and Secondary while application IO is being written to
the backing disk does not depend on any other lock acquisition. So it is
guaranteed that a locked interval of this type will eventually be unlocked.

Online verify does not block any other operations, so cannot be involved in
causing a deadlock.

Sending ``P_PEERS_IN_SYNC`` also cannot be involved in causing a deadlock
because it does not depend on any other lock acquisition.

Resync requests depend on the corresponding peer. If the connection is lost,
the operation is aborted, so no deadlock will occur as a result of
non-responsive peers.

A node cannot be both sync source and sync target simultaneously. Hence there
are no locks in the scheme which can block sync source reads indefinitely. So
a resync request from a sync target will always eventually receive a reply,
which allows it to perform the write and unlock its interval. This in turn
guarantees that the sync source will receive an ack and unlock its interval.

Hence the locking scheme itself is free from distributed deadlocks.