Mirror Design Document (Version 1)
Kernel source files:
- linux/drivers/md/dm-raid1.c
- linux/drivers/md/dm-log.[ch]
Kernel module name:
- dm-mirror
<font color=RED> The following information is stale and needs updating/expanding. </font>
A mirror consists of a variable number of replicated volumes and a log. The log is used to track the state of portions of the replicated volumes. The state can be clean (aka in sync), dirty, or not in sync. The clean state implies that portion of the mirror address space is the same. The dirty state is used to identify that a portion is being altered. The not-in-sync state identifies a portion as being not the same. A write to a mirror would consist of writing to the log to mark the destination address as “dirty”, writing to all replicated volumes, and writing to the log to mark the destination address as “clean” once all writes are complete. To prevent excessive logging, the logical address space of the replicated volumes is broken down into “regions”. These regions are powers 2 typically in the high kilobyte or low megabyte range. So, a second write to the same region of a mirror would not require informing the log, since the first write has already done so. The log is not marked clean until all writes to a region complete. If a failure occurs, regions which are in the dirty state are now considered to be out-of-sync.
During I/O, reads which occur to an in-sync region are free to choose the mirror device from which they read. This has the ability to improve performance, although the current implementation leaves much to be desired. (Currently, read balancing is round-robin – switching between mirror devices every 128 requests.) Reads to a region that is not in sync, along with all writes, are queued and processed. Concurrent to normal I/O, recovery of any out-of-sync regions is taking place. There is a single task which handles dispatching queued I/O and recovery I/O and it happens in the following way:
- Update region states: Notify the log to clear regions that have recently completed all writes or which have recovered successfully. Also, dispatch any writes that were delayed behind a recovering region.
- Do (more) recovery: Query the log for an out-of-sync region and note that the region is recovering. Copy the region from the primary device to the other replicate volumes.
- Do reads: If the mirror is in sync now, do read balancing; otherwise read from the primary mirror.
- Do writes: Separate writes into three groups according to whether the region is in-sync, not-in-sync, or flagged as recovering (as noted in #2). Writes to in-sync regions are written to all replicate volumes. Writes to not-in-sync regions are written only to the primary volume. Writes to recovering regions are placed on a list to be written by #1 in the next pass. (Note: Writes to regions which are being recovered by remote nodes are ignored until the next pass.)
The mirror kernel component is a target to device-mapper, which is composed of two files – dm-raid1.c and dm-log.c. The first file implements the algorithm previously described. It also provides a mechanism to support different logging implementations. In fact, each mirror instantiation could have a different logging implementation. The dm-log.c file holds the implementation of two kinds of logging - “core” and “disk”. The core version tracks region state in memory, while the disk version requires a separate device to which it can write the region state persistently. While the core version is fast during use, it lacks fast recovery and start-up. Indeed, upon device activation, the core version must assume that all regions are out-of-sync – thus slowing I/O while the mirror re-syncs. When the disk version is used and the device is activated, it can read which regions were dirty when it died/was shutdown – considering them to be out-of-sync. This greatly improves recoverability. It should be obvious that when using a persistent log, there is a trade-off between speed and recoverability when choosing the region size. The larger the region, the fewer the writes to the log (disk). However, this also means that larger portions of the disk need to be re-synced in the event of a failure.
A disk failure is treated differently for reads and writes. If an I/O fails during a read, it is simply retried on a different device. If a failure occurs during a write, an event is raised so that user-land tools can recreate the mirror volume without the failed device. The suspension, reconfiguration, and resumption is done in user-space as a response to the event signaled during the failed write. There is a daemon running (dmeventd) that “waits” for events. If one should occur on a device, the proper handling code for a particular target type (mirror in this case) is run. The proper handling code is made available to the daemon via a dynamically loaded shared object library, which is loaded when the device is “registered” to listen for events. So, in the mirror case, the flow might look like the following:
- The call to create/activate a mirrored volume is made
- The device-mapper table is created and loaded
- A “register for events” call is made
- If there is no dmeventd daemon running, it is started
- The daemon recognizes that a mirror device is being registered for monitoring and loads the appropriate DSO
- If a failure occurs, the daemon sees the event and it calls the processing code in the DSO
- The DSO reconfigures the mirror device to exclude the failed device
- Operations to the device proceed