146 lines
6.2 KiB
Text
146 lines
6.2 KiB
Text
|
|
When managing a RAID1 array which uses metadata other than the
|
|
"native" metadata understood by the kernel, mdadm makes use of a
|
|
partner program named 'mdmon' to manage some aspects of updating
|
|
that metadata and synchronising the metadata with the array state.
|
|
|
|
This document provides some details on how mdmon works.
|
|
|
|
Containers
|
|
----------
|
|
|
|
As background: mdadm makes a distinction between an 'array' and a
|
|
'container'. Other sources sometimes use the term 'volume' or
|
|
'device' for an 'array', and may use the term 'array' for a
|
|
'container'.
|
|
|
|
For our purposes:
|
|
- a 'container' is a collection of devices which are described by a
|
|
single set of metadata. The metadata may be stored equally
|
|
on all devices, or different devices may have quite different
|
|
subsets of the total metadata. But there is conceptually one set
|
|
of metadata that unifies the devices.
|
|
|
|
- an 'array' is a set of datablock from various devices which
|
|
together are used to present the abstraction of a single linear
|
|
sequence of block, which may provide data redundancy or enhanced
|
|
performance.
|
|
|
|
So a container has some metadata and provides a number of arrays which
|
|
are described by that metadata.
|
|
|
|
Sometimes this model doesn't work perfectly. For example, global
|
|
spares may have their own metadata which is quite different from the
|
|
metadata from any device that participates in one or more arrays.
|
|
Such a global spare might still need to belong to some container so
|
|
that it is available to be used should a failure arise. In that case
|
|
we consider the 'metadata' to be the union of the metadata on the
|
|
active devices which describes the arrays, and the metadata on the
|
|
global spares which only describes the spares. In this case different
|
|
devices in the one container will have quite different metadata.
|
|
|
|
|
|
Purpose
|
|
-------
|
|
|
|
The main purpose of mdmon is to update the metadata in response to
|
|
changes to the array which need to be reflected in the metadata before
|
|
futures writes to the array can safely be performed.
|
|
These include:
|
|
- transitions from 'clean' to 'dirty'.
|
|
- recording the devices have failed.
|
|
- recording the progress of a 'reshape'
|
|
|
|
This requires mdmon to be running at any time that the array is
|
|
writable (a read-only array does not require mdmon to be running).
|
|
|
|
Because mdmon must be able to process these metadata updates at any
|
|
time, it must (when running) have exclusive write access to the
|
|
metadata. Any other changes (e.g. reconfiguration of the array) must
|
|
go through mdmon.
|
|
|
|
A secondary role for mdmon is to activate spares when a device fails.
|
|
This role is much less time-critical than the other metadata updates,
|
|
so it could be performed by a separate process, possibly
|
|
"mdadm --monitor" which has a related role of moving devices between
|
|
arrays. A main reason for including this functionality in mdmon is
|
|
that in the native-metadata case this function is handled in the
|
|
kernel, and mdmon's reason for existence to provide functionality
|
|
which is otherwise handled by the kernel.
|
|
|
|
|
|
Design overview
|
|
---------------
|
|
|
|
mdmon is structured as two threads with a common address space and
|
|
common data structures. These threads are know as the 'monitor' and
|
|
the 'manager'.
|
|
|
|
The 'monitor' has the primary role of monitoring the array for
|
|
important state changes and updating the metadata accordingly. As
|
|
writes to the array can be blocked until 'monitor' completes and
|
|
acknowledges the update, it much be very careful not to block itself.
|
|
In particular it must not block waiting for any write to complete else
|
|
it could deadlock. This means that it must not allocate memory as
|
|
doing this can require dirty memory to be written out and if the
|
|
system choose to write to the array that mdmon is monitoring, the
|
|
memory allocation could deadlock.
|
|
|
|
So 'monitor' must never allocate memory and must limit the number of
|
|
other system call it performs. It may:
|
|
- use select (or poll) to wait for activity on a file descriptor
|
|
- read from a sysfs file descriptor
|
|
- write to a sysfs file descriptor
|
|
- write the metadata out to the block devices using O_DIRECT
|
|
- send a signal (kill) to the manager thread
|
|
|
|
It must not e.g. open files or do anything similar that might allocate
|
|
resources.
|
|
|
|
The 'manager' thread does everything else that is needed. If any
|
|
files are to be opened (e.g. because a device has been added to the
|
|
array), the manager does that. If any memory needs to be allocated
|
|
(e.g. to hold data about a new array as can happen when one set of
|
|
metadata describes several arrays), the manager performs that
|
|
allocation.
|
|
|
|
The 'manager' is also responsible for communicating with mdadm and
|
|
assigning spares to replace failed devices.
|
|
|
|
|
|
Handling metadata updates
|
|
-------------------------
|
|
|
|
There are a number of cases in which mdadm needs to update the
|
|
metdata which mdmon is managing. These include:
|
|
- creating a new array in an active container
|
|
- adding a device to a container
|
|
- reconfiguring an array
|
|
etc.
|
|
|
|
To complete these updates, mdadm must send a message to mdmon which
|
|
will merge the update into the metadata as it is at that moment.
|
|
|
|
To achieve this, mdmon creates a Unix Domain Socket which the manager
|
|
thread listens on. mdadm sends a message over this socket. The
|
|
manager thread examines the message to see if it will require
|
|
allocating any memory and allocates it. This is done in the
|
|
'prepare_update' metadata method.
|
|
|
|
The update message is then queued for handling by the monitor thread
|
|
which it will do when convenient. The monitor thread calls
|
|
->process_update which should atomically make the required changes to
|
|
the metadata, making use of the pre-allocate memory as required. Any
|
|
memory the is no-longer needed can be placed back in the request and
|
|
the manager thread will free it.
|
|
|
|
The exact format of a metadata update is up to the implementer of the
|
|
metadata handlers. It will simply describe a change that needs to be
|
|
made. It will sometimes contain fragments of the metadata to be
|
|
copied in to place. However the ->process_update routine must make
|
|
sure not to over-write any field that the monitor thread might have
|
|
updated, such as a 'device failed' or 'array is dirty' state.
|
|
|
|
When the monitor thread has completed the update and written it to the
|
|
devices, an acknowledgement message is sent back over the socket so
|
|
that mdadm knows it is complete.
|