1
0
Fork 0

Adding upstream version 4.3+20240412.

Signed-off-by: Daniel Baumann <daniel@debian.org>
This commit is contained in:
Daniel Baumann 2025-02-14 06:34:09 +01:00
parent e2c5cc815f
commit c0f6a5a1b7
Signed by: daniel
GPG key ID: FBB4F0E80A80222F
67 changed files with 2247 additions and 2747 deletions

View file

@ -0,0 +1,280 @@
External Reshape
1 Problem statement
External (third-party metadata) reshape differs from native-metadata
reshape in three key ways:
1.1 Format specific constraints
In the native case reshape is limited by what is implemented in the
generic reshape routine (Grow_reshape()) and what is supported by the
kernel. There are exceptional cases where Grow_reshape() may block
operations when it knows that the kernel implementation is broken, but
otherwise the kernel is relied upon to be the final arbiter of what
reshape operations are supported.
In the external case the kernel, and the generic checks in
Grow_reshape(), become the super-set of what reshapes are possible. The
metadata format may not support, or have yet to implement a given
reshape type. The implication for Grow_reshape() is that it must query
the metadata handler and effect changes in the metadata before the new
geometry is posted to the kernel. The ->reshape_super method allows
Grow_reshape() to validate the requested operation and post the metadata
update.
1.2 Scope of reshape
Native metadata reshape is always performed at the array scope (no
metadata relationship with sibling arrays on the same disks). External
reshape, depending on the format, may not allow the number of member
disks to be changed in a subarray unless the change is simultaneously
applied to all subarrays in the container. For example the imsm format
requires all member disks to be a member of all subarrays, so a 4-disk
raid5 in a container that also houses a 4-disk raid10 array could not be
reshaped to 5 disks as the imsm format does not support a 5-disk raid10
representation. This requires the ->reshape_super method to check the
contents of the array and ask the user to run the reshape at container
scope (if all subarrays are agreeable to the change), or report an
error in the case where one subarray cannot support the change.
1.3 Monitoring / checkpointing
Reshape, unlike rebuild/resync, requires strict checkpointing to survive
interrupted reshape operations. For example when expanding a raid5
array the first few stripes of the array will be overwritten in a
destructive manner. When restarting the reshape process we need to know
the exact location of the last successfully written stripe, and we need
to restore the data in any partially overwritten stripe. Native
metadata stores this backup data in the unused portion of spares that
are being promoted to array members, or in an external backup file
(located on a non-involved block device).
The kernel is in charge of recording checkpoints of reshape progress,
but mdadm is delegated the task of managing the backup space which
involves:
1/ Identifying what data will be overwritten in the next unit of reshape
operation
2/ Suspending access to that region so that a snapshot of the data can
be transferred to the backup space.
3/ Allowing the kernel to reshape the saved region and setting the
boundary for the next backup.
In the external reshape case we want to preserve this mdadm
'reshape-manager' arrangement, but have a third actor, mdmon, to
consider. It is tempting to give the role of managing reshape to mdmon,
but that is counter to its role as a monitor, and conflicts with the
existing capabilities and role of mdadm to manage the progress of
reshape. For clarity the external reshape implementation maintains the
role of mdmon as a (mostly) passive recorder of raid events, and mdadm
treats it as it would the kernel in the native reshape case (modulo
needing to send explicit metadata update messages and checking that
mdmon took the expected action).
External reshape can use the generic md backup file as a fallback, but in the
optimal/firmware-compatible case the reshape-manager will use the metadata
specific areas for managing reshape. The implementation also needs to spawn a
reshape-manager per subarray when the reshape is being carried out at the
container level. For these two reasons the ->manage_reshape() method is
introduced. This method in addition to base tasks mentioned above:
1/ Processed each subarray one at a time in series - where appropriate.
2/ Uses either generic routines in Grow.c for md-style backup file
support, or uses the metadata-format specific location for storing
recovery data.
This aims to avoid a "midlayer mistake"[1] and lets the metadata handler
optionally take advantage of generic infrastructure in Grow.c
2 Details for specific reshape requests
There are quite a few moving pieces spread out across md, mdadm, and mdmon for
the support of external reshape, and there are several different types of
reshape that need to be comprehended by the implementation. A rundown of
these details follows.
2.0 General provisions:
Obtain an exclusive open on the container to make sure we are not
running concurrently with a Create() event.
2.1 Freezing sync_action
Before making any attempt at a reshape we 'freeze' every array in
the container to ensure no spare assignment or recovery happens.
This involves writing 'frozen' to sync_action and changing the '/'
after 'external:' in metadata_version to a '-'. mdmon knows that
this means not to perform any management.
Before doing this we check that all sync_actions are 'idle', which
is racy but still useful.
Afterwards we check that all member arrays have no spares
or partial spares (recovery_start != 'none') which would indicate a
race. If they do, we unfreeze again.
Once this completes we know all the arrays are stable. They may
still have failed devices as devices can fail at any time. However
we treat those like failures that happen during the reshape.
2.2 Reshape size
1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
initializes st->update_tail
2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the size change
is allowed (being performed at subarray scope / enough room) prepares a
metadata update
3/ mdadm::Grow_reshape(): flushes the metadata update (via
flush_metadata_update(), or ->sync_metadata())
4/ mdadm::Grow_reshape(): post the new size to the kernel
2.3 Reshape level (simple-takeover)
"simple-takeover" implies the level change can be satisfied without touching
sync_action
1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
initializes st->update_tail
2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the level change
is allowed (being performed at subarray scope) prepares a
metadata update
2a/ raid10 --> raid0: degrade all mirror legs prior to calling
->reshape_super
3/ mdadm::Grow_reshape(): flushes the metadata update (via
flush_metadata_update(), or ->sync_metadata())
4/ mdadm::Grow_reshape(): post the new level to the kernel
2.4 Reshape chunk, layout
2.5 Reshape raid disks (grow)
1/ mdadm::Grow_reshape(): unconditionally initializes st->update_tail
because only redundant raid levels can modify the number of raid disks
2/ mdadm::Grow_reshape(): calls ->reshape_super() to check that the level
change is allowed (being performed at proper scope / permissible
geometry / proper spares available in the container), chooses
the spares to use, and prepares a metadata update.
3/ mdadm::Grow_reshape(): Converts each subarray in the container to the
raid level that can perform the reshape and starts mdmon.
4/ mdadm::Grow_reshape(): Pushes the update to mdmon.
5/ mdadm::Grow_reshape(): uses container_content to find details of
the spares and passes them to the kernel.
6/ mdadm::Grow_reshape(): gives raid_disks update to the kernel,
sets sync_max, sync_min, suspend_lo, suspend_hi all to zero,
and starts the reshape by writing 'reshape' to sync_action.
7/ mdmon::monitor notices the sync_action change and tells
managemon to check for new devices. managemon notices the new
devices, opens relevant sysfs file, and passes them all to
monitor.
8/ mdadm::Grow_reshape() calls ->manage_reshape to oversee the
rest of the reshape.
9/ mdadm::<format>->manage_reshape(): saves data that will be overwritten by
the kernel to either the backup file or the metadata specific location,
advances sync_max, waits for reshape, ping mdmon, repeat.
Meanwhile mdmon::read_and_act(): records checkpoints.
Specifically.
9a/ if the 'next' stripe to be reshaped will over-write
itself during reshape then:
9a.1/ increase suspend_hi to cover a suitable number of
stripes.
9a.2/ backup those stripes safely.
9a.3/ advance sync_max to allow those stripes to be backed up
9a.4/ when sync_completed indicates that those stripes have
been reshaped, manage_reshape must ping_manager
9a.5/ when mdmon notices that sync_completed has been updated,
it records the new checkpoint in the metadata
9a.6/ after the ping_manager, manage_reshape will increase
suspend_lo to allow access to those stripes again
9b/ if the 'next' stripe to be reshaped will over-write unused
space during reshape then we apply same process as above,
except that there is no need to back anything up.
Note that we *do* need to keep suspend_hi progressing as
it is not safe to write to the area-under-reshape. For
kernel-managed-metadata this protection is provided by
->reshape_safe, but that does not protect us in the case
of user-space-managed-metadata.
10/ mdadm::<format>->manage_reshape(): Once reshape completes changes the raid
level back to the nominal raid level (if necessary)
FIXME: native metadata does not have the capability to record the original
raid level in reshape-restart case because the kernel always records current
raid level to the metadata, whereas external metadata can masquerade at an
alternate level based on the reshape state.
2.6 Reshape raid disks (shrink)
3 Interaction with metadata handle.
The following calls are made into the metadata handler to assist
with initiating and monitoring a 'reshape'.
1/ ->reshape_super is called quite early (after only minimial
checks) to make sure that the metadata can record the new shape
and any necessary transitions. It may be passed a 'container'
or an individual array within a container, and it should notice
the difference and act accordingly.
When a reshape is requested against a container it is expected
that it should be applied to every array in the container,
however it is up to the metadata handler to determine final
policy.
If the reshape is supportable, the internal copy of the metadata
should be updated, and a metadata update suitable for sending
to mdmon should be queued.
If the reshape will involve converting spares into array members,
this must be recorded in the metadata too.
2/ ->container_content will be called to find out the new state
of all the array, or all arrays in the container. Any newly
added devices (with state==0 and raid_disk >= 0) will be added
to the array as spares with the relevant slot number.
It is likely that the info returned by ->container_content will
have ->reshape_active set, ->reshape_progress set to e.g. 0, and
new_* set appropriately. mdadm will use this information to
cause the correct reshape to start at an appropriate time.
3/ ->set_array_state will be called by mdmon when reshape has
started and again periodically as it progresses. This should
record the ->last_checkpoint as the point where reshape has
progressed to. When the reshape finished this will be called
again and it should notice that ->curr_action is no longer
'reshape' and so should record that the reshape has finished
providing 'last_checkpoint' has progressed suitably.
4/ ->manage_reshape will be called once the reshape has been set
up in the kernel but before sync_max has been moved from 0, so
no actual reshape will have happened.
->manage_reshape should call progress_reshape() to allow the
reshape to progress, and should back-up any data as indicated
by the return value. See the documentation of that function
for more details.
->manage_reshape will be called multiple times when a
container is being reshaped, once for each member array in
the container.
The progress of the metadata is as follows:
1/ mdadm sends a metadata update to mdmon which marks the array
as undergoing a reshape. This is set up by
->reshape_super and applied by ->process_update
For container-wide reshape, this happens once for the whole
container.
2/ mdmon notices progress via the sysfs files and calls
->set_array_state to update the state periodically
For container-wide reshape, this happens repeatedly for
one array, then repeatedly for the next, etc.
3/ mdmon notices when reshape has finished and call
->set_array_state to record the the reshape is complete.
For container-wide reshape, this happens once for each
member array.
...
[1]: Linux kernel design patterns - part 3, Neil Brown https://lwn.net/Articles/336262/

View file

@ -0,0 +1,65 @@
# mdadm configuration file
#
# mdadm will function properly without the use of a configuration file,
# but this file is useful for keeping track of arrays and member disks.
# In general, a mdadm.conf file is created, and updated, after arrays
# are created. This is the opposite behavior of /etc/raidtab which is
# created prior to array construction.
#
#
# the config file takes two types of lines:
#
# DEVICE lines specify a list of devices of where to look for
# potential member disks
#
# ARRAY lines specify information about how to identify arrays so
# so that they can be activated
#
# You can have more than one device line and use wild cards. The first
# example includes SCSI the first partition of SCSI disks /dev/sdb,
# /dev/sdc, /dev/sdd, /dev/sdj, /dev/sdk, and /dev/sdl. The second
# line looks for array slices on IDE disks.
#
#DEVICE /dev/sd[bcdjkl]1
#DEVICE /dev/hda1 /dev/hdb1
#
# If you mount devfs on /dev, then a suitable way to list all devices is:
#DEVICE /dev/discs/*/*
#
#
# The AUTO line can control which arrays get assembled by auto-assembly,
# meaing either "mdadm -As" when there are no 'ARRAY' lines in this file,
# or "mdadm --incremental" when the array found is not listed in this file.
# By default, all arrays that are found are assembled.
# If you want to ignore all DDF arrays (maybe they are managed by dmraid),
# and only assemble 1.x arrays if which are marked for 'this' homehost,
# but assemble all others, then use
#AUTO -ddf homehost -1.x +all
#
# ARRAY lines specify an array to assemble and a method of identification.
# Arrays can currently be identified by using a UUID, superblock minor number,
# or a listing of devices.
#
# super-minor is usually the minor number of the metadevice
# UUID is the Universally Unique Identifier for the array
# Each can be obtained using
#
# mdadm -D <md>
#
#ARRAY /dev/md0 UUID=3aaa0122:29827cfa:5331ad66:ca767371
#ARRAY /dev/md1 super-minor=1
#ARRAY /dev/md2 devices=/dev/hda1,/dev/hdb1
#
# ARRAY lines can also specify a "spare-group" for each array. mdadm --monitor
# will then move a spare between arrays in a spare-group if one array has a failed
# drive but no spare
#ARRAY /dev/md4 uuid=b23f3c6d:aec43a9f:fd65db85:369432df spare-group=group1
#ARRAY /dev/md5 uuid=19464854:03f71b1b:e0df2edd:246cc977 spare-group=group1
#
# When used in --follow (aka --monitor) mode, mdadm needs a
# mail address and/or a program. This can be given with "mailaddr"
# and "program" lines to that monitoring can be started using
# mdadm --follow --scan & echo $! > /run/mdadm/mon.pid
# If the lines are not found, mdadm will exit quietly
#MAILADDR root@mydomain.tld
#PROGRAM /usr/sbin/handle-mdadm-events

View file

@ -0,0 +1,146 @@
When managing a RAID1 array which uses metadata other than the
"native" metadata understood by the kernel, mdadm makes use of a
partner program named 'mdmon' to manage some aspects of updating
that metadata and synchronising the metadata with the array state.
This document provides some details on how mdmon works.
Containers
----------
As background: mdadm makes a distinction between an 'array' and a
'container'. Other sources sometimes use the term 'volume' or
'device' for an 'array', and may use the term 'array' for a
'container'.
For our purposes:
- a 'container' is a collection of devices which are described by a
single set of metadata. The metadata may be stored equally
on all devices, or different devices may have quite different
subsets of the total metadata. But there is conceptually one set
of metadata that unifies the devices.
- an 'array' is a set of datablock from various devices which
together are used to present the abstraction of a single linear
sequence of block, which may provide data redundancy or enhanced
performance.
So a container has some metadata and provides a number of arrays which
are described by that metadata.
Sometimes this model doesn't work perfectly. For example, global
spares may have their own metadata which is quite different from the
metadata from any device that participates in one or more arrays.
Such a global spare might still need to belong to some container so
that it is available to be used should a failure arise. In that case
we consider the 'metadata' to be the union of the metadata on the
active devices which describes the arrays, and the metadata on the
global spares which only describes the spares. In this case different
devices in the one container will have quite different metadata.
Purpose
-------
The main purpose of mdmon is to update the metadata in response to
changes to the array which need to be reflected in the metadata before
futures writes to the array can safely be performed.
These include:
- transitions from 'clean' to 'dirty'.
- recording the devices have failed.
- recording the progress of a 'reshape'
This requires mdmon to be running at any time that the array is
writable (a read-only array does not require mdmon to be running).
Because mdmon must be able to process these metadata updates at any
time, it must (when running) have exclusive write access to the
metadata. Any other changes (e.g. reconfiguration of the array) must
go through mdmon.
A secondary role for mdmon is to activate spares when a device fails.
This role is much less time-critical than the other metadata updates,
so it could be performed by a separate process, possibly
"mdadm --monitor" which has a related role of moving devices between
arrays. A main reason for including this functionality in mdmon is
that in the native-metadata case this function is handled in the
kernel, and mdmon's reason for existence to provide functionality
which is otherwise handled by the kernel.
Design overview
---------------
mdmon is structured as two threads with a common address space and
common data structures. These threads are know as the 'monitor' and
the 'manager'.
The 'monitor' has the primary role of monitoring the array for
important state changes and updating the metadata accordingly. As
writes to the array can be blocked until 'monitor' completes and
acknowledges the update, it much be very careful not to block itself.
In particular it must not block waiting for any write to complete else
it could deadlock. This means that it must not allocate memory as
doing this can require dirty memory to be written out and if the
system choose to write to the array that mdmon is monitoring, the
memory allocation could deadlock.
So 'monitor' must never allocate memory and must limit the number of
other system call it performs. It may:
- use select (or poll) to wait for activity on a file descriptor
- read from a sysfs file descriptor
- write to a sysfs file descriptor
- write the metadata out to the block devices using O_DIRECT
- send a signal (kill) to the manager thread
It must not e.g. open files or do anything similar that might allocate
resources.
The 'manager' thread does everything else that is needed. If any
files are to be opened (e.g. because a device has been added to the
array), the manager does that. If any memory needs to be allocated
(e.g. to hold data about a new array as can happen when one set of
metadata describes several arrays), the manager performs that
allocation.
The 'manager' is also responsible for communicating with mdadm and
assigning spares to replace failed devices.
Handling metadata updates
-------------------------
There are a number of cases in which mdadm needs to update the
metdata which mdmon is managing. These include:
- creating a new array in an active container
- adding a device to a container
- reconfiguring an array
etc.
To complete these updates, mdadm must send a message to mdmon which
will merge the update into the metadata as it is at that moment.
To achieve this, mdmon creates a Unix Domain Socket which the manager
thread listens on. mdadm sends a message over this socket. The
manager thread examines the message to see if it will require
allocating any memory and allocates it. This is done in the
'prepare_update' metadata method.
The update message is then queued for handling by the monitor thread
which it will do when convenient. The monitor thread calls
->process_update which should atomically make the required changes to
the metadata, making use of the pre-allocate memory as required. Any
memory the is no-longer needed can be placed back in the request and
the manager thread will free it.
The exact format of a metadata update is up to the implementer of the
metadata handlers. It will simply describe a change that needs to be
made. It will sometimes contain fragments of the metadata to be
copied in to place. However the ->process_update routine must make
sure not to over-write any field that the monitor thread might have
updated, such as a 'device failed' or 'array is dirty' state.
When the monitor thread has completed the update and written it to the
devices, an acknowledgement message is sent back over the socket so
that mdadm knows it is complete.