280 lines
13 KiB
Text
280 lines
13 KiB
Text
External Reshape
|
|
|
|
1 Problem statement
|
|
|
|
External (third-party metadata) reshape differs from native-metadata
|
|
reshape in three key ways:
|
|
|
|
1.1 Format specific constraints
|
|
|
|
In the native case reshape is limited by what is implemented in the
|
|
generic reshape routine (Grow_reshape()) and what is supported by the
|
|
kernel. There are exceptional cases where Grow_reshape() may block
|
|
operations when it knows that the kernel implementation is broken, but
|
|
otherwise the kernel is relied upon to be the final arbiter of what
|
|
reshape operations are supported.
|
|
|
|
In the external case the kernel, and the generic checks in
|
|
Grow_reshape(), become the super-set of what reshapes are possible. The
|
|
metadata format may not support, or have yet to implement a given
|
|
reshape type. The implication for Grow_reshape() is that it must query
|
|
the metadata handler and effect changes in the metadata before the new
|
|
geometry is posted to the kernel. The ->reshape_super method allows
|
|
Grow_reshape() to validate the requested operation and post the metadata
|
|
update.
|
|
|
|
1.2 Scope of reshape
|
|
|
|
Native metadata reshape is always performed at the array scope (no
|
|
metadata relationship with sibling arrays on the same disks). External
|
|
reshape, depending on the format, may not allow the number of member
|
|
disks to be changed in a subarray unless the change is simultaneously
|
|
applied to all subarrays in the container. For example the imsm format
|
|
requires all member disks to be a member of all subarrays, so a 4-disk
|
|
raid5 in a container that also houses a 4-disk raid10 array could not be
|
|
reshaped to 5 disks as the imsm format does not support a 5-disk raid10
|
|
representation. This requires the ->reshape_super method to check the
|
|
contents of the array and ask the user to run the reshape at container
|
|
scope (if all subarrays are agreeable to the change), or report an
|
|
error in the case where one subarray cannot support the change.
|
|
|
|
1.3 Monitoring / checkpointing
|
|
|
|
Reshape, unlike rebuild/resync, requires strict checkpointing to survive
|
|
interrupted reshape operations. For example when expanding a raid5
|
|
array the first few stripes of the array will be overwritten in a
|
|
destructive manner. When restarting the reshape process we need to know
|
|
the exact location of the last successfully written stripe, and we need
|
|
to restore the data in any partially overwritten stripe. Native
|
|
metadata stores this backup data in the unused portion of spares that
|
|
are being promoted to array members, or in an external backup file
|
|
(located on a non-involved block device).
|
|
|
|
The kernel is in charge of recording checkpoints of reshape progress,
|
|
but mdadm is delegated the task of managing the backup space which
|
|
involves:
|
|
1/ Identifying what data will be overwritten in the next unit of reshape
|
|
operation
|
|
2/ Suspending access to that region so that a snapshot of the data can
|
|
be transferred to the backup space.
|
|
3/ Allowing the kernel to reshape the saved region and setting the
|
|
boundary for the next backup.
|
|
|
|
In the external reshape case we want to preserve this mdadm
|
|
'reshape-manager' arrangement, but have a third actor, mdmon, to
|
|
consider. It is tempting to give the role of managing reshape to mdmon,
|
|
but that is counter to its role as a monitor, and conflicts with the
|
|
existing capabilities and role of mdadm to manage the progress of
|
|
reshape. For clarity the external reshape implementation maintains the
|
|
role of mdmon as a (mostly) passive recorder of raid events, and mdadm
|
|
treats it as it would the kernel in the native reshape case (modulo
|
|
needing to send explicit metadata update messages and checking that
|
|
mdmon took the expected action).
|
|
|
|
External reshape can use the generic md backup file as a fallback, but in the
|
|
optimal/firmware-compatible case the reshape-manager will use the metadata
|
|
specific areas for managing reshape. The implementation also needs to spawn a
|
|
reshape-manager per subarray when the reshape is being carried out at the
|
|
container level. For these two reasons the ->manage_reshape() method is
|
|
introduced. This method in addition to base tasks mentioned above:
|
|
1/ Processed each subarray one at a time in series - where appropriate.
|
|
2/ Uses either generic routines in Grow.c for md-style backup file
|
|
support, or uses the metadata-format specific location for storing
|
|
recovery data.
|
|
This aims to avoid a "midlayer mistake"[1] and lets the metadata handler
|
|
optionally take advantage of generic infrastructure in Grow.c
|
|
|
|
2 Details for specific reshape requests
|
|
|
|
There are quite a few moving pieces spread out across md, mdadm, and mdmon for
|
|
the support of external reshape, and there are several different types of
|
|
reshape that need to be comprehended by the implementation. A rundown of
|
|
these details follows.
|
|
|
|
2.0 General provisions:
|
|
|
|
Obtain an exclusive open on the container to make sure we are not
|
|
running concurrently with a Create() event.
|
|
|
|
2.1 Freezing sync_action
|
|
|
|
Before making any attempt at a reshape we 'freeze' every array in
|
|
the container to ensure no spare assignment or recovery happens.
|
|
This involves writing 'frozen' to sync_action and changing the '/'
|
|
after 'external:' in metadata_version to a '-'. mdmon knows that
|
|
this means not to perform any management.
|
|
|
|
Before doing this we check that all sync_actions are 'idle', which
|
|
is racy but still useful.
|
|
Afterwards we check that all member arrays have no spares
|
|
or partial spares (recovery_start != 'none') which would indicate a
|
|
race. If they do, we unfreeze again.
|
|
|
|
Once this completes we know all the arrays are stable. They may
|
|
still have failed devices as devices can fail at any time. However
|
|
we treat those like failures that happen during the reshape.
|
|
|
|
2.2 Reshape size
|
|
|
|
1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
|
|
initializes st->update_tail
|
|
2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the size change
|
|
is allowed (being performed at subarray scope / enough room) prepares a
|
|
metadata update
|
|
3/ mdadm::Grow_reshape(): flushes the metadata update (via
|
|
flush_metadata_update(), or ->sync_metadata())
|
|
4/ mdadm::Grow_reshape(): post the new size to the kernel
|
|
|
|
|
|
2.3 Reshape level (simple-takeover)
|
|
|
|
"simple-takeover" implies the level change can be satisfied without touching
|
|
sync_action
|
|
|
|
1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
|
|
initializes st->update_tail
|
|
2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the level change
|
|
is allowed (being performed at subarray scope) prepares a
|
|
metadata update
|
|
2a/ raid10 --> raid0: degrade all mirror legs prior to calling
|
|
->reshape_super
|
|
3/ mdadm::Grow_reshape(): flushes the metadata update (via
|
|
flush_metadata_update(), or ->sync_metadata())
|
|
4/ mdadm::Grow_reshape(): post the new level to the kernel
|
|
|
|
2.4 Reshape chunk, layout
|
|
|
|
2.5 Reshape raid disks (grow)
|
|
|
|
1/ mdadm::Grow_reshape(): unconditionally initializes st->update_tail
|
|
because only redundant raid levels can modify the number of raid disks
|
|
2/ mdadm::Grow_reshape(): calls ->reshape_super() to check that the level
|
|
change is allowed (being performed at proper scope / permissible
|
|
geometry / proper spares available in the container), chooses
|
|
the spares to use, and prepares a metadata update.
|
|
3/ mdadm::Grow_reshape(): Converts each subarray in the container to the
|
|
raid level that can perform the reshape and starts mdmon.
|
|
4/ mdadm::Grow_reshape(): Pushes the update to mdmon.
|
|
5/ mdadm::Grow_reshape(): uses container_content to find details of
|
|
the spares and passes them to the kernel.
|
|
6/ mdadm::Grow_reshape(): gives raid_disks update to the kernel,
|
|
sets sync_max, sync_min, suspend_lo, suspend_hi all to zero,
|
|
and starts the reshape by writing 'reshape' to sync_action.
|
|
7/ mdmon::monitor notices the sync_action change and tells
|
|
managemon to check for new devices. managemon notices the new
|
|
devices, opens relevant sysfs file, and passes them all to
|
|
monitor.
|
|
8/ mdadm::Grow_reshape() calls ->manage_reshape to oversee the
|
|
rest of the reshape.
|
|
|
|
9/ mdadm::<format>->manage_reshape(): saves data that will be overwritten by
|
|
the kernel to either the backup file or the metadata specific location,
|
|
advances sync_max, waits for reshape, ping mdmon, repeat.
|
|
Meanwhile mdmon::read_and_act(): records checkpoints.
|
|
Specifically.
|
|
|
|
9a/ if the 'next' stripe to be reshaped will over-write
|
|
itself during reshape then:
|
|
9a.1/ increase suspend_hi to cover a suitable number of
|
|
stripes.
|
|
9a.2/ backup those stripes safely.
|
|
9a.3/ advance sync_max to allow those stripes to be backed up
|
|
9a.4/ when sync_completed indicates that those stripes have
|
|
been reshaped, manage_reshape must ping_manager
|
|
9a.5/ when mdmon notices that sync_completed has been updated,
|
|
it records the new checkpoint in the metadata
|
|
9a.6/ after the ping_manager, manage_reshape will increase
|
|
suspend_lo to allow access to those stripes again
|
|
|
|
9b/ if the 'next' stripe to be reshaped will over-write unused
|
|
space during reshape then we apply same process as above,
|
|
except that there is no need to back anything up.
|
|
Note that we *do* need to keep suspend_hi progressing as
|
|
it is not safe to write to the area-under-reshape. For
|
|
kernel-managed-metadata this protection is provided by
|
|
->reshape_safe, but that does not protect us in the case
|
|
of user-space-managed-metadata.
|
|
|
|
10/ mdadm::<format>->manage_reshape(): Once reshape completes changes the raid
|
|
level back to the nominal raid level (if necessary)
|
|
|
|
FIXME: native metadata does not have the capability to record the original
|
|
raid level in reshape-restart case because the kernel always records current
|
|
raid level to the metadata, whereas external metadata can masquerade at an
|
|
alternate level based on the reshape state.
|
|
|
|
2.6 Reshape raid disks (shrink)
|
|
|
|
3 Interaction with metadata handle.
|
|
|
|
The following calls are made into the metadata handler to assist
|
|
with initiating and monitoring a 'reshape'.
|
|
|
|
1/ ->reshape_super is called quite early (after only minimial
|
|
checks) to make sure that the metadata can record the new shape
|
|
and any necessary transitions. It may be passed a 'container'
|
|
or an individual array within a container, and it should notice
|
|
the difference and act accordingly.
|
|
When a reshape is requested against a container it is expected
|
|
that it should be applied to every array in the container,
|
|
however it is up to the metadata handler to determine final
|
|
policy.
|
|
|
|
If the reshape is supportable, the internal copy of the metadata
|
|
should be updated, and a metadata update suitable for sending
|
|
to mdmon should be queued.
|
|
|
|
If the reshape will involve converting spares into array members,
|
|
this must be recorded in the metadata too.
|
|
|
|
2/ ->container_content will be called to find out the new state
|
|
of all the array, or all arrays in the container. Any newly
|
|
added devices (with state==0 and raid_disk >= 0) will be added
|
|
to the array as spares with the relevant slot number.
|
|
|
|
It is likely that the info returned by ->container_content will
|
|
have ->reshape_active set, ->reshape_progress set to e.g. 0, and
|
|
new_* set appropriately. mdadm will use this information to
|
|
cause the correct reshape to start at an appropriate time.
|
|
|
|
3/ ->set_array_state will be called by mdmon when reshape has
|
|
started and again periodically as it progresses. This should
|
|
record the ->last_checkpoint as the point where reshape has
|
|
progressed to. When the reshape finished this will be called
|
|
again and it should notice that ->curr_action is no longer
|
|
'reshape' and so should record that the reshape has finished
|
|
providing 'last_checkpoint' has progressed suitably.
|
|
|
|
4/ ->manage_reshape will be called once the reshape has been set
|
|
up in the kernel but before sync_max has been moved from 0, so
|
|
no actual reshape will have happened.
|
|
|
|
->manage_reshape should call progress_reshape() to allow the
|
|
reshape to progress, and should back-up any data as indicated
|
|
by the return value. See the documentation of that function
|
|
for more details.
|
|
->manage_reshape will be called multiple times when a
|
|
container is being reshaped, once for each member array in
|
|
the container.
|
|
|
|
|
|
The progress of the metadata is as follows:
|
|
1/ mdadm sends a metadata update to mdmon which marks the array
|
|
as undergoing a reshape. This is set up by
|
|
->reshape_super and applied by ->process_update
|
|
For container-wide reshape, this happens once for the whole
|
|
container.
|
|
2/ mdmon notices progress via the sysfs files and calls
|
|
->set_array_state to update the state periodically
|
|
For container-wide reshape, this happens repeatedly for
|
|
one array, then repeatedly for the next, etc.
|
|
3/ mdmon notices when reshape has finished and call
|
|
->set_array_state to record the the reshape is complete.
|
|
For container-wide reshape, this happens once for each
|
|
member array.
|
|
|
|
|
|
|
|
...
|
|
|
|
[1]: Linux kernel design patterns - part 3, Neil Brown https://lwn.net/Articles/336262/
|