1280 lines
46 KiB
Groff
1280 lines
46 KiB
Groff
.\" Copyright Neil Brown and others.
|
|
.\" This program is free software; you can redistribute it and/or modify
|
|
.\" it under the terms of the GNU General Public License as published by
|
|
.\" the Free Software Foundation; either version 2 of the License, or
|
|
.\" (at your option) any later version.
|
|
.\" See file COPYING in distribution for details.
|
|
.if n .pl 1000v
|
|
.TH MD 4
|
|
.SH NAME
|
|
md \- Multiple Device driver aka Linux Software RAID
|
|
.SH SYNOPSIS
|
|
.BI /dev/md n
|
|
.br
|
|
.BI /dev/md/ n
|
|
.br
|
|
.BR /dev/md/ name
|
|
.SH DESCRIPTION
|
|
The
|
|
.B md
|
|
driver provides virtual devices that are created from one or more
|
|
independent underlying devices. This array of devices often contains
|
|
redundancy and the devices are often disk drives, hence the acronym RAID
|
|
which stands for a Redundant Array of Independent Disks.
|
|
.PP
|
|
.B md
|
|
supports RAID levels
|
|
1 (mirroring),
|
|
4 (striped array with parity device),
|
|
5 (striped array with distributed parity information),
|
|
6 (striped array with distributed dual redundancy information), and
|
|
10 (striped and mirrored).
|
|
If some number of underlying devices fails while using one of these
|
|
levels, the array will continue to function; this number is one for
|
|
RAID levels 4 and 5, two for RAID level 6, and all but one (N-1) for
|
|
RAID level 1, and dependent on configuration for level 10.
|
|
.PP
|
|
.B md
|
|
also supports a number of pseudo RAID (non-redundant) configurations
|
|
including RAID0 (striped array), LINEAR (catenated array),
|
|
MULTIPATH (a set of different interfaces to the same device),
|
|
and FAULTY (a layer over a single device into which errors can be injected).
|
|
|
|
.SS MD METADATA
|
|
Each device in an array may have some
|
|
.I metadata
|
|
stored in the device. This metadata is sometimes called a
|
|
.BR superblock .
|
|
The metadata records information about the structure and state of the array.
|
|
This allows the array to be reliably re-assembled after a shutdown.
|
|
|
|
.B md
|
|
provides support for two different formats of metadata, and
|
|
other formats can be added.
|
|
|
|
The common format \(em known as version 0.90 \(em has
|
|
a superblock that is 4K long and is written into a 64K aligned block that
|
|
starts at least 64K and less than 128K from the end of the device
|
|
(i.e. to get the address of the superblock round the size of the
|
|
device down to a multiple of 64K and then subtract 64K).
|
|
The available size of each device is the amount of space before the
|
|
super block, so between 64K and 128K is lost when a device in
|
|
incorporated into an MD array.
|
|
This superblock stores multi-byte fields in a processor-dependent
|
|
manner, so arrays cannot easily be moved between computers with
|
|
different processors.
|
|
|
|
The new format \(em known as version 1 \(em has a superblock that is
|
|
normally 1K long, but can be longer. It is normally stored between 8K
|
|
and 12K from the end of the device, on a 4K boundary, though
|
|
variations can be stored at the start of the device (version 1.1) or 4K from
|
|
the start of the device (version 1.2).
|
|
This metadata format stores multibyte data in a
|
|
processor-independent format and supports up to hundreds of
|
|
component devices (version 0.90 only supports 28).
|
|
|
|
The metadata contains, among other things:
|
|
.TP
|
|
LEVEL
|
|
The manner in which the devices are arranged into the array
|
|
(LINEAR, RAID0, RAID1, RAID4, RAID5, RAID10, MULTIPATH).
|
|
.TP
|
|
UUID
|
|
a 128 bit Universally Unique Identifier that identifies the array that
|
|
contains this device.
|
|
|
|
.PP
|
|
When a version 0.90 array is being reshaped (e.g. adding extra devices
|
|
to a RAID5), the version number is temporarily set to 0.91. This
|
|
ensures that if the reshape process is stopped in the middle (e.g. by
|
|
a system crash) and the machine boots into an older kernel that does
|
|
not support reshaping, then the array will not be assembled (which
|
|
would cause data corruption) but will be left untouched until a kernel
|
|
that can complete the reshape processes is used.
|
|
|
|
.SS ARRAYS WITHOUT METADATA
|
|
While it is usually best to create arrays with superblocks so that
|
|
they can be assembled reliably, there are some circumstances when an
|
|
array without superblocks is preferred. These include:
|
|
.TP
|
|
LEGACY ARRAYS
|
|
Early versions of the
|
|
.B md
|
|
driver only supported LINEAR and RAID0 configurations and did not use
|
|
a superblock (which is less critical with these configurations).
|
|
While such arrays should be rebuilt with superblocks if possible,
|
|
.B md
|
|
continues to support them.
|
|
.TP
|
|
FAULTY
|
|
Being a largely transparent layer over a different device, the FAULTY
|
|
personality doesn't gain anything from having a superblock.
|
|
.TP
|
|
MULTIPATH
|
|
It is often possible to detect devices which are different paths to
|
|
the same storage directly rather than having a distinctive superblock
|
|
written to the device and searched for on all paths. In this case,
|
|
a MULTIPATH array with no superblock makes sense.
|
|
.TP
|
|
RAID1
|
|
In some configurations it might be desired to create a RAID1
|
|
configuration that does not use a superblock, and to maintain the state of
|
|
the array elsewhere. While not encouraged for general use, it does
|
|
have special-purpose uses and is supported.
|
|
|
|
.SS ARRAYS WITH EXTERNAL METADATA
|
|
|
|
.I md
|
|
driver supports arrays with externally managed metadata. That is,
|
|
the metadata is not managed by the kernel but rather by a user-space
|
|
program which is external to the kernel. This allows support for a
|
|
variety of metadata formats without cluttering the kernel with lots of
|
|
details.
|
|
.PP
|
|
.I md
|
|
is able to communicate with the user-space program through various
|
|
sysfs attributes so that it can make appropriate changes to the
|
|
metadata \- for example to mark a device as faulty. When necessary,
|
|
.I md
|
|
will wait for the program to acknowledge the event by writing to a
|
|
sysfs attribute.
|
|
The manual page for
|
|
.IR mdmon (8)
|
|
contains more detail about this interaction.
|
|
|
|
.SS CONTAINERS
|
|
Many metadata formats use a single block of metadata to describe a
|
|
number of different arrays which all use the same set of devices.
|
|
In this case it is helpful for the kernel to know about the full set
|
|
of devices as a whole. This set is known to md as a
|
|
.IR container .
|
|
A container is an
|
|
.I md
|
|
array with externally managed metadata and with device offset and size
|
|
so that it just covers the metadata part of the devices. The
|
|
remainder of each device is available to be incorporated into various
|
|
arrays.
|
|
|
|
.SS LINEAR
|
|
|
|
A LINEAR array simply catenates the available space on each
|
|
drive to form one large virtual drive.
|
|
|
|
One advantage of this arrangement over the more common RAID0
|
|
arrangement is that the array may be reconfigured at a later time with
|
|
an extra drive, so the array is made bigger without disturbing the
|
|
data that is on the array. This can even be done on a live
|
|
array.
|
|
|
|
If a chunksize is given with a LINEAR array, the usable space on each
|
|
device is rounded down to a multiple of this chunksize.
|
|
|
|
.SS RAID0
|
|
|
|
A RAID0 array (which has zero redundancy) is also known as a
|
|
striped array.
|
|
A RAID0 array is configured at creation with a
|
|
.B "Chunk Size"
|
|
which must be at least 4 kibibytes.
|
|
|
|
The RAID0 driver assigns the first chunk of the array to the first
|
|
device, the second chunk to the second device, and so on until all
|
|
drives have been assigned one chunk. This collection of chunks forms a
|
|
.BR stripe .
|
|
Further chunks are gathered into stripes in the same way, and are
|
|
assigned to the remaining space in the drives.
|
|
|
|
If devices in the array are not all the same size, then once the
|
|
smallest device has been exhausted, the RAID0 driver starts
|
|
collecting chunks into smaller stripes that only span the drives which
|
|
still have remaining space.
|
|
|
|
A bug was introduced in linux 3.14 which changed the layout of blocks in
|
|
a RAID0 beyond the region that is striped over all devices. This bug
|
|
does not affect an array with all devices the same size, but can affect
|
|
other RAID0 arrays.
|
|
|
|
Linux 5.4 (and some stable kernels to which the change was backported)
|
|
will not normally assemble such an array as it cannot know which layout
|
|
to use. There is a module parameter "raid0.default_layout" which can be
|
|
set to "1" to force the kernel to use the pre-3.14 layout or to "2" to
|
|
force it to use the 3.14-and-later layout. when creating a new RAID0
|
|
array,
|
|
.I mdadm
|
|
will record the chosen layout in the metadata in a way that allows newer
|
|
kernels to assemble the array without needing a module parameter.
|
|
|
|
To assemble an old array on a new kernel without using the module parameter,
|
|
use either the
|
|
.B "--update=layout-original"
|
|
option or the
|
|
.B "--update=layout-alternate"
|
|
option.
|
|
|
|
Once you have updated the layout you will not be able to mount the array
|
|
on an older kernel. If you need to revert to an older kernel, the
|
|
layout information can be erased with the
|
|
.B "--update=layout-unspecificed"
|
|
option. If you use this option to
|
|
.B --assemble
|
|
while running a newer kernel, the array will NOT assemble, but the
|
|
metadata will be update so that it can be assembled on an older kernel.
|
|
|
|
Note that setting the layout to "unspecified" removes protections against
|
|
this bug, and you must be sure that the kernel you use matches the
|
|
layout of the array.
|
|
|
|
.SS RAID1
|
|
|
|
A RAID1 array is also known as a mirrored set (though mirrors tend to
|
|
provide reflected images, which RAID1 does not) or a plex.
|
|
|
|
Once initialised, each device in a RAID1 array contains exactly the
|
|
same data. Changes are written to all devices in parallel. Data is
|
|
read from any one device. The driver attempts to distribute read
|
|
requests across all devices to maximise performance.
|
|
|
|
All devices in a RAID1 array should be the same size. If they are
|
|
not, then only the amount of space available on the smallest device is
|
|
used (any extra space on other devices is wasted).
|
|
|
|
Note that the read balancing done by the driver does not make the RAID1
|
|
performance profile be the same as for RAID0; a single stream of
|
|
sequential input will not be accelerated (e.g. a single dd), but
|
|
multiple sequential streams or a random workload will use more than one
|
|
spindle. In theory, having an N-disk RAID1 will allow N sequential
|
|
threads to read from all disks.
|
|
|
|
Individual devices in a RAID1 can be marked as "write-mostly".
|
|
These drives are excluded from the normal read balancing and will only
|
|
be read from when there is no other option. This can be useful for
|
|
devices connected over a slow link.
|
|
|
|
.SS RAID4
|
|
|
|
A RAID4 array is like a RAID0 array with an extra device for storing
|
|
parity. This device is the last of the active devices in the
|
|
array. Unlike RAID0, RAID4 also requires that all stripes span all
|
|
drives, so extra space on devices that are larger than the smallest is
|
|
wasted.
|
|
|
|
When any block in a RAID4 array is modified, the parity block for that
|
|
stripe (i.e. the block in the parity device at the same device offset
|
|
as the stripe) is also modified so that the parity block always
|
|
contains the "parity" for the whole stripe. I.e. its content is
|
|
equivalent to the result of performing an exclusive-or operation
|
|
between all the data blocks in the stripe.
|
|
|
|
This allows the array to continue to function if one device fails.
|
|
The data that was on that device can be calculated as needed from the
|
|
parity block and the other data blocks.
|
|
|
|
.SS RAID5
|
|
|
|
RAID5 is very similar to RAID4. The difference is that the parity
|
|
blocks for each stripe, instead of being on a single device, are
|
|
distributed across all devices. This allows more parallelism when
|
|
writing, as two different block updates will quite possibly affect
|
|
parity blocks on different devices so there is less contention.
|
|
|
|
This also allows more parallelism when reading, as read requests are
|
|
distributed over all the devices in the array instead of all but one.
|
|
|
|
.SS RAID6
|
|
|
|
RAID6 is similar to RAID5, but can handle the loss of any \fItwo\fP
|
|
devices without data loss. Accordingly, it requires N+2 drives to
|
|
store N drives worth of data.
|
|
|
|
The performance for RAID6 is slightly lower but comparable to RAID5 in
|
|
normal mode and single disk failure mode. It is very slow in dual
|
|
disk failure mode, however.
|
|
|
|
.SS RAID10
|
|
|
|
RAID10 provides a combination of RAID1 and RAID0, and is sometimes known
|
|
as RAID1+0. Every datablock is duplicated some number of times, and
|
|
the resulting collection of datablocks are distributed over multiple
|
|
drives.
|
|
|
|
When configuring a RAID10 array, it is necessary to specify the number
|
|
of replicas of each data block that are required (this will usually
|
|
be\ 2) and whether their layout should be "near", "far" or "offset".
|
|
|
|
.B About the RAID10 Layout Examples:
|
|
.br
|
|
The examples below visualise the chunk distribution on the underlying
|
|
devices for the respective layout.
|
|
|
|
For simplicity it is assumed that the size of the chunks equals the
|
|
size of the blocks of the underlying devices as well as those of the
|
|
RAID10 device exported by the kernel (for example \fB/dev/md/\fPname).
|
|
.br
|
|
Therefore the chunks\ /\ chunk numbers map directly to the blocks\ /\
|
|
block addresses of the exported RAID10 device.
|
|
|
|
Decimal numbers (0,\ 1, 2,\ ...) are the chunks of the RAID10 and due
|
|
to the above assumption also the blocks and block addresses of the
|
|
exported RAID10 device.
|
|
.br
|
|
Repeated numbers mean copies of a chunk\ /\ block (obviously on
|
|
different underlying devices).
|
|
.br
|
|
Hexadecimal numbers (0x00,\ 0x01, 0x02,\ ...) are the block addresses
|
|
of the underlying devices.
|
|
|
|
.TP
|
|
\fB "near" Layout\fP
|
|
When "near" replicas are chosen, the multiple copies of a given chunk are laid
|
|
out consecutively ("as close to each other as possible") across the stripes of
|
|
the array.
|
|
|
|
With an even number of devices, they will likely (unless some misalignment is
|
|
present) lay at the very same offset on the different devices.
|
|
.br
|
|
This is as the "classic" RAID1+0; that is two groups of mirrored devices (in the
|
|
example below the groups Device\ #1\ /\ #2 and Device\ #3\ /\ #4 are each a
|
|
RAID1) both in turn forming a striped RAID0.
|
|
|
|
.ne 10
|
|
.B Example with 2\ copies per chunk and an even number\ (4) of devices:
|
|
.TS
|
|
tab(;);
|
|
C - - - -
|
|
C | C | C | C | C |
|
|
| - | - | - | - | - |
|
|
| C | C | C | C | C |
|
|
| C | C | C | C | C |
|
|
| C | C | C | C | C |
|
|
| C | C | C | C | C |
|
|
| C | C | C | C | C |
|
|
| C | C | C | C | C |
|
|
| - | - | - | - | - |
|
|
C C S C S
|
|
C C S C S
|
|
C C S S S
|
|
C C S S S.
|
|
;
|
|
;Device #1;Device #2;Device #3;Device #4
|
|
0x00;0;0;1;1
|
|
0x01;2;2;3;3
|
|
\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.
|
|
:;:;:;:;:
|
|
\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.
|
|
0x80;254;254;255;255
|
|
;\\---------v---------/;\\---------v---------/
|
|
;RAID1;RAID1
|
|
;\\---------------------v---------------------/
|
|
;RAID0
|
|
.TE
|
|
|
|
.ne 10
|
|
.B Example with 2\ copies per chunk and an odd number\ (5) of devices:
|
|
.TS
|
|
tab(;);
|
|
C - - - - -
|
|
C | C | C | C | C | C |
|
|
| - | - | - | - | - | - |
|
|
| C | C | C | C | C | C |
|
|
| C | C | C | C | C | C |
|
|
| C | C | C | C | C | C |
|
|
| C | C | C | C | C | C |
|
|
| C | C | C | C | C | C |
|
|
| C | C | C | C | C | C |
|
|
| - | - | - | - | - | - |
|
|
C.
|
|
;
|
|
;Dev #1;Dev #2;Dev #3;Dev #4;Dev #5
|
|
0x00;0;0;1;1;2
|
|
0x01;2;3;3;4;4
|
|
\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.
|
|
:;:;:;:;:;:
|
|
\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.
|
|
0x80;317;318;318;319;319
|
|
;
|
|
.TE
|
|
|
|
.TP
|
|
\fB "far" Layout\fP
|
|
When "far" replicas are chosen, the multiple copies of a given chunk
|
|
are laid out quite distant ("as far as reasonably possible") from each
|
|
other.
|
|
|
|
First a complete sequence of all data blocks (that is all the data one
|
|
sees on the exported RAID10 block device) is striped over the
|
|
devices. Then another (though "shifted") complete sequence of all data
|
|
blocks; and so on (in the case of more than 2\ copies per chunk).
|
|
|
|
The "shift" needed to prevent placing copies of the same chunks on the
|
|
same devices is actually a cyclic permutation with offset\ 1 of each
|
|
of the stripes within a complete sequence of chunks.
|
|
.br
|
|
The offset\ 1 is relative to the previous complete sequence of chunks,
|
|
so in case of more than 2\ copies per chunk one gets the following
|
|
offsets:
|
|
.br
|
|
1.\ complete sequence of chunks: offset\ =\ \ 0
|
|
.br
|
|
2.\ complete sequence of chunks: offset\ =\ \ 1
|
|
.br
|
|
3.\ complete sequence of chunks: offset\ =\ \ 2
|
|
.br
|
|
:
|
|
.br
|
|
n.\ complete sequence of chunks: offset\ =\ n-1
|
|
|
|
.ne 10
|
|
.B Example with 2\ copies per chunk and an even number\ (4) of devices:
|
|
.TS
|
|
tab(;);
|
|
C - - - -
|
|
C | C | C | C | C |
|
|
| - | - | - | - | - |
|
|
| C | C | C | C | C | L
|
|
| C | C | C | C | C | L
|
|
| C | C | C | C | C | L
|
|
| C | C | C | C | C | L
|
|
| C | C | C | C | C | L
|
|
| C | C | C | C | C | L
|
|
| C | C | C | C | C | L
|
|
| C | C | C | C | C | L
|
|
| C | C | C | C | C | L
|
|
| C | C | C | C | C | L
|
|
| C | C | C | C | C | L
|
|
| C | C | C | C | C | L
|
|
| - | - | - | - | - |
|
|
C.
|
|
;
|
|
;Device #1;Device #2;Device #3;Device #4
|
|
;
|
|
0x00;0;1;2;3;\\
|
|
0x01;4;5;6;7;> [#]
|
|
\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;:
|
|
:;:;:;:;:;:
|
|
\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;:
|
|
0x40;252;253;254;255;/
|
|
0x41;3;0;1;2;\\
|
|
0x42;7;4;5;6;> [#]~
|
|
\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;:
|
|
:;:;:;:;:;:
|
|
\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;:
|
|
0x80;255;252;253;254;/
|
|
;
|
|
.TE
|
|
|
|
.ne 10
|
|
.B Example with 2\ copies per chunk and an odd number\ (5) of devices:
|
|
.TS
|
|
tab(;);
|
|
C - - - - -
|
|
C | C | C | C | C | C |
|
|
| - | - | - | - | - | - |
|
|
| C | C | C | C | C | C | L
|
|
| C | C | C | C | C | C | L
|
|
| C | C | C | C | C | C | L
|
|
| C | C | C | C | C | C | L
|
|
| C | C | C | C | C | C | L
|
|
| C | C | C | C | C | C | L
|
|
| C | C | C | C | C | C | L
|
|
| C | C | C | C | C | C | L
|
|
| C | C | C | C | C | C | L
|
|
| C | C | C | C | C | C | L
|
|
| C | C | C | C | C | C | L
|
|
| C | C | C | C | C | C | L
|
|
| - | - | - | - | - | - |
|
|
C.
|
|
;
|
|
;Dev #1;Dev #2;Dev #3;Dev #4;Dev #5
|
|
;
|
|
0x00;0;1;2;3;4;\\
|
|
0x01;5;6;7;8;9;> [#]
|
|
\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;:
|
|
:;:;:;:;:;:;:
|
|
\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;:
|
|
0x40;315;316;317;318;319;/
|
|
0x41;4;0;1;2;3;\\
|
|
0x42;9;5;6;7;8;> [#]~
|
|
\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;:
|
|
:;:;:;:;:;:;:
|
|
\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;:
|
|
0x80;319;315;316;317;318;/
|
|
;
|
|
.TE
|
|
|
|
With [#]\ being the complete sequence of chunks and [#]~\ the cyclic permutation
|
|
with offset\ 1 thereof (in the case of more than 2 copies per chunk there would
|
|
be ([#]~)~,\ (([#]~)~)~,\ ...).
|
|
|
|
The advantage of this layout is that MD can easily spread sequential reads over
|
|
the devices, making them similar to RAID0 in terms of speed.
|
|
.br
|
|
The cost is more seeking for writes, making them substantially slower.
|
|
|
|
.TP
|
|
\fB"offset" Layout\fP
|
|
When "offset" replicas are chosen, all the copies of a given chunk are
|
|
striped consecutively ("offset by the stripe length after each other")
|
|
over the devices.
|
|
|
|
Explained in detail, <number of devices> consecutive chunks are
|
|
striped over the devices, immediately followed by a "shifted" copy of
|
|
these chunks (and by further such "shifted" copies in the case of more
|
|
than 2\ copies per chunk).
|
|
.br
|
|
This pattern repeats for all further consecutive chunks of the
|
|
exported RAID10 device (in other words: all further data blocks).
|
|
|
|
The "shift" needed to prevent placing copies of the same chunks on the
|
|
same devices is actually a cyclic permutation with offset\ 1 of each
|
|
of the striped copies of <number of devices> consecutive chunks.
|
|
.br
|
|
The offset\ 1 is relative to the previous striped copy of <number of
|
|
devices> consecutive chunks, so in case of more than 2\ copies per
|
|
chunk one gets the following offsets:
|
|
.br
|
|
1.\ <number of devices> consecutive chunks: offset\ =\ \ 0
|
|
.br
|
|
2.\ <number of devices> consecutive chunks: offset\ =\ \ 1
|
|
.br
|
|
3.\ <number of devices> consecutive chunks: offset\ =\ \ 2
|
|
.br
|
|
:
|
|
.br
|
|
n.\ <number of devices> consecutive chunks: offset\ =\ n-1
|
|
|
|
.ne 10
|
|
.B Example with 2\ copies per chunk and an even number\ (4) of devices:
|
|
.TS
|
|
tab(;);
|
|
C - - - -
|
|
C | C | C | C | C |
|
|
| - | - | - | - | - |
|
|
| C | C | C | C | C | L
|
|
| C | C | C | C | C | L
|
|
| C | C | C | C | C | L
|
|
| C | C | C | C | C | L
|
|
| C | C | C | C | C | L
|
|
| C | C | C | C | C | L
|
|
| C | C | C | C | C | L
|
|
| C | C | C | C | C | L
|
|
| C | C | C | C | C | L
|
|
| - | - | - | - | - |
|
|
C.
|
|
;
|
|
;Device #1;Device #2;Device #3;Device #4
|
|
;
|
|
0x00;0;1;2;3;) AA
|
|
0x01;3;0;1;2;) AA~
|
|
0x02;4;5;6;7;) AB
|
|
0x03;7;4;5;6;) AB~
|
|
\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;) \.\.\.
|
|
:;:;:;:;:; :
|
|
\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;) \.\.\.
|
|
0x79;251;252;253;254;) EX
|
|
0x80;254;251;252;253;) EX~
|
|
;
|
|
.TE
|
|
|
|
.ne 10
|
|
.B Example with 2\ copies per chunk and an odd number\ (5) of devices:
|
|
.TS
|
|
tab(;);
|
|
C - - - - -
|
|
C | C | C | C | C | C |
|
|
| - | - | - | - | - | - |
|
|
| C | C | C | C | C | C | L
|
|
| C | C | C | C | C | C | L
|
|
| C | C | C | C | C | C | L
|
|
| C | C | C | C | C | C | L
|
|
| C | C | C | C | C | C | L
|
|
| C | C | C | C | C | C | L
|
|
| C | C | C | C | C | C | L
|
|
| C | C | C | C | C | C | L
|
|
| C | C | C | C | C | C | L
|
|
| - | - | - | - | - | - |
|
|
C.
|
|
;
|
|
;Dev #1;Dev #2;Dev #3;Dev #4;Dev #5
|
|
;
|
|
0x00;0;1;2;3;4;) AA
|
|
0x01;4;0;1;2;3;) AA~
|
|
0x02;5;6;7;8;9;) AB
|
|
0x03;9;5;6;7;8;) AB~
|
|
\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;) \.\.\.
|
|
:;:;:;:;:;:; :
|
|
\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;\.\.\.;) \.\.\.
|
|
0x79;314;315;316;317;318;) EX
|
|
0x80;318;314;315;316;317;) EX~
|
|
;
|
|
.TE
|
|
|
|
With AA,\ AB,\ ..., AZ,\ BA,\ ... being the sets of <number of devices> consecutive
|
|
chunks and AA~,\ AB~,\ ..., AZ~,\ BA~,\ ... the cyclic permutations with offset\ 1
|
|
thereof (in the case of more than 2 copies per chunk there would be (AA~)~,\ ...
|
|
as well as ((AA~)~)~,\ ... and so on).
|
|
|
|
This should give similar read characteristics to "far" if a suitably large chunk
|
|
size is used, but without as much seeking for writes.
|
|
.PP
|
|
|
|
|
|
It should be noted that the number of devices in a RAID10 array need
|
|
not be a multiple of the number of replica of each data block; however,
|
|
there must be at least as many devices as replicas.
|
|
|
|
If, for example, an array is created with 5 devices and 2 replicas,
|
|
then space equivalent to 2.5 of the devices will be available, and
|
|
every block will be stored on two different devices.
|
|
|
|
Finally, it is possible to have an array with both "near" and "far"
|
|
copies. If an array is configured with 2 near copies and 2 far
|
|
copies, then there will be a total of 4 copies of each block, each on
|
|
a different drive. This is an artifact of the implementation and is
|
|
unlikely to be of real value.
|
|
|
|
.SS MULTIPATH
|
|
|
|
MULTIPATH is not really a RAID at all as there is only one real device
|
|
in a MULTIPATH md array. However there are multiple access points
|
|
(paths) to this device, and one of these paths might fail, so there
|
|
are some similarities.
|
|
|
|
A MULTIPATH array is composed of a number of logically different
|
|
devices, often fibre channel interfaces, that all refer the the same
|
|
real device. If one of these interfaces fails (e.g. due to cable
|
|
problems), the MULTIPATH driver will attempt to redirect requests to
|
|
another interface.
|
|
|
|
The MULTIPATH drive is not receiving any ongoing development and
|
|
should be considered a legacy driver. The device-mapper based
|
|
multipath drivers should be preferred for new installations.
|
|
|
|
.SS FAULTY
|
|
The FAULTY md module is provided for testing purposes. A FAULTY array
|
|
has exactly one component device and is normally assembled without a
|
|
superblock, so the md array created provides direct access to all of
|
|
the data in the component device.
|
|
|
|
The FAULTY module may be requested to simulate faults to allow testing
|
|
of other md levels or of filesystems. Faults can be chosen to trigger
|
|
on read requests or write requests, and can be transient (a subsequent
|
|
read/write at the address will probably succeed) or persistent
|
|
(subsequent read/write of the same address will fail). Further, read
|
|
faults can be "fixable" meaning that they persist until a write
|
|
request at the same address.
|
|
|
|
Fault types can be requested with a period. In this case, the fault
|
|
will recur repeatedly after the given number of requests of the
|
|
relevant type. For example if persistent read faults have a period of
|
|
100, then every 100th read request would generate a fault, and the
|
|
faulty sector would be recorded so that subsequent reads on that
|
|
sector would also fail.
|
|
|
|
There is a limit to the number of faulty sectors that are remembered.
|
|
Faults generated after this limit is exhausted are treated as
|
|
transient.
|
|
|
|
The list of faulty sectors can be flushed, and the active list of
|
|
failure modes can be cleared.
|
|
|
|
.SS UNCLEAN SHUTDOWN
|
|
|
|
When changes are made to a RAID1, RAID4, RAID5, RAID6, or RAID10 array
|
|
there is a possibility of inconsistency for short periods of time as
|
|
each update requires at least two block to be written to different
|
|
devices, and these writes probably won't happen at exactly the same
|
|
time. Thus if a system with one of these arrays is shutdown in the
|
|
middle of a write operation (e.g. due to power failure), the array may
|
|
not be consistent.
|
|
|
|
To handle this situation, the md driver marks an array as "dirty"
|
|
before writing any data to it, and marks it as "clean" when the array
|
|
is being disabled, e.g. at shutdown. If the md driver finds an array
|
|
to be dirty at startup, it proceeds to correct any possibly
|
|
inconsistency. For RAID1, this involves copying the contents of the
|
|
first drive onto all other drives. For RAID4, RAID5 and RAID6 this
|
|
involves recalculating the parity for each stripe and making sure that
|
|
the parity block has the correct data. For RAID10 it involves copying
|
|
one of the replicas of each block onto all the others. This process,
|
|
known as "resynchronising" or "resync" is performed in the background.
|
|
The array can still be used, though possibly with reduced performance.
|
|
|
|
If a RAID4, RAID5 or RAID6 array is degraded (missing at least one
|
|
drive, two for RAID6) when it is restarted after an unclean shutdown, it cannot
|
|
recalculate parity, and so it is possible that data might be
|
|
undetectably corrupted. The md driver will fail to
|
|
start an array in this condition without manual intervention, though
|
|
this behaviour can be overridden by a kernel parameter.
|
|
|
|
.SS RECOVERY
|
|
|
|
If the md driver detects a write error on a device in a RAID1, RAID4,
|
|
RAID5, RAID6, or RAID10 array, it immediately disables that device
|
|
(marking it as faulty) and continues operation on the remaining
|
|
devices. If there are spare drives, the driver will start recreating
|
|
on one of the spare drives the data which was on that failed drive,
|
|
either by copying a working drive in a RAID1 configuration, or by
|
|
doing calculations with the parity block on RAID4, RAID5 or RAID6, or
|
|
by finding and copying originals for RAID10.
|
|
|
|
A read-error will cause md to attempt a recovery by overwriting the bad block. i.e. it will find
|
|
the correct data from elsewhere, write it over the block that failed, and then try to read it back
|
|
again. If either the write or the re-read fail, md will treat the error the same way that a write
|
|
error is treated, and will fail the whole device.
|
|
|
|
While this recovery process is happening, the md driver will monitor
|
|
accesses to the array and will slow down the rate of recovery if other
|
|
activity is happening, so that normal access to the array will not be
|
|
unduly affected. When no other activity is happening, the recovery
|
|
process proceeds at full speed. The actual speed targets for the two
|
|
different situations can be controlled by the
|
|
.B speed_limit_min
|
|
and
|
|
.B speed_limit_max
|
|
control files mentioned below.
|
|
|
|
.SS SCRUBBING AND MISMATCHES
|
|
|
|
As storage devices can develop bad blocks at any time it is valuable
|
|
to regularly read all blocks on all devices in an array so as to catch
|
|
such bad blocks early. This process is called
|
|
.IR scrubbing .
|
|
|
|
md arrays can be scrubbed by writing either
|
|
.I check
|
|
or
|
|
.I repair
|
|
to the file
|
|
.I md/sync_action
|
|
in the
|
|
.I sysfs
|
|
directory for the device.
|
|
|
|
Requesting a scrub will cause
|
|
.I md
|
|
to read every block on every device in the array, and check that the
|
|
data is consistent. For RAID1 and RAID10, this means checking that the copies
|
|
are identical. For RAID4, RAID5, RAID6 this means checking that the
|
|
parity block is (or blocks are) correct.
|
|
|
|
If a read error is detected during this process, the normal read-error
|
|
handling causes correct data to be found from other devices and to be
|
|
written back to the faulty device. In many case this will
|
|
effectively
|
|
.I fix
|
|
the bad block.
|
|
|
|
If all blocks read successfully but are found to not be consistent,
|
|
then this is regarded as a
|
|
.IR mismatch .
|
|
|
|
If
|
|
.I check
|
|
was used, then no action is taken to handle the mismatch, it is simply
|
|
recorded.
|
|
If
|
|
.I repair
|
|
was used, then a mismatch will be repaired in the same way that
|
|
.I resync
|
|
repairs arrays. For RAID5/RAID6 new parity blocks are written. For RAID1/RAID10,
|
|
all but one block are overwritten with the content of that one block.
|
|
|
|
A count of mismatches is recorded in the
|
|
.I sysfs
|
|
file
|
|
.IR md/mismatch_cnt .
|
|
This is set to zero when a
|
|
scrub starts and is incremented whenever a sector is
|
|
found that is a mismatch.
|
|
.I md
|
|
normally works in units much larger than a single sector and when it
|
|
finds a mismatch, it does not determine exactly how many actual sectors were
|
|
affected but simply adds the number of sectors in the IO unit that was
|
|
used. So a value of 128 could simply mean that a single 64KB check
|
|
found an error (128 x 512bytes = 64KB).
|
|
|
|
If an array is created by
|
|
.I mdadm
|
|
with
|
|
.I \-\-assume\-clean
|
|
then a subsequent check could be expected to find some mismatches.
|
|
|
|
On a truly clean RAID5 or RAID6 array, any mismatches should indicate
|
|
a hardware problem at some level - software issues should never cause
|
|
such a mismatch.
|
|
|
|
However on RAID1 and RAID10 it is possible for software issues to
|
|
cause a mismatch to be reported. This does not necessarily mean that
|
|
the data on the array is corrupted. It could simply be that the
|
|
system does not care what is stored on that part of the array - it is
|
|
unused space.
|
|
|
|
The most likely cause for an unexpected mismatch on RAID1 or RAID10
|
|
occurs if a swap partition or swap file is stored on the array.
|
|
|
|
When the swap subsystem wants to write a page of memory out, it flags
|
|
the page as 'clean' in the memory manager and requests the swap device
|
|
to write it out. It is quite possible that the memory will be
|
|
changed while the write-out is happening. In that case the 'clean'
|
|
flag will be found to be clear when the write completes and so the
|
|
swap subsystem will simply forget that the swapout had been attempted,
|
|
and will possibly choose a different page to write out.
|
|
|
|
If the swap device was on RAID1 (or RAID10), then the data is sent
|
|
from memory to a device twice (or more depending on the number of
|
|
devices in the array). Thus it is possible that the memory gets changed
|
|
between the times it is sent, so different data can be written to
|
|
the different devices in the array. This will be detected by
|
|
.I check
|
|
as a mismatch. However it does not reflect any corruption as the
|
|
block where this mismatch occurs is being treated by the swap system as
|
|
being empty, and the data will never be read from that block.
|
|
|
|
It is conceivable for a similar situation to occur on non-swap files,
|
|
though it is less likely.
|
|
|
|
Thus the
|
|
.I mismatch_cnt
|
|
value can not be interpreted very reliably on RAID1 or RAID10,
|
|
especially when the device is used for swap.
|
|
|
|
|
|
.SS BITMAP WRITE-INTENT LOGGING
|
|
|
|
.I md
|
|
supports a bitmap based write-intent log. If configured, the bitmap
|
|
is used to record which blocks of the array may be out of sync.
|
|
Before any write request is honoured, md will make sure that the
|
|
corresponding bit in the log is set. After a period of time with no
|
|
writes to an area of the array, the corresponding bit will be cleared.
|
|
|
|
This bitmap is used for two optimisations.
|
|
|
|
Firstly, after an unclean shutdown, the resync process will consult
|
|
the bitmap and only resync those blocks that correspond to bits in the
|
|
bitmap that are set. This can dramatically reduce resync time.
|
|
|
|
Secondly, when a drive fails and is removed from the array, md stops
|
|
clearing bits in the intent log. If that same drive is re-added to
|
|
the array, md will notice and will only recover the sections of the
|
|
drive that are covered by bits in the intent log that are set. This
|
|
can allow a device to be temporarily removed and reinserted without
|
|
causing an enormous recovery cost.
|
|
|
|
The intent log can be stored in a file on a separate device, or it can
|
|
be stored near the superblocks of an array which has superblocks.
|
|
|
|
It is possible to add an intent log to an active array, or remove an
|
|
intent log if one is present.
|
|
|
|
All raid levels with redundancy are supported.
|
|
|
|
.SS BAD BLOCK LIST
|
|
|
|
Each device in an
|
|
.I md
|
|
array can store a list of known-bad-blocks. This list is 4K in size
|
|
and usually positioned at the end of the space between the superblock
|
|
and the data.
|
|
|
|
When a block cannot be read and cannot be repaired by writing data
|
|
recovered from other devices, the address of the block is stored in
|
|
the bad block list. Similarly if an attempt to write a block fails,
|
|
the address will be recorded as a bad block. If attempting to record
|
|
the bad block fails, the whole device will be marked faulty.
|
|
|
|
Attempting to read from a known bad block will cause a read error.
|
|
Attempting to write to a known bad block will be ignored if any write
|
|
errors have been reported by the device. If there have been no write
|
|
errors then the data will be written to the known bad block and if
|
|
that succeeds, the address will be removed from the list.
|
|
|
|
This allows an array to fail more gracefully - a few blocks on
|
|
different devices can be faulty without taking the whole array out of
|
|
action.
|
|
|
|
The list is particularly useful when recovering to a spare. If a few blocks
|
|
cannot be read from the other devices, the bulk of the recovery can
|
|
complete and those few bad blocks will be recorded in the bad block list.
|
|
|
|
.SS RAID WRITE HOLE
|
|
|
|
Due to non-atomicity nature of RAID write operations,
|
|
interruption of write operations (system crash, etc.) to RAID456
|
|
array can lead to inconsistent parity and data loss (so called
|
|
RAID-5 write hole).
|
|
To plug the write hole md supports two mechanisms described below.
|
|
|
|
.TP
|
|
DIRTY STRIPE JOURNAL
|
|
From Linux 4.4, md supports write ahead journal for RAID456.
|
|
When the array is created, an additional journal device can be added to
|
|
the array through write-journal option. The RAID write journal works
|
|
similar to file system journals. Before writing to the data
|
|
disks, md persists data AND parity of the stripe to the journal
|
|
device. After crashes, md searches the journal device for
|
|
incomplete write operations, and replay them to the data disks.
|
|
|
|
When the journal device fails, the RAID array is forced to run in
|
|
read-only mode.
|
|
|
|
.TP
|
|
PARTIAL PARITY LOG
|
|
From Linux 4.12 md supports Partial Parity Log (PPL) for RAID5 arrays only.
|
|
Partial parity for a write operation is the XOR of stripe data chunks not
|
|
modified by the write. PPL is stored in the metadata region of RAID member drives,
|
|
no additional journal drive is needed.
|
|
After crashes, if one of the not modified data disks of
|
|
the stripe is missing, this updated parity can be used to recover its data.
|
|
|
|
See Documentation/driver-api/md/raid5-ppl.rst for implementation details.
|
|
|
|
.SS WRITE-BEHIND
|
|
|
|
This allows certain devices in the array to be flagged as
|
|
.IR write-mostly .
|
|
MD will only read from such devices if there is no
|
|
other option.
|
|
|
|
If a write-intent bitmap is also provided, write requests to
|
|
write-mostly devices will be treated as write-behind requests and md
|
|
will not wait for writes to those requests to complete before
|
|
reporting the write as complete to the filesystem.
|
|
|
|
This allows for a RAID1 with WRITE-BEHIND to be used to mirror data
|
|
over a slow link to a remote computer (providing the link isn't too
|
|
slow). The extra latency of the remote link will not slow down normal
|
|
operations, but the remote system will still have a reasonably
|
|
up-to-date copy of all data.
|
|
|
|
.SS FAILFAST
|
|
|
|
From Linux 4.10,
|
|
.I
|
|
md
|
|
supports FAILFAST for RAID1 and RAID10 arrays. This is a flag that
|
|
can be set on individual drives, though it is usually set on all
|
|
drives, or no drives.
|
|
|
|
When
|
|
.I md
|
|
sends an I/O request to a drive that is marked as FAILFAST, and when
|
|
the array could survive the loss of that drive without losing data,
|
|
.I md
|
|
will request that the underlying device does not perform any retries.
|
|
This means that a failure will be reported to
|
|
.I md
|
|
promptly, and it can mark the device as faulty and continue using the
|
|
other device(s).
|
|
.I md
|
|
cannot control the timeout that the underlying devices use to
|
|
determine failure. Any changes desired to that timeout must be set
|
|
explictly on the underlying device, separately from using
|
|
.IR mdadm .
|
|
|
|
If a FAILFAST request does fail, and if it is still safe to mark the
|
|
device as faulty without data loss, that will be done and the array
|
|
will continue functioning on a reduced number of devices. If it is not
|
|
possible to safely mark the device as faulty,
|
|
.I md
|
|
will retry the request without disabling retries in the underlying
|
|
device. In any case,
|
|
.I md
|
|
will not attempt to repair read errors on a device marked as FAILFAST
|
|
by writing out the correct. It will just mark the device as faulty.
|
|
|
|
FAILFAST is appropriate for storage arrays that have a low probability
|
|
of true failure, but will sometimes introduce unacceptable delays to
|
|
I/O requests while performing internal maintenance. The value of
|
|
setting FAILFAST involves a trade-off. The gain is that the chance of
|
|
unacceptable delays is substantially reduced. The cost is that the
|
|
unlikely event of data-loss on one device is slightly more likely to
|
|
result in data-loss for the array.
|
|
|
|
When a device in an array using FAILFAST is marked as faulty, it will
|
|
usually become usable again in a short while.
|
|
.I mdadm
|
|
makes no attempt to detect that possibility. Some separate
|
|
mechanism, tuned to the specific details of the expected failure modes,
|
|
needs to be created to monitor devices to see when they return to full
|
|
functionality, and to then re-add them to the array. In order of
|
|
this "re-add" functionality to be effective, an array using FAILFAST
|
|
should always have a write-intent bitmap.
|
|
|
|
.SS RESTRIPING
|
|
|
|
.IR Restriping ,
|
|
also known as
|
|
.IR Reshaping ,
|
|
is the processes of re-arranging the data stored in each stripe into a
|
|
new layout. This might involve changing the number of devices in the
|
|
array (so the stripes are wider), changing the chunk size (so stripes
|
|
are deeper or shallower), or changing the arrangement of data and
|
|
parity (possibly changing the RAID level, e.g. 1 to 5 or 5 to 6).
|
|
|
|
.I md
|
|
can reshape a RAID4, RAID5, or RAID6 array to
|
|
have a different number of devices (more or fewer) and to have a
|
|
different layout or chunk size. It can also convert between these
|
|
different RAID levels. It can also convert between RAID0 and RAID10,
|
|
and between RAID0 and RAID4 or RAID5.
|
|
Other possibilities may follow in future kernels.
|
|
|
|
During any stripe process there is a 'critical section' during which
|
|
live data is being overwritten on disk. For the operation of
|
|
increasing the number of drives in a RAID5, this critical section
|
|
covers the first few stripes (the number being the product of the old
|
|
and new number of devices). After this critical section is passed,
|
|
data is only written to areas of the array which no longer hold live
|
|
data \(em the live data has already been located away.
|
|
|
|
For a reshape which reduces the number of devices, the 'critical
|
|
section' is at the end of the reshape process.
|
|
|
|
md is not able to ensure data preservation if there is a crash
|
|
(e.g. power failure) during the critical section. If md is asked to
|
|
start an array which failed during a critical section of restriping,
|
|
it will fail to start the array.
|
|
|
|
To deal with this possibility, a user-space program must
|
|
.IP \(bu 4
|
|
Disable writes to that section of the array (using the
|
|
.B sysfs
|
|
interface),
|
|
.IP \(bu 4
|
|
take a copy of the data somewhere (i.e. make a backup),
|
|
.IP \(bu 4
|
|
allow the process to continue and invalidate the backup and restore
|
|
write access once the critical section is passed, and
|
|
.IP \(bu 4
|
|
provide for restoring the critical data before restarting the array
|
|
after a system crash.
|
|
.PP
|
|
|
|
.B mdadm
|
|
do this for growing a RAID5 array.
|
|
|
|
For operations that do not change the size of the array, like simply
|
|
increasing chunk size, or converting RAID5 to RAID6 with one extra
|
|
device, the entire process is the critical section. In this case, the
|
|
restripe will need to progress in stages, as a section is suspended,
|
|
backed up, restriped, and released.
|
|
|
|
.SS SYSFS INTERFACE
|
|
Each block device appears as a directory in
|
|
.I sysfs
|
|
(which is usually mounted at
|
|
.BR /sys ).
|
|
For MD devices, this directory will contain a subdirectory called
|
|
.B md
|
|
which contains various files for providing access to information about
|
|
the array.
|
|
|
|
This interface is documented more fully in the file
|
|
.B Documentation/admin-guide/md.rst
|
|
which is distributed with the kernel sources. That file should be
|
|
consulted for full documentation. The following are just a selection
|
|
of attribute files that are available.
|
|
|
|
.TP
|
|
.B md/sync_speed_min
|
|
This value, if set, overrides the system-wide setting in
|
|
.B /proc/sys/dev/raid/speed_limit_min
|
|
for this array only.
|
|
Writing the value
|
|
.B "system"
|
|
to this file will cause the system-wide setting to have effect.
|
|
|
|
.TP
|
|
.B md/sync_speed_max
|
|
This is the partner of
|
|
.B md/sync_speed_min
|
|
and overrides
|
|
.B /proc/sys/dev/raid/speed_limit_max
|
|
described below.
|
|
|
|
.TP
|
|
.B md/sync_action
|
|
This can be used to monitor and control the resync/recovery process of
|
|
MD.
|
|
In particular, writing "check" here will cause the array to read all
|
|
data block and check that they are consistent (e.g. parity is correct,
|
|
or all mirror replicas are the same). Any discrepancies found are
|
|
.B NOT
|
|
corrected.
|
|
|
|
A count of problems found will be stored in
|
|
.BR md/mismatch_count .
|
|
|
|
Alternately, "repair" can be written which will cause the same check
|
|
to be performed, but any errors will be corrected.
|
|
|
|
Finally, "idle" can be written to stop the check/repair process.
|
|
|
|
.TP
|
|
.B md/stripe_cache_size
|
|
This is only available on RAID5 and RAID6. It records the size (in
|
|
pages per device) of the stripe cache which is used for synchronising
|
|
all write operations to the array and all read operations if the array
|
|
is degraded. The default is 256. Valid values are 17 to 32768.
|
|
Increasing this number can increase performance in some situations, at
|
|
some cost in system memory. Note, setting this value too high can
|
|
result in an "out of memory" condition for the system.
|
|
|
|
memory_consumed = system_page_size * nr_disks * stripe_cache_size
|
|
|
|
.TP
|
|
.B md/preread_bypass_threshold
|
|
This is only available on RAID5 and RAID6. This variable sets the
|
|
number of times MD will service a full-stripe-write before servicing a
|
|
stripe that requires some "prereading". For fairness this defaults to
|
|
1. Valid values are 0 to stripe_cache_size. Setting this to 0
|
|
maximizes sequential-write throughput at the cost of fairness to threads
|
|
doing small or random writes.
|
|
|
|
.TP
|
|
.B md/bitmap/backlog
|
|
The value stored in the file only has any effect on RAID1 when write-mostly
|
|
devices are active, and write requests to those devices are proceed in the
|
|
background.
|
|
|
|
This variable sets a limit on the number of concurrent background writes,
|
|
the valid values are 0 to 16383, 0 means that write-behind is not allowed,
|
|
while any other number means it can happen. If there are more write requests
|
|
than the number, new writes will by synchronous.
|
|
|
|
.TP
|
|
.B md/bitmap/can_clear
|
|
This is for externally managed bitmaps, where the kernel writes the bitmap
|
|
itself, but metadata describing the bitmap is managed by mdmon or similar.
|
|
|
|
When the array is degraded, bits mustn't be cleared. When the array becomes
|
|
optimal again, bit can be cleared, but first the metadata needs to record
|
|
the current event count. So md sets this to 'false' and notifies mdmon,
|
|
then mdmon updates the metadata and writes 'true'.
|
|
|
|
There is no code in mdmon to actually do this, so maybe it doesn't even
|
|
work.
|
|
|
|
.TP
|
|
.B md/bitmap/chunksize
|
|
The bitmap chunksize can only be changed when no bitmap is active, and
|
|
the value should be power of 2 and at least 512.
|
|
|
|
.TP
|
|
.B md/bitmap/location
|
|
This indicates where the write-intent bitmap for the array is stored.
|
|
It can be "none" or "file" or a signed offset from the array metadata
|
|
- measured in sectors. You cannot set a file by writing here - that can
|
|
only be done with the SET_BITMAP_FILE ioctl.
|
|
|
|
Write 'none' to 'bitmap/location' will clear bitmap, and the previous
|
|
location value must be write to it to restore bitmap.
|
|
|
|
.TP
|
|
.B md/bitmap/max_backlog_used
|
|
This keeps track of the maximum number of concurrent write-behind requests
|
|
for an md array, writing any value to this file will clear it.
|
|
|
|
.TP
|
|
.B md/bitmap/metadata
|
|
This can be 'internal' or 'clustered' or 'external'. 'internal' is set
|
|
by default, which means the metadata for bitmap is stored in the first 256
|
|
bytes of the bitmap space. 'clustered' means separate bitmap metadata are
|
|
used for each cluster node. 'external' means that bitmap metadata is managed
|
|
externally to the kernel.
|
|
|
|
.TP
|
|
.B md/bitmap/space
|
|
This shows the space (in sectors) which is available at md/bitmap/location,
|
|
and allows the kernel to know when it is safe to resize the bitmap to match
|
|
a resized array. It should big enough to contain the total bytes in the bitmap.
|
|
|
|
For 1.0 metadata, assume we can use up to the superblock if before, else
|
|
to 4K beyond superblock. For other metadata versions, assume no change is
|
|
possible.
|
|
|
|
.TP
|
|
.B md/bitmap/time_base
|
|
This shows the time (in seconds) between disk flushes, and is used to looking
|
|
for bits in the bitmap to be cleared.
|
|
|
|
The default value is 5 seconds, and it should be an unsigned long value.
|
|
|
|
.SS KERNEL PARAMETERS
|
|
|
|
The md driver recognised several different kernel parameters.
|
|
.TP
|
|
.B raid=noautodetect
|
|
This will disable the normal detection of md arrays that happens at
|
|
boot time. If a drive is partitioned with MS-DOS style partitions,
|
|
then if any of the 4 main partitions has a partition type of 0xFD,
|
|
then that partition will normally be inspected to see if it is part of
|
|
an MD array, and if any full arrays are found, they are started. This
|
|
kernel parameter disables this behaviour.
|
|
|
|
.TP
|
|
.B md_mod.start_ro=1
|
|
.TP
|
|
.B /sys/module/md_mod/parameters/start_ro
|
|
This tells md to start all arrays in read-only mode. This is a soft
|
|
read-only that will automatically switch to read-write on the first
|
|
write request. However until that write request, nothing is written
|
|
to any device by md, and in particular, no resync or recovery
|
|
operation is started.
|
|
|
|
.TP
|
|
.B md_mod.start_dirty_degraded=1
|
|
.TP
|
|
.B /sys/module/md_mod/parameters/start_dirty_degraded
|
|
As mentioned above, md will not normally start a RAID4, RAID5, or
|
|
RAID6 that is both dirty and degraded as this situation can imply
|
|
hidden data loss. This can be awkward if the root filesystem is
|
|
affected. Using this module parameter allows such arrays to be started
|
|
at boot time. It should be understood that there is a real (though
|
|
small) risk of data corruption in this situation.
|
|
|
|
.TP
|
|
.BI md= n , dev , dev ,...
|
|
.TP
|
|
.BI md=d n , dev , dev ,...
|
|
This tells the md driver to assemble
|
|
.B /dev/md n
|
|
from the listed devices. It is only necessary to start the device
|
|
holding the root filesystem this way. Other arrays are best started
|
|
once the system is booted.
|
|
|
|
.TP
|
|
.BI md= n , l , c , i , dev...
|
|
This tells the md driver to assemble a legacy RAID0 or LINEAR array
|
|
without a superblock.
|
|
.I n
|
|
gives the md device number,
|
|
.I l
|
|
gives the level, 0 for RAID0 or \-1 for LINEAR,
|
|
.I c
|
|
gives the chunk size as a base-2 logarithm offset by twelve, so 0
|
|
means 4K, 1 means 8K.
|
|
.I i
|
|
is ignored (legacy support).
|
|
|
|
.SH FILES
|
|
.TP
|
|
.B /proc/mdstat
|
|
Contains information about the status of currently running array.
|
|
.TP
|
|
.B /proc/sys/dev/raid/speed_limit_min
|
|
A readable and writable file that reflects the current "goal" rebuild
|
|
speed for times when non-rebuild activity is current on an array.
|
|
The speed is in Kibibytes per second, and is a per-device rate, not a
|
|
per-array rate (which means that an array with more disks will shuffle
|
|
more data for a given speed). The default is 1000.
|
|
|
|
.TP
|
|
.B /proc/sys/dev/raid/speed_limit_max
|
|
A readable and writable file that reflects the current "goal" rebuild
|
|
speed for times when no non-rebuild activity is current on an array.
|
|
The default is 200,000.
|
|
|
|
.SH SEE ALSO
|
|
.BR mdadm (8),
|