Adding upstream version 0.17.
Signed-off-by: Daniel Baumann <daniel@debian.org>
This commit is contained in:
parent
bb26c2917c
commit
739f200278
29 changed files with 2935 additions and 2272 deletions
338
doc/tarlz.texi
338
doc/tarlz.texi
|
@ -6,8 +6,8 @@
|
|||
@finalout
|
||||
@c %**end of header
|
||||
|
||||
@set UPDATED 8 October 2019
|
||||
@set VERSION 0.16
|
||||
@set UPDATED 30 July 2020
|
||||
@set VERSION 0.17
|
||||
|
||||
@dircategory Data Compression
|
||||
@direntry
|
||||
|
@ -40,7 +40,8 @@ This manual is for Tarlz (version @value{VERSION}, @value{UPDATED}).
|
|||
* Portable character set:: POSIX portable filename character set
|
||||
* File format:: Detailed format of the compressed archive
|
||||
* Amendments to pax format:: The reasons for the differences with pax
|
||||
* Multi-threaded tar:: Limitations of parallel tar decoding
|
||||
* Program design:: Internal structure of tarlz
|
||||
* Multi-threaded decoding:: Limitations of parallel tar decoding
|
||||
* Minimum archive sizes:: Sizes required for full multi-threaded speed
|
||||
* Examples:: A small tutorial with examples
|
||||
* Problems:: Reporting bugs
|
||||
|
@ -48,10 +49,10 @@ This manual is for Tarlz (version @value{VERSION}, @value{UPDATED}).
|
|||
@end menu
|
||||
|
||||
@sp 1
|
||||
Copyright @copyright{} 2013-2019 Antonio Diaz Diaz.
|
||||
Copyright @copyright{} 2013-2020 Antonio Diaz Diaz.
|
||||
|
||||
This manual is free documentation: you have unlimited permission
|
||||
to copy, distribute and modify it.
|
||||
to copy, distribute, and modify it.
|
||||
|
||||
|
||||
@node Introduction
|
||||
|
@ -77,7 +78,8 @@ because it does not keep the members aligned.
|
|||
|
||||
Tarlz can create tar archives with five levels of compression granularity;
|
||||
per file (---no-solid), per block (---bsolid, default), per directory
|
||||
(---dsolid), appendable solid (---asolid), and solid (---solid).
|
||||
(---dsolid), appendable solid (---asolid), and solid (---solid). It can also
|
||||
create uncompressed tar archives.
|
||||
|
||||
@noindent
|
||||
Of course, compressing each file (or each directory) individually can't
|
||||
|
@ -105,16 +107,16 @@ and lziprecover can be used to recover some of the damaged members.
|
|||
@item
|
||||
A multimember tar.lz archive is usually smaller than the
|
||||
corresponding solidly compressed tar.gz archive, except when
|
||||
individually compressing files smaller than about 32 KiB.
|
||||
compressing files smaller than about 32 KiB individually.
|
||||
@end itemize
|
||||
|
||||
Tarlz protects the extended records with a CRC in a way compatible with
|
||||
standard tar tools. @xref{crc32}.
|
||||
Tarlz protects the extended records with a Cyclic Redundancy Check (CRC) in
|
||||
a way compatible with standard tar tools. @xref{crc32}.
|
||||
|
||||
Tarlz does not understand other tar formats like @samp{gnu}, @samp{oldgnu},
|
||||
@samp{star} or @samp{v7}. @w{@samp{tarlz -tf archive.tar.lz > /dev/null}}
|
||||
can be used to verify that the format of the archive is compatible with
|
||||
tarlz.
|
||||
@samp{star} or @samp{v7}. The command
|
||||
@w{@samp{tarlz -tf archive.tar.lz > /dev/null}} can be used to verify that
|
||||
the format of the archive is compatible with tarlz.
|
||||
|
||||
|
||||
@node Invoking tarlz
|
||||
|
@ -151,7 +153,11 @@ If several compression levels or @samp{--*solid} options are given, the last
|
|||
setting is used. For example @w{@samp{-9 --solid --uncompressed -1}} is
|
||||
equivalent to @samp{-1 --solid}
|
||||
|
||||
tarlz supports the following options:
|
||||
tarlz supports the following
|
||||
@uref{http://www.nongnu.org/arg-parser/manual/arg_parser_manual.html#Argument-syntax,,options}:
|
||||
@ifnothtml
|
||||
@xref{Argument syntax,,,arg_parser}.
|
||||
@end ifnothtml
|
||||
|
||||
@table @code
|
||||
@item --help
|
||||
|
@ -177,7 +183,7 @@ modifying the archive if no @var{files} have been specified.
|
|||
@anchor{--data-size}
|
||||
@item -B @var{bytes}
|
||||
@itemx --data-size=@var{bytes}
|
||||
Set target size of input data blocks for the @samp{--bsolid} option.
|
||||
Set target size of input data blocks for the option @samp{--bsolid}.
|
||||
@xref{--bsolid}. Valid values range from @w{8 KiB} to @w{1 GiB}. Default
|
||||
value is two times the dictionary size, except for option @samp{-0} where it
|
||||
defaults to @w{1 MiB}. @xref{Minimum archive sizes}.
|
||||
|
@ -210,7 +216,7 @@ standard output the differences found in type, mode (permissions), owner and
|
|||
group IDs, modification time, file size, file contents (of regular files),
|
||||
target (of symlinks) and device number (of block/character special files).
|
||||
|
||||
As tarlz removes leading slashes from member names, the @samp{-C} option may
|
||||
As tarlz removes leading slashes from member names, the option @samp{-C} may
|
||||
be used in combination with @samp{--diff} when absolute file names were used
|
||||
on archive creation: @w{@samp{tarlz -C / -d}}. Alternatively, tarlz may be
|
||||
run from the root directory to perform the comparison.
|
||||
|
@ -220,14 +226,18 @@ Make @samp{--diff} ignore differences in owner and group IDs. This option is
|
|||
useful when comparing an @samp{--anonymous} archive.
|
||||
|
||||
@item --delete
|
||||
Delete the specified files and directories from an archive in place. It
|
||||
currently can delete only from uncompressed archives and from archives with
|
||||
individually compressed files (@samp{--no-solid} archives). Note that files
|
||||
of about @samp{--data-size} or larger are compressed individually even if
|
||||
Delete files and directories from an archive in place. It currently can
|
||||
delete only from uncompressed archives and from archives with files
|
||||
compressed individually (@samp{--no-solid} archives). Note that files of
|
||||
about @samp{--data-size} or larger are compressed individually even if
|
||||
@samp{--bsolid} is used, and can therefore be deleted. Tarlz takes care to
|
||||
not delete a tar member unless it is possible to do so. For example it won't
|
||||
try to delete a tar member that is not individually compressed. To delete a
|
||||
directory without deleting the files under it, use
|
||||
try to delete a tar member that is not compressed individually. Even in the
|
||||
case of finding a corrupt member after having deleted some member(s), tarlz
|
||||
stops and copies the rest of the file as soon as corruption is found,
|
||||
leaving it just as corrupt as it was, but not worse.
|
||||
|
||||
To delete a directory without deleting the files under it, use
|
||||
@w{@samp{tarlz --delete -f foo --exclude='dir/*' dir}}. Deleting in place
|
||||
may be dangerous. A corrupt archive, a power cut, or an I/O error may cause
|
||||
data loss.
|
||||
|
@ -241,14 +251,22 @@ the file name. For example, @samp{foo/*.o} matches @samp{foo/bar.o}.
|
|||
|
||||
@item -f @var{archive}
|
||||
@itemx --file=@var{archive}
|
||||
Use archive file @var{archive}. @samp{-} used as an @var{archive} argument
|
||||
reads from standard input or writes to standard output.
|
||||
Use archive file @var{archive}. A hyphen @samp{-} used as an @var{archive}
|
||||
argument reads from standard input or writes to standard output.
|
||||
|
||||
@item -h
|
||||
@itemx --dereference
|
||||
Follow symbolic links during archive creation, appending or comparison.
|
||||
Archive or compare the files they point to instead of the links themselves.
|
||||
|
||||
@item --mtime=@var{date}
|
||||
When creating or appending, use @var{date} as the modification time for
|
||||
files added to the archive instead of their actual modification times. The
|
||||
value of @var{date} may be either @samp{@@} followed by the number of
|
||||
seconds since the epoch, or a date in format @w{@samp{YYYY-MM-DD HH:MM:SS}},
|
||||
or the name of an existing file starting with @samp{.} or @samp{/}. In the
|
||||
latter case, the modification time of that file is used.
|
||||
|
||||
@item -n @var{n}
|
||||
@itemx --threads=@var{n}
|
||||
Set the number of (de)compression threads, overriding the system's default.
|
||||
|
@ -256,15 +274,22 @@ Valid values range from 0 to "as many as your system can support". A value
|
|||
of 0 disables threads entirely. If this option is not used, tarlz tries to
|
||||
detect the number of processors in the system and use it as default value.
|
||||
@w{@samp{tarlz --help}} shows the system's default value. See the note about
|
||||
multi-threaded archive creation in the @samp{-C} option above.
|
||||
multi-threaded archive creation in the option @samp{-C} above.
|
||||
Multi-threaded extraction of files from an archive is not yet implemented.
|
||||
@xref{Multi-threaded tar}.
|
||||
@xref{Multi-threaded decoding}.
|
||||
|
||||
Note that the number of usable threads is limited during compression to
|
||||
@w{ceil( uncompressed_size / data_size )} (@pxref{Minimum archive sizes}),
|
||||
and during decompression to the number of lzip members in the tar.lz
|
||||
archive, which you can find by running @w{@samp{lzip -lv archive.tar.lz}}.
|
||||
|
||||
@item -p
|
||||
@itemx --preserve-permissions
|
||||
On extraction, set file permissions as they appear in the archive. This is
|
||||
the default behavior when tarlz is run by the superuser. The default for
|
||||
other users is to subtract the umask of the user running tarlz from the
|
||||
permissions specified in the archive.
|
||||
|
||||
@item -q
|
||||
@itemx --quiet
|
||||
Quiet operation. Suppress all messages.
|
||||
|
@ -298,7 +323,10 @@ Verbosely list files processed.
|
|||
Extract files from an archive. If @var{files} are given, extract only the
|
||||
@var{files} given. Else extract all the files in the archive. To extract a
|
||||
directory without extracting the files under it, use
|
||||
@w{@samp{tarlz -xf foo --exclude='dir/*' dir}}.
|
||||
@w{@samp{tarlz -xf foo --exclude='dir/*' dir}}. Tarlz will not make any
|
||||
special effort to extract a file over an incompatible type of file. For
|
||||
example, extracting a link over a directory will usually fail. (Principle of
|
||||
least surprise).
|
||||
|
||||
@item -0 .. -9
|
||||
Set the compression level for @samp{--create} and @samp{--append}. The
|
||||
|
@ -411,9 +439,9 @@ keyword appearing in the same block of extended records.
|
|||
@end table
|
||||
|
||||
Exit status: 0 for a normal exit, 1 for environmental problems (file not
|
||||
found, invalid flags, I/O errors, etc), 2 to indicate a corrupt or
|
||||
invalid input file, 3 for an internal consistency error (eg, bug) which
|
||||
caused tarlz to panic.
|
||||
found, files differ, invalid flags, I/O errors, etc), 2 to indicate a
|
||||
corrupt or invalid input file, 3 for an internal consistency error (eg, bug)
|
||||
which caused tarlz to panic.
|
||||
|
||||
|
||||
@node Portable character set
|
||||
|
@ -431,12 +459,16 @@ a b c d e f g h i j k l m n o p q r s t u v w x y z
|
|||
The last three characters are the period, underscore, and hyphen-minus
|
||||
characters, respectively.
|
||||
|
||||
File names are identifiers. Therefore, archiving works better when file
|
||||
names use only the portable character set without spaces added.
|
||||
|
||||
|
||||
@node File format
|
||||
@chapter File format
|
||||
@cindex file format
|
||||
|
||||
In the diagram below, a box like this:
|
||||
|
||||
@verbatim
|
||||
+---+
|
||||
| | <-- the vertical bars might be missing
|
||||
|
@ -444,6 +476,7 @@ In the diagram below, a box like this:
|
|||
@end verbatim
|
||||
|
||||
represents one byte; a box like this:
|
||||
|
||||
@verbatim
|
||||
+==============+
|
||||
| |
|
||||
|
@ -486,7 +519,7 @@ Zero or more blocks that contain the contents of the file.
|
|||
Each tar member must be contiguously stored in a lzip member for the
|
||||
parallel decoding operations like @samp{--list} to work. If any tar member
|
||||
is split over two or more lzip members, the archive must be decoded
|
||||
sequentially. @xref{Multi-threaded tar}.
|
||||
sequentially. @xref{Multi-threaded decoding}.
|
||||
|
||||
At the end of the archive file there are two 512-byte blocks filled with
|
||||
binary zeros, interpreted as an end-of-archive indicator. These EOF
|
||||
|
@ -530,28 +563,29 @@ An extended header just before the EOF blocks.
|
|||
@section Pax header block
|
||||
|
||||
The pax header block is identical to the ustar header block described below
|
||||
except that the typeflag has the value @samp{x} (extended). The size field
|
||||
is the size of the extended header data in bytes. Most other fields in the
|
||||
pax header block are zeroed on archive creation to prevent trouble if the
|
||||
archive is read by an ustar tool, and are ignored by tarlz on archive
|
||||
extraction. @xref{flawed-compat}.
|
||||
except that the typeflag has the value @samp{x} (extended). The field
|
||||
@samp{size} is the size of the extended header data in bytes. Most other
|
||||
fields in the pax header block are zeroed on archive creation to prevent
|
||||
trouble if the archive is read by an ustar tool, and are ignored by tarlz on
|
||||
archive extraction. @xref{flawed-compat}.
|
||||
|
||||
The pax extended header data consists of one or more records, each of
|
||||
them constructed as follows:@*
|
||||
@samp{"%d %s=%s\n", <length>, <keyword>, <value>}
|
||||
|
||||
The <length>, <blank>, <keyword>, <equals-sign>, and <newline> in the
|
||||
record must be limited to the portable character set. The <length> field
|
||||
contains the decimal length of the record in bytes, including the
|
||||
trailing <newline>. The <value> field is stored as-is, without
|
||||
conversion to UTF-8 nor any other transformation.
|
||||
The fields <length> and <keyword> in the record must be limited to the
|
||||
portable character set (@pxref{Portable character set}). The field <length>
|
||||
contains the decimal length of the record in bytes, including the trailing
|
||||
newline. The field <value> is stored as-is, without conversion to UTF-8 nor
|
||||
any other transformation. The fields are separated by the ASCII characters
|
||||
space, equal-sign, and newline.
|
||||
|
||||
These are the <keyword> fields currently supported by tarlz:
|
||||
These are the <keyword> values currently supported by tarlz:
|
||||
|
||||
@table @code
|
||||
@item linkpath
|
||||
The pathname of a link being created to another file, of any type,
|
||||
previously archived. This record overrides the linkname field in the
|
||||
previously archived. This record overrides the field @samp{linkname} in the
|
||||
following ustar header block. The following ustar header block
|
||||
determines the type of link created. If typeflag of the following header
|
||||
block is 1, it will be a hard link. If typeflag is 2, it will be a
|
||||
|
@ -559,8 +593,8 @@ symbolic link and the linkpath value will be used as the contents of the
|
|||
symbolic link.
|
||||
|
||||
@item path
|
||||
The pathname of the following file. This record overrides the name and
|
||||
prefix fields in the following ustar header block.
|
||||
The pathname of the following file. This record overrides the fields
|
||||
@samp{name} and @samp{prefix} in the following ustar header block.
|
||||
|
||||
@item size
|
||||
The size of the file in bytes, expressed as a decimal number using
|
||||
|
@ -610,31 +644,30 @@ shown in the following table. All lengths and offsets are in decimal.
|
|||
All characters in the header block are coded using the ISO/IEC 646:1991
|
||||
(ASCII) standard, except in fields storing names for files, users, and
|
||||
groups. For maximum portability between implementations, names should only
|
||||
contain characters from the portable character set. But if an implementation
|
||||
supports the use of characters outside of @samp{/} and the portable
|
||||
character set in names for files, users, and groups, tarlz will use the byte
|
||||
values in these names unmodified.
|
||||
contain characters from the portable character set (@pxref{Portable
|
||||
character set}), but if an implementation supports the use of characters
|
||||
outside of @samp{/} and the portable character set in names for files,
|
||||
users, and groups, tarlz will use the byte values in these names unmodified.
|
||||
|
||||
The fields name, linkname, and prefix are null-terminated character
|
||||
strings except when all characters in the array contain non-null
|
||||
characters including the last character.
|
||||
The fields @samp{name}, @samp{linkname}, and @samp{prefix} are
|
||||
null-terminated character strings except when all characters in the array
|
||||
contain non-null characters including the last character.
|
||||
|
||||
The name and the prefix fields produce the pathname of the file. A new
|
||||
pathname is formed, if prefix is not an empty string (its first
|
||||
The fields @samp{prefix} and @samp{name} produce the pathname of the file. A
|
||||
new pathname is formed, if prefix is not an empty string (its first
|
||||
character is not null), by concatenating prefix (up to the first null
|
||||
character), a <slash> character, and name; otherwise, name is used
|
||||
alone. In either case, name is terminated at the first null character.
|
||||
If prefix begins with a null character, it is ignored. In this manner,
|
||||
pathnames of at most 256 characters can be supported. If a pathname does
|
||||
not fit in the space provided, an extended record is used to store the
|
||||
pathname.
|
||||
character), a slash character, and name; otherwise, name is used alone. In
|
||||
either case, name is terminated at the first null character. If prefix
|
||||
begins with a null character, it is ignored. In this manner, pathnames of at
|
||||
most 256 characters can be supported. If a pathname does not fit in the
|
||||
space provided, an extended record is used to store the pathname.
|
||||
|
||||
The linkname field does not use the prefix to produce a pathname. If the
|
||||
linkname does not fit in the 100 characters provided, an extended record
|
||||
The field @samp{linkname} does not use the prefix to produce a pathname. If
|
||||
the linkname does not fit in the 100 characters provided, an extended record
|
||||
is used to store the linkname.
|
||||
|
||||
The mode field provides 12 access permission bits. The following table
|
||||
shows the symbolic name of each bit and its octal value:
|
||||
The field @samp{mode} provides 12 access permission bits. The following
|
||||
table shows the symbolic name of each bit and its octal value:
|
||||
|
||||
@multitable {Bit Name} {Value} {Bit Name} {Value} {Bit Name} {Value}
|
||||
@headitem Bit Name @tab Value @tab Bit Name @tab Value @tab Bit Name @tab Value
|
||||
|
@ -644,29 +677,28 @@ shows the symbolic name of each bit and its octal value:
|
|||
@item S_IROTH @tab 00004 @tab S_IWOTH @tab 00002 @tab S_IXOTH @tab 00001
|
||||
@end multitable
|
||||
|
||||
The uid and gid fields are the user and group ID of the owner and group
|
||||
of the file, respectively.
|
||||
The fields @samp{uid} and @samp{gid} are the user and group IDs of the owner
|
||||
and group of the file, respectively.
|
||||
|
||||
The size field contains the octal representation of the size of the file
|
||||
in bytes. If the typeflag field specifies a file of type '0' (regular
|
||||
file) or '7' (high performance regular file), the number of logical
|
||||
The field @samp{size} contains the octal representation of the size of the
|
||||
file in bytes. If the field @samp{typeflag} specifies a file of type '0'
|
||||
(regular file) or '7' (high performance regular file), the number of logical
|
||||
records following the header is @w{(size / 512)} rounded to the next
|
||||
integer. For all other values of typeflag, tarlz either sets the size
|
||||
field to 0 or ignores it, and does not store or expect any logical
|
||||
records following the header. If the file size is larger than
|
||||
8_589_934_591 bytes @w{(octal 77777777777)}, an extended record is used
|
||||
to store the file size.
|
||||
integer. For all other values of typeflag, tarlz either sets the size field
|
||||
to 0 or ignores it, and does not store or expect any logical records
|
||||
following the header. If the file size is larger than 8_589_934_591 bytes
|
||||
@w{(octal 77777777777)}, an extended record is used to store the file size.
|
||||
|
||||
The mtime field contains the octal representation of the modification
|
||||
time of the file at the time it was archived, obtained from the stat()
|
||||
function.
|
||||
The field @samp{mtime} contains the octal representation of the modification
|
||||
time of the file at the time it was archived, obtained from the function
|
||||
@samp{stat}.
|
||||
|
||||
The chksum field contains the octal representation of the value of the
|
||||
simple sum of all bytes in the header logical record. Each byte in the
|
||||
header is treated as an unsigned value. When calculating the checksum,
|
||||
the chksum field is treated as if it were all <space> characters.
|
||||
The field @samp{chksum} contains the octal representation of the value of
|
||||
the simple sum of all bytes in the header logical record. Each byte in the
|
||||
header is treated as an unsigned value. When calculating the checksum, the
|
||||
chksum field is treated as if it were all space characters.
|
||||
|
||||
The typeflag field contains a single character specifying the type of
|
||||
The field @samp{typeflag} contains a single character specifying the type of
|
||||
file archived:
|
||||
|
||||
@table @code
|
||||
|
@ -680,8 +712,8 @@ Hard link to another file, of any type, previously archived.
|
|||
Symbolic link.
|
||||
|
||||
@item '3', '4'
|
||||
Character special file and block special file respectively. In this case
|
||||
the devmajor and devminor fields contain information defining the
|
||||
Character special file and block special file respectively. In this case the
|
||||
fields @samp{devmajor} and @samp{devminor} contain information defining the
|
||||
device in unspecified format.
|
||||
|
||||
@item '5'
|
||||
|
@ -697,14 +729,15 @@ regular file (type 0).
|
|||
|
||||
@end table
|
||||
|
||||
The magic field contains the ASCII null-terminated string "ustar". The
|
||||
version field contains the characters "00" (0x30,0x30). The fields uname,
|
||||
and gname are null-terminated character strings except when all characters
|
||||
in the array contain non-null characters including the last character. Each
|
||||
numeric field contains a leading space- or zero-filled, optionally
|
||||
null-terminated octal number using digits from the ISO/IEC 646:1991 (ASCII)
|
||||
standard. Tarlz is able to decode numeric fields 1 byte longer than standard
|
||||
ustar by not requiring a terminating null character.
|
||||
The field @samp{magic} contains the ASCII null-terminated string "ustar".
|
||||
The field @samp{version} contains the characters "00" (0x30,0x30). The
|
||||
fields @samp{uname} and @samp{gname} are null-terminated character strings
|
||||
except when all characters in the array contain non-null characters
|
||||
including the last character. Each numeric field contains a leading space-
|
||||
or zero-filled, optionally null-terminated octal number using digits from
|
||||
the ISO/IEC 646:1991 (ASCII) standard. Tarlz is able to decode numeric
|
||||
fields 1 byte longer than standard ustar by not requiring a terminating null
|
||||
character.
|
||||
|
||||
|
||||
@node Amendments to pax format
|
||||
|
@ -714,10 +747,10 @@ ustar by not requiring a terminating null character.
|
|||
Tarlz creates safe archives that allow the reliable detection of invalid or
|
||||
corrupt metadata during decoding even when the integrity checking of lzip
|
||||
can't be used because the lzip members are only decompressed partially, as
|
||||
it happens in parallel @samp{--list} and @samp{--extract}. In order to
|
||||
achieve this goal, tarlz makes some changes to the variant of the pax format
|
||||
that it uses. This chapter describes these changes and the concrete reasons
|
||||
to implement them.
|
||||
it happens in parallel @samp{--diff}, @samp{--list}, and @samp{--extract}.
|
||||
In order to achieve this goal, tarlz makes some changes to the variant of
|
||||
the pax format that it uses. This chapter describes these changes and the
|
||||
concrete reasons to implement them.
|
||||
|
||||
@sp 1
|
||||
@anchor{crc32}
|
||||
|
@ -735,7 +768,7 @@ Metadata like file name and file size must be always protected in an archive
|
|||
format because of the adverse effects of undetected corruption in them,
|
||||
potentially much worse that undetected corruption in the data. Even more so
|
||||
in the case of pax because the amount of metadata it stores is potentially
|
||||
large, making undetected corruption more probable.
|
||||
large, making undetected corruption and archiver misbehavior more probable.
|
||||
|
||||
Headers and metadata must be protected separately from data because the
|
||||
integrity checking of lzip may not be able to detect the corruption before
|
||||
|
@ -753,12 +786,12 @@ In order to allow the extraction of pax archives by a tar utility conforming
|
|||
to the POSIX-2:1993 standard, POSIX.1-2008 recommends selecting extended
|
||||
header field values that allow such tar to create a regular file containing
|
||||
the extended header records as data. This approach is broken because if the
|
||||
extended header is needed because of a long file name, the name and prefix
|
||||
fields will be unable to contain the full pathname of the file. Therefore
|
||||
the files corresponding to both the extended header and the overridden ustar
|
||||
header will be extracted using truncated file names, perhaps overwriting
|
||||
existing files or directories. It may be a security risk to extract a file
|
||||
with a truncated file name.
|
||||
extended header is needed because of a long file name, the fields
|
||||
@samp{prefix} and @samp{name} will be unable to contain the full pathname of
|
||||
the file. Therefore the files corresponding to both the extended header and
|
||||
the overridden ustar header will be extracted using truncated file names,
|
||||
perhaps overwriting existing files or directories. It may be a security risk
|
||||
to extract a file with a truncated file name.
|
||||
|
||||
To avoid this problem, tarlz writes extended headers with all fields zeroed
|
||||
except size, chksum, typeflag, magic and version. This prevents old tar
|
||||
|
@ -778,10 +811,10 @@ between the extended header and the ustar header.
|
|||
|
||||
The tarlz format is mainly ustar. Extended pax headers are used only when
|
||||
needed because the length of a file name or link name, or the size of a file
|
||||
exceed the limits of the ustar format. Adding extended headers to each
|
||||
member just to record subsecond timestamps seems wasteful for a backup
|
||||
format. Moreover, minimizing the overhead may help recovering the archive
|
||||
with lziprecover in case of corruption.
|
||||
exceed the limits of the ustar format. Adding @w{1 KiB} of extended headers
|
||||
to each member just to record subsecond timestamps seems wasteful for a
|
||||
backup format. Moreover, minimizing the overhead may help recovering the
|
||||
archive with lziprecover in case of corruption.
|
||||
|
||||
Global pax headers are tolerated, but not supported; they are parsed and
|
||||
ignored. Some operations may not behave as expected if the archive contains
|
||||
|
@ -797,7 +830,88 @@ accidental double UTF-8 conversions. If the need arises this behavior will
|
|||
be adjusted with a command line option in the future.
|
||||
|
||||
|
||||
@node Multi-threaded tar
|
||||
@node Program design
|
||||
@chapter Internal structure of tarlz
|
||||
@cindex program design
|
||||
|
||||
The parts of tarlz related to sequential processing of the archive are more
|
||||
or less similar to any other tar and won't be described here. The interesting
|
||||
parts described here are those related to Multi-threaded processing.
|
||||
|
||||
The structure of the part of tarlz performing Multi-threaded archive
|
||||
creation is somewhat similar to that of plzip with the added complication of
|
||||
the solidity levels. A grouper thread and several worker threads are
|
||||
created, acting the main thread as muxer (multiplexer) thread. A "packet
|
||||
courier" takes care of data transfers among threads and limits the maximum
|
||||
number of data blocks (packets) being processed simultaneously.
|
||||
|
||||
The grouper traverses the directory tree, groups together the metadata of
|
||||
the files to be archived in each lzip member, and distributes them to the
|
||||
workers. The workers compress the metadata received from the grouper along
|
||||
with the file data read from the file system. The muxer collects processed
|
||||
packets from the workers, and writes them to the archive.
|
||||
|
||||
@verbatim
|
||||
,--------,
|
||||
| data|---> to each worker below
|
||||
| | ,------------,
|
||||
| file | ,-->| worker 0 |--,
|
||||
| system | | `------------' |
|
||||
| | ,---------, | ,------------, | ,-------, ,---------,
|
||||
|metadata|--->| grouper |-+-->| worker 1 |--+-->| muxer |-->| archive |
|
||||
`--------' `---------' | `------------' | `-------' `---------'
|
||||
| ... |
|
||||
| ,------------, |
|
||||
`-->| worker N-1 |--'
|
||||
`------------'
|
||||
@end verbatim
|
||||
|
||||
Decoding an archive is somewhat similar to how plzip decompresses a regular
|
||||
file to standard output, with the differences that it is not the data but
|
||||
only messages what is written to stdout/stderr, and that each worker may
|
||||
access files in the file system either to read them (diff) or write them
|
||||
(extract). As in plzip, each worker reads members directly from the archive.
|
||||
|
||||
@verbatim
|
||||
,--------,
|
||||
| file |<---> data to/from each worker below
|
||||
| system |
|
||||
`--------'
|
||||
,------------,
|
||||
,-->| worker 0 |--,
|
||||
| `------------' |
|
||||
,---------, | ,------------, | ,-------, ,--------,
|
||||
| archive |-+-->| worker 1 |--+-->| muxer |-->| stdout |
|
||||
`---------' | `------------' | `-------' | stderr |
|
||||
| ... | `--------'
|
||||
| ,------------, |
|
||||
`-->| worker N-1 |--'
|
||||
`------------'
|
||||
@end verbatim
|
||||
|
||||
As misaligned tar.lz archives can't be decoded in parallel, and the
|
||||
misalignment can't be detected until after decoding has started, a
|
||||
"mastership request" mechanism has been designed that allows the decoding to
|
||||
continue instead of signalling an error.
|
||||
|
||||
During parallel decoding, if a worker finds a misalignment, it requests
|
||||
mastership to decode the rest of the archive. When mastership is requested,
|
||||
an error_member_id is set, and all subsequently received packets with
|
||||
member_id > error_member_id are rejected. All workers requesting mastership
|
||||
are blocked at the request_mastership call until mastership is granted.
|
||||
Mastership is granted to the delivering worker when its queue is empty to
|
||||
make sure that all preceding packets have been processed. When mastership is
|
||||
granted, all packets are deleted and all subsequently received packets not
|
||||
coming from the master are rejected.
|
||||
|
||||
If a worker can't continue decoding for any cause (for example lack of
|
||||
memory or finding a split tar member at the beginning of a lzip member), it
|
||||
requests mastership to print an error and terminate the program. Only if
|
||||
some other worker requests mastership in a previous lzip member can this
|
||||
error be avoided.
|
||||
|
||||
|
||||
@node Multi-threaded decoding
|
||||
@chapter Limitations of parallel tar decoding
|
||||
@cindex parallel tar decoding
|
||||
|
||||
|
@ -827,8 +941,8 @@ decoding it safely in parallel.
|
|||
Tarlz is able to automatically decode aligned and unaligned multimember
|
||||
tar.lz archives, keeping backwards compatibility. If tarlz finds a member
|
||||
misalignment during multi-threaded decoding, it switches to single-threaded
|
||||
mode and continues decoding the archive. Currently only the @samp{--list}
|
||||
option is able to do multi-threaded decoding.
|
||||
mode and continues decoding the archive. Currently only the options
|
||||
@samp{--diff} and @samp{--list} are able to do multi-threaded decoding.
|
||||
|
||||
If the files in the archive are large, multi-threaded @samp{--list} on a
|
||||
regular (seekable) tar.lz archive can be hundreds of times faster than
|
||||
|
@ -843,6 +957,10 @@ time plzip -cd silesia.tar.lz | tar -tf - (3.256s)
|
|||
time tarlz -tf silesia.tar.lz (0.020s)
|
||||
@end example
|
||||
|
||||
On the other hand, multi-threaded @samp{--list} won't detect corruption in
|
||||
the tar member data because it only decodes the part of each lzip member
|
||||
corresponding to the tar member header.
|
||||
|
||||
|
||||
@node Minimum archive sizes
|
||||
@chapter Minimum archive sizes required for multi-threaded block compression
|
||||
|
@ -860,7 +978,7 @@ least as large as the number of worker threads times the block size
|
|||
compress, and compression will be proportionally slower. The maximum speed
|
||||
increase achievable on a given archive is limited by the ratio
|
||||
@w{(uncompressed_size / data_size)}. For example, a tarball the size of gcc
|
||||
or linux will scale up to 10 or 12 processors at level -9.
|
||||
or linux will scale up to 10 or 14 processors at level -9.
|
||||
|
||||
The following table shows the minimum uncompressed archive size needed for
|
||||
full use of N processors at a given compression level, using the default
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue