Adding upstream version 0.16.
Signed-off-by: Daniel Baumann <daniel@debian.org>
This commit is contained in:
parent
5d67ab9e97
commit
bb26c2917c
20 changed files with 854 additions and 662 deletions
171
doc/tarlz.texi
171
doc/tarlz.texi
|
@ -6,8 +6,8 @@
|
|||
@finalout
|
||||
@c %**end of header
|
||||
|
||||
@set UPDATED 11 April 2019
|
||||
@set VERSION 0.15
|
||||
@set UPDATED 8 October 2019
|
||||
@set VERSION 0.16
|
||||
|
||||
@dircategory Data Compression
|
||||
@direntry
|
||||
|
@ -37,6 +37,7 @@ This manual is for Tarlz (version @value{VERSION}, @value{UPDATED}).
|
|||
@menu
|
||||
* Introduction:: Purpose and features of tarlz
|
||||
* Invoking tarlz:: Command line interface
|
||||
* Portable character set:: POSIX portable filename character set
|
||||
* File format:: Detailed format of the compressed archive
|
||||
* Amendments to pax format:: The reasons for the differences with pax
|
||||
* Multi-threaded tar:: Limitations of parallel tar decoding
|
||||
|
@ -60,13 +61,19 @@ to copy, distribute and modify it.
|
|||
@uref{http://www.nongnu.org/lzip/tarlz.html,,Tarlz} is a massively parallel
|
||||
(multi-threaded) combined implementation of the tar archiver and the
|
||||
@uref{http://www.nongnu.org/lzip/lzip.html,,lzip} compressor. Tarlz creates,
|
||||
lists and extracts archives in a simplified posix pax format compressed with
|
||||
lzip, keeping the alignment between tar members and lzip members. This
|
||||
method adds an indexed lzip layer on top of the tar archive, making it
|
||||
possible to decode the archive safely in parallel. The resulting multimember
|
||||
tar.lz archive is fully backward compatible with standard tar tools like GNU
|
||||
tar, which treat it like any other tar.lz archive. Tarlz can append files to
|
||||
the end of such compressed archives.
|
||||
lists and extracts archives in a simplified and safer variant of the POSIX
|
||||
pax format compressed with lzip, keeping the alignment between tar members
|
||||
and lzip members. The resulting multimember tar.lz archive is fully backward
|
||||
compatible with standard tar tools like GNU tar, which treat it like any
|
||||
other tar.lz archive. Tarlz can append files to the end of such compressed
|
||||
archives.
|
||||
|
||||
Keeping the alignment between tar members and lzip members has two
|
||||
advantages. It adds an indexed lzip layer on top of the tar archive, making
|
||||
it possible to decode the archive safely in parallel. It also minimizes the
|
||||
amount of data lost in case of corruption. Compressing a tar archive with
|
||||
plzip may even double the amount of files lost for each lzip member damaged
|
||||
because it does not keep the members aligned.
|
||||
|
||||
Tarlz can create tar archives with five levels of compression granularity;
|
||||
per file (---no-solid), per block (---bsolid, default), per directory
|
||||
|
@ -88,7 +95,7 @@ member), and unwanted members can be deleted from the archive. Just
|
|||
like an uncompressed tar archive.
|
||||
|
||||
@item
|
||||
It is a safe posix-style backup format. In case of corruption,
|
||||
It is a safe POSIX-style backup format. In case of corruption,
|
||||
tarlz can extract all the undamaged members from the tar.lz
|
||||
archive, skipping over the damaged members, just like the standard
|
||||
(uncompressed) tar. Moreover, the option @samp{--keep-damaged} can be
|
||||
|
@ -105,7 +112,9 @@ Tarlz protects the extended records with a CRC in a way compatible with
|
|||
standard tar tools. @xref{crc32}.
|
||||
|
||||
Tarlz does not understand other tar formats like @samp{gnu}, @samp{oldgnu},
|
||||
@samp{star} or @samp{v7}.
|
||||
@samp{star} or @samp{v7}. @w{@samp{tarlz -tf archive.tar.lz > /dev/null}}
|
||||
can be used to verify that the format of the archive is compatible with
|
||||
tarlz.
|
||||
|
||||
|
||||
@node Invoking tarlz
|
||||
|
@ -126,10 +135,10 @@ All operations except @samp{--concatenate} operate on whole trees if any
|
|||
@var{file} is a directory.
|
||||
|
||||
On archive creation or appending tarlz archives the files specified, but
|
||||
removes from member names any leading and trailing slashes and any filename
|
||||
removes from member names any leading and trailing slashes and any file name
|
||||
prefixes containing a @samp{..} component. On extraction, leading and
|
||||
trailing slashes are also removed from member names, and archive members
|
||||
containing a @samp{..} component in the filename are skipped. Tarlz detects
|
||||
containing a @samp{..} component in the file name are skipped. Tarlz detects
|
||||
when the archive being created or enlarged is among the files to be dumped,
|
||||
appended or concatenated, and skips it.
|
||||
|
||||
|
@ -179,30 +188,30 @@ Create a new archive from @var{files}.
|
|||
|
||||
@item -C @var{dir}
|
||||
@itemx --directory=@var{dir}
|
||||
Change to directory @var{dir}. When creating or appending, the position
|
||||
of each @samp{-C} option in the command line is significant; it will
|
||||
change the current working directory for the following @var{files} until
|
||||
a new @samp{-C} option appears in the command line. When extracting, all
|
||||
the @samp{-C} options are executed in sequence before starting the
|
||||
extraction. Listing ignores any @samp{-C} options specified. @var{dir}
|
||||
is relative to the then current working directory, perhaps changed by a
|
||||
Change to directory @var{dir}. When creating or appending, the position of
|
||||
each @samp{-C} option in the command line is significant; it will change the
|
||||
current working directory for the following @var{files} until a new
|
||||
@samp{-C} option appears in the command line. When extracting or comparing,
|
||||
all the @samp{-C} options are executed in sequence before reading the
|
||||
archive. Listing ignores any @samp{-C} options specified. @var{dir} is
|
||||
relative to the then current working directory, perhaps changed by a
|
||||
previous @samp{-C} option.
|
||||
|
||||
Note that a process can only have one current working directory (CWD).
|
||||
Therefore multi-threading can't be used to create an archive if a @samp{-C}
|
||||
option appears after a relative filename in the command line.
|
||||
option appears after a relative file name in the command line.
|
||||
|
||||
@item -d
|
||||
@itemx --diff
|
||||
Find differences between archive and file system. For each tar member in the
|
||||
archive, verify that the corresponding file exists and is of the same type
|
||||
(regular file, directory, etc). Report on standard output the differences
|
||||
found in type, mode (permissions), owner and group IDs, modification time,
|
||||
file size, file contents (of regular files), target (of symlinks) and device
|
||||
number (of block/character special files).
|
||||
Compare and report differences between archive and file system. For each tar
|
||||
member in the archive, verify that the corresponding file in the file system
|
||||
exists and is of the same type (regular file, directory, etc). Report on
|
||||
standard output the differences found in type, mode (permissions), owner and
|
||||
group IDs, modification time, file size, file contents (of regular files),
|
||||
target (of symlinks) and device number (of block/character special files).
|
||||
|
||||
As tarlz removes leading slashes from member names, the @samp{-C} option may
|
||||
be used in combination with @samp{--diff} when absolute filenames were used
|
||||
be used in combination with @samp{--diff} when absolute file names were used
|
||||
on archive creation: @w{@samp{tarlz -C / -d}}. Alternatively, tarlz may be
|
||||
run from the root directory to perform the comparison.
|
||||
|
||||
|
@ -213,16 +222,22 @@ useful when comparing an @samp{--anonymous} archive.
|
|||
@item --delete
|
||||
Delete the specified files and directories from an archive in place. It
|
||||
currently can delete only from uncompressed archives and from archives with
|
||||
individually compressed files (@samp{--no-solid} archives). To delete a
|
||||
individually compressed files (@samp{--no-solid} archives). Note that files
|
||||
of about @samp{--data-size} or larger are compressed individually even if
|
||||
@samp{--bsolid} is used, and can therefore be deleted. Tarlz takes care to
|
||||
not delete a tar member unless it is possible to do so. For example it won't
|
||||
try to delete a tar member that is not individually compressed. To delete a
|
||||
directory without deleting the files under it, use
|
||||
@w{@code{tarlz --delete -f foo --exclude='dir/*' dir}}. Deleting in place
|
||||
@w{@samp{tarlz --delete -f foo --exclude='dir/*' dir}}. Deleting in place
|
||||
may be dangerous. A corrupt archive, a power cut, or an I/O error may cause
|
||||
data loss.
|
||||
|
||||
@item --exclude=@var{pattern}
|
||||
Exclude files matching a shell pattern like @samp{*.o}. A file is considered
|
||||
to match if any component of the filename matches. For example, @samp{*.o}
|
||||
matches @samp{foo.o}, @samp{foo.o/bar} and @samp{foo/bar.o}.
|
||||
to match if any component of the file name matches. For example, @samp{*.o}
|
||||
matches @samp{foo.o}, @samp{foo.o/bar} and @samp{foo/bar.o}. If
|
||||
@var{pattern} contains a @samp{/}, it matches a corresponding @samp{/} in
|
||||
the file name. For example, @samp{foo/*.o} matches @samp{foo/bar.o}.
|
||||
|
||||
@item -f @var{archive}
|
||||
@itemx --file=@var{archive}
|
||||
|
@ -261,12 +276,13 @@ Append files to the end of an archive. The archive must be a regular
|
|||
be appended to an uncompressed archive, nor vice versa. If the archive is
|
||||
compressed, it must be a multimember lzip file with the two end-of-file
|
||||
blocks plus any zero padding contained in the last lzip member of the
|
||||
archive. Appending works as follows; first the end-of-file blocks are
|
||||
removed, then the new members are appended, and finally two new end-of-file
|
||||
blocks are appended to the archive. If the archive is uncompressed, tarlz
|
||||
parses and skips tar headers until it finds the end-of-file blocks. Exit
|
||||
with status 0 without modifying the archive if no @var{files} have been
|
||||
specified.
|
||||
archive. It is possible to append files to an archive with a different
|
||||
compression granularity. Appending works as follows; first the end-of-file
|
||||
blocks are removed, then the new members are appended, and finally two new
|
||||
end-of-file blocks are appended to the archive. If the archive is
|
||||
uncompressed, tarlz parses and skips tar headers until it finds the
|
||||
end-of-file blocks. Exit with status 0 without modifying the archive if no
|
||||
@var{files} have been specified.
|
||||
|
||||
@item -t
|
||||
@itemx --list
|
||||
|
@ -282,7 +298,7 @@ Verbosely list files processed.
|
|||
Extract files from an archive. If @var{files} are given, extract only the
|
||||
@var{files} given. Else extract all the files in the archive. To extract a
|
||||
directory without extracting the files under it, use
|
||||
@w{@code{tarlz -xf foo --exclude='dir/*' dir}}.
|
||||
@w{@samp{tarlz -xf foo --exclude='dir/*' dir}}.
|
||||
|
||||
@item -0 .. -9
|
||||
Set the compression level for @samp{--create} and @samp{--append}. The
|
||||
|
@ -326,7 +342,7 @@ compressed data block must contain an integer number of tar members. Block
|
|||
compression is the default because it improves compression ratio for
|
||||
archives with many files smaller than the block size. This option allows
|
||||
tarlz revert to default behavior if, for example, it is invoked through an
|
||||
alias like @code{tar='tarlz --solid'}. @xref{--data-size}, to set the target
|
||||
alias like @samp{tar='tarlz --solid'}. @xref{--data-size}, to set the target
|
||||
block size.
|
||||
|
||||
@item --dsolid
|
||||
|
@ -374,7 +390,7 @@ When this option is used, tarlz detects any corruption in the extended
|
|||
records (only limited by CRC collisions). But note that a corrupt
|
||||
@samp{GNU.crc32} keyword, for example @samp{GNU.crc33}, is reported as a
|
||||
missing CRC instead of as a corrupt record. This misleading
|
||||
@samp{Missing CRC} message is the consequence of a flaw in the posix pax
|
||||
@samp{Missing CRC} message is the consequence of a flaw in the POSIX pax
|
||||
format; i.e., the lack of a mandatory check sequence in the extended
|
||||
records. @xref{crc32}.
|
||||
|
||||
|
@ -400,6 +416,22 @@ invalid input file, 3 for an internal consistency error (eg, bug) which
|
|||
caused tarlz to panic.
|
||||
|
||||
|
||||
@node Portable character set
|
||||
@chapter POSIX portable filename character set
|
||||
@cindex portable character set
|
||||
|
||||
The set of characters from which portable file names are constructed.
|
||||
|
||||
@example
|
||||
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
|
||||
a b c d e f g h i j k l m n o p q r s t u v w x y z
|
||||
0 1 2 3 4 5 6 7 8 9 . _ -
|
||||
@end example
|
||||
|
||||
The last three characters are the period, underscore, and hyphen-minus
|
||||
characters, respectively.
|
||||
|
||||
|
||||
@node File format
|
||||
@chapter File format
|
||||
@cindex file format
|
||||
|
@ -426,7 +458,7 @@ A tar.lz file consists of a series of lzip members (compressed data sets).
|
|||
The members simply appear one after another in the file, with no
|
||||
additional information before, between, or after them.
|
||||
|
||||
Each lzip member contains one or more tar members in a simplified posix
|
||||
Each lzip member contains one or more tar members in a simplified POSIX
|
||||
pax interchange format. The only pax typeflag value supported by tarlz
|
||||
(in addition to the typeflag values defined by the ustar format) is
|
||||
@samp{x}. The pax format is an extension on top of the ustar format that
|
||||
|
@ -506,7 +538,7 @@ extraction. @xref{flawed-compat}.
|
|||
|
||||
The pax extended header data consists of one or more records, each of
|
||||
them constructed as follows:@*
|
||||
@code{"%d %s=%s\n", <length>, <keyword>, <value>}
|
||||
@samp{"%d %s=%s\n", <length>, <keyword>, <value>}
|
||||
|
||||
The <length>, <blank>, <keyword>, <equals-sign>, and <newline> in the
|
||||
record must be limited to the portable character set. The <length> field
|
||||
|
@ -577,11 +609,11 @@ shown in the following table. All lengths and offsets are in decimal.
|
|||
|
||||
All characters in the header block are coded using the ISO/IEC 646:1991
|
||||
(ASCII) standard, except in fields storing names for files, users, and
|
||||
groups. For maximum portability between implementations, names should
|
||||
only contain characters from the portable filename character set. But if
|
||||
an implementation supports the use of characters outside of @samp{/} and
|
||||
the portable filename character set in names for files, users, and
|
||||
groups, tarlz will use the byte values in these names unmodified.
|
||||
groups. For maximum portability between implementations, names should only
|
||||
contain characters from the portable character set. But if an implementation
|
||||
supports the use of characters outside of @samp{/} and the portable
|
||||
character set in names for files, users, and groups, tarlz will use the byte
|
||||
values in these names unmodified.
|
||||
|
||||
The fields name, linkname, and prefix are null-terminated character
|
||||
strings except when all characters in the array contain non-null
|
||||
|
@ -679,32 +711,39 @@ ustar by not requiring a terminating null character.
|
|||
@chapter The reasons for the differences with pax
|
||||
@cindex Amendments to pax format
|
||||
|
||||
Tarlz is meant to reliably detect invalid or corrupt metadata during
|
||||
decoding, and to create safe archives where corrupt metadata can be reliably
|
||||
detected. In order to achieve these goals, tarlz makes some changes to the
|
||||
variant of the pax format that it uses. This chapter describes these changes
|
||||
and the concrete reasons to implement them.
|
||||
Tarlz creates safe archives that allow the reliable detection of invalid or
|
||||
corrupt metadata during decoding even when the integrity checking of lzip
|
||||
can't be used because the lzip members are only decompressed partially, as
|
||||
it happens in parallel @samp{--list} and @samp{--extract}. In order to
|
||||
achieve this goal, tarlz makes some changes to the variant of the pax format
|
||||
that it uses. This chapter describes these changes and the concrete reasons
|
||||
to implement them.
|
||||
|
||||
@sp 1
|
||||
@anchor{crc32}
|
||||
@section Add a CRC of the extended records
|
||||
|
||||
The posix pax format has a serious flaw. The metadata stored in pax extended
|
||||
The POSIX pax format has a serious flaw. The metadata stored in pax extended
|
||||
records are not protected by any kind of check sequence. Corruption in a
|
||||
long filename may cause the extraction of the file in the wrong place
|
||||
long file name may cause the extraction of the file in the wrong place
|
||||
without warning. Corruption in a large file size may cause the truncation of
|
||||
the file or the appending of garbage to the file, both followed by a
|
||||
spurious warning about a corrupt header far from the place of the undetected
|
||||
corruption.
|
||||
|
||||
Metadata like filename and file size must be always protected in an archive
|
||||
Metadata like file name and file size must be always protected in an archive
|
||||
format because of the adverse effects of undetected corruption in them,
|
||||
potentially much worse that undetected corruption in the data. Even more so
|
||||
in the case of pax because the amount of metadata it stores is potentially
|
||||
large, making undetected corruption more probable.
|
||||
|
||||
Because of the above, tarlz protects the extended records with a CRC in
|
||||
a way compatible with standard tar tools. @xref{key_crc32}.
|
||||
Headers and metadata must be protected separately from data because the
|
||||
integrity checking of lzip may not be able to detect the corruption before
|
||||
the metadata has been used, for example, to create a new file in the wrong
|
||||
place.
|
||||
|
||||
Because of the above, tarlz protects the extended records with a CRC in a
|
||||
way compatible with standard tar tools. @xref{key_crc32}.
|
||||
|
||||
@sp 1
|
||||
@anchor{flawed-compat}
|
||||
|
@ -714,12 +753,12 @@ In order to allow the extraction of pax archives by a tar utility conforming
|
|||
to the POSIX-2:1993 standard, POSIX.1-2008 recommends selecting extended
|
||||
header field values that allow such tar to create a regular file containing
|
||||
the extended header records as data. This approach is broken because if the
|
||||
extended header is needed because of a long filename, the name and prefix
|
||||
extended header is needed because of a long file name, the name and prefix
|
||||
fields will be unable to contain the full pathname of the file. Therefore
|
||||
the files corresponding to both the extended header and the overridden ustar
|
||||
header will be extracted using truncated filenames, perhaps overwriting
|
||||
header will be extracted using truncated file names, perhaps overwriting
|
||||
existing files or directories. It may be a security risk to extract a file
|
||||
with a truncated filename.
|
||||
with a truncated file name.
|
||||
|
||||
To avoid this problem, tarlz writes extended headers with all fields zeroed
|
||||
except size, chksum, typeflag, magic and version. This prevents old tar
|
||||
|
@ -729,8 +768,8 @@ extended records.
|
|||
|
||||
If an extended header is required for any reason (for example a file size
|
||||
larger than @w{8 GiB} or a link name longer than 100 bytes), tarlz moves the
|
||||
filename also to the extended header to prevent an ustar tool from trying to
|
||||
extract the file or link. This also makes easier during parallel decoding
|
||||
file name also to the extended header to prevent an ustar tool from trying
|
||||
to extract the file or link. This also makes easier during parallel decoding
|
||||
the detection of a tar member split between two lzip members at the boundary
|
||||
between the extended header and the ustar header.
|
||||
|
||||
|
@ -738,10 +777,11 @@ between the extended header and the ustar header.
|
|||
@section As simple as possible (but not simpler)
|
||||
|
||||
The tarlz format is mainly ustar. Extended pax headers are used only when
|
||||
needed because the length of a filename or link name, or the size of a file
|
||||
needed because the length of a file name or link name, or the size of a file
|
||||
exceed the limits of the ustar format. Adding extended headers to each
|
||||
member just to record subsecond timestamps seems wasteful for a backup
|
||||
format.
|
||||
format. Moreover, minimizing the overhead may help recovering the archive
|
||||
with lziprecover in case of corruption.
|
||||
|
||||
Global pax headers are tolerated, but not supported; they are parsed and
|
||||
ignored. Some operations may not behave as expected if the archive contains
|
||||
|
@ -759,6 +799,7 @@ be adjusted with a command line option in the future.
|
|||
|
||||
@node Multi-threaded tar
|
||||
@chapter Limitations of parallel tar decoding
|
||||
@cindex parallel tar decoding
|
||||
|
||||
Safely decoding an arbitrary tar archive in parallel is impossible. For
|
||||
example, if a tar archive containing another tar archive is decoded starting
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue