Adding upstream version 0.16.
Signed-off-by: Daniel Baumann <daniel@debian.org>
This commit is contained in:
parent
5d67ab9e97
commit
bb26c2917c
20 changed files with 854 additions and 662 deletions
232
doc/tarlz.info
232
doc/tarlz.info
|
@ -11,12 +11,13 @@ File: tarlz.info, Node: Top, Next: Introduction, Up: (dir)
|
|||
Tarlz Manual
|
||||
************
|
||||
|
||||
This manual is for Tarlz (version 0.15, 11 April 2019).
|
||||
This manual is for Tarlz (version 0.16, 8 October 2019).
|
||||
|
||||
* Menu:
|
||||
|
||||
* Introduction:: Purpose and features of tarlz
|
||||
* Invoking tarlz:: Command line interface
|
||||
* Portable character set:: POSIX portable filename character set
|
||||
* File format:: Detailed format of the compressed archive
|
||||
* Amendments to pax format:: The reasons for the differences with pax
|
||||
* Multi-threaded tar:: Limitations of parallel tar decoding
|
||||
|
@ -39,13 +40,19 @@ File: tarlz.info, Node: Introduction, Next: Invoking tarlz, Prev: Top, Up: T
|
|||
|
||||
Tarlz is a massively parallel (multi-threaded) combined implementation
|
||||
of the tar archiver and the lzip compressor. Tarlz creates, lists and
|
||||
extracts archives in a simplified posix pax format compressed with
|
||||
lzip, keeping the alignment between tar members and lzip members. This
|
||||
method adds an indexed lzip layer on top of the tar archive, making it
|
||||
possible to decode the archive safely in parallel. The resulting
|
||||
multimember tar.lz archive is fully backward compatible with standard
|
||||
tar tools like GNU tar, which treat it like any other tar.lz archive.
|
||||
Tarlz can append files to the end of such compressed archives.
|
||||
extracts archives in a simplified and safer variant of the POSIX pax
|
||||
format compressed with lzip, keeping the alignment between tar members
|
||||
and lzip members. The resulting multimember tar.lz archive is fully
|
||||
backward compatible with standard tar tools like GNU tar, which treat
|
||||
it like any other tar.lz archive. Tarlz can append files to the end of
|
||||
such compressed archives.
|
||||
|
||||
Keeping the alignment between tar members and lzip members has two
|
||||
advantages. It adds an indexed lzip layer on top of the tar archive,
|
||||
making it possible to decode the archive safely in parallel. It also
|
||||
minimizes the amount of data lost in case of corruption. Compressing a
|
||||
tar archive with plzip may even double the amount of files lost for
|
||||
each lzip member damaged because it does not keep the members aligned.
|
||||
|
||||
Tarlz can create tar archives with five levels of compression
|
||||
granularity; per file (--no-solid), per block (--bsolid, default), per
|
||||
|
@ -62,7 +69,7 @@ archive, but it has the following advantages:
|
|||
member), and unwanted members can be deleted from the archive. Just
|
||||
like an uncompressed tar archive.
|
||||
|
||||
* It is a safe posix-style backup format. In case of corruption,
|
||||
* It is a safe POSIX-style backup format. In case of corruption,
|
||||
tarlz can extract all the undamaged members from the tar.lz
|
||||
archive, skipping over the damaged members, just like the standard
|
||||
(uncompressed) tar. Moreover, the option '--keep-damaged' can be
|
||||
|
@ -77,10 +84,11 @@ archive, but it has the following advantages:
|
|||
with standard tar tools. *Note crc32::.
|
||||
|
||||
Tarlz does not understand other tar formats like 'gnu', 'oldgnu',
|
||||
'star' or 'v7'.
|
||||
'star' or 'v7'. 'tarlz -tf archive.tar.lz > /dev/null' can be used to
|
||||
verify that the format of the archive is compatible with tarlz.
|
||||
|
||||
|
||||
File: tarlz.info, Node: Invoking tarlz, Next: File format, Prev: Introduction, Up: Top
|
||||
File: tarlz.info, Node: Invoking tarlz, Next: Portable character set, Prev: Introduction, Up: Top
|
||||
|
||||
2 Invoking tarlz
|
||||
****************
|
||||
|
@ -94,9 +102,9 @@ FILE is a directory.
|
|||
|
||||
On archive creation or appending tarlz archives the files specified,
|
||||
but removes from member names any leading and trailing slashes and any
|
||||
filename prefixes containing a '..' component. On extraction, leading
|
||||
file name prefixes containing a '..' component. On extraction, leading
|
||||
and trailing slashes are also removed from member names, and archive
|
||||
members containing a '..' component in the filename are skipped. Tarlz
|
||||
members containing a '..' component in the file name are skipped. Tarlz
|
||||
detects when the archive being created or enlarged is among the files
|
||||
to be dumped, appended or concatenated, and skips it.
|
||||
|
||||
|
@ -149,30 +157,31 @@ equivalent to '-1 --solid'
|
|||
Change to directory DIR. When creating or appending, the position
|
||||
of each '-C' option in the command line is significant; it will
|
||||
change the current working directory for the following FILES until
|
||||
a new '-C' option appears in the command line. When extracting, all
|
||||
the '-C' options are executed in sequence before starting the
|
||||
extraction. Listing ignores any '-C' options specified. DIR is
|
||||
relative to the then current working directory, perhaps changed by
|
||||
a previous '-C' option.
|
||||
a new '-C' option appears in the command line. When extracting or
|
||||
comparing, all the '-C' options are executed in sequence before
|
||||
reading the archive. Listing ignores any '-C' options specified.
|
||||
DIR is relative to the then current working directory, perhaps
|
||||
changed by a previous '-C' option.
|
||||
|
||||
Note that a process can only have one current working directory
|
||||
(CWD). Therefore multi-threading can't be used to create an
|
||||
archive if a '-C' option appears after a relative filename in the
|
||||
archive if a '-C' option appears after a relative file name in the
|
||||
command line.
|
||||
|
||||
'-d'
|
||||
'--diff'
|
||||
Find differences between archive and file system. For each tar
|
||||
member in the archive, verify that the corresponding file exists
|
||||
and is of the same type (regular file, directory, etc). Report on
|
||||
standard output the differences found in type, mode (permissions),
|
||||
owner and group IDs, modification time, file size, file contents
|
||||
(of regular files), target (of symlinks) and device number (of
|
||||
block/character special files).
|
||||
Compare and report differences between archive and file system.
|
||||
For each tar member in the archive, verify that the corresponding
|
||||
file in the file system exists and is of the same type (regular
|
||||
file, directory, etc). Report on standard output the differences
|
||||
found in type, mode (permissions), owner and group IDs,
|
||||
modification time, file size, file contents (of regular files),
|
||||
target (of symlinks) and device number (of block/character special
|
||||
files).
|
||||
|
||||
As tarlz removes leading slashes from member names, the '-C'
|
||||
option may be used in combination with '--diff' when absolute
|
||||
filenames were used on archive creation: 'tarlz -C / -d'.
|
||||
option may be used in combination with '--diff' when absolute file
|
||||
names were used on archive creation: 'tarlz -C / -d'.
|
||||
Alternatively, tarlz may be run from the root directory to perform
|
||||
the comparison.
|
||||
|
||||
|
@ -184,15 +193,22 @@ equivalent to '-1 --solid'
|
|||
Delete the specified files and directories from an archive in
|
||||
place. It currently can delete only from uncompressed archives and
|
||||
from archives with individually compressed files ('--no-solid'
|
||||
archives). To delete a directory without deleting the files under
|
||||
it, use 'tarlz --delete -f foo --exclude='dir/*' dir'. Deleting in
|
||||
place may be dangerous. A corrupt archive, a power cut, or an I/O
|
||||
error may cause data loss.
|
||||
archives). Note that files of about '--data-size' or larger are
|
||||
compressed individually even if '--bsolid' is used, and can
|
||||
therefore be deleted. Tarlz takes care to not delete a tar member
|
||||
unless it is possible to do so. For example it won't try to delete
|
||||
a tar member that is not individually compressed. To delete a
|
||||
directory without deleting the files under it, use
|
||||
'tarlz --delete -f foo --exclude='dir/*' dir'. Deleting in place
|
||||
may be dangerous. A corrupt archive, a power cut, or an I/O error
|
||||
may cause data loss.
|
||||
|
||||
'--exclude=PATTERN'
|
||||
Exclude files matching a shell pattern like '*.o'. A file is
|
||||
considered to match if any component of the filename matches. For
|
||||
example, '*.o' matches 'foo.o', 'foo.o/bar' and 'foo/bar.o'.
|
||||
considered to match if any component of the file name matches. For
|
||||
example, '*.o' matches 'foo.o', 'foo.o/bar' and 'foo/bar.o'. If
|
||||
PATTERN contains a '/', it matches a corresponding '/' in the file
|
||||
name. For example, 'foo/*.o' matches 'foo/bar.o'.
|
||||
|
||||
'-f ARCHIVE'
|
||||
'--file=ARCHIVE'
|
||||
|
@ -234,13 +250,15 @@ equivalent to '-1 --solid'
|
|||
Compressed members can't be appended to an uncompressed archive,
|
||||
nor vice versa. If the archive is compressed, it must be a
|
||||
multimember lzip file with the two end-of-file blocks plus any
|
||||
zero padding contained in the last lzip member of the archive.
|
||||
Appending works as follows; first the end-of-file blocks are
|
||||
removed, then the new members are appended, and finally two new
|
||||
end-of-file blocks are appended to the archive. If the archive is
|
||||
uncompressed, tarlz parses and skips tar headers until it finds
|
||||
the end-of-file blocks. Exit with status 0 without modifying the
|
||||
archive if no FILES have been specified.
|
||||
zero padding contained in the last lzip member of the archive. It
|
||||
is possible to append files to an archive with a different
|
||||
compression granularity. Appending works as follows; first the
|
||||
end-of-file blocks are removed, then the new members are appended,
|
||||
and finally two new end-of-file blocks are appended to the
|
||||
archive. If the archive is uncompressed, tarlz parses and skips
|
||||
tar headers until it finds the end-of-file blocks. Exit with
|
||||
status 0 without modifying the archive if no FILES have been
|
||||
specified.
|
||||
|
||||
'-t'
|
||||
'--list'
|
||||
|
@ -351,7 +369,7 @@ equivalent to '-1 --solid'
|
|||
that a corrupt 'GNU.crc32' keyword, for example 'GNU.crc33', is
|
||||
reported as a missing CRC instead of as a corrupt record. This
|
||||
misleading 'Missing CRC' message is the consequence of a flaw in
|
||||
the posix pax format; i.e., the lack of a mandatory check sequence
|
||||
the POSIX pax format; i.e., the lack of a mandatory check sequence
|
||||
in the extended records. *Note crc32::.
|
||||
|
||||
'--out-slots=N'
|
||||
|
@ -369,9 +387,24 @@ invalid input file, 3 for an internal consistency error (eg, bug) which
|
|||
caused tarlz to panic.
|
||||
|
||||
|
||||
File: tarlz.info, Node: File format, Next: Amendments to pax format, Prev: Invoking tarlz, Up: Top
|
||||
File: tarlz.info, Node: Portable character set, Next: File format, Prev: Invoking tarlz, Up: Top
|
||||
|
||||
3 File format
|
||||
3 POSIX portable filename character set
|
||||
***************************************
|
||||
|
||||
The set of characters from which portable file names are constructed.
|
||||
|
||||
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
|
||||
a b c d e f g h i j k l m n o p q r s t u v w x y z
|
||||
0 1 2 3 4 5 6 7 8 9 . _ -
|
||||
|
||||
The last three characters are the period, underscore, and
|
||||
hyphen-minus characters, respectively.
|
||||
|
||||
|
||||
File: tarlz.info, Node: File format, Next: Amendments to pax format, Prev: Portable character set, Up: Top
|
||||
|
||||
4 File format
|
||||
*************
|
||||
|
||||
In the diagram below, a box like this:
|
||||
|
@ -393,7 +426,7 @@ sets). The members simply appear one after another in the file, with no
|
|||
additional information before, between, or after them.
|
||||
|
||||
Each lzip member contains one or more tar members in a simplified
|
||||
posix pax interchange format. The only pax typeflag value supported by
|
||||
POSIX pax interchange format. The only pax typeflag value supported by
|
||||
tarlz (in addition to the typeflag values defined by the ustar format)
|
||||
is 'x'. The pax format is an extension on top of the ustar format that
|
||||
removes the size limitations of the ustar format.
|
||||
|
@ -438,7 +471,7 @@ tar.lz
|
|||
+===============+=================================================+========+
|
||||
|
||||
|
||||
3.1 Pax header block
|
||||
4.1 Pax header block
|
||||
====================
|
||||
|
||||
The pax header block is identical to the ustar header block described
|
||||
|
@ -492,7 +525,7 @@ conversion to UTF-8 nor any other transformation.
|
|||
swapping of two bytes.
|
||||
|
||||
|
||||
3.2 Ustar header block
|
||||
4.2 Ustar header block
|
||||
======================
|
||||
|
||||
The ustar header block has a length of 512 bytes and is structured as
|
||||
|
@ -519,11 +552,10 @@ prefix 345 155
|
|||
All characters in the header block are coded using the ISO/IEC
|
||||
646:1991 (ASCII) standard, except in fields storing names for files,
|
||||
users, and groups. For maximum portability between implementations,
|
||||
names should only contain characters from the portable filename
|
||||
character set. But if an implementation supports the use of characters
|
||||
outside of '/' and the portable filename character set in names for
|
||||
files, users, and groups, tarlz will use the byte values in these names
|
||||
unmodified.
|
||||
names should only contain characters from the portable character set.
|
||||
But if an implementation supports the use of characters outside of '/'
|
||||
and the portable character set in names for files, users, and groups,
|
||||
tarlz will use the byte values in these names unmodified.
|
||||
|
||||
The fields name, linkname, and prefix are null-terminated character
|
||||
strings except when all characters in the array contain non-null
|
||||
|
@ -618,38 +650,45 @@ character.
|
|||
|
||||
File: tarlz.info, Node: Amendments to pax format, Next: Multi-threaded tar, Prev: File format, Up: Top
|
||||
|
||||
4 The reasons for the differences with pax
|
||||
5 The reasons for the differences with pax
|
||||
******************************************
|
||||
|
||||
Tarlz is meant to reliably detect invalid or corrupt metadata during
|
||||
decoding, and to create safe archives where corrupt metadata can be
|
||||
reliably detected. In order to achieve these goals, tarlz makes some
|
||||
changes to the variant of the pax format that it uses. This chapter
|
||||
describes these changes and the concrete reasons to implement them.
|
||||
Tarlz creates safe archives that allow the reliable detection of
|
||||
invalid or corrupt metadata during decoding even when the integrity
|
||||
checking of lzip can't be used because the lzip members are only
|
||||
decompressed partially, as it happens in parallel '--list' and
|
||||
'--extract'. In order to achieve this goal, tarlz makes some changes to
|
||||
the variant of the pax format that it uses. This chapter describes
|
||||
these changes and the concrete reasons to implement them.
|
||||
|
||||
|
||||
4.1 Add a CRC of the extended records
|
||||
5.1 Add a CRC of the extended records
|
||||
=====================================
|
||||
|
||||
The posix pax format has a serious flaw. The metadata stored in pax
|
||||
The POSIX pax format has a serious flaw. The metadata stored in pax
|
||||
extended records are not protected by any kind of check sequence.
|
||||
Corruption in a long filename may cause the extraction of the file in
|
||||
Corruption in a long file name may cause the extraction of the file in
|
||||
the wrong place without warning. Corruption in a large file size may
|
||||
cause the truncation of the file or the appending of garbage to the
|
||||
file, both followed by a spurious warning about a corrupt header far
|
||||
from the place of the undetected corruption.
|
||||
|
||||
Metadata like filename and file size must be always protected in an
|
||||
Metadata like file name and file size must be always protected in an
|
||||
archive format because of the adverse effects of undetected corruption
|
||||
in them, potentially much worse that undetected corruption in the data.
|
||||
Even more so in the case of pax because the amount of metadata it
|
||||
stores is potentially large, making undetected corruption more probable.
|
||||
|
||||
Headers and metadata must be protected separately from data because
|
||||
the integrity checking of lzip may not be able to detect the corruption
|
||||
before the metadata has been used, for example, to create a new file in
|
||||
the wrong place.
|
||||
|
||||
Because of the above, tarlz protects the extended records with a CRC
|
||||
in a way compatible with standard tar tools. *Note key_crc32::.
|
||||
|
||||
|
||||
4.2 Remove flawed backward compatibility
|
||||
5.2 Remove flawed backward compatibility
|
||||
========================================
|
||||
|
||||
In order to allow the extraction of pax archives by a tar utility
|
||||
|
@ -657,12 +696,12 @@ conforming to the POSIX-2:1993 standard, POSIX.1-2008 recommends
|
|||
selecting extended header field values that allow such tar to create a
|
||||
regular file containing the extended header records as data. This
|
||||
approach is broken because if the extended header is needed because of
|
||||
a long filename, the name and prefix fields will be unable to contain
|
||||
a long file name, the name and prefix fields will be unable to contain
|
||||
the full pathname of the file. Therefore the files corresponding to
|
||||
both the extended header and the overridden ustar header will be
|
||||
extracted using truncated filenames, perhaps overwriting existing files
|
||||
or directories. It may be a security risk to extract a file with a
|
||||
truncated filename.
|
||||
extracted using truncated file names, perhaps overwriting existing
|
||||
files or directories. It may be a security risk to extract a file with
|
||||
a truncated file name.
|
||||
|
||||
To avoid this problem, tarlz writes extended headers with all fields
|
||||
zeroed except size, chksum, typeflag, magic and version. This prevents
|
||||
|
@ -672,28 +711,29 @@ overridden by extended records.
|
|||
|
||||
If an extended header is required for any reason (for example a file
|
||||
size larger than 8 GiB or a link name longer than 100 bytes), tarlz
|
||||
moves the filename also to the extended header to prevent an ustar tool
|
||||
from trying to extract the file or link. This also makes easier during
|
||||
parallel decoding the detection of a tar member split between two lzip
|
||||
members at the boundary between the extended header and the ustar
|
||||
header.
|
||||
moves the file name also to the extended header to prevent an ustar
|
||||
tool from trying to extract the file or link. This also makes easier
|
||||
during parallel decoding the detection of a tar member split between
|
||||
two lzip members at the boundary between the extended header and the
|
||||
ustar header.
|
||||
|
||||
|
||||
4.3 As simple as possible (but not simpler)
|
||||
5.3 As simple as possible (but not simpler)
|
||||
===========================================
|
||||
|
||||
The tarlz format is mainly ustar. Extended pax headers are used only
|
||||
when needed because the length of a filename or link name, or the size
|
||||
when needed because the length of a file name or link name, or the size
|
||||
of a file exceed the limits of the ustar format. Adding extended
|
||||
headers to each member just to record subsecond timestamps seems
|
||||
wasteful for a backup format.
|
||||
wasteful for a backup format. Moreover, minimizing the overhead may
|
||||
help recovering the archive with lziprecover in case of corruption.
|
||||
|
||||
Global pax headers are tolerated, but not supported; they are parsed
|
||||
and ignored. Some operations may not behave as expected if the archive
|
||||
contains global headers.
|
||||
|
||||
|
||||
4.4 Avoid misconversions to/from UTF-8
|
||||
5.4 Avoid misconversions to/from UTF-8
|
||||
======================================
|
||||
|
||||
There is no portable way to tell what charset a text string is coded
|
||||
|
@ -705,7 +745,7 @@ this behavior will be adjusted with a command line option in the future.
|
|||
|
||||
File: tarlz.info, Node: Multi-threaded tar, Next: Minimum archive sizes, Prev: Amendments to pax format, Up: Top
|
||||
|
||||
5 Limitations of parallel tar decoding
|
||||
6 Limitations of parallel tar decoding
|
||||
**************************************
|
||||
|
||||
Safely decoding an arbitrary tar archive in parallel is impossible. For
|
||||
|
@ -753,7 +793,7 @@ example listing the Silesia corpus on a dual core machine:
|
|||
|
||||
File: tarlz.info, Node: Minimum archive sizes, Next: Examples, Prev: Multi-threaded tar, Up: Top
|
||||
|
||||
6 Minimum archive sizes required for multi-threaded block compression
|
||||
7 Minimum archive sizes required for multi-threaded block compression
|
||||
*********************************************************************
|
||||
|
||||
When creating or appending to a compressed archive using multi-threaded
|
||||
|
@ -791,7 +831,7 @@ Level
|
|||
|
||||
File: tarlz.info, Node: Examples, Next: Problems, Prev: Minimum archive sizes, Up: Top
|
||||
|
||||
7 A small tutorial with examples
|
||||
8 A small tutorial with examples
|
||||
********************************
|
||||
|
||||
Example 1: Create a multimember compressed archive 'archive.tar.lz'
|
||||
|
@ -850,7 +890,7 @@ Example 8: Copy the contents of directory 'sourcedir' to the directory
|
|||
|
||||
File: tarlz.info, Node: Problems, Next: Concept index, Prev: Examples, Up: Top
|
||||
|
||||
8 Reporting bugs
|
||||
9 Reporting bugs
|
||||
****************
|
||||
|
||||
There are probably bugs in tarlz. There are certainly errors and
|
||||
|
@ -881,6 +921,9 @@ Concept index
|
|||
* invoking: Invoking tarlz. (line 6)
|
||||
* minimum archive sizes: Minimum archive sizes. (line 6)
|
||||
* options: Invoking tarlz. (line 6)
|
||||
* parallel tar decoding: Multi-threaded tar. (line 6)
|
||||
* portable character set: Portable character set.
|
||||
(line 6)
|
||||
* usage: Invoking tarlz. (line 6)
|
||||
* version: Invoking tarlz. (line 6)
|
||||
|
||||
|
@ -888,20 +931,21 @@ Concept index
|
|||
|
||||
Tag Table:
|
||||
Node: Top223
|
||||
Node: Introduction1086
|
||||
Node: Invoking tarlz3337
|
||||
Ref: --data-size5489
|
||||
Ref: --bsolid12172
|
||||
Node: File format15802
|
||||
Ref: key_crc3220622
|
||||
Node: Amendments to pax format26039
|
||||
Ref: crc3226580
|
||||
Ref: flawed-compat27605
|
||||
Node: Multi-threaded tar30128
|
||||
Node: Minimum archive sizes32667
|
||||
Node: Examples34800
|
||||
Node: Problems36517
|
||||
Node: Concept index37043
|
||||
Node: Introduction1155
|
||||
Node: Invoking tarlz3841
|
||||
Ref: --data-size6006
|
||||
Ref: --bsolid13287
|
||||
Node: Portable character set16917
|
||||
Node: File format17420
|
||||
Ref: key_crc3222248
|
||||
Node: Amendments to pax format27647
|
||||
Ref: crc3228304
|
||||
Ref: flawed-compat29564
|
||||
Node: Multi-threaded tar32198
|
||||
Node: Minimum archive sizes34737
|
||||
Node: Examples36870
|
||||
Node: Problems38587
|
||||
Node: Concept index39113
|
||||
|
||||
End Tag Table
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue