Merging upstream version 0.16.
Signed-off-by: Daniel Baumann <daniel@debian.org>
This commit is contained in:
parent
cf7dc90711
commit
e896ecf9fe
20 changed files with 854 additions and 662 deletions
22
doc/tarlz.1
22
doc/tarlz.1
|
@ -1,5 +1,5 @@
|
|||
.\" DO NOT MODIFY THIS FILE! It was generated by help2man 1.46.1.
|
||||
.TH TARLZ "1" "April 2019" "tarlz 0.15" "User Commands"
|
||||
.TH TARLZ "1" "October 2019" "tarlz 0.16" "User Commands"
|
||||
.SH NAME
|
||||
tarlz \- creates tar archives with multimember lzip compression
|
||||
.SH SYNOPSIS
|
||||
|
@ -8,15 +8,19 @@ tarlz \- creates tar archives with multimember lzip compression
|
|||
.SH DESCRIPTION
|
||||
Tarlz is a massively parallel (multi\-threaded) combined implementation of
|
||||
the tar archiver and the lzip compressor. Tarlz creates, lists and extracts
|
||||
archives in a simplified posix pax format compressed with lzip, keeping the
|
||||
alignment between tar members and lzip members. This method adds an indexed
|
||||
lzip layer on top of the tar archive, making it possible to decode the
|
||||
archive safely in parallel. The resulting multimember tar.lz archive is
|
||||
fully backward compatible with standard tar tools like GNU tar, which treat
|
||||
it like any other tar.lz archive. Tarlz can append files to the end of such
|
||||
compressed archives.
|
||||
archives in a simplified and safer variant of the POSIX pax format
|
||||
compressed with lzip, keeping the alignment between tar members and lzip
|
||||
members. The resulting multimember tar.lz archive is fully backward
|
||||
compatible with standard tar tools like GNU tar, which treat it like any
|
||||
other tar.lz archive. Tarlz can append files to the end of such compressed
|
||||
archives.
|
||||
.PP
|
||||
The tarlz file format is a safe posix\-style backup format. In case of
|
||||
Keeping the alignment between tar members and lzip members has two
|
||||
advantages. It adds an indexed lzip layer on top of the tar archive, making
|
||||
it possible to decode the archive safely in parallel. It also minimizes the
|
||||
amount of data lost in case of corruption.
|
||||
.PP
|
||||
The tarlz file format is a safe POSIX\-style backup format. In case of
|
||||
corruption, tarlz can extract all the undamaged members from the tar.lz
|
||||
archive, skipping over the damaged members, just like the standard
|
||||
(uncompressed) tar. Moreover, the option '\-\-keep\-damaged' can be used to
|
||||
|
|
232
doc/tarlz.info
232
doc/tarlz.info
|
@ -11,12 +11,13 @@ File: tarlz.info, Node: Top, Next: Introduction, Up: (dir)
|
|||
Tarlz Manual
|
||||
************
|
||||
|
||||
This manual is for Tarlz (version 0.15, 11 April 2019).
|
||||
This manual is for Tarlz (version 0.16, 8 October 2019).
|
||||
|
||||
* Menu:
|
||||
|
||||
* Introduction:: Purpose and features of tarlz
|
||||
* Invoking tarlz:: Command line interface
|
||||
* Portable character set:: POSIX portable filename character set
|
||||
* File format:: Detailed format of the compressed archive
|
||||
* Amendments to pax format:: The reasons for the differences with pax
|
||||
* Multi-threaded tar:: Limitations of parallel tar decoding
|
||||
|
@ -39,13 +40,19 @@ File: tarlz.info, Node: Introduction, Next: Invoking tarlz, Prev: Top, Up: T
|
|||
|
||||
Tarlz is a massively parallel (multi-threaded) combined implementation
|
||||
of the tar archiver and the lzip compressor. Tarlz creates, lists and
|
||||
extracts archives in a simplified posix pax format compressed with
|
||||
lzip, keeping the alignment between tar members and lzip members. This
|
||||
method adds an indexed lzip layer on top of the tar archive, making it
|
||||
possible to decode the archive safely in parallel. The resulting
|
||||
multimember tar.lz archive is fully backward compatible with standard
|
||||
tar tools like GNU tar, which treat it like any other tar.lz archive.
|
||||
Tarlz can append files to the end of such compressed archives.
|
||||
extracts archives in a simplified and safer variant of the POSIX pax
|
||||
format compressed with lzip, keeping the alignment between tar members
|
||||
and lzip members. The resulting multimember tar.lz archive is fully
|
||||
backward compatible with standard tar tools like GNU tar, which treat
|
||||
it like any other tar.lz archive. Tarlz can append files to the end of
|
||||
such compressed archives.
|
||||
|
||||
Keeping the alignment between tar members and lzip members has two
|
||||
advantages. It adds an indexed lzip layer on top of the tar archive,
|
||||
making it possible to decode the archive safely in parallel. It also
|
||||
minimizes the amount of data lost in case of corruption. Compressing a
|
||||
tar archive with plzip may even double the amount of files lost for
|
||||
each lzip member damaged because it does not keep the members aligned.
|
||||
|
||||
Tarlz can create tar archives with five levels of compression
|
||||
granularity; per file (--no-solid), per block (--bsolid, default), per
|
||||
|
@ -62,7 +69,7 @@ archive, but it has the following advantages:
|
|||
member), and unwanted members can be deleted from the archive. Just
|
||||
like an uncompressed tar archive.
|
||||
|
||||
* It is a safe posix-style backup format. In case of corruption,
|
||||
* It is a safe POSIX-style backup format. In case of corruption,
|
||||
tarlz can extract all the undamaged members from the tar.lz
|
||||
archive, skipping over the damaged members, just like the standard
|
||||
(uncompressed) tar. Moreover, the option '--keep-damaged' can be
|
||||
|
@ -77,10 +84,11 @@ archive, but it has the following advantages:
|
|||
with standard tar tools. *Note crc32::.
|
||||
|
||||
Tarlz does not understand other tar formats like 'gnu', 'oldgnu',
|
||||
'star' or 'v7'.
|
||||
'star' or 'v7'. 'tarlz -tf archive.tar.lz > /dev/null' can be used to
|
||||
verify that the format of the archive is compatible with tarlz.
|
||||
|
||||
|
||||
File: tarlz.info, Node: Invoking tarlz, Next: File format, Prev: Introduction, Up: Top
|
||||
File: tarlz.info, Node: Invoking tarlz, Next: Portable character set, Prev: Introduction, Up: Top
|
||||
|
||||
2 Invoking tarlz
|
||||
****************
|
||||
|
@ -94,9 +102,9 @@ FILE is a directory.
|
|||
|
||||
On archive creation or appending tarlz archives the files specified,
|
||||
but removes from member names any leading and trailing slashes and any
|
||||
filename prefixes containing a '..' component. On extraction, leading
|
||||
file name prefixes containing a '..' component. On extraction, leading
|
||||
and trailing slashes are also removed from member names, and archive
|
||||
members containing a '..' component in the filename are skipped. Tarlz
|
||||
members containing a '..' component in the file name are skipped. Tarlz
|
||||
detects when the archive being created or enlarged is among the files
|
||||
to be dumped, appended or concatenated, and skips it.
|
||||
|
||||
|
@ -149,30 +157,31 @@ equivalent to '-1 --solid'
|
|||
Change to directory DIR. When creating or appending, the position
|
||||
of each '-C' option in the command line is significant; it will
|
||||
change the current working directory for the following FILES until
|
||||
a new '-C' option appears in the command line. When extracting, all
|
||||
the '-C' options are executed in sequence before starting the
|
||||
extraction. Listing ignores any '-C' options specified. DIR is
|
||||
relative to the then current working directory, perhaps changed by
|
||||
a previous '-C' option.
|
||||
a new '-C' option appears in the command line. When extracting or
|
||||
comparing, all the '-C' options are executed in sequence before
|
||||
reading the archive. Listing ignores any '-C' options specified.
|
||||
DIR is relative to the then current working directory, perhaps
|
||||
changed by a previous '-C' option.
|
||||
|
||||
Note that a process can only have one current working directory
|
||||
(CWD). Therefore multi-threading can't be used to create an
|
||||
archive if a '-C' option appears after a relative filename in the
|
||||
archive if a '-C' option appears after a relative file name in the
|
||||
command line.
|
||||
|
||||
'-d'
|
||||
'--diff'
|
||||
Find differences between archive and file system. For each tar
|
||||
member in the archive, verify that the corresponding file exists
|
||||
and is of the same type (regular file, directory, etc). Report on
|
||||
standard output the differences found in type, mode (permissions),
|
||||
owner and group IDs, modification time, file size, file contents
|
||||
(of regular files), target (of symlinks) and device number (of
|
||||
block/character special files).
|
||||
Compare and report differences between archive and file system.
|
||||
For each tar member in the archive, verify that the corresponding
|
||||
file in the file system exists and is of the same type (regular
|
||||
file, directory, etc). Report on standard output the differences
|
||||
found in type, mode (permissions), owner and group IDs,
|
||||
modification time, file size, file contents (of regular files),
|
||||
target (of symlinks) and device number (of block/character special
|
||||
files).
|
||||
|
||||
As tarlz removes leading slashes from member names, the '-C'
|
||||
option may be used in combination with '--diff' when absolute
|
||||
filenames were used on archive creation: 'tarlz -C / -d'.
|
||||
option may be used in combination with '--diff' when absolute file
|
||||
names were used on archive creation: 'tarlz -C / -d'.
|
||||
Alternatively, tarlz may be run from the root directory to perform
|
||||
the comparison.
|
||||
|
||||
|
@ -184,15 +193,22 @@ equivalent to '-1 --solid'
|
|||
Delete the specified files and directories from an archive in
|
||||
place. It currently can delete only from uncompressed archives and
|
||||
from archives with individually compressed files ('--no-solid'
|
||||
archives). To delete a directory without deleting the files under
|
||||
it, use 'tarlz --delete -f foo --exclude='dir/*' dir'. Deleting in
|
||||
place may be dangerous. A corrupt archive, a power cut, or an I/O
|
||||
error may cause data loss.
|
||||
archives). Note that files of about '--data-size' or larger are
|
||||
compressed individually even if '--bsolid' is used, and can
|
||||
therefore be deleted. Tarlz takes care to not delete a tar member
|
||||
unless it is possible to do so. For example it won't try to delete
|
||||
a tar member that is not individually compressed. To delete a
|
||||
directory without deleting the files under it, use
|
||||
'tarlz --delete -f foo --exclude='dir/*' dir'. Deleting in place
|
||||
may be dangerous. A corrupt archive, a power cut, or an I/O error
|
||||
may cause data loss.
|
||||
|
||||
'--exclude=PATTERN'
|
||||
Exclude files matching a shell pattern like '*.o'. A file is
|
||||
considered to match if any component of the filename matches. For
|
||||
example, '*.o' matches 'foo.o', 'foo.o/bar' and 'foo/bar.o'.
|
||||
considered to match if any component of the file name matches. For
|
||||
example, '*.o' matches 'foo.o', 'foo.o/bar' and 'foo/bar.o'. If
|
||||
PATTERN contains a '/', it matches a corresponding '/' in the file
|
||||
name. For example, 'foo/*.o' matches 'foo/bar.o'.
|
||||
|
||||
'-f ARCHIVE'
|
||||
'--file=ARCHIVE'
|
||||
|
@ -234,13 +250,15 @@ equivalent to '-1 --solid'
|
|||
Compressed members can't be appended to an uncompressed archive,
|
||||
nor vice versa. If the archive is compressed, it must be a
|
||||
multimember lzip file with the two end-of-file blocks plus any
|
||||
zero padding contained in the last lzip member of the archive.
|
||||
Appending works as follows; first the end-of-file blocks are
|
||||
removed, then the new members are appended, and finally two new
|
||||
end-of-file blocks are appended to the archive. If the archive is
|
||||
uncompressed, tarlz parses and skips tar headers until it finds
|
||||
the end-of-file blocks. Exit with status 0 without modifying the
|
||||
archive if no FILES have been specified.
|
||||
zero padding contained in the last lzip member of the archive. It
|
||||
is possible to append files to an archive with a different
|
||||
compression granularity. Appending works as follows; first the
|
||||
end-of-file blocks are removed, then the new members are appended,
|
||||
and finally two new end-of-file blocks are appended to the
|
||||
archive. If the archive is uncompressed, tarlz parses and skips
|
||||
tar headers until it finds the end-of-file blocks. Exit with
|
||||
status 0 without modifying the archive if no FILES have been
|
||||
specified.
|
||||
|
||||
'-t'
|
||||
'--list'
|
||||
|
@ -351,7 +369,7 @@ equivalent to '-1 --solid'
|
|||
that a corrupt 'GNU.crc32' keyword, for example 'GNU.crc33', is
|
||||
reported as a missing CRC instead of as a corrupt record. This
|
||||
misleading 'Missing CRC' message is the consequence of a flaw in
|
||||
the posix pax format; i.e., the lack of a mandatory check sequence
|
||||
the POSIX pax format; i.e., the lack of a mandatory check sequence
|
||||
in the extended records. *Note crc32::.
|
||||
|
||||
'--out-slots=N'
|
||||
|
@ -369,9 +387,24 @@ invalid input file, 3 for an internal consistency error (eg, bug) which
|
|||
caused tarlz to panic.
|
||||
|
||||
|
||||
File: tarlz.info, Node: File format, Next: Amendments to pax format, Prev: Invoking tarlz, Up: Top
|
||||
File: tarlz.info, Node: Portable character set, Next: File format, Prev: Invoking tarlz, Up: Top
|
||||
|
||||
3 File format
|
||||
3 POSIX portable filename character set
|
||||
***************************************
|
||||
|
||||
The set of characters from which portable file names are constructed.
|
||||
|
||||
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
|
||||
a b c d e f g h i j k l m n o p q r s t u v w x y z
|
||||
0 1 2 3 4 5 6 7 8 9 . _ -
|
||||
|
||||
The last three characters are the period, underscore, and
|
||||
hyphen-minus characters, respectively.
|
||||
|
||||
|
||||
File: tarlz.info, Node: File format, Next: Amendments to pax format, Prev: Portable character set, Up: Top
|
||||
|
||||
4 File format
|
||||
*************
|
||||
|
||||
In the diagram below, a box like this:
|
||||
|
@ -393,7 +426,7 @@ sets). The members simply appear one after another in the file, with no
|
|||
additional information before, between, or after them.
|
||||
|
||||
Each lzip member contains one or more tar members in a simplified
|
||||
posix pax interchange format. The only pax typeflag value supported by
|
||||
POSIX pax interchange format. The only pax typeflag value supported by
|
||||
tarlz (in addition to the typeflag values defined by the ustar format)
|
||||
is 'x'. The pax format is an extension on top of the ustar format that
|
||||
removes the size limitations of the ustar format.
|
||||
|
@ -438,7 +471,7 @@ tar.lz
|
|||
+===============+=================================================+========+
|
||||
|
||||
|
||||
3.1 Pax header block
|
||||
4.1 Pax header block
|
||||
====================
|
||||
|
||||
The pax header block is identical to the ustar header block described
|
||||
|
@ -492,7 +525,7 @@ conversion to UTF-8 nor any other transformation.
|
|||
swapping of two bytes.
|
||||
|
||||
|
||||
3.2 Ustar header block
|
||||
4.2 Ustar header block
|
||||
======================
|
||||
|
||||
The ustar header block has a length of 512 bytes and is structured as
|
||||
|
@ -519,11 +552,10 @@ prefix 345 155
|
|||
All characters in the header block are coded using the ISO/IEC
|
||||
646:1991 (ASCII) standard, except in fields storing names for files,
|
||||
users, and groups. For maximum portability between implementations,
|
||||
names should only contain characters from the portable filename
|
||||
character set. But if an implementation supports the use of characters
|
||||
outside of '/' and the portable filename character set in names for
|
||||
files, users, and groups, tarlz will use the byte values in these names
|
||||
unmodified.
|
||||
names should only contain characters from the portable character set.
|
||||
But if an implementation supports the use of characters outside of '/'
|
||||
and the portable character set in names for files, users, and groups,
|
||||
tarlz will use the byte values in these names unmodified.
|
||||
|
||||
The fields name, linkname, and prefix are null-terminated character
|
||||
strings except when all characters in the array contain non-null
|
||||
|
@ -618,38 +650,45 @@ character.
|
|||
|
||||
File: tarlz.info, Node: Amendments to pax format, Next: Multi-threaded tar, Prev: File format, Up: Top
|
||||
|
||||
4 The reasons for the differences with pax
|
||||
5 The reasons for the differences with pax
|
||||
******************************************
|
||||
|
||||
Tarlz is meant to reliably detect invalid or corrupt metadata during
|
||||
decoding, and to create safe archives where corrupt metadata can be
|
||||
reliably detected. In order to achieve these goals, tarlz makes some
|
||||
changes to the variant of the pax format that it uses. This chapter
|
||||
describes these changes and the concrete reasons to implement them.
|
||||
Tarlz creates safe archives that allow the reliable detection of
|
||||
invalid or corrupt metadata during decoding even when the integrity
|
||||
checking of lzip can't be used because the lzip members are only
|
||||
decompressed partially, as it happens in parallel '--list' and
|
||||
'--extract'. In order to achieve this goal, tarlz makes some changes to
|
||||
the variant of the pax format that it uses. This chapter describes
|
||||
these changes and the concrete reasons to implement them.
|
||||
|
||||
|
||||
4.1 Add a CRC of the extended records
|
||||
5.1 Add a CRC of the extended records
|
||||
=====================================
|
||||
|
||||
The posix pax format has a serious flaw. The metadata stored in pax
|
||||
The POSIX pax format has a serious flaw. The metadata stored in pax
|
||||
extended records are not protected by any kind of check sequence.
|
||||
Corruption in a long filename may cause the extraction of the file in
|
||||
Corruption in a long file name may cause the extraction of the file in
|
||||
the wrong place without warning. Corruption in a large file size may
|
||||
cause the truncation of the file or the appending of garbage to the
|
||||
file, both followed by a spurious warning about a corrupt header far
|
||||
from the place of the undetected corruption.
|
||||
|
||||
Metadata like filename and file size must be always protected in an
|
||||
Metadata like file name and file size must be always protected in an
|
||||
archive format because of the adverse effects of undetected corruption
|
||||
in them, potentially much worse that undetected corruption in the data.
|
||||
Even more so in the case of pax because the amount of metadata it
|
||||
stores is potentially large, making undetected corruption more probable.
|
||||
|
||||
Headers and metadata must be protected separately from data because
|
||||
the integrity checking of lzip may not be able to detect the corruption
|
||||
before the metadata has been used, for example, to create a new file in
|
||||
the wrong place.
|
||||
|
||||
Because of the above, tarlz protects the extended records with a CRC
|
||||
in a way compatible with standard tar tools. *Note key_crc32::.
|
||||
|
||||
|
||||
4.2 Remove flawed backward compatibility
|
||||
5.2 Remove flawed backward compatibility
|
||||
========================================
|
||||
|
||||
In order to allow the extraction of pax archives by a tar utility
|
||||
|
@ -657,12 +696,12 @@ conforming to the POSIX-2:1993 standard, POSIX.1-2008 recommends
|
|||
selecting extended header field values that allow such tar to create a
|
||||
regular file containing the extended header records as data. This
|
||||
approach is broken because if the extended header is needed because of
|
||||
a long filename, the name and prefix fields will be unable to contain
|
||||
a long file name, the name and prefix fields will be unable to contain
|
||||
the full pathname of the file. Therefore the files corresponding to
|
||||
both the extended header and the overridden ustar header will be
|
||||
extracted using truncated filenames, perhaps overwriting existing files
|
||||
or directories. It may be a security risk to extract a file with a
|
||||
truncated filename.
|
||||
extracted using truncated file names, perhaps overwriting existing
|
||||
files or directories. It may be a security risk to extract a file with
|
||||
a truncated file name.
|
||||
|
||||
To avoid this problem, tarlz writes extended headers with all fields
|
||||
zeroed except size, chksum, typeflag, magic and version. This prevents
|
||||
|
@ -672,28 +711,29 @@ overridden by extended records.
|
|||
|
||||
If an extended header is required for any reason (for example a file
|
||||
size larger than 8 GiB or a link name longer than 100 bytes), tarlz
|
||||
moves the filename also to the extended header to prevent an ustar tool
|
||||
from trying to extract the file or link. This also makes easier during
|
||||
parallel decoding the detection of a tar member split between two lzip
|
||||
members at the boundary between the extended header and the ustar
|
||||
header.
|
||||
moves the file name also to the extended header to prevent an ustar
|
||||
tool from trying to extract the file or link. This also makes easier
|
||||
during parallel decoding the detection of a tar member split between
|
||||
two lzip members at the boundary between the extended header and the
|
||||
ustar header.
|
||||
|
||||
|
||||
4.3 As simple as possible (but not simpler)
|
||||
5.3 As simple as possible (but not simpler)
|
||||
===========================================
|
||||
|
||||
The tarlz format is mainly ustar. Extended pax headers are used only
|
||||
when needed because the length of a filename or link name, or the size
|
||||
when needed because the length of a file name or link name, or the size
|
||||
of a file exceed the limits of the ustar format. Adding extended
|
||||
headers to each member just to record subsecond timestamps seems
|
||||
wasteful for a backup format.
|
||||
wasteful for a backup format. Moreover, minimizing the overhead may
|
||||
help recovering the archive with lziprecover in case of corruption.
|
||||
|
||||
Global pax headers are tolerated, but not supported; they are parsed
|
||||
and ignored. Some operations may not behave as expected if the archive
|
||||
contains global headers.
|
||||
|
||||
|
||||
4.4 Avoid misconversions to/from UTF-8
|
||||
5.4 Avoid misconversions to/from UTF-8
|
||||
======================================
|
||||
|
||||
There is no portable way to tell what charset a text string is coded
|
||||
|
@ -705,7 +745,7 @@ this behavior will be adjusted with a command line option in the future.
|
|||
|
||||
File: tarlz.info, Node: Multi-threaded tar, Next: Minimum archive sizes, Prev: Amendments to pax format, Up: Top
|
||||
|
||||
5 Limitations of parallel tar decoding
|
||||
6 Limitations of parallel tar decoding
|
||||
**************************************
|
||||
|
||||
Safely decoding an arbitrary tar archive in parallel is impossible. For
|
||||
|
@ -753,7 +793,7 @@ example listing the Silesia corpus on a dual core machine:
|
|||
|
||||
File: tarlz.info, Node: Minimum archive sizes, Next: Examples, Prev: Multi-threaded tar, Up: Top
|
||||
|
||||
6 Minimum archive sizes required for multi-threaded block compression
|
||||
7 Minimum archive sizes required for multi-threaded block compression
|
||||
*********************************************************************
|
||||
|
||||
When creating or appending to a compressed archive using multi-threaded
|
||||
|
@ -791,7 +831,7 @@ Level
|
|||
|
||||
File: tarlz.info, Node: Examples, Next: Problems, Prev: Minimum archive sizes, Up: Top
|
||||
|
||||
7 A small tutorial with examples
|
||||
8 A small tutorial with examples
|
||||
********************************
|
||||
|
||||
Example 1: Create a multimember compressed archive 'archive.tar.lz'
|
||||
|
@ -850,7 +890,7 @@ Example 8: Copy the contents of directory 'sourcedir' to the directory
|
|||
|
||||
File: tarlz.info, Node: Problems, Next: Concept index, Prev: Examples, Up: Top
|
||||
|
||||
8 Reporting bugs
|
||||
9 Reporting bugs
|
||||
****************
|
||||
|
||||
There are probably bugs in tarlz. There are certainly errors and
|
||||
|
@ -881,6 +921,9 @@ Concept index
|
|||
* invoking: Invoking tarlz. (line 6)
|
||||
* minimum archive sizes: Minimum archive sizes. (line 6)
|
||||
* options: Invoking tarlz. (line 6)
|
||||
* parallel tar decoding: Multi-threaded tar. (line 6)
|
||||
* portable character set: Portable character set.
|
||||
(line 6)
|
||||
* usage: Invoking tarlz. (line 6)
|
||||
* version: Invoking tarlz. (line 6)
|
||||
|
||||
|
@ -888,20 +931,21 @@ Concept index
|
|||
|
||||
Tag Table:
|
||||
Node: Top223
|
||||
Node: Introduction1086
|
||||
Node: Invoking tarlz3337
|
||||
Ref: --data-size5489
|
||||
Ref: --bsolid12172
|
||||
Node: File format15802
|
||||
Ref: key_crc3220622
|
||||
Node: Amendments to pax format26039
|
||||
Ref: crc3226580
|
||||
Ref: flawed-compat27605
|
||||
Node: Multi-threaded tar30128
|
||||
Node: Minimum archive sizes32667
|
||||
Node: Examples34800
|
||||
Node: Problems36517
|
||||
Node: Concept index37043
|
||||
Node: Introduction1155
|
||||
Node: Invoking tarlz3841
|
||||
Ref: --data-size6006
|
||||
Ref: --bsolid13287
|
||||
Node: Portable character set16917
|
||||
Node: File format17420
|
||||
Ref: key_crc3222248
|
||||
Node: Amendments to pax format27647
|
||||
Ref: crc3228304
|
||||
Ref: flawed-compat29564
|
||||
Node: Multi-threaded tar32198
|
||||
Node: Minimum archive sizes34737
|
||||
Node: Examples36870
|
||||
Node: Problems38587
|
||||
Node: Concept index39113
|
||||
|
||||
End Tag Table
|
||||
|
||||
|
|
171
doc/tarlz.texi
171
doc/tarlz.texi
|
@ -6,8 +6,8 @@
|
|||
@finalout
|
||||
@c %**end of header
|
||||
|
||||
@set UPDATED 11 April 2019
|
||||
@set VERSION 0.15
|
||||
@set UPDATED 8 October 2019
|
||||
@set VERSION 0.16
|
||||
|
||||
@dircategory Data Compression
|
||||
@direntry
|
||||
|
@ -37,6 +37,7 @@ This manual is for Tarlz (version @value{VERSION}, @value{UPDATED}).
|
|||
@menu
|
||||
* Introduction:: Purpose and features of tarlz
|
||||
* Invoking tarlz:: Command line interface
|
||||
* Portable character set:: POSIX portable filename character set
|
||||
* File format:: Detailed format of the compressed archive
|
||||
* Amendments to pax format:: The reasons for the differences with pax
|
||||
* Multi-threaded tar:: Limitations of parallel tar decoding
|
||||
|
@ -60,13 +61,19 @@ to copy, distribute and modify it.
|
|||
@uref{http://www.nongnu.org/lzip/tarlz.html,,Tarlz} is a massively parallel
|
||||
(multi-threaded) combined implementation of the tar archiver and the
|
||||
@uref{http://www.nongnu.org/lzip/lzip.html,,lzip} compressor. Tarlz creates,
|
||||
lists and extracts archives in a simplified posix pax format compressed with
|
||||
lzip, keeping the alignment between tar members and lzip members. This
|
||||
method adds an indexed lzip layer on top of the tar archive, making it
|
||||
possible to decode the archive safely in parallel. The resulting multimember
|
||||
tar.lz archive is fully backward compatible with standard tar tools like GNU
|
||||
tar, which treat it like any other tar.lz archive. Tarlz can append files to
|
||||
the end of such compressed archives.
|
||||
lists and extracts archives in a simplified and safer variant of the POSIX
|
||||
pax format compressed with lzip, keeping the alignment between tar members
|
||||
and lzip members. The resulting multimember tar.lz archive is fully backward
|
||||
compatible with standard tar tools like GNU tar, which treat it like any
|
||||
other tar.lz archive. Tarlz can append files to the end of such compressed
|
||||
archives.
|
||||
|
||||
Keeping the alignment between tar members and lzip members has two
|
||||
advantages. It adds an indexed lzip layer on top of the tar archive, making
|
||||
it possible to decode the archive safely in parallel. It also minimizes the
|
||||
amount of data lost in case of corruption. Compressing a tar archive with
|
||||
plzip may even double the amount of files lost for each lzip member damaged
|
||||
because it does not keep the members aligned.
|
||||
|
||||
Tarlz can create tar archives with five levels of compression granularity;
|
||||
per file (---no-solid), per block (---bsolid, default), per directory
|
||||
|
@ -88,7 +95,7 @@ member), and unwanted members can be deleted from the archive. Just
|
|||
like an uncompressed tar archive.
|
||||
|
||||
@item
|
||||
It is a safe posix-style backup format. In case of corruption,
|
||||
It is a safe POSIX-style backup format. In case of corruption,
|
||||
tarlz can extract all the undamaged members from the tar.lz
|
||||
archive, skipping over the damaged members, just like the standard
|
||||
(uncompressed) tar. Moreover, the option @samp{--keep-damaged} can be
|
||||
|
@ -105,7 +112,9 @@ Tarlz protects the extended records with a CRC in a way compatible with
|
|||
standard tar tools. @xref{crc32}.
|
||||
|
||||
Tarlz does not understand other tar formats like @samp{gnu}, @samp{oldgnu},
|
||||
@samp{star} or @samp{v7}.
|
||||
@samp{star} or @samp{v7}. @w{@samp{tarlz -tf archive.tar.lz > /dev/null}}
|
||||
can be used to verify that the format of the archive is compatible with
|
||||
tarlz.
|
||||
|
||||
|
||||
@node Invoking tarlz
|
||||
|
@ -126,10 +135,10 @@ All operations except @samp{--concatenate} operate on whole trees if any
|
|||
@var{file} is a directory.
|
||||
|
||||
On archive creation or appending tarlz archives the files specified, but
|
||||
removes from member names any leading and trailing slashes and any filename
|
||||
removes from member names any leading and trailing slashes and any file name
|
||||
prefixes containing a @samp{..} component. On extraction, leading and
|
||||
trailing slashes are also removed from member names, and archive members
|
||||
containing a @samp{..} component in the filename are skipped. Tarlz detects
|
||||
containing a @samp{..} component in the file name are skipped. Tarlz detects
|
||||
when the archive being created or enlarged is among the files to be dumped,
|
||||
appended or concatenated, and skips it.
|
||||
|
||||
|
@ -179,30 +188,30 @@ Create a new archive from @var{files}.
|
|||
|
||||
@item -C @var{dir}
|
||||
@itemx --directory=@var{dir}
|
||||
Change to directory @var{dir}. When creating or appending, the position
|
||||
of each @samp{-C} option in the command line is significant; it will
|
||||
change the current working directory for the following @var{files} until
|
||||
a new @samp{-C} option appears in the command line. When extracting, all
|
||||
the @samp{-C} options are executed in sequence before starting the
|
||||
extraction. Listing ignores any @samp{-C} options specified. @var{dir}
|
||||
is relative to the then current working directory, perhaps changed by a
|
||||
Change to directory @var{dir}. When creating or appending, the position of
|
||||
each @samp{-C} option in the command line is significant; it will change the
|
||||
current working directory for the following @var{files} until a new
|
||||
@samp{-C} option appears in the command line. When extracting or comparing,
|
||||
all the @samp{-C} options are executed in sequence before reading the
|
||||
archive. Listing ignores any @samp{-C} options specified. @var{dir} is
|
||||
relative to the then current working directory, perhaps changed by a
|
||||
previous @samp{-C} option.
|
||||
|
||||
Note that a process can only have one current working directory (CWD).
|
||||
Therefore multi-threading can't be used to create an archive if a @samp{-C}
|
||||
option appears after a relative filename in the command line.
|
||||
option appears after a relative file name in the command line.
|
||||
|
||||
@item -d
|
||||
@itemx --diff
|
||||
Find differences between archive and file system. For each tar member in the
|
||||
archive, verify that the corresponding file exists and is of the same type
|
||||
(regular file, directory, etc). Report on standard output the differences
|
||||
found in type, mode (permissions), owner and group IDs, modification time,
|
||||
file size, file contents (of regular files), target (of symlinks) and device
|
||||
number (of block/character special files).
|
||||
Compare and report differences between archive and file system. For each tar
|
||||
member in the archive, verify that the corresponding file in the file system
|
||||
exists and is of the same type (regular file, directory, etc). Report on
|
||||
standard output the differences found in type, mode (permissions), owner and
|
||||
group IDs, modification time, file size, file contents (of regular files),
|
||||
target (of symlinks) and device number (of block/character special files).
|
||||
|
||||
As tarlz removes leading slashes from member names, the @samp{-C} option may
|
||||
be used in combination with @samp{--diff} when absolute filenames were used
|
||||
be used in combination with @samp{--diff} when absolute file names were used
|
||||
on archive creation: @w{@samp{tarlz -C / -d}}. Alternatively, tarlz may be
|
||||
run from the root directory to perform the comparison.
|
||||
|
||||
|
@ -213,16 +222,22 @@ useful when comparing an @samp{--anonymous} archive.
|
|||
@item --delete
|
||||
Delete the specified files and directories from an archive in place. It
|
||||
currently can delete only from uncompressed archives and from archives with
|
||||
individually compressed files (@samp{--no-solid} archives). To delete a
|
||||
individually compressed files (@samp{--no-solid} archives). Note that files
|
||||
of about @samp{--data-size} or larger are compressed individually even if
|
||||
@samp{--bsolid} is used, and can therefore be deleted. Tarlz takes care to
|
||||
not delete a tar member unless it is possible to do so. For example it won't
|
||||
try to delete a tar member that is not individually compressed. To delete a
|
||||
directory without deleting the files under it, use
|
||||
@w{@code{tarlz --delete -f foo --exclude='dir/*' dir}}. Deleting in place
|
||||
@w{@samp{tarlz --delete -f foo --exclude='dir/*' dir}}. Deleting in place
|
||||
may be dangerous. A corrupt archive, a power cut, or an I/O error may cause
|
||||
data loss.
|
||||
|
||||
@item --exclude=@var{pattern}
|
||||
Exclude files matching a shell pattern like @samp{*.o}. A file is considered
|
||||
to match if any component of the filename matches. For example, @samp{*.o}
|
||||
matches @samp{foo.o}, @samp{foo.o/bar} and @samp{foo/bar.o}.
|
||||
to match if any component of the file name matches. For example, @samp{*.o}
|
||||
matches @samp{foo.o}, @samp{foo.o/bar} and @samp{foo/bar.o}. If
|
||||
@var{pattern} contains a @samp{/}, it matches a corresponding @samp{/} in
|
||||
the file name. For example, @samp{foo/*.o} matches @samp{foo/bar.o}.
|
||||
|
||||
@item -f @var{archive}
|
||||
@itemx --file=@var{archive}
|
||||
|
@ -261,12 +276,13 @@ Append files to the end of an archive. The archive must be a regular
|
|||
be appended to an uncompressed archive, nor vice versa. If the archive is
|
||||
compressed, it must be a multimember lzip file with the two end-of-file
|
||||
blocks plus any zero padding contained in the last lzip member of the
|
||||
archive. Appending works as follows; first the end-of-file blocks are
|
||||
removed, then the new members are appended, and finally two new end-of-file
|
||||
blocks are appended to the archive. If the archive is uncompressed, tarlz
|
||||
parses and skips tar headers until it finds the end-of-file blocks. Exit
|
||||
with status 0 without modifying the archive if no @var{files} have been
|
||||
specified.
|
||||
archive. It is possible to append files to an archive with a different
|
||||
compression granularity. Appending works as follows; first the end-of-file
|
||||
blocks are removed, then the new members are appended, and finally two new
|
||||
end-of-file blocks are appended to the archive. If the archive is
|
||||
uncompressed, tarlz parses and skips tar headers until it finds the
|
||||
end-of-file blocks. Exit with status 0 without modifying the archive if no
|
||||
@var{files} have been specified.
|
||||
|
||||
@item -t
|
||||
@itemx --list
|
||||
|
@ -282,7 +298,7 @@ Verbosely list files processed.
|
|||
Extract files from an archive. If @var{files} are given, extract only the
|
||||
@var{files} given. Else extract all the files in the archive. To extract a
|
||||
directory without extracting the files under it, use
|
||||
@w{@code{tarlz -xf foo --exclude='dir/*' dir}}.
|
||||
@w{@samp{tarlz -xf foo --exclude='dir/*' dir}}.
|
||||
|
||||
@item -0 .. -9
|
||||
Set the compression level for @samp{--create} and @samp{--append}. The
|
||||
|
@ -326,7 +342,7 @@ compressed data block must contain an integer number of tar members. Block
|
|||
compression is the default because it improves compression ratio for
|
||||
archives with many files smaller than the block size. This option allows
|
||||
tarlz revert to default behavior if, for example, it is invoked through an
|
||||
alias like @code{tar='tarlz --solid'}. @xref{--data-size}, to set the target
|
||||
alias like @samp{tar='tarlz --solid'}. @xref{--data-size}, to set the target
|
||||
block size.
|
||||
|
||||
@item --dsolid
|
||||
|
@ -374,7 +390,7 @@ When this option is used, tarlz detects any corruption in the extended
|
|||
records (only limited by CRC collisions). But note that a corrupt
|
||||
@samp{GNU.crc32} keyword, for example @samp{GNU.crc33}, is reported as a
|
||||
missing CRC instead of as a corrupt record. This misleading
|
||||
@samp{Missing CRC} message is the consequence of a flaw in the posix pax
|
||||
@samp{Missing CRC} message is the consequence of a flaw in the POSIX pax
|
||||
format; i.e., the lack of a mandatory check sequence in the extended
|
||||
records. @xref{crc32}.
|
||||
|
||||
|
@ -400,6 +416,22 @@ invalid input file, 3 for an internal consistency error (eg, bug) which
|
|||
caused tarlz to panic.
|
||||
|
||||
|
||||
@node Portable character set
|
||||
@chapter POSIX portable filename character set
|
||||
@cindex portable character set
|
||||
|
||||
The set of characters from which portable file names are constructed.
|
||||
|
||||
@example
|
||||
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
|
||||
a b c d e f g h i j k l m n o p q r s t u v w x y z
|
||||
0 1 2 3 4 5 6 7 8 9 . _ -
|
||||
@end example
|
||||
|
||||
The last three characters are the period, underscore, and hyphen-minus
|
||||
characters, respectively.
|
||||
|
||||
|
||||
@node File format
|
||||
@chapter File format
|
||||
@cindex file format
|
||||
|
@ -426,7 +458,7 @@ A tar.lz file consists of a series of lzip members (compressed data sets).
|
|||
The members simply appear one after another in the file, with no
|
||||
additional information before, between, or after them.
|
||||
|
||||
Each lzip member contains one or more tar members in a simplified posix
|
||||
Each lzip member contains one or more tar members in a simplified POSIX
|
||||
pax interchange format. The only pax typeflag value supported by tarlz
|
||||
(in addition to the typeflag values defined by the ustar format) is
|
||||
@samp{x}. The pax format is an extension on top of the ustar format that
|
||||
|
@ -506,7 +538,7 @@ extraction. @xref{flawed-compat}.
|
|||
|
||||
The pax extended header data consists of one or more records, each of
|
||||
them constructed as follows:@*
|
||||
@code{"%d %s=%s\n", <length>, <keyword>, <value>}
|
||||
@samp{"%d %s=%s\n", <length>, <keyword>, <value>}
|
||||
|
||||
The <length>, <blank>, <keyword>, <equals-sign>, and <newline> in the
|
||||
record must be limited to the portable character set. The <length> field
|
||||
|
@ -577,11 +609,11 @@ shown in the following table. All lengths and offsets are in decimal.
|
|||
|
||||
All characters in the header block are coded using the ISO/IEC 646:1991
|
||||
(ASCII) standard, except in fields storing names for files, users, and
|
||||
groups. For maximum portability between implementations, names should
|
||||
only contain characters from the portable filename character set. But if
|
||||
an implementation supports the use of characters outside of @samp{/} and
|
||||
the portable filename character set in names for files, users, and
|
||||
groups, tarlz will use the byte values in these names unmodified.
|
||||
groups. For maximum portability between implementations, names should only
|
||||
contain characters from the portable character set. But if an implementation
|
||||
supports the use of characters outside of @samp{/} and the portable
|
||||
character set in names for files, users, and groups, tarlz will use the byte
|
||||
values in these names unmodified.
|
||||
|
||||
The fields name, linkname, and prefix are null-terminated character
|
||||
strings except when all characters in the array contain non-null
|
||||
|
@ -679,32 +711,39 @@ ustar by not requiring a terminating null character.
|
|||
@chapter The reasons for the differences with pax
|
||||
@cindex Amendments to pax format
|
||||
|
||||
Tarlz is meant to reliably detect invalid or corrupt metadata during
|
||||
decoding, and to create safe archives where corrupt metadata can be reliably
|
||||
detected. In order to achieve these goals, tarlz makes some changes to the
|
||||
variant of the pax format that it uses. This chapter describes these changes
|
||||
and the concrete reasons to implement them.
|
||||
Tarlz creates safe archives that allow the reliable detection of invalid or
|
||||
corrupt metadata during decoding even when the integrity checking of lzip
|
||||
can't be used because the lzip members are only decompressed partially, as
|
||||
it happens in parallel @samp{--list} and @samp{--extract}. In order to
|
||||
achieve this goal, tarlz makes some changes to the variant of the pax format
|
||||
that it uses. This chapter describes these changes and the concrete reasons
|
||||
to implement them.
|
||||
|
||||
@sp 1
|
||||
@anchor{crc32}
|
||||
@section Add a CRC of the extended records
|
||||
|
||||
The posix pax format has a serious flaw. The metadata stored in pax extended
|
||||
The POSIX pax format has a serious flaw. The metadata stored in pax extended
|
||||
records are not protected by any kind of check sequence. Corruption in a
|
||||
long filename may cause the extraction of the file in the wrong place
|
||||
long file name may cause the extraction of the file in the wrong place
|
||||
without warning. Corruption in a large file size may cause the truncation of
|
||||
the file or the appending of garbage to the file, both followed by a
|
||||
spurious warning about a corrupt header far from the place of the undetected
|
||||
corruption.
|
||||
|
||||
Metadata like filename and file size must be always protected in an archive
|
||||
Metadata like file name and file size must be always protected in an archive
|
||||
format because of the adverse effects of undetected corruption in them,
|
||||
potentially much worse that undetected corruption in the data. Even more so
|
||||
in the case of pax because the amount of metadata it stores is potentially
|
||||
large, making undetected corruption more probable.
|
||||
|
||||
Because of the above, tarlz protects the extended records with a CRC in
|
||||
a way compatible with standard tar tools. @xref{key_crc32}.
|
||||
Headers and metadata must be protected separately from data because the
|
||||
integrity checking of lzip may not be able to detect the corruption before
|
||||
the metadata has been used, for example, to create a new file in the wrong
|
||||
place.
|
||||
|
||||
Because of the above, tarlz protects the extended records with a CRC in a
|
||||
way compatible with standard tar tools. @xref{key_crc32}.
|
||||
|
||||
@sp 1
|
||||
@anchor{flawed-compat}
|
||||
|
@ -714,12 +753,12 @@ In order to allow the extraction of pax archives by a tar utility conforming
|
|||
to the POSIX-2:1993 standard, POSIX.1-2008 recommends selecting extended
|
||||
header field values that allow such tar to create a regular file containing
|
||||
the extended header records as data. This approach is broken because if the
|
||||
extended header is needed because of a long filename, the name and prefix
|
||||
extended header is needed because of a long file name, the name and prefix
|
||||
fields will be unable to contain the full pathname of the file. Therefore
|
||||
the files corresponding to both the extended header and the overridden ustar
|
||||
header will be extracted using truncated filenames, perhaps overwriting
|
||||
header will be extracted using truncated file names, perhaps overwriting
|
||||
existing files or directories. It may be a security risk to extract a file
|
||||
with a truncated filename.
|
||||
with a truncated file name.
|
||||
|
||||
To avoid this problem, tarlz writes extended headers with all fields zeroed
|
||||
except size, chksum, typeflag, magic and version. This prevents old tar
|
||||
|
@ -729,8 +768,8 @@ extended records.
|
|||
|
||||
If an extended header is required for any reason (for example a file size
|
||||
larger than @w{8 GiB} or a link name longer than 100 bytes), tarlz moves the
|
||||
filename also to the extended header to prevent an ustar tool from trying to
|
||||
extract the file or link. This also makes easier during parallel decoding
|
||||
file name also to the extended header to prevent an ustar tool from trying
|
||||
to extract the file or link. This also makes easier during parallel decoding
|
||||
the detection of a tar member split between two lzip members at the boundary
|
||||
between the extended header and the ustar header.
|
||||
|
||||
|
@ -738,10 +777,11 @@ between the extended header and the ustar header.
|
|||
@section As simple as possible (but not simpler)
|
||||
|
||||
The tarlz format is mainly ustar. Extended pax headers are used only when
|
||||
needed because the length of a filename or link name, or the size of a file
|
||||
needed because the length of a file name or link name, or the size of a file
|
||||
exceed the limits of the ustar format. Adding extended headers to each
|
||||
member just to record subsecond timestamps seems wasteful for a backup
|
||||
format.
|
||||
format. Moreover, minimizing the overhead may help recovering the archive
|
||||
with lziprecover in case of corruption.
|
||||
|
||||
Global pax headers are tolerated, but not supported; they are parsed and
|
||||
ignored. Some operations may not behave as expected if the archive contains
|
||||
|
@ -759,6 +799,7 @@ be adjusted with a command line option in the future.
|
|||
|
||||
@node Multi-threaded tar
|
||||
@chapter Limitations of parallel tar decoding
|
||||
@cindex parallel tar decoding
|
||||
|
||||
Safely decoding an arbitrary tar archive in parallel is impossible. For
|
||||
example, if a tar archive containing another tar archive is decoded starting
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue