1448 lines
62 KiB
Text
1448 lines
62 KiB
Text
\input texinfo @c -*-texinfo-*-
|
|
@c %**start of header
|
|
@setfilename tarlz.info
|
|
@documentencoding ISO-8859-15
|
|
@settitle Tarlz Manual
|
|
@finalout
|
|
@c %**end of header
|
|
|
|
@set UPDATED 4 March 2025
|
|
@set VERSION 0.27.1
|
|
|
|
@dircategory Archiving
|
|
@direntry
|
|
* Tarlz: (tarlz). Archiver with multimember lzip compression
|
|
@end direntry
|
|
|
|
|
|
@ifnothtml
|
|
@titlepage
|
|
@title Tarlz
|
|
@subtitle Archiver with multimember lzip compression
|
|
@subtitle for Tarlz version @value{VERSION}, @value{UPDATED}
|
|
@author by Antonio Diaz Diaz
|
|
|
|
@page
|
|
@vskip 0pt plus 1filll
|
|
@end titlepage
|
|
|
|
@contents
|
|
@end ifnothtml
|
|
|
|
@ifnottex
|
|
@node Top
|
|
@top
|
|
|
|
This manual is for Tarlz (version @value{VERSION}, @value{UPDATED}).
|
|
|
|
@menu
|
|
* Introduction:: Purpose and features of tarlz
|
|
* Invoking tarlz:: Command-line interface
|
|
* Argument syntax:: By convention, options start with a hyphen
|
|
* Creating backups safely:: Checking integrity and accuracy of archives
|
|
* Portable character set:: POSIX portable filename character set
|
|
* File format:: Detailed format of the compressed archive
|
|
* Amendments to pax format:: The reasons for the differences with pax
|
|
* Program design:: Internal structure of tarlz
|
|
* Multi-threaded decoding:: Limitations of parallel tar decoding
|
|
* Minimum archive sizes:: Sizes required for full multi-threaded speed
|
|
* Examples:: A small tutorial with examples
|
|
* Problems:: Reporting bugs
|
|
* Concept index:: Index of concepts
|
|
@end menu
|
|
|
|
@sp 1
|
|
Copyright @copyright{} 2013-2025 Antonio Diaz Diaz.
|
|
|
|
This manual is free documentation: you have unlimited permission to copy,
|
|
distribute, and modify it.
|
|
@end ifnottex
|
|
|
|
|
|
@node Introduction
|
|
@chapter Introduction
|
|
@cindex introduction
|
|
|
|
@uref{http://www.nongnu.org/lzip/tarlz.html,,Tarlz} is a massively parallel
|
|
(multi-threaded) combined implementation of the tar archiver and the
|
|
@uref{http://www.nongnu.org/lzip/lzip.html,,lzip} compressor. Tarlz uses the
|
|
compression library @uref{http://www.nongnu.org/lzip/lzlib.html,,lzlib}.
|
|
|
|
Tarlz creates tar archives using a simplified and safer variant of the POSIX
|
|
pax format compressed in lzip format, keeping the alignment between tar
|
|
members and lzip members. The resulting multimember tar.lz archive is
|
|
backward compatible with standard tar tools like GNU tar, which treat it
|
|
like any other tar.lz archive. Tarlz can append files to the end of such
|
|
compressed archives.
|
|
|
|
Keeping the alignment between tar members and lzip members has two
|
|
advantages. It adds an indexed lzip layer on top of the tar archive, making
|
|
it possible to decode the archive safely in parallel. It also reduces the
|
|
amount of data lost in case of corruption. Compressing a tar archive with
|
|
plzip may even double the amount of files lost for each lzip member damaged
|
|
because it does not keep the members aligned.
|
|
|
|
Tarlz can create tar archives with five levels of compression granularity:
|
|
per file (@option{--no-solid}), per block (@option{--bsolid}, default), per
|
|
directory (@option{--dsolid}), appendable solid (@option{--asolid}), and
|
|
solid (@option{--solid}). It can also create uncompressed tar archives.
|
|
|
|
@noindent
|
|
Of course, compressing each file (or each directory) individually can't
|
|
achieve a compression ratio as high as compressing solidly the whole tar
|
|
archive, but it has the following advantages:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The resulting multimember tar.lz archive can be decompressed in
|
|
parallel, multiplying the decompression speed.
|
|
|
|
@item
|
|
New members can be appended to the archive (by removing the
|
|
end-of-archive member), and unwanted members can be deleted from the
|
|
archive. Just like an uncompressed tar archive.
|
|
|
|
@item
|
|
It is a safe POSIX-style backup format. In case of corruption, tarlz
|
|
can extract all the undamaged members from the tar.lz archive,
|
|
skipping over the damaged members, just like the standard
|
|
(uncompressed) tar. Moreover, the option @option{--keep-damaged} can be used
|
|
to recover as much data as possible from each damaged member, and
|
|
lziprecover can be used to recover some of the damaged members.
|
|
|
|
@item
|
|
A multimember tar.lz archive is usually smaller than the corresponding
|
|
solidly compressed tar.gz archive, except when individually
|
|
compressing files smaller than about @w{32 KiB}.
|
|
@end itemize
|
|
|
|
Tarlz protects the extended records with a Cyclic Redundancy Check (CRC) in
|
|
a way compatible with standard tar tools. @xref{crc32}.
|
|
|
|
Tarlz does not understand other tar formats like @samp{gnu}, @samp{oldgnu},
|
|
@samp{star}, or @samp{v7}. The command
|
|
@w{@samp{tarlz -t -f archive.tar.lz > /dev/null}} can be used to check that
|
|
the format of the archive is compatible with tarlz.
|
|
|
|
|
|
@node Invoking tarlz
|
|
@chapter Invoking tarlz
|
|
@cindex invoking
|
|
@cindex options
|
|
@cindex usage
|
|
@cindex version
|
|
|
|
The format for running tarlz is:
|
|
|
|
@example
|
|
tarlz @var{operation} [@var{options}] [@var{files}]
|
|
@end example
|
|
|
|
@noindent
|
|
All operations except @option{--concatenate} and @option{--compress} operate
|
|
on whole trees if any @var{file} is a directory. All operations except
|
|
@option{--compress} overwrite output files without warning. If no archive is
|
|
specified, tarlz tries to read it from standard input or write it to
|
|
standard output. Tarlz refuses to read archive data from a terminal or write
|
|
archive data to a terminal. Tarlz detects when the archive being created or
|
|
enlarged is among the files to be archived, appended, or concatenated, and
|
|
skips it.
|
|
|
|
Tarlz does not use absolute file names nor file names above the current
|
|
working directory (perhaps changed by option @option{-C}). On archive creation
|
|
or appending tarlz archives the files specified, but removes from member
|
|
names any leading and trailing slashes and any file name prefixes containing
|
|
a @file{..} component. On extraction, leading and trailing slashes are also
|
|
removed from member names, and archive members containing a @file{..}
|
|
component in the file name are skipped. Tarlz does not follow symbolic links
|
|
during extraction; not even symbolic links replacing intermediate
|
|
directories.
|
|
|
|
On extraction and listing, tarlz removes leading @file{./} strings from
|
|
member names in the archive or given in the command line, so that
|
|
@w{@samp{tarlz -xf foo ./bar baz}} extracts members @file{bar} and
|
|
@file{./baz} from archive @file{foo}.
|
|
|
|
If several compression levels or @option{--*solid} options are given, the last
|
|
setting is used. For example @w{@option{-9 --solid --uncompressed -1}} is
|
|
equivalent to @w{@option{-1 --solid}}.
|
|
|
|
@noindent
|
|
tarlz supports the following operations:
|
|
|
|
@table @code
|
|
@item --help
|
|
Print an informative help message describing the options and exit.
|
|
|
|
@item -V
|
|
@itemx --version
|
|
Print the version number of tarlz on the standard output and exit.
|
|
This version number should be included in all bug reports.
|
|
|
|
@item -A
|
|
@itemx --concatenate
|
|
Append one or more archives to the end of an archive. If no archive is
|
|
specified with the option @option{-f}, concatenate the input archives to
|
|
standard output. All the archives involved must be regular (seekable) files,
|
|
and must be either all compressed or all uncompressed. Compressed and
|
|
uncompressed archives can't be mixed. Compressed archives must be
|
|
multimember lzip files with the two end-of-archive blocks plus any zero
|
|
padding contained in the last lzip member of each archive. The intermediate
|
|
end-of-archive blocks are removed as each new archive is concatenated. If
|
|
the archive is uncompressed, tarlz parses tar headers until it finds the
|
|
end-of-archive blocks. Exit with status 0 without modifying the archive if
|
|
no @var{files} have been specified.
|
|
|
|
Concatenating archives containing files in common results in two or more tar
|
|
members with the same name in the resulting archive, which may produce
|
|
nondeterministic behavior during multi-threaded extraction.
|
|
@xref{mt-extraction}.
|
|
|
|
@item -c
|
|
@itemx --create
|
|
Create a new archive from @var{files}.
|
|
|
|
@item -d
|
|
@itemx --diff
|
|
Compare and report differences between archive and file system. For each tar
|
|
member in the archive, check that the corresponding file in the file system
|
|
exists and is of the same type (regular file, directory, etc). Report on
|
|
standard output the differences found in type, mode (permissions), owner and
|
|
group IDs, modification time, file size, file contents (of regular files),
|
|
target (of symlinks) and device number (of block/character special files).
|
|
|
|
As tarlz removes leading slashes from member names, the option @option{-C} may
|
|
be used in combination with @option{--diff} when absolute file names were used
|
|
on archive creation: @w{@samp{tarlz -C / -d}}. Alternatively, tarlz may be
|
|
run from the root directory to perform the comparison.
|
|
|
|
@item --delete
|
|
Delete files and directories from an archive in place. It currently can
|
|
delete only from uncompressed archives and from archives with files
|
|
compressed individually (@option{--no-solid} archives). Note that files of
|
|
about @option{--data-size} or larger are compressed individually even if
|
|
@option{--bsolid} is used, and can therefore be deleted. Tarlz takes care to
|
|
not delete a tar member unless it is possible to do so. For example it won't
|
|
try to delete a tar member that is not compressed individually. Even in the
|
|
case of finding a corrupt member after having deleted some member(s), tarlz
|
|
stops and copies the rest of the file as soon as corruption is found,
|
|
leaving it just as corrupt as it was, but not worse.
|
|
|
|
To delete a directory without deleting the files under it, use
|
|
@w{@samp{tarlz --delete -f foo --exclude='dir/*' dir}}. Deleting in place
|
|
may be dangerous. A corrupt archive, a power cut, or an I/O error may cause
|
|
data loss.
|
|
|
|
@item -r
|
|
@itemx --append
|
|
Append files to the end of an archive. The archive must be a regular
|
|
(seekable) file either compressed or uncompressed. Compressed members can't
|
|
be appended to an uncompressed archive, nor vice versa. If the archive is
|
|
compressed, it must be a multimember lzip file with the two end-of-archive
|
|
blocks plus any zero padding contained in the last lzip member of the
|
|
archive. It is possible to append files to an archive with a different
|
|
compression granularity. Appending works as follows; first the
|
|
end-of-archive blocks are removed, then the new members are appended, and
|
|
finally two new end-of-archive blocks are appended to the archive. If the
|
|
archive is uncompressed, tarlz parses and skips tar headers until it finds
|
|
the end-of-archive blocks. Exit with status 0 without modifying the archive
|
|
if no @var{files} have been specified.
|
|
|
|
Appending files already present in the archive results in two or more tar
|
|
members with the same name, which may produce nondeterministic behavior
|
|
during multi-threaded extraction. @xref{mt-extraction}.
|
|
|
|
@item -t
|
|
@itemx --list
|
|
List the contents of an archive. If @var{files} are given, list only the
|
|
@var{files} given. @xref{mt-listing}.
|
|
|
|
@item -x
|
|
@itemx --extract
|
|
Extract files from an archive. If @var{files} are given, extract only the
|
|
@var{files} given. Else extract all the files in the archive. To extract a
|
|
directory without extracting the files under it, use
|
|
@w{@samp{tarlz -xf foo --exclude='dir/*' dir}}. Tarlz removes files and
|
|
empty directories unconditionally before extracting over them. Other than
|
|
that, it does not make any special effort to extract a file over an
|
|
incompatible type of file. For example, extracting a file over a non-empty
|
|
directory usually fails. @xref{mt-extraction}.
|
|
|
|
@item -z
|
|
@itemx --compress
|
|
Compress existing POSIX tar archives aligning the lzip members to the tar
|
|
members with choice of granularity (@option{--bsolid} by default,
|
|
@option{--dsolid} works like @option{--asolid}). Each input archive is
|
|
compressed to a file with the extension @file{.lz} added unless the option
|
|
@option{--output} is used. If no archives are specified, or if a hyphen
|
|
@samp{-} is used as the name of an archive, tarlz reads from standard input
|
|
and writes to standard output (unless the option @option{--output} is used).
|
|
When @option{--output} is used, only one input archive can be specified.
|
|
Exit with error status 2 if any input archive is an empty file. The input
|
|
archives are kept unchanged. Existing compressed archives are not
|
|
overwritten. Tarlz can be used as compressor for GNU tar by using a command
|
|
like @w{@samp{tar -c -Hustar foo | tarlz -z -o foo.tar.lz}}. Tarlz can be
|
|
used as compressor for zupdate (zutils) by using a command like
|
|
@w{@samp{zupdate --lz='tarlz -z' foo.tar.gz}}. Note that tarlz only works
|
|
reliably on archives without global headers, or with global headers whose
|
|
content can be ignored.
|
|
|
|
The compression is reversible, including any garbage present after the
|
|
end-of-archive blocks. Tarlz stops parsing after the first end-of-archive
|
|
block is found, and then compresses the rest of the archive. Unless solid
|
|
compression is requested, the end-of-archive blocks are compressed in a lzip
|
|
member separated from the preceding members and from any nonzero garbage
|
|
following the end-of-archive blocks. @option{--compress} implies plzip
|
|
argument style, not tar style. @option{-f} can't be used with
|
|
@option{--compress}.
|
|
|
|
@item --check-lib
|
|
Compare the
|
|
@uref{http://www.nongnu.org/lzip/manual/lzlib_manual.html#Library-version,,version of lzlib}
|
|
used to compile tarlz with the version actually being used at run time and
|
|
exit. Report any differences found. Exit with error status 1 if differences
|
|
are found. A mismatch may indicate that lzlib is not correctly installed or
|
|
that a different version of lzlib has been installed after compiling tarlz.
|
|
Exit with error status 2 if LZ_API_VERSION and LZ_version_string don't
|
|
match. @w{@samp{tarlz -v --check-lib}} shows the version of lzlib being used
|
|
and the value of LZ_API_VERSION (if defined).
|
|
@ifnothtml
|
|
@xref{Library version,,,lzlib}.
|
|
@end ifnothtml
|
|
|
|
@end table
|
|
|
|
@noindent
|
|
tarlz supports the following options: @xref{Argument syntax}.
|
|
|
|
@table @code
|
|
@anchor{--data-size}
|
|
@item -B @var{bytes}
|
|
@itemx --data-size=@var{bytes}
|
|
Set target size of input data blocks for the option @option{--bsolid}.
|
|
@xref{--bsolid}. Valid values range from @w{8 KiB} to @w{1 GiB}. Default
|
|
value is two times the dictionary size, except for option @option{-0} where
|
|
it defaults to @w{1 MiB}. @xref{Minimum archive sizes}. Tarlz does not split
|
|
tar members. If a file is larger than @var{bytes}, tarlz will create a lzip
|
|
member large enough to contain the file.
|
|
|
|
@item -C @var{dir}
|
|
@itemx --directory=@var{dir}
|
|
Change to directory @var{dir}. When creating, appending, comparing, or
|
|
extracting, the position of each option @option{-C} in the command line is
|
|
significant; it changes the current working directory for the following
|
|
@var{files} until a new option @option{-C} appears in the command line.
|
|
@option{--list} and @option{--delete} ignore any option @option{-C}
|
|
specified. @var{dir} is relative to the then current working directory,
|
|
perhaps changed by a previous option @option{-C}.
|
|
|
|
Note that a process can only have one current working directory (CWD).
|
|
Therefore multi-threading can't be used to create or decode an archive if an
|
|
option @option{-C} appears after a (relative) file name in the command line.
|
|
(All file names are made relative by removing leading slashes when decoding).
|
|
|
|
@item -f @var{archive}
|
|
@itemx --file=@var{archive}
|
|
Use archive file @var{archive}. A hyphen @samp{-} used as an @var{archive}
|
|
argument reads from standard input or writes to standard output.
|
|
|
|
@item -h
|
|
@itemx --dereference
|
|
Follow symbolic links during archive creation, appending or comparison.
|
|
Archive or compare the files they point to instead of the links themselves.
|
|
|
|
@item -n @var{n}
|
|
@itemx --threads=@var{n}
|
|
Set the number of (de)compression threads, overriding the system's default.
|
|
Valid values range from 0 to as many as your system can support. A value
|
|
of 0 disables threads entirely. If this option is not used, tarlz tries to
|
|
detect the number of processors in the system and use it as default value.
|
|
@w{@samp{tarlz --help}} shows the system's default value. See the note about
|
|
multi-threading in the option @option{-C} above.
|
|
|
|
Note that the number of usable threads is limited during compression to
|
|
@w{ceil( uncompressed_size / data_size )} (@pxref{Minimum archive sizes}),
|
|
and during decompression to the number of lzip members in the tar.lz
|
|
archive, which you can find by running @w{@samp{lzip -lv archive.tar.lz}}.
|
|
|
|
@item -o @var{file}
|
|
@itemx --output=@var{file}
|
|
Write the compressed output to @var{file}. @w{@option{-o -}} writes the
|
|
compressed output to standard output. Currently @option{--output} only works
|
|
with @option{--compress}.
|
|
|
|
@item -p
|
|
@itemx --preserve-permissions
|
|
On extraction, set file permissions as they appear in the archive. This is
|
|
the default behavior when tarlz is run by the superuser. The default for
|
|
other users is to subtract the umask of the user running tarlz from the
|
|
permissions specified in the archive.
|
|
|
|
@item -q
|
|
@itemx --quiet
|
|
Quiet operation. Suppress all messages.
|
|
|
|
@item -v
|
|
@itemx --verbose
|
|
Verbosely list files processed. Further -v's (up to 4) increase the
|
|
verbosity level.
|
|
|
|
@item -0 .. -9
|
|
Set the compression level for @option{--create}, @option{--append}, and
|
|
@option{--compress}. The default compression level is @option{-6}. Like lzip,
|
|
tarlz also minimizes the dictionary size of the lzip members it creates,
|
|
reducing the amount of memory required for decompression.
|
|
|
|
@multitable {Level} {Dictionary size} {Match length limit}
|
|
@headitem Level @tab Dictionary size @tab Match length limit
|
|
@item -0 @tab 64 KiB @tab 16 bytes
|
|
@item -1 @tab 1 MiB @tab 5 bytes
|
|
@item -2 @tab 1.5 MiB @tab 6 bytes
|
|
@item -3 @tab 2 MiB @tab 8 bytes
|
|
@item -4 @tab 3 MiB @tab 12 bytes
|
|
@item -5 @tab 4 MiB @tab 20 bytes
|
|
@item -6 @tab 8 MiB @tab 36 bytes
|
|
@item -7 @tab 16 MiB @tab 68 bytes
|
|
@item -8 @tab 24 MiB @tab 132 bytes
|
|
@item -9 @tab 32 MiB @tab 273 bytes
|
|
@end multitable
|
|
|
|
@item --uncompressed
|
|
With @option{--create}, don't compress the tar archive created. Create an
|
|
uncompressed tar archive instead. With @option{--append}, don't compress the
|
|
new members appended to the tar archive. Compressed members can't be
|
|
appended to an uncompressed archive, nor vice versa. @option{--uncompressed}
|
|
can be omitted if it can be deduced from the archive name. (An uncompressed
|
|
archive name lacks a @file{.lz} or @file{.tlz} extension).
|
|
|
|
@item --asolid
|
|
When creating or appending to a compressed archive, use appendable solid
|
|
compression. All the files being added to the archive are compressed into a
|
|
single lzip member, but the end-of-archive blocks are compressed into a
|
|
separate lzip member. This creates a solidly compressed appendable archive.
|
|
Solid archives can't be created nor decoded in parallel.
|
|
|
|
@anchor{--bsolid}
|
|
@item --bsolid
|
|
When creating or appending to a compressed archive, use block compression.
|
|
Tar members are compressed together in a lzip member until they approximate
|
|
a target uncompressed size. The size can't be exact because each solidly
|
|
compressed data block must contain an integer number of tar members. Block
|
|
compression is the default because it improves compression ratio for
|
|
archives with many files smaller than the block size. This option allows
|
|
tarlz revert to default behavior if, for example, it is invoked through an
|
|
alias like @w{@samp{tar='tarlz --solid'}}. @xref{--data-size}, to set the
|
|
target block size.
|
|
|
|
@item --dsolid
|
|
When creating or appending to a compressed archive, compress each file
|
|
specified in the command line separately in its own lzip member, and use
|
|
solid compression for each directory specified in the command line. The
|
|
end-of-archive blocks are compressed into a separate lzip member. This
|
|
creates a compressed appendable archive with a separate lzip member for each
|
|
file or top-level directory specified.
|
|
|
|
@item --no-solid
|
|
When creating or appending to a compressed archive, compress each file
|
|
separately in its own lzip member. The end-of-archive blocks are compressed
|
|
into a separate lzip member. This creates a compressed appendable archive
|
|
with a lzip member for each file.
|
|
|
|
@item --solid
|
|
When creating or appending to a compressed archive, use solid compression.
|
|
The files being added to the archive, along with the end-of-archive blocks,
|
|
are compressed into a single lzip member. The resulting archive is not
|
|
appendable. No more files can be later appended to the archive. Solid
|
|
archives can't be created nor decoded in parallel.
|
|
|
|
@item --anonymous
|
|
Equivalent to @w{@option{--owner=root --group=root}}.
|
|
|
|
@item --owner=@var{owner}
|
|
When creating or appending, use @var{owner} for files added to the archive.
|
|
If @var{owner} is not a valid user name, it is decoded as a decimal numeric
|
|
user ID.
|
|
|
|
@item --group=@var{group}
|
|
When creating or appending, use @var{group} for files added to the archive.
|
|
If @var{group} is not a valid group name, it is decoded as a decimal numeric
|
|
group ID.
|
|
|
|
@item --exclude=@var{pattern}
|
|
Exclude files matching a shell pattern like @file{*.o}, even if the files
|
|
are specified in the command line. A file is considered to match if any
|
|
component of the file name matches. For example, @file{*.o} matches
|
|
@file{foo.o}, @file{foo.o/bar} and @file{foo/bar.o}. If @var{pattern}
|
|
contains a @samp{/}, it matches a corresponding @samp{/} in the file name.
|
|
For example, @file{foo/*.o} matches @file{foo/bar.o}. Multiple
|
|
@option{--exclude} options can be specified.
|
|
|
|
@item --ignore-ids
|
|
Make @option{--diff} ignore differences in owner and group IDs. This option is
|
|
useful when comparing an @option{--anonymous} archive.
|
|
|
|
@item --ignore-metadata
|
|
Make @option{--diff} ignore any differences in metadata (file permissions,
|
|
owner and group IDs, modification time). Compare only file type, file size,
|
|
and file content. This option is useful when file permissions have not been
|
|
fully restored because uid/gid changed on extraction.
|
|
|
|
@item --ignore-overflow
|
|
Make @option{--diff} ignore differences in mtime caused by overflow on 32-bit
|
|
systems with a 32-bit time_t.
|
|
|
|
@item --keep-damaged
|
|
Don't delete partially extracted files. If a decompression error happens
|
|
while extracting a file, keep the partial data extracted. Use this option to
|
|
recover as much data as possible from each damaged member. It is recommended
|
|
to run tarlz in single-threaded mode (@option{--threads=0}) when using this
|
|
option.
|
|
|
|
@anchor{--missing-crc}
|
|
@item --missing-crc
|
|
Exit with error status 2 if the CRC of the extended records is missing. When
|
|
this option is used, tarlz detects any corruption in the extended records
|
|
(only limited by CRC collisions). But note that a corrupt @samp{GNU.crc32}
|
|
keyword, for example @samp{GNU.crc30}, is reported as a missing CRC instead
|
|
of as a corrupt record. This misleading @w{@samp{Missing CRC}} message is
|
|
the consequence of a flaw in the POSIX pax format; i.e., the lack of a
|
|
mandatory check sequence of the extended records. @xref{crc32}.
|
|
|
|
@item --mtime=@var{date}
|
|
When creating or appending, use @var{date} as the modification time for
|
|
files added to the archive instead of their actual modification times. The
|
|
value of @var{date} may be either @samp{@@} followed by the number of
|
|
seconds since (or before) the epoch, or a date in format
|
|
@w{@samp{[-]YYYY-MM-DD HH:MM:SS}} or @samp{[-]YYYY-MM-DDTHH:MM:SS}, or the
|
|
name of an existing reference file starting with @samp{.} or @samp{/} whose
|
|
modification time is used. The time of day @samp{HH:MM:SS} in the date
|
|
format is optional and defaults to @samp{00:00:00}. The epoch is
|
|
@w{@samp{1970-01-01 00:00:00 UTC}}. Negative seconds or years define a
|
|
modification time before the epoch.
|
|
|
|
@item --out-slots=@var{n}
|
|
Number of @w{1 MiB} output packets buffered per worker thread during
|
|
multi-threaded creation or appending to compressed archives. Increasing the
|
|
number of packets may increase compression speed if the files being archived
|
|
are larger than @w{64 MiB} compressed, but requires more memory. Valid
|
|
values range from 1 to 1024. The default value is 64.
|
|
|
|
@item --warn-newer
|
|
During archive creation, warn if any file being archived has a modification
|
|
time newer than the archive creation time. This option may slow archive
|
|
creation somewhat because it makes an extra call to @samp{stat} after
|
|
archiving each file, but it nearly guarantees that file contents were not
|
|
modified during the creation of the archive. Note that the file must be at
|
|
least one second newer than the archive for it to be detected as newer.
|
|
|
|
@ignore
|
|
@item --permissive
|
|
Allow some violations of the archive format, like consecutive extended
|
|
headers preceding a ustar header, or several records with the same
|
|
keyword appearing in the same block of extended records.
|
|
@end ignore
|
|
|
|
@end table
|
|
|
|
Exit status: 0 for a normal exit, 1 for environmental problems
|
|
(file not found, files differ, invalid command-line options, I/O errors,
|
|
etc), 2 to indicate a corrupt or invalid input file, 3 for an internal
|
|
consistency error (e.g., bug) which caused tarlz to panic.
|
|
|
|
|
|
@node Argument syntax
|
|
@chapter Syntax of command-line arguments
|
|
@cindex argument syntax
|
|
|
|
POSIX recommends these conventions for command-line arguments.
|
|
|
|
@itemize @bullet
|
|
@item A command-line argument is an option if it begins with a hyphen
|
|
(@samp{-}).
|
|
|
|
@item Option names are single alphanumeric characters.
|
|
|
|
@item Certain options require an argument.
|
|
|
|
@item An option and its argument may or may not appear as separate tokens.
|
|
(In other words, the whitespace separating them is optional).
|
|
Thus, @w{@option{-o foo}} and @option{-ofoo} are equivalent.
|
|
|
|
@item One or more options without arguments, followed by at most one option
|
|
that takes an argument, may follow a hyphen in a single token.
|
|
Thus, @option{-abc} is equivalent to @w{@option{-a -b -c}}.
|
|
|
|
@item Options typically precede other non-option arguments.
|
|
|
|
@item The argument @samp{--} terminates all options; any following arguments
|
|
are treated as non-option arguments, even if they begin with a hyphen.
|
|
|
|
@item A token consisting of a single hyphen character is interpreted as an
|
|
ordinary non-option argument. By convention, it is used to specify standard
|
|
input, standard output, or a file named @samp{-}.
|
|
@end itemize
|
|
|
|
@noindent
|
|
GNU adds @dfn{long options} to these conventions:
|
|
|
|
@itemize @bullet
|
|
@item A long option consists of two hyphens (@samp{--}) followed by a name
|
|
made of alphanumeric characters and hyphens. Option names are typically one
|
|
to three words long, with hyphens to separate words. Abbreviations can be
|
|
used for the long option names as long as the abbreviations are unique.
|
|
|
|
@item A long option and its argument may or may not appear as separate
|
|
tokens. In the latter case they must be separated by an equal sign @samp{=}.
|
|
Thus, @w{@option{--foo bar}} and @option{--foo=bar} are equivalent.
|
|
@end itemize
|
|
|
|
|
|
@node Creating backups safely
|
|
@chapter Checking the integrity and accuracy of tar.lz archives
|
|
@cindex creating backups
|
|
|
|
Uncompressed tar archives do not offer any integrity checking for the files
|
|
they store. The pax format even fails to offer integrity checking for some
|
|
of the metadata. @xref{crc32}. The integrity checking of tar archives is
|
|
usually provided by a compression layer or by an external hash.
|
|
|
|
Lzip compression provides safe integrity checking to tar archives. But it
|
|
does not matter how safe is the archiving format if the archive is created
|
|
corrupt because of a concurrent modification of the files being archived, a
|
|
faulty RAM, or a bug in the archiving tool. The only way of guaranteeing
|
|
that a backup archive is correct is to check its integrity and accuracy
|
|
after creating it.
|
|
|
|
Testing the integrity of the archive with @w{@samp{lzip -tv}} guarantees
|
|
that the compression layer of the archive is valid, but it does not
|
|
guarantee that the tar layer is valid nor that the files in the archive
|
|
match the files in the file system. For example, if the RAM is faulty and a
|
|
bit flip happens in the input buffer before tarlz compresses it, the archive
|
|
will not match the files. It is safer to check the archive with
|
|
@w{@samp{tarlz -d}} just after creation because it checks the compression
|
|
layer and the tar layer, and it compares the files in the archive with the
|
|
files in the file system:
|
|
|
|
@example
|
|
tarlz -cf archive.tar.lz somedir # create the archive
|
|
tarlz -df archive.tar.lz # check the archive
|
|
@end example
|
|
|
|
Once the integrity and accuracy of an archive have been verified as in the
|
|
example above, they can be verified again anywhere at any time with
|
|
@w{@samp{tarlz -t -n0}}. It is important to disable multi-threading with
|
|
@option{-n0} because multi-threaded listing does not detect corruption in
|
|
the tar member data of multimember archives: @xref{mt-listing}.
|
|
|
|
@example
|
|
tarlz -t -n0 -f archive.tar.lz > /dev/null
|
|
@end example
|
|
|
|
@w{@samp{lzip -tv}} checks the integrity of the compression layer, and
|
|
therefore the integrity and accuracy of any archive created and verified as
|
|
explained above. This test is reliable for solidly compressed archives, but
|
|
it does not detect a truncated multimember archive if the truncation happens
|
|
just at a member boundary:
|
|
|
|
@example
|
|
lzip -tv archive.tar.lz
|
|
@end example
|
|
|
|
|
|
@node Portable character set
|
|
@chapter POSIX portable filename character set
|
|
@cindex portable character set
|
|
|
|
The set of characters from which portable file names are constructed.
|
|
|
|
@example
|
|
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
|
|
a b c d e f g h i j k l m n o p q r s t u v w x y z
|
|
0 1 2 3 4 5 6 7 8 9 . _ -
|
|
@end example
|
|
|
|
The last three characters are the period, underscore, and hyphen-minus
|
|
characters, respectively.
|
|
|
|
File names are identifiers. Therefore, archiving works better when file
|
|
names use only the portable character set without spaces added.
|
|
|
|
|
|
@node File format
|
|
@chapter File format
|
|
@cindex file format
|
|
|
|
In the diagram below, a box like this:
|
|
|
|
@verbatim
|
|
+---+
|
|
| | <-- the vertical bars might be missing
|
|
+---+
|
|
@end verbatim
|
|
|
|
represents one byte; a box like this:
|
|
|
|
@verbatim
|
|
+==============+
|
|
| |
|
|
+==============+
|
|
@end verbatim
|
|
|
|
represents a variable number of bytes or a fixed but large number of
|
|
bytes (for example 512).
|
|
|
|
@noindent
|
|
A tar.lz file consists of one or more lzip members (compressed data sets).
|
|
The members simply appear one after another in the file, with no additional
|
|
information before, between, or after them. Empty members (data size = 0)
|
|
are not allowed in multimember files.
|
|
|
|
Each lzip member contains one or more tar members in a simplified POSIX pax
|
|
interchange format. The only pax typeflag value supported by tarlz (in
|
|
addition to the typeflag values defined by the ustar format) is 'x'.
|
|
The pax format is an extension on top of the ustar format that removes the
|
|
size limitations of the ustar format.
|
|
|
|
Each tar member contains one file archived, and is represented by the
|
|
following sequence:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
An optional extended header block followed by one or more blocks that
|
|
contain the extended header records as if they were the contents of a file;
|
|
i.e., the extended header records are included as the data for this header
|
|
block. This header block is of the form described in pax header block, with
|
|
a typeflag value of 'x'.
|
|
|
|
@item
|
|
A header block in ustar format that describes the file. Any fields defined
|
|
in the preceding optional extended header records override the associated
|
|
fields in this header block for this file.
|
|
|
|
@item
|
|
Zero or more blocks that contain the contents of the file.
|
|
@end itemize
|
|
|
|
Each tar member must be contiguously stored in a lzip member for the
|
|
parallel decoding operations like @option{--list} to work. If any tar member
|
|
is split over two or more lzip members, the archive must be decoded
|
|
sequentially. @xref{Multi-threaded decoding}.
|
|
|
|
At the end of the archive file there are two 512-byte blocks filled with
|
|
binary zeros, interpreted as an end-of-archive indicator. These EOA blocks
|
|
are either compressed in a separate lzip member or compressed along with the
|
|
tar members contained in the last lzip member. For a compressed archive to
|
|
be recognized by tarlz as appendable, the last lzip member must contain
|
|
between 512 and 32256 zeros alone (without any nonzero bytes).
|
|
|
|
The diagram below shows the correspondence between each tar member (formed
|
|
by one or two headers plus optional data) in the tar archive and each
|
|
@uref{http://www.nongnu.org/lzip/manual/lzip_manual.html#File-format,,lzip member}
|
|
in the resulting multimember tar.lz archive, when per file compression is
|
|
used:
|
|
@ifnothtml
|
|
@xref{File format,,,lzip}.
|
|
@end ifnothtml
|
|
|
|
@verbatim
|
|
tar
|
|
+========+======+=================+===============+========+======+========+
|
|
| header | data | extended header | extended data | header | data | EOA |
|
|
+========+======+=================+===============+========+======+========+
|
|
|
|
tar.lz
|
|
+===============+=================================================+========+
|
|
| member | member | member |
|
|
+===============+=================================================+========+
|
|
@end verbatim
|
|
|
|
@ignore
|
|
When @option{--permissive} is used, the following violations of the archive
|
|
format are allowed:@*
|
|
If several extended headers precede a ustar header, only the last extended
|
|
header takes effect. The other extended headers are ignored. Similarly, if
|
|
several records with the same keyword appear in the same block of extended
|
|
records, only the last record for the repeated keyword takes effect. The
|
|
other records for the repeated keyword are ignored.@*
|
|
A global header inserted between an extended header and a ustar header.@*
|
|
An extended header just before the end-of-archive blocks.
|
|
@end ignore
|
|
|
|
@section Pax header block
|
|
|
|
The pax header block is identical to the ustar header block described below
|
|
except that the typeflag has the value 'x' (extended). The field
|
|
@samp{size} is the size of the extended header data in bytes. Most other
|
|
fields in the pax header block are zeroed on archive creation to prevent
|
|
trouble if the archive is read by a ustar tool, and are ignored by tarlz on
|
|
archive extraction. @xref{flawed-compat}.
|
|
|
|
Tarlz limits the size of the pax extended header data so that the whole
|
|
header set (extended header + extended data + ustar header) can be read and
|
|
decoded in a buffer of size INT_MAX.
|
|
|
|
The pax extended header data consists of one or more records, each of
|
|
them constructed as follows:@*
|
|
@w{@samp{"%d %s=%s\n", <length>, <keyword>, <value>}}
|
|
|
|
The fields <length> and <keyword> in the record must be limited to the
|
|
portable character set (@pxref{Portable character set}). The field <length>
|
|
contains the decimal length of the record in bytes, including the trailing
|
|
newline. The field <value> is stored as-is, without conversion to UTF-8 nor
|
|
any other transformation. The fields are separated by the ASCII characters
|
|
space, equal-sign, and newline.
|
|
|
|
These are the <keyword> values currently supported by tarlz:
|
|
|
|
@table @code
|
|
@item atime
|
|
The signed decimal representation of the access time of the following file
|
|
in seconds since (or before) the epoch, obtained from the function
|
|
@samp{stat}. The atime record is created only for files with a modification
|
|
time outside of the ustar range. @xref{ustar-mtime}.
|
|
|
|
@item gid
|
|
The unsigned decimal representation of the group ID of the group that owns
|
|
the following file. The gid record is created only for files with a group ID
|
|
greater than 2_097_151 @w{(octal 7_777_777)}. @xref{ustar-uid-gid}.
|
|
|
|
@item linkpath
|
|
The file name of a link being created to another file, of any type,
|
|
previously archived. This record overrides the field @samp{linkname} in the
|
|
following ustar header block. The following ustar header block determines
|
|
the type of link created. If typeflag of the following header block is '1', a
|
|
hard link is created. If typeflag is '2', a symbolic link is created and the
|
|
linkpath value is used as the contents of the symbolic link. The linkpath
|
|
record is created only for links with a link name that does not fit in the
|
|
space provided by the ustar header.
|
|
|
|
@item mtime
|
|
The signed decimal representation of the modification time of the following
|
|
file in seconds since (or before) the epoch, obtained from the function
|
|
@samp{stat}. This record overrides the field @samp{mtime} in the following
|
|
ustar header block. The mtime record is created only for files with a
|
|
modification time outside of the ustar range. @xref{ustar-mtime}.
|
|
|
|
@item path
|
|
The file name of the following file. This record overrides the fields
|
|
@samp{name} and @samp{prefix} in the following ustar header block. The path
|
|
record is created for files with a name that does not fit in the space
|
|
provided by the ustar header, but is also created for files that require any
|
|
other extended record so that the fields @samp{name} and @samp{prefix} in
|
|
the following ustar header block can be zeroed.
|
|
|
|
@item size
|
|
The size of the file in bytes, expressed as a decimal number using digits
|
|
from the ISO/IEC 646:1991 (ASCII) standard. This record overrides the field
|
|
@samp{size} in the following ustar header block. The size record is created
|
|
only for files with a size value greater than 8_589_934_591
|
|
@w{(octal 77_777_777_777)}; that is, @w{8 GiB} (2^33 bytes) or larger.
|
|
|
|
@item uid
|
|
The unsigned decimal representation of the user ID of the file owner of the
|
|
following file. The uid record is created only for files with a user ID
|
|
greater than 2_097_151 @w{(octal 7_777_777)}. @xref{ustar-uid-gid}.
|
|
|
|
@anchor{key_crc32}
|
|
@item GNU.crc32
|
|
CRC32-C (Castagnoli) of the extended header data excluding the 8 bytes
|
|
representing the CRC <value> itself. The <value> is represented as 8
|
|
hexadecimal digits in big endian order, @w{@samp{22 GNU.crc32=00000000\n}}.
|
|
The option @option{--missing-crc} guarantees that corruption is always
|
|
detected (except in case of CRC collision). A CRC was chosen because a
|
|
checksum is too weak for a potentially large list of variable sized records.
|
|
A checksum can't detect simple errors like the swapping of two bytes.
|
|
@xref{--missing-crc}.
|
|
|
|
@end table
|
|
|
|
At verbosity level 1 or higher tarlz prints a diagnostic for each unknown
|
|
extended header keyword found in an archive, once per keyword.
|
|
|
|
@section Ustar header block
|
|
|
|
The ustar header block has a length of 512 bytes and is structured as
|
|
shown in the following table. All lengths and offsets are in decimal:
|
|
|
|
@multitable {Field Name} {Offset} {Length (in bytes)}
|
|
@headitem Field Name @tab Offset @tab Length (in bytes)
|
|
@item name @tab 0 @tab 100
|
|
@item mode @tab 100 @tab 8
|
|
@item uid @tab 108 @tab 8
|
|
@item gid @tab 116 @tab 8
|
|
@item size @tab 124 @tab 12
|
|
@item mtime @tab 136 @tab 12
|
|
@item chksum @tab 148 @tab 8
|
|
@item typeflag @tab 156 @tab 1
|
|
@item linkname @tab 157 @tab 100
|
|
@item magic @tab 257 @tab 6
|
|
@item version @tab 263 @tab 2
|
|
@item uname @tab 265 @tab 32
|
|
@item gname @tab 297 @tab 32
|
|
@item devmajor @tab 329 @tab 8
|
|
@item devminor @tab 337 @tab 8
|
|
@item prefix @tab 345 @tab 155
|
|
@item padding @tab 500 @tab 12
|
|
@end multitable
|
|
|
|
All characters in the header block are coded using the ISO/IEC 646:1991
|
|
(ASCII) standard, except in fields storing names for files, users, and
|
|
groups. For maximum portability between implementations, names should only
|
|
contain characters from the portable character set (@pxref{Portable
|
|
character set}), but if an implementation supports the use of characters
|
|
outside of @samp{/} and the portable character set in names for files,
|
|
users, and groups, tarlz will use the byte values in these names unmodified.
|
|
|
|
The fields @samp{name}, @samp{linkname}, and @samp{prefix} are
|
|
null-terminated character strings except when all characters in the array
|
|
contain non-null characters including the last character.
|
|
|
|
The fields @samp{name} and @samp{prefix} produce the file name. A new file
|
|
name is formed, if prefix is not an empty string (its first character is not
|
|
null), by concatenating prefix (up to the first null character), a slash
|
|
character, and name; otherwise, name is used alone. In either case, name is
|
|
terminated at the first null character. If prefix begins with a null
|
|
character, it is ignored. In this manner, file names of at most 256
|
|
characters can be supported. If a file name does not fit in the space
|
|
provided, an extended record is used to store the file name.
|
|
|
|
The field @samp{linkname} does not use the prefix to produce a file name. If
|
|
the link name does not fit in the 100 characters provided, an extended
|
|
record is used to store the link name.
|
|
|
|
The field @samp{mode} provides 12 access permission bits. The following
|
|
table shows the symbolic name of each bit and its octal value:
|
|
|
|
@multitable {Bit Name} {Value} {Bit Name} {Value} {Bit Name} {Value}
|
|
@headitem Bit Name @tab Value @tab Bit Name @tab Value @tab Bit Name @tab Value
|
|
@item S_ISUID @tab 04000 @tab S_ISGID @tab 02000 @tab S_ISVTX @tab 01000
|
|
@item S_IRUSR @tab 00400 @tab S_IWUSR @tab 00200 @tab S_IXUSR @tab 00100
|
|
@item S_IRGRP @tab 00040 @tab S_IWGRP @tab 00020 @tab S_IXGRP @tab 00010
|
|
@item S_IROTH @tab 00004 @tab S_IWOTH @tab 00002 @tab S_IXOTH @tab 00001
|
|
@end multitable
|
|
|
|
@anchor{ustar-uid-gid}
|
|
The fields @samp{uid} and @samp{gid} are the user and group IDs of the owner
|
|
and group of the file, respectively. If the file uid or gid are greater than
|
|
2_097_151 @w{(octal 7_777_777)}, an extended record is used to store the uid
|
|
or gid.
|
|
|
|
The field @samp{size} contains the octal representation of the size of the
|
|
file in bytes. If the field @samp{typeflag} specifies a file of type '0'
|
|
(regular file) or '7' (high performance regular file), the number of logical
|
|
records following the header is @w{(size / 512)} rounded to the next
|
|
integer. For all other values of typeflag, tarlz either sets the size field
|
|
to 0 or ignores it, and does not store or expect any logical records
|
|
following the header. If the file size is larger than 8_589_934_591 bytes
|
|
@w{(octal 77_777_777_777)}, an extended record is used to store the file size.
|
|
|
|
@anchor{ustar-mtime}
|
|
The field @samp{mtime} contains the octal representation of the modification
|
|
time of the file at the time it was archived, obtained from the function
|
|
@samp{stat}. If the modification time is negative or larger than
|
|
8_589_934_591 @w{(octal 77_777_777_777)} seconds since the epoch, an extended
|
|
record is used to store the modification time. The ustar range of mtime goes
|
|
from @w{@samp{1970-01-01 00:00:00 UTC}} to @w{@samp{2242-03-16 12:56:31 UTC}}.
|
|
|
|
The field @samp{chksum} contains the octal representation of the value of
|
|
the simple sum of all bytes in the header logical record. Each byte in the
|
|
header is treated as an unsigned value. When calculating the checksum, the
|
|
chksum field is treated as if it were all space characters.
|
|
|
|
The field @samp{typeflag} contains a single character specifying the type of
|
|
file archived:
|
|
|
|
@table @code
|
|
@item '0'
|
|
Regular file.
|
|
|
|
@item '1'
|
|
Hard link to another file, of any type, previously archived. Hard links must
|
|
not contain file data.
|
|
|
|
@item '2'
|
|
Symbolic link.
|
|
|
|
@item '3', '4'
|
|
Character special file and block special file respectively. In this case the
|
|
fields @samp{devmajor} and @samp{devminor} contain information defining the
|
|
device in unspecified format.
|
|
|
|
@item '5'
|
|
Directory.
|
|
|
|
@item '6'
|
|
FIFO special file.
|
|
|
|
@item '7'
|
|
Reserved to represent a file to which an implementation has associated some
|
|
high-performance attribute (contiguous file). Tarlz treats this type of file
|
|
as a regular file (type '0').
|
|
|
|
@end table
|
|
|
|
The field @samp{magic} contains the ASCII null-terminated string "ustar".
|
|
The field @samp{version} contains the characters "00" (0x30,0x30). The
|
|
fields @samp{uname} and @samp{gname} are null-terminated character strings
|
|
except when all characters in the array contain non-null characters
|
|
including the last character. Each numeric field contains a leading space-
|
|
or zero-filled, optionally null-terminated octal number using digits from
|
|
the ISO/IEC 646:1991 (ASCII) standard. Tarlz is able to decode numeric
|
|
fields one byte longer than standard ustar by not requiring a terminating
|
|
null character.
|
|
|
|
|
|
@node Amendments to pax format
|
|
@chapter The reasons for the differences with pax
|
|
@cindex Amendments to pax format
|
|
|
|
Tarlz creates safe archives that allow the reliable detection of invalid or
|
|
corrupt metadata during decoding even when the integrity checking of lzip
|
|
can't be used because the lzip members are only decompressed partially, as
|
|
it happens in parallel @option{--diff}, @option{--list}, and @option{--extract}.
|
|
In order to achieve this goal and avoid some other flaws in the pax format,
|
|
tarlz makes some changes to the variant of the pax format that it uses. This
|
|
chapter describes these changes and the concrete reasons to implement them.
|
|
|
|
@anchor{crc32}
|
|
@section Add a CRC of the extended records
|
|
|
|
The POSIX pax format has a serious flaw. The metadata stored in pax extended
|
|
records are not protected by any kind of check sequence. Corruption in a
|
|
long file name may cause the extraction of the file in the wrong place
|
|
without warning. Corruption in a large file size may cause the truncation of
|
|
the file or the appending of garbage to the file, both followed by a
|
|
spurious warning about a corrupt header far from the place of the undetected
|
|
corruption.
|
|
|
|
Metadata like file name and file size must be always protected in an archive
|
|
format because of the adverse effects of undetected corruption in them,
|
|
potentially much worse that undetected corruption in the data. Even more so
|
|
in the case of pax because the amount of metadata it stores is potentially
|
|
large, making undetected corruption and archiver misbehavior more probable.
|
|
|
|
Headers and metadata must be protected separately from data because the
|
|
integrity checking of lzip may not be able to detect the corruption before
|
|
the metadata have been used, for example, to create a new file in the wrong
|
|
place.
|
|
|
|
Because of the above, tarlz protects the extended records with a Cyclic
|
|
Redundancy Check (CRC) in a way compatible with standard tar tools.
|
|
@xref{key_crc32}.
|
|
|
|
@anchor{flawed-compat}
|
|
@section Remove flawed backward compatibility
|
|
|
|
In order to allow the extraction of pax archives by a tar utility conforming
|
|
to the POSIX-2:1993 standard, POSIX.1-2008 recommends selecting extended
|
|
header field values that allow such tar to create a regular file containing
|
|
the extended header records as data. This approach is broken because if the
|
|
extended header is needed because of a long file name, the fields
|
|
@samp{name} and @samp{prefix} are unable to contain the full file name.
|
|
(Some tar implementations store the truncated name in the field @samp{name}
|
|
alone, truncating the name to only 100 bytes instead of 256). Therefore the
|
|
files corresponding to both the extended header and the overridden ustar
|
|
header are extracted using truncated file names, perhaps overwriting
|
|
existing files or directories. It may be a security risk to extract a file
|
|
with a truncated file name.
|
|
|
|
To avoid this problem, tarlz writes extended headers with all fields zeroed
|
|
except @samp{size} (which contains the size of the extended records),
|
|
@samp{chksum}, @samp{typeflag}, @samp{magic}, and @samp{version}. In
|
|
particular, tarlz sets the fields @samp{name} and @samp{prefix} to zero.
|
|
This prevents old tar programs from extracting the extended records as a
|
|
file in the wrong place. Tarlz also sets to zero those fields of the ustar
|
|
header overridden by extended records. Finally, tarlz skips members with
|
|
zeroed @samp{name} and @samp{prefix} when decoding, except when listing.
|
|
This is needed to detect certain format violations during parallel
|
|
extraction.
|
|
|
|
If an extended header is required for any reason (for example a file size of
|
|
@w{8 GiB} or larger, or a link name longer than 100 bytes), tarlz also moves
|
|
the file name to the extended records to prevent a ustar tool from trying to
|
|
extract the file or link. This also makes easier during parallel decoding
|
|
the detection of a tar member split between two lzip members at the boundary
|
|
between the extended header and the ustar header.
|
|
|
|
@section As simple as possible (but not simpler)
|
|
|
|
The tarlz format is mainly ustar. Extended pax headers are used only when
|
|
needed because the length of a file name or link name, or the size or other
|
|
attribute of a file exceed the limits of the ustar format. Adding @w{1 KiB}
|
|
of extended header and records to each member just to save subsecond
|
|
timestamps seems wasteful for a backup format. Moreover, minimizing the
|
|
overhead may help recovering the archive with lziprecover in case of
|
|
corruption.
|
|
|
|
Global pax headers are tolerated, but not supported; they are parsed and
|
|
ignored. Some operations may not behave as expected if the archive contains
|
|
global headers.
|
|
|
|
@section Improve reproducibility
|
|
|
|
Pax includes by default the process ID of the pax process in the ustar name
|
|
of the extended headers, making the archive not reproducible. Tarlz stores
|
|
the true name of the file just once, either in the ustar header or in the
|
|
extended records, making it easier to produce reproducible archives.
|
|
|
|
Pax allows an extended record to have length x-1 or x if x is a power of
|
|
ten; @samp{99<97_bytes>} or @samp{100<97_bytes>}. Tarlz minimizes the length
|
|
of the record and always produces a length of x-1 in these cases.
|
|
|
|
@section No data in hard links
|
|
|
|
Tarlz does not allow data in hard link members. The data (if any) must be in
|
|
the member determining the type of the file (which can't be a link). If all
|
|
the names of a file are stored as hard links, the type of the file is lost.
|
|
Not allowing data in hard links also prevents invalid actions like
|
|
extracting file data for a hard link to a symbolic link or to a directory.
|
|
|
|
@section Avoid misconversions to/from UTF-8
|
|
|
|
There is no portable way to tell what charset a text string is coded into.
|
|
Therefore, tarlz stores all fields representing text strings unmodified,
|
|
without conversion to UTF-8 nor any other transformation. This prevents
|
|
accidental double UTF-8 conversions.
|
|
|
|
|
|
@node Program design
|
|
@chapter Internal structure of tarlz
|
|
@cindex program design
|
|
|
|
The parts of tarlz related to sequential processing of the archive are more
|
|
or less similar to any other tar and won't be described here. The interesting
|
|
parts described here are those related to multi-threaded processing.
|
|
|
|
The structure of the part of tarlz performing multi-threaded archive
|
|
creation is somewhat similar to that of
|
|
@uref{http://www.nongnu.org/lzip/manual/plzip_manual.html#Program-design,,plzip}
|
|
with the added complication of the solidity levels.
|
|
@ifnothtml
|
|
@xref{Program design,,,plzip}.
|
|
@end ifnothtml
|
|
A grouper thread and several worker threads are created, acting the main
|
|
thread as muxer (multiplexer) thread. A 'packet courier' takes care of data
|
|
transfers among threads and limits the maximum number of data blocks
|
|
(packets) being processed simultaneously.
|
|
|
|
The grouper traverses the directory tree, groups together the metadata of
|
|
the files to be archived in each lzip member, and distributes them to the
|
|
workers. The workers compress the metadata received from the grouper along
|
|
with the file data read from the file system. The muxer collects processed
|
|
packets from the workers, and writes them to the archive.
|
|
|
|
@verbatim
|
|
.--------.
|
|
| data|---> to each worker below
|
|
| | .------------.
|
|
| file | ,-->| worker 0 |--,
|
|
| system | | `------------' |
|
|
| | .---------. | .------------. | .-------. .---------.
|
|
|metadata|--->| grouper |-+-->| worker 1 |--+-->| muxer |-->| archive |
|
|
`--------' `---------' | `------------' | `-------' `---------'
|
|
| ... |
|
|
| .------------. |
|
|
`-->| worker N-1 |--'
|
|
`------------'
|
|
@end verbatim
|
|
|
|
Decoding an archive is somewhat similar to how plzip decompresses a regular
|
|
file to standard output, with the differences that it is not the data but
|
|
only messages what is written to stdout/stderr, and that each worker may
|
|
access files in the file system either to read them (diff) or write them
|
|
(extract). As in plzip, each worker reads members directly from the archive.
|
|
|
|
@verbatim
|
|
.--------.
|
|
| file |<---> data to/from each worker below
|
|
| system |
|
|
`--------' .------------.
|
|
,-->| worker 0 |--,
|
|
| `------------' |
|
|
.---------. | .------------. | .-------. .--------.
|
|
| archive |-+-->| worker 1 |--+-->| muxer |-->| stdout |
|
|
`---------' | `------------' | `-------' | stderr |
|
|
| ... | `--------'
|
|
| .------------. |
|
|
`-->| worker N-1 |--'
|
|
`------------'
|
|
@end verbatim
|
|
|
|
As misaligned tar.lz archives can't be decoded in parallel, and the
|
|
misalignment can't be detected until after decoding has started, a
|
|
'mastership request' mechanism has been designed that allows the decoding to
|
|
continue instead of exiting with an error.
|
|
|
|
During parallel decoding, if a worker finds a misalignment, it requests
|
|
mastership to decode the rest of the archive. When mastership is requested,
|
|
an error_member_id is set, and all subsequently received packets with
|
|
member_id > error_member_id are rejected. All workers requesting mastership
|
|
are blocked at the request_mastership call until mastership is granted.
|
|
Mastership is granted to the delivering worker when its queue is empty to
|
|
make sure that all preceding packets have been processed. When mastership is
|
|
granted, all packets are deleted and all subsequently received packets not
|
|
coming from the master are rejected.
|
|
|
|
If a worker can't continue decoding for any cause (for example lack of
|
|
memory or finding a split tar member at the beginning of a lzip member), it
|
|
requests mastership to print an error and terminate the program. Only if
|
|
some other worker requests mastership in a previous lzip member can this
|
|
error be avoided.
|
|
|
|
|
|
@node Multi-threaded decoding
|
|
@chapter Limitations of parallel tar decoding
|
|
@cindex parallel tar decoding
|
|
|
|
Safely decoding a tar archive in parallel is only possible if one decodes
|
|
the headers sequentially first. For example, if a tar archive containing
|
|
another tar archive is decoded starting from some position other than the
|
|
beginning, there is no way to know if the first header found there belongs
|
|
to the outer tar archive or to the inner tar archive. Tar is a format
|
|
inherently serial; it was designed for tapes.
|
|
|
|
The pax format is even more serial than the ustar format. Two headers need
|
|
to be decoded sequentially for each file. The extended header may even need
|
|
parsing to reveal something as basic as file size. If a thread decodes the
|
|
ustar header skipping the preceding extended header, it may extract a file
|
|
of incorrect size at the wrong place. Moreover, a pax archive with global
|
|
headers can't be decoded in parallel because each thread can't know about
|
|
the global headers decoded by other threads.
|
|
|
|
In the case of compressed tar archives, the start of each compressed block
|
|
determines one point through which the tar archive can be decoded in
|
|
parallel. Therefore, in tar.lz archives the decoding operations can't be
|
|
parallelized if the tar members are not aligned with the lzip members. Tar
|
|
archives compressed with plzip can't be decoded in parallel because tar and
|
|
plzip do not have a way to align both sets of members. Certainly one can
|
|
decompress one such archive with a multi-threaded tool like plzip, but the
|
|
increase in speed is not as large as it could be because plzip must
|
|
serialize the decompressed data and pass them to tar, which decodes them
|
|
sequentially, one tar member at a time.
|
|
|
|
On the other hand, if the tar.lz archive is created with a tool like tarlz,
|
|
which can guarantee the alignment between tar members and lzip members
|
|
because it controls both archiving and compression, then the lzip format
|
|
becomes an indexed layer on top of the tar archive which makes possible
|
|
decoding it safely in parallel.
|
|
|
|
Tarlz is able to automatically decode aligned and unaligned multimember
|
|
tar.lz archives, keeping backwards compatibility. If tarlz finds a member
|
|
misalignment during multi-threaded decoding, it switches to single-threaded
|
|
mode and continues decoding the archive.
|
|
|
|
@anchor{mt-listing}
|
|
@section Multi-threaded listing
|
|
|
|
If the files in the archive are large, multi-threaded @option{--list} on a
|
|
regular (seekable) tar.lz archive can be hundreds of times faster than
|
|
sequential @option{--list} because, in addition to using several processors,
|
|
it only needs to decompress part of each lzip member. See the following
|
|
example listing the Silesia corpus on a dual core machine:
|
|
|
|
@example
|
|
tarlz -9 --no-solid -cf silesia.tar.lz silesia
|
|
time lzip -cd silesia.tar.lz | tar -tf - (5.032s)
|
|
time plzip -cd silesia.tar.lz | tar -tf - (3.256s)
|
|
time tarlz -tf silesia.tar.lz (0.020s)
|
|
@end example
|
|
|
|
On the other hand, multi-threaded @option{--list} won't detect corruption in
|
|
the tar member data because it only decodes the part of each lzip member
|
|
corresponding to the tar member header. Partial decoding of a lzip member
|
|
can't guarantee the integrity of the data decoded. This is another reason
|
|
why the tar headers (including the extended records) must provide their own
|
|
integrity checking.
|
|
|
|
@anchor{mt-extraction}
|
|
@section Limitations of multi-threaded extraction
|
|
|
|
Multi-threaded extraction may produce different output than single-threaded
|
|
extraction in some cases:
|
|
|
|
During multi-threaded extraction, several independent threads are
|
|
simultaneously reading the archive and creating files in the file system.
|
|
The archive is not read sequentially. As a consequence, any error or
|
|
weirdness in the archive (like a corrupt member or an end-of-archive block
|
|
in the middle of the archive) won't be usually detected until part of the
|
|
archive beyond that point has been processed.
|
|
|
|
If the archive contains two or more tar members with the same name,
|
|
single-threaded extraction extracts the members in the order they appear in
|
|
the archive and leaves in the file system the last version of the file. But
|
|
multi-threaded extraction may extract the members in any order and leave in
|
|
the file system any version of the file nondeterministically. It is
|
|
unspecified which of the tar members is extracted.
|
|
|
|
If the same file is extracted through several paths (different member names
|
|
resolve to the same file in the file system), the result is undefined.
|
|
(Probably the resulting file will be mangled).
|
|
|
|
Extraction of a hard link may fail if it is extracted before the file it
|
|
links to.
|
|
|
|
|
|
@node Minimum archive sizes
|
|
@chapter Minimum archive sizes required for multi-threaded block compression
|
|
@cindex minimum archive sizes
|
|
|
|
When creating or appending to a compressed archive using multi-threaded
|
|
block compression, tarlz puts tar members together in blocks and compresses
|
|
as many blocks simultaneously as worker threads are chosen, creating a
|
|
multimember compressed archive.
|
|
|
|
For this to work as expected (and roughly multiply the compression speed by
|
|
the number of available processors), the uncompressed archive must be at
|
|
least as large as the number of worker threads times the block size
|
|
(@pxref{--data-size}). Else some processors do not get any data to compress,
|
|
and compression is proportionally slower. The maximum speed increase
|
|
achievable on a given archive is limited by the ratio
|
|
@w{(uncompressed_size / data_size)}. For example, a tarball the size of gcc
|
|
or linux scales up to 10 or 14 processors at level -9.
|
|
|
|
The following table shows the minimum uncompressed archive size needed for
|
|
full use of N processors at a given compression level, using the default
|
|
data size for each level:
|
|
|
|
@multitable {Processors} {512 MiB} {512 MiB} {512 MiB} {512 MiB} {512 MiB} {512 MiB}
|
|
@headitem Processors @tab 2 @tab 4 @tab 8 @tab 16 @tab 64 @tab 256
|
|
@item Level
|
|
@item -0 @tab 2 MiB @tab 4 MiB @tab 8 MiB @tab 16 MiB @tab 64 MiB @tab 256 MiB
|
|
@item -1 @tab 4 MiB @tab 8 MiB @tab 16 MiB @tab 32 MiB @tab 128 MiB @tab 512 MiB
|
|
@item -2 @tab 6 MiB @tab 12 MiB @tab 24 MiB @tab 48 MiB @tab 192 MiB @tab 768 MiB
|
|
@item -3 @tab 8 MiB @tab 16 MiB @tab 32 MiB @tab 64 MiB @tab 256 MiB @tab 1 GiB
|
|
@item -4 @tab 12 MiB @tab 24 MiB @tab 48 MiB @tab 96 MiB @tab 384 MiB @tab 1.5 GiB
|
|
@item -5 @tab 16 MiB @tab 32 MiB @tab 64 MiB @tab 128 MiB @tab 512 MiB @tab 2 GiB
|
|
@item -6 @tab 32 MiB @tab 64 MiB @tab 128 MiB @tab 256 MiB @tab 1 GiB @tab 4 GiB
|
|
@item -7 @tab 64 MiB @tab 128 MiB @tab 256 MiB @tab 512 MiB @tab 2 GiB @tab 8 GiB
|
|
@item -8 @tab 96 MiB @tab 192 MiB @tab 384 MiB @tab 768 MiB @tab 3 GiB @tab 12 GiB
|
|
@item -9 @tab 128 MiB @tab 256 MiB @tab 512 MiB @tab 1 GiB @tab 4 GiB @tab 16 GiB
|
|
@end multitable
|
|
|
|
|
|
@node Examples
|
|
@chapter A small tutorial with examples
|
|
@cindex examples
|
|
|
|
@noindent
|
|
Example 1: Create a multimember compressed archive @file{archive.tar.lz}
|
|
containing files @file{a}, @file{b} and @file{c}.
|
|
|
|
@example
|
|
tarlz -cf archive.tar.lz a b c
|
|
@end example
|
|
|
|
@noindent
|
|
Example 2: Append files @file{d} and @file{e} to the multimember compressed
|
|
archive @file{archive.tar.lz}.
|
|
|
|
@example
|
|
tarlz -rf archive.tar.lz d e
|
|
@end example
|
|
|
|
@noindent
|
|
Example 3: Create a solidly compressed appendable archive
|
|
@file{archive.tar.lz} containing files @file{a}, @file{b} and @file{c}.
|
|
Then append files @file{d} and @file{e} to the archive.
|
|
|
|
@example
|
|
tarlz --asolid -cf archive.tar.lz a b c
|
|
tarlz --asolid -rf archive.tar.lz d e
|
|
@end example
|
|
|
|
@noindent
|
|
Example 4: Create a compressed appendable archive containing directories
|
|
@file{dir1}, @file{dir2} and @file{dir3} with a separate lzip member per
|
|
directory. Then append files @file{a}, @file{b}, @file{c}, @file{d} and
|
|
@file{e} to the archive, all of them contained in a single lzip member.
|
|
The resulting archive @file{archive.tar.lz} contains 5 lzip members
|
|
(including the end-of-archive member).
|
|
|
|
@example
|
|
tarlz --dsolid -cf archive.tar.lz dir1 dir2 dir3
|
|
tarlz --asolid -rf archive.tar.lz a b c d e
|
|
@end example
|
|
|
|
@noindent
|
|
Example 5: Create a solidly compressed archive @file{archive.tar.lz}
|
|
containing files @file{a}, @file{b} and @file{c}. Note that no more
|
|
files can be later appended to the archive.
|
|
|
|
@example
|
|
tarlz --solid -cf archive.tar.lz a b c
|
|
@end example
|
|
|
|
@noindent
|
|
Example 6: Extract all files from archive @file{archive.tar.lz}.
|
|
|
|
@example
|
|
tarlz -xf archive.tar.lz
|
|
@end example
|
|
|
|
@noindent
|
|
Example 7: Extract files @file{a} and @file{c}, and the whole tree under
|
|
directory @file{dir1} from archive @file{archive.tar.lz}.
|
|
|
|
@example
|
|
tarlz -xf archive.tar.lz a c dir1
|
|
@end example
|
|
|
|
@noindent
|
|
Example 8: Copy the contents of directory @file{sourcedir} to the directory
|
|
@file{destdir}.
|
|
|
|
@example
|
|
tarlz -C sourcedir --uncompressed -cf - . | tarlz -C destdir -xf -
|
|
@end example
|
|
|
|
@noindent
|
|
Example 9: Compress the existing POSIX archive @file{archive.tar} and write
|
|
the output to @file{archive.tar.lz}. Compress each member individually for
|
|
maximum availability. (If one member in the compressed archive gets damaged,
|
|
the other members can still be extracted).
|
|
|
|
@example
|
|
tarlz -z --no-solid archive.tar
|
|
@end example
|
|
|
|
@noindent
|
|
Example 10: Recompress the archive @file{archive.tar.lz} with different
|
|
solidity, write the output to @file{archive-ns.tar.lz}, and compare both
|
|
archives.
|
|
|
|
@example
|
|
lzip -cd archive.tar.lz | tarlz -9z --no-solid -o archive-ns.tar.lz
|
|
zcmp archive.tar.lz archive-ns.tar.lz
|
|
@end example
|
|
|
|
@noindent
|
|
Example 11: Concatenate and compress two archives @file{archive1.tar} and
|
|
@file{archive2.tar}, and write the output to @file{foo.tar.lz}.
|
|
|
|
@example
|
|
tarlz -A archive1.tar archive2.tar | tarlz -z -o foo.tar.lz
|
|
@end example
|
|
|
|
|
|
@node Problems
|
|
@chapter Reporting bugs
|
|
@cindex bugs
|
|
@cindex getting help
|
|
|
|
There are probably bugs in tarlz. There are certainly errors and
|
|
omissions in this manual. If you report them, they will get fixed. If
|
|
you don't, no one will ever know about them and they will remain unfixed
|
|
for all eternity, if not longer.
|
|
|
|
If you find a bug in tarlz, please send electronic mail to
|
|
@email{lzip-bug@@nongnu.org}. Include the version number, which you can
|
|
find by running @w{@samp{tarlz --version}} and
|
|
@w{@samp{tarlz -v --check-lib}}.
|
|
|
|
|
|
@node Concept index
|
|
@unnumbered Concept index
|
|
|
|
@printindex cp
|
|
|
|
@bye
|