947 lines
38 KiB
Text
947 lines
38 KiB
Text
\input texinfo @c -*-texinfo-*-
|
|
@c %**start of header
|
|
@setfilename tarlz.info
|
|
@documentencoding ISO-8859-15
|
|
@settitle Tarlz Manual
|
|
@finalout
|
|
@c %**end of header
|
|
|
|
@set UPDATED 11 April 2019
|
|
@set VERSION 0.15
|
|
|
|
@dircategory Data Compression
|
|
@direntry
|
|
* Tarlz: (tarlz). Archiver with multimember lzip compression
|
|
@end direntry
|
|
|
|
|
|
@ifnothtml
|
|
@titlepage
|
|
@title Tarlz
|
|
@subtitle Archiver with multimember lzip compression
|
|
@subtitle for Tarlz version @value{VERSION}, @value{UPDATED}
|
|
@author by Antonio Diaz Diaz
|
|
|
|
@page
|
|
@vskip 0pt plus 1filll
|
|
@end titlepage
|
|
|
|
@contents
|
|
@end ifnothtml
|
|
|
|
@node Top
|
|
@top
|
|
|
|
This manual is for Tarlz (version @value{VERSION}, @value{UPDATED}).
|
|
|
|
@menu
|
|
* Introduction:: Purpose and features of tarlz
|
|
* Invoking tarlz:: Command line interface
|
|
* File format:: Detailed format of the compressed archive
|
|
* Amendments to pax format:: The reasons for the differences with pax
|
|
* Multi-threaded tar:: Limitations of parallel tar decoding
|
|
* Minimum archive sizes:: Sizes required for full multi-threaded speed
|
|
* Examples:: A small tutorial with examples
|
|
* Problems:: Reporting bugs
|
|
* Concept index:: Index of concepts
|
|
@end menu
|
|
|
|
@sp 1
|
|
Copyright @copyright{} 2013-2019 Antonio Diaz Diaz.
|
|
|
|
This manual is free documentation: you have unlimited permission
|
|
to copy, distribute and modify it.
|
|
|
|
|
|
@node Introduction
|
|
@chapter Introduction
|
|
@cindex introduction
|
|
|
|
@uref{http://www.nongnu.org/lzip/tarlz.html,,Tarlz} is a massively parallel
|
|
(multi-threaded) combined implementation of the tar archiver and the
|
|
@uref{http://www.nongnu.org/lzip/lzip.html,,lzip} compressor. Tarlz creates,
|
|
lists and extracts archives in a simplified posix pax format compressed with
|
|
lzip, keeping the alignment between tar members and lzip members. This
|
|
method adds an indexed lzip layer on top of the tar archive, making it
|
|
possible to decode the archive safely in parallel. The resulting multimember
|
|
tar.lz archive is fully backward compatible with standard tar tools like GNU
|
|
tar, which treat it like any other tar.lz archive. Tarlz can append files to
|
|
the end of such compressed archives.
|
|
|
|
Tarlz can create tar archives with five levels of compression granularity;
|
|
per file (---no-solid), per block (---bsolid, default), per directory
|
|
(---dsolid), appendable solid (---asolid), and solid (---solid).
|
|
|
|
@noindent
|
|
Of course, compressing each file (or each directory) individually can't
|
|
achieve a compression ratio as high as compressing solidly the whole tar
|
|
archive, but it has the following advantages:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The resulting multimember tar.lz archive can be decompressed in
|
|
parallel, multiplying the decompression speed.
|
|
|
|
@item
|
|
New members can be appended to the archive (by removing the EOF
|
|
member), and unwanted members can be deleted from the archive. Just
|
|
like an uncompressed tar archive.
|
|
|
|
@item
|
|
It is a safe posix-style backup format. In case of corruption,
|
|
tarlz can extract all the undamaged members from the tar.lz
|
|
archive, skipping over the damaged members, just like the standard
|
|
(uncompressed) tar. Moreover, the option @samp{--keep-damaged} can be
|
|
used to recover as much data as possible from each damaged member,
|
|
and lziprecover can be used to recover some of the damaged members.
|
|
|
|
@item
|
|
A multimember tar.lz archive is usually smaller than the
|
|
corresponding solidly compressed tar.gz archive, except when
|
|
individually compressing files smaller than about 32 KiB.
|
|
@end itemize
|
|
|
|
Tarlz protects the extended records with a CRC in a way compatible with
|
|
standard tar tools. @xref{crc32}.
|
|
|
|
Tarlz does not understand other tar formats like @samp{gnu}, @samp{oldgnu},
|
|
@samp{star} or @samp{v7}.
|
|
|
|
|
|
@node Invoking tarlz
|
|
@chapter Invoking tarlz
|
|
@cindex invoking
|
|
@cindex options
|
|
@cindex usage
|
|
@cindex version
|
|
|
|
The format for running tarlz is:
|
|
|
|
@example
|
|
tarlz [@var{options}] [@var{files}]
|
|
@end example
|
|
|
|
@noindent
|
|
All operations except @samp{--concatenate} operate on whole trees if any
|
|
@var{file} is a directory.
|
|
|
|
On archive creation or appending tarlz archives the files specified, but
|
|
removes from member names any leading and trailing slashes and any filename
|
|
prefixes containing a @samp{..} component. On extraction, leading and
|
|
trailing slashes are also removed from member names, and archive members
|
|
containing a @samp{..} component in the filename are skipped. Tarlz detects
|
|
when the archive being created or enlarged is among the files to be dumped,
|
|
appended or concatenated, and skips it.
|
|
|
|
On extraction and listing, tarlz removes leading @samp{./} strings from
|
|
member names in the archive or given in the command line, so that
|
|
@w{@samp{tarlz -xf foo ./bar baz}} extracts members @samp{bar} and
|
|
@samp{./baz} from archive @samp{foo}.
|
|
|
|
If several compression levels or @samp{--*solid} options are given, the last
|
|
setting is used. For example @w{@samp{-9 --solid --uncompressed -1}} is
|
|
equivalent to @samp{-1 --solid}
|
|
|
|
tarlz supports the following options:
|
|
|
|
@table @code
|
|
@item --help
|
|
Print an informative help message describing the options and exit.
|
|
|
|
@item -V
|
|
@itemx --version
|
|
Print the version number of tarlz on the standard output and exit.
|
|
This version number should be included in all bug reports.
|
|
|
|
@item -A
|
|
@itemx --concatenate
|
|
Append one or more archives to the end of an archive. All the archives
|
|
involved must be regular (seekable) files, and must be either all compressed
|
|
or all uncompressed. Compressed and uncompressed archives can't be mixed.
|
|
Compressed archives must be multimember lzip files with the two end-of-file
|
|
blocks plus any zero padding contained in the last lzip member of each
|
|
archive. The intermediate end-of-file blocks are removed as each new archive
|
|
is concatenated. If the archive is uncompressed, tarlz parses and skips tar
|
|
headers until it finds the end-of-file blocks. Exit with status 0 without
|
|
modifying the archive if no @var{files} have been specified.
|
|
|
|
@anchor{--data-size}
|
|
@item -B @var{bytes}
|
|
@itemx --data-size=@var{bytes}
|
|
Set target size of input data blocks for the @samp{--bsolid} option.
|
|
@xref{--bsolid}. Valid values range from @w{8 KiB} to @w{1 GiB}. Default
|
|
value is two times the dictionary size, except for option @samp{-0} where it
|
|
defaults to @w{1 MiB}. @xref{Minimum archive sizes}.
|
|
|
|
@item -c
|
|
@itemx --create
|
|
Create a new archive from @var{files}.
|
|
|
|
@item -C @var{dir}
|
|
@itemx --directory=@var{dir}
|
|
Change to directory @var{dir}. When creating or appending, the position
|
|
of each @samp{-C} option in the command line is significant; it will
|
|
change the current working directory for the following @var{files} until
|
|
a new @samp{-C} option appears in the command line. When extracting, all
|
|
the @samp{-C} options are executed in sequence before starting the
|
|
extraction. Listing ignores any @samp{-C} options specified. @var{dir}
|
|
is relative to the then current working directory, perhaps changed by a
|
|
previous @samp{-C} option.
|
|
|
|
Note that a process can only have one current working directory (CWD).
|
|
Therefore multi-threading can't be used to create an archive if a @samp{-C}
|
|
option appears after a relative filename in the command line.
|
|
|
|
@item -d
|
|
@itemx --diff
|
|
Find differences between archive and file system. For each tar member in the
|
|
archive, verify that the corresponding file exists and is of the same type
|
|
(regular file, directory, etc). Report on standard output the differences
|
|
found in type, mode (permissions), owner and group IDs, modification time,
|
|
file size, file contents (of regular files), target (of symlinks) and device
|
|
number (of block/character special files).
|
|
|
|
As tarlz removes leading slashes from member names, the @samp{-C} option may
|
|
be used in combination with @samp{--diff} when absolute filenames were used
|
|
on archive creation: @w{@samp{tarlz -C / -d}}. Alternatively, tarlz may be
|
|
run from the root directory to perform the comparison.
|
|
|
|
@item --ignore-ids
|
|
Make @samp{--diff} ignore differences in owner and group IDs. This option is
|
|
useful when comparing an @samp{--anonymous} archive.
|
|
|
|
@item --delete
|
|
Delete the specified files and directories from an archive in place. It
|
|
currently can delete only from uncompressed archives and from archives with
|
|
individually compressed files (@samp{--no-solid} archives). To delete a
|
|
directory without deleting the files under it, use
|
|
@w{@code{tarlz --delete -f foo --exclude='dir/*' dir}}. Deleting in place
|
|
may be dangerous. A corrupt archive, a power cut, or an I/O error may cause
|
|
data loss.
|
|
|
|
@item --exclude=@var{pattern}
|
|
Exclude files matching a shell pattern like @samp{*.o}. A file is considered
|
|
to match if any component of the filename matches. For example, @samp{*.o}
|
|
matches @samp{foo.o}, @samp{foo.o/bar} and @samp{foo/bar.o}.
|
|
|
|
@item -f @var{archive}
|
|
@itemx --file=@var{archive}
|
|
Use archive file @var{archive}. @samp{-} used as an @var{archive} argument
|
|
reads from standard input or writes to standard output.
|
|
|
|
@item -h
|
|
@itemx --dereference
|
|
Follow symbolic links during archive creation, appending or comparison.
|
|
Archive or compare the files they point to instead of the links themselves.
|
|
|
|
@item -n @var{n}
|
|
@itemx --threads=@var{n}
|
|
Set the number of (de)compression threads, overriding the system's default.
|
|
Valid values range from 0 to "as many as your system can support". A value
|
|
of 0 disables threads entirely. If this option is not used, tarlz tries to
|
|
detect the number of processors in the system and use it as default value.
|
|
@w{@samp{tarlz --help}} shows the system's default value. See the note about
|
|
multi-threaded archive creation in the @samp{-C} option above.
|
|
Multi-threaded extraction of files from an archive is not yet implemented.
|
|
@xref{Multi-threaded tar}.
|
|
|
|
Note that the number of usable threads is limited during compression to
|
|
@w{ceil( uncompressed_size / data_size )} (@pxref{Minimum archive sizes}),
|
|
and during decompression to the number of lzip members in the tar.lz
|
|
archive, which you can find by running @w{@samp{lzip -lv archive.tar.lz}}.
|
|
|
|
@item -q
|
|
@itemx --quiet
|
|
Quiet operation. Suppress all messages.
|
|
|
|
@item -r
|
|
@itemx --append
|
|
Append files to the end of an archive. The archive must be a regular
|
|
(seekable) file either compressed or uncompressed. Compressed members can't
|
|
be appended to an uncompressed archive, nor vice versa. If the archive is
|
|
compressed, it must be a multimember lzip file with the two end-of-file
|
|
blocks plus any zero padding contained in the last lzip member of the
|
|
archive. Appending works as follows; first the end-of-file blocks are
|
|
removed, then the new members are appended, and finally two new end-of-file
|
|
blocks are appended to the archive. If the archive is uncompressed, tarlz
|
|
parses and skips tar headers until it finds the end-of-file blocks. Exit
|
|
with status 0 without modifying the archive if no @var{files} have been
|
|
specified.
|
|
|
|
@item -t
|
|
@itemx --list
|
|
List the contents of an archive. If @var{files} are given, list only the
|
|
@var{files} given.
|
|
|
|
@item -v
|
|
@itemx --verbose
|
|
Verbosely list files processed.
|
|
|
|
@item -x
|
|
@itemx --extract
|
|
Extract files from an archive. If @var{files} are given, extract only the
|
|
@var{files} given. Else extract all the files in the archive. To extract a
|
|
directory without extracting the files under it, use
|
|
@w{@code{tarlz -xf foo --exclude='dir/*' dir}}.
|
|
|
|
@item -0 .. -9
|
|
Set the compression level for @samp{--create} and @samp{--append}. The
|
|
default compression level is @samp{-6}. Like lzip, tarlz also minimizes the
|
|
dictionary size of the lzip members it creates, reducing the amount of
|
|
memory required for decompression.
|
|
|
|
@multitable {Level} {Dictionary size} {Match length limit}
|
|
@item Level @tab Dictionary size @tab Match length limit
|
|
@item -0 @tab 64 KiB @tab 16 bytes
|
|
@item -1 @tab 1 MiB @tab 5 bytes
|
|
@item -2 @tab 1.5 MiB @tab 6 bytes
|
|
@item -3 @tab 2 MiB @tab 8 bytes
|
|
@item -4 @tab 3 MiB @tab 12 bytes
|
|
@item -5 @tab 4 MiB @tab 20 bytes
|
|
@item -6 @tab 8 MiB @tab 36 bytes
|
|
@item -7 @tab 16 MiB @tab 68 bytes
|
|
@item -8 @tab 24 MiB @tab 132 bytes
|
|
@item -9 @tab 32 MiB @tab 273 bytes
|
|
@end multitable
|
|
|
|
@item --uncompressed
|
|
With @samp{--create}, don't compress the tar archive created. Create an
|
|
uncompressed tar archive instead. With @samp{--append}, don't compress the
|
|
new members appended to the tar archive. Compressed members can't be
|
|
appended to an uncompressed archive, nor vice versa.
|
|
|
|
@item --asolid
|
|
When creating or appending to a compressed archive, use appendable solid
|
|
compression. All the files being added to the archive are compressed into a
|
|
single lzip member, but the end-of-file blocks are compressed into a
|
|
separate lzip member. This creates a solidly compressed appendable archive.
|
|
Solid archives can't be created nor decoded in parallel.
|
|
|
|
@anchor{--bsolid}
|
|
@item --bsolid
|
|
When creating or appending to a compressed archive, use block compression.
|
|
Tar members are compressed together in a lzip member until they approximate
|
|
a target uncompressed size. The size can't be exact because each solidly
|
|
compressed data block must contain an integer number of tar members. Block
|
|
compression is the default because it improves compression ratio for
|
|
archives with many files smaller than the block size. This option allows
|
|
tarlz revert to default behavior if, for example, it is invoked through an
|
|
alias like @code{tar='tarlz --solid'}. @xref{--data-size}, to set the target
|
|
block size.
|
|
|
|
@item --dsolid
|
|
When creating or appending to a compressed archive, compress each file
|
|
specified in the command line separately in its own lzip member, and use
|
|
solid compression for each directory specified in the command line. The
|
|
end-of-file blocks are compressed into a separate lzip member. This creates
|
|
a compressed appendable archive with a separate lzip member for each file or
|
|
top-level directory specified.
|
|
|
|
@item --no-solid
|
|
When creating or appending to a compressed archive, compress each file
|
|
separately in its own lzip member. The end-of-file blocks are compressed
|
|
into a separate lzip member. This creates a compressed appendable archive
|
|
with a lzip member for each file.
|
|
|
|
@item --solid
|
|
When creating or appending to a compressed archive, use solid compression.
|
|
The files being added to the archive, along with the end-of-file blocks, are
|
|
compressed into a single lzip member. The resulting archive is not
|
|
appendable. No more files can be later appended to the archive. Solid
|
|
archives can't be created nor decoded in parallel.
|
|
|
|
@item --anonymous
|
|
Equivalent to @samp{--owner=root --group=root}.
|
|
|
|
@item --owner=@var{owner}
|
|
When creating or appending, use @var{owner} for files added to the
|
|
archive. If @var{owner} is not a valid user name, it is decoded as a
|
|
decimal numeric user ID.
|
|
|
|
@item --group=@var{group}
|
|
When creating or appending, use @var{group} for files added to the
|
|
archive. If @var{group} is not a valid group name, it is decoded as a
|
|
decimal numeric group ID.
|
|
|
|
@item --keep-damaged
|
|
Don't delete partially extracted files. If a decompression error happens
|
|
while extracting a file, keep the partial data extracted. Use this
|
|
option to recover as much data as possible from each damaged member.
|
|
|
|
@item --missing-crc
|
|
Exit with error status 2 if the CRC of the extended records is missing.
|
|
When this option is used, tarlz detects any corruption in the extended
|
|
records (only limited by CRC collisions). But note that a corrupt
|
|
@samp{GNU.crc32} keyword, for example @samp{GNU.crc33}, is reported as a
|
|
missing CRC instead of as a corrupt record. This misleading
|
|
@samp{Missing CRC} message is the consequence of a flaw in the posix pax
|
|
format; i.e., the lack of a mandatory check sequence in the extended
|
|
records. @xref{crc32}.
|
|
|
|
@item --out-slots=@var{n}
|
|
Number of @w{1 MiB} output packets buffered per worker thread during
|
|
multi-threaded creation or appending to compressed archives. Increasing the
|
|
number of packets may increase compression speed if the files being archived
|
|
are larger than @w{64 MiB} compressed, but requires more memory. Valid
|
|
values range from 1 to 1024. The default value is 64.
|
|
|
|
@ignore
|
|
@item --permissive
|
|
Allow some violations of the archive format, like consecutive extended
|
|
headers preceding a ustar header, or several records with the same
|
|
keyword appearing in the same block of extended records.
|
|
@end ignore
|
|
|
|
@end table
|
|
|
|
Exit status: 0 for a normal exit, 1 for environmental problems (file not
|
|
found, invalid flags, I/O errors, etc), 2 to indicate a corrupt or
|
|
invalid input file, 3 for an internal consistency error (eg, bug) which
|
|
caused tarlz to panic.
|
|
|
|
|
|
@node File format
|
|
@chapter File format
|
|
@cindex file format
|
|
|
|
In the diagram below, a box like this:
|
|
@verbatim
|
|
+---+
|
|
| | <-- the vertical bars might be missing
|
|
+---+
|
|
@end verbatim
|
|
|
|
represents one byte; a box like this:
|
|
@verbatim
|
|
+==============+
|
|
| |
|
|
+==============+
|
|
@end verbatim
|
|
|
|
represents a variable number of bytes or a fixed but large number of
|
|
bytes (for example 512).
|
|
|
|
@sp 1
|
|
A tar.lz file consists of a series of lzip members (compressed data sets).
|
|
The members simply appear one after another in the file, with no
|
|
additional information before, between, or after them.
|
|
|
|
Each lzip member contains one or more tar members in a simplified posix
|
|
pax interchange format. The only pax typeflag value supported by tarlz
|
|
(in addition to the typeflag values defined by the ustar format) is
|
|
@samp{x}. The pax format is an extension on top of the ustar format that
|
|
removes the size limitations of the ustar format.
|
|
|
|
Each tar member contains one file archived, and is represented by the
|
|
following sequence:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
An optional extended header block with extended header records. This
|
|
header block is of the form described in pax header block, with a
|
|
typeflag value of @samp{x}. The extended header records are included as
|
|
the data for this header block.
|
|
|
|
@item
|
|
A header block in ustar format that describes the file. Any fields
|
|
defined in the preceding optional extended header records override the
|
|
associated fields in this header block for this file.
|
|
|
|
@item
|
|
Zero or more blocks that contain the contents of the file.
|
|
@end itemize
|
|
|
|
Each tar member must be contiguously stored in a lzip member for the
|
|
parallel decoding operations like @samp{--list} to work. If any tar member
|
|
is split over two or more lzip members, the archive must be decoded
|
|
sequentially. @xref{Multi-threaded tar}.
|
|
|
|
At the end of the archive file there are two 512-byte blocks filled with
|
|
binary zeros, interpreted as an end-of-archive indicator. These EOF
|
|
blocks are either compressed in a separate lzip member or compressed
|
|
along with the tar members contained in the last lzip member.
|
|
|
|
The diagram below shows the correspondence between each tar member (formed
|
|
by one or two headers plus optional data) in the tar archive and each
|
|
@uref{http://www.nongnu.org/lzip/manual/lzip_manual.html#File-format,,lzip member}
|
|
in the resulting multimember tar.lz archive, when per file compression is
|
|
used:
|
|
@ifnothtml
|
|
@xref{File format,,,lzip}.
|
|
@end ifnothtml
|
|
|
|
@verbatim
|
|
tar
|
|
+========+======+=================+===============+========+======+========+
|
|
| header | data | extended header | extended data | header | data | EOF |
|
|
+========+======+=================+===============+========+======+========+
|
|
|
|
tar.lz
|
|
+===============+=================================================+========+
|
|
| member | member | member |
|
|
+===============+=================================================+========+
|
|
@end verbatim
|
|
|
|
@ignore
|
|
When @samp{--permissive} is used, the following violations of the
|
|
archive format are allowed:@*
|
|
If several extended headers precede an ustar header, only the last
|
|
extended header takes effect. The other extended headers are ignored.
|
|
Similarly, if several records with the same keyword appear in the same
|
|
block of extended records, only the last record for the repeated keyword
|
|
takes effect. The other records for the repeated keyword are ignored.@*
|
|
A global header inserted between an extended header and an ustar header.@*
|
|
An extended header just before the EOF blocks.
|
|
@end ignore
|
|
|
|
@sp 1
|
|
@section Pax header block
|
|
|
|
The pax header block is identical to the ustar header block described below
|
|
except that the typeflag has the value @samp{x} (extended). The size field
|
|
is the size of the extended header data in bytes. Most other fields in the
|
|
pax header block are zeroed on archive creation to prevent trouble if the
|
|
archive is read by an ustar tool, and are ignored by tarlz on archive
|
|
extraction. @xref{flawed-compat}.
|
|
|
|
The pax extended header data consists of one or more records, each of
|
|
them constructed as follows:@*
|
|
@code{"%d %s=%s\n", <length>, <keyword>, <value>}
|
|
|
|
The <length>, <blank>, <keyword>, <equals-sign>, and <newline> in the
|
|
record must be limited to the portable character set. The <length> field
|
|
contains the decimal length of the record in bytes, including the
|
|
trailing <newline>. The <value> field is stored as-is, without
|
|
conversion to UTF-8 nor any other transformation.
|
|
|
|
These are the <keyword> fields currently supported by tarlz:
|
|
|
|
@table @code
|
|
@item linkpath
|
|
The pathname of a link being created to another file, of any type,
|
|
previously archived. This record overrides the linkname field in the
|
|
following ustar header block. The following ustar header block
|
|
determines the type of link created. If typeflag of the following header
|
|
block is 1, it will be a hard link. If typeflag is 2, it will be a
|
|
symbolic link and the linkpath value will be used as the contents of the
|
|
symbolic link.
|
|
|
|
@item path
|
|
The pathname of the following file. This record overrides the name and
|
|
prefix fields in the following ustar header block.
|
|
|
|
@item size
|
|
The size of the file in bytes, expressed as a decimal number using
|
|
digits from the ISO/IEC 646:1991 (ASCII) standard. This record overrides
|
|
the size field in the following ustar header block. The size record is
|
|
used only for files with a size value greater than 8_589_934_591
|
|
@w{(octal 77777777777)}. This is 2^33 bytes or larger.
|
|
|
|
@anchor{key_crc32}
|
|
@item GNU.crc32
|
|
CRC32-C (Castagnoli) of the extended header data excluding the 8 bytes
|
|
representing the CRC <value> itself. The <value> is represented as 8
|
|
hexadecimal digits in big endian order,
|
|
@w{@samp{22 GNU.crc32=00000000\n}}. The keyword of the CRC record is
|
|
protected by the CRC to guarante that corruption is always detected
|
|
(except in case of CRC collision). A CRC was chosen because a checksum
|
|
is too weak for a potentially large list of variable sized records. A
|
|
checksum can't detect simple errors like the swapping of two bytes.
|
|
@end table
|
|
|
|
@sp 1
|
|
@section Ustar header block
|
|
|
|
The ustar header block has a length of 512 bytes and is structured as
|
|
shown in the following table. All lengths and offsets are in decimal.
|
|
|
|
@multitable {Field Name} {Offset} {Length (in bytes)}
|
|
@item Field Name @tab Offset @tab Length (in bytes)
|
|
@item name @tab 0 @tab 100
|
|
@item mode @tab 100 @tab 8
|
|
@item uid @tab 108 @tab 8
|
|
@item gid @tab 116 @tab 8
|
|
@item size @tab 124 @tab 12
|
|
@item mtime @tab 136 @tab 12
|
|
@item chksum @tab 148 @tab 8
|
|
@item typeflag @tab 156 @tab 1
|
|
@item linkname @tab 157 @tab 100
|
|
@item magic @tab 257 @tab 6
|
|
@item version @tab 263 @tab 2
|
|
@item uname @tab 265 @tab 32
|
|
@item gname @tab 297 @tab 32
|
|
@item devmajor @tab 329 @tab 8
|
|
@item devminor @tab 337 @tab 8
|
|
@item prefix @tab 345 @tab 155
|
|
@end multitable
|
|
|
|
All characters in the header block are coded using the ISO/IEC 646:1991
|
|
(ASCII) standard, except in fields storing names for files, users, and
|
|
groups. For maximum portability between implementations, names should
|
|
only contain characters from the portable filename character set. But if
|
|
an implementation supports the use of characters outside of @samp{/} and
|
|
the portable filename character set in names for files, users, and
|
|
groups, tarlz will use the byte values in these names unmodified.
|
|
|
|
The fields name, linkname, and prefix are null-terminated character
|
|
strings except when all characters in the array contain non-null
|
|
characters including the last character.
|
|
|
|
The name and the prefix fields produce the pathname of the file. A new
|
|
pathname is formed, if prefix is not an empty string (its first
|
|
character is not null), by concatenating prefix (up to the first null
|
|
character), a <slash> character, and name; otherwise, name is used
|
|
alone. In either case, name is terminated at the first null character.
|
|
If prefix begins with a null character, it is ignored. In this manner,
|
|
pathnames of at most 256 characters can be supported. If a pathname does
|
|
not fit in the space provided, an extended record is used to store the
|
|
pathname.
|
|
|
|
The linkname field does not use the prefix to produce a pathname. If the
|
|
linkname does not fit in the 100 characters provided, an extended record
|
|
is used to store the linkname.
|
|
|
|
The mode field provides 12 access permission bits. The following table
|
|
shows the symbolic name of each bit and its octal value:
|
|
|
|
@multitable {Bit Name} {Value} {Bit Name} {Value} {Bit Name} {Value}
|
|
@headitem Bit Name @tab Value @tab Bit Name @tab Value @tab Bit Name @tab Value
|
|
@item S_ISUID @tab 04000 @tab S_ISGID @tab 02000 @tab S_ISVTX @tab 01000
|
|
@item S_IRUSR @tab 00400 @tab S_IWUSR @tab 00200 @tab S_IXUSR @tab 00100
|
|
@item S_IRGRP @tab 00040 @tab S_IWGRP @tab 00020 @tab S_IXGRP @tab 00010
|
|
@item S_IROTH @tab 00004 @tab S_IWOTH @tab 00002 @tab S_IXOTH @tab 00001
|
|
@end multitable
|
|
|
|
The uid and gid fields are the user and group ID of the owner and group
|
|
of the file, respectively.
|
|
|
|
The size field contains the octal representation of the size of the file
|
|
in bytes. If the typeflag field specifies a file of type '0' (regular
|
|
file) or '7' (high performance regular file), the number of logical
|
|
records following the header is @w{(size / 512)} rounded to the next
|
|
integer. For all other values of typeflag, tarlz either sets the size
|
|
field to 0 or ignores it, and does not store or expect any logical
|
|
records following the header. If the file size is larger than
|
|
8_589_934_591 bytes @w{(octal 77777777777)}, an extended record is used
|
|
to store the file size.
|
|
|
|
The mtime field contains the octal representation of the modification
|
|
time of the file at the time it was archived, obtained from the stat()
|
|
function.
|
|
|
|
The chksum field contains the octal representation of the value of the
|
|
simple sum of all bytes in the header logical record. Each byte in the
|
|
header is treated as an unsigned value. When calculating the checksum,
|
|
the chksum field is treated as if it were all <space> characters.
|
|
|
|
The typeflag field contains a single character specifying the type of
|
|
file archived:
|
|
|
|
@table @code
|
|
@item '0'
|
|
Regular file.
|
|
|
|
@item '1'
|
|
Hard link to another file, of any type, previously archived.
|
|
|
|
@item '2'
|
|
Symbolic link.
|
|
|
|
@item '3', '4'
|
|
Character special file and block special file respectively. In this case
|
|
the devmajor and devminor fields contain information defining the
|
|
device in unspecified format.
|
|
|
|
@item '5'
|
|
Directory.
|
|
|
|
@item '6'
|
|
FIFO special file.
|
|
|
|
@item '7'
|
|
Reserved to represent a file to which an implementation has associated
|
|
some high-performance attribute. Tarlz treats this type of file as a
|
|
regular file (type 0).
|
|
|
|
@end table
|
|
|
|
The magic field contains the ASCII null-terminated string "ustar". The
|
|
version field contains the characters "00" (0x30,0x30). The fields uname,
|
|
and gname are null-terminated character strings except when all characters
|
|
in the array contain non-null characters including the last character. Each
|
|
numeric field contains a leading space- or zero-filled, optionally
|
|
null-terminated octal number using digits from the ISO/IEC 646:1991 (ASCII)
|
|
standard. Tarlz is able to decode numeric fields 1 byte longer than standard
|
|
ustar by not requiring a terminating null character.
|
|
|
|
|
|
@node Amendments to pax format
|
|
@chapter The reasons for the differences with pax
|
|
@cindex Amendments to pax format
|
|
|
|
Tarlz is meant to reliably detect invalid or corrupt metadata during
|
|
decoding, and to create safe archives where corrupt metadata can be reliably
|
|
detected. In order to achieve these goals, tarlz makes some changes to the
|
|
variant of the pax format that it uses. This chapter describes these changes
|
|
and the concrete reasons to implement them.
|
|
|
|
@sp 1
|
|
@anchor{crc32}
|
|
@section Add a CRC of the extended records
|
|
|
|
The posix pax format has a serious flaw. The metadata stored in pax extended
|
|
records are not protected by any kind of check sequence. Corruption in a
|
|
long filename may cause the extraction of the file in the wrong place
|
|
without warning. Corruption in a large file size may cause the truncation of
|
|
the file or the appending of garbage to the file, both followed by a
|
|
spurious warning about a corrupt header far from the place of the undetected
|
|
corruption.
|
|
|
|
Metadata like filename and file size must be always protected in an archive
|
|
format because of the adverse effects of undetected corruption in them,
|
|
potentially much worse that undetected corruption in the data. Even more so
|
|
in the case of pax because the amount of metadata it stores is potentially
|
|
large, making undetected corruption more probable.
|
|
|
|
Because of the above, tarlz protects the extended records with a CRC in
|
|
a way compatible with standard tar tools. @xref{key_crc32}.
|
|
|
|
@sp 1
|
|
@anchor{flawed-compat}
|
|
@section Remove flawed backward compatibility
|
|
|
|
In order to allow the extraction of pax archives by a tar utility conforming
|
|
to the POSIX-2:1993 standard, POSIX.1-2008 recommends selecting extended
|
|
header field values that allow such tar to create a regular file containing
|
|
the extended header records as data. This approach is broken because if the
|
|
extended header is needed because of a long filename, the name and prefix
|
|
fields will be unable to contain the full pathname of the file. Therefore
|
|
the files corresponding to both the extended header and the overridden ustar
|
|
header will be extracted using truncated filenames, perhaps overwriting
|
|
existing files or directories. It may be a security risk to extract a file
|
|
with a truncated filename.
|
|
|
|
To avoid this problem, tarlz writes extended headers with all fields zeroed
|
|
except size, chksum, typeflag, magic and version. This prevents old tar
|
|
programs from extracting the extended records as a file in the wrong place.
|
|
Tarlz also sets to zero those fields of the ustar header overridden by
|
|
extended records.
|
|
|
|
If an extended header is required for any reason (for example a file size
|
|
larger than @w{8 GiB} or a link name longer than 100 bytes), tarlz moves the
|
|
filename also to the extended header to prevent an ustar tool from trying to
|
|
extract the file or link. This also makes easier during parallel decoding
|
|
the detection of a tar member split between two lzip members at the boundary
|
|
between the extended header and the ustar header.
|
|
|
|
@sp 1
|
|
@section As simple as possible (but not simpler)
|
|
|
|
The tarlz format is mainly ustar. Extended pax headers are used only when
|
|
needed because the length of a filename or link name, or the size of a file
|
|
exceed the limits of the ustar format. Adding extended headers to each
|
|
member just to record subsecond timestamps seems wasteful for a backup
|
|
format.
|
|
|
|
Global pax headers are tolerated, but not supported; they are parsed and
|
|
ignored. Some operations may not behave as expected if the archive contains
|
|
global headers.
|
|
|
|
@sp 1
|
|
@section Avoid misconversions to/from UTF-8
|
|
|
|
There is no portable way to tell what charset a text string is coded into.
|
|
Therefore, tarlz stores all fields representing text strings unmodified,
|
|
without conversion to UTF-8 nor any other transformation. This prevents
|
|
accidental double UTF-8 conversions. If the need arises this behavior will
|
|
be adjusted with a command line option in the future.
|
|
|
|
|
|
@node Multi-threaded tar
|
|
@chapter Limitations of parallel tar decoding
|
|
|
|
Safely decoding an arbitrary tar archive in parallel is impossible. For
|
|
example, if a tar archive containing another tar archive is decoded starting
|
|
from some position other than the beginning, there is no way to know if the
|
|
first header found there belongs to the outer tar archive or to the inner
|
|
tar archive. Tar is a format inherently serial; it was designed for tapes.
|
|
|
|
In the case of compressed tar archives, the start of each compressed block
|
|
determines one point through which the tar archive can be decoded in
|
|
parallel. Therefore, in tar.lz archives the decoding operations can't be
|
|
parallelized if the tar members are not aligned with the lzip members. Tar
|
|
archives compressed with plzip can't be decoded in parallel because tar and
|
|
plzip do not have a way to align both sets of members. Certainly one can
|
|
decompress one such archive with a multi-threaded tool like plzip, but the
|
|
increase in speed is not as large as it could be because plzip must
|
|
serialize the decompressed data and pass them to tar, which decodes them
|
|
sequentially, one tar member at a time.
|
|
|
|
On the other hand, if the tar.lz archive is created with a tool like tarlz,
|
|
which can guarantee the alignment between tar members and lzip members
|
|
because it controls both archiving and compression, then the lzip format
|
|
becomes an indexed layer on top of the tar archive which makes possible
|
|
decoding it safely in parallel.
|
|
|
|
Tarlz is able to automatically decode aligned and unaligned multimember
|
|
tar.lz archives, keeping backwards compatibility. If tarlz finds a member
|
|
misalignment during multi-threaded decoding, it switches to single-threaded
|
|
mode and continues decoding the archive. Currently only the @samp{--list}
|
|
option is able to do multi-threaded decoding.
|
|
|
|
If the files in the archive are large, multi-threaded @samp{--list} on a
|
|
regular (seekable) tar.lz archive can be hundreds of times faster than
|
|
sequential @samp{--list} because, in addition to using several processors,
|
|
it only needs to decompress part of each lzip member. See the following
|
|
example listing the Silesia corpus on a dual core machine:
|
|
|
|
@example
|
|
tarlz -9 --no-solid -cf silesia.tar.lz silesia
|
|
time lzip -cd silesia.tar.lz | tar -tf - (5.032s)
|
|
time plzip -cd silesia.tar.lz | tar -tf - (3.256s)
|
|
time tarlz -tf silesia.tar.lz (0.020s)
|
|
@end example
|
|
|
|
|
|
@node Minimum archive sizes
|
|
@chapter Minimum archive sizes required for multi-threaded block compression
|
|
@cindex minimum archive sizes
|
|
|
|
When creating or appending to a compressed archive using multi-threaded
|
|
block compression, tarlz puts tar members together in blocks and compresses
|
|
as many blocks simultaneously as worker threads are chosen, creating a
|
|
multimember compressed archive.
|
|
|
|
For this to work as expected (and roughly multiply the compression speed by
|
|
the number of available processors), the uncompressed archive must be at
|
|
least as large as the number of worker threads times the block size
|
|
(@pxref{--data-size}). Else some processors will not get any data to
|
|
compress, and compression will be proportionally slower. The maximum speed
|
|
increase achievable on a given archive is limited by the ratio
|
|
@w{(uncompressed_size / data_size)}. For example, a tarball the size of gcc
|
|
or linux will scale up to 10 or 12 processors at level -9.
|
|
|
|
The following table shows the minimum uncompressed archive size needed for
|
|
full use of N processors at a given compression level, using the default
|
|
data size for each level:
|
|
|
|
@multitable {Processors} {512 MiB} {512 MiB} {512 MiB} {512 MiB} {512 MiB} {512 MiB}
|
|
@headitem Processors @tab 2 @tab 4 @tab 8 @tab 16 @tab 64 @tab 256
|
|
@item Level
|
|
@item -0 @tab 2 MiB @tab 4 MiB @tab 8 MiB @tab 16 MiB @tab 64 MiB @tab 256 MiB
|
|
@item -1 @tab 4 MiB @tab 8 MiB @tab 16 MiB @tab 32 MiB @tab 128 MiB @tab 512 MiB
|
|
@item -2 @tab 6 MiB @tab 12 MiB @tab 24 MiB @tab 48 MiB @tab 192 MiB @tab 768 MiB
|
|
@item -3 @tab 8 MiB @tab 16 MiB @tab 32 MiB @tab 64 MiB @tab 256 MiB @tab 1 GiB
|
|
@item -4 @tab 12 MiB @tab 24 MiB @tab 48 MiB @tab 96 MiB @tab 384 MiB @tab 1.5 GiB
|
|
@item -5 @tab 16 MiB @tab 32 MiB @tab 64 MiB @tab 128 MiB @tab 512 MiB @tab 2 GiB
|
|
@item -6 @tab 32 MiB @tab 64 MiB @tab 128 MiB @tab 256 MiB @tab 1 GiB @tab 4 GiB
|
|
@item -7 @tab 64 MiB @tab 128 MiB @tab 256 MiB @tab 512 MiB @tab 2 GiB @tab 8 GiB
|
|
@item -8 @tab 96 MiB @tab 192 MiB @tab 384 MiB @tab 768 MiB @tab 3 GiB @tab 12 GiB
|
|
@item -9 @tab 128 MiB @tab 256 MiB @tab 512 MiB @tab 1 GiB @tab 4 GiB @tab 16 GiB
|
|
@end multitable
|
|
|
|
|
|
@node Examples
|
|
@chapter A small tutorial with examples
|
|
@cindex examples
|
|
|
|
@noindent
|
|
Example 1: Create a multimember compressed archive @samp{archive.tar.lz}
|
|
containing files @samp{a}, @samp{b} and @samp{c}.
|
|
|
|
@example
|
|
tarlz -cf archive.tar.lz a b c
|
|
@end example
|
|
|
|
@sp 1
|
|
@noindent
|
|
Example 2: Append files @samp{d} and @samp{e} to the multimember
|
|
compressed archive @samp{archive.tar.lz}.
|
|
|
|
@example
|
|
tarlz -rf archive.tar.lz d e
|
|
@end example
|
|
|
|
@sp 1
|
|
@noindent
|
|
Example 3: Create a solidly compressed appendable archive
|
|
@samp{archive.tar.lz} containing files @samp{a}, @samp{b} and @samp{c}.
|
|
Then append files @samp{d} and @samp{e} to the archive.
|
|
|
|
@example
|
|
tarlz --asolid -cf archive.tar.lz a b c
|
|
tarlz --asolid -rf archive.tar.lz d e
|
|
@end example
|
|
|
|
@sp 1
|
|
@noindent
|
|
Example 4: Create a compressed appendable archive containing directories
|
|
@samp{dir1}, @samp{dir2} and @samp{dir3} with a separate lzip member per
|
|
directory. Then append files @samp{a}, @samp{b}, @samp{c}, @samp{d} and
|
|
@samp{e} to the archive, all of them contained in a single lzip member.
|
|
The resulting archive @samp{archive.tar.lz} contains 5 lzip members
|
|
(including the EOF member).
|
|
|
|
@example
|
|
tarlz --dsolid -cf archive.tar.lz dir1 dir2 dir3
|
|
tarlz --asolid -rf archive.tar.lz a b c d e
|
|
@end example
|
|
|
|
@sp 1
|
|
@noindent
|
|
Example 5: Create a solidly compressed archive @samp{archive.tar.lz}
|
|
containing files @samp{a}, @samp{b} and @samp{c}. Note that no more
|
|
files can be later appended to the archive.
|
|
|
|
@example
|
|
tarlz --solid -cf archive.tar.lz a b c
|
|
@end example
|
|
|
|
@sp 1
|
|
@noindent
|
|
Example 6: Extract all files from archive @samp{archive.tar.lz}.
|
|
|
|
@example
|
|
tarlz -xf archive.tar.lz
|
|
@end example
|
|
|
|
@sp 1
|
|
@noindent
|
|
Example 7: Extract files @samp{a} and @samp{c}, and the whole tree under
|
|
directory @samp{dir1} from archive @samp{archive.tar.lz}.
|
|
|
|
@example
|
|
tarlz -xf archive.tar.lz a c dir1
|
|
@end example
|
|
|
|
@sp 1
|
|
@noindent
|
|
Example 8: Copy the contents of directory @samp{sourcedir} to the
|
|
directory @samp{destdir}.
|
|
|
|
@example
|
|
tarlz -C sourcedir -c . | tarlz -C destdir -x
|
|
@end example
|
|
|
|
|
|
@node Problems
|
|
@chapter Reporting bugs
|
|
@cindex bugs
|
|
@cindex getting help
|
|
|
|
There are probably bugs in tarlz. There are certainly errors and
|
|
omissions in this manual. If you report them, they will get fixed. If
|
|
you don't, no one will ever know about them and they will remain unfixed
|
|
for all eternity, if not longer.
|
|
|
|
If you find a bug in tarlz, please send electronic mail to
|
|
@email{lzip-bug@@nongnu.org}. Include the version number, which you can
|
|
find by running @w{@samp{tarlz --version}}.
|
|
|
|
|
|
@node Concept index
|
|
@unnumbered Concept index
|
|
|
|
@printindex cp
|
|
|
|
@bye
|