1
0
Fork 0

Adding upstream version 0.11.

Signed-off-by: Daniel Baumann <daniel@debian.org>
This commit is contained in:
Daniel Baumann 2025-02-17 21:12:08 +01:00
parent 7a2248990c
commit 6bd0c00498
Signed by: daniel
GPG key ID: FBB4F0E80A80222F
18 changed files with 1504 additions and 654 deletions

View file

@ -6,8 +6,8 @@
@finalout
@c %**end of header
@set UPDATED 31 January 2019
@set VERSION 0.10
@set UPDATED 13 February 2019
@set VERSION 0.11
@dircategory Data Compression
@direntry
@ -40,6 +40,7 @@ This manual is for Tarlz (version @value{VERSION}, @value{UPDATED}).
* File format:: Detailed format of the compressed archive
* Amendments to pax format:: The reasons for the differences with pax
* Multi-threaded tar:: Limitations of parallel tar decoding
* Minimum archive sizes:: Sizes required for full multi-threaded speed
* Examples:: A small tutorial with examples
* Problems:: Reporting bugs
* Concept index:: Index of concepts
@ -56,25 +57,24 @@ to copy, distribute and modify it.
@chapter Introduction
@cindex introduction
@uref{http://www.nongnu.org/lzip/tarlz.html,,Tarlz} is a combined
implementation of the tar archiver and the
@uref{http://www.nongnu.org/lzip/lzip.html,,lzip} compressor. By default
tarlz creates, lists and extracts archives in a simplified posix pax format
compressed with lzip on a per file basis. Each tar member is compressed in
its own lzip member, as well as the end-of-file blocks. This method adds an
indexed lzip layer on top of the tar archive, making it possible to decode
the archive safely in parallel. The resulting multimember tar.lz archive is
fully backward compatible with standard tar tools like GNU tar, which treat
it like any other tar.lz archive. Tarlz can append files to the end of such
compressed archives.
@uref{http://www.nongnu.org/lzip/tarlz.html,,Tarlz} is a massively parallel
(multi-threaded) combined implementation of the tar archiver and the
@uref{http://www.nongnu.org/lzip/lzip.html,,lzip} compressor. Tarlz creates,
lists and extracts archives in a simplified posix pax format compressed with
lzip, keeping the alignment between tar members and lzip members. This
method adds an indexed lzip layer on top of the tar archive, making it
possible to decode the archive safely in parallel. The resulting multimember
tar.lz archive is fully backward compatible with standard tar tools like GNU
tar, which treat it like any other tar.lz archive. Tarlz can append files to
the end of such compressed archives.
Tarlz can create tar archives with four levels of compression granularity;
per file, per directory, appendable solid, and solid.
Tarlz can create tar archives with five levels of compression granularity;
per file, per block, per directory, appendable solid, and solid.
@noindent
Of course, compressing each file (or each directory) individually is
less efficient than compressing the whole tar archive, but it has the
following advantages:
Of course, compressing each file (or each directory) individually can't
achieve a compression ratio as high as compressing solidly the whole tar
archive, but it has the following advantages:
@itemize @bullet
@item
@ -120,18 +120,23 @@ tarlz [@var{options}] [@var{files}]
@end example
@noindent
On archive creation or appending, tarlz removes leading and trailing
slashes from filenames, as well as filename prefixes containing a
@samp{..} component. On extraction, archive members containing a
@samp{..} component are skipped. Tarlz detects when the archive being
created or enlarged is among the files to be dumped, appended or
concatenated, and skips it.
On archive creation or appending tarlz archives the files specified, but
removes from member names any leading and trailing slashes and any filename
prefixes containing a @samp{..} component. On extraction, leading and
trailing slashes are also removed from member names, and archive members
containing a @samp{..} component in the filename are skipped. Tarlz detects
when the archive being created or enlarged is among the files to be dumped,
appended or concatenated, and skips it.
On extraction and listing, tarlz removes leading @samp{./} strings from
member names in the archive or given in the command line, so that
@w{@code{tarlz -xf foo ./bar baz}} extracts members @samp{bar} and
@samp{./baz} from archive @samp{foo}.
If several compression levels or @samp{--*solid} options are given, the last
setting is used. For example @w{@samp{-9 --solid --uncompressed -1}} is
equivalent to @samp{-1 --solid}
tarlz supports the following options:
@table @code
@ -160,6 +165,7 @@ specified. Tarlz can't concatenate uncompressed tar archives.
Set target size of input data blocks for the @samp{--bsolid} option. Valid
values range from @w{8 KiB} to @w{1 GiB}. Default value is two times the
dictionary size, except for option @samp{-0} where it defaults to @w{1 MiB}.
@xref{Minimum archive sizes}.
@item -c
@itemx --create
@ -176,6 +182,10 @@ extraction. Listing ignores any @samp{-C} options specified. @var{dir}
is relative to the then current working directory, perhaps changed by a
previous @samp{-C} option.
Note that a process can only have one current working directory (CWD).
Therefore multi-threading can't be used to create an archive if a @samp{-C}
option appears after a relative filename in the command line.
@item -f @var{archive}
@itemx --file=@var{archive}
Use archive file @var{archive}. @samp{-} used as an @var{archive}
@ -183,17 +193,19 @@ argument reads from standard input or writes to standard output.
@item -n @var{n}
@itemx --threads=@var{n}
Set the number of decompression threads, overriding the system's default.
Set the number of (de)compression threads, overriding the system's default.
Valid values range from 0 to "as many as your system can support". A value
of 0 disables threads entirely. If this option is not used, tarlz tries to
detect the number of processors in the system and use it as default value.
@w{@samp{tarlz --help}} shows the system's default value. This option
currently only has effect when listing the contents of a multimember
compressed archive. @xref{Multi-threaded tar}.
@w{@samp{tarlz --help}} shows the system's default value. See the note about
multi-threaded archive creation in the @samp{-C} option above.
Multi-threaded extraction of files from an archive is not yet implemented.
@xref{Multi-threaded tar}.
Note that the number of usable threads is limited during decompression to
the number of lzip members in the tar.lz archive, which you can find by
running @w{@code{lzip -lv archive.tar.lz}}.
Note that the number of usable threads is limited during compression to
@w{ceil( uncompressed_size / data_size )} (@pxref{Minimum archive sizes}),
and during decompression to the number of lzip members in the tar.lz
archive, which you can find by running @w{@code{lzip -lv archive.tar.lz}}.
@item -q
@itemx --quiet
@ -213,7 +225,7 @@ to an uncompressed tar archive.
@item -t
@itemx --list
List the contents of an archive. If @var{files} are given, list only the
given @var{files}.
@var{files} given.
@item -v
@itemx --verbose
@ -222,7 +234,7 @@ Verbosely list files processed.
@item -x
@itemx --extract
Extract files from an archive. If @var{files} are given, extract only
the given @var{files}. Else extract all the files in the archive.
the @var{files} given. Else extract all the files in the archive.
@item -0 .. -9
Set the compression level. The default compression level is @samp{-6}.
@ -245,40 +257,42 @@ it creates, reducing the amount of memory required for decompression.
@item --asolid
When creating or appending to a compressed archive, use appendable solid
compression. All the files being added to the archive are compressed
into a single lzip member, but the end-of-file blocks are compressed
into a separate lzip member. This creates a solidly compressed
appendable archive.
compression. All the files being added to the archive are compressed into a
single lzip member, but the end-of-file blocks are compressed into a
separate lzip member. This creates a solidly compressed appendable archive.
Solid archives can't be created nor decoded in parallel.
@item --bsolid
When creating or appending to a compressed archive, compress tar members
together in a lzip member until they approximate a target uncompressed size.
The size can't be exact because each solidly compressed data block must
contain an integer number of tar members. This option improves compression
efficiency for archives with lots of small files. @xref{--data-size}, to set
the target block size.
When creating or appending to a compressed archive, use block compression.
Tar members are compressed together in a lzip member until they approximate
a target uncompressed size. The size can't be exact because each solidly
compressed data block must contain an integer number of tar members. Block
compression is the default because it improves compression ratio for
archives with many files smaller than the block size. This option allows
tarlz revert to default behavior if, for example, it is invoked through an
alias like @code{tar='tarlz --solid'}. @xref{--data-size}, to set the target
block size.
@item --dsolid
When creating or appending to a compressed archive, use solid
compression for each directory especified in the command line. The
end-of-file blocks are compressed into a separate lzip member. This
creates a compressed appendable archive with a separate lzip member for
each top-level directory.
When creating or appending to a compressed archive, compress each file
specified in the command line separately in its own lzip member, and use
solid compression for each directory specified in the command line. The
end-of-file blocks are compressed into a separate lzip member. This creates
a compressed appendable archive with a separate lzip member for each file or
top-level directory specified.
@item --no-solid
When creating or appending to a compressed archive, compress each file
separately. The end-of-file blocks are compressed into a separate lzip
member. This creates a compressed appendable archive with a separate
lzip member for each file. This option allows tarlz revert to default
behavior if, for example, tarlz is invoked through an alias like
@code{tar='tarlz --solid'}.
separately in its own lzip member. The end-of-file blocks are compressed
into a separate lzip member. This creates a compressed appendable archive
with a lzip member for each file.
@item --solid
When creating or appending to a compressed archive, use solid
compression. The files being added to the archive, along with the
end-of-file blocks, are compressed into a single lzip member. The
resulting archive is not appendable. No more files can be later appended
to the archive.
When creating or appending to a compressed archive, use solid compression.
The files being added to the archive, along with the end-of-file blocks, are
compressed into a single lzip member. The resulting archive is not
appendable. No more files can be later appended to the archive. Solid
archives can't be created nor decoded in parallel.
@item --anonymous
Equivalent to @samp{--owner=root --group=root}.
@ -388,11 +402,11 @@ binary zeros, interpreted as an end-of-archive indicator. These EOF
blocks are either compressed in a separate lzip member or compressed
along with the tar members contained in the last lzip member.
The diagram below shows the correspondence between each tar member
(formed by one or two headers plus optional data) in the tar archive and
each
The diagram below shows the correspondence between each tar member (formed
by one or two headers plus optional data) in the tar archive and each
@uref{http://www.nongnu.org/lzip/manual/lzip_manual.html#File-format,,lzip member}
in the resulting multimember tar.lz archive:
in the resulting multimember tar.lz archive, when per file compression is
used:
@ifnothtml
@xref{File format,,,lzip}.
@end ifnothtml
@ -672,10 +686,10 @@ format.
@section Avoid misconversions to/from UTF-8
There is no portable way to tell what charset a text string is coded into.
Therefore, tarlz stores all fields representing text strings as-is, without
conversion to UTF-8 nor any other transformation. This prevents accidental
double UTF-8 conversions. If the need arises this behavior will be adjusted
with a command line option in the future.
Therefore, tarlz stores all fields representing text strings unmodified,
without conversion to UTF-8 nor any other transformation. This prevents
accidental double UTF-8 conversions. If the need arises this behavior will
be adjusted with a command line option in the future.
@node Multi-threaded tar
@ -717,13 +731,51 @@ it only needs to decompress part of each lzip member. See the following
example listing the Silesia corpus on a dual core machine:
@example
tarlz -9 -cf silesia.tar.lz silesia
tarlz -9 --no-solid -cf silesia.tar.lz silesia
time lzip -cd silesia.tar.lz | tar -tf - (5.032s)
time plzip -cd silesia.tar.lz | tar -tf - (3.256s)
time tarlz -tf silesia.tar.lz (0.020s)
@end example
@node Minimum archive sizes
@chapter Minimum archive sizes required for multi-threaded block compression
@cindex minimum archive sizes
When creating or appending to a compressed archive using multi-threaded
block compression, tarlz puts tar members together in blocks and compresses
as many blocks simultaneously as worker threads are chosen, creating a
multimember compressed archive.
For this to work as expected (and roughly multiply the compression speed by
the number of available processors), the uncompressed archive must be at
least as large as the number of worker threads times the block size
(@pxref{--data-size}). Else some processors will not get any data to
compress, and compression will be proportionally slower. The maximum speed
increase achievable on a given file is limited by the ratio
@w{(uncompressed_size / data_size)}. For example, a tarball the size of gcc
or linux will scale up to 10 or 12 processors at level -9.
The following table shows the minimum uncompressed archive size needed for
full use of N processors at a given compression level, using the default
data size for each level:
@multitable {Processors} {512 MiB} {512 MiB} {512 MiB} {512 MiB} {512 MiB} {512 MiB}
@headitem Processors @tab 2 @tab 4 @tab 8 @tab 16 @tab 64 @tab 256
@item Level
@item -0 @tab 2 MiB @tab 4 MiB @tab 8 MiB @tab 16 MiB @tab 64 MiB @tab 256 MiB
@item -1 @tab 4 MiB @tab 8 MiB @tab 16 MiB @tab 32 MiB @tab 128 MiB @tab 512 MiB
@item -2 @tab 6 MiB @tab 12 MiB @tab 24 MiB @tab 48 MiB @tab 192 MiB @tab 768 MiB
@item -3 @tab 8 MiB @tab 16 MiB @tab 32 MiB @tab 64 MiB @tab 256 MiB @tab 1 GiB
@item -4 @tab 12 MiB @tab 24 MiB @tab 48 MiB @tab 96 MiB @tab 384 MiB @tab 1.5 GiB
@item -5 @tab 16 MiB @tab 32 MiB @tab 64 MiB @tab 128 MiB @tab 512 MiB @tab 2 GiB
@item -6 @tab 32 MiB @tab 64 MiB @tab 128 MiB @tab 256 MiB @tab 1 GiB @tab 4 GiB
@item -7 @tab 64 MiB @tab 128 MiB @tab 256 MiB @tab 512 MiB @tab 2 GiB @tab 8 GiB
@item -8 @tab 96 MiB @tab 192 MiB @tab 384 MiB @tab 768 MiB @tab 3 GiB @tab 12 GiB
@item -9 @tab 128 MiB @tab 256 MiB @tab 512 MiB @tab 1 GiB @tab 4 GiB @tab 16 GiB
@end multitable
@node Examples
@chapter A small tutorial with examples
@cindex examples