Merging upstream version 0.27.

Signed-off-by: Daniel Baumann <daniel@debian.org>
2025-03-04 07:39:30 +01:00 · 2025-03-04 07:39:30 +01:00 · 5e422e043e
commit 5e422e043e
parent 619358407d
83 changed files with 980 additions and 726 deletions
--- a/doc/tarlz.texi
+++ b/doc/tarlz.texi
@ -6,8 +6,8 @@
@finalout
@c %**end of header

-@set UPDATED 7 December 2024
-@set VERSION 0.26
+@set UPDATED 28 February 2025
+@set VERSION 0.27

@dircategory Archiving
@direntry
@ -39,6 +39,7 @@ This manual is for Tarlz (version @value{VERSION}, @value{UPDATED}).
 * Introduction::              Purpose and features of tarlz
 * Invoking tarlz::            Command-line interface
 * Argument syntax::           By convention, options start with a hyphen
+* Creating backups safely::   Checking integrity and accuracy of archives
 * Portable character set::    POSIX portable filename character set
 * File format::               Detailed format of the compressed archive
 * Amendments to pax format::  The reasons for the differences with pax
@ -51,7 +52,7 @@ This manual is for Tarlz (version @value{VERSION}, @value{UPDATED}).
@end menu

@sp 1
-Copyright @copyright{} 2013-2024 Antonio Diaz Diaz.
+Copyright @copyright{} 2013-2025 Antonio Diaz Diaz.

 This manual is free documentation: you have unlimited permission to copy,
 distribute, and modify it.
@ -76,7 +77,7 @@ compressed archives.

 Keeping the alignment between tar members and lzip members has two
 advantages. It adds an indexed lzip layer on top of the tar archive, making
-it possible to decode the archive safely in parallel. It also minimizes the
+it possible to decode the archive safely in parallel. It also reduces the
 amount of data lost in case of corruption. Compressing a tar archive with
 plzip may even double the amount of files lost for each lzip member damaged
 because it does not keep the members aligned.
@ -254,7 +255,7 @@ during multi-threaded extraction. @xref{mt-extraction}.
@item -t
@itemx --list
 List the contents of an archive. If @var{files} are given, list only the
-@var{files} given.
+@var{files} given. @xref{mt-listing}.

@item -x
@itemx --extract
@ -265,20 +266,23 @@ directory without extracting the files under it, use
 empty directories unconditionally before extracting over them. Other than
 that, it does not make any special effort to extract a file over an
 incompatible type of file. For example, extracting a file over a non-empty
-directory usually fails.
+directory usually fails. @xref{mt-extraction}.

@item -z
@itemx --compress
 Compress existing POSIX tar archives aligning the lzip members to the tar
 members with choice of granularity (@option{--bsolid} by default,
-@option{--dsolid} works like @option{--asolid}). Exit with error status 2 if
-any input archive is an empty file. The input archives are kept unchanged.
-Existing compressed archives are not overwritten. A hyphen @samp{-} used as
-the name of an input archive reads from standard input and writes to
-standard output (unless the option @option{--output} is used). Tarlz can be
-used as compressor for GNU tar by using a command like
-@w{@samp{tar -c -Hustar foo | tarlz -z -o foo.tar.lz}}. Tarlz can be used as
-compressor for zupdate (zutils) by using a command like
+@option{--dsolid} works like @option{--asolid}). Each input archive is
+compressed to a file with the extension @file{.lz} added unless the option
+@option{--output} is used. If no archives are specified, or if a hyphen
+@samp{-} is used as the name of an archive, tarlz reads from standard input
+and writes to standard output (unless the option @option{--output} is used).
+When @option{--output} is used, only one input archive can be specified.
+Exit with error status 2 if any input archive is an empty file. The input
+archives are kept unchanged. Existing compressed archives are not
+overwritten. Tarlz can be used as compressor for GNU tar by using a command
+like @w{@samp{tar -c -Hustar foo | tarlz -z -o foo.tar.lz}}. Tarlz can be
+used as compressor for zupdate (zutils) by using a command like
@w{@samp{zupdate --lz='tarlz -z' foo.tar.gz}}. Note that tarlz only works
 reliably on archives without global headers, or with global headers whose
 content can be ignored.
@ -289,10 +293,8 @@ block is found, and then compresses the rest of the archive. Unless solid
 compression is requested, the end-of-archive blocks are compressed in a lzip
 member separated from the preceding members and from any nonzero garbage
 following the end-of-archive blocks. @option{--compress} implies plzip
-argument style, not tar style. Each input archive is compressed to a file
-with the extension @file{.lz} added unless the option @option{--output} is
-used. When @option{--output} is used, only one input archive can be specified.
-@option{-f} can't be used with @option{--compress}.
+argument style, not tar style. @option{-f} can't be used with
+@option{--compress}.

@item --check-lib
 Compare the
@ -319,8 +321,10 @@ tarlz supports the following options: @xref{Argument syntax}.
@itemx --data-size=@var{bytes}
 Set target size of input data blocks for the option @option{--bsolid}.
@xref{--bsolid}. Valid values range from @w{8 KiB} to @w{1 GiB}. Default
-value is two times the dictionary size, except for option @option{-0} where it
-defaults to @w{1 MiB}. @xref{Minimum archive sizes}.
+value is two times the dictionary size, except for option @option{-0} where
+it defaults to @w{1 MiB}. @xref{Minimum archive sizes}. Tarlz does not split
+tar members. If a file is larger than @var{bytes}, tarlz will create a lzip
+member large enough to contain the file.

@item -C @var{dir}
@itemx --directory=@var{dir}
@ -465,12 +469,13 @@ If @var{group} is not a valid group name, it is decoded as a decimal numeric
 group ID.

@item --exclude=@var{pattern}
-Exclude files matching a shell pattern like @file{*.o}. A file is considered
-to match if any component of the file name matches. For example, @file{*.o}
-matches @file{foo.o}, @file{foo.o/bar} and @file{foo/bar.o}. If
-@var{pattern} contains a @samp{/}, it matches a corresponding @samp{/} in
-the file name. For example, @file{foo/*.o} matches @file{foo/bar.o}.
-Multiple @option{--exclude} options can be specified.
+Exclude files matching a shell pattern like @file{*.o}, even if the files
+are specified in the command line. A file is considered to match if any
+component of the file name matches. For example, @file{*.o} matches
+@file{foo.o}, @file{foo.o/bar} and @file{foo/bar.o}. If @var{pattern}
+contains a @samp{/}, it matches a corresponding @samp{/} in the file name.
+For example, @file{foo/*.o} matches @file{foo/bar.o}. Multiple
+@option{--exclude} options can be specified.

@item --ignore-ids
 Make @option{--diff} ignore differences in owner and group IDs. This option is
@ -493,6 +498,7 @@ recover as much data as possible from each damaged member. It is recommended
 to run tarlz in single-threaded mode (@option{--threads=0}) when using this
 option.

+@anchor{--missing-crc}
@item --missing-crc
 Exit with error status 2 if the CRC of the extended records is missing. When
 this option is used, tarlz detects any corruption in the extended records
@ -525,9 +531,9 @@ values range from 1 to 1024. The default value is 64.
 During archive creation, warn if any file being archived has a modification
 time newer than the archive creation time. This option may slow archive
 creation somewhat because it makes an extra call to @samp{stat} after
-archiving each file, but it guarantees that file contents were not modified
-during the creation of the archive. Note that the file must be at least one
-second newer than the archive for it to be detected as newer.
+archiving each file, but it nearly guarantees that file contents were not
+modified during the creation of the archive. Note that the file must be at
+least one second newer than the archive for it to be detected as newer.

@ignore
@item --permissive
@ -591,6 +597,58 @@ Thus, @w{@option{--foo bar}} and @option{--foo=bar} are equivalent.
@end itemize


+@node Creating backups safely
+@chapter Checking the integrity and accuracy of tar.lz archives
+@cindex creating backups
+
+Uncompressed tar archives do not offer any integrity checking for the files
+they store. The pax format even fails to offer integrity checking for some
+of the metadata. @xref{crc32}. The integrity checking of tar archives is
+usually provided by a compression layer or by an external hash.
+
+Lzip compression provides safe integrity checking to tar archives. But it
+does not matter how safe is the archiving format if the archive is created
+corrupt because of a concurrent modification of the files being archived, a
+faulty RAM, or a bug in the archiving tool. The only way of guaranteeing
+that a backup archive is correct is to check its integrity and accuracy
+after creating it.
+
+Testing the integrity of the archive with @w{@samp{lzip -tv}} guarantees
+that the compression layer of the archive is valid, but it does not
+guarantee that the tar layer is valid nor that the files in the archive
+match the files in the file system. For example, if the RAM is faulty and a
+bit flip happens in the input buffer before tarlz compresses it, the archive
+will not match the files. It is safer to check the archive with
+@w{@samp{tarlz -d}} just after creation because it checks the compression
+layer and the tar layer, and it compares the files in the archive with the
+files in the file system:
+
+@example
+tarlz -cf archive.tar.lz somedir         # create the archive
+tarlz -df archive.tar.lz                 # check the archive
+@end example
+
+Once the integrity and accuracy of an archive have been verified as in the
+example above, they can be verified again anywhere at any time with
+@w{@samp{tarlz -t -n0}}. It is important to disable multi-threading with
+@option{-n0} because multi-threaded listing does not detect corruption in
+the tar member data of multimember archives: @xref{mt-listing}.
+
+@example
+tarlz -t -n0 -f archive.tar.lz > /dev/null
+@end example
+
+@w{@samp{lzip -tv}} checks the integrity of the compression layer, and
+therefore the integrity and accuracy of any archive created and verified as
+explained above. This test is reliable for solidly compressed archives, but
+it does not detect a truncated multimember archive if the truncation happens
+just at a member boundary:
+
+@example
+lzip -tv archive.tar.lz
+@end example
+
+
@node Portable character set
@chapter POSIX portable filename character set
@cindex portable character set
@ -641,7 +699,7 @@ are not allowed in multimember files.

 Each lzip member contains one or more tar members in a simplified POSIX pax
 interchange format. The only pax typeflag value supported by tarlz (in
-addition to the typeflag values defined by the ustar format) is @samp{x}.
+addition to the typeflag values defined by the ustar format) is 'x'.
 The pax format is an extension on top of the ustar format that removes the
 size limitations of the ustar format.

@ -654,7 +712,7 @@ An optional extended header block followed by one or more blocks that
 contain the extended header records as if they were the contents of a file;
 i.e., the extended header records are included as the data for this header
 block. This header block is of the form described in pax header block, with
-a typeflag value of @samp{x}.
+a typeflag value of 'x'.

@item
 A header block in ustar format that describes the file. Any fields defined
@ -713,7 +771,7 @@ An extended header just before the end-of-archive blocks.
@section Pax header block

 The pax header block is identical to the ustar header block described below
-except that the typeflag has the value @samp{x} (extended). The field
+except that the typeflag has the value 'x' (extended). The field
@samp{size} is the size of the extended header data in bytes. Most other
 fields in the pax header block are zeroed on archive creation to prevent
 trouble if the archive is read by a ustar tool, and are ignored by tarlz on
@ -752,8 +810,8 @@ greater than 2_097_151 @w{(octal 7_777_777)}. @xref{ustar-uid-gid}.
 The file name of a link being created to another file, of any type,
 previously archived. This record overrides the field @samp{linkname} in the
 following ustar header block. The following ustar header block determines
-the type of link created. If typeflag of the following header block is 1, a
-hard link is created. If typeflag is 2, a symbolic link is created and the
+the type of link created. If typeflag of the following header block is '1', a
+hard link is created. If typeflag is '2', a symbolic link is created and the
 linkpath value is used as the contents of the symbolic link. The linkpath
 record is created only for links with a link name that does not fit in the
 space provided by the ustar header.
@ -789,13 +847,12 @@ greater than 2_097_151 @w{(octal 7_777_777)}. @xref{ustar-uid-gid}.
@item GNU.crc32
 CRC32-C (Castagnoli) of the extended header data excluding the 8 bytes
 representing the CRC <value> itself. The <value> is represented as 8
-hexadecimal digits in big endian order,
-@w{@samp{22 GNU.crc32=00000000\n}}. The keyword of the CRC record is
-protected by the CRC to guarantee that corruption is always detected when
-using @option{--missing-crc} (except in case of CRC collision). A CRC was
-chosen because a checksum is too weak for a potentially large list of
-variable sized records. A checksum can't detect simple errors like the
-swapping of two bytes.
+hexadecimal digits in big endian order, @w{@samp{22 GNU.crc32=00000000\n}}.
+The option @option{--missing-crc} guarantees that corruption is always
+detected (except in case of CRC collision). A CRC was chosen because a
+checksum is too weak for a potentially large list of variable sized records.
+A checksum can't detect simple errors like the swapping of two bytes.
+@xref{--missing-crc}.

@end table

@ -825,6 +882,7 @@ shown in the following table. All lengths and offsets are in decimal:
@item devmajor @tab 329 @tab   8
@item devminor @tab 337 @tab   8
@item prefix   @tab 345 @tab 155
+@item padding  @tab 500 @tab  12
@end multitable

 All characters in the header block are coded using the ISO/IEC 646:1991
@ -919,7 +977,7 @@ FIFO special file.
@item '7'
 Reserved to represent a file to which an implementation has associated some
 high-performance attribute (contiguous file). Tarlz treats this type of file
-as a regular file (type 0).
+as a regular file (type '0').

@end table

@ -930,8 +988,8 @@ except when all characters in the array contain non-null characters
 including the last character. Each numeric field contains a leading space-
 or zero-filled, optionally null-terminated octal number using digits from
 the ISO/IEC 646:1991 (ASCII) standard. Tarlz is able to decode numeric
-fields 1 byte longer than standard ustar by not requiring a terminating null
-character.
+fields one byte longer than standard ustar by not requiring a terminating
+null character.


@node Amendments to pax format
@ -1044,8 +1102,7 @@ extracting file data for a hard link to a symbolic link or to a directory.
 There is no portable way to tell what charset a text string is coded into.
 Therefore, tarlz stores all fields representing text strings unmodified,
 without conversion to UTF-8 nor any other transformation. This prevents
-accidental double UTF-8 conversions. If the need arises this behavior will
-be adjusted with a command-line option in the future.
+accidental double UTF-8 conversions.


@node Program design
@ -1054,12 +1111,12 @@ be adjusted with a command-line option in the future.

 The parts of tarlz related to sequential processing of the archive are more
 or less similar to any other tar and won't be described here. The interesting
-parts described here are those related to Multi-threaded processing.
+parts described here are those related to multi-threaded processing.

-The structure of the part of tarlz performing Multi-threaded archive
+The structure of the part of tarlz performing multi-threaded archive
 creation is somewhat similar to that of
-@uref{http://www.nongnu.org/lzip/plzip.html#Program-design,,plzip} with the
-added complication of the solidity levels.
+@uref{http://www.nongnu.org/lzip/manual/plzip_manual.html#Program-design,,plzip}
+with the added complication of the solidity levels.
@ifnothtml
@xref{Program design,,,plzip}.
@end ifnothtml
@ -1174,6 +1231,9 @@ tar.lz archives, keeping backwards compatibility. If tarlz finds a member
 misalignment during multi-threaded decoding, it switches to single-threaded
 mode and continues decoding the archive.

+@anchor{mt-listing}
+@section Multi-threaded listing
+
 If the files in the archive are large, multi-threaded @option{--list} on a
 regular (seekable) tar.lz archive can be hundreds of times faster than
 sequential @option{--list} because, in addition to using several processors,
@ -1189,8 +1249,10 @@ time tarlz -tf silesia.tar.lz                       (0.020s)

 On the other hand, multi-threaded @option{--list} won't detect corruption in
 the tar member data because it only decodes the part of each lzip member
-corresponding to the tar member header. This is another reason why the tar
-headers must provide their own integrity checking.
+corresponding to the tar member header. Partial decoding of a lzip member
+can't guarantee the integrity of the data decoded. This is another reason
+why the tar headers (including the extended records) must provide their own
+integrity checking.

@anchor{mt-extraction}
@section Limitations of multi-threaded extraction
@ -1344,11 +1406,13 @@ tarlz -z --no-solid archive.tar
@end example

@noindent
-Example 10: Compress the archive @file{archive.tar} and write the output to
-@file{foo.tar.lz}.
+Example 10: Recompress the archive @file{archive.tar.lz} with different
+solidity, write the output to @file{archive-ns.tar.lz}, and compare both
+archives.

@example
-tarlz -z -o foo.tar.lz archive.tar
+lzip -cd archive.tar.lz | tarlz -9z --no-solid -o archive-ns.tar.lz
+zcmp archive.tar.lz archive-ns.tar.lz
@end example

@noindent