Class SeekableXZInputStream

  • All Implemented Interfaces:
    java.io.Closeable, java.lang.AutoCloseable

    public class SeekableXZInputStream
    extends SeekableInputStream
    Decompresses a .xz file in random access mode. This supports decompressing concatenated .xz files.

    Each .xz file consist of one or more Streams. Each Stream consist of zero or more Blocks. Each Stream contains an Index of Streams' Blocks. The Indexes from all Streams are loaded in RAM by a constructor of this class. A typical .xz file has only one Stream, and parsing its Index will need only three or four seeks.

    To make random access possible, the data in a .xz file must be splitted into multiple Blocks of reasonable size. Decompression can only start at a Block boundary. When seeking to an uncompressed position that is not at a Block boundary, decompression starts at the beginning of the Block and throws away data until the target position is reached. Thus, smaller Blocks mean faster seeks to arbitrary uncompressed positions. On the other hand, smaller Blocks mean worse compression. So one has to make a compromise between random access speed and compression ratio.

    Implementation note: This class uses linear search to locate the correct Stream from the data structures in RAM. It was the simplest to implement and should be fine as long as there aren't too many Streams. The correct Block inside a Stream is located using binary search and thus is fast even with a huge number of Blocks.

    Memory usage

    The amount of memory needed for the Indexes is taken into account when checking the memory usage limit. Each Stream is calculated to need at least 1 KiB of memory and each Block 16 bytes of memory, rounded up to the next kibibyte. So unless the file has a huge number of Streams or Blocks, these don't take significant amount of memory.

    Creating random-accessible .xz files

    When using XZOutputStream, a new Block can be started by calling its endBlock method. If you know that the decompressor will only need to seek to certain uncompressed positions, it can be a good idea to start a new Block at (some of) these positions (and only at these positions to get better compression ratio).

    liblzma in XZ Utils supports starting a new Block with LZMA_FULL_FLUSH. XZ Utils 5.1.1alpha added threaded compression which creates multi-Block .xz files. XZ Utils 5.1.1alpha also added the option --block-size=SIZE to the xz command line tool. XZ Utils 5.1.2alpha added a partial implementation of --block-list=SIZES which allows specifying sizes of individual Blocks.

    See Also:
    SeekableFileInputStream, XZInputStream, XZOutputStream
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      int available()
      Returns the number of uncompressed bytes that can be read without blocking.
      void close()
      Closes the stream and calls in.close().
      void close​(boolean closeInput)
      Closes the stream and optionally calls in.close().
      int getBlockCheckType​(int blockNumber)
      Gets integrity check type (Check ID) of the given Block.
      long getBlockCompPos​(int blockNumber)
      Gets the position where the given compressed Block starts in the underlying .xz file.
      long getBlockCompSize​(int blockNumber)
      Gets the compressed size of the given Block.
      int getBlockCount()
      Gets the number of Blocks in the .xz file.
      int getBlockNumber​(long pos)
      Gets the number of the Block that contains the byte at the given uncompressed position.
      long getBlockPos​(int blockNumber)
      Gets the uncompressed start position of the given Block.
      long getBlockSize​(int blockNumber)
      Gets the uncompressed size of the given Block.
      int getCheckTypes()
      Gets the types of integrity checks used in the .xz file.
      int getIndexMemoryUsage()
      Gets the amount of memory in kibibytes (KiB) used by the data structures needed to locate the XZ Blocks.
      long getLargestBlockSize()
      Gets the uncompressed size of the largest XZ Block in bytes.
      int getStreamCount()
      Gets the number of Streams in the .xz file.
      long length()
      Gets the uncompressed size of this input stream.
      long position()
      Gets the current uncompressed position in this input stream.
      int read()
      Decompresses the next byte from this input stream.
      int read​(byte[] buf, int off, int len)
      Decompresses into an array of bytes.
      void seek​(long pos)
      Seeks to the specified absolute uncompressed position in the stream.
      void seekToBlock​(int blockNumber)
      Seeks to the beginning of the given XZ Block.
      • Methods inherited from class java.io.InputStream

        mark, markSupported, nullInputStream, read, readAllBytes, readNBytes, readNBytes, reset, transferTo
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • SeekableXZInputStream

        public SeekableXZInputStream​(SeekableInputStream in)
                              throws java.io.IOException
        Creates a new seekable XZ decompressor without a memory usage limit.
        Parameters:
        in - seekable input stream containing one or more XZ Streams; the whole input stream is used
        Throws:
        XZFormatException - input is not in the XZ format
        CorruptedInputException - XZ data is corrupt or truncated
        UnsupportedOptionsException - XZ headers seem valid but they specify options not supported by this implementation
        java.io.EOFException - less than 6 bytes of input was available from in, or (unlikely) the size of the underlying stream got smaller while this was reading from it
        java.io.IOException - may be thrown by in
      • SeekableXZInputStream

        public SeekableXZInputStream​(SeekableInputStream in,
                                     ArrayCache arrayCache)
                              throws java.io.IOException
        Creates a new seekable XZ decompressor without a memory usage limit.

        This is identical to SeekableXZInputStream(SeekableInputStream) except that this also takes the arrayCache argument.

        Parameters:
        in - seekable input stream containing one or more XZ Streams; the whole input stream is used
        arrayCache - cache to be used for allocating large arrays
        Throws:
        XZFormatException - input is not in the XZ format
        CorruptedInputException - XZ data is corrupt or truncated
        UnsupportedOptionsException - XZ headers seem valid but they specify options not supported by this implementation
        java.io.EOFException - less than 6 bytes of input was available from in, or (unlikely) the size of the underlying stream got smaller while this was reading from it
        java.io.IOException - may be thrown by in
        Since:
        1.7
      • SeekableXZInputStream

        public SeekableXZInputStream​(SeekableInputStream in,
                                     int memoryLimit)
                              throws java.io.IOException
        Creates a new seekable XZ decomporessor with an optional memory usage limit.
        Parameters:
        in - seekable input stream containing one or more XZ Streams; the whole input stream is used
        memoryLimit - memory usage limit in kibibytes (KiB) or -1 to impose no memory usage limit
        Throws:
        XZFormatException - input is not in the XZ format
        CorruptedInputException - XZ data is corrupt or truncated
        UnsupportedOptionsException - XZ headers seem valid but they specify options not supported by this implementation
        MemoryLimitException - decoded XZ Indexes would need more memory than allowed by the memory usage limit
        java.io.EOFException - less than 6 bytes of input was available from in, or (unlikely) the size of the underlying stream got smaller while this was reading from it
        java.io.IOException - may be thrown by in
      • SeekableXZInputStream

        public SeekableXZInputStream​(SeekableInputStream in,
                                     int memoryLimit,
                                     ArrayCache arrayCache)
                              throws java.io.IOException
        Creates a new seekable XZ decomporessor with an optional memory usage limit.

        This is identical to SeekableXZInputStream(SeekableInputStream,int) except that this also takes the arrayCache argument.

        Parameters:
        in - seekable input stream containing one or more XZ Streams; the whole input stream is used
        memoryLimit - memory usage limit in kibibytes (KiB) or -1 to impose no memory usage limit
        arrayCache - cache to be used for allocating large arrays
        Throws:
        XZFormatException - input is not in the XZ format
        CorruptedInputException - XZ data is corrupt or truncated
        UnsupportedOptionsException - XZ headers seem valid but they specify options not supported by this implementation
        MemoryLimitException - decoded XZ Indexes would need more memory than allowed by the memory usage limit
        java.io.EOFException - less than 6 bytes of input was available from in, or (unlikely) the size of the underlying stream got smaller while this was reading from it
        java.io.IOException - may be thrown by in
        Since:
        1.7
      • SeekableXZInputStream

        public SeekableXZInputStream​(SeekableInputStream in,
                                     int memoryLimit,
                                     boolean verifyCheck)
                              throws java.io.IOException
        Creates a new seekable XZ decomporessor with an optional memory usage limit and ability to disable verification of integrity checks.

        Note that integrity check verification should almost never be disabled. Possible reasons to disable integrity check verification:

        • Trying to recover data from a corrupt .xz file.
        • Speeding up decompression. This matters mostly with SHA-256 or with files that have compressed extremely well. It's recommended that integrity checking isn't disabled for performance reasons unless the file integrity is verified externally in some other way.

        verifyCheck only affects the integrity check of the actual compressed data. The CRC32 fields in the headers are always verified.

        Parameters:
        in - seekable input stream containing one or more XZ Streams; the whole input stream is used
        memoryLimit - memory usage limit in kibibytes (KiB) or -1 to impose no memory usage limit
        verifyCheck - if true, the integrity checks will be verified; this should almost never be set to false
        Throws:
        XZFormatException - input is not in the XZ format
        CorruptedInputException - XZ data is corrupt or truncated
        UnsupportedOptionsException - XZ headers seem valid but they specify options not supported by this implementation
        MemoryLimitException - decoded XZ Indexes would need more memory than allowed by the memory usage limit
        java.io.EOFException - less than 6 bytes of input was available from in, or (unlikely) the size of the underlying stream got smaller while this was reading from it
        java.io.IOException - may be thrown by in
        Since:
        1.6
      • SeekableXZInputStream

        public SeekableXZInputStream​(SeekableInputStream in,
                                     int memoryLimit,
                                     boolean verifyCheck,
                                     ArrayCache arrayCache)
                              throws java.io.IOException
        Creates a new seekable XZ decomporessor with an optional memory usage limit and ability to disable verification of integrity checks.

        This is identical to SeekableXZInputStream(SeekableInputStream,int,boolean) except that this also takes the arrayCache argument.

        Parameters:
        in - seekable input stream containing one or more XZ Streams; the whole input stream is used
        memoryLimit - memory usage limit in kibibytes (KiB) or -1 to impose no memory usage limit
        verifyCheck - if true, the integrity checks will be verified; this should almost never be set to false
        arrayCache - cache to be used for allocating large arrays
        Throws:
        XZFormatException - input is not in the XZ format
        CorruptedInputException - XZ data is corrupt or truncated
        UnsupportedOptionsException - XZ headers seem valid but they specify options not supported by this implementation
        MemoryLimitException - decoded XZ Indexes would need more memory than allowed by the memory usage limit
        java.io.EOFException - less than 6 bytes of input was available from in, or (unlikely) the size of the underlying stream got smaller while this was reading from it
        java.io.IOException - may be thrown by in
        Since:
        1.7
    • Method Detail

      • getCheckTypes

        public int getCheckTypes()
        Gets the types of integrity checks used in the .xz file. Multiple checks are possible only if there are multiple concatenated XZ Streams.

        The returned value has a bit set for every check type that is present. For example, if CRC64 and SHA-256 were used, the return value is (1 << XZ.CHECK_CRC64) | (1 << XZ.CHECK_SHA256).

      • getIndexMemoryUsage

        public int getIndexMemoryUsage()
        Gets the amount of memory in kibibytes (KiB) used by the data structures needed to locate the XZ Blocks. This is usually useless information but since it is calculated for memory usage limit anyway, it is nice to make it available to too.
      • getLargestBlockSize

        public long getLargestBlockSize()
        Gets the uncompressed size of the largest XZ Block in bytes. This can be useful if you want to check that the file doesn't have huge XZ Blocks which could make seeking to arbitrary offsets very slow. Note that huge Blocks don't automatically mean that seeking would be slow, for example, seeking to the beginning of any Block is always fast.
      • getStreamCount

        public int getStreamCount()
        Gets the number of Streams in the .xz file.
        Since:
        1.3
      • getBlockCount

        public int getBlockCount()
        Gets the number of Blocks in the .xz file.
        Since:
        1.3
      • getBlockPos

        public long getBlockPos​(int blockNumber)
        Gets the uncompressed start position of the given Block.
        Throws:
        java.lang.IndexOutOfBoundsException - if blockNumber < 0 or blockNumber >= getBlockCount().
        Since:
        1.3
      • getBlockSize

        public long getBlockSize​(int blockNumber)
        Gets the uncompressed size of the given Block.
        Throws:
        java.lang.IndexOutOfBoundsException - if blockNumber < 0 or blockNumber >= getBlockCount().
        Since:
        1.3
      • getBlockCompPos

        public long getBlockCompPos​(int blockNumber)
        Gets the position where the given compressed Block starts in the underlying .xz file. This information is rarely useful to the users of this class.
        Throws:
        java.lang.IndexOutOfBoundsException - if blockNumber < 0 or blockNumber >= getBlockCount().
        Since:
        1.3
      • getBlockCompSize

        public long getBlockCompSize​(int blockNumber)
        Gets the compressed size of the given Block. This together with the uncompressed size can be used to calculate the compression ratio of the specific Block.
        Throws:
        java.lang.IndexOutOfBoundsException - if blockNumber < 0 or blockNumber >= getBlockCount().
        Since:
        1.3
      • getBlockCheckType

        public int getBlockCheckType​(int blockNumber)
        Gets integrity check type (Check ID) of the given Block.
        Throws:
        java.lang.IndexOutOfBoundsException - if blockNumber < 0 or blockNumber >= getBlockCount().
        Since:
        1.3
        See Also:
        getCheckTypes()
      • getBlockNumber

        public int getBlockNumber​(long pos)
        Gets the number of the Block that contains the byte at the given uncompressed position.
        Throws:
        java.lang.IndexOutOfBoundsException - if pos < 0 or pos >= length().
        Since:
        1.3
      • read

        public int read​(byte[] buf,
                        int off,
                        int len)
                 throws java.io.IOException
        Decompresses into an array of bytes.

        If len is zero, no bytes are read and 0 is returned. Otherwise this will try to decompress len bytes of uncompressed data. Less than len bytes may be read only in the following situations:

        • The end of the compressed data was reached successfully.
        • An error is detected after at least one but less than len bytes have already been successfully decompressed. The next call with non-zero len will immediately throw the pending exception.
        • An exception is thrown.
        Overrides:
        read in class java.io.InputStream
        Parameters:
        buf - target buffer for uncompressed data
        off - start offset in buf
        len - maximum number of uncompressed bytes to read
        Returns:
        number of bytes read, or -1 to indicate the end of the compressed stream
        Throws:
        CorruptedInputException
        UnsupportedOptionsException
        MemoryLimitException
        XZIOException - if the stream has been closed
        java.io.IOException - may be thrown by in
      • available

        public int available()
                      throws java.io.IOException
        Returns the number of uncompressed bytes that can be read without blocking. The value is returned with an assumption that the compressed input data will be valid. If the compressed data is corrupt, CorruptedInputException may get thrown before the number of bytes claimed to be available have been read from this input stream.
        Overrides:
        available in class java.io.InputStream
        Returns:
        the number of uncompressed bytes that can be read without blocking
        Throws:
        java.io.IOException
      • close

        public void close()
                   throws java.io.IOException
        Closes the stream and calls in.close(). If the stream was already closed, this does nothing.

        This is equivalent to close(true).

        Specified by:
        close in interface java.lang.AutoCloseable
        Specified by:
        close in interface java.io.Closeable
        Overrides:
        close in class java.io.InputStream
        Throws:
        java.io.IOException - if thrown by in.close()
      • close

        public void close​(boolean closeInput)
                   throws java.io.IOException
        Closes the stream and optionally calls in.close(). If the stream was already closed, this does nothing. If close(false) has been called, a further call of close(true) does nothing (it doesn't call in.close()).

        If you don't want to close the underlying InputStream, there is usually no need to worry about closing this stream either; it's fine to do nothing and let the garbage collector handle it. However, if you are using ArrayCache, close(false) can be useful to put the allocated arrays back to the cache without closing the underlying InputStream.

        Note that if you successfully reach the end of the stream (read returns -1), the arrays are automatically put back to the cache by that read call. In this situation close(false) is redundant (but harmless).

        Throws:
        java.io.IOException - if thrown by in.close()
        Since:
        1.7
      • length

        public long length()
        Gets the uncompressed size of this input stream. If there are multiple XZ Streams, the total uncompressed size of all XZ Streams is returned.
        Specified by:
        length in class SeekableInputStream
      • position

        public long position()
                      throws java.io.IOException
        Gets the current uncompressed position in this input stream.
        Specified by:
        position in class SeekableInputStream
        Throws:
        XZIOException - if the stream has been closed
        java.io.IOException
      • seek

        public void seek​(long pos)
                  throws java.io.IOException
        Seeks to the specified absolute uncompressed position in the stream. This only stores the new position, so this function itself is always very fast. The actual seek is done when read is called to read at least one byte.

        Seeking past the end of the stream is possible. In that case read will return -1 to indicate the end of the stream.

        Specified by:
        seek in class SeekableInputStream
        Parameters:
        pos - new uncompressed read position
        Throws:
        XZIOException - if pos is negative, or if stream has been closed
        java.io.IOException - if pos is negative or if a stream-specific I/O error occurs
      • seekToBlock

        public void seekToBlock​(int blockNumber)
                         throws java.io.IOException
        Seeks to the beginning of the given XZ Block.
        Throws:
        XZIOException - if blockNumber < 0 or blockNumber >= getBlockCount(), or if stream has been closed
        java.io.IOException
        Since:
        1.3