Title: Docs: More description(warning) about LZMA1 + BCJ with FORMAT_RAW
Type: Stage:
Components: Library (Lib) Versions: Python 3.9, Python 3.8, Python 3.7, Python 3.6
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Janae147, malin, miurahr
Priority: normal Keywords: patch

Created on 2020-07-05 01:51 by miurahr, last changed 2020-08-03 04:26 by miurahr.

File name Uploaded Description Edit
lzmabcj.bin miurahr, 2020-07-05 01:51 test data to reproduce a problem
0001-lzma-support-LZMA1-with-FORMAT_RAW.patch miurahr, 2020-07-07 05:56 add test and update doc
0001-lzma-support-LZMA1-with-FORMAT_RAW.patch miurahr, 2020-07-07 08:21 Add tests and update doc
Messages (11)
msg373008 - (view) Author: Hiroshi Miura (miurahr) * Date: 2020-07-05 01:51
When decompressing a particular archive, result become truncated a last word. 
A test data attached is uncompressed size is 12800 bytes, and compressed using LZMA1+BCJ algorithm into 11327 bytes.
The data is a payload of a 7zip archive.

Here is a pytest code to reproduce it.

:: code-block::

    def test_lzma_raw_decompressor_lzmabcj():
        filters = []
        filters.append({'id': lzma.FILTER_X86})
        filters.append(lzma._decode_filter_properties(lzma.FILTER_LZMA1, b']\x00\x00\x01\x00'))
        decompressor = lzma.LZMADecompressor(format=lzma.FORMAT_RAW, filters=filters)
        with testdata_path.joinpath('lzmabcj.bin').open('rb') as infile:
            out = decompressor.decompress(
        assert len(out) == 12800

test become failure that len(out) become 12796 bytes, which lacks last 4 bytes, which should be b'\x00\x00\x00\x00'
When specifying  a filters  as a single LZMA1 decompression,  I got an expected length of data, 12800 bytes.(*1)

When creating a test data with LZMA2+BCJ and examines it, I got an expected data.
When specifying a filters as a single LZMA2 decompression against LZMA2+BCJ payload, a result is perfectly as same as (*1) data.
It indicate us that a pipeline of LZMA1/LZMA2 --> BCJ is in doubt. 

After investigation and understanding that _lzmamodule.c is a thin wrapper of liblzma, I found the problem can be reproduced in liblzma.
I've reported it to upstream xz-devel ML with a test code
msg373086 - (view) Author: Ma Lin (malin) * Date: 2020-07-06 09:49
The docs[1] said:

    Compression filters:
            FILTER_LZMA1 (for use with FORMAT_ALONE)
            FILTER_LZMA2 (for use with FORMAT_XZ and FORMAT_RAW)

But your code uses a combination of `FILTER_LZMA1` and `FORMAT_RAW`, is this ok?

msg373199 - (view) Author: Hiroshi Miura (miurahr) * Date: 2020-07-07 00:07
>    Compression filters:
>            FILTER_LZMA1 (for use with FORMAT_ALONE)
>            FILTER_LZMA2 (for use with FORMAT_XZ and FORMAT_RAW)

I look into past discussion  BPO-6715 when lzma module proposed.

There is an only comment about FORMAT_ALONE and LZMA1 here

> .lzma is actually not a format. It is just the raw output of the LZMA1
> coder. XZ instead is a container format for the LZMA2 coder, which
probably means LZMA+some metadata.

It said FORMAT_ALONE decode .lzma archive which use LZMA1 as coder and FORMAT_XZ decode .xz archive which use LZMA2 as coder.
There are no discussion about FORMAT_RAW.

This indicate an opposite relation between two things.
FORMAT_ALONE should use with LZMA1.
FORMAT_XZ should use with LZMA2. 

FORMAT_RAW actually no limitation against LZMA1/2.

Here is another discussion about lzma_raw_encoder and LZMA1.
A xz/liblzma maintainer Lasse suggest lzma_raw_encoder is usable for LZMA1.

I think we need fix the document.
msg373206 - (view) Author: Hiroshi Miura (miurahr) * Date: 2020-07-07 05:56
I think FORMAT_RAW is only tested with LZMA2 in Lib/test/ Since no test is for LZMA1, then the document express FORMAT_RAW is for LZMA2.

I'd like to add tests against LZMA1 and change expression on the document.
msg373208 - (view) Author: Ma Lin (malin) * Date: 2020-07-07 06:34
There was a similar issue (issue21872).

When decompressing a lzma.FORMAT_ALONE format data, and it doesn't have the end marker (but has the correct "Uncompressed Size" in the .lzma header), sometimes the last one to dozens bytes can't be output.

issue21872 fixed the problem in `_lzmamodule.c`. But if liblzma strictly follows zlib's API (IMO it should), there should be no this problem.

I debugged your code with attached file `lzmabcj.bin`, when it output 12796 bytes, the output buffer still has 353 bytes space. So it seems to be a problem of liblzma.

IMHO, we first wait the reply from liblzma maintainer, if Lasse Collin thinks this is a bug, let us wait for the upstream fix. And I will report the issue21872 to see if he can fix the problem in upstream as well.
msg373210 - (view) Author: Hiroshi Miura (miurahr) * Date: 2020-07-07 08:30
Thank you for information about similar problem.

This problem is observed and reported on 7-zip library project,
py7zr heavily depend on lzma FORMAT_RAW interface.

Fortunately  7-zip container format has size database, then library can know output is enough or not.

In reported case, the library/caller become a state that all input data has send into decompressor,  but decompressor does not output anything.

I'd like to wait upstream reaction.
msg373519 - (view) Author: Hiroshi Miura (miurahr) * Date: 2020-07-11 07:19
Here is a BCJ only CFFI test project.

It imports two bcj_x86 C sources, one is from liblzma (src/xz_bcj_x86.c) taht is bind with python's lzma module, and the other is from xz-embbed project for linux kernel.(src/xz_simple_bcj.c)

We can observe that

1. it has an interface which overwrite buffer
2. it returns good resulted buffer (digest assertion) in both case
3. it returns 4 bytes less size than expected.

for 3, it is because return value  of BCJ is defined such as

	size -= 4;
	for (i = 0; i < size; ++i) {...}
        return i;
and  variable i sometimes increment 4 bytes when target sequence is found and processed.

It may be natural that a size value returned from BCJ filter is often 4 bytes smaller than actual.
msg373581 - (view) Author: Hiroshi Miura (miurahr) * Date: 2020-07-13 02:34
Lasse Collin gives me explanation of LZMA1 data format and suggestion how to implement.

I'd like to change an issue to a documentation issue to add more description about limitation on FORMAT_ALONE and LZMA1.

A suggestion from Lasse is as follows:

> liblzma cannot be used to decode data from .7z files except in certain
> cases. This isn't a bug, it's a missing feature.
> The raw encoder and decoder APIs only support streams that contain an
> end of payload marker (EOPM) alias end of stream (EOS) marker. .7z
> files use LZMA1 without such an end marker. Instead, the end is handled
> by the decoder knowing the exact uncompressed size of the data.
> The API of liblzma supports LZMA1 without end marker via
> lzma_alone_decoder(). That API can be abused to properly decode raw
> LZMA1 with known uncompressed size by feeding the decoder a fake 13-byte
> header. Everything else in the public API requires some end marker.
> Decoding LZMA1 without BCJ or other extra filters from .7z with
> lzma_raw_decoder() kind of works but you will notice that it will never
> return LZMA_STREAM_END, only LZMA_OK. This is because it will never see
> an end marker. A minor downside is that it won't then do a small
> integrity check at the end either (one variable in the range decoder
> state will be 0 at the end of any valid LZMA1 stream);
> lzma_alone_decoder() does this check even when end marker is missing.
> If you use lzma_raw_decoder() for end-markerless LZMA1, make sure that
> you never give it more output space than the real uncompressed size. In
> rare cases this could result in extra output or an error since the
> decoder would try to decode more output using the input it has gotten
> so far. Overall I think the hack with lzma_alone_decoder() is a better
> way with the current API.
> BCJ filters process the input data in chunks of a few bytes long, thus
> they need to hold a few bytes of look-ahead buffer. With some filters
> like ARM the look-ahead is aligned and if the uncompressed size is a
> multiple of this alignment, lzma_raw_decoder() will give you all the
> data even when the LZMA1 layer doesn't have an end marker. The x86
> filter has one-byte alignment but needs to see five bytes from the
> future before producing output. When LZMA1 layer doesn't return
> LZMA_STREAM_END, the x86 filter doesn't know that the end was reached
> and cannot flush the last bytes out.
> Using liblzma to decode .7z works in these cases:
> - LZMA1 using a fake 13-byte header with lzma_alone_decoder():
> 1 byte LZMA properties byte that encodes lc/lp/pb
> 4 bytes dictionary size as little endian uint32_t
> 8 bytes uncompressed size as little endian uint64_t;
> UINT64_MAX means unknown and then (and only then)
> EOPM must be present
msg373590 - (view) Author: Ma Lin (malin) * Date: 2020-07-13 10:45
It is better to raise a warning when using problematic combination.

But IMO either "raising a warning" or "adding more description to doc" is too dependent on the implementation detail of liblzma.
msg373591 - (view) Author: Janae (Janae147) Date: 2020-07-13 10:54
Here is a BCJ only CFFI test project.
All works are very interesting. thanks, to post and your works.
msg374715 - (view) Author: Hiroshi Miura (miurahr) * Date: 2020-08-03 04:23
Here is a draft of additional text

Usage of :const:`FILTER_LZMA1` with :const:`FORMAT_RAW` is not recommended.
Because it may produce a wrong output in a certain condition, decompressing 
a combination of :const:`FILTER_LZMA1` and BCJ filters in :const:`FORMAT_RAW`.
It is because LZMA1 format sometimes lacks End of Stream (EOS) mark that
lead BCJ filters can not be flushed.

I've tried to write without a description of liblzma implementation, but only a nature of API and file format specification.
Date User Action Args
2020-08-03 04:26:37miurahrsettitle: Docs: More description of reason about LZMA1 data handling with FORMAT_ALONE -> Docs: More description(warning) about LZMA1 + BCJ with FORMAT_RAW
2020-08-03 04:23:37miurahrsetmessages: + msg374715
2020-07-13 10:54:16Janae147setnosy: + Janae147
messages: + msg373591
2020-07-13 10:45:13malinsetmessages: + msg373590
2020-07-13 02:34:03miurahrsetmessages: + msg373581
title: LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data -> Docs: More description of reason about LZMA1 data handling with FORMAT_ALONE
2020-07-11 07:19:49miurahrsetmessages: + msg373519
2020-07-07 08:30:28miurahrsetmessages: + msg373210
2020-07-07 08:21:51miurahrsetfiles: + 0001-lzma-support-LZMA1-with-FORMAT_RAW.patch
2020-07-07 06:34:33malinsetmessages: + msg373208
2020-07-07 05:56:19miurahrsetfiles: + 0001-lzma-support-LZMA1-with-FORMAT_RAW.patch
keywords: + patch
messages: + msg373206
2020-07-07 00:07:23miurahrsetmessages: + msg373199
2020-07-06 09:49:54malinsetmessages: + msg373086
2020-07-05 10:48:03malinsetnosy: + malin
components: + Library (Lib), - Extension Modules
2020-07-05 01:51:25miurahrcreate