Message 373581 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	miurahr
Recipients	malin, miurahr
Date	2020-07-13.02:34:03
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1594607643.91.0.354553409074.issue41210@roundup.psfhosted.org>
In-reply-to

Content
Lasse Collin gives me explanation of LZMA1 data format and suggestion how to implement. I'd like to change an issue to a documentation issue to add more description about limitation on FORMAT_ALONE and LZMA1. A suggestion from Lasse is as follows: > liblzma cannot be used to decode data from .7z files except in certain > cases. This isn't a bug, it's a missing feature. > > The raw encoder and decoder APIs only support streams that contain an > end of payload marker (EOPM) alias end of stream (EOS) marker. .7z > files use LZMA1 without such an end marker. Instead, the end is handled > by the decoder knowing the exact uncompressed size of the data. > > The API of liblzma supports LZMA1 without end marker via > lzma_alone_decoder(). That API can be abused to properly decode raw > LZMA1 with known uncompressed size by feeding the decoder a fake 13-byte > header. Everything else in the public API requires some end marker. > > Decoding LZMA1 without BCJ or other extra filters from .7z with > lzma_raw_decoder() kind of works but you will notice that it will never > return LZMA_STREAM_END, only LZMA_OK. This is because it will never see > an end marker. A minor downside is that it won't then do a small > integrity check at the end either (one variable in the range decoder > state will be 0 at the end of any valid LZMA1 stream); > lzma_alone_decoder() does this check even when end marker is missing. > > If you use lzma_raw_decoder() for end-markerless LZMA1, make sure that > you never give it more output space than the real uncompressed size. In > rare cases this could result in extra output or an error since the > decoder would try to decode more output using the input it has gotten > so far. Overall I think the hack with lzma_alone_decoder() is a better > way with the current API. > > BCJ filters process the input data in chunks of a few bytes long, thus > they need to hold a few bytes of look-ahead buffer. With some filters > like ARM the look-ahead is aligned and if the uncompressed size is a > multiple of this alignment, lzma_raw_decoder() will give you all the > data even when the LZMA1 layer doesn't have an end marker. The x86 > filter has one-byte alignment but needs to see five bytes from the > future before producing output. When LZMA1 layer doesn't return > LZMA_STREAM_END, the x86 filter doesn't know that the end was reached > and cannot flush the last bytes out. > > Using liblzma to decode .7z works in these cases: > > - LZMA1 using a fake 13-byte header with lzma_alone_decoder(): > > 1 byte LZMA properties byte that encodes lc/lp/pb > 4 bytes dictionary size as little endian uint32_t > 8 bytes uncompressed size as little endian uint64_t; > UINT64_MAX means unknown and then (and only then) > EOPM must be present

Lasse Collin gives me explanation of LZMA1 data format and suggestion how to implement.

I'd like to change an issue to a documentation issue to add more description about limitation on FORMAT_ALONE and LZMA1.

A suggestion from Lasse is as follows:

> liblzma cannot be used to decode data from .7z files except in certain
> cases. This isn't a bug, it's a missing feature.
>
> The raw encoder and decoder APIs only support streams that contain an
> end of payload marker (EOPM) alias end of stream (EOS) marker. .7z
> files use LZMA1 without such an end marker. Instead, the end is handled
> by the decoder knowing the exact uncompressed size of the data.
>
> The API of liblzma supports LZMA1 without end marker via
> lzma_alone_decoder(). That API can be abused to properly decode raw
> LZMA1 with known uncompressed size by feeding the decoder a fake 13-byte
> header. Everything else in the public API requires some end marker.
>
> Decoding LZMA1 without BCJ or other extra filters from .7z with
> lzma_raw_decoder() kind of works but you will notice that it will never
> return LZMA_STREAM_END, only LZMA_OK. This is because it will never see
> an end marker. A minor downside is that it won't then do a small
> integrity check at the end either (one variable in the range decoder
> state will be 0 at the end of any valid LZMA1 stream);
> lzma_alone_decoder() does this check even when end marker is missing.
>
> If you use lzma_raw_decoder() for end-markerless LZMA1, make sure that
> you never give it more output space than the real uncompressed size. In
> rare cases this could result in extra output or an error since the
> decoder would try to decode more output using the input it has gotten
> so far. Overall I think the hack with lzma_alone_decoder() is a better
> way with the current API.
>
> BCJ filters process the input data in chunks of a few bytes long, thus
> they need to hold a few bytes of look-ahead buffer. With some filters
> like ARM the look-ahead is aligned and if the uncompressed size is a
> multiple of this alignment, lzma_raw_decoder() will give you all the
> data even when the LZMA1 layer doesn't have an end marker. The x86
> filter has one-byte alignment but needs to see five bytes from the
> future before producing output. When LZMA1 layer doesn't return
> LZMA_STREAM_END, the x86 filter doesn't know that the end was reached
> and cannot flush the last bytes out.
>
> Using liblzma to decode .7z works in these cases:
>
> - LZMA1 using a fake 13-byte header with lzma_alone_decoder():
>
> 1 byte LZMA properties byte that encodes lc/lp/pb
> 4 bytes dictionary size as little endian uint32_t
> 8 bytes uncompressed size as little endian uint64_t;
> UINT64_MAX means unknown and then (and only then)
> EOPM must be present

History
Date	User	Action	Args
2020-07-13 02:34:03	miurahr	set	recipients: + miurahr, malin
2020-07-13 02:34:03	miurahr	set	messageid: <1594607643.91.0.354553409074.issue41210@roundup.psfhosted.org>
2020-07-13 02:34:03	miurahr	link	issue41210 messages
2020-07-13 02:34:03	miurahr	create