classification
Title: zlib can't decompress DEFLATE using shared dictionary
Type: behavior Stage: resolved
Components: Extension Modules Versions: Python 3.6, Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Vladimir Mihailenco, gregory.p.smith, martin.panter, nadeem.vawda, python-dev, twouters, xiang.zhang
Priority: normal Keywords: patch

Created on 2016-05-31 09:41 by Vladimir Mihailenco, last changed 2016-06-06 01:48 by martin.panter. This issue is now closed.

Files
File name Uploaded Description Edit
issue27164.patch xiang.zhang, 2016-06-01 11:02 review
issue27164_with_test.patch xiang.zhang, 2016-06-01 14:59 review
raw-zdict.v3.patch martin.panter, 2016-06-05 06:21 review
Messages (9)
msg266747 - (view) Author: Vladimir Mihailenco (Vladimir Mihailenco) Date: 2016-05-31 09:41
Consider following code:

```
import zlib


hello = b'hello'

# Compress data using deflate and shared/predefined dict.
co = zlib.compressobj(wbits=-zlib.MAX_WBITS, zdict=hello)
data = co.compress(hello) + co.flush()

# Decompress deflate by providing same dict.
do = zlib.decompressobj(wbits=-zlib.MAX_WBITS, zdict=hello)
data = do.decompress(data)

print(data)
```

Decompression fails with "zlib.error: Error -3 while decompressing data: invalid distance too far back".

My original task was to decompress data I get from Golang library - https://golang.org/pkg/compress/flate/#NewWriterDict
msg266804 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2016-06-01 07:48
This seems related with the difference between zlib format and raw format. When you do raw inflate, you have to inflateSetDictionary before any inflate. You cannot rely on the first inflate to return Z_NEED_DICT and then do inflateSetDictionary.

Although I have the solution with experiment, but there is nothing can be found in the official zlib documentation. It is rather vague. I'll make more research and submit a patch later..

Right now, you can use zlib format (positive wbits) if it can achieve your goal. It's behaviour is right.
msg266814 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2016-06-01 11:02
I think my conclusion in last message is right. A similar bug in CPAN can be found in https://rt.cpan.org/Public/Bug/Display.html?id=36046. And I can find in zlib's changelog inflateSetDictionary starts to support raw inflate from 1.2.2.1.

Patch attached now allows Python to decompress raw deflate format with a dictionary. Hope some core devs are willing to review.
msg266819 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2016-06-01 14:59
Forget to add test. Upload a new one with test included.
msg266871 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-06-02 03:59
Thanks for the patches. Maybe Gregory knows more about this than I do (I never used zdict). But I think I have figured it out today, so I left some review comments :)

I’m not sure whether this really is a bug fix, although it would have little if any impact on existing code. It seems zdict was implemented with just the non-raw zlib format in mind, so supporting raw deflate mode is a new feature. If we were to add full support of inflateSetDictionary() for raw deflate mode, I imagine it would be a new decompressobj.set_dictionary() method.

I did some quick tests with specifying an unnecessary zdict buffer to the decompressor (when zdict not used for compression), and it doesn’t _seem_ to break anything, but I am not 100% sure.

Notes from my learning what zdict means:

Zdict seems to be a bunch of user data to setup or train the (de)compressor with. It was added by Issue 14684. (Not a Python dictionary, nor any special zlib dictionary format. The Python documentation is confusing, and the Go documentation made more sense, assuming it is equivalent.)

See the documentation of in/deflateSetDictionary() in <http://www.zlib.net/manual.html> or zlib.h. Looking through the history at <https://github.com/madler/zlib> can also be useful.
msg266876 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2016-06-02 05:59
Thanks for your review, Martin. :) I have replied with after thinking and try.

Yes, this is more like a feature request. But for the end user, this behaviour is quite like a bug.

> If we were to add full support of inflateSetDictionary() for raw deflate mode, I imagine it would be a new decompressobj.set_dictionary() method.

I don't support adding a new set_dictionary() method. Right now decompressobj handles raw deflate format, zlib format and gzip format. Adding a set_dictionary() method for raw deflate format seems weird. If we are going to do so, I suggest reorganize the zlib like PHP and nodejs do, separating different formats into different objects.

> I did some quick tests with specifying an unnecessary zdict buffer to the decompressor (when zdict not used for compression), and it doesn’t _seem_ to break anything, but I am not 100% sure.

This patch only affects raw inflate. You have to set wbits negative. If you are decompressing with an unnecessary zdict, I think it ought to emit an error.

With Python background, zlib dictionary is really hard to understand. I think manual doesn't tell much. I understand why zdict is needed until reading https://blog.cloudflare.com/improving-compression-with-preset-deflate-dictionary/.
msg267391 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-06-05 06:21
Here is a patch which I propose to use that factors out a set_inflate_zdict() function. I think this will help with maintainence, e.g. if someone changes the UINT_MAX checking, but forgets that there are two places than need changing. Let me know if that is okay with you :)
msg267405 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2016-06-05 09:05
It looks good. I don't think that is a matter since usually I will do a search before change. But of course extracting them to a subroutine helps.
msg267416 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-06-05 12:36
New changeset d91b951e676f by Martin Panter in branch '3.5':
Issue #27164: Allow decompressing raw Deflate streams with predefined zdict
https://hg.python.org/cpython/rev/d91b951e676f

New changeset 470954641f3b by Martin Panter in branch 'default':
Issue #27164: Merge raw Deflate zdict support from 3.5
https://hg.python.org/cpython/rev/470954641f3b
History
Date User Action Args
2016-06-06 01:48:25martin.pantersetstatus: open -> closed
resolution: fixed
stage: commit review -> resolved
2016-06-05 12:36:56python-devsetnosy: + python-dev
messages: + msg267416
2016-06-05 09:05:31xiang.zhangsetmessages: + msg267405
2016-06-05 06:21:19martin.pantersetfiles: + raw-zdict.v3.patch
versions: + Python 3.6
messages: + msg267391

components: + Extension Modules, - Library (Lib)
stage: patch review -> commit review
2016-06-02 18:50:16gregory.p.smithsetassignee: gregory.p.smith ->
2016-06-02 05:59:08xiang.zhangsetmessages: + msg266876
2016-06-02 03:59:46martin.pantersetmessages: + msg266871
stage: patch review
2016-06-01 21:30:16gregory.p.smithsetassignee: gregory.p.smith

nosy: + gregory.p.smith
2016-06-01 14:59:58xiang.zhangsetfiles: + issue27164_with_test.patch

messages: + msg266819
2016-06-01 11:02:41xiang.zhangsetfiles: + issue27164.patch
keywords: + patch
messages: + msg266814
2016-06-01 07:48:27xiang.zhangsetnosy: + twouters, nadeem.vawda, martin.panter
messages: + msg266804
2016-05-31 10:33:32xiang.zhangsetnosy: + xiang.zhang
2016-05-31 09:41:03Vladimir Mihailencocreate