classification
Title: zlib set dictionary support inflateSetDictionary
Type: enhancement Stage: resolved
Components: Extension Modules Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Jim.Jewett, Sam.Rushing, eric.araujo, jcea, nadeem.vawda, python-dev
Priority: normal Keywords: needs review, patch

Created on 2012-04-27 20:49 by Sam.Rushing, last changed 2012-06-21 23:51 by python-dev. This issue is now closed.

Files
File name Uploaded Description Edit
zlib_set_dictionary.patch Sam.Rushing, 2012-04-27 20:49 adds inflateSetDictionary(), deflateSetDictionary() review
zlib_set_dictionary_2.patch Sam.Rushing, 2012-05-03 23:48 set dictionary in constructors review
tz2.py Sam.Rushing, 2012-05-03 23:50 verify correct stream behavior with dictionary
zlib_set_dictionary_3.patch Sam.Rushing, 2012-05-04 00:51 review
zlib_set_dictionary_4.patch Sam.Rushing, 2012-05-05 19:33 review
Messages (23)
msg159492 - (view) Author: Sam Rushing (Sam.Rushing) Date: 2012-04-27 20:49
Google's SPDY protocol requires the use of a pre-defined compression dictionary.  The current zlib module doesn't expose the two functions for setting the dictionary.

This patch is minimal in the sense that it only exposes the two functions, but unfortunately the sequence of zlib calls required is clumsy: a call to inflate() must fail first (with an error of Z_NEED_DICT):


import zlib
zdict = b"thequickbrownfoxjumped\x00"
c = zlib.compressobj()
c.set_dictionary (zdict)
cd = c.compress (b"the quick brown fox jumped over the candlestick")
cd += c.flush()
d = zlib.decompressobj()
try:
    print (d.decompress (cd))
except zlib.error as what:
    if what.args[0].startswith ('Error 2 '):
        d.set_dictionary (zdict)
        print (d.flush())

Obviously a better way to catch/match Z_NEED_DICT would be nice.
A much nicer solution would allow you to associate the dictionary at creation time, rather than having to wait for an error condition.  I can only guess that the zlib authors designed it that way for a reason?
msg159870 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2012-05-03 19:31
A dictionary could be provided an init time. Then, the Z_NEED_DICT could be intercepted in the binding and automatically inject the dictionary provided in the init.

Anyway, for a patch to be approved, we need a test too.

PS: Why is this NOT targeted to 3.3?. We have time, yet.
msg159886 - (view) Author: Sam Rushing (Sam.Rushing) Date: 2012-05-03 23:04
I'm currently reworking this so that the dictionaries are provided in the constructor, and inflateSetDictionary() is called automatically.  I've gone over the zlib RFC's and zlibmodule.c, and I'm fairly certain that whatever usage mode might involve multiple calls to SetDictionary() couldn't be supported by the zlib object anyway.

r.e. 3.3/3.4 - I merely chose the highest version number I saw, since this diff is against HEAD (as recommended by the docs).  It's been quite a few years since I've submitted a patch to CPython!
msg159887 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2012-05-03 23:38
Retargetting to python 3.3.

If you hurry a bit and I find your patch acceptable (remember the tests!), I will try to integrate it.
msg159888 - (view) Author: Sam Rushing (Sam.Rushing) Date: 2012-05-03 23:48
Ok, here's the patch.  It has a single short test.  For use with SPDY, it's necessary to test that the following stream data also correctly decompresses, I'll attach that to the next comment.
msg159889 - (view) Author: Sam Rushing (Sam.Rushing) Date: 2012-05-03 23:50
This test is rather large, since it includes the predefined SPDY draft 2 dictionary, and some real-world data.  Not sure what the policy is on including so much data in a test.  If there's enough time I could make a smaller test that also verifies the correct behavior on a stream...
msg159890 - (view) Author: Sam Rushing (Sam.Rushing) Date: 2012-05-04 00:00
Argh, probably need to add the 'dict' field to the copy() method.
msg159893 - (view) Author: Sam Rushing (Sam.Rushing) Date: 2012-05-04 00:50
Updated version of the patch: extends the test, including a test of the streaming behavior needed for SPDY (both compression and decompression).

Also wik: copy()/uncopy() are aware of the 'dict' attribute.
msg159897 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-05-04 03:33
Added a few comments on Rietveld.
msg160004 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2012-05-05 16:43
I've posted a review on Rietveld.
msg160028 - (view) Author: Sam Rushing (Sam.Rushing) Date: 2012-05-05 19:33
renames dict->zdict, splits the test, adds BEGIN/END around inflate call.
msg161082 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2012-05-18 19:42
Status of this feature?. Ready to integrate?
msg161093 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2012-05-19 06:08
The code should be changed to use the buffer API (instead of accepting
only bytes objects). Other than that, I think it's ready for integration.
msg163118 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2012-06-18 20:21
Sam, the window for Python 3.3 integration is almost close. Could you possibly update your patch with Nadeem's feedback?.
msg163130 - (view) Author: Sam Rushing (Sam.Rushing) Date: 2012-06-19 01:00
I think other than the disagreement about whether the dictionary constructor arg should be a buffer object, it's good to go.
To restate my position: the need is for an immutable string of bytes, and that's exactly what PyBytes_Type is for.  I see no advantage to allowing a buffer object, which will require extra code to check that it is both a buffer object and set to be readonly.  I believe the argument for aesthetics does not apply, as the constant dictionary constructor argument is a morally different kind of parameter, comparable to (say) the compression level.

You folks are of course welcome to change it, though. 8^)
msg163147 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2012-06-19 08:57
> To restate my position: the need is for an immutable string of bytes, [...]

I disagree that we should require the dictionary to be immutable - if the
caller wishes to use a mutable buffer here, it is their responsibility to
ensure that it is not modified until the compressor is finished with it
("consenting adults" and all that). The documentation can inform users of
this requirement.


> I believe the argument for aesthetics does not apply, as the constant
> dictionary constructor argument is a morally different kind of
> parameter, comparable to (say) the compression level.

Even so, the surrounding code sets a precedent for how it accepts binary
data buffers, and deviating from this existing convention should not be
taken lightly.


Nitpicking about the API aside, thanks for the patch :-)
msg163199 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2012-06-19 18:21
So my question is easy: could we apply this patch as is and defer any "improvement" to 3.4?. The risk of not doing so would be to miss 3.3 completely.
msg163228 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2012-06-19 22:28
I plan to commit it (along with the buffer API changes) tomorrow.
msg163298 - (view) Author: Roundup Robot (python-dev) Date: 2012-06-21 00:20
New changeset dd4f7d5c51c7 by Nadeem Vawda in branch 'default':
Issue #14684: Add support for predefined compression dictionaries to the zlib module.
http://hg.python.org/cpython/rev/dd4f7d5c51c7
msg163300 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2012-06-21 00:27
Committed. Once again, thanks for the patch!
msg163339 - (view) Author: Jim Jewett (Jim.Jewett) Date: 2012-06-21 15:24
Just saw this on the checkins list; where are the other options documented? 

"""
  PyDoc_STRVAR(compressobj__doc__,
-"compressobj([level]) -- Return a compressor object.\n"
+"compressobj([level[, method[, wbits[, memlevel[, strategy[, zdict]]]]]])\n"
+" -- Return a compressor object.\n"
 "\n"
-"Optional arg level is the compression level, in 1-9.");
+"Optional arg level is the compression level, in 1-9.\n"
+"\n"
+"Optional arg zdict is the predefined compression dictionary - a sequence of\n"
+"bytes containing subsequences that are likely to occur in the input data.");
"""

I'm honestly not certain what they should be, but the following is my best guess:

"""
 PyDoc_STRVAR(compressobj__doc__,
 "compressobj([level[, method[, wbits[, memlevel[, strategy[, zdict]]]]]])\n"
 " -- Return a compressor object.\n"
 "\n"
-"Optional arg level is the compression level, in 1-9.\n"
+"Optional arg level (1-9) is the compression level.\n"
+"Larger numbers take longer, but produce smaller results.\n"
 "\n"
+"Optional arg method is the compression method.\n"
+"The only currently supported method is zlib.DEFLATED.\n"
+"\n"
+"Optional arg wbits determines the window buffer size.\n"
+"Normal values are 8 (least memory) to 15 (best compression).\n"
+"\n"
+"Optional arg memlevel (1-9) controls working memory size.\n"
+"Larger numbers use more memory, but produce smaller results more quickly.\n"
+"\n"
+"Optional arg strategy tunes the compression algorithm.\n"
+"Supported options include zlib.Z_DEFAULT_STRATEGY, zlib.Z_FILTERED, and zlib.Z_HUFFMAN_ONLY.\n"
+"\n"
+"Optional arg zdict is the predefined compression dictionary - a sequence of\n"
+"bytes containing subsequences that are likely to occur in the input data.");
"""
msg163341 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2012-06-21 15:36
> Just saw this on the checkins list; where are the other options documented? 

They aren't, AFAIK. I've been planning on adding them when I've got time
(based on the zlib manual at http://zlib.net/manual.html), but with the
upcoming feature freeze for 3.3, this issue was higher priority.
msg163380 - (view) Author: Roundup Robot (python-dev) Date: 2012-06-21 23:51
New changeset 1cfa44cb5af0 by Nadeem Vawda in branch 'default':
Document the rest of zlib.compressobj()'s arguments.
http://hg.python.org/cpython/rev/1cfa44cb5af0
History
Date User Action Args
2012-06-21 23:51:41python-devsetmessages: + msg163380
2012-06-21 15:36:20nadeem.vawdasetmessages: + msg163341
2012-06-21 15:24:40Jim.Jewettsetnosy: + Jim.Jewett
messages: + msg163339
2012-06-21 00:27:01nadeem.vawdasetstatus: open -> closed
resolution: fixed
messages: + msg163300

stage: patch review -> resolved
2012-06-21 00:20:56python-devsetnosy: + python-dev
messages: + msg163298
2012-06-19 22:28:19nadeem.vawdasetmessages: + msg163228
2012-06-19 18:21:05jceasetmessages: + msg163199
2012-06-19 08:57:15nadeem.vawdasetmessages: + msg163147
2012-06-19 01:01:00Sam.Rushingsetmessages: + msg163130
2012-06-18 20:21:36jceasetmessages: + msg163118
2012-05-19 06:08:09nadeem.vawdasetmessages: + msg161093
2012-05-18 19:58:31eric.araujosetkeywords: + needs review
stage: patch review
2012-05-18 19:42:30jceasetmessages: + msg161082
2012-05-05 19:33:31Sam.Rushingsetfiles: + zlib_set_dictionary_4.patch

messages: + msg160028
2012-05-05 16:43:06nadeem.vawdasetmessages: + msg160004
2012-05-05 10:01:11nadeem.vawdasetnosy: + nadeem.vawda
2012-05-04 03:33:02eric.araujosetnosy: + eric.araujo
messages: + msg159897
2012-05-04 00:51:05Sam.Rushingsetfiles: + zlib_set_dictionary_3.patch
2012-05-04 00:50:27Sam.Rushingsetmessages: + msg159893
2012-05-04 00:00:22Sam.Rushingsetmessages: + msg159890
2012-05-03 23:50:04Sam.Rushingsetfiles: + tz2.py

messages: + msg159889
2012-05-03 23:48:08Sam.Rushingsetfiles: + zlib_set_dictionary_2.patch

messages: + msg159888
2012-05-03 23:39:10jceasetassignee: roger.serwy ->

nosy: - roger.serwy
2012-05-03 23:38:24jceasetversions: + Python 3.3, - Python 3.4
nosy: + roger.serwy

messages: + msg159887

assignee: roger.serwy
2012-05-03 23:04:16Sam.Rushingsetmessages: + msg159886
2012-05-03 19:31:39jceasetnosy: + jcea
messages: + msg159870
2012-04-27 20:49:09Sam.Rushingcreate