Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transparent gzip compression in urllib #43521

Open
antialize mannequin opened this issue Jun 19, 2006 · 23 comments
Open

transparent gzip compression in urllib #43521

antialize mannequin opened this issue Jun 19, 2006 · 23 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@antialize
Copy link
Mannequin

antialize mannequin commented Jun 19, 2006

BPO 1508475
Nosy @rhettinger, @jcea, @orsenthil, @vstinner, @merwok, @berkerpeksag, @vadmium, @JimJJewett, @serhiy-storchaka, @demianbrecht
Files
  • urllib2-gzip.patch: urllib2-gzip.patch
  • issue1508475.diff
  • http_client_gzip.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/orsenthil'
    closed_at = None
    created_at = <Date 2006-06-19.08:59:09.000>
    labels = ['type-feature', 'library']
    title = 'transparent gzip compression in urllib'
    updated_at = <Date 2015-03-07.02:32:39.334>
    user = 'https://bugs.python.org/antialize'

    bugs.python.org fields:

    activity = <Date 2015-03-07.02:32:39.334>
    actor = 'demian.brecht'
    assignee = 'orsenthil'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2006-06-19.08:59:09.000>
    creator = 'antialize'
    dependencies = []
    files = ['7335', '19811', '34177']
    hgrepos = []
    issue_num = 1508475
    keywords = ['patch']
    message_count = 23.0
    messages = ['50500', '50501', '114671', '114725', '114726', '122342', '122343', '122351', '158315', '158380', '158400', '160355', '160384', '163925', '163935', '211893', '212252', '212473', '213982', '226258', '226399', '234363', '236013']
    nosy_count = 18.0
    nosy_names = ['rhettinger', 'jjlee', 'jcea', 'orsenthil', 'jerub', 'vstinner', 'antialize', 'ruseel', 'nadeem.vawda', 'thomaspinckney3', 'eric.araujo', 'abacabadabacaba', 'jcon', 'berker.peksag', 'martin.panter', 'Jim.Jewett', 'serhiy.storchaka', 'demian.brecht']
    pr_nums = []
    priority = 'high'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue1508475'
    versions = ['Python 3.5']

    @antialize
    Copy link
    Mannequin Author

    antialize mannequin commented Jun 19, 2006

    Some webservers support gzipping things before sending
    them, this patch adds transparrent support for this in
    urllib2 (documentation http://www.http-compression.com/)

    This patach *requires* hash patch 914340 as a
    prerequirement as this enabels stream support in the
    gzip libary..

    @antialize antialize mannequin added extension-modules C modules in the Modules dir labels Jun 19, 2006
    @jjlee
    Copy link
    Mannequin

    jjlee mannequin commented Jan 30, 2007

    Looks good.

    This needs tests and docs. As a new feature, this could not be released until Python 2.6.

    It would be nice to have support for managing content negotiation in general, but that wish isn't an obstacle to this patch.

    @devdanzin devdanzin mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Feb 12, 2009
    @devdanzin devdanzin mannequin added easy labels Apr 22, 2009
    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Aug 22, 2010

    @jakob could you provide an updated patch for py3k that includes unit test and doc changes?

    @antialize
    Copy link
    Mannequin Author

    antialize mannequin commented Aug 23, 2010

    No, I have long since moved on to other things.

    @orsenthil
    Copy link
    Member

    Its okay, Jacab, we will take it forward.

    @merwok merwok removed the extension-modules C modules in the Modules dir label Nov 20, 2010
    @merwok merwok changed the title transparent gzip compression in liburl2 transparent gzip compression in urllib Nov 20, 2010
    @merwok merwok removed the extension-modules C modules in the Modules dir label Nov 20, 2010
    @merwok merwok changed the title transparent gzip compression in liburl2 transparent gzip compression in urllib Nov 20, 2010
    @orsenthil
    Copy link
    Member

    The transparent gzip Content-Encoding support should be done at the
    http.client level code.

    Before adding this feature, a question needs to be sorted out.

    If we support the transparent gzip and wrap the file pointer to a
    GzipFile filepointer, should reset the Content-Length value?

    What if a user of urllib is relying on the Content-Length of response
    to do something further?

    I observed that google-chrome returns the uncompressed output (which
    is correct for a browser), but has the Content-Length set the
    compressed output length.

    @orsenthil
    Copy link
    Member

    Patch for py3k.

    @merwok
    Copy link
    Member

    merwok commented Nov 25, 2010

    @serhiy-storchaka
    Copy link
    Member

    What if the gzip module is not available?

    I think, with transparent decompression should delete headers Content-Encoding (to free the user from re-decompression) and Content-Length (which is wrong).

    @orsenthil
    Copy link
    Member

    In that case, transparent decompression should not be available. (
    Request header should not be sent and response wont be compressed).

    @serhiy-storchaka
    Copy link
    Member

    The patch for py3k also has the disadvantage that the content is decoded even if the user has defined a Content-Encoding and he is going to process compressed response himself.

    @thomaspinckney3
    Copy link
    Mannequin

    thomaspinckney3 mannequin commented May 10, 2012

    What if this gzip decompression was optional and controlled via a flag or handler instead of making it automagic?

    It's not entirely trivial to implement so it is nice to have the option of this happening automatically if one wishes.

    Then, the caller would be aware that Content-length / Accept-encoding / Content-encoding etc have been modified iff they requested gzip decompression.

    @merwok
    Copy link
    Member

    merwok commented May 10, 2012

    Enabled by default with a knob to turn it off sounds good. Maybe the original headers could be preserved in some object.

    @merwok merwok removed easy labels May 10, 2012
    @serhiy-storchaka
    Copy link
    Member

    The first step is to answer on the fundamental question: on what level transparent decompression will work? On http.client level or on urllib level? Patch for first case will be much more difficult, but will benefit from compression in other http-based protocols.

    @orsenthil
    Copy link
    Member

    I think, the transparent compression should work at http.client level. I also agree with other points made by Serhiy:

    • transparent decompression should delete headers Content-Encoding and Content-Length (this is as per RFC too)

    • Should not do another compression if the user has a explicit specified intent of using Content-Encoding: gzip and is ready to do decompression himself.

    • This transparent compression/decompression would require the availability gzip module, if not then the feature may be disabled and normal request-response cycle would proceed.

    • I think, having it 'ON' with a flag to switch 'OFF' would be more desirable than having this feature via Handler. The reason being it can help in performance of any requests on servers that support it and browsers have adopted similar approach too.

    @vstinner
    Copy link
    Member

    I updated bpo-1508475.diff for Python 3.4 and removed the change in urllib: http_client_gzip.patch. The patch only changes http.client to support server sending gzip data.

    For example, the new python.org website serves gzip data even if the Accept-Encoding header is not sent by the client: see the issue bpo-20719.

    @vadmium
    Copy link
    Member

    vadmium commented Feb 26, 2014

    I have code that already handles an “gzip” encoded response from urlopen(). All three patches leave the Content-Encoding header intact, so I suspect my code would try to decompress the body a second time. Deleting this header (as already suggested) would work for me.

    @rhettinger
    Copy link
    Contributor

    Victor, the patch looks good and would be a welcome enhancement.

    There should be an option for turning this on and off (perhaps, I want the zipped content and want to unzip later or in a different thread).

    Consider adding support for "deflate" as well.

    @jimjjewett
    Copy link
    Mannequin

    jimjjewett mannequin commented Mar 18, 2014

    This is an enhancement, so I am changing the affected version from 3.3 to 3.5.

    It is python-only, which works well with the cheeseshop.

    That said, the patch is truly short; if that is really sufficient, it could almost go into the documentation as a recipe. But I would prefer some more assurances that it actually does work; a quick skim suggests that it relies on a superclass happening to implement read via readinto.

    Needs tests and documentation change.

    @vadmium
    Copy link
    Member

    vadmium commented Sep 2, 2014

    I think the patch is indeed a bit short, for instannce it looks like calling read() without a size limit could bypass the decoding.

    Also, I wonder if Content-Encoding handling is better done at a higher level. What if someone wants to download a *.tar.gz file? They may not expect the tar file to be transparently decompressed. And I suspect this would blow up if you tried a partial range request.

    Transfer-Encoding is meant to be the proper way to transparently compress HTTP messages at a low level, but it doesn’t seem to be used as much in the real world.

    @vadmium
    Copy link
    Member

    vadmium commented Sep 5, 2014

    Related: bpo-1243678, which includes a patch for “httplib” (now known as “http.client”?). That patch looks like it sets Accept-Encoding and decodes according to Content-Encoding. However I suspect it is also trying to be too “transparent” at the wrong level and would have many of the problems already mentioned here.

    @vadmium
    Copy link
    Member

    vadmium commented Jan 20, 2015

    The Lib/xmlrpc/client.py file appears to already support compression using “Content-Encoding: gzip”. Perhaps it could be leveraged for any work on this issue.

    @vadmium
    Copy link
    Member

    vadmium commented Feb 15, 2015

    I suggest resolving bpo-15955 first, then the GzipFile API could be used without fear of decompression bombs.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    Status: No status
    Development

    No branches or pull requests

    6 participants