Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[http.client] HTTPConnection.request not support "chunked" Transfer-Encoding to send data #56528

Closed
harobed mannequin opened this issue Jun 12, 2011 · 63 comments
Closed
Assignees
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@harobed
Copy link
Mannequin

harobed mannequin commented Jun 12, 2011

BPO 12319
Nosy @orsenthil, @pitrou, @vstinner, @akheron, @vadmium, @eryksun, @demianbrecht, @matrixise, @RKrahl
Files
  • chunkedhttp.py: A custom module implementing upload with chunked transfer encoding for urllib
  • issue12319.patch
  • issue12319_1.patch: Removing unused imports
  • issue12319_2.patch
  • issue12319_3.patch
  • issue12319_4.patch
  • issue12319_5.patch
  • issue12319_6.patch
  • chunkedhttp-2.py: Illustration of the situation after applying issue12319_6.patch from user's point of view
  • issue12319_7.patch
  • issue12319_8.patch
  • issue12319_9.patch
  • issue12319_10.patch
  • issue12319_11.patch
  • issue12319_12.patch
  • chunked-file.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/vadmium'
    closed_at = <Date 2016-08-27.04:12:55.150>
    created_at = <Date 2011-06-12.10:47:13.084>
    labels = ['type-feature', 'library']
    title = '[http.client] HTTPConnection.request not support "chunked" Transfer-Encoding to send data'
    updated_at = <Date 2016-08-27.04:12:55.149>
    user = 'https://bugs.python.org/harobed'

    bugs.python.org fields:

    activity = <Date 2016-08-27.04:12:55.149>
    actor = 'martin.panter'
    assignee = 'martin.panter'
    closed = True
    closed_date = <Date 2016-08-27.04:12:55.150>
    closer = 'martin.panter'
    components = ['Library (Lib)']
    creation = <Date 2011-06-12.10:47:13.084>
    creator = 'harobed'
    dependencies = []
    files = ['36493', '38351', '38352', '38528', '38670', '38772', '38797', '39447', '39466', '43398', '43579', '43973', '44057', '44076', '44133', '44217']
    hgrepos = []
    issue_num = 12319
    keywords = ['patch', 'needs review']
    message_count = 63.0
    messages = ['138203', '138242', '138258', '138296', '138357', '138691', '171268', '226012', '226018', '226024', '236024', '236037', '237309', '237312', '237313', '237427', '237509', '238314', '239119', '239151', '239769', '239771', '239793', '243518', '243601', '243730', '243731', '243747', '243751', '243762', '243800', '243823', '243837', '243878', '268611', '268715', '268894', '269430', '269497', '271771', '271815', '272121', '272164', '272239', '272241', '272322', '272354', '272394', '272395', '272413', '272438', '272442', '272937', '273532', '273533', '273534', '273536', '273538', '273547', '273548', '273628', '273751', '273758']
    nosy_count = 13.0
    nosy_names = ['orsenthil', 'pitrou', 'vstinner', 'harobed', 'python-dev', 'petri.lehtinen', 'martin.panter', 'piotr.dobrogost', 'eryksun', 'demian.brecht', 'matrixise', 'whitemice', 'Rotkraut']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue12319'
    versions = ['Python 3.6']

    @harobed
    Copy link
    Mannequin Author

    harobed mannequin commented Jun 12, 2011

    Hi,

    HTTPConnection.putrequest not support "chunked" Transfer-Encodings to send data.

    Exemple, I can't do PUT request with chunk transfert.

    Regards,
    Stephane

    @harobed harobed mannequin added the stdlib Python modules in the Lib dir label Jun 12, 2011
    @akheron
    Copy link
    Member

    akheron commented Jun 13, 2011

    What's the use case? Do you have an iterable that yields data whose size is unknown?

    AFAIK, most web servers don't even support chunked uploads.

    (Removing Python 2.7 from versions as this is clearly a feature request.)

    @akheron akheron added the type-feature A feature request or enhancement label Jun 13, 2011
    @harobed
    Copy link
    Mannequin Author

    harobed mannequin commented Jun 13, 2011

    I use http.client in WebDAV client.

    Mac OS X Finder WebDAV client perform all his request in "chunk" mode : PUT and GET.

    Here, I use http.client to simulate Mac OS X Finder WebDAV client.

    Regards,
    Stephane

    @akheron
    Copy link
    Member

    akheron commented Jun 14, 2011

    harobed wrote:

    I use http.client in WebDAV client.

    Mac OS X Finder WebDAV client perform all his request in "chunk" mode : PUT and GET.

    Here, I use http.client to simulate Mac OS X Finder WebDAV client.

    Now I'm confused. Per the HTTP specification, GET requests don't have
    a body, so "Transfer-Encoding: chunked" doesn't apply to them.

    Are you sure you don't confuse with the response that the server
    sends? In responses, "Transfer-Encoding: chunked" is very common.

    @harobed
    Copy link
    Mannequin Author

    harobed mannequin commented Jun 15, 2011

    Now I'm confused. Per the HTTP specification, GET requests don't have
    a body, so "Transfer-Encoding: chunked" doesn't apply to them.

    Are you sure you don't confuse with the response that the server
    sends? In responses, "Transfer-Encoding: chunked" is very common.

    Sorry, yes GET requests have "Transfer-Encoding: chunked" in server response.
    PUT requests can send body data in transfer-encoding chunked mode.

    @orsenthil
    Copy link
    Member

    We had support for chunked transfer encoding for POST method recently, which is exposed via urllib2 wrapper function. PUT is not exposed via urllib2 and users should use httplib. This feature of chunked transfer can be added to PUT by taking the body of the message as iterable.

    @orsenthil orsenthil self-assigned this Jun 20, 2011
    @pitrou
    Copy link
    Member

    pitrou commented Sep 25, 2012

    We had support for chunked transfer encoding for POST method recently,
    which is exposed via urllib2 wrapper function.

    I couldn't find what you're talking about.
    If I look at AbstractHTTPHandler.do_request_, it actually mandates a Content-Length header for POST data (hence no chunked encoding).

    @RKrahl
    Copy link
    Mannequin

    RKrahl mannequin commented Aug 28, 2014

    I'd like to support the request. I have a use case where I definitely need this feature: I maintain a Python client for a scientific metadata catalogue, see 1 for details. The client also features the upload of the data files. The files may come in as a data stream from another process, so my client takes a file like object as input. The files may be large (several GB), so buffering them is not an option, they must get streamed to the server as they come in. Therefore, there is have no way to know the Content-length of the upload beforehand.

    I implemented chunked transfer encoding in a custom module that monkey patches the library, see the attached file. This works fine, but of course it's an awkward hack as I must meddle deeply into the internals of urllib and http.client to get this working. This module is tested to work with Python 2.7, 3.1, 3.2, 3.3, and 3.4 (for Python 3 you need to pass it through 2to3 first). I really would like to see this feature in the standard library in order to get rid of this hack in my package. I would happy to transform my module into a patch to the library if such a patch would have a chance to get accepted.

    @piotrdobrogost
    Copy link
    Mannequin

    piotrdobrogost mannequin commented Aug 28, 2014

    @rotkraut

    The truth is http in stdlib is dead.
    Your best option is to use 3rd party libs like requests or urllib3.
    Authors of these libs plan to get rid of httplib entirely; see "Moving away from httplib" (urllib3/urllib3#58)

    @RKrahl
    Copy link
    Mannequin

    RKrahl mannequin commented Aug 28, 2014

    Thanks for the notice! As far as I read from the link you cite, getting rid of the current httplib in urllib3 is planned but far from being done. Furthermore, I don't like packages with too many 3rd party dependencies. Since my package is working fine with the current standard lib, even though using an ugly hack in one place, I'd consider switching to urllib3 as soon as the latter makes it into the standard lib.

    I still believe that adding chunked transfer encoding to http.client and urllib in the current standard lib would only require a rather small change that can easily be done such that the lib remains fully compatible with existing code. Still waiting for feedback if such a patch is welcome.

    @vadmium
    Copy link
    Member

    vadmium commented Feb 15, 2015

    One interesting question is how to convey data to the chunked encoder. There are two sets of options in my mind, a pull interface:

    • iterable: seems to be the popular way amoung commenters here
    • file reader object: encoder calls into stream’s read() method

    and a push interface:

    • chunked encoder is a file writer object: user calls encoder’s write() and close() methods. This would suit APIs like saxutils.XMLGenerator and TextIOWrapper.
    • chunked encoder has a “feed parser” interface, codecs.IncrementalEncoder interface, or something else.

    The advantage of the push interface is that you could fairly easily feed data from an iterable or file reader into it simply by just doing shutil.copyfileobj() or equivalent. But to adapt the pull interface to a push interface would require “asyncio” support or a separate thread or something to invert the flow of control. So I think building the encoder with a push interface would be best. Rolf’s ChunkedHTTPConnectionMixin class appears to only support the pull interface (iterables and stream readers).

    I would welcome support for chunked uploading in Python’s “http.client” module, especially with push or stream writer support. I don’t think overwriting _send_request should be necessary; just call putrequest(), putheader() etc manually, and then call send() for each chunk. Perhaps there is scope for sharing the code with the “http.server” module (for encoding chunked responses).

    @RKrahl
    Copy link
    Mannequin

    RKrahl mannequin commented Feb 15, 2015

    The design goal for my implementation was compatibility. My version can be used as a drop in replacement for the existing urllib's HTTPHandler. The only thing that need to be changed in the calling code is that it must call build_opener() to select the chunked handler in the place of the default HTTPHandler. After this, the calling code can use the returned opener in the very same way as usual.

    I guess, switching to a push interface would require major changes in the calling code.

    In principle, you could certainly support both schemes at the same time: you could change the internal design to a push interface and than wrap this by a pull interface for the compatibility with existing code. But I'm not sure whether this would be worth the effort. If, as Piotr suggests, the current urllib is going to be replaced by urllib3, then I guess, its questionable if it makes sense to add major design changes that are incompatible with existing code to the current standard lib.

    @demianbrecht
    Copy link
    Mannequin

    demianbrecht mannequin commented Mar 6, 2015

    I've attached a patch that implements full Transfer-Encoding support for requests as specified in RFC 7230.

    @demianbrecht
    Copy link
    Mannequin

    demianbrecht mannequin commented Mar 6, 2015

    I hit "submit" a little too soon.

    The intention of the patch is to adhere to all aspects of Transfer-Encoding as specified in the RFC and to make best guesses as to encoding that should be used based on the data type of the given body.

    This will break backwards compatibility for cases where users are manually chunking the request bodies prior to passing them in and explicitly setting the Transfer-Encoding header. Additionally, if Transfer-Encoding was previously specified, but not chunked, the patch will automatically chunk the body.

    Otherwise, the patch should only be additive.

    @demianbrecht
    Copy link
    Mannequin

    demianbrecht mannequin commented Mar 6, 2015

    Also note that this patch includes the changes in bpo-23350 as it's directly relevant.

    @demianbrecht
    Copy link
    Mannequin

    demianbrecht mannequin commented Mar 7, 2015

    FWIW, so far I've tested this change against:

    cherrypy 3.6.0
    uwsgi 2.0.9 (--http-raw-body)
    nginx 1.6.2 (chunked_transfer_encoding on, proxy_buffering off;) + uwsgi 2.0.9 (--http-raw-body)

    The chunked body works as expected. Unfortunately, all implementations seem to be ignorant of the trailer part. So it seems that although RFC-compliant (and I can definitely see the use case for it), they trailer implementation may not be overly practical. I still think that it's worthwhile keeping it, but perhaps adding a note that it may not be supported at this point.

    Relevant gists: https://gist.github.com/demianbrecht/3fd60994eceeb3da8f13

    @demianbrecht
    Copy link
    Mannequin

    demianbrecht mannequin commented Mar 8, 2015

    After sleeping on this, I think that the best route to go would be to drop the trailer implementation (if it's not practical it doesn't belong in the standard library).

    Also, to better preserve backwards compatibility it may be better to circumvent the automatic chunking if transfer-encoding headers are present in the request call. That way, no changes would need to be made to existing code that already supports it at a higher level.

    @demianbrecht
    Copy link
    Mannequin

    demianbrecht mannequin commented Mar 17, 2015

    Updated patch changes the following:

    + Removes support for trailers in requests as they're not supported
    + If Transfer-Encoding is explicitly set by the client, it's assumed that the caller will handle all encoding (backwards compatibility)
    + Fixed a bug where chunk size was being sent as decimal instead of hex

    @vadmium
    Copy link
    Member

    vadmium commented Mar 24, 2015

    I left a few comments on Reitveld, mainly about the documentation and API design.

    However I understand Rolf specifically wanted chunked encoding to work with the existing urlopen() framework, at least after constructing a separate opener object. I think that should be practical with the existing HTTPConnection implementation. Here is some pseudocode of how I might write a urlopen() handler class, and an encoder class that should be usable for both clients and servers:

    class ChunkedHandler(...):
        def http_request(...):
            # Like AbstractHTTPHandler, but don’t force Content-Length
        
        def default_open(...):
            # Like AbstractHTTPHandler, but instead of calling h.request():
            encoder = ChunkedEncoder(h.send)
            h.putrequest(req.get_method(), req.selector)
            for item in headers:
                h.putheader(*item)
            h.putheader("Transfer-Encoding", encoder.encoding)
            h.endheaders()
            shutil.copyfileobj(req.data, writer)
            encoder.close()
    
    class ChunkedEncoder(io.BufferedIOBase):
        # Hook output() up to either http.client.HTTPConnection.send()
        # or http.server.BaseHTTPRequestHandler.wfile.write()
        
        encoding = "chunked"
        
        def write(self, b):
            self.output("{:X}\r\n".format(len(b)).encode("ascii"))
            self.output(b)
            self.output(b"\r\n")
        
        def close(self):
            self.output(b"0\r\n\r\n")

    @demianbrecht
    Copy link
    Mannequin

    demianbrecht mannequin commented Mar 24, 2015

    Thanks for the review Martin.

    However I understand Rolf specifically wanted chunked encoding to work with the existing urlopen() framework, at least after constructing a separate opener object. I think that should be practical with the existing HTTPConnection implementation.

    The original issue was that http.client doesn't support chunked encoding. With this patch, chunked encoding should more or less come for free with urlopen. There's absolutely no reason as to why HTTPConnection should not support transfer encoding out of the box given it's part of the HTTP1.1 spec. I do understand that there are some modifications needed in urllib.request in order to support the changes here, but I didn't include those in the patch as to not conflate the patch. I think that it's also reasonable to open a new issue to address the feature in urllib.request rather than piggy-backing on this one.

    @vadmium
    Copy link
    Member

    vadmium commented Apr 1, 2015

    Perhaps you should make a table of some potential body object types, and figure out what the behaviour should be for request() with and without relevant headers, and endheaders() and send() with and without encode_chunked=True:

    • Add/don’t add Content-Length/Transfer-Encoding
    • Send with/without chunked encoding
    • Raise exception
    • Not supported or undefined behaviour

    Potential body types:

    • None with GET/POST request
    • bytes()
    • Latin-1/non Latin-1 str()
    • BytesIO/StringIO
    • Ordinary binary/Latin-1/other file object
    • File object reading a special file like a pipe (st_size == 0)
    • File object wrapping a pipe or similar that does not support fileno() (ESPIPE)
    • Custom file object not supporting fileno() nor seeking
    • File object at non-zero offset
    • GzipFile object, where fileno() corresponds to the compressed size
    • GzipFile not supporting fileno(), where seeking is possible but expensive
    • Iterator yielding bytes() and/or strings
    • Generator
    • File object considered as an iterable of lines
    • List/tuple of bytes/text
    • Other sequences of bytes/text
    • Other iterables of bytes/text, e.g. set(), OrderedDict.values()

    This list could go on and on. I would rather have a couple of simple rules, or explicit APIs for the various modes so you don’t have to guess which mode your particular body type will trigger.

    @demianbrecht
    Copy link
    Mannequin

    demianbrecht mannequin commented Apr 1, 2015

    Agreed. However, I'm wondering if that should belong in a new issue geared towards further clarifying behaviour of request body types. The patch introduces the behaviour this specific issue was looking, with the largest change being that iterators may now result in chunked transfer encoding with the data currently handled by the library. I'd rather move forward with incremental improvements rather than keep building on each issue before the work's merged (there's still a /good/ amount of work to be done in this module).

    @vadmium
    Copy link
    Member

    vadmium commented Apr 1, 2015

    The incremental improvement thing sounds good. Here are some things which I think are orthogonal to sensible chunked encoding:

    • Automagic seeking to determine Content-Length
    • Setting Content-Length for iterables that are neither strings, iterators nor files (bpo-23350)
    • Latin-1 encoding of iterated items

    @pitrou
    Copy link
    Member

    pitrou commented May 18, 2015

    What's the status on this one? It looks like some review comments need addressing.

    @demianbrecht
    Copy link
    Mannequin

    demianbrecht mannequin commented May 19, 2015

    What's the status on this one? It looks like some review comments need addressing.

    That's about it. Unfortunately I've been pretty tied up over the last month and a bit. I'll try to get to hopefully closing this out over the next few days.

    @demianbrecht
    Copy link
    Mannequin

    demianbrecht mannequin commented May 21, 2015

    Latest patch should address all outstanding comments.

    @demianbrecht
    Copy link
    Mannequin

    demianbrecht mannequin commented May 21, 2015

    BTW, thanks for the reviews Martin and the nudge Antoine!

    @vadmium
    Copy link
    Member

    vadmium commented Aug 7, 2016

    Yes when I say “text” I mean str objects, as opposed to byte strings, etc.

    I wonder if it would be safer to test for TextIOBase. With the read(0) hack, the zero size may be unexpected. A file object may need special handling to generate an empty result, or to not interpret it as an EOF indicator. See e.g. bpo-23804, about zero-sized SSL reads, where both of these problems happened. As long as we document that the text file logic has changed, I think it would be okay. IMO it is better to keep things simple and robust for byte files than add more hacks for text files.

    A common base class like HTTPException is useful when a caller may need to handle a set of exceptions in a common way. But a problem with Content-Length or Transfer-Encoding seems like a programming error, so ValueError seems fine IMO.

    However I still don’t see why the checks are worthwhile. Sure having both is a technical violation of the current HTTP protocols. But if a caller is likely to end up in one of these situations, it may indicate a problem with the API that we should fix instead. If the checks are unlikely to ever fail, then drop them. Otherwise, I don’t think any of the bugs I pointed out have been addressed.

    I don’t like converting bytes-like to bytes. I understand one of the main purposes of accepting bytes-like objects is to avoid copying them, but converting to bytes defeats that. Also, the copying is inconsistent with the lack of copying in _read_iterable().

    I am wondering if it would be best to keep new logic out of send(); just put it into _send_output() instead. Once upon a time, send() was a simple wrapper around socket.sendall() without much bloat. But with the current patch, in order to send a single buffer, we have to jump through a bunch of checks, create a dummy lambda and singleton tuple, loop over the singleton, call the lambda, realize there is no chunked encoding, etc. IMO the existing support for iterables should have been put in _send_output() too (when that was an option). There is no need to use send() to send the whole body when endheaders() already supports that.

    @RKrahl
    Copy link
    Mannequin

    RKrahl mannequin commented Aug 8, 2016

    Martin,

    I wonder if it would be safer to test for TextIOBase. With the
    read(0) hack, the zero size may be unexpected. A file object may
    need special handling to generate an empty result, or to not
    interpret it as an EOF indicator. See e.g. bpo-23804, about
    zero-sized SSL reads, where both of these problems happened. As long
    as we document that the text file logic has changed, I think it
    would be okay. IMO it is better to keep things simple and robust for
    byte files than add more hacks for text files.

    The problem is that there is no accepted definition of what a file-like object is. As a consequence, each test method will fail in some cases. Not all file-like objects are derived from the base classes in the io module. It is common praxis to simply define an object with a read() method and to consider this a file-like. This pattern can even be found in the official Python test suite. Testing for TextIOBase will fail in these cases.

    There seem to be at least a minimal consensus that each file-like (that supports reading) must have the read() method implementing the interface as defined by the io base classes. The read(0) test (I wouldn't call it a hack, as it only uses this documented API) has the advantage of making no other assumptions then this. And I even would not accept bpo-23804 as a counter example as this was clearly a bug in the ssl module.

    The current test using the mode attribute is certainly the worst one as it even fails for standard classes from the io module.

    Anyway. Even though I would prefer the read(0) test, I am happy to accept any other method. Please tell me what I should use.

    However I still don’t see why the checks are worthwhile. Sure having
    both is a technical violation of the current HTTP protocols. But if
    a caller is likely to end up in one of these situations, it may
    indicate a problem with the API that we should fix instead. If the
    checks are unlikely to ever fail, then drop them.

    The question is whether we should apply sanity checks on caller's input or whether we should rather send bullshit out and wait for the request to fail. I'd rather give the caller a somewhat helpful error message rather then a lesser helpful "400 Bad Request" from the server. But if you really insist, I can also remove these checks.

    I don't understand your remark on the API and what the likeliness of rubbish header values set by the caller has to do with it, sorry.

    Otherwise, I don’t think any of the bugs I pointed out have been
    addressed.

    Could you elaborate on what bugs you are talking about? I really don't know, sorry.

    I don’t like converting bytes-like to bytes. I understand one of the
    main purposes of accepting bytes-like objects is to avoid copying
    them, but converting to bytes defeats that. Also, the copying is
    inconsistent with the lack of copying in _read_iterable().

    I added this conversion because you mentioned bpo-27340 and the problem to send byte-like objects with the ssl module. I agree to leave this to bpo-27340 to find a better solution and will remove the conversion.

    I am wondering if it would be best to keep new logic out of send();
    just put it into _send_output() instead. Once upon a time, send()
    was a simple wrapper around socket.sendall() without much bloat. But
    with the current patch, in order to send a single buffer, we have to
    jump through a bunch of checks, create a dummy lambda and singleton
    tuple, loop over the singleton, call the lambda, realize there is no
    chunked encoding, etc. IMO the existing support for iterables should
    have been put in _send_output() too (when that was an option). There
    is no need to use send() to send the whole body when endheaders()
    already supports that.

    No problem, I can move all this to _send_output(). The only problem that I see then, is that the current library version of send() already accepts file-like (both text and byte streams), bytes, and iterables. So we will end up with two different methods accepting this variety of data types and behaving differently in the respective cases.

    Maybe, as a consequence, one should restrict send() to accept only bytes in order to tighten consistency. But this is not in the scope of this Issue and will not be covered by my patch.

    Please note that I will be in holidays starting with next Monday. We should try to get this done during this week.

    @RKrahl
    Copy link
    Mannequin

    RKrahl mannequin commented Aug 9, 2016

    Ok, here comes yet another version of the patch, addressing the review comments.

    As indicated in the last post and requested by Martin, I moved the logic to deal with the different body types out of send() into _send_output(). I also removed the conversion from byte-like to bytes.

    I had some problems with some comments mainly on the documentation of urllib.request, asking for stuff unrelated to chunked transfer encoding and the present issue. I have the impression this is starting to get out of hands. Time is running out and I suggest that we should keep the focus on the present issue.

    There were also requests to drop support for stuff that is working in the current library, although not documented, and that is explicitly covered by the test suite. I'm fine with dropping that and also adapted the tests accordingly. But please be sure that what you are asking for is really what you want.

    Please note that this patch is not yet complete. I still need your decision on the two issues raised in the last post: the test method to use for text streams and whether I should drop the sanity checks for the transfer-encoding header.

    @RKrahl
    Copy link
    Mannequin

    RKrahl mannequin commented Aug 9, 2016

    I just wrote:

    There were also requests to drop support for stuff that is working
    in the current library, although not documented, and that is
    explicitly covered by the test suite. I'm fine with dropping that
    and also adapted the tests accordingly. But please be sure that
    what you are asking for is really what you want.

    Sorry, please disregards this comment. I just realized that these tests were added by previous versions of the patch by Damien.

    @orsenthil
    Copy link
    Member

    (Assigning this to Martin)

    I really like this patch, and I support that we land this before the feature freeze of 3.6.

    As along as the API interface to users is simple, the changes internal to _send_output should not be a concern. Yeah, I agree that multiple things needs to be taken care, but it is alright.

    Also, +1 to do validation of transfer-encoding usage as it's done by the patch. This is along the lines of RFC recommendation and it is better to fail early than let the server implementation be the decider.

    I would like to know the better suggestion for "testing text" too. (Adding haypo to steer us here) Although, I would point out that the static method has a generic name _is_textIO, while relying on .read attribute which is checked outside the static method.

    @orsenthil orsenthil assigned vadmium and unassigned orsenthil Aug 10, 2016
    @matrixise
    Copy link
    Member

    When I have read the last patch, I have seen a mutable as the default value of the HTTPConnection.request method. I have proposed this issue and the attached patch http://bugs.python.org/issue27728

    @vadmium
    Copy link
    Member

    vadmium commented Aug 11, 2016

    _is_textIO(): I’m sorry but I still prefer the TextIOBase check over read(0). Each option has a disadvantage, but with TextIOBase, the disadvantage only affects text files, not byte files.

    Maybe another option is to stick with the current checking of the “mode” attribute, or are its failings significant for chunked encoding? It seems to me that the most important use cases would be reading bytes from a pipe file, custom bytes file object reader, generator, etc, but not a text file without a “mode” attribute.

    Checking Content-Length and Transfer-Encoding: I would prefer to remove this, unless there is a “good” reason to keep them. I want to understand why the three specific checks were added, and what made them more special than other potential checks (e.g. the format of the Content-Length value, or that the resulting body matches it). If the reason is something like “it is too easy for the caller to accidentally trigger the problem”, then there may be another way to fix it. Or maybe it is a mitigation for a security problem? But at the moment, it just seems to me like code being too smart for its own good.

    I mentioned a few possible bugs with parsing Transfer-Encoding at <https://bugs.python.org/review/12319/diff/14900/Lib/http/client.py#newcode1210\>. The format of the Transfer-Encoding value is specified at <https://tools.ietf.org/html/rfc7230#section-3.3.1\> and <https://tools.ietf.org/html/rfc7230#section-4\>. In bpo-23498, I identified some parts of Python that parse header values like this, although it looks like http.cookiejar.split_header_words() may be the only one usable for Transfer-Encoding.

    @vadmium
    Copy link
    Member

    vadmium commented Aug 11, 2016

    Some of the header field validation was included in Demian’s first patch, raising HTTPEncodingError. But I don’t know why he included it. Maybe it was more appropriate then; it looks like that patch took a different approach than the current encode_chunked=True mechanism.

    @vadmium vadmium assigned orsenthil and vadmium and unassigned vadmium and orsenthil Aug 11, 2016
    @RKrahl
    Copy link
    Mannequin

    RKrahl mannequin commented Aug 11, 2016

    Martin,

    _is_textIO(): I’m sorry but I still prefer the TextIOBase check over
    read(0). Each option has a disadvantage, but with TextIOBase, the
    disadvantage only affects text files, not byte files.

    Ok, this is a valid argument.

    Maybe another option is to stick with the current checking of the
    “mode” attribute, or are its failings significant for chunked
    encoding? It seems to me that the most important use cases would be
    reading bytes from a pipe file, custom bytes file object reader,
    generator, etc, but not a text file without a “mode” attribute.

    No, this is not significant for chunked encoding. As I said, I am happy to accept any method.

    But I would say, it's a fairly save bet that any file-like that has the mode attribute is in fact derived from the base classes in the io module. If this assumption holds, then the TextIOBase check will always work in all cases where the mode attribute check would. The inverse is not true. So I'll use the TextIOBase check, hope you agree.

    Checking Content-Length and Transfer-Encoding: I would prefer to
    remove this, unless there is a “good” reason to keep them. I want to
    understand why the three specific checks were added, and what made
    them more special than other potential checks (e.g. the format of
    the Content-Length value, or that the resulting body matches it). If
    the reason is something like “it is too easy for the caller to
    accidentally trigger the problem”, then there may be another way to
    fix it. Or maybe it is a mitigation for a security problem? But at
    the moment, it just seems to me like code being too smart for its
    own good.

    Ok, I'll drop them.

    I mentioned a few possible bugs with parsing Transfer-Encoding at
    <https://bugs.python.org/review/12319/diff/14900/Lib/http/client.py#newcode1210\>.
    The format of the Transfer-Encoding value is specified at
    <https://tools.ietf.org/html/rfc7230#section-3.3.1\> and
    <https://tools.ietf.org/html/rfc7230#section-4\>. In bpo-23498, I
    identified some parts of Python that parse header values like this,
    although it looks like http.cookiejar.split_header_words() may be
    the only one usable for Transfer-Encoding.

    I admit, I didn't read all comments on Demian's versions of the patch. Since I'll drop the checks, I guess, these problems are gone now.

    Some of the header field validation was included in Demian’s first
    patch, raising HTTPEncodingError. But I don’t know why he included
    it. Maybe it was more appropriate then; it looks like that patch
    took a different approach than the current encode_chunked=True
    mechanism.

    I don't think this is in the current versions of the patch any more.

    @RKrahl
    Copy link
    Mannequin

    RKrahl mannequin commented Aug 11, 2016

    Ok, here comes the next version of the patch. I made the changes discussed in the last post and addressed the review comments. Looks like we are converging towards a final version.

    @vadmium
    Copy link
    Member

    vadmium commented Aug 11, 2016

    I am pretty happy with the latest patch. I left one comment. I will try to give it a more thorough review and actually test it out at some point, but I don’t anticipate any major problems.

    @vadmium
    Copy link
    Member

    vadmium commented Aug 17, 2016

    Patch 12 has the following changes:

    • Change the chunked_encoding parameter to keyword-only
    • Drop most of the change regarding the UpdatingFile test (see code review)
    • Update the urlopen() TypeError to mention “data” may be a file object
    • Fix and update the tests (they weren’t actually being run)
    • Various minor documentation corrections, e.g. wrong parameter name.

    I think there are a few of my comments to Demian’s patches that are still relevant, but they are just opportunities for improving the code (e.g. making the chunking iteration more efficient and readable). This version is hopefully “good enough”.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 24, 2016

    New changeset 0effb34f3276 by Martin Panter in branch 'default':
    Issue bpo-12319: Support for chunked encoding of HTTP request bodies
    https://hg.python.org/cpython/rev/0effb34f3276

    @orsenthil
    Copy link
    Member

    Thank you and great work everyone involved with this change. Regards.

    @vadmium
    Copy link
    Member

    vadmium commented Aug 24, 2016

    Rolf, just a note that I had to remove some trailing spaces on various continued lines in the Python code before it would let me push this.

    Other tweaks I made:

    • Eliminate _read_iterable() and lambda
    • Rename line → chunk

    The Windows buildbots fail the test sending a pipe with urlopen():

    http://buildbot.python.org/all/builders/AMD64%20Windows8.1%20Non-Debug%203.x/builds/1166/steps/test/logs/stdio
    ======================================================================
    FAIL: test_http_body_pipe (test.test_urllib2.HandlerTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "D:\buildarea\3.x.ware-win81-release\build\lib\test\test_urllib2.py", line 970, in test_http_body_pipe
        self.assertEqual(newreq.get_header('Content-length'), None)
    AssertionError: '0' != None

    I cannot figure out why exactly. My experiments with Wine suggest that tell() on a BufferedReader wrapping a pipe raises OSError, so I would expect _get_content_length() to return None, yet the test failure shows that Content-Length is set to zero.

    Is anyone able to investigate this on Windows? Will tell() and seek() reliably fail on a pipe or other unseekable file?

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 24, 2016

    New changeset b004da19b869 by Martin Panter in branch 'default':
    Issue bpo-12319: Move NEWS under beta 1 heading
    https://hg.python.org/cpython/rev/b004da19b869

    @RKrahl
    Copy link
    Mannequin

    RKrahl mannequin commented Aug 24, 2016

    First of all, thanks all for the great work, in particular to you Martin for your patience!

    Martin, on your changes: as mentioned earlier, I'm in holidays right now, so I didn't had a close look yet. But I tested my application and it works, so I guess I'm happy with it.

    I don't use Windows, so I can't comment on the issue test_http_body_pipe, sorry.

    @eryksun
    Copy link
    Contributor

    eryksun commented Aug 24, 2016

    on Windows? Will tell() and seek() reliably fail on a pipe
    or other unseekable file?

    No they will not reliably fail.

    The file position is stored in the FILE_OBJECT CurrentByteOffset. This value can be queried and set using the WinAPI function SetFilePointerEx, which is implemented via the system calls NtQueryInformationFile and NtSetInformationFile with the information class FilePositionInformation.

    For files opened in synchronous mode, which is the only mode that the CRT supports, there's a fast path in the I/O manager to bypass the I/O request packet (IRP) path that would normally call the device driver. This fast-path code trivially updates the CurrentByteOffset field, regardless of the file type. In synchronous mode, a seek will only fail if the file is opened without buffering and the new position isn't sector-aligned, as is documented for SetFilePointerEx and NtSetInformationFile: FilePositionInformation.

    Prior to Windows 8, seek() and tell() will also fail for a console handle because it's not actually a kernel File handle that can be used with I/O system calls. OTOH, Windows 8+ uses the ConDrv device driver, so console handles trivially allow seek() and tell() to succeed, though the console itself ignores the CurrentByteOffset in the File object. Console screen-buffer files have a positionable cursor, which is similar but not quite the same.

    A fun fact is that for files opened in asynchronous mode, for FilePositionInformation the I/O manager does a regular device driver call with an IRP_MJ_QUERY_INFORMATION or IRP_MJ_SET_INFORMATION packet. It probably does this instead of using its fast path because there's no inherent meaning to the file position in asynchronous mode. I played around with this a bit with aynchronous named pipes. For setting FilePositionInformation, the NamedPipe filesystem driver always returns STATUS_INVALID_PARAMETER. However, querying the file 'position' returns the number of unread bytes in the queue for the handle's end of the pipe. This can only be non-zero if the handle allows reading, which depends on whether the pipe was created as duplex, inbound, or outbound. Querying this requires the NT API because SetFilePointerEx always calls NtSetInformationFile and fails.

    @vadmium
    Copy link
    Member

    vadmium commented Aug 24, 2016

    Thankyou Eryksun for the detailed explanation. Unfortunately, that means that uploading an unseekable file via urllib.request, or via http.client, isn’t going to work by default on Windows. I see a couple of workarounds with the current code:

    • Users can force chunked encoding (manually pass Transfer-Encoding: chunked, and/or encode_chunked=True, depending on the API)
    • Users can pass an iterator that reads the file: iter(partial(file.read, CHUNK_SIZE), b"")

    But I never liked how so much behaviour depends on specific details of the upload body object anyway. Perhaps this is an opportunity to change the behaviour so that a file object always triggers chunked encoding. We could add a note to <https://docs.python.org/3.6/whatsnew/3.6.html#changes-in-the-python-api\> to warn that uploading files to HTTP 1.0 servers will now require Content-Length to be manually specified.

    @vadmium
    Copy link
    Member

    vadmium commented Aug 25, 2016

    Here is my attempt to drop automatic Content-Length with files. I also dropped the special handling of zero-length iterables, added some tests for that, and removed some unused and misleading bits from other tests.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 27, 2016

    New changeset 216f5451c35a by Martin Panter in branch 'default':
    Issue bpo-12319: Always send file request bodies using chunked encoding
    https://hg.python.org/cpython/rev/216f5451c35a

    @vadmium
    Copy link
    Member

    vadmium commented Aug 27, 2016

    I committed my patch, so now file bodies are chunk-encoded by default. The Windows buildbots pass the relevant tests again.

    This does mean that old code that uploaded a file to a HTTP-1.0-only server, and relied on Python automatically setting Content-Length, will no longer work in 3.6. Let me know if you think this is a serious problem, and maybe we can find another solution.

    @vadmium vadmium closed this as completed Aug 27, 2016
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants