classification
Title: cgi.FieldStorage has different (wrong?) behavior on Python3 than Python2
Type: behavior Stage: resolved
Components: Versions: Python 3.6, Python 3.5, Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, berker.peksag, dstufft, python-dev, r.david.murray
Priority: normal Keywords: patch

Created on 2015-03-29 02:22 by dstufft, last changed 2015-03-29 20:51 by dstufft. This issue is now closed.

Files
File name Uploaded Description Edit
reproduce.py dstufft, 2015-03-29 02:22
cgi-read-until-boundary.diff dstufft, 2015-03-29 02:30 review
Messages (11)
msg239471 - (view) Author: Donald Stufft (dstufft) * (Python committer) Date: 2015-03-29 02:22
While working on PyPI 2.0 (which is currently running Python 3) I discovered that ``setup.py upload`` was causing an exception. After tracing things I determined that the reason for this is that Python 3 fails to handle leading whitespace in a multipart body.

I've attached a minimum reproducer that runs without error on Python 2.6 and Python 2.7 which fails on Python 3.2, 3.3, and 3.4.

If I go into the cgi.py module and add a print() statement that will print the header of each part, I get output that looks like:

b'----------------GHSKFJDLGDS7543FJKLFHRE75642756743254\r\nContent-Disposition: form-data; name="protcol_version"\r\n\r\n'
b'Content-Disposition: form-data; name="summary"\r\n\r\n'
b'Content-Disposition: form-data; name="home_page"\r\n\r\n'
b'Content-Disposition: form-data; name="filetype"\r\n\r\n'
b'Content-Disposition: form-data; name="content"; filename="jasmin-13.13.13.tar.gz"\r\n\r\n'

The first line of that is obviously suspicious since it includes the inner boundary marker. Looking at the Python 3.x code it throws away the first line off the fp and then proceeds to process the rest of the fp. However in this case the first line is just a blank b'\r\n'. Looking at the Python 2.7 code it throws away an entire first part, not just the first line.

I'm guessing that the "read first line and throw it away code" needs to continue reading lines until it's read enough lines to get to the boundary marker.
msg239472 - (view) Author: Donald Stufft (dstufft) * (Python committer) Date: 2015-03-29 02:30
Added a patch that fixes this issue by reading lines until we find the line that is our expected boundary marker.
msg239478 - (view) Author: Donald Stufft (dstufft) * (Python committer) Date: 2015-03-29 07:50
Added R David Murray to the nosy list because this is kinda similar to the email stuff and there doesn't seem to be anyone better to look at this patch that I can find...
msg239480 - (view) Author: Donald Stufft (dstufft) * (Python committer) Date: 2015-03-29 07:52
Also adding Berker Peksag because they've touched this module recently :)
msg239490 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2015-03-29 14:27
The loop might be more elegantly expressed as

for line in self.fp:
    self.bytes_read += len(line)
    if line.strip() != b"--" + self.innerboundary:
        break

but otherwise lgtm
msg239497 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-03-29 16:42
The code you are modifying looks completely wrong-headed to me.  Unfortunately I don't have time to rewrite the cgi module right now :).  Given that, your fix looks fine: the first part of a multipart (up to the first inner boundary) is the 'preamble', which can contain pretty much arbitrary content, and in the context of cgi should indeed be completely ignored.  

Benjamin's suggested modification has the if logic reversed.
msg239516 - (view) Author: Donald Stufft (dstufft) * (Python committer) Date: 2015-03-29 20:31
@Benjamin

The reason I didn't do that to begin with, was the code currently checks if the first line is a bytes object or not in order to be able to raise an error if it's returning str instead of bytes. I didn't want to redo that check on every iteration, so I left the original part alone and then used the while loop to handle doing more if needed.

Would you prefer the code written your way and either drop the bytes check or incur the cost of doing the type check on every line?
msg239517 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2015-03-29 20:39
Oh, I see. I think your way is fine then.
msg239518 - (view) Author: Roundup Robot (python-dev) Date: 2015-03-29 20:43
New changeset 3af77d1c5c47 by Donald Stufft in branch '3.4':
Closes #23801 - Ignore entire preamble to multipart in cgi.FieldStorage
https://hg.python.org/cpython/rev/3af77d1c5c47
msg239519 - (view) Author: Roundup Robot (python-dev) Date: 2015-03-29 20:45
New changeset b9364903d91c by Benjamin Peterson in branch 'default':
merge 3.4 (#23801)
https://hg.python.org/cpython/rev/b9364903d91c
msg239520 - (view) Author: Donald Stufft (dstufft) * (Python committer) Date: 2015-03-29 20:50
Thanks everyone for taking a look at this!
History
Date User Action Args
2015-03-29 20:51:22dstufftsetstage: resolved
2015-03-29 20:50:07dstufftsetmessages: + msg239520
stage: resolved -> (no value)
2015-03-29 20:45:25python-devsetmessages: + msg239519
2015-03-29 20:43:32python-devsetstatus: open -> closed

nosy: + python-dev
messages: + msg239518

resolution: fixed
stage: resolved
2015-03-29 20:39:27benjamin.petersonsetmessages: + msg239517
2015-03-29 20:31:42dstufftsetmessages: + msg239516
2015-03-29 16:42:53r.david.murraysetmessages: + msg239497
2015-03-29 14:27:31benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg239490
2015-03-29 07:52:50dstufftsetnosy: + berker.peksag
messages: + msg239480
2015-03-29 07:50:10dstufftsetnosy: + r.david.murray
messages: + msg239478
2015-03-29 02:30:39dstufftsetfiles: + cgi-read-until-boundary.diff
keywords: + patch
messages: + msg239472
2015-03-29 02:22:20dstufftcreate