This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: urllib.urlretrieve() fails on second ftp transfer
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: orsenthil Nosy List: Sohaib Ahmad, db3l, gson, orsenthil, pablogsal, r.david.murray, scoder
Priority: normal Keywords: patch

Created on 2016-09-06 15:09 by Sohaib Ahmad, last changed 2022-04-11 14:58 by admin.

Files
File name Uploaded Description Edit
multiple_ftp_download.py Sohaib Ahmad, 2016-09-06 15:09
test.py gson, 2016-09-13 14:46 Test case
urllib.patch Sohaib Ahmad, 2016-09-17 13:48 review
issue27973.patch orsenthil, 2017-01-22 17:12
Pull Requests
URL Status Linked Edit
PR 1040 merged orsenthil, 2017-04-08 05:06
PR 17774 merged orsenthil, 2019-12-31 21:34
PR 17819 closed orsenthil, 2020-01-04 02:20
Messages (28)
msg274559 - (view) Author: Sohaib Ahmad (Sohaib Ahmad) * Date: 2016-09-06 15:09
urllib.urlretrieve() fails on ftp:
- start and complete a transfer
- immediately start another transfer
The second transfer will fail with the following error:
[Errno ftp error] 200 Type set to I

I am using urllib.urlretrieve(url, filename) to retrieve two files (one by one) from FTP server.

Sample code to reproduce the problem is attached. Please update url1 and url2 with correct values.

This problem was reported several years ago and was fixed but it is now reproducible on latest python 2.7 package (2.7.12).

http://bugs.python.org/issue1067702

I tried the same scenario on 2.7.10 and it worked fine. So a patch after 2.7.10 must have broken something.
msg274575 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-09-06 17:47
If you want to help out with tracking this down, you could hg bisect the commits and find out which one broke it.
msg274871 - (view) Author: Sohaib Ahmad (Sohaib Ahmad) * Date: 2016-09-07 19:47
I am not much familiar with mercurial. I will try to setup the development environment.

Traceback is:

[Errno ftp error] 200 Switching to Binary mode.
Traceback (most recent call last):
  File "multiple_ftp_download.py", line 49, in main
    file2_path = download_from_url(url2, local_folder=tmpDir)
  File "multiple_ftp_download.py", line 32, in download_from_url
    filename = urllib.urlretrieve(url, local_path)[0]
  File "C:\Python27\lib\urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "C:\Python27\lib\urllib.py", line 245, in retrieve
    fp = self.open(url, data)
  File "C:\Python27\lib\urllib.py", line 213, in open
    return getattr(self, name)(url)
  File "C:\Python27\lib\urllib.py", line 558, in open_ftp
    (fp, retrlen) = self.ftpcache[key].retrfile(file, type)
  File "C:\Python27\lib\urllib.py", line 906, in retrfile
    conn, retrlen = self.ftp.ntransfercmd(cmd)
  File "C:\Python27\lib\ftplib.py", line 334, in ntransfercmd
    host, port = self.makepasv()
  File "C:\Python27\lib\ftplib.py", line 312, in makepasv
    host, port = parse227(self.sendcmd('PASV'))
  File "C:\Python27\lib\ftplib.py", line 830, in parse227
    raise error_reply, resp
IOError: [Errno ftp error] 200 Switching to Binary mode.
msg276280 - (view) Author: Andreas Gustafsson (gson) Date: 2016-09-13 14:46
I am also suffering from this bug, using Python 2.7.12 on NetBSD, and it is blocking my efforts to do automated testing of NetBSD/Xen.

I'm attaching a minimal test case (only 3 lines).
msg276518 - (view) Author: Sohaib Ahmad (Sohaib Ahmad) * Date: 2016-09-15 06:02
Thank you for pointing me towards hg bisect. I got some time to look into it and was able to find the commit that broke this functionality.

A fix from Python 3 was backported in issue "urllib hangs when closing connection" which removed a call to ftp.voidresp(). Without this call the second download using urlretrieve() now fails in 2.7.12.

Issue ID:
http://bugs.python.org/issue26960

Commit ID:
https://hg.python.org/cpython/rev/44d02a5d59fb

voidresp() itself calls getresp(). So issue26960 could be because control never returns from getresp().

In my opinion this commit (101286) should be reverted and getresp() should be updated with some sort of timeout to fix issue26960.
msg276545 - (view) Author: Sohaib Ahmad (Sohaib Ahmad) * Date: 2016-09-15 10:44
I didn't know that urllib.urlopen() retrieves complete object in case of FTP. When getresp() is called for big files (the one in issue26960), RETR command is initiated and server returns code 150 which means "standby for another reply" and there is where the control got stuck and issue26960 was reported.

This is the end of debug log with the file mentioned in issue26960, after which the control got stuck:

*cmd* 'TYPE I'
*put* 'TYPE I\r\n'
*get* '200 Type set to I\r\n'
*resp* '200 Type set to I'
*cmd* 'PASV'
*put* 'PASV\r\n'
*get* '227 Entering Passive Mode (130,133,3,130,207,26).\r\n'
*resp* '227 Entering Passive Mode (130,133,3,130,207,26).'
*cmd* 'RETR ratings.list.gz'
*put* 'RETR ratings.list.gz\r\n'
*get* '150 Opening BINARY mode data connection for ratings.list.gz (12643237 bytes)\r\n'
*resp* '150 Opening BINARY mode data connection for ratings.list.gz (12643237 bytes)'

And this is the end of debug log of a very small file transfer over FTP:

*cmd* 'PASV'
*put* 'PASV\r\n'
*get* '227 Entering Passive Mode (130,239,18,165,234,243).\r\n'
*resp* '227 Entering Passive Mode (130,239,18,165,234,243).'
*cmd* 'RETR Contents-udeb-ppc64el.gz'
*put* 'RETR Contents-udeb-ppc64el.gz\r\n'
*get* '150 Opening BINARY mode data connection for Contents-udeb-ppc64el.gz (26555 bytes).\r\n'
*resp* '150 Opening BINARY mode data connection for Contents-udeb-ppc64el.gz (26555 bytes).'
*get* '226 Transfer complete.\r\n'
*resp* '226 Transfer complete.'

The control returned successfully once FTP returned 2xx.

Please correct me if I am wrong but from the RETR command it looks like it is trying the retrieve the whole file in both cases. Is urlopen() supposed to retrieve files when called or just get the headers/information etc.?
msg276587 - (view) Author: Sohaib Ahmad (Sohaib Ahmad) * Date: 2016-09-15 18:37
I manually reverted the issue26960 patch which fixed my issue of consecutive downloads but it also caused regression of issue26960.

I am looking into what could be causing this hang when voidresp() is called using the demo available in issue26960 and it looks when urlopen() is called following happens:

urlopen() > URLopener.open() > URLopener.open_ftp > ftpwrapper.retrfile() > FTP.ntransfercmd()

Now this retrfile() calls FTP.ntransfercmd() in ftplib which sends RETR command to ftp server which, if I understand correctly, means that retrieve a copy of the file from FTP server. If RETR does retrieve complete file then I think the behavior after reverting issue26960 patch is fine and the hang would be there for large files.

I think we can fix this freeze for large files but I have two questions regarding this:

1) Is urlopen() supposed to download complete files? From Python doc, it looks like it only returns a network object or an exception in case of invalid URL.

2) If it is not supposed to download complete files, can we switch to LIST instead of RETR for FTP files?

I'd be grateful if a urllib / ftplib expert can answer the above questions.
msg276700 - (view) Author: Sohaib Ahmad (Sohaib Ahmad) * Date: 2016-09-16 11:39
The attached patch fixes the problem with multiple ftp downloads while keeping the fix for issue1067702 intact.

The fix basically uses a new parameter ftp_retrieve to change the behavior of ftpwrapper.retrfile() if it is being called by urlretrieve().

I am not familiar with the process of contributing a patch in Python repo so please review and commit the attached urllib.patch file.

Tested with urlopen (https, http, ftp) and urlretrieve (ftp).
msg276714 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2016-09-16 14:01
Hi Sohaib,

I will get the proper fix for this issue. 

A comment on the patch. Changing the API to `def open(self, fullurl, data=None, ftp_retrieve=False):`  just breaks the abstraction of the open method and may not be the way to go for this. Any changes that is required should be within the FTPHandler classes.
msg276723 - (view) Author: Sohaib Ahmad (Sohaib Ahmad) * Date: 2016-09-16 14:59
Hi Senthil,

Thanks for the review. Now that I look at it, even with a default value, an ftp specific parameter sure does break the open() API abstraction.
msg276795 - (view) Author: Sohaib Ahmad (Sohaib Ahmad) * Date: 2016-09-17 13:48
I finally found the actual problem causing the failure of second download. urlretrieve() works with FTP in PASV mode, and in PASV mode after sending the file to client, the FTP server sends an ACK that the file has been transferred. After the fix of issue1067702 socket was being closed without receiving this ACK.

Now, when a user tries to download the same file or another file from same directory, the key (host, port, dirs) remains the same so open_ftp() skips ftp initialization. Because of this skipping, previous FTP connection is reused and when new commands are sent to the server, server first sends the previous ACK. This causes a domino effect and each response gets delayed by one and we get an exception from parse227().

Expected response:
        *cmd* 'RETR Contents-udeb-ppc64el.gz'
        *resp* '150 Opening BINARY mode data connection for Contents-udeb-ppc64el.gz (26555 bytes).'
        *resp* '226 Transfer complete.'

        *cmd* 'TYPE I'
        *resp* '200 Switching to Binary mode.'
        *cmd* 'PASV'
        *resp* '227 Entering Passive Mode (130,239,18,173,137,59).'

Actual response:
        *cmd* 'RETR Contents-udeb-ppc64el.gz'
        *resp* '150 Opening BINARY mode data connection for Contents-udeb-ppc64el.gz (26555 bytes).'

        *cmd* 'TYPE I'
        *resp* '226 Transfer complete.'
        *cmd* 'PASV'
        *resp* '200 Switching to Binary mode.'

I am attaching a new patch (urllib.patch) which fixes this problem by clearing the FTP server responses first if an existing connection is being used to download a file. Please review and let me know if it looks good.
msg280078 - (view) Author: Sohaib Ahmad (Sohaib Ahmad) * Date: 2016-11-04 20:20
Can someone please review this patch so that it would be in 2.7.13 when it comes out?
msg280089 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2016-11-04 21:11
@Sohaib,

Thanks for the ping. Yeah, I will act on it.

Your analysis in msg276795 seems plausible, On the patch itself, I will try to reproduce this and see if I can avoid introducing this clear_buffer function. If no other go, then it should just be a private method (_clear_buffer).

That's my update and we will have it included by 2.7.13 and in 3.x versions if have a similar fate.

Thanks.
msg280537 - (view) Author: Sohaib Ahmad (Sohaib Ahmad) * Date: 2016-11-10 19:41
@Senthil, thanks for looking into this.

Looking forward to your commit.

Regards.
msg283525 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2016-12-18 01:06
Just an update. We missed the last 2.7.13 release train. Sorry for that. 
It's worth to fix this bug still ASAP. It's intricately related to some other bugs like (issue25458, issue26960). I got to this and realized the dependency and other bugs. All those will be cleaned up and fixed together.
msg284977 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2017-01-08 11:28
This bug makes the installation of lxml fail on many systems (especially MacOS) when pip installs from sources, because it tries to download its library dependencies from FTP and crashes due to the FTP "error". I guess the current fix is to not use urllib for that and instead implement the FTP downloads separately. That's unfortunate.
msg284986 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2017-01-08 12:00
Actually, it seems that calling urlcleanup() is sufficient as a work-around.
msg286016 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2017-01-22 17:12
I spent time digging for a proper fix for this issue.

I've come to a conclusion that this commit, https://hg.python.org/cpython/rev/44d02a5d59fb (10 May 2016) in 2.7.12, was a mistake and needs to be reverted.

The reason this change was made was apparently, it fixed a particular problem with retrieving files in 3.x, and the change was backported (issue26960). It was reported and fixed in 3.x code line here http://bugs.python.org/issue16270 (This is an improper change too and it needs to be reverted, we will come to it at the end [3].

Why is this a problem?

1. The change made was essentially taking the logic of draining ftp response from endtransfer method. Historically, in the module, endtransfer() has been used before closing the connection and it is correct to drain response (voidresp() method call).

2. If we don't drain responses, we end up with problems like the current (issue27973) bug report.

3. The problem with issue16270 was the fail transfer failed only when urllopen was used as a context manager (which calls close implicitly on the same host). But ftp scenarios were designed for reusing the same host without calling close from a long time (3a) and context manager scenario is not applicable to 2.7 code (3b).

Here are some references on how the module shaped up:
 
3a. https://hg.python.org/cpython/rev/6e0eddfa404a - it talks about repeated retriving of the same file from host without closing as a feature.

3b. The urllopen context manager was applicable only in 3.x so the original problem of issue16270 was not reproducible in 2.7 and it was improper to backport those changes (issue26960). Even issue16270 it is improper to change the code in endtransfer, because the problem is specific with context-manager usecase of urlopen, regular usage is just fine.

This patch fixes the problem for 2.7. I have included tests in test_urllibnet to cover the scenarios that were reported. Please review this.

For 3.x code, I will reopen issue16270, I will port this patch with test cases and an additional case for context manager scenario.
msg359099 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2019-12-31 05:15
New changeset f82e59ac4020a64c262a925230a8eb190b652e87 by Senthil Kumaran in branch '2.7':
[2.7] bpo-27973 - Fix for urllib.urlretrieve() failing on second ftp transfer (#1040)
https://github.com/python/cpython/commit/f82e59ac4020a64c262a925230a8eb190b652e87
msg359125 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2019-12-31 19:57
This PR1040 is failing in several 2.7 buildbots:

Tests result: FAILURE then FAILURE
test test_urllibnet failed -- Traceback (most recent call last):
  File "D:\cygwin\home\db3l\buildarea\2.7.bolen-windows7\build\lib\test\test_urllibnet.py", line 232, in test_multiple_ftp_retrieves
    "multiple times.\n Error message was : %s" % e)
AssertionError: Failed FTP retrieve while accessing ftp url multiple times.
 Error message was : [Errno 13] Permission denied: 'd:\\temp\\tmpiuquqa'
Example failure:

https://buildbot.python.org/all/#/builders/209/builds/4
msg359128 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2019-12-31 21:35
Thanks for the note, Pablo.
I am going to check if this patch https://github.com/python/cpython/pull/17774 will solve the Windows buildbot issues.
msg359135 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2019-12-31 22:42
Hi Pablo, Is there a way for us to test https://github.com/python/cpython/pull/17774 on a Windows Builder which displayed the post-commit failure?

The CI custom-builders seem to be broken for a different reason: https://buildbot.python.org/all/#/builders/106
msg359176 - (view) Author: David Bolen (db3l) * Date: 2020-01-01 22:54
The issue appears to be the temporary flag (O_TEMPORARY) that is used under Windows with delete on close temporary files.  That appears to prevent any separate access to the file by anyone else including obtaining another reference in the same process:

>>> temp = tempfile.NamedTemporaryFile()
>>> temp, temp.name
(<open file '<fdopen>', mode 'w+b' at 0x017958E8>, 'd:\\temp\\tmp44kugh')
>>> other = open(temp.name, 'r')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 13] Permission denied: 'd:\\temp\\tmp44kugh'

Changing the mode (PR 17774) of the temporary file has no effect.  Setting delete=False will work, but would defeat the point of using NamedTemporaryFile for the cleanup.

I don't see any way to have both auto-delete and the ability to write separately to a file under Windows with NamedTemporaryFile.

Perhaps instead use test.support.temp_dir for the overall test, and let it take care of the iteration temp files when cleaning up the entire directory?  Something like:

    with test_support.temp_dir() as td:
        for i in range(self.NUM_FTP_RETRIEVES):
            urllib.FancyURLopener().retrieve(self.FTP_TEST_FILE, os.path.join(td, str(i)))

That seems to work fine under Windows.
msg359227 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2020-01-03 05:48
Thanks for the suggestion, David. I have updated the PR 17774 to use temp_support instead of NamedTemporaryFile. Please review this.
msg359228 - (view) Author: David Bolen (db3l) * Date: 2020-01-03 06:40
I'd be happy to test an updated PR 17774 on a Windows builder, but I don't actually see any change yet.  It still appears to hold the earlier NamedTemporaryFile with mode='w' change from a few days ago.
msg359231 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2020-01-03 11:35
Sorry, I have updated it now: https://github.com/python/cpython/pull/17774

(I had pushed to a different branch earlier and it didn't reflect in my PR)
msg359258 - (view) Author: David Bolen (db3l) * Date: 2020-01-03 21:32
Ok, I can confirm that the updated PR 17774 test passes under Windows (and still cleans up after itself).
msg359271 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2020-01-04 02:14
New changeset 5bba60290b4ac8c95ac46cfdaba5deee37be1fab by Senthil Kumaran in branch '2.7':
bpo-27973 - Use test.support.temp_dir instead of NamedTemporaryFile for the (#17774)
https://github.com/python/cpython/commit/5bba60290b4ac8c95ac46cfdaba5deee37be1fab
History
Date User Action Args
2022-04-11 14:58:35adminsetgithub: 72160
2020-01-04 02:20:51orsenthilsetstage: resolved -> patch review
pull_requests: + pull_request17245
2020-01-04 02:14:22orsenthilsetmessages: + msg359271
2020-01-03 21:32:00db3lsetmessages: + msg359258
2020-01-03 11:35:33orsenthilsetmessages: + msg359231
2020-01-03 06:40:34db3lsetmessages: + msg359228
2020-01-03 05:48:24orsenthilsetmessages: + msg359227
2020-01-01 22:54:50db3lsetnosy: + db3l
messages: + msg359176
2019-12-31 22:42:41orsenthilsetmessages: + msg359135
2019-12-31 21:35:14orsenthilsetmessages: + msg359128
stage: patch review -> resolved
2019-12-31 21:34:02orsenthilsetstage: resolved -> patch review
pull_requests: + pull_request17207
2019-12-31 19:57:10pablogsalsetstatus: closed -> open

nosy: + pablogsal
messages: + msg359125

resolution: fixed ->
2019-12-31 05:54:58orsenthilsetstatus: open -> closed
resolution: fixed
stage: resolved
2019-12-31 05:15:00orsenthilsetmessages: + msg359099
2017-04-08 05:06:55orsenthilsetpull_requests: + pull_request1193
2017-01-22 17:12:08orsenthilsetfiles: + issue27973.patch

messages: + msg286016
2017-01-08 12:00:55scodersetmessages: + msg284986
2017-01-08 11:28:46scodersetnosy: + scoder
messages: + msg284977
2016-12-18 01:06:28orsenthilsetmessages: + msg283525
2016-11-10 19:42:58orsenthilsetassignee: orsenthil
2016-11-10 19:41:55Sohaib Ahmadsetmessages: + msg280537
2016-11-04 21:11:38orsenthilsetmessages: + msg280089
2016-11-04 20:20:20Sohaib Ahmadsetmessages: + msg280078
2016-09-17 13:48:32Sohaib Ahmadsetfiles: + urllib.patch

messages: + msg276795
2016-09-17 13:42:47Sohaib Ahmadsetfiles: - urllib.patch
2016-09-16 14:59:17Sohaib Ahmadsetmessages: + msg276723
2016-09-16 14:01:27orsenthilsetmessages: + msg276714
2016-09-16 11:39:14Sohaib Ahmadsetfiles: + urllib.patch
keywords: + patch
messages: + msg276700
2016-09-15 18:37:56Sohaib Ahmadsetmessages: + msg276587
2016-09-15 10:44:10Sohaib Ahmadsetmessages: + msg276545
2016-09-15 06:20:35berker.peksagsetnosy: + orsenthil
2016-09-15 06:02:04Sohaib Ahmadsetmessages: + msg276518
2016-09-13 14:46:40gsonsetfiles: + test.py
nosy: + gson
messages: + msg276280

2016-09-07 19:47:39Sohaib Ahmadsetmessages: + msg274871
2016-09-06 17:47:58r.david.murraysetnosy: + r.david.murray
messages: + msg274575
2016-09-06 15:09:56Sohaib Ahmadcreate