Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urllib2 does not download 4 MB file completely using ftp #59207

Closed
sspapilin mannequin opened this issue Jun 4, 2012 · 17 comments
Closed

urllib2 does not download 4 MB file completely using ftp #59207

sspapilin mannequin opened this issue Jun 4, 2012 · 17 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@sspapilin
Copy link
Mannequin

sspapilin mannequin commented Jun 4, 2012

BPO 15002
Nosy @orsenthil, @pitrou, @giampaolo, @bitdancer, @iritkatriel
Files
  • test.py: Downloads file into stdout
  • issue15002-tcpdump-early-fin
  • d3c6ab639306.diff
  • issue15002.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/orsenthil'
    closed_at = <Date 2022-03-01.18:22:22.825>
    created_at = <Date 2012-06-04.20:14:41.261>
    labels = ['type-bug', 'library']
    title = 'urllib2 does not download 4 MB file completely using ftp'
    updated_at = <Date 2022-03-01.18:22:22.824>
    user = 'https://bugs.python.org/sspapilin'

    bugs.python.org fields:

    activity = <Date 2022-03-01.18:22:22.824>
    actor = 'iritkatriel'
    assignee = 'orsenthil'
    closed = True
    closed_date = <Date 2022-03-01.18:22:22.825>
    closer = 'iritkatriel'
    components = ['Library (Lib)']
    creation = <Date 2012-06-04.20:14:41.261>
    creator = 'sspapilin'
    dependencies = []
    files = ['25821', '26290', '34915', '34970']
    hgrepos = ['237']
    issue_num = 15002
    keywords = ['patch']
    message_count = 17.0
    messages = ['162287', '162319', '162320', '164828', '164833', '216465', '216466', '216472', '216507', '216508', '216509', '216511', '216513', '216849', '216910', '216911', '414287']
    nosy_count = 9.0
    nosy_names = ['orsenthil', 'ctheune', 'pitrou', 'giampaolo.rodola', 'r.david.murray', 'bbrazil', 'python-dev', 'sspapilin', 'iritkatriel']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue15002'
    versions = ['Python 2.7', 'Python 3.4', 'Python 3.5']

    @sspapilin
    Copy link
    Mannequin Author

    sspapilin mannequin commented Jun 4, 2012

    File test.py is

    #!/usr/bin/env python
    import urllib2
    print urllib2.urlopen('ftp://ftp.ripe.net/pub/stats/ripencc/delegated-ripencc-extended-latest').read()

    When I issue

    python test.py > out.txt

    , I get file about 100KB in size, the beginning of the actual file. I repeated it a hundred times, and almost every time I get 98305 byte file, and a couple of times a 49153 bytes or 188417 bytes file.

    When I replace urllib2 with urllib in test.py, I get full (4 MB) file.

    I have Ubuntu 12.04 64-bit, Python 2.7.3 (from default Ubuntu repository, up-to-date as of 4-june-2012) and slow, 64KB/s, Internet connection.

    However, I asked my friend with Windows and faster connection to check it, and he got partial download as well, while he had another size of partial file (50109 bytes). I do not know his OS ant Python versions.

    @sspapilin sspapilin mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jun 4, 2012
    @sspapilin sspapilin mannequin changed the title urllib2 does not download 4 MB file fully using ftp urllib2 does not download 4 MB file completely using ftp Jun 4, 2012
    @bitdancer
    Copy link
    Member

    The same problem exists in Python3.

    @orsenthil
    Copy link
    Member

    That's surprising! I shall test it with http debug mode and see what's happening.

    @orsenthil orsenthil self-assigned this Jun 5, 2012
    @bbrazil
    Copy link
    Mannequin

    bbrazil mannequin commented Jul 7, 2012

    I've tested this on head, and the issue appears to be buggy ftp code in python.

    From the attached tcpdump for fetching delegated-ripencc-20120706:

    12:57:19.933607 IP myhost.39627 > ftp.ripe.net.ftp: Flags [.], ack 511, win 115, options [nop,nop,TS val 129353190 ecr 1632444059], length 0
    12:57:19.934853 IP myhost.39627 > ftp.ripe.net.ftp: Flags [F.], seq 97, ack 511, win 115, options [nop,nop,TS val 129353191 ecr 1632444059], length 0

    and a bit later:

    12:57:20.043701 IP ftp.ripe.net.42707 > myhost.50818: Flags [.], seq 46337:47785, ack 1, win 227, options [nop,nop,TS val 2552550247 ecr 129353204], length 1448
    12:57:20.043717 IP myhost.50818 > ftp.ripe.net.42707: Flags [.], ack 47785, win 353, options [nop,nop,TS val 129353218 ecr 2552550247], length 0
    12:57:20.043816 IP ftp.ripe.net.42707 > myhost.50818: Flags [FP.], seq 47785:49153, ack 1, win 227, options [nop,nop,TS val 2552550247 ecr 129353204], length 1368
    12:57:20.043992 IP myhost.50818 > ftp.ripe.net.42707: Flags [F.], seq 1, ack 49154, win 376, options [nop,nop,TS val 129353218 ecr 2552550247], length 0
    12:57:20.094067 IP ftp.ripe.net.42707 > myhost.50818: Flags [.], ack 2, win 227, options [nop,nop,TS val 2552550299 ecr 129353218], length 0

    As you can see we're sending a FIN without sending a close command to the control connection, and in response the server stops sending data about 49k in. Per RFC 959 section 2.3: "The server may abort data transfer if the control connections are closed without command." so this is acceptable behaviour on the part of the server, and means we need to keep the control connection open for longer.

    @bbrazil
    Copy link
    Mannequin

    bbrazil mannequin commented Jul 7, 2012

    More particularly, the ftpwrapper's ftp member is being GCed sometime after FtpHandler.ftp_open returns.

    @ctheune
    Copy link
    Mannequin

    ctheune mannequin commented Apr 16, 2014

    Looking into this.

    It seems that it doesn't happen for all servers, I can download large files reliably from other sources.

    I'll make another wireshark recording to get more details for me to analyze.

    @orsenthil
    Copy link
    Member

    I'll make another wireshark recording to get more details for me to analyze.

    Thank you! That will be useful. Please test it against 3.x version as it has seen cleanups recently.

    @ctheune
    Copy link
    Mannequin

    ctheune mannequin commented Apr 16, 2014

    This is actually the same problem as bpo-18879.

    Changing the sample to keep a reference to the addinfourl object avoids this issue.

    This is even worse than bpo-18879 in the sense that the error goes undetected and just leaves you with partial data.

    Looking at the solution in bpo-18879 I think we can reuse that, maybe even better by refactoring that to a common file proxy object.

    @ctheune
    Copy link
    Mannequin

    ctheune mannequin commented Apr 16, 2014

    I wasn't able to come up with a good testcase. :(

    I tried similar approaches as in bpo-18879 but I wasn't able to make them trigger the behaviour as it also seems to be an issue regarding actual network performance ... :/

    Backport to 2.7 is currently missing as I'd need bpo-18879 to be backported. If that is OK (I'd like to have this in 2.7) then I'd be happy to port both.

    @ctheune
    Copy link
    Mannequin

    ctheune mannequin commented Apr 16, 2014

    Antoine, I'm adding you here as I'm leveraging your patch from bpo-18879.

    I'd need some feedback about the backport, but this patch should be OK for 3.4. Also, if you had an idea how to test this - I tried, but failed so far.

    @ctheune
    Copy link
    Mannequin

    ctheune mannequin commented Apr 16, 2014

    Antoine, could you check my last comment in here?

    (The nosy list got reset accidentally when I made that comment and got a conflict from the tracker).

    @orsenthil
    Copy link
    Member

    Christian , with respect to patch, I agree with the logic (using something similar to bpo-18879). Does all current unittests succeed with this? (I suspect not) A unittest for coverage would be helpful.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 16, 2014

    Well, this looks ok on the principle, but I haven't investigated the urllib-specific parts, so I'll let Senthil delve into this :)

    @orsenthil
    Copy link
    Member

    Christian's patch is good.It helps in setting the socket.makefile file descriptor to a well behaving well file close wrapper and thus will help us prevent some tricky fd close issues.

    I added tests for coverage to ensure that we are asserting the type and covering all the methods in urllib.response. Attaching the patch built on top of Chritians one.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Apr 20, 2014

    New changeset bb71b71322a3 by Senthil Kumaran in branch '3.4':
    urllib.response object to use _TemporaryFileWrapper (and _TemporaryFileCloser)
    http://hg.python.org/cpython/rev/bb71b71322a3

    New changeset 72fe23edfec6 by Senthil Kumaran in branch '3.4':
    NEWS entry for bpo-15002
    http://hg.python.org/cpython/rev/72fe23edfec6

    New changeset 8c8315bac6a8 by Senthil Kumaran in branch 'default':
    merge 3.4
    http://hg.python.org/cpython/rev/8c8315bac6a8

    @orsenthil
    Copy link
    Member

    This is fixed in 3.4 and 3.5. I will backport to 2.7 ( I think, it is worth it).

    @iritkatriel
    Copy link
    Member

    2.7 backport is no longer relevant.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants