This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: [2.7] Python on 64-bit Windows uses signed 32-bit type for read length
Type: behavior Stage: resolved
Components: IO Versions: Python 2.7
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: larry Nosy List: Matt.Mackall, benjamin.peterson, eric.smith, ezio.melotti, josh.r, larry, loewis, ncoghlan, pitrou, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2014-04-10 16:58 by Matt.Mackall, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
file_read_size_t.patch vstinner, 2014-04-16 06:19 review
larry.file_read.8.bytes.1.diff larry, 2014-04-16 23:03 review
larry.make.file_read.use.size_t.2.diff larry, 2016-05-15 15:48 review
Messages (26)
msg215893 - (view) Author: Matt Mackall (Matt.Mackall) Date: 2014-04-10 16:58
Python's file_read uses an 'l' type to parse its args which results in a 31-bit size limit on reads on 64-bit Windows. It should instead use an ssize_t.

Related Mercurial bug:

http://bz.selenic.com/show_bug.cgi?id=4215
msg215896 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2014-04-10 17:57
Benjamin: given the brave new world of 2.7 living even longer, Guido said this sort of bug fix was fair game.  Will you accept a patch for this in 2.7?
msg215899 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-04-10 19:27
It is much simpler for Mercurial to add a workaround (which they already did, apparently :-)), than to risk introducing regressions by fixing this. AFAIK nobody else complained before.
msg216008 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2014-04-13 09:37
FWIW, using "l" for file_read goes back to

http://hg.python.org/cpython/diff/f44a56e697fb/Objects/fileobject.c
from Fri, 09 May 1997 22:27:31 +0000
msg216016 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2014-04-13 13:51
So it predates the existence of type code 'n', which would be the appropriate type code, but no one updated it.

Antoine: Inability to perform a 2GB+ read properly is not something that should be worked around on a case by case basis. There is one compatibility risk here, which is that 'n' requires index integers (__index__), where 'l' will accept coercible types (__int__). If you think people are reading from files using Fraction and Decimal, then this would be a compatibility issue (and you could work around it be making it use type code 'L' only on 64 bit Windows), but this just looks like an oversight from the upgrade from int to Py_ssize_t across the Python code base.
msg216023 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2014-04-13 15:53
Josh: it's not as simple as just changing the type code and variable type. Systems used to require all kinds of types in the length parameter of read(2) and fread(3), including int, long, and size_t. If it was int, passing size_t would lead to silent truncation. So at a minimum, a careful review of the further processing, and a test case is necessary.

I'm personally not interested in Python 2.7 anymore, so I certainly won't work on this.

As for why this isn't reported frequently, I guess a number of causes:
a) people don't use Python on Windows that much to process large files
b) when they do and run into this problem, they feel guilty for actually asking for a such large chunk, which could well exhaust the memory of the system.
c) there is a straight-forward work-around
d) people that do read large files typically either read them at once, or in much smaller chunks

So it may well be that this remains the only report of this for the rest of Python 2.7's life.
msg216026 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-04-13 16:43
> Antoine: Inability to perform a 2GB+ read properly is not something
> that should be worked around on a case by case basis.

Really we are talking about Python 2.7 here. Anyone having this problem *has* to work it around in order for their code to work on published versions. The benefit of fixing this in the next 2.7.x iteration is small compared to the cost of a potential regression in the core I/O implementation.
msg216165 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2014-04-14 18:41
We discussed this bug at the sprint and we all agreed that it should be fixed. It's an oversight and a bug that this behaviour is different between Windows and other platforms due to how 64-bit Windows treats longs.
msg216377 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2014-04-15 20:15
I don't think we can fix this.  file_read returns a str, and str can have at most SSIZE_T_MAX entries.  So, file_read could read size_t characters, but it couldn't return them.
msg216389 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-04-15 21:24
> I don't think we can fix this. file_read returns a str, and str can
have at most SSIZE_T_MAX entries. So, file_read could read size_t
characters, but it couldn't return them.

But at least it could read more than LONG_MAX characters.
msg216396 - (view) Author: Matt Mackall (Matt.Mackall) Date: 2014-04-15 21:42
Actually, no, all the size_t types are 64-bit on Windows 64:

https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models

To confirm this, I found someone nearby with a 64-bit Windows and he had no trouble allocating a 3G string with a = b'a' * 3000000000.
msg216398 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2014-04-15 22:04
D'oh, yeah, you guys are right.  SSIZE_T_MAX > LONG_MAX, so that's an improvement.  We just can't do the full size_t range.
msg216435 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-04-16 05:53
On Windows, the type of the size parameter of read() is an unsigned int, not long nor size_t:
http://msdn.microsoft.com/en-us/library/1570wh78.aspx

FileIO.read() of Python 3 uses a size_t type but also:

#ifdef MS_WINDOWS
    if (size > INT_MAX)
        size = INT_MAX;
#endif
...
#ifdef MS_WINDOWS
        n = read(self->fd, ptr, (int)size);
#else
        n = read(self->fd, ptr, size);
#endif


Using a size_t to parse the Python parameter would avoid an OverflowError and the function would be more portable. But it would not permit to read more than 2 GB at once.
msg216436 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-04-16 06:09
"It is much simpler for Mercurial to add a workaround (which they already did, apparently :-)), than to risk introducing regressions by fixing this. AFAIK nobody else complained before."

We fixed a lot of issues for 32-bit long and 64-bit size_t types (Windows 64 bits) in Python 3.3 and 3.4. I don't think that they were fixed in Python 2.7, because the required changes are large and it would be risky to backport them. At least, I don't feel confortable to backport them.

I suggest to switch to Python 3 if you would like to handle large objects (bigger than 2 GB) on Windows 64-bit.

See the issue #9566 for examples of fixes, but there were many other changes related to 64-bit Windows issues.

I see for example that send() & send_bytes() methods of multiprocessing connection doesn't truncate the length to DWORD_MAX bytes on Windows in Python 2.7, whereas it was fixed in Python 3:
http://hg.python.org/cpython/rev/c75ab7b802df
(On UNIX, the same methods loop on write() to ensure that all bytes are written.)

If you want to announce that Python 2.7.x supports large objects on Windows 64 bits, be prepared to have to fix Python in various different places.
msg216437 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-04-16 06:19
> On Windows, the type of the size parameter of read() is an unsigned int, not long nor size_t (...)

Oh, I read the wrong function. In fact, file_read() of Python 2.7 calls fread() and fread() uses size_t types, even on Windows.

To make sure that we are talking about the same thing, I wrote the attached file_read_size_t.patch file which replaces "l" with "n" in file_read().

Note: os.read() uses int types, even on Python 3.5, whereas read() uses a size_t type for the number of bytes on Linux.
msg216439 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-04-16 06:36
"If you want to announce that Python 2.7.x supports large objects on Windows 64 bits, be prepared to have to fix Python in various different places."

You can compare which modules define PY_SSIZE_T_CLEAN in Python 2.7 and 3.x. For example, it looks like bz2 and zlib modules handle correctly 64-bit lengths in Python 3, but don't in Python 2.

By the way, BZ2File_read() in Python 2.7 uses also the "l" format to parse the input length. It looks like the code was copied from fileobject.c.
msg216617 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2014-04-16 23:03
Here's my version of the patch.  It does the same thing Victor's patch does, but removes a now-completely-irrelevant stanza of code, and adds a test.

I'm on 64-bit Linux, so the test was always going to work anyway.  So I tested the test by changing file_read to use an int (and the "i" format unit), which happily causes it to fail.
msg226773 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2014-09-11 13:15
Ping?
msg226781 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-09-11 14:48
I added comments to test on Rietveld.
msg265402 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-05-12 12:47
Ping?
msg265552 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-14 22:58
I'm lost in this old issue. Can someone please try to summarize it? What is the problem? What is the proposed fix? What is the status?
msg265595 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2016-05-15 08:38
Summary: read Matt's original message on the bug.

Status: still broken.  Some people think it should be left that way.

Proposed fix: it's straightforward to fix.  You wrote a patch, I modified yours a little.  The patch is short and easy to read.
msg265624 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2016-05-15 15:48
Updated patch based on feedback from Serihy.  Thanks for the ping.
msg265627 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2016-05-15 16:04
Martin said:

> Josh: it's not as simple as just changing the type code and variable type. Systems used to require all kinds of types in the length parameter of read(2) and fread(3), including int, long, and size_t. If it was int, passing size_t would lead to silent truncation. So at a minimum, a careful review of the further processing, and a test case is necessary.

file_read() calls Py_UniversalNewlineFread() to do the actual reading.  The length argument of Py_UniversalNewlineFread(), "n", is already of type size_t, and it passes this parameter in to fread() for the number-of-elements argument.  Py_UniversalNewlineFread() is also used by file_readinto(), and that function already passes in a size_t.  So it's already successfully being used in 2.7.

If Python 2.7 still supports systems where fread()'s number-of-elements argument is smaller than size_t, this should already be causing a compilation warning on such platforms.  One would hope people supporting such platforms would fix it.  I don't know of any such platforms, so I cannot address this hypothetical issue.  Do you know of platforms supported by 2.7 where fread's number-of-elements parameter is smaller than size_t?


Antoine said:

> The benefit of fixing this in the next 2.7.x iteration is small compared to the cost of a potential regression in the core I/O implementation.

Although nobody wants to cause a regression in CPython, I'm having difficulty imagining how this change could plausibly cause one.  On the other hand, Matt makes it clear that it's a bug that has bitten hg, so I think it's worth fixing.

Can you suggest a way this change could plausibly cause a regression?
msg265673 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2016-05-16 06:43
Fixing seems fine with me.
msg364548 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-18 18:28
Python 3 doesn't have this issue. Python 2 no longer accepts bugfixes. I close the issue.

Note: there was a similar issue specific to macOS: bpo-24658 and PR 9938 (closed since Python 2 moved to end of life).
History
Date User Action Args
2022-04-11 14:58:01adminsetgithub: 65398
2020-03-18 18:28:42vstinnersetstatus: open -> closed
title: Python on 64-bit Windows uses signed 32-bit type for read length -> [2.7] Python on 64-bit Windows uses signed 32-bit type for read length
messages: + msg364548

resolution: wont fix
stage: patch review -> resolved
2020-03-18 18:22:01brett.cannonsetnosy: - brett.cannon
2016-05-16 06:43:01benjamin.petersonsetmessages: + msg265673
2016-05-15 16:04:28larrysetmessages: + msg265627
2016-05-15 15:48:17larrysetfiles: + larry.make.file_read.use.size_t.2.diff

messages: + msg265624
2016-05-15 08:38:52larrysetmessages: + msg265595
2016-05-14 22:58:14vstinnersetmessages: + msg265552
2016-05-12 12:47:40serhiy.storchakasetmessages: + msg265402
2014-09-11 14:48:15serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg226781
2014-09-11 13:15:32larrysetmessages: + msg226773
2014-04-16 23:03:52larrysetfiles: + larry.file_read.8.bytes.1.diff

messages: + msg216617
stage: patch review
2014-04-16 06:36:27vstinnersetmessages: + msg216439
2014-04-16 06:19:18vstinnersetfiles: + file_read_size_t.patch
keywords: + patch
messages: + msg216437
2014-04-16 06:09:45vstinnersetmessages: + msg216436
2014-04-16 05:53:51vstinnersetnosy: + vstinner
messages: + msg216435
2014-04-15 22:04:59larrysetmessages: + msg216398
2014-04-15 21:42:25Matt.Mackallsetmessages: + msg216396
2014-04-15 21:24:04pitrousetmessages: + msg216389
2014-04-15 20:15:49larrysetmessages: + msg216377
2014-04-14 18:41:14brett.cannonsetnosy: + brett.cannon
messages: + msg216165
2014-04-14 18:39:40eric.smithsetnosy: + eric.smith
2014-04-13 16:43:45pitrousetmessages: + msg216026
2014-04-13 15:53:49loewissetmessages: + msg216023
2014-04-13 13:51:22josh.rsetnosy: + josh.r
messages: + msg216016
2014-04-13 09:37:36loewissetnosy: + loewis
messages: + msg216008
2014-04-10 19:27:11pitrousetnosy: + pitrou
messages: + msg215899
2014-04-10 17:57:39larrysetassignee: larry

messages: + msg215896
nosy: + benjamin.peterson
2014-04-10 17:09:54ezio.melottisetnosy: + ezio.melotti
2014-04-10 16:58:53Matt.Mackallcreate