msg215893 - (view) |
Author: Matt Mackall (Matt.Mackall) |
Date: 2014-04-10 16:58 |
Python's file_read uses an 'l' type to parse its args which results in a 31-bit size limit on reads on 64-bit Windows. It should instead use an ssize_t.
Related Mercurial bug:
http://bz.selenic.com/show_bug.cgi?id=4215
|
msg215896 - (view) |
Author: Larry Hastings (larry) * |
Date: 2014-04-10 17:57 |
Benjamin: given the brave new world of 2.7 living even longer, Guido said this sort of bug fix was fair game. Will you accept a patch for this in 2.7?
|
msg215899 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2014-04-10 19:27 |
It is much simpler for Mercurial to add a workaround (which they already did, apparently :-)), than to risk introducing regressions by fixing this. AFAIK nobody else complained before.
|
msg216008 - (view) |
Author: Martin v. Löwis (loewis) * |
Date: 2014-04-13 09:37 |
FWIW, using "l" for file_read goes back to
http://hg.python.org/cpython/diff/f44a56e697fb/Objects/fileobject.c
from Fri, 09 May 1997 22:27:31 +0000
|
msg216016 - (view) |
Author: Josh Rosenberg (josh.r) * |
Date: 2014-04-13 13:51 |
So it predates the existence of type code 'n', which would be the appropriate type code, but no one updated it.
Antoine: Inability to perform a 2GB+ read properly is not something that should be worked around on a case by case basis. There is one compatibility risk here, which is that 'n' requires index integers (__index__), where 'l' will accept coercible types (__int__). If you think people are reading from files using Fraction and Decimal, then this would be a compatibility issue (and you could work around it be making it use type code 'L' only on 64 bit Windows), but this just looks like an oversight from the upgrade from int to Py_ssize_t across the Python code base.
|
msg216023 - (view) |
Author: Martin v. Löwis (loewis) * |
Date: 2014-04-13 15:53 |
Josh: it's not as simple as just changing the type code and variable type. Systems used to require all kinds of types in the length parameter of read(2) and fread(3), including int, long, and size_t. If it was int, passing size_t would lead to silent truncation. So at a minimum, a careful review of the further processing, and a test case is necessary.
I'm personally not interested in Python 2.7 anymore, so I certainly won't work on this.
As for why this isn't reported frequently, I guess a number of causes:
a) people don't use Python on Windows that much to process large files
b) when they do and run into this problem, they feel guilty for actually asking for a such large chunk, which could well exhaust the memory of the system.
c) there is a straight-forward work-around
d) people that do read large files typically either read them at once, or in much smaller chunks
So it may well be that this remains the only report of this for the rest of Python 2.7's life.
|
msg216026 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2014-04-13 16:43 |
> Antoine: Inability to perform a 2GB+ read properly is not something
> that should be worked around on a case by case basis.
Really we are talking about Python 2.7 here. Anyone having this problem *has* to work it around in order for their code to work on published versions. The benefit of fixing this in the next 2.7.x iteration is small compared to the cost of a potential regression in the core I/O implementation.
|
msg216165 - (view) |
Author: Brett Cannon (brett.cannon) * |
Date: 2014-04-14 18:41 |
We discussed this bug at the sprint and we all agreed that it should be fixed. It's an oversight and a bug that this behaviour is different between Windows and other platforms due to how 64-bit Windows treats longs.
|
msg216377 - (view) |
Author: Larry Hastings (larry) * |
Date: 2014-04-15 20:15 |
I don't think we can fix this. file_read returns a str, and str can have at most SSIZE_T_MAX entries. So, file_read could read size_t characters, but it couldn't return them.
|
msg216389 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2014-04-15 21:24 |
> I don't think we can fix this. file_read returns a str, and str can
have at most SSIZE_T_MAX entries. So, file_read could read size_t
characters, but it couldn't return them.
But at least it could read more than LONG_MAX characters.
|
msg216396 - (view) |
Author: Matt Mackall (Matt.Mackall) |
Date: 2014-04-15 21:42 |
Actually, no, all the size_t types are 64-bit on Windows 64:
https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models
To confirm this, I found someone nearby with a 64-bit Windows and he had no trouble allocating a 3G string with a = b'a' * 3000000000.
|
msg216398 - (view) |
Author: Larry Hastings (larry) * |
Date: 2014-04-15 22:04 |
D'oh, yeah, you guys are right. SSIZE_T_MAX > LONG_MAX, so that's an improvement. We just can't do the full size_t range.
|
msg216435 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2014-04-16 05:53 |
On Windows, the type of the size parameter of read() is an unsigned int, not long nor size_t:
http://msdn.microsoft.com/en-us/library/1570wh78.aspx
FileIO.read() of Python 3 uses a size_t type but also:
#ifdef MS_WINDOWS
if (size > INT_MAX)
size = INT_MAX;
#endif
...
#ifdef MS_WINDOWS
n = read(self->fd, ptr, (int)size);
#else
n = read(self->fd, ptr, size);
#endif
Using a size_t to parse the Python parameter would avoid an OverflowError and the function would be more portable. But it would not permit to read more than 2 GB at once.
|
msg216436 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2014-04-16 06:09 |
"It is much simpler for Mercurial to add a workaround (which they already did, apparently :-)), than to risk introducing regressions by fixing this. AFAIK nobody else complained before."
We fixed a lot of issues for 32-bit long and 64-bit size_t types (Windows 64 bits) in Python 3.3 and 3.4. I don't think that they were fixed in Python 2.7, because the required changes are large and it would be risky to backport them. At least, I don't feel confortable to backport them.
I suggest to switch to Python 3 if you would like to handle large objects (bigger than 2 GB) on Windows 64-bit.
See the issue #9566 for examples of fixes, but there were many other changes related to 64-bit Windows issues.
I see for example that send() & send_bytes() methods of multiprocessing connection doesn't truncate the length to DWORD_MAX bytes on Windows in Python 2.7, whereas it was fixed in Python 3:
http://hg.python.org/cpython/rev/c75ab7b802df
(On UNIX, the same methods loop on write() to ensure that all bytes are written.)
If you want to announce that Python 2.7.x supports large objects on Windows 64 bits, be prepared to have to fix Python in various different places.
|
msg216437 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2014-04-16 06:19 |
> On Windows, the type of the size parameter of read() is an unsigned int, not long nor size_t (...)
Oh, I read the wrong function. In fact, file_read() of Python 2.7 calls fread() and fread() uses size_t types, even on Windows.
To make sure that we are talking about the same thing, I wrote the attached file_read_size_t.patch file which replaces "l" with "n" in file_read().
Note: os.read() uses int types, even on Python 3.5, whereas read() uses a size_t type for the number of bytes on Linux.
|
msg216439 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2014-04-16 06:36 |
"If you want to announce that Python 2.7.x supports large objects on Windows 64 bits, be prepared to have to fix Python in various different places."
You can compare which modules define PY_SSIZE_T_CLEAN in Python 2.7 and 3.x. For example, it looks like bz2 and zlib modules handle correctly 64-bit lengths in Python 3, but don't in Python 2.
By the way, BZ2File_read() in Python 2.7 uses also the "l" format to parse the input length. It looks like the code was copied from fileobject.c.
|
msg216617 - (view) |
Author: Larry Hastings (larry) * |
Date: 2014-04-16 23:03 |
Here's my version of the patch. It does the same thing Victor's patch does, but removes a now-completely-irrelevant stanza of code, and adds a test.
I'm on 64-bit Linux, so the test was always going to work anyway. So I tested the test by changing file_read to use an int (and the "i" format unit), which happily causes it to fail.
|
msg226773 - (view) |
Author: Larry Hastings (larry) * |
Date: 2014-09-11 13:15 |
Ping?
|
msg226781 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2014-09-11 14:48 |
I added comments to test on Rietveld.
|
msg265402 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2016-05-12 12:47 |
Ping?
|
msg265552 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2016-05-14 22:58 |
I'm lost in this old issue. Can someone please try to summarize it? What is the problem? What is the proposed fix? What is the status?
|
msg265595 - (view) |
Author: Larry Hastings (larry) * |
Date: 2016-05-15 08:38 |
Summary: read Matt's original message on the bug.
Status: still broken. Some people think it should be left that way.
Proposed fix: it's straightforward to fix. You wrote a patch, I modified yours a little. The patch is short and easy to read.
|
msg265624 - (view) |
Author: Larry Hastings (larry) * |
Date: 2016-05-15 15:48 |
Updated patch based on feedback from Serihy. Thanks for the ping.
|
msg265627 - (view) |
Author: Larry Hastings (larry) * |
Date: 2016-05-15 16:04 |
Martin said:
> Josh: it's not as simple as just changing the type code and variable type. Systems used to require all kinds of types in the length parameter of read(2) and fread(3), including int, long, and size_t. If it was int, passing size_t would lead to silent truncation. So at a minimum, a careful review of the further processing, and a test case is necessary.
file_read() calls Py_UniversalNewlineFread() to do the actual reading. The length argument of Py_UniversalNewlineFread(), "n", is already of type size_t, and it passes this parameter in to fread() for the number-of-elements argument. Py_UniversalNewlineFread() is also used by file_readinto(), and that function already passes in a size_t. So it's already successfully being used in 2.7.
If Python 2.7 still supports systems where fread()'s number-of-elements argument is smaller than size_t, this should already be causing a compilation warning on such platforms. One would hope people supporting such platforms would fix it. I don't know of any such platforms, so I cannot address this hypothetical issue. Do you know of platforms supported by 2.7 where fread's number-of-elements parameter is smaller than size_t?
Antoine said:
> The benefit of fixing this in the next 2.7.x iteration is small compared to the cost of a potential regression in the core I/O implementation.
Although nobody wants to cause a regression in CPython, I'm having difficulty imagining how this change could plausibly cause one. On the other hand, Matt makes it clear that it's a bug that has bitten hg, so I think it's worth fixing.
Can you suggest a way this change could plausibly cause a regression?
|
msg265673 - (view) |
Author: Benjamin Peterson (benjamin.peterson) * |
Date: 2016-05-16 06:43 |
Fixing seems fine with me.
|
msg364548 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2020-03-18 18:28 |
Python 3 doesn't have this issue. Python 2 no longer accepts bugfixes. I close the issue.
Note: there was a similar issue specific to macOS: bpo-24658 and PR 9938 (closed since Python 2 moved to end of life).
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:01 | admin | set | github: 65398 |
2020-03-18 18:28:42 | vstinner | set | status: open -> closed title: Python on 64-bit Windows uses signed 32-bit type for read length -> [2.7] Python on 64-bit Windows uses signed 32-bit type for read length messages:
+ msg364548
resolution: wont fix stage: patch review -> resolved |
2020-03-18 18:22:01 | brett.cannon | set | nosy:
- brett.cannon
|
2016-05-16 06:43:01 | benjamin.peterson | set | messages:
+ msg265673 |
2016-05-15 16:04:28 | larry | set | messages:
+ msg265627 |
2016-05-15 15:48:17 | larry | set | files:
+ larry.make.file_read.use.size_t.2.diff
messages:
+ msg265624 |
2016-05-15 08:38:52 | larry | set | messages:
+ msg265595 |
2016-05-14 22:58:14 | vstinner | set | messages:
+ msg265552 |
2016-05-12 12:47:40 | serhiy.storchaka | set | messages:
+ msg265402 |
2014-09-11 14:48:15 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages:
+ msg226781
|
2014-09-11 13:15:32 | larry | set | messages:
+ msg226773 |
2014-04-16 23:03:52 | larry | set | files:
+ larry.file_read.8.bytes.1.diff
messages:
+ msg216617 stage: patch review |
2014-04-16 06:36:27 | vstinner | set | messages:
+ msg216439 |
2014-04-16 06:19:18 | vstinner | set | files:
+ file_read_size_t.patch keywords:
+ patch messages:
+ msg216437
|
2014-04-16 06:09:45 | vstinner | set | messages:
+ msg216436 |
2014-04-16 05:53:51 | vstinner | set | nosy:
+ vstinner messages:
+ msg216435
|
2014-04-15 22:04:59 | larry | set | messages:
+ msg216398 |
2014-04-15 21:42:25 | Matt.Mackall | set | messages:
+ msg216396 |
2014-04-15 21:24:04 | pitrou | set | messages:
+ msg216389 |
2014-04-15 20:15:49 | larry | set | messages:
+ msg216377 |
2014-04-14 18:41:14 | brett.cannon | set | nosy:
+ brett.cannon messages:
+ msg216165
|
2014-04-14 18:39:40 | eric.smith | set | nosy:
+ eric.smith
|
2014-04-13 16:43:45 | pitrou | set | messages:
+ msg216026 |
2014-04-13 15:53:49 | loewis | set | messages:
+ msg216023 |
2014-04-13 13:51:22 | josh.r | set | nosy:
+ josh.r messages:
+ msg216016
|
2014-04-13 09:37:36 | loewis | set | nosy:
+ loewis messages:
+ msg216008
|
2014-04-10 19:27:11 | pitrou | set | nosy:
+ pitrou messages:
+ msg215899
|
2014-04-10 17:57:39 | larry | set | assignee: larry
messages:
+ msg215896 nosy:
+ benjamin.peterson |
2014-04-10 17:09:54 | ezio.melotti | set | nosy:
+ ezio.melotti
|
2014-04-10 16:58:53 | Matt.Mackall | create | |