Issue 21199: [2.7] Python on 64-bit Windows uses signed 32-bit type for read length

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/65398

classification

Title:	[2.7] Python on 64-bit Windows uses signed 32-bit type for read length
Type:	behavior	Stage:	resolved
Components:	IO	Versions:	Python 2.7

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:	larry	Nosy List:	Matt.Mackall, benjamin.peterson, eric.smith, ezio.melotti, josh.r, larry, loewis, ncoghlan, pitrou, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2014-04-10 16:58 by Matt.Mackall, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
file_read_size_t.patch	vstinner, 2014-04-16 06:19		review
larry.file_read.8.bytes.1.diff	larry, 2014-04-16 23:03		review
larry.make.file_read.use.size_t.2.diff	larry, 2016-05-15 15:48		review

Messages (26)
msg215893 - (view)	Author: Matt Mackall (Matt.Mackall)	Date: 2014-04-10 16:58
Python's file_read uses an 'l' type to parse its args which results in a 31-bit size limit on reads on 64-bit Windows. It should instead use an ssize_t. Related Mercurial bug: http://bz.selenic.com/show_bug.cgi?id=4215
msg215896 - (view)	Author: Larry Hastings (larry) *	Date: 2014-04-10 17:57
Benjamin: given the brave new world of 2.7 living even longer, Guido said this sort of bug fix was fair game. Will you accept a patch for this in 2.7?
msg215899 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-04-10 19:27
It is much simpler for Mercurial to add a workaround (which they already did, apparently :-)), than to risk introducing regressions by fixing this. AFAIK nobody else complained before.
msg216008 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2014-04-13 09:37
FWIW, using "l" for file_read goes back to http://hg.python.org/cpython/diff/f44a56e697fb/Objects/fileobject.c from Fri, 09 May 1997 22:27:31 +0000
msg216016 - (view)	Author: Josh Rosenberg (josh.r) *	Date: 2014-04-13 13:51
So it predates the existence of type code 'n', which would be the appropriate type code, but no one updated it. Antoine: Inability to perform a 2GB+ read properly is not something that should be worked around on a case by case basis. There is one compatibility risk here, which is that 'n' requires index integers (__index__), where 'l' will accept coercible types (__int__). If you think people are reading from files using Fraction and Decimal, then this would be a compatibility issue (and you could work around it be making it use type code 'L' only on 64 bit Windows), but this just looks like an oversight from the upgrade from int to Py_ssize_t across the Python code base.
msg216023 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2014-04-13 15:53
Josh: it's not as simple as just changing the type code and variable type. Systems used to require all kinds of types in the length parameter of read(2) and fread(3), including int, long, and size_t. If it was int, passing size_t would lead to silent truncation. So at a minimum, a careful review of the further processing, and a test case is necessary. I'm personally not interested in Python 2.7 anymore, so I certainly won't work on this. As for why this isn't reported frequently, I guess a number of causes: a) people don't use Python on Windows that much to process large files b) when they do and run into this problem, they feel guilty for actually asking for a such large chunk, which could well exhaust the memory of the system. c) there is a straight-forward work-around d) people that do read large files typically either read them at once, or in much smaller chunks So it may well be that this remains the only report of this for the rest of Python 2.7's life.
msg216026 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-04-13 16:43
> Antoine: Inability to perform a 2GB+ read properly is not something > that should be worked around on a case by case basis. Really we are talking about Python 2.7 here. Anyone having this problem has to work it around in order for their code to work on published versions. The benefit of fixing this in the next 2.7.x iteration is small compared to the cost of a potential regression in the core I/O implementation.
msg216165 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2014-04-14 18:41
We discussed this bug at the sprint and we all agreed that it should be fixed. It's an oversight and a bug that this behaviour is different between Windows and other platforms due to how 64-bit Windows treats longs.
msg216377 - (view)	Author: Larry Hastings (larry) *	Date: 2014-04-15 20:15
I don't think we can fix this. file_read returns a str, and str can have at most SSIZE_T_MAX entries. So, file_read could read size_t characters, but it couldn't return them.
msg216389 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-04-15 21:24
> I don't think we can fix this. file_read returns a str, and str can have at most SSIZE_T_MAX entries. So, file_read could read size_t characters, but it couldn't return them. But at least it could read more than LONG_MAX characters.
msg216396 - (view)	Author: Matt Mackall (Matt.Mackall)	Date: 2014-04-15 21:42
Actually, no, all the size_t types are 64-bit on Windows 64: https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models To confirm this, I found someone nearby with a 64-bit Windows and he had no trouble allocating a 3G string with a = b'a' * 3000000000.
msg216398 - (view)	Author: Larry Hastings (larry) *	Date: 2014-04-15 22:04
D'oh, yeah, you guys are right. SSIZE_T_MAX > LONG_MAX, so that's an improvement. We just can't do the full size_t range.
msg216435 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-16 05:53
On Windows, the type of the size parameter of read() is an unsigned int, not long nor size_t: http://msdn.microsoft.com/en-us/library/1570wh78.aspx FileIO.read() of Python 3 uses a size_t type but also: #ifdef MS_WINDOWS if (size > INT_MAX) size = INT_MAX; #endif ... #ifdef MS_WINDOWS n = read(self->fd, ptr, (int)size); #else n = read(self->fd, ptr, size); #endif Using a size_t to parse the Python parameter would avoid an OverflowError and the function would be more portable. But it would not permit to read more than 2 GB at once.
msg216436 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-16 06:09
"It is much simpler for Mercurial to add a workaround (which they already did, apparently :-)), than to risk introducing regressions by fixing this. AFAIK nobody else complained before." We fixed a lot of issues for 32-bit long and 64-bit size_t types (Windows 64 bits) in Python 3.3 and 3.4. I don't think that they were fixed in Python 2.7, because the required changes are large and it would be risky to backport them. At least, I don't feel confortable to backport them. I suggest to switch to Python 3 if you would like to handle large objects (bigger than 2 GB) on Windows 64-bit. See the issue #9566 for examples of fixes, but there were many other changes related to 64-bit Windows issues. I see for example that send() & send_bytes() methods of multiprocessing connection doesn't truncate the length to DWORD_MAX bytes on Windows in Python 2.7, whereas it was fixed in Python 3: http://hg.python.org/cpython/rev/c75ab7b802df (On UNIX, the same methods loop on write() to ensure that all bytes are written.) If you want to announce that Python 2.7.x supports large objects on Windows 64 bits, be prepared to have to fix Python in various different places.
msg216437 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-16 06:19
> On Windows, the type of the size parameter of read() is an unsigned int, not long nor size_t (...) Oh, I read the wrong function. In fact, file_read() of Python 2.7 calls fread() and fread() uses size_t types, even on Windows. To make sure that we are talking about the same thing, I wrote the attached file_read_size_t.patch file which replaces "l" with "n" in file_read(). Note: os.read() uses int types, even on Python 3.5, whereas read() uses a size_t type for the number of bytes on Linux.
msg216439 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-16 06:36
"If you want to announce that Python 2.7.x supports large objects on Windows 64 bits, be prepared to have to fix Python in various different places." You can compare which modules define PY_SSIZE_T_CLEAN in Python 2.7 and 3.x. For example, it looks like bz2 and zlib modules handle correctly 64-bit lengths in Python 3, but don't in Python 2. By the way, BZ2File_read() in Python 2.7 uses also the "l" format to parse the input length. It looks like the code was copied from fileobject.c.
msg216617 - (view)	Author: Larry Hastings (larry) *	Date: 2014-04-16 23:03
Here's my version of the patch. It does the same thing Victor's patch does, but removes a now-completely-irrelevant stanza of code, and adds a test. I'm on 64-bit Linux, so the test was always going to work anyway. So I tested the test by changing file_read to use an int (and the "i" format unit), which happily causes it to fail.
msg226773 - (view)	Author: Larry Hastings (larry) *	Date: 2014-09-11 13:15
Ping?
msg226781 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-09-11 14:48
I added comments to test on Rietveld.
msg265402 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-05-12 12:47
Ping?
msg265552 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-05-14 22:58
I'm lost in this old issue. Can someone please try to summarize it? What is the problem? What is the proposed fix? What is the status?
msg265595 - (view)	Author: Larry Hastings (larry) *	Date: 2016-05-15 08:38
Summary: read Matt's original message on the bug. Status: still broken. Some people think it should be left that way. Proposed fix: it's straightforward to fix. You wrote a patch, I modified yours a little. The patch is short and easy to read.
msg265624 - (view)	Author: Larry Hastings (larry) *	Date: 2016-05-15 15:48
Updated patch based on feedback from Serihy. Thanks for the ping.
msg265627 - (view)	Author: Larry Hastings (larry) *	Date: 2016-05-15 16:04
Martin said: > Josh: it's not as simple as just changing the type code and variable type. Systems used to require all kinds of types in the length parameter of read(2) and fread(3), including int, long, and size_t. If it was int, passing size_t would lead to silent truncation. So at a minimum, a careful review of the further processing, and a test case is necessary. file_read() calls Py_UniversalNewlineFread() to do the actual reading. The length argument of Py_UniversalNewlineFread(), "n", is already of type size_t, and it passes this parameter in to fread() for the number-of-elements argument. Py_UniversalNewlineFread() is also used by file_readinto(), and that function already passes in a size_t. So it's already successfully being used in 2.7. If Python 2.7 still supports systems where fread()'s number-of-elements argument is smaller than size_t, this should already be causing a compilation warning on such platforms. One would hope people supporting such platforms would fix it. I don't know of any such platforms, so I cannot address this hypothetical issue. Do you know of platforms supported by 2.7 where fread's number-of-elements parameter is smaller than size_t? Antoine said: > The benefit of fixing this in the next 2.7.x iteration is small compared to the cost of a potential regression in the core I/O implementation. Although nobody wants to cause a regression in CPython, I'm having difficulty imagining how this change could plausibly cause one. On the other hand, Matt makes it clear that it's a bug that has bitten hg, so I think it's worth fixing. Can you suggest a way this change could plausibly cause a regression?
msg265673 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2016-05-16 06:43
Fixing seems fine with me.
msg364548 - (view)	Author: STINNER Victor (vstinner) *	Date: 2020-03-18 18:28
Python 3 doesn't have this issue. Python 2 no longer accepts bugfixes. I close the issue. Note: there was a similar issue specific to macOS: bpo-24658 and PR 9938 (closed since Python 2 moved to end of life).

History
Date	User	Action	Args
2022-04-11 14:58:01	admin	set	github: 65398
2020-03-18 18:28:42	vstinner	set	status: open -> closed title: Python on 64-bit Windows uses signed 32-bit type for read length -> [2.7] Python on 64-bit Windows uses signed 32-bit type for read length messages: + msg364548 resolution: wont fix stage: patch review -> resolved
2020-03-18 18:22:01	brett.cannon	set	nosy: - brett.cannon
2016-05-16 06:43:01	benjamin.peterson	set	messages: + msg265673
2016-05-15 16:04:28	larry	set	messages: + msg265627
2016-05-15 15:48:17	larry	set	files: + larry.make.file_read.use.size_t.2.diff messages: + msg265624
2016-05-15 08:38:52	larry	set	messages: + msg265595
2016-05-14 22:58:14	vstinner	set	messages: + msg265552
2016-05-12 12:47:40	serhiy.storchaka	set	messages: + msg265402
2014-09-11 14:48:15	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg226781
2014-09-11 13:15:32	larry	set	messages: + msg226773
2014-04-16 23:03:52	larry	set	files: + larry.file_read.8.bytes.1.diff messages: + msg216617 stage: patch review
2014-04-16 06:36:27	vstinner	set	messages: + msg216439
2014-04-16 06:19:18	vstinner	set	files: + file_read_size_t.patch keywords: + patch messages: + msg216437
2014-04-16 06:09:45	vstinner	set	messages: + msg216436
2014-04-16 05:53:51	vstinner	set	nosy: + vstinner messages: + msg216435
2014-04-15 22:04:59	larry	set	messages: + msg216398
2014-04-15 21:42:25	Matt.Mackall	set	messages: + msg216396
2014-04-15 21:24:04	pitrou	set	messages: + msg216389
2014-04-15 20:15:49	larry	set	messages: + msg216377
2014-04-14 18:41:14	brett.cannon	set	nosy: + brett.cannon messages: + msg216165
2014-04-14 18:39:40	eric.smith	set	nosy: + eric.smith
2014-04-13 16:43:45	pitrou	set	messages: + msg216026
2014-04-13 15:53:49	loewis	set	messages: + msg216023
2014-04-13 13:51:22	josh.r	set	nosy: + josh.r messages: + msg216016
2014-04-13 09:37:36	loewis	set	nosy: + loewis messages: + msg216008
2014-04-10 19:27:11	pitrou	set	nosy: + pitrou messages: + msg215899
2014-04-10 17:57:39	larry	set	assignee: larry messages: + msg215896 nosy: + benjamin.peterson
2014-04-10 17:09:54	ezio.melotti	set	nosy: + ezio.melotti
2014-04-10 16:58:53	Matt.Mackall	create