msg177879 - (view) |
Author: Marc Schlaich (schlamar) * |
Date: 2012-12-21 14:43 |
Platform: Windows 7 64 bit
Interpreter: Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit Intel)] on win32
Here are the steps to reproduce:
1. Create a big file (5 GB):
with open('big', 'wb') as fobj:
for _ in xrange(1024 * 1024 * 5):
fobj.write('1' + '0' * 1023)
2. Open and process it with `mmap`:
import mmap
import re
import sys
with open('big', 'rb') as fobj:
data = mmap.mmap(fobj.fileno(), 0, access=mmap.ACCESS_READ)
print data.size()
try:
counter = 0
for match in re.finditer('1' + '0' * 1023, data):
counter += 1
print len(data[1073740800:1073741824]) # (1 GB - 1024, 1 GB)
print len(data[1073741824:1073742848]) # (1 GB, 1 GB + 1024)
finally:
data.close()
print counter
This returns the following lines:
5368709120
1024
0
1048576
So this is a behavioral issue. `mmap` accepts a file which cannot fit in the interpreter memory but fits in the system memory. On processing the data, it only reads data until the maximum interpreter memory is reached (1 GB).
|
msg177916 - (view) |
Author: Terry J. Reedy (terry.reedy) * |
Date: 2012-12-22 01:58 |
The immediate fix is to use a 64 bit build. That aside, what change in behavior are you suggesting? (and for 32 bit builds only?)
Should mmap.mmap warn if the file is longer that would be supported?
This could be added to all current versions.
Should it raise in the same circumstance? What is a person *knows* that the file is 'too big' but only wants to access the first gigabyte? Forcing people to explicitly pass the magic number 1073741824 would, to me, effectively be a 3.4-at-best api change.
Perhaps mmap.mmap should be left alone and only the attempt to access beyond the cutoff should raise or warn. (Is the 32-bit cutoff OS specific?) Given that there are multiple access methods and methods that access, and that all accesses are ultimately delegated to the os mmap functions, this could be major nuisance to get right.
Now that disks have grown to larger than a gigabyte, the doc should explicitly mention the memory space issue.
|
msg177968 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-12-23 10:14 |
Terry, what makes you think this is a feature request? This is a bug, quite simply.
|
msg178008 - (view) |
Author: Terry J. Reedy (terry.reedy) * |
Date: 2012-12-23 19:12 |
It is a report of behavior that lacks a specific request for change (that I can see). The implied code-change request could break working code. We don't usually do that. What do you think should be done?
|
msg178013 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-12-23 19:55 |
As I understand, the issue is that mmap slicing returns an empty string for large (but less than ssize_t limit) indices on 2.7.
May be it relates to 30-bit digits long integer implementation?
|
msg178021 - (view) |
Author: Terry J. Reedy (terry.reedy) * |
Date: 2012-12-23 23:09 |
To me, Marc's title and penultimate sentence imply that he thinks that mmap should not accept such files. (But he should speak for himself.) As I said, not accepting such files could break working code.
As for the alternative of 'fixing' methods: Is it only slicing or other methods, even *every* method that 'misbehaves' when attempting to read (or write) beyond the 1 gig limit? I am guessing the last. If so, just about every method (inherited from bytearray, like slicing, or mmap specific) would need a fix conditional on the build and access location (and OS or hardware?).
Even for slices, what change would you (or anyone) make? Keep in mind that is it a *feature* of slices that they generally always work, and that this is specifically true of bytearrays. ("Memory-mapped file objects behave like both bytearray and like file objects.")
I am actually a bit surprised that the limit is 1 gb rather than 2, 3, or 4 gb. Is it the same on *nix? What is the limit for a bytearray on Win 7?
|
msg178033 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-12-24 07:36 |
I have only 32-bit OS and can't answer this questions. I'm surprised by
1 GiB limit too.
Marc, can you please check 4.5 GiB file? What limit in this case, 1 GiB
or 0.5 GiB? What about slicing a big bytes object or bytearray (if you
have enough memory)?
If mmap on Windows can't work with files larger 4 GiB, then it should
raise exception on creation or at least on access. Silent production of
wrong result is not a way.
|
msg178109 - (view) |
Author: Terry J. Reedy (terry.reedy) * |
Date: 2012-12-24 22:35 |
Windows memory-maps multi-gigabyte files just fine as long as one uses the proper build (64-bit), which we provide.
Given that mmap produces a finite-length sequence object, as documented, slicing is working as it should. Slicing beyond the length returns an empty sequence. The is no different from 'abc'[4:6]==''.
Running Python with finite memory has many memory-associated limitations. They are mostly undocumented as the exact details may depend on hardware, OS, implementation, version, and build. One practical limitation is that mmap with a 32-bit build cannot completely map multi-gigabyte files.
The current doc says:
"class mmap.mmap(fileno, length, tagname=None, access=ACCESS_DEFAULT[, offset])
(Windows version) Maps length bytes from the file specified by the file handle fileno, and creates a mmap object. If length is larger than the current size of the file, the file is extended to contain length bytes. If length is 0, the maximum length of the map is the current size of the file, except that if the file is empty Windows raises an exception (you cannot create an empty mapping on Windows)."
It does not say what happens if the requested length is larger than the max possible on a particular system. In particular, there is no mention of exception raising. So failure to raise is not a bug for tracker purposes.
The two possibilities of what to do is such situations are best effort and bailout. The current choice (at least on Windows, and whether by us, Microsoft, or the original mmap authors, I don't know) is best effort. I think that is fine, but should be documented. Users who care can compare the mmap object length with the file length or needed length and raise or do whatever if the mmap length is too short.
So I think we should change this to a doc issue and add something like "If the requested length is larger than the limit for the current system, then that limit is used as the length."
or
"The length of the returned mmap object has a limit that depends on the details of the running system."
Or the header should say that there is a system limit and two of the sentences above revised. In the first, change 'length bytes' to 'min(length, system limit) bytes. (I am presuming this is true also when length is not given as 0.) In the last sentence, change 'current size' to 'min(current size, system limit)'.
The Unix version doc should also clarify behavior.
---
If we were to change mmap() (but only in a future release), then users who want the current behavior would have to discover, hard-code, and explicitly but conditionally pass the limit for each system their code might ever run on. I do not know that that is sensibly possible. I would not be surprised if the limit for a given 32-bit build varies for different windows versions and setups.
|
msg178110 - (view) |
Author: Richard Oudkerk (sbt) * |
Date: 2012-12-24 23:51 |
I suspect that the size of the 5GB file is originally a 64 bit quantity, but gets cast unsafely to a 32 bit size_t to give 1GB. This is causing the miscalculations.
There is no way to map all of a 5GB file in a 32 bit process -- 4GB is the maximum -- so any such attempt should raise an error. This does not prevent us from mapping *part* of a 5GB file.
|
msg178111 - (view) |
Author: Richard Oudkerk (sbt) * |
Date: 2012-12-25 00:15 |
This bit looks wrong to me:
if (offset - size > PY_SSIZE_T_MAX)
/* Map area too large to fit in memory */
m_obj->size = (Py_ssize_t) -1;
Should it not be "size - offset" instead of "offset - size"? (offset and size are Py_LONG_LONG.) And there is no check that offset is non-negative.
|
msg178217 - (view) |
Author: Richard Oudkerk (sbt) * |
Date: 2012-12-26 14:11 |
On 32 bit Unix mmap() will raise ValueError("mmap length is too large") in Marc's example. This is correct since Python's sequence protocol does not support indexes larger than sys.maxsize.
But on 32 bit Windows, if length == 0 then the size check always passes, and the actual size mapped is the file size modulo 4GB.
Fix for 3.x is attached with tests.
|
msg178221 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-12-26 14:47 |
This change is not backward compatible. Now user can mmap a larger file and safely access lower 2 GiB. With the patch it will fail.
Unix implementation uses unsafe integer overflow idiom which cause undefined behavior (Mark, you have the floor).
|
msg178228 - (view) |
Author: Richard Oudkerk (sbt) * |
Date: 2012-12-26 15:43 |
> This change is not backward compatible. Now user can mmap a larger file
> and safely access lower 2 GiB. With the patch it will fail.
They should specify length=2GiB-1 if that is what they want.
With length=0 you can only access the lower 2GiB if "file_size % 4GiB > 2GiB". If the file size is 4GiB+1 then you can only access *one byte* of the file. And if "2GiB < file_size < 4GiB" then presumably len(data) will be negative (or throw an exception or fail an assertion -- I have not tested that case). I would not be surprised if crashes are possible.
Basically if you had a "large" file and you did not hit a problem then it was Windows specific dumb luck. I see no point in retaining such unpredictable behaviour.
|
msg178230 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-12-26 15:52 |
Agree.
Please add the same check for Unix implementation (instead of unsafe overflow trick).
|
msg178235 - (view) |
Author: Richard Oudkerk (sbt) * |
Date: 2012-12-26 17:51 |
New patch with same check for Unix.
|
msg178237 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-12-26 19:23 |
LGTM. Isn't 2 GiB + 1 bytes mmap file enough for testing?
|
msg178242 - (view) |
Author: Richard Oudkerk (sbt) * |
Date: 2012-12-26 19:50 |
> Isn't 2 GiB + 1 bytes mmap file enough for testing?
Yes.
But creating multigigabyte files is very slow on Windows. On Linux/FreeBSD test_mmap takes a fraction of a second, whereas on Windows it takes over 2 minutes. (Presumably Linux/FreeBSD is automatically creating a sparse file.)
So adding assertions to an existing test is more convenient than creating another huge file just for these new tests.
|
msg178260 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-12-26 22:02 |
Sorry, I had missed you reuse an existing test. It's right. I think decreasing file size to 4 GiB + 1 byte will keep the purpose of the test and save a little time, but this is another issue. Let's go ahead!
|
msg178261 - (view) |
Author: Terry J. Reedy (terry.reedy) * |
Date: 2012-12-26 23:11 |
New data, new thoughts ;-)
I agree that an undocumented discrepancy between Windows and *nix should be fixed either in doc, code, or both. The fix might be version specific.
If the limit on Windows is less than it should or could be, because of our code, that should be changed. It would be best if the working limit were the same for all major platforms or at least the same for all versions of each major platform.
I am a bit reluctant to break working code on existing versions, even if the code depends on dumb luck because of an undocumented version-specific buggy 'feature'.
|
msg178275 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2012-12-27 06:19 |
As Richard explained, this will not break working code, this will break only broken code. Working limit will not be changed, it's sys.maxsize on all platforms.
|
msg180765 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-01-27 12:47 |
Let's go ahead.
|
msg180789 - (view) |
Author: Terry J. Reedy (terry.reedy) * |
Date: 2013-01-27 20:27 |
> As Richard explained, this will not break working code, this will break only broken code
If code is both working and broken, for some reasonable meaning of 'working' and 'broken', breaking such broken code *will* break working code. I understood Richard as saying that code that 'works by dumb luck' is *also* broken.
I agree we do not need to retain unpredictable 'dumb luck' -- in future versions. But in the absence of a clear discrepancy between doc and behavior (the definition of a bug) I believe breaking such code in a bugfix release would be contrary to current policy.
|
msg180797 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-01-27 21:06 |
Every bugfix breaks some code. As a compromise I propose set m_obj->size = PY_SSIZE_T_MAX in case of overflow and emit a warning.
|
msg180813 - (view) |
Author: Richard Oudkerk (sbt) * |
Date: 2013-01-28 00:29 |
On 27/01/2013 8:27pm, Terry J. Reedy wrote:
> I agree we do not need to retain unpredictable 'dumb luck' -- in
> future versions. But in the absence of a clear discrepancy
> between doc and behavior (the definition of a bug) I believe
> breaking such code in a bugfix release would be contrary to
> current policy.
Currently if you mmap a file with length 4GB+1, then you get an mmap of
length 1. Surely that is a *huge* discrepancy between docs and behaviour.
BTW, on 32 bit Windows it looks like the maximum size one can mmap in
practice is about 1.1GB:
PS> python -c "import mmap; m = mmap.mmap(-1, int(1.1*1024**3))"
PS> python -c "import mmap; m = mmap.mmap(-1, int(1.2*1024**3))"
Traceback (most recent call last):
File "<string>", line 1, in <module>
WindowsError: [Error 8] Not enough storage is available to process this
command
|
msg180814 - (view) |
Author: Richard Oudkerk (sbt) * |
Date: 2013-01-28 00:33 |
On 27/01/2013 9:06pm, Serhiy Storchaka wrote:
> Every bugfix breaks some code. As a compromise I propose set
> m_obj->size = PY_SSIZE_T_MAX in case of overflow and emit a warning.
Trying to allocate PY_SSIZE_T_MAX bytes always seems to fail with
WindowsError: [Error 8] Not enough storage is available to process
this command
|
msg180818 - (view) |
Author: Terry J. Reedy (terry.reedy) * |
Date: 2013-01-28 01:31 |
I missed the Ngb+1 case in your previous answer and agree that this is a bug issue.
|
msg181984 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-02-12 23:49 |
Ping.
|
msg182028 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2013-02-13 12:55 |
New changeset b1bbe519770b by Richard Oudkerk in branch '2.7':
Issue #16743: Fix mmap overflow check on 32 bit Windows
http://hg.python.org/cpython/rev/b1bbe519770b
New changeset c2c84d3ab393 by Richard Oudkerk in branch '3.2':
Issue #16743: Fix mmap overflow check on 32 bit Windows
http://hg.python.org/cpython/rev/c2c84d3ab393
|
msg182033 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-02-13 13:26 |
Perhaps NEWS item needed for this change.
|
msg182038 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2013-02-13 15:27 |
New changeset 82db097cd2e0 by Richard Oudkerk in branch '2.7':
Add Misc/NEWS entry for Issue #16743
http://hg.python.org/cpython/rev/82db097cd2e0
New changeset efe489f87881 by Richard Oudkerk in branch '3.2':
Add Misc/NEWS entry for Issue #16743
http://hg.python.org/cpython/rev/efe489f87881
|
msg182039 - (view) |
Author: Richard Oudkerk (sbt) * |
Date: 2013-02-13 15:29 |
> Perhaps NEWS item needed for this change.
Done.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:57:39 | admin | set | github: 60947 |
2013-02-13 21:09:19 | sbt | set | status: open -> closed resolution: fixed stage: patch review -> resolved |
2013-02-13 15:29:18 | sbt | set | messages:
+ msg182039 |
2013-02-13 15:27:13 | python-dev | set | messages:
+ msg182038 |
2013-02-13 13:26:00 | serhiy.storchaka | set | messages:
+ msg182033 |
2013-02-13 12:55:16 | python-dev | set | nosy:
+ python-dev messages:
+ msg182028
|
2013-02-12 23:52:18 | brian.curtin | set | nosy:
- brian.curtin
|
2013-02-12 23:49:30 | serhiy.storchaka | set | messages:
+ msg181984 |
2013-01-28 01:31:49 | terry.reedy | set | messages:
+ msg180818 |
2013-01-28 00:33:05 | sbt | set | messages:
+ msg180814 |
2013-01-28 00:29:18 | sbt | set | messages:
+ msg180813 |
2013-01-27 21:06:24 | serhiy.storchaka | set | messages:
+ msg180797 |
2013-01-27 20:27:57 | terry.reedy | set | messages:
+ msg180789 |
2013-01-27 12:47:44 | serhiy.storchaka | set | messages:
+ msg180765 |
2012-12-27 06:19:43 | serhiy.storchaka | set | messages:
+ msg178275 |
2012-12-27 01:25:52 | jcea | set | nosy:
+ jcea
|
2012-12-26 23:11:56 | terry.reedy | set | messages:
+ msg178261 |
2012-12-26 22:02:23 | serhiy.storchaka | set | messages:
+ msg178260 |
2012-12-26 21:41:29 | sbt | set | type: enhancement -> behavior title: mmap accepts files > 1 GB, but processes only 1 GB -> mmap on Windows can mishandle files larger than sys.maxsize |
2012-12-26 19:50:02 | sbt | set | messages:
+ msg178242 |
2012-12-26 19:23:24 | serhiy.storchaka | set | messages:
+ msg178237 |
2012-12-26 17:51:55 | sbt | set | files:
+ mmap.patch
messages:
+ msg178235 |
2012-12-26 15:52:45 | serhiy.storchaka | set | messages:
+ msg178230 |
2012-12-26 15:43:15 | sbt | set | messages:
+ msg178228 |
2012-12-26 14:47:13 | serhiy.storchaka | set | nosy:
+ mark.dickinson messages:
+ msg178221
|
2012-12-26 14:11:54 | sbt | set | files:
+ mmap.patch keywords:
+ patch messages:
+ msg178217
stage: needs patch -> patch review |
2012-12-25 00:15:16 | sbt | set | messages:
+ msg178111 |
2012-12-24 23:51:46 | sbt | set | nosy:
+ sbt messages:
+ msg178110
|
2012-12-24 22:35:29 | terry.reedy | set | messages:
+ msg178109 |
2012-12-24 07:36:21 | serhiy.storchaka | set | messages:
+ msg178033 |
2012-12-23 23:09:52 | terry.reedy | set | messages:
+ msg178021 |
2012-12-23 19:55:22 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages:
+ msg178013
|
2012-12-23 19:12:16 | terry.reedy | set | messages:
+ msg178008 |
2012-12-23 10:14:44 | pitrou | set | nosy:
+ pitrou
messages:
+ msg177968 versions:
+ Python 2.7, Python 3.2, Python 3.3 |
2012-12-22 01:58:16 | terry.reedy | set | versions:
- Python 2.7, Python 3.2, Python 3.3 nosy:
+ terry.reedy
messages:
+ msg177916
type: behavior -> enhancement stage: needs patch |
2012-12-21 21:28:39 | pitrou | set | nosy:
+ tim.golden, brian.curtin
components:
+ Library (Lib), Windows, - None versions:
+ Python 3.2, Python 3.3, Python 3.4 |
2012-12-21 14:43:31 | schlamar | create | |