classification
Title: mmap on Windows can mishandle files larger than sys.maxsize
Type: behavior Stage: resolved
Components: Library (Lib), Windows Versions: Python 3.4, Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: jcea, mark.dickinson, pitrou, python-dev, sbt, schlamar, serhiy.storchaka, terry.reedy, tim.golden
Priority: normal Keywords: patch

Created on 2012-12-21 14:43 by schlamar, last changed 2013-02-13 21:09 by sbt. This issue is now closed.

Files
File name Uploaded Description Edit
mmap.patch sbt, 2012-12-26 14:11 review
mmap.patch sbt, 2012-12-26 17:51 review
Messages (31)
msg177879 - (view) Author: Marc Schlaich (schlamar) * Date: 2012-12-21 14:43
Platform: Windows 7 64 bit
Interpreter: Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit Intel)] on win32

Here are the steps to reproduce:

1. Create a big file (5 GB):

with open('big', 'wb') as fobj:
    for _ in xrange(1024 * 1024 * 5):
        fobj.write('1' + '0' * 1023)

2. Open and process it with `mmap`:

import mmap
import re
import sys

with open('big', 'rb') as fobj:
    data = mmap.mmap(fobj.fileno(), 0, access=mmap.ACCESS_READ)
    print data.size()
    try:
        counter = 0
        for match in re.finditer('1' + '0' * 1023, data):
            counter += 1
        print len(data[1073740800:1073741824]) # (1 GB - 1024, 1 GB)
        print len(data[1073741824:1073742848]) # (1 GB, 1 GB + 1024)
    finally:
        data.close()

    print counter

This returns the following lines:

    5368709120
    1024
    0
    1048576

So this is a behavioral issue. `mmap` accepts a file which cannot fit in the interpreter memory but fits in the system memory. On processing the data, it only reads data until the maximum interpreter memory is reached (1 GB).
msg177916 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-12-22 01:58
The immediate fix is to use a 64 bit build. That aside, what change in behavior are you suggesting? (and for 32 bit builds only?)

Should mmap.mmap warn if the file is longer that would be supported?
This could be added to all current versions.

Should it raise in the same circumstance? What is a person *knows* that the file is 'too big' but only wants to access the first gigabyte? Forcing people to explicitly pass the magic number 1073741824 would, to me, effectively be a 3.4-at-best api change.

Perhaps mmap.mmap should be left alone and only the attempt to access beyond the cutoff should raise or warn. (Is the 32-bit cutoff OS specific?) Given that there are multiple access methods and methods that access, and that all accesses are ultimately delegated to the os mmap functions, this could be major nuisance to get right.

Now that disks have grown to larger than a gigabyte, the doc should explicitly mention the memory space issue.
msg177968 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-12-23 10:14
Terry, what makes you think this is a feature request? This is a bug, quite simply.
msg178008 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-12-23 19:12
It is a report of behavior that lacks a specific request for change (that I can see). The implied code-change request could break working code. We don't usually do that. What do you think should be done?
msg178013 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-23 19:55
As I understand, the issue is that mmap slicing returns an empty string for large (but less than ssize_t limit) indices on 2.7.

May be it relates to 30-bit digits long integer implementation?
msg178021 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-12-23 23:09
To me, Marc's title and penultimate sentence imply that he thinks that mmap should not accept such files. (But he should speak for himself.) As I said, not accepting such files could break working code.

As for the alternative of 'fixing' methods: Is it only slicing or other methods, even *every* method that 'misbehaves' when attempting to read (or write) beyond the 1 gig limit? I am guessing the last. If so, just about every method (inherited from bytearray, like slicing, or mmap specific) would need a fix conditional on the build and access location (and OS or hardware?).

Even for slices, what change would you (or anyone) make? Keep in mind that is it a *feature* of slices that they generally always work, and that this is specifically true of bytearrays. ("Memory-mapped file objects behave like both bytearray and like file objects.") 

I am actually a bit surprised that the limit is 1 gb rather than 2, 3, or 4 gb. Is it the same on *nix? What is the limit for a bytearray on Win 7?
msg178033 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-24 07:36
I have only 32-bit OS and can't answer this questions. I'm surprised by
1 GiB limit too.

Marc, can you please check 4.5 GiB file? What limit in this case, 1 GiB
or 0.5 GiB? What about slicing a big bytes object or bytearray (if you
have enough memory)?

If mmap on Windows can't work with files larger 4 GiB, then it should
raise exception on creation or at least on access. Silent production of
wrong result is not a way.
msg178109 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-12-24 22:35
Windows memory-maps multi-gigabyte files just fine as long as one uses the proper build (64-bit), which we provide.

Given that mmap produces a finite-length sequence object, as documented, slicing is working as it should. Slicing beyond the length  returns an empty sequence. The is no different from 'abc'[4:6]==''.

Running Python with finite memory has many memory-associated limitations. They are mostly undocumented as the exact details may depend on hardware, OS, implementation, version, and build. One practical limitation is that mmap with a 32-bit build cannot completely map multi-gigabyte files.

The current doc says:
"class mmap.mmap(fileno, length, tagname=None, access=ACCESS_DEFAULT[, offset]) 
(Windows version) Maps length bytes from the file specified by the file handle fileno, and creates a mmap object. If length is larger than the current size of the file, the file is extended to contain length bytes. If length is 0, the maximum length of the map is the current size of the file, except that if the file is empty Windows raises an exception (you cannot create an empty mapping on Windows)."

It does not say what happens if the requested length is larger than the max possible on a particular system. In particular, there is no mention of exception raising. So failure to raise is not a bug for tracker purposes.

The two possibilities of what to do is such situations are best effort and bailout. The current choice (at least on Windows, and whether by us, Microsoft, or the original mmap authors, I don't know) is best effort. I think that is fine, but should be documented. Users who care can compare the mmap object length with the file length or needed length and raise or do whatever if the mmap length is too short.

So I think we should change this to a doc issue and add something like "If the requested length is larger than the limit for the current system, then that limit is used as the length."
or
"The length of the returned mmap object has a limit that depends on the details of the running system."

Or the header should say that there is a system limit and two of the sentences above revised. In the first, change 'length bytes' to 'min(length, system limit) bytes. (I am presuming this is true also when length is not given as 0.) In the last sentence, change 'current size' to 'min(current size, system limit)'.

The Unix version doc should also clarify behavior.
---

If we were to change mmap() (but only in a future release), then users who want the current behavior would have to discover, hard-code, and explicitly but conditionally pass the limit for each system their code might ever run on. I do not know that that is sensibly possible. I would not be surprised if the limit for a given 32-bit build varies for different windows versions and setups.
msg178110 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-12-24 23:51
I suspect that the size of the 5GB file is originally a 64 bit quantity, but gets cast unsafely to a 32 bit size_t to give 1GB.  This is causing the miscalculations.

There is no way to map all of a 5GB file in a 32 bit process -- 4GB is the maximum -- so any such attempt should raise an error.  This does not prevent us from mapping *part* of a 5GB file.
msg178111 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-12-25 00:15
This bit looks wrong to me:

            if (offset - size > PY_SSIZE_T_MAX)
                /* Map area too large to fit in memory */
                m_obj->size = (Py_ssize_t) -1;

Should it not be "size - offset" instead of "offset - size"?  (offset and size are Py_LONG_LONG.)  And there is no check that offset is non-negative.
msg178217 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-12-26 14:11
On 32 bit Unix mmap() will raise ValueError("mmap length is too large") in Marc's example.  This is correct since Python's sequence protocol does not support indexes larger than sys.maxsize.

But on 32 bit Windows, if length == 0 then the size check always passes, and the actual size mapped is the file size modulo 4GB.

Fix for 3.x is attached with tests.
msg178221 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-26 14:47
This change is not backward compatible. Now user can mmap a larger file and safely access lower 2 GiB. With the patch it will fail.

Unix implementation uses unsafe integer overflow idiom which cause undefined behavior (Mark, you have the floor).
msg178228 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-12-26 15:43
> This change is not backward compatible. Now user can mmap a larger file 
> and safely access lower 2 GiB. With the patch it will fail.

They should specify length=2GiB-1 if that is what they want.

With length=0 you can only access the lower 2GiB if "file_size % 4GiB > 2GiB".  If the file size is 4GiB+1 then you can only access *one byte* of the file.  And if "2GiB < file_size < 4GiB" then presumably len(data) will be negative (or throw an exception or fail an assertion -- I have not tested that case).  I would not be surprised if crashes are possible.

Basically if you had a "large" file and you did not hit a problem then it was Windows specific dumb luck.  I see no point in retaining such unpredictable behaviour.
msg178230 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-26 15:52
Agree.

Please add the same check for Unix implementation (instead of unsafe overflow trick).
msg178235 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-12-26 17:51
New patch with same check for Unix.
msg178237 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-26 19:23
LGTM. Isn't 2 GiB + 1 bytes mmap file enough for testing?
msg178242 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-12-26 19:50
> Isn't 2 GiB + 1 bytes mmap file enough for testing?

Yes.

But creating multigigabyte files is very slow on Windows.  On Linux/FreeBSD test_mmap takes a fraction of a second, whereas on Windows it takes over 2 minutes.  (Presumably Linux/FreeBSD is automatically creating a sparse file.)

So adding assertions to an existing test is more convenient than creating another huge file just for these new tests.
msg178260 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-26 22:02
Sorry, I had missed you reuse an existing test. It's right. I think decreasing file size to 4 GiB + 1 byte will keep the purpose of the test and save a little time, but this is another issue. Let's go ahead!
msg178261 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-12-26 23:11
New data, new thoughts ;-)

I agree that an undocumented discrepancy between Windows and *nix should be fixed either in doc, code, or both. The fix might be version specific.

If the limit on Windows is less than it should or could be, because of our code, that should be changed. It would be best if the working limit were the same for all major platforms or at least the same for all versions of each major platform.

I am a bit reluctant to break working code on existing versions, even if the code depends on dumb luck because of an undocumented version-specific buggy 'feature'.
msg178275 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-27 06:19
As Richard explained, this will not break working code, this will break only broken code. Working limit will not be changed, it's sys.maxsize on all platforms.
msg180765 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-27 12:47
Let's go ahead.
msg180789 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-01-27 20:27
> As Richard explained, this will not break working code, this will break only broken code

If code is both working and broken, for some reasonable meaning of 'working' and 'broken', breaking such broken code *will* break working code. I understood Richard as saying that code that 'works by dumb luck' is *also* broken.

I agree we do not need to retain unpredictable 'dumb luck' -- in future versions. But in the absence of a clear discrepancy between doc and behavior (the definition of a bug) I believe breaking such code in a bugfix release would be contrary to current policy.
msg180797 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-27 21:06
Every bugfix breaks some code. As a compromise I propose set m_obj->size = PY_SSIZE_T_MAX in case of overflow and emit a warning.
msg180813 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2013-01-28 00:29
On 27/01/2013 8:27pm, Terry J. Reedy wrote:
> I agree we do not need to retain unpredictable 'dumb luck' -- in
> future versions. But in the absence of a clear discrepancy
> between doc and behavior (the definition of a bug) I believe
> breaking such code in a bugfix release would be contrary to
> current policy.

Currently if you mmap a file with length 4GB+1, then you get an mmap of 
length 1.  Surely that is a *huge* discrepancy between docs and behaviour.

BTW, on 32 bit Windows it looks like the maximum size one can mmap in 
practice is about 1.1GB:

PS> python -c "import mmap; m = mmap.mmap(-1, int(1.1*1024**3))"
PS> python -c "import mmap; m = mmap.mmap(-1, int(1.2*1024**3))"
Traceback (most recent call last):
   File "<string>", line 1, in <module>
WindowsError: [Error 8] Not enough storage is available to process this 
command
msg180814 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2013-01-28 00:33
On 27/01/2013 9:06pm, Serhiy Storchaka wrote:
> Every bugfix breaks some code. As a compromise I propose set
 > m_obj->size = PY_SSIZE_T_MAX in case of overflow and emit a warning.

Trying to allocate PY_SSIZE_T_MAX bytes always seems to fail with

     WindowsError: [Error 8] Not enough storage is available to process
     this command
msg180818 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-01-28 01:31
I missed the Ngb+1 case in your previous answer and agree that this is a bug issue.
msg181984 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-02-12 23:49
Ping.
msg182028 - (view) Author: Roundup Robot (python-dev) Date: 2013-02-13 12:55
New changeset b1bbe519770b by Richard Oudkerk in branch '2.7':
Issue #16743: Fix mmap overflow check on 32 bit Windows
http://hg.python.org/cpython/rev/b1bbe519770b

New changeset c2c84d3ab393 by Richard Oudkerk in branch '3.2':
Issue #16743: Fix mmap overflow check on 32 bit Windows
http://hg.python.org/cpython/rev/c2c84d3ab393
msg182033 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-02-13 13:26
Perhaps NEWS item needed for this change.
msg182038 - (view) Author: Roundup Robot (python-dev) Date: 2013-02-13 15:27
New changeset 82db097cd2e0 by Richard Oudkerk in branch '2.7':
Add Misc/NEWS entry for Issue #16743
http://hg.python.org/cpython/rev/82db097cd2e0

New changeset efe489f87881 by Richard Oudkerk in branch '3.2':
Add Misc/NEWS entry for Issue #16743
http://hg.python.org/cpython/rev/efe489f87881
msg182039 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2013-02-13 15:29
> Perhaps NEWS item needed for this change.

Done.
History
Date User Action Args
2013-02-13 21:09:19sbtsetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2013-02-13 15:29:18sbtsetmessages: + msg182039
2013-02-13 15:27:13python-devsetmessages: + msg182038
2013-02-13 13:26:00serhiy.storchakasetmessages: + msg182033
2013-02-13 12:55:16python-devsetnosy: + python-dev
messages: + msg182028
2013-02-12 23:52:18brian.curtinsetnosy: - brian.curtin
2013-02-12 23:49:30serhiy.storchakasetmessages: + msg181984
2013-01-28 01:31:49terry.reedysetmessages: + msg180818
2013-01-28 00:33:05sbtsetmessages: + msg180814
2013-01-28 00:29:18sbtsetmessages: + msg180813
2013-01-27 21:06:24serhiy.storchakasetmessages: + msg180797
2013-01-27 20:27:57terry.reedysetmessages: + msg180789
2013-01-27 12:47:44serhiy.storchakasetmessages: + msg180765
2012-12-27 06:19:43serhiy.storchakasetmessages: + msg178275
2012-12-27 01:25:52jceasetnosy: + jcea
2012-12-26 23:11:56terry.reedysetmessages: + msg178261
2012-12-26 22:02:23serhiy.storchakasetmessages: + msg178260
2012-12-26 21:41:29sbtsettype: enhancement -> behavior
title: mmap accepts files > 1 GB, but processes only 1 GB -> mmap on Windows can mishandle files larger than sys.maxsize
2012-12-26 19:50:02sbtsetmessages: + msg178242
2012-12-26 19:23:24serhiy.storchakasetmessages: + msg178237
2012-12-26 17:51:55sbtsetfiles: + mmap.patch

messages: + msg178235
2012-12-26 15:52:45serhiy.storchakasetmessages: + msg178230
2012-12-26 15:43:15sbtsetmessages: + msg178228
2012-12-26 14:47:13serhiy.storchakasetnosy: + mark.dickinson
messages: + msg178221
2012-12-26 14:11:54sbtsetfiles: + mmap.patch
keywords: + patch
messages: + msg178217

stage: needs patch -> patch review
2012-12-25 00:15:16sbtsetmessages: + msg178111
2012-12-24 23:51:46sbtsetnosy: + sbt
messages: + msg178110
2012-12-24 22:35:29terry.reedysetmessages: + msg178109
2012-12-24 07:36:21serhiy.storchakasetmessages: + msg178033
2012-12-23 23:09:52terry.reedysetmessages: + msg178021
2012-12-23 19:55:22serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg178013
2012-12-23 19:12:16terry.reedysetmessages: + msg178008
2012-12-23 10:14:44pitrousetnosy: + pitrou

messages: + msg177968
versions: + Python 2.7, Python 3.2, Python 3.3
2012-12-22 01:58:16terry.reedysetversions: - Python 2.7, Python 3.2, Python 3.3
nosy: + terry.reedy

messages: + msg177916

type: behavior -> enhancement
stage: needs patch
2012-12-21 21:28:39pitrousetnosy: + tim.golden, brian.curtin

components: + Library (Lib), Windows, - None
versions: + Python 3.2, Python 3.3, Python 3.4
2012-12-21 14:43:31schlamarcreate