msg77957 - (view) |
Author: (saa) |
Date: 2008-12-17 08:40 |
In trying to use the mmap offset to allow me to work with a 12GB file, I
encountered the fact that the new offset parameter was a ssize_t, which
is only 32 bits on my platform (OS X). The mmap offset in C is defined
to be an off_t
(http://opengroup.org/onlinepubs/007908775/xsh/mmap.html), which is 64
bits on my platform. The attached patch attempts to fix the non-Windows
code to have an off_t offset parameter and allows my code to access all
12 GB of my file. As a warning, this is the first time I've even looked
at the Python source, and wouldn't even consider myself a Python coder;
I just use Python every few months. Review the patch carefully.
|
msg77958 - (view) |
Author: Martin v. Löwis (loewis) * |
Date: 2008-12-17 09:07 |
I don't think fpos_t is a suitable type here. It isn't guaranteed to be
an integral type, and, on many systems, it isn't integral. So you can't
convert int to it, and you can't do arithmetic on it.
|
msg77959 - (view) |
Author: (saa) |
Date: 2008-12-17 09:17 |
Sorry, I should have explained that the bulk of my patch was derived from
existing code. The code that references fpos_t is basically a cut and
paste of code from Modules/bz2module.c. I don't know if this is a valid
usage of fpos_t, but if it isn't, you should probably file a bug on the
existing usage in bz2module and suggest a fix. Thanks for reviewing!
|
msg77960 - (view) |
Author: Martin v. Löwis (loewis) * |
Date: 2008-12-17 09:38 |
This is different from bz2, though. In bzlib, there is no API to seek to
a certain position. In mmap, the offset type in mmap specified as off_t,
so fpos_t plays no role here.
|
msg77961 - (view) |
Author: (saa) |
Date: 2008-12-17 09:42 |
Ah, I think I'm beginning to understand. Thanks for the explanation. So,
if I understand correctly, I should refactor the patch and just remove all
of the Py_off_t and just use off_t because the standard says that *is* the
type. Correct?
|
msg77962 - (view) |
Author: Martin v. Löwis (loewis) * |
Date: 2008-12-17 09:45 |
> Ah, I think I'm beginning to understand. Thanks for the explanation. So,
> if I understand correctly, I should refactor the patch and just remove all
> of the Py_off_t and just use off_t because the standard says that *is* the
> type. Correct?
Correct.
|
msg77963 - (view) |
Author: (saa) |
Date: 2008-12-17 09:55 |
Thanks again, revised patch attached.
|
msg78055 - (view) |
Author: (saa) |
Date: 2008-12-19 08:31 |
There are bigger problems here. When the code calculates the size to map
if size is passed as zero, it simply assigns the 64-bit st_size to the
size to mmap. On a 32-bit system, it is obvious that the amount to map
must be less than a 32-bit value. The calculation should take into
account at least the following two things: 1) it should be the size of the
file minus the offset, and 2) it shouldn't ask for an unreasonable amount
of memory. I will experiment with this and submit a new patch (and delete
the two I've already uploaded) within a couple of days, or at a minimum,
provide an update to this bug.
|
msg78076 - (view) |
Author: (saa) |
Date: 2008-12-19 23:24 |
Notes on the problem.
The Python mmap documentation says: If length is 0, the maximum length
of the map will be the current size of the file when mmap is called.
Currently, on a system where off_t is 64-bit and size_t is 32-bit, if
the size of the file (st_size) is > 0xFFFF'FFFF, the code will in effect
try to map (st_size & 0xFFFF'FFFF). This is suboptimal (imagine if the
size of the file is 0x1'0000'1000 or even 0x1'0000'0000).
In addition, it seems weird that a caller would have to check the size
of the returned mmap against the size of the file to see if the entire
file was mapped or not.
Finally, it appears that there isn't a reliable, architecture
independent, quick, and easy way to figure out what the maximum mmap
size is.
With these points in mind, I am going to work on coming up with a patch
that will change the behavior to simply return an error when a zero
length is passed to mmap and the entire file can not be mapped. This
will put the onus on the caller to determine an appropriate mmap size
for "chunking" the processing of the file, but the behavior will become
more consistent.
Of course, comments and suggestions are welcome. However, I'm going to
have to wrap up my immediate involvement on this in the next couple of
days (Yay, vacation!).
|
msg78085 - (view) |
Author: (saa) |
Date: 2008-12-20 04:50 |
Ok, put a fork in me, 'cuz I'm done with this. The latest is that
mmap.size() is defined to return the size of the file and not the size
of the data. It tries to return it as a ssize_t, which of course, on
systems where off_t is 64-bits and ssize_t is 32-bits, won't work for
sizes that won't fit in 32-bits.
Without the size of the data, it is unclear how one would traverse a
file using the offset parameter. As part of fixing mmap, I would
suggest someone should write an example of how it should be used in
these cases and use that as a test case.
I'm going to leave this up to someone with more knowledge of how this
stuff *should* work in Python, but I'm going to go redo my program in C.
Thanks for listening and the future efforts.
|
msg126454 - (view) |
Author: Ross Lagerwall (rosslagerwall) |
Date: 2011-01-18 06:30 |
Attached is a fix to make offset use off_t. This means that mmap will work with offset > 2GB on 32bit systems.
It also fixes that mmap.size() returns the correct value for files > 2GB on 32bit systems.
The first issue of msg78055 was fixed in issue10916, this also fixes the second part, raising an exception if the mmap length is too large instead of mmap()ing an invalid or wrong size.
|
msg128210 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2011-02-09 12:54 |
Thanks for doing this. Looking at the patch: Modules/_io/_iomodule.h has macros for dealing with off_t (it also defines Py_off_t for Windows), perhaps you want to use them.
Also, support.unlink() takes care of ignoring ENOENT for you.
Regardling the sizes used in the test: I think using hexadecimal literals (e.g. 0x180000000 instead of 6442450944) would make things a bit more readable :)
Part of the patch doesn't apply cleanly to latest py3k. Could you regenerate it, please?
|
msg128215 - (view) |
Author: Ross Lagerwall (rosslagerwall) |
Date: 2011-02-09 14:52 |
OK, this new patch applies cleanly, uses support.unlink and hexadecimal constants.
I left the off_t handling as is (it seems to work on *nix testing). Perhaps someone can handle the Windows side?
|
msg128236 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2011-02-09 20:18 |
> I left the off_t handling as is (it seems to work on *nix testing).
> Perhaps someone can handle the Windows side?
Ok, after looking at the code, it appears Windows already should have proper handling of 64-bit sizes under 32-bit builds.
|
msg128259 - (view) |
Author: wrobell (wrobell) |
Date: 2011-02-10 02:06 |
will this patch support off_t for "mmap.resize" as well? ftruncate uses off_t for the length of a file as well.
|
msg128261 - (view) |
Author: Ross Lagerwall (rosslagerwall) |
Date: 2011-02-10 04:38 |
> will this patch support off_t for "mmap.resize" as well?
> ftruncate uses off_t for the length of a file as well.
No, it doesn't because resize() resizes the amount mmapped in memory which can't be more than ssize_t anyway. However, resizing does work on a 12GiB file with a large offset, e.g. 6GiB.
|
msg128274 - (view) |
Author: wrobell (wrobell) |
Date: 2011-02-10 09:43 |
> > will this patch support off_t for "mmap.resize" as well?
> > ftruncate uses off_t for the length of a file as well.
> No, it doesn't because resize() resizes the amount mmapped in memory
> which can't be more than ssize_t anyway. However, resizing does work on
> a 12GiB file with a large offset, e.g. 6GiB.
i am bit lost. ftruncate is used to enlarge the file, isn't it?
please consider the script:
-- rr.py ---
import mmap
f = open('test.x', 'w+b')
f.write(b'\x00' * 10)
f.flush()
fm = mmap.mmap(f.fileno(), 0)
fm.resize(5 * 1024 ** 3)
fm.close()
f.close()
------------
the script above resizes file from 10 bytes
to 5 * 1024 ** 3 bytes with mmap.resize (but
not 5 * 1024 ** 3 + 10!)
now, on 32-bit linux (with off_t):
$ python3 rr.py
Traceback (most recent call last):
File "rr.py", line 8, in <module>
fm.resize(5 * 1024 ** 3)
OverflowError: Python int too large to convert to C ssize_t
above script works perfectly on 64-bit version of linux
with just 4GB of RAM.
maybe i do not understand some magic here, but clearly one
can mmap.resize to a file size much beyond the amount of memory.
|
msg128305 - (view) |
Author: Ross Lagerwall (rosslagerwall) |
Date: 2011-02-10 16:41 |
32-bit computers can address up to 4GiB of memory and 64-bit computers can address much more than this. mmap() allows a file to be mapped to a location in memory - the actual amount of memory that exists doesn't matter. This is the reason why a 5GiB file can be mmaped on 64-bit but not 32-bit.
Also, python uses the ssize_t type for sequences which allows about 2GiB to be mapped on a 32-bit platform. The mmap.resize() operation as well as changing the size of the underlying file size with ftruncate(), also changes the size of the map in memory. Thus in your sample script, this resizing would need to ftruncate the file to 5GiB and then map all of it into memory which is not possible on 32-bit.
I think the following might be a possible way of doing it:
import mmap
f = open('test.x', 'w+b')
f.write(b'\x00' * 10)
f.flush()
fm = mmap.mmap(f.fileno(), 0)
fm.close()
f.truncate(5 * 1024 ** 3)
# mmap only 1GiB of the 5GiB file
fm = mmap.mmap(f.fileno(), 1024**3)
fm.close()
f.close()
|
msg128312 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2011-02-10 17:03 |
Le jeudi 10 février 2011 à 16:41 +0000, Ross Lagerwall a écrit :
> Ross Lagerwall <rosslagerwall@gmail.com> added the comment:
>
> 32-bit computers can address up to 4GiB of memory
... at least 4 GB. With PAE (Physical Address Extension), we can address
up to 2^36 bytes. It looks like mmap() do support PAE, I suppose that
you need 64 bits offset on a 32 bits system to be able to use PAE :-)
Victor
|
msg128313 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2011-02-10 17:06 |
Le jeudi 10 février 2011 à 17:03 +0000, STINNER Victor a écrit :
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
>
> Le jeudi 10 février 2011 à 16:41 +0000, Ross Lagerwall a écrit :
> > Ross Lagerwall <rosslagerwall@gmail.com> added the comment:
> >
> > 32-bit computers can address up to 4GiB of memory
>
> ... at least 4 GB. With PAE (Physical Address Extension), we can address
> up to 2^36 bytes. It looks like mmap() do support PAE, I suppose that
> you need 64 bits offset on a 32 bits system to be able to use PAE :-)
We are talking about virtual addresses, not physical addresses. PAE
doesn't magically enlarge your address registers ;)
|
msg128336 - (view) |
Author: wrobell (wrobell) |
Date: 2011-02-10 20:17 |
Ross, you are completely right. thanks for the tips!
|
msg129012 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2011-02-21 23:44 |
I was a bit optimistic concerning 32-bit Windows. I had to do some changes, in part because off_t is 32-bit there. The final patch is committed in r88486 (tested under 32-bit and 64-bit Linux and Windows).
|
msg129014 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2011-02-21 23:59 |
Backported in r88487 to 3.2, and in r88488 to 2.7.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:56:42 | admin | set | github: 48931 |
2011-02-21 23:59:56 | pitrou | set | status: open -> closed versions:
+ Python 2.7 nosy:
loewis, pitrou, vstinner, wrobell, nadeem.vawda, eckhardt, saa, rosslagerwall, jbeezley messages:
+ msg129014
resolution: fixed stage: patch review -> resolved |
2011-02-21 23:44:30 | pitrou | set | nosy:
loewis, pitrou, vstinner, wrobell, nadeem.vawda, eckhardt, saa, rosslagerwall, jbeezley messages:
+ msg129012 |
2011-02-10 20:17:59 | wrobell | set | nosy:
loewis, pitrou, vstinner, wrobell, nadeem.vawda, eckhardt, saa, rosslagerwall, jbeezley messages:
+ msg128336 |
2011-02-10 17:06:04 | pitrou | set | nosy:
loewis, pitrou, vstinner, wrobell, nadeem.vawda, eckhardt, saa, rosslagerwall, jbeezley messages:
+ msg128313 |
2011-02-10 17:03:48 | vstinner | set | nosy:
loewis, pitrou, vstinner, wrobell, nadeem.vawda, eckhardt, saa, rosslagerwall, jbeezley messages:
+ msg128312 |
2011-02-10 16:41:45 | rosslagerwall | set | nosy:
loewis, pitrou, vstinner, wrobell, nadeem.vawda, eckhardt, saa, rosslagerwall, jbeezley messages:
+ msg128305 |
2011-02-10 09:43:31 | wrobell | set | nosy:
loewis, pitrou, vstinner, wrobell, nadeem.vawda, eckhardt, saa, rosslagerwall, jbeezley messages:
+ msg128274 |
2011-02-10 04:38:01 | rosslagerwall | set | nosy:
loewis, pitrou, vstinner, wrobell, nadeem.vawda, eckhardt, saa, rosslagerwall, jbeezley messages:
+ msg128261 |
2011-02-10 02:06:36 | wrobell | set | nosy:
+ wrobell messages:
+ msg128259
|
2011-02-09 20:18:18 | pitrou | set | nosy:
loewis, pitrou, vstinner, nadeem.vawda, eckhardt, saa, rosslagerwall, jbeezley messages:
+ msg128236 |
2011-02-09 14:52:42 | rosslagerwall | set | files:
+ mmap_revised.patch nosy:
loewis, pitrou, vstinner, nadeem.vawda, eckhardt, saa, rosslagerwall, jbeezley messages:
+ msg128215
|
2011-02-09 12:54:13 | pitrou | set | nosy:
+ pitrou messages:
+ msg128210
|
2011-02-09 01:24:19 | jbeezley | set | nosy:
+ jbeezley
|
2011-01-18 21:38:38 | vstinner | set | nosy:
+ vstinner
|
2011-01-18 06:30:57 | rosslagerwall | set | files:
+ mmap.patch nosy:
+ rosslagerwall messages:
+ msg126454
|
2011-01-06 22:18:49 | nadeem.vawda | set | nosy:
+ nadeem.vawda
|
2011-01-06 16:47:12 | pitrou | set | nosy:
loewis, eckhardt, saa stage: patch review versions:
+ Python 3.2, Python 3.3, - Python 2.6 |
2009-01-02 13:37:08 | eckhardt | set | nosy:
+ eckhardt |
2008-12-20 04:50:40 | saa | set | messages:
+ msg78085 |
2008-12-19 23:24:10 | saa | set | messages:
+ msg78076 |
2008-12-19 08:31:32 | saa | set | messages:
+ msg78055 title: mmap offset should be off_t instead of ssize_t -> mmap offset should be off_t instead of ssize_t, and size calculation needs corrected |
2008-12-17 09:55:48 | saa | set | files:
+ mmap_off_t_2.patch messages:
+ msg77963 |
2008-12-17 09:45:51 | loewis | set | messages:
+ msg77962 |
2008-12-17 09:42:27 | saa | set | messages:
+ msg77961 |
2008-12-17 09:38:24 | loewis | set | messages:
+ msg77960 |
2008-12-17 09:17:40 | saa | set | messages:
+ msg77959 |
2008-12-17 09:07:08 | loewis | set | nosy:
+ loewis messages:
+ msg77958 |
2008-12-17 08:40:09 | saa | create | |