Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmap offset should be off_t instead of ssize_t, and size calculation needs corrected #48931

Closed
saa mannequin opened this issue Dec 17, 2008 · 23 comments
Closed

mmap offset should be off_t instead of ssize_t, and size calculation needs corrected #48931

saa mannequin opened this issue Dec 17, 2008 · 23 comments
Labels
extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error

Comments

@saa
Copy link
Mannequin

saa mannequin commented Dec 17, 2008

BPO 4681
Nosy @loewis, @pitrou, @vstinner
Files
  • mmap_off_t.patch: Noobie fix for non-Windows
  • mmap_off_t_2.patch: Revised patch that removes fpos_t.
  • mmap.patch
  • mmap_revised.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2011-02-21.23:59:56.807>
    created_at = <Date 2008-12-17.08:40:08.917>
    labels = ['extension-modules', 'type-bug']
    title = 'mmap offset should be off_t instead of ssize_t, and size calculation needs corrected'
    updated_at = <Date 2011-02-21.23:59:56.805>
    user = 'https://bugs.python.org/saa'

    bugs.python.org fields:

    activity = <Date 2011-02-21.23:59:56.805>
    actor = 'pitrou'
    assignee = 'none'
    closed = True
    closed_date = <Date 2011-02-21.23:59:56.807>
    closer = 'pitrou'
    components = ['Extension Modules']
    creation = <Date 2008-12-17.08:40:08.917>
    creator = 'saa'
    dependencies = []
    files = ['12375', '12376', '20434', '20722']
    hgrepos = []
    issue_num = 4681
    keywords = ['patch']
    message_count = 23.0
    messages = ['77957', '77958', '77959', '77960', '77961', '77962', '77963', '78055', '78076', '78085', '126454', '128210', '128215', '128236', '128259', '128261', '128274', '128305', '128312', '128313', '128336', '129012', '129014']
    nosy_count = 9.0
    nosy_names = ['loewis', 'pitrou', 'vstinner', 'wrobell', 'nadeem.vawda', 'eckhardt', 'saa', 'rosslagerwall', 'jbeezley']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue4681'
    versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']

    @saa
    Copy link
    Mannequin Author

    saa mannequin commented Dec 17, 2008

    In trying to use the mmap offset to allow me to work with a 12GB file, I
    encountered the fact that the new offset parameter was a ssize_t, which
    is only 32 bits on my platform (OS X). The mmap offset in C is defined
    to be an off_t
    (http://opengroup.org/onlinepubs/007908775/xsh/mmap.html), which is 64
    bits on my platform. The attached patch attempts to fix the non-Windows
    code to have an off_t offset parameter and allows my code to access all
    12 GB of my file. As a warning, this is the first time I've even looked
    at the Python source, and wouldn't even consider myself a Python coder;
    I just use Python every few months. Review the patch carefully.

    @saa saa mannequin added extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error labels Dec 17, 2008
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Dec 17, 2008

    I don't think fpos_t is a suitable type here. It isn't guaranteed to be
    an integral type, and, on many systems, it isn't integral. So you can't
    convert int to it, and you can't do arithmetic on it.

    @saa
    Copy link
    Mannequin Author

    saa mannequin commented Dec 17, 2008

    Sorry, I should have explained that the bulk of my patch was derived from
    existing code. The code that references fpos_t is basically a cut and
    paste of code from Modules/bz2module.c. I don't know if this is a valid
    usage of fpos_t, but if it isn't, you should probably file a bug on the
    existing usage in bz2module and suggest a fix. Thanks for reviewing!

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Dec 17, 2008

    This is different from bz2, though. In bzlib, there is no API to seek to
    a certain position. In mmap, the offset type in mmap specified as off_t,
    so fpos_t plays no role here.

    @saa
    Copy link
    Mannequin Author

    saa mannequin commented Dec 17, 2008

    Ah, I think I'm beginning to understand. Thanks for the explanation. So,
    if I understand correctly, I should refactor the patch and just remove all
    of the Py_off_t and just use off_t because the standard says that *is* the
    type. Correct?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Dec 17, 2008

    Ah, I think I'm beginning to understand. Thanks for the explanation. So,
    if I understand correctly, I should refactor the patch and just remove all
    of the Py_off_t and just use off_t because the standard says that *is* the
    type. Correct?

    Correct.

    @saa
    Copy link
    Mannequin Author

    saa mannequin commented Dec 17, 2008

    Thanks again, revised patch attached.

    @saa
    Copy link
    Mannequin Author

    saa mannequin commented Dec 19, 2008

    There are bigger problems here. When the code calculates the size to map
    if size is passed as zero, it simply assigns the 64-bit st_size to the
    size to mmap. On a 32-bit system, it is obvious that the amount to map
    must be less than a 32-bit value. The calculation should take into
    account at least the following two things: 1) it should be the size of the
    file minus the offset, and 2) it shouldn't ask for an unreasonable amount
    of memory. I will experiment with this and submit a new patch (and delete
    the two I've already uploaded) within a couple of days, or at a minimum,
    provide an update to this bug.

    @saa saa mannequin changed the title mmap offset should be off_t instead of ssize_t mmap offset should be off_t instead of ssize_t, and size calculation needs corrected Dec 19, 2008
    @saa
    Copy link
    Mannequin Author

    saa mannequin commented Dec 19, 2008

    Notes on the problem.

    The Python mmap documentation says: If length is 0, the maximum length
    of the map will be the current size of the file when mmap is called.

    Currently, on a system where off_t is 64-bit and size_t is 32-bit, if
    the size of the file (st_size) is > 0xFFFF'FFFF, the code will in effect
    try to map (st_size & 0xFFFF'FFFF). This is suboptimal (imagine if the
    size of the file is 0x1'0000'1000 or even 0x1'0000'0000).

    In addition, it seems weird that a caller would have to check the size
    of the returned mmap against the size of the file to see if the entire
    file was mapped or not.

    Finally, it appears that there isn't a reliable, architecture
    independent, quick, and easy way to figure out what the maximum mmap
    size is.

    With these points in mind, I am going to work on coming up with a patch
    that will change the behavior to simply return an error when a zero
    length is passed to mmap and the entire file can not be mapped. This
    will put the onus on the caller to determine an appropriate mmap size
    for "chunking" the processing of the file, but the behavior will become
    more consistent.

    Of course, comments and suggestions are welcome. However, I'm going to
    have to wrap up my immediate involvement on this in the next couple of
    days (Yay, vacation!).

    @saa
    Copy link
    Mannequin Author

    saa mannequin commented Dec 20, 2008

    Ok, put a fork in me, 'cuz I'm done with this. The latest is that
    mmap.size() is defined to return the size of the file and not the size
    of the data. It tries to return it as a ssize_t, which of course, on
    systems where off_t is 64-bits and ssize_t is 32-bits, won't work for
    sizes that won't fit in 32-bits.

    Without the size of the data, it is unclear how one would traverse a
    file using the offset parameter. As part of fixing mmap, I would
    suggest someone should write an example of how it should be used in
    these cases and use that as a test case.

    I'm going to leave this up to someone with more knowledge of how this
    stuff *should* work in Python, but I'm going to go redo my program in C.
    Thanks for listening and the future efforts.

    @rosslagerwall
    Copy link
    Mannequin

    rosslagerwall mannequin commented Jan 18, 2011

    Attached is a fix to make offset use off_t. This means that mmap will work with offset > 2GB on 32bit systems.

    It also fixes that mmap.size() returns the correct value for files > 2GB on 32bit systems.

    The first issue of msg78055 was fixed in bpo-10916, this also fixes the second part, raising an exception if the mmap length is too large instead of mmap()ing an invalid or wrong size.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 9, 2011

    Thanks for doing this. Looking at the patch: Modules/_io/_iomodule.h has macros for dealing with off_t (it also defines Py_off_t for Windows), perhaps you want to use them.

    Also, support.unlink() takes care of ignoring ENOENT for you.

    Regardling the sizes used in the test: I think using hexadecimal literals (e.g. 0x180000000 instead of 6442450944) would make things a bit more readable :)

    Part of the patch doesn't apply cleanly to latest py3k. Could you regenerate it, please?

    @rosslagerwall
    Copy link
    Mannequin

    rosslagerwall mannequin commented Feb 9, 2011

    OK, this new patch applies cleanly, uses support.unlink and hexadecimal constants.

    I left the off_t handling as is (it seems to work on *nix testing). Perhaps someone can handle the Windows side?

    @pitrou
    Copy link
    Member

    pitrou commented Feb 9, 2011

    I left the off_t handling as is (it seems to work on *nix testing).
    Perhaps someone can handle the Windows side?

    Ok, after looking at the code, it appears Windows already should have proper handling of 64-bit sizes under 32-bit builds.

    @wrobell
    Copy link
    Mannequin

    wrobell mannequin commented Feb 10, 2011

    will this patch support off_t for "mmap.resize" as well? ftruncate uses off_t for the length of a file as well.

    @rosslagerwall
    Copy link
    Mannequin

    rosslagerwall mannequin commented Feb 10, 2011

    will this patch support off_t for "mmap.resize" as well?
    ftruncate uses off_t for the length of a file as well.

    No, it doesn't because resize() resizes the amount mmapped in memory which can't be more than ssize_t anyway. However, resizing does work on a 12GiB file with a large offset, e.g. 6GiB.

    @wrobell
    Copy link
    Mannequin

    wrobell mannequin commented Feb 10, 2011

    > will this patch support off_t for "mmap.resize" as well?
    > ftruncate uses off_t for the length of a file as well.

    No, it doesn't because resize() resizes the amount mmapped in memory
    which can't be more than ssize_t anyway. However, resizing does work on
    a 12GiB file with a large offset, e.g. 6GiB.

    i am bit lost. ftruncate is used to enlarge the file, isn't it?

    please consider the script:

    -- rr.py ---

    import mmap
    
    f = open('test.x', 'w+b')
    f.write(b'\x00' * 10)
    f.flush()
    
    fm = mmap.mmap(f.fileno(), 0)
    fm.resize(5 * 1024 ** 3)
    fm.close()
    
    f.close()

    the script above resizes file from 10 bytes
    to 5 * 1024 ** 3 bytes with mmap.resize (but
    not 5 * 1024 ** 3 + 10!)

    now, on 32-bit linux (with off_t):

    $ python3 rr.py
    Traceback (most recent call last):
      File "rr.py", line 8, in <module>
        fm.resize(5 * 1024 ** 3)
    OverflowError: Python int too large to convert to C ssize_t

    above script works perfectly on 64-bit version of linux
    with just 4GB of RAM.

    maybe i do not understand some magic here, but clearly one
    can mmap.resize to a file size much beyond the amount of memory.

    @rosslagerwall
    Copy link
    Mannequin

    rosslagerwall mannequin commented Feb 10, 2011

    32-bit computers can address up to 4GiB of memory and 64-bit computers can address much more than this. mmap() allows a file to be mapped to a location in memory - the actual amount of memory that exists doesn't matter. This is the reason why a 5GiB file can be mmaped on 64-bit but not 32-bit.

    Also, python uses the ssize_t type for sequences which allows about 2GiB to be mapped on a 32-bit platform. The mmap.resize() operation as well as changing the size of the underlying file size with ftruncate(), also changes the size of the map in memory. Thus in your sample script, this resizing would need to ftruncate the file to 5GiB and then map all of it into memory which is not possible on 32-bit.

    I think the following might be a possible way of doing it:
    import mmap

    f = open('test.x', 'w+b')
    f.write(b'\x00' * 10)
    f.flush()
    
    fm = mmap.mmap(f.fileno(), 0)
    fm.close()
    f.truncate(5 * 1024 ** 3)
    
    # mmap only 1GiB of the 5GiB file
    fm = mmap.mmap(f.fileno(), 1024**3)
    fm.close()
    
    f.close()

    @vstinner
    Copy link
    Member

    Le jeudi 10 février 2011 à 16:41 +0000, Ross Lagerwall a écrit :

    Ross Lagerwall <rosslagerwall@gmail.com> added the comment:

    32-bit computers can address up to 4GiB of memory

    ... at least 4 GB. With PAE (Physical Address Extension), we can address
    up to 2^36 bytes. It looks like mmap() do support PAE, I suppose that
    you need 64 bits offset on a 32 bits system to be able to use PAE :-)

    Victor

    @pitrou
    Copy link
    Member

    pitrou commented Feb 10, 2011

    Le jeudi 10 février 2011 à 17:03 +0000, STINNER Victor a écrit :

    STINNER Victor <victor.stinner@haypocalc.com> added the comment:

    Le jeudi 10 février 2011 à 16:41 +0000, Ross Lagerwall a écrit :
    > Ross Lagerwall <rosslagerwall@gmail.com> added the comment:
    >
    > 32-bit computers can address up to 4GiB of memory

    ... at least 4 GB. With PAE (Physical Address Extension), we can address
    up to 2^36 bytes. It looks like mmap() do support PAE, I suppose that
    you need 64 bits offset on a 32 bits system to be able to use PAE :-)

    We are talking about virtual addresses, not physical addresses. PAE
    doesn't magically enlarge your address registers ;)

    @wrobell
    Copy link
    Mannequin

    wrobell mannequin commented Feb 10, 2011

    Ross, you are completely right. thanks for the tips!

    @pitrou
    Copy link
    Member

    pitrou commented Feb 21, 2011

    I was a bit optimistic concerning 32-bit Windows. I had to do some changes, in part because off_t is 32-bit there. The final patch is committed in r88486 (tested under 32-bit and 64-bit Linux and Windows).

    @pitrou
    Copy link
    Member

    pitrou commented Feb 21, 2011

    Backported in r88487 to 3.2, and in r88488 to 2.7.

    @pitrou pitrou closed this as completed Feb 21, 2011
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants