Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with handling the file command output in platform.architecture() #79529

Closed
serhiy-storchaka opened this issue Nov 29, 2018 · 24 comments
Closed
Labels
3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@serhiy-storchaka
Copy link
Member

BPO 35348
Nosy @malemburg, @ronaldoussoren, @vstinner, @serhiy-storchaka, @Windsooon
PRs
  • bpo-35348: Fix platform.architecture() #11159
  • bpo-35348: Make parsing the output of the file command in platform.architecture() more reliable. #11160
  • bpo-35348: Fix platform.architecture() on macOS #11186
  • bpo-35348: platform.architecture() doc suggests sys.maxsize #11208
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-12-18.16:40:49.785>
    created_at = <Date 2018-11-29.12:33:34.244>
    labels = ['3.7', '3.8', 'type-bug', 'library']
    title = 'Problems with handling the file command output in platform.architecture()'
    updated_at = <Date 2018-12-18.16:40:49.784>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2018-12-18.16:40:49.784>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-12-18.16:40:49.785>
    closer = 'vstinner'
    components = ['Library (Lib)']
    creation = <Date 2018-11-29.12:33:34.244>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 35348
    keywords = ['patch']
    message_count = 24.0
    messages = ['330686', '330689', '330738', '331816', '331817', '331820', '331821', '331826', '331829', '331830', '331831', '331834', '331835', '331836', '331843', '331956', '331958', '331959', '331962', '331965', '332015', '332016', '332050', '332072']
    nosy_count = 5.0
    nosy_names = ['lemburg', 'ronaldoussoren', 'vstinner', 'serhiy.storchaka', 'Windson Yang']
    pr_nums = ['11159', '11160', '11186', '11208']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue35348'
    versions = ['Python 3.6', 'Python 3.7', 'Python 3.8']

    @serhiy-storchaka
    Copy link
    Member Author

    The code of _syscmd_file() in the platform module does not match the docstring. The "-b" option was removed in 685fffa (bpo-16112), and this leads to inclusion the executable path in the file command output. If the executable path contains some key strings like "32-bit" or "PE", platform.architecture() can return an incorrect result. I think that the "-b" option should be restored.

    $ python3 -c 'import platform; print(platform.architecture("/usr/bin/python3.6"))'
    ('64bit', 'ELF')
    $ cp /usr/bin/python3.6 /tmp/32-bitPE
    $ python3 -c 'import platform; print(platform.architecture("/tmp/32-bitPE"))'
    ('32bit', 'ELF')

    Other problem is that the code tests if the string "executable" is contained in the file command output. But it is not always contained for executables on Linux.

    $ file python
    python: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=d3cfa06c2bdcbf7b6af9e4e6be5061cb8398c086, with debug_info, not stripped
    $ file /usr/bin/python2.7
    /usr/bin/python2.7: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=fd3904306c73383fb371287416257b82d6a3363b, stripped
    $ file /usr/bin/python3.6
    /usr/bin/python3.6: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=9dae0eec9b3f9cb82612d20dc0c3088feab9e356, stripped

    @serhiy-storchaka serhiy-storchaka added 3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Nov 29, 2018
    @vstinner
    Copy link
    Member

    Why does platform has to analyze sys.executable binary to check if it's 32 or 64 bits? Can't we use sizeof(void*) for example? Is it something related to FAT binary on macOS? (single binary for 32 and 64 bits, or single binary for PPC and x86)

    @Windsooon
    Copy link
    Mannequin

    Windsooon mannequin commented Nov 30, 2018

    I agreed with Serhiy. I also found the function decode the output with latin-1, I think it will be better to use utf-8 instead.

    @vstinner
    Copy link
    Member

    I removed the dependency between bpo-35346 and this issue. I don't see how they are related.

    @vstinner
    Copy link
    Member

    I agreed with Serhiy. I also found the function decode the output with latin-1, I think it will be better to use utf-8 instead.

    Decoding from UTF-8 can fail with UnicodeDecodeError, whereas decoding from latin-1 never fails.

    file output is ASCII, so I don't see the point of using UTF-8.

    Currently, the command displays the filename, but I don't think that we should care of the encoding of the filename.

    I think that the "-b" option should be restored.

    That, or strip/skip the filename in the output?

    @vstinner
    Copy link
    Member

    -b option added by:
    https://hg.python.org/cpython/rev/c73b90b6dadd

    and removed the day after by:
    https://hg.python.org/cpython/rev/b94a9ff13199

    @vstinner
    Copy link
    Member

    -b option added by:
    https://hg.python.org/cpython/rev/c73b90b6dadd

    Oh wait, this change is for the 2.7 branch. The change in master (old "default" branch) didn't add -b, but replaced "-b" with "-b --":
    https://hg.python.org/cpython/rev/cd026866b333

    @serhiy-storchaka
    Copy link
    Member Author

    I tested that the "-b" option is supported on Linux, *BSD and OpenIndiana. But it is not a part of POSIX. So perhaps we should fall back to "file" without "-b" if "file -b" failed.

    We can also check that the output starts with executable+': ' and strip this prefix.

    @vstinner
    Copy link
    Member

    In 2.7 branch, _syscmd_file() only used -b option during one day (no Python 2.7.x release used -b):

    • Oct 4, 2012: commit 95038fa added -b: "Closes bpo-16112: platform.architecture does not correctly escape argument to /usr/bin/file"
    • Oct 5, 2012: commit 2699c9d removed -b: "bpo-16112: platform.architecture does not correctly escape argument to /usr/bin/file. Fix original patch"

    Python 3.2.0 (Feb 2011) to 3.2.3 (Sep 2012) and Python 3.3.0 (Sep 2012) used -b:

    • Aug 13, 2010: commit ddfb2c3 added -b: "Omit the filename to avoid enconding issues, especially with non encodable characters in the Python full path."
    • Oct 5, 2012: commit 685fffa removed -b: "bpo-16112: platform.architecture does not correctly escape argument to /usr/bin/file. Fix original patch"

    @vstinner
    Copy link
    Member

    We can also check that the output starts with executable+': ' and strip this prefix.

    Technically, on UNIX, ':' is valid in a filename. Filename examples which contain ':' on my Fedora 29:

    /usr/share/man/man3/List::Util.3pm.gz
    /usr/share/usb_modeswitch/0408:f000
    /proc/irq/127/ahci[0000:00:17.0]
    /proc/irq/131/snd_hda_intel:card0
    /dev/block/259:3
    /sys/kernel/slab/:0002632
    /sys/module/psmouse/drivers/serio:psmouse

    Note: I cannot find a program name which contains ':'.

    @vstinner
    Copy link
    Member

    A convervative approach would be to leave stable branches unchanged and use -b in the master branch.

    @serhiy-storchaka
    Copy link
    Member Author

    PR 11160 is an alternate solution which strips the filename in the output. It does not matter if the filename contains ":", because the format of the output in the POSIX locale is strictly specified.

    @ronaldoussoren
    Copy link
    Contributor

    BTW. A related problem with platform.architecture() is that it doesn't know how to deal with fat binaries (such as those found on macOS).

    As an example:

    $  file /usr/bin/python
    /usr/bin/python: Mach-O universal binary with 2 architectures: [i386:Mach-O executable i386] [x86_64:Mach-O 64-bit executable x86_64]
    /usr/bin/python (for architecture i386):	Mach-O executable i386
    /usr/bin/python (for architecture x86_64):	Mach-O 64-bit executable x86_64

    This will be reported as "64-bit" by platform.architecture() because there is '64-bit' in the output of file(1).

    Using sizeof(void*) or sys.maxsize suffers from the a simular problem: this will only detect the pointer-size of the current proces and not that the binary is capable of running with a different pointer-size as well.

    P.S. platform.architecture() uses file(1) because you can specify different executables than sys.executable.

    @serhiy-storchaka
    Copy link
    Member Author

    What result of platform.architecture() do you expect for an universal binary?

    @vstinner
    Copy link
    Member

    I don't understand the purpose of the 'linkage' information of platform.architecture(). Does anyone care if Python is an ELF program or a WindowsPE program? Maybe it was useful 20 years ago when there were COFF on Unix, but right now ELF is the defacto standard on Unix, and WindowsPE on Windows.

    32-bit and 64-bit information should be enough, no?

    I would suggest to just return ('%sbit' % bits, '') if executable is not set. Use struct.calcsize('P')*8 or sys.maxsize to get bits.

    @ronaldoussoren
    Copy link
    Contributor

    What result of platform.architecture() do you expect for an universal binary?

    I honestly don't know. What is the purpose of this functionality in the first place? I have never had a problem where using this function was the right solution.

    To be honest I have the same problem with a number of other APIs in this module. As an example, platform.system_alias() looks interesting but has an API that won't work in general (macOS release version cannot be calculated from platform.uname() information, likewise for linux distribution information).

    @vstinner
    Copy link
    Member

    Ronald Oussoren gave more info on my previous PR 10780 ("platform.platform() uses mac_ver() on macOS"):

    #10780 (comment)

    """
    The information does not include data about fat binaries, resulting amongst others in the following inconsistency:

    ronald@Menegroth[0]$ arch -i386 python3.6 -m platform
    Darwin-18.2.0-x86_64-i386-64bit

    ronald@Menegroth[0]$ arch -i386 python3.6 -c 'import sys; print(sys.maxsize)'
    2147483647

    This platform output includes "64bit" because the binary for python3.6 includes support for both i386 and x86_64, and doesn't show that the command is using i386 instructions.
    """

    I made some tests:

    $ file /usr/local/bin/python3
    /usr/local/bin/python3: Mach-O universal binary with 2 architectures: [i386:Mach-O executable i386] [x86_64:Mach-O 64-bit executable x86_64]
    /usr/local/bin/python3 (for architecture i386):	Mach-O executable i386
    /usr/local/bin/python3 (for architecture x86_64):	Mach-O 64-bit executable x86_64
    
    $ /usr/local/bin/python3 -c 'import struct, sys, platform; print(platform.architecture(), struct.calcsize("P"), sys.maxsize)'
    ('64bit', '') 8 9223372036854775807
    
    $ arch -x86_64 /usr/local/bin/python3 -c 'import struct, sys, platform; print(platform.architecture(), struct.calcsize("P"), sys.maxsize)'
    ('64bit', '') 8 9223372036854775807
    
    $ arch -i386 /usr/local/bin/python3 -c 'import struct, sys, platform; print(platform.architecture(), struct.calcsize("P"), sys.maxsize)'
    ('64bit', '') 4 2147483647

    IMHO platform.architecture() should return 32bit when running "arch -i386 /usr/local/bin/python3" to be consistent with struct.calcsize("P") == 4 and sys.maxsize == 2147483647. Otherwise, how would you notice that you are using the 32-bit flavor of Python?

    My PR 11186 implements this fix.

    Ronald Oussoren:

    Using sizeof(void*) or sys.maxsize suffers from the a simular problem: this will only detect the pointer-size of the current proces and not that the binary is capable of running with a different pointer-size as well.

    Right, but I don't think that it's possible to report that Python executable is FAT binary in platform.architecture() result. If you want to provide such information, IMHO you should write a new function or at least add a new parameter to platform.architecture().

    IMHO it's more consistent to report "32bit" for "arch -i386 python3" and "64bit" for "arch -x86_64 python3".

    @ronaldoussoren
    Copy link
    Contributor

    IMHO platform.architecture() should return 32bit when running "arch -i386 /usr/local/bin/python3" to be consistent with struct.calcsize("P") == 4 and sys.maxsize == 2147483647. Otherwise, how would you notice that you are using the 32-bit flavor of Python?

    I don't agree. Platform.architecture() is defined to look at a specified binary, not the currently running process. That can lead to inconsistencies like this and is not something you can avoid.

    Ronald Oussoren:
    > Using sizeof(void*) or sys.maxsize suffers from the a simular problem: this will only detect the pointer-size of the current proces and not that the binary is capable of running with a different pointer-size as well.

    Right, but I don't think that it's possible to report that Python executable is FAT binary in platform.architecture() result. If you want to provide such information, IMHO you should write a new function or at least add a new parameter to platform.architecture().

    IMHO it's more consistent to report "32bit" for "arch -i386 python3" and "64bit" for "arch -x86_64 python3".

    This doesn't necessarily need a new function, platform.architecture could also return something like "32bit,64bit".

    But as I mentioned in my previous message I don't know why anyone would want to use this function in the first place. There are better ways to determine information about the current process (struct.calcsize, sys.maxsize, sys.byteorder), and I have never had a need to determine information about executable files that I couldn't get in a better way using other libraries (like macholib and pyelftools)

    @vstinner
    Copy link
    Member

    See also bpo-35516: "platform.system_alias(): add macOS support".

    @vstinner
    Copy link
    Member

    I don't agree. Platform.architecture() is defined to look at a specified binary, not the currently running process. That can lead to inconsistencies like this and is not something you can avoid.

    architecture() looks at running Python executable by default and documents a special case when executable equals to sys.executable:
    "(...) then only if the executable points to the Python interpreter. Reasonable defaults are used when the above needs are not met."
    https://docs.python.org/dev/library/platform.html#platform.architecture

    This doesn't necessarily need a new function, platform.architecture could also return something like "32bit,64bit".

    As an user, I don't need for this information.

    architecture() already contains a note:

    """
    On Mac OS X (and perhaps other platforms), executable files may be universal files containing multiple architectures.

    To get at the “64-bitness” of the current interpreter, it is more reliable to query the sys.maxsize attribute:

    is_64bits = sys.maxsize > 2**32
    """

    But as I mentioned in my previous message I don't know why anyone would want to use this function in the first place. There are better ways to determine information about the current process (struct.calcsize, sys.maxsize, sys.byteorder), and I have never had a need to determine information about executable files that I couldn't get in a better way using other libraries (like macholib and pyelftools)

    platform.architecture() has multiple issues:

    • It rely on the external program "file". It is not available on Windows. It is likely missing on small Linux containers. It doensn't report an error if the program is missing but "should be available".
    • Calling an external program can lead to security issues.
    • Parsing file output is not reliable, even if PR 11159 should make the parsing more reliable. For example, file output is locale dependent and the -b option is not standard.
    • The expected output on a macOS universal binary is unclear.
    • The purpose of the function seems to be unclear to most developers...

    Another solution is to deprecate the function. I agree with Ronald that sys.maxsize is enough for most use cases (get "bits"). For more accurate information, platform.architecture() is wrong and a third-party module is required.

    By the way, platform.architecture() is not used in the stdlib which is a sign that maybe the function is not really helpful. Moreover, sysconfig and distutils.util contain the following code:

                # We can't use "platform.architecture()[0]" because a
                # bootstrap problem. We use a dict to get an error
                # if some suspicious happens.
                bitness = {2147483647:"32bit", 9223372036854775807:"64bit"}
                machine += ".%s" % bitness[sys.maxsize]

    Serhiy, Ronald: What do you think of deprecating platform.architecture() instead of trying to fix it?

    @vstinner
    Copy link
    Member

    New changeset 0af9c33 by Victor Stinner in branch 'master':
    bpo-35348: Fix platform.architecture() (GH-11159)
    0af9c33

    @malemburg
    Copy link
    Member

    Guys, please read the doc-string of the platform.architecture() function (or ask the person who wrote most of the module). It clearly refers to inspecting a specific executable and only uses the Python interpreter as default.

    The running process can provide some sane defaults, but is not necessarily using the same values as the given executable.

    The function does not support multi-architecture executables. This is simply out of scope for the function.

    Victor: AFAIK, I still own this module, so if you want to deprecate something, please ping me first.

    @vstinner
    Copy link
    Member

    Ok, I closed my PR 11186 which modified architecture() to only return struct.calcsize('P') if the executable argument is equal to sys.executable.

    please read the doc-string of the platform.architecture() function (or ask the person who wrote most of the module). It clearly refers to inspecting a specific executable and only uses the Python interpreter as default. The running process can provide some sane defaults, but is not necessarily using the same values as the given executable.

    I see the platform module as a module to get info about the operating system and Python, but it seems like I misunderstood the purpose of the specific case of the architecture() function.

    I propose a small addition to the doc to avoid confusion:
    https://github.com/python/cpython/pull/11208/files

    @vstinner
    Copy link
    Member

    The initial issue has been fixed, I close the issue. Thanks for the review and feedback!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants