classification
Title: Problems with handling the file command output in platform.architecture()
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.8, Python 3.7, Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Windson Yang, lemburg, ronaldoussoren, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2018-11-29 12:33 by serhiy.storchaka, last changed 2018-12-18 16:40 by vstinner. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 11159 merged vstinner, 2018-12-14 12:29
PR 11160 closed serhiy.storchaka, 2018-12-14 14:55
PR 11186 closed vstinner, 2018-12-17 09:01
PR 11208 closed vstinner, 2018-12-18 11:06
Messages (24)
msg330686 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-11-29 12:33
The code of _syscmd_file() in the platform module does not match the docstring. The "-b" option was removed in 685fffa (issue16112), and this leads to inclusion the executable path in the file command output. If the executable path contains some key strings like "32-bit" or "PE", platform.architecture() can return an incorrect result. I think that the "-b" option should be restored.

$ python3 -c 'import platform; print(platform.architecture("/usr/bin/python3.6"))'
('64bit', 'ELF')
$ cp /usr/bin/python3.6 /tmp/32-bitPE
$ python3 -c 'import platform; print(platform.architecture("/tmp/32-bitPE"))'
('32bit', 'ELF')

Other problem is that the code tests if the string "executable" is contained in the file command output. But it is not always contained for executables on Linux.

$ file python
python: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=d3cfa06c2bdcbf7b6af9e4e6be5061cb8398c086, with debug_info, not stripped
$ file /usr/bin/python2.7
/usr/bin/python2.7: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=fd3904306c73383fb371287416257b82d6a3363b, stripped
$ file /usr/bin/python3.6
/usr/bin/python3.6: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=9dae0eec9b3f9cb82612d20dc0c3088feab9e356, stripped
msg330689 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-11-29 14:27
Why does platform has to analyze sys.executable binary to check if it's 32 or 64 bits? Can't we use sizeof(void*) for example? Is it something related to FAT binary on macOS? (single binary for 32 and 64 bits, or single binary for PPC and x86)
msg330738 - (view) Author: Windson Yang (Windson Yang) * Date: 2018-11-30 00:10
I agreed with Serhiy. I also found the function decode the output with latin-1, I think it will be better to use utf-8 instead.
msg331816 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-12-14 12:18
I removed the dependency between bpo-35346 and this issue. I don't see how they are related.
msg331817 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-12-14 12:20
> I agreed with Serhiy. I also found the function decode the output with latin-1, I think it will be better to use utf-8 instead.

Decoding from UTF-8 can fail with UnicodeDecodeError, whereas decoding from latin-1 never fails.

file output is ASCII, so I don't see the point of using UTF-8.

Currently, the command displays the filename, but I don't think that we should care of the encoding of the filename.

> I think that the "-b" option should be restored.

That, or strip/skip the filename in the output?
msg331820 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-12-14 12:30
-b option added by:
https://hg.python.org/cpython/rev/c73b90b6dadd

and removed the day after by:
https://hg.python.org/cpython/rev/b94a9ff13199
msg331821 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-12-14 12:33
> -b option added by:
> https://hg.python.org/cpython/rev/c73b90b6dadd

Oh wait, this change is for the 2.7 branch. The change in master (old "default" branch) didn't add -b, but replaced "-b" with "-b --":
https://hg.python.org/cpython/rev/cd026866b333
msg331826 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-12-14 13:30
I tested that the "-b" option is supported on Linux, *BSD and OpenIndiana. But it is not a part of POSIX. So perhaps we should fall back to "file" without "-b" if "file -b" failed.

We can also check that the output starts with executable+': ' and strip this prefix.
msg331829 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-12-14 14:21
In 2.7 branch, _syscmd_file() only used -b option during one day (no Python 2.7.x release used -b):

* Oct 4, 2012: commit 95038fa526c8b93e42c59b0735edf1c80b7b6449 added -b: "Closes #16112: platform.architecture does not correctly escape argument to /usr/bin/file"
* Oct 5, 2012: commit 2699c9d24810196a07e0577215ec3beb7ecfb55b removed -b: "#16112: platform.architecture does not correctly escape argument to /usr/bin/file. Fix original patch"

Python 3.2.0 (Feb 2011) to 3.2.3 (Sep 2012) and Python 3.3.0 (Sep 2012) used -b:

* Aug 13, 2010: commit ddfb2c3a338a8c1550774648f816e96c41f59de1 added -b: "Omit the filename to avoid enconding issues, especially with non encodable characters in the Python full path."
* Oct 5, 2012: commit 685fffa8f427b3a768b232170752565d58c32112 removed -b: "#16112: platform.architecture does not correctly escape argument to /usr/bin/file. Fix original patch"
msg331830 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-12-14 14:27
> We can also check that the output starts with executable+': ' and strip this prefix.

Technically, on UNIX, ':' is valid in a filename. Filename examples which contain ':' on my Fedora 29:

/usr/share/man/man3/List::Util.3pm.gz
/usr/share/usb_modeswitch/0408:f000
/proc/irq/127/ahci[0000:00:17.0]
/proc/irq/131/snd_hda_intel:card0
/dev/block/259:3
/sys/kernel/slab/:0002632
/sys/module/psmouse/drivers/serio:psmouse

Note: I cannot find a program name which contains ':'.
msg331831 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-12-14 14:28
A convervative approach would be to leave stable branches unchanged and use -b in the master branch.
msg331834 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-12-14 15:04
PR 11160 is an alternate solution which strips the filename in the output. It does not matter if the filename contains ":", because the format of the output in the POSIX locale is strictly specified.
msg331835 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2018-12-14 15:20
BTW. A related problem with platform.architecture() is that it doesn't know how to deal with fat binaries (such as those found on macOS).

As an example:

$  file /usr/bin/python
/usr/bin/python: Mach-O universal binary with 2 architectures: [i386:Mach-O executable i386] [x86_64:Mach-O 64-bit executable x86_64]
/usr/bin/python (for architecture i386):	Mach-O executable i386
/usr/bin/python (for architecture x86_64):	Mach-O 64-bit executable x86_64

This will be reported as "64-bit" by platform.architecture() because there is '64-bit' in the output of file(1).

Using sizeof(void*) or sys.maxsize suffers from the a simular problem: this will only detect the pointer-size of the current proces and not that the binary is capable of running with a different pointer-size as well. 

P.S. platform.architecture() uses file(1) because you can specify different executables than sys.executable.
msg331836 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-12-14 15:59
What result of platform.architecture() do you expect for an universal binary?
msg331843 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-12-14 17:20
I don't understand the purpose of the 'linkage' information of platform.architecture(). Does anyone care if Python is an ELF program or a WindowsPE program? Maybe it was useful 20 years ago when there were COFF on Unix, but right now ELF is the defacto standard on Unix, and WindowsPE on Windows.

32-bit and 64-bit information should be enough, no?

I would suggest to just return ('%sbit' % bits, '') if executable is not set. Use struct.calcsize('P')*8 or sys.maxsize to get bits.
msg331956 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2018-12-17 09:06
> What result of platform.architecture() do you expect for an universal binary?

I honestly don't know. What is the purpose of this functionality in the first place? I have never had a problem where using this function was the right solution.

To be honest I have the same problem with a number of other APIs in this module. As an example, platform.system_alias() looks interesting but has an API that won't work in general (macOS release version cannot be calculated from platform.uname() information, likewise for linux distribution information).
msg331958 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-12-17 09:13
Ronald Oussoren gave more info on my previous PR 10780 ("platform.platform() uses mac_ver() on macOS"):

https://github.com/python/cpython/pull/10780#issuecomment-444529371

"""
The information does not include data about fat binaries, resulting amongst others in the following inconsistency:

ronald@Menegroth[0]$ arch -i386 python3.6 -m platform
Darwin-18.2.0-x86_64-i386-64bit

ronald@Menegroth[0]$ arch -i386 python3.6 -c 'import sys; print(sys.maxsize)'
2147483647

This platform output includes "64bit" because the binary for python3.6 includes support for both i386 and x86_64, and doesn't show that the command is using i386 instructions.
"""


I made some tests:

$ file /usr/local/bin/python3
/usr/local/bin/python3: Mach-O universal binary with 2 architectures: [i386:Mach-O executable i386] [x86_64:Mach-O 64-bit executable x86_64]
/usr/local/bin/python3 (for architecture i386):	Mach-O executable i386
/usr/local/bin/python3 (for architecture x86_64):	Mach-O 64-bit executable x86_64

$ /usr/local/bin/python3 -c 'import struct, sys, platform; print(platform.architecture(), struct.calcsize("P"), sys.maxsize)'
('64bit', '') 8 9223372036854775807

$ arch -x86_64 /usr/local/bin/python3 -c 'import struct, sys, platform; print(platform.architecture(), struct.calcsize("P"), sys.maxsize)'
('64bit', '') 8 9223372036854775807

$ arch -i386 /usr/local/bin/python3 -c 'import struct, sys, platform; print(platform.architecture(), struct.calcsize("P"), sys.maxsize)'
('64bit', '') 4 2147483647


IMHO platform.architecture() should return 32bit when running "arch -i386 /usr/local/bin/python3" to be consistent with struct.calcsize("P") == 4 and sys.maxsize == 2147483647. Otherwise, how would you notice that you are using the 32-bit flavor of Python?

My PR 11186 implements this fix.


Ronald Oussoren:
> Using sizeof(void*) or sys.maxsize suffers from the a simular problem: this will only detect the pointer-size of the current proces and not that the binary is capable of running with a different pointer-size as well.

Right, but I don't think that it's possible to report that Python executable is FAT binary in platform.architecture() result. If you want to provide such information, IMHO you should write a new function or at least add a new parameter to platform.architecture().

IMHO it's more consistent to report "32bit" for "arch -i386 python3" and "64bit" for "arch -x86_64 python3".
msg331959 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2018-12-17 09:30
> IMHO platform.architecture() should return 32bit when running "arch -i386 /usr/local/bin/python3" to be consistent with struct.calcsize("P") == 4 and sys.maxsize == 2147483647. Otherwise, how would you notice that you are using the 32-bit flavor of Python?

I don't agree. Platform.architecture() is defined to look at a specified binary, not the currently running process. That can lead to inconsistencies like this and is not something you can avoid. 


> Ronald Oussoren:
> > Using sizeof(void*) or sys.maxsize suffers from the a simular problem:  this will only detect the pointer-size of the current proces and not that the binary is capable of running with a different pointer-size as well.

> Right, but I don't think that it's possible to report that Python executable is FAT binary in platform.architecture() result. If you want to provide such information, IMHO you should write a new function or at least add a new parameter to platform.architecture().

> IMHO it's more consistent to report "32bit" for "arch -i386 python3" and "64bit" for "arch -x86_64 python3".

This doesn't necessarily need a new function, platform.architecture could also return something like "32bit,64bit". 

But as I mentioned in my previous message I don't know why anyone would want to use this function in the first place. There are better ways to determine information about the current process (struct.calcsize, sys.maxsize, sys.byteorder), and I have never had a need to determine information about executable files that I couldn't get in a better way using other libraries (like macholib and pyelftools)
msg331962 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-12-17 09:36
See also bpo-35516: "platform.system_alias(): add macOS support".
msg331965 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-12-17 10:15
> I don't agree. Platform.architecture() is defined to look at a specified binary, not the currently running process. That can lead to inconsistencies like this and is not something you can avoid.

architecture() looks at running Python executable by default and documents a special case when executable equals to sys.executable:
"(...) then only if the executable points to the Python interpreter. Reasonable defaults are used when the above needs are not met."
https://docs.python.org/dev/library/platform.html#platform.architecture


> This doesn't necessarily need a new function, platform.architecture could also return something like "32bit,64bit".

As an user, I don't need for this information.

architecture() already contains a note:

"""
On Mac OS X (and perhaps other platforms), executable files may be universal files containing multiple architectures.

To get at the “64-bitness” of the current interpreter, it is more reliable to query the sys.maxsize attribute:

is_64bits = sys.maxsize > 2**32
"""


> But as I mentioned in my previous message I don't know why anyone would want to use this function in the first place. There are better ways to determine information about the current process (struct.calcsize, sys.maxsize, sys.byteorder), and I have never had a need to determine information about executable files that I couldn't get in a better way using other libraries (like macholib and pyelftools)

platform.architecture() has multiple issues:

* It rely on the external program "file". It is not available on Windows. It is likely missing on small Linux containers. It doensn't report an error if the program is missing but "should be available".
* Calling an external program can lead to security issues.
* Parsing file output is not reliable, even if PR 11159 should make the parsing more reliable. For example, file output is locale dependent and the -b option is not standard.
* The expected output on a macOS universal binary is unclear.
* The purpose of the function seems to be unclear to most developers...

Another solution is to deprecate the function. I agree with Ronald that sys.maxsize is enough for most use cases (get "bits"). For more accurate information, platform.architecture() is wrong and a third-party module is required.

By the way, platform.architecture() is not used in the stdlib which is a sign that maybe the function is not really helpful. Moreover, sysconfig and distutils.util contain the following code:

            # We can't use "platform.architecture()[0]" because a
            # bootstrap problem. We use a dict to get an error
            # if some suspicious happens.
            bitness = {2147483647:"32bit", 9223372036854775807:"64bit"}
            machine += ".%s" % bitness[sys.maxsize]

Serhiy, Ronald: What do you think of deprecating platform.architecture() instead of trying to fix it?
msg332015 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-12-17 17:47
New changeset 0af9c33262fb43f39a6558e3f155689e83e10706 by Victor Stinner in branch 'master':
bpo-35348: Fix platform.architecture() (GH-11159)
https://github.com/python/cpython/commit/0af9c33262fb43f39a6558e3f155689e83e10706
msg332016 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2018-12-17 18:45
Guys, please read the doc-string of the platform.architecture() function (or ask the person who wrote most of the module). It clearly refers to inspecting a specific executable and only uses the Python interpreter as default.

The running process can provide some sane defaults, but is not necessarily using the same values as the given executable.

The function does not support multi-architecture executables. This is simply out of scope for the function.

Victor: AFAIK, I still own this module, so if you want to deprecate something, please ping me first.
msg332050 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-12-18 11:12
Ok, I closed my PR 11186 which modified architecture() to only return struct.calcsize('P') if the executable argument is equal to sys.executable.

> please read the doc-string of the platform.architecture() function (or ask the person who wrote most of the module). It clearly refers to inspecting a specific executable and only uses the Python interpreter as default. The running process can provide some sane defaults, but is not necessarily using the same values as the given executable.

I see the platform module as a module to get info about the operating system and Python, but it seems like I misunderstood the purpose of the specific case of the architecture() function.

I propose a small addition to the doc to avoid confusion:
https://github.com/python/cpython/pull/11208/files
msg332072 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-12-18 16:40
The initial issue has been fixed, I close the issue. Thanks for the review and feedback!
History
Date User Action Args
2018-12-18 16:40:49vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg332072

stage: patch review -> resolved
2018-12-18 11:12:59vstinnersetmessages: + msg332050
2018-12-18 11:06:34vstinnersetpull_requests: + pull_request10445
2018-12-17 18:45:51lemburgsetmessages: + msg332016
2018-12-17 17:47:28vstinnersetmessages: + msg332015
2018-12-17 10:15:35vstinnersetmessages: + msg331965
2018-12-17 09:36:26vstinnersetmessages: + msg331962
2018-12-17 09:30:17ronaldoussorensetmessages: + msg331959
2018-12-17 09:13:54vstinnersetmessages: + msg331958
2018-12-17 09:06:47ronaldoussorensetmessages: + msg331956
2018-12-17 09:01:41vstinnersetpull_requests: + pull_request10424
2018-12-14 17:20:59vstinnersetmessages: + msg331843
2018-12-14 15:59:00serhiy.storchakasetmessages: + msg331836
2018-12-14 15:20:37ronaldoussorensetnosy: + ronaldoussoren
messages: + msg331835
2018-12-14 15:04:04serhiy.storchakasetmessages: + msg331834
2018-12-14 14:55:24serhiy.storchakasetpull_requests: + pull_request10396
2018-12-14 14:28:01vstinnersetmessages: + msg331831
2018-12-14 14:27:31vstinnersetmessages: + msg331830
2018-12-14 14:21:25vstinnersetmessages: + msg331829
2018-12-14 13:30:57serhiy.storchakasetmessages: + msg331826
2018-12-14 12:33:08vstinnersetmessages: + msg331821
2018-12-14 12:30:56vstinnersetmessages: + msg331820
2018-12-14 12:29:41vstinnersetkeywords: + patch
stage: patch review
pull_requests: + pull_request10395
2018-12-14 12:20:39vstinnersetmessages: + msg331817
2018-12-14 12:18:25vstinnersetmessages: + msg331816
2018-12-14 12:15:57vstinnerunlinkissue35346 dependencies
2018-11-30 00:10:01Windson Yangsetnosy: + Windson Yang
messages: + msg330738
2018-11-29 14:27:12vstinnersetmessages: + msg330689
2018-11-29 12:34:03serhiy.storchakalinkissue35346 dependencies
2018-11-29 12:33:34serhiy.storchakacreate