classification
Title: platform.linux_distribution() should decode files from UTF-8, not from the locale encoding
Type: behavior Stage: patch review
Components: Versions: Python 3.4, Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: a.badger, haypo, loewis, python-dev, r.david.murray, zbysz
Priority: normal Keywords: patch

Created on 2013-03-15 15:08 by a.badger, last changed 2013-12-08 23:28 by haypo. This issue is now closed.

Files
File name Uploaded Description Edit
00175-platform-unicode.patch a.badger, 2013-03-15 15:11 Patch to use surrogateescape when reading from OS vendor provided file
00175-platform-unicode.patch a.badger, 2013-03-15 22:10 Patch to default to utf-8 and fallback to surrogateescape when reading from OS vendor provided file
00175-platform-unicode.patch a.badger, 2013-03-20 19:02 3rd version of the patch. Includes a unittest and some re-arranging of the code to allow the unittest to work.
00175-platform-unicode.patch a.badger, 2013-03-20 19:34 4th version of patch w/ NEWS; rebased against hg default review
00175-platform-unicode.patch a.badger, 2013-03-25 19:03 5th version of patch fixing issues raised by r.david.murray review
Messages (15)
msg184234 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2013-03-15 15:08
Tested on python-3.2 and python-3.3.  platform.platform() looks for a file in /etc/ that looks like it will contain the name of the Linux distribution that python3 is running on.  Once found, it reads the contents of the file to have a name for the Linux distribution.  Most Linux distributions do create files inside of /etc/ with a single line which is the distribution name so this is a good heuristic.  However, these files are created by the operating system vendor and so they can have a different encoding than the encoding of the locale the user uses.  This means that if there are non-ascii characters inside the file, user code that invokes platform.platform() may throw a traceback.

Test:

$ LC_ALL=en_US.utf8 sudo echo ' Café' >> /etc/fedora-release
$ LC_ALL=C python3
>>> import platform
>>> platform.platform()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.2/platform.py", line 1538, in platform
    distname,distversion,distid = dist('')
  File "/usr/lib64/python3.2/platform.py", line 358, in dist
    full_distribution_name=0)
  File "/usr/lib64/python3.2/platform.py", line 329, in linux_distribution
    firstline = f.readline()
  File "/usr/lib64/python3.2/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 22: ordinal not in range(128)

It seems that the standard method of fixing these that we're promoting in python3 is to use surrogateescape.  I'll provide a patch that does that.
msg184241 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2013-03-15 17:20
What's the most likely encoding? UTF-8? I suggest we assume UTF-8, and use the surrogate-escape error handler to deal with the cases when it isn't.
msg184267 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2013-03-15 22:10
I agree.  In my experience, utf-8 is the most common encoding.  Updated patch that defaults to utf-8 instead of the user's locale is attached.
msg184306 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-03-16 11:15
00175-platform-unicode.patch looks good to me, but it lacks an unit test.
msg184319 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2013-03-16 14:43
I'm at pycon.  I'll find someone during the sprints to teach me how the unittests are organized.
msg184634 - (view) Author: Zbyszek Jędrzejewski-Szmek (zbysz) * Date: 2013-03-19 14:26
At least for /etc/os-release, which is slated to replace /etc/fedora-release and other distribution specific files, the encoding in mandated to be UTF-8:

http://www.freedesktop.org/software/systemd/man/os-release.html
  All strings should be in UTF-8 format, and non-printable characters 
  should not be used.
msg184776 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2013-03-20 19:02
Okay, new version of the patch with a unittest.

Re: os-release; I don't believe the current code can handle that file.  i\It changes format from a simple string (in most Linux distros) to key value pairs.  We'll probably need an update to the code to deal with that at some point in the future.
msg184783 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2013-03-20 19:34
Added NEWS file.  Rebased against hg default.  Ready for review.
msg184918 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-03-21 22:38
Review comments added.
msg185225 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2013-03-25 19:03
Patch fixing the issues raised in r.david.murray's review:

* Merged _find_linux_release_file() back into linux_distribution() and broke out _UNIXCONFDIR module level variable to enable mocking of the unittest data
* Fix already present style issue in linux_distribution() code
* Separate test_dist_with_unicode() into test_dist_with_utf8 and test_dist_with_latin1
* Moved test data into the setUp() method and broke long lines to under 79 chars
* Switched from NamedTempfile() to TemporaryDirectory() => I think this fixes the Windows incompatibility but I can mark the tests as Skip on Windows if that isn't true (I don't have a Windows box to test on).
* Removed Misc/NEWS portion of patch

* I've switched from os.environ['LC_ALL'] = temp_locale to  setlocale(locale.LC_ALL, temp_locale).  Testing showed that the former did not provoke the bug while the latter does.  Could someone point me at documentation on the difference between these two?  I'd like to understand what the two different calls do differently so that I use them correctly in other unittests.
msg200260 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-10-18 14:04
Ping myself, I just installed Fedora 19 and I cannot run the Python test suite with the ASCII locale encoding, because of this bug.
msg205616 - (view) Author: Roundup Robot (python-dev) Date: 2013-12-08 23:04
New changeset 4580976c07cb by Victor Stinner in branch '3.3':
Issue #17429: platform.linux_distribution() now decodes files from the UTF-8
http://hg.python.org/cpython/rev/4580976c07cb

New changeset 407f18c8ce8a by Victor Stinner in branch 'default':
(Merge 3.3) Issue #17429: platform.linux_distribution() now decodes files from
http://hg.python.org/cpython/rev/407f18c8ce8a
msg205618 - (view) Author: Roundup Robot (python-dev) Date: 2013-12-08 23:20
New changeset 831b2c80a9c9 by Victor Stinner in branch 'default':
Issue #17429: some PEP 8 compliance fixes for the platform modules, add whitespaces
http://hg.python.org/cpython/rev/831b2c80a9c9
msg205619 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-12-08 23:25
Thanks Toshio Kuratomi for your patch. I simplified the unit test. I'm not sure that "resetlocale" restores the locale in its previous state. I don't want to rely on two specific locales ('pt_BR.UTF8' and 'pt_BR.ISO8859-1') for a such simple test. We have enough buildbots to test other various locales.

I prefered to only change minor syntax issues (PEP 8) in Python 3.4 to limit changes in the stable release.
msg205620 - (view) Author: Roundup Robot (python-dev) Date: 2013-12-08 23:26
New changeset a951ab03bda0 by Victor Stinner in branch '3.3':
Issue #17429: Oops, remove unused import
http://hg.python.org/cpython/rev/a951ab03bda0

New changeset 209bf9576dc8 by Victor Stinner in branch 'default':
(Merge 3.3) Issue #17429: Oops, remove unused import
http://hg.python.org/cpython/rev/209bf9576dc8
History
Date User Action Args
2013-12-08 23:28:41hayposetversions: - Python 3.2
2013-12-08 23:28:35hayposettitle: platform.platform() can throw Unicode error -> platform.linux_distribution() should decode files from UTF-8, not from the locale encoding
2013-12-08 23:26:45python-devsetmessages: + msg205620
2013-12-08 23:26:22hayposetstatus: open -> closed
resolution: fixed
2013-12-08 23:25:01hayposetmessages: + msg205619
2013-12-08 23:20:33python-devsetmessages: + msg205618
2013-12-08 23:04:40python-devsetnosy: + python-dev
messages: + msg205616
2013-10-18 14:04:52hayposetmessages: + msg200260
2013-03-25 19:03:43a.badgersetfiles: + 00175-platform-unicode.patch

messages: + msg185225
2013-03-21 22:38:24r.david.murraysetversions: + Python 3.4
nosy: + r.david.murray

messages: + msg184918

type: behavior
stage: patch review
2013-03-20 19:34:06a.badgersetfiles: + 00175-platform-unicode.patch

messages: + msg184783
2013-03-20 19:02:42a.badgersetfiles: + 00175-platform-unicode.patch

messages: + msg184776
2013-03-19 14:26:30zbyszsetnosy: + zbysz
messages: + msg184634
2013-03-16 14:43:06a.badgersetmessages: + msg184319
2013-03-16 11:15:12hayposetnosy: + haypo
messages: + msg184306
2013-03-15 22:10:45a.badgersetfiles: + 00175-platform-unicode.patch

messages: + msg184267
2013-03-15 17:20:26loewissetnosy: + loewis
messages: + msg184241
2013-03-15 15:11:25a.badgersetfiles: + 00175-platform-unicode.patch
keywords: + patch
2013-03-15 15:08:30a.badgercreate