This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Unnecessary non-ASCII characters in standard library
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.6, Python 3.5, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Rosuav, benjamin.peterson, python-dev, serhiy.storchaka, steven.daprano, terry.reedy
Priority: normal Keywords: patch

Created on 2015-12-18 07:43 by Rosuav, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
asciify.diff Rosuav, 2015-12-18 07:43 review
nonascii.py Rosuav, 2015-12-18 09:05
nonascii.py Rosuav, 2015-12-18 09:23
asciify.diff Rosuav, 2015-12-18 10:01 review
nonascii.py Rosuav, 2015-12-18 10:02
asciify.diff Rosuav, 2015-12-18 10:31 review
Messages (18)
msg256647 - (view) Author: Chris Angelico (Rosuav) * Date: 2015-12-18 07:43
Discussion on python-list led to searching out unnecessary non-ASCII in the stdlib. While there are places where non-ASCII text is good and worthwhile (eg in comments identifying people such as Łukasz Langa, Peter Åstrand, Martin v. Löwis, and Gerhard Häring, or code specifically demonstrating or implementing non-ASCII behaviour), there are some instances which are unnecessary. Attached is a patch converting apostrophes, dashes, and one space.
msg256655 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-12-18 08:32
Non-ASCII apostrophes and dashes in docstrings can be considered as a bug, since they can lead pydoc or help() to fail.

The patch LGTM.

Could you please provide your script used to search non-ASCII sources Chris?

Are there doubtful non-ASCII inclusions in tests, in C files, or in the documentation (including Misc/NEWS, Misc/HISTORY, etc)?
msg256657 - (view) Author: Chris Angelico (Rosuav) * Date: 2015-12-18 09:05
Whoops! Meant to include that as a second attachment. Now attached.

It's a quickly-thrown-together thing and not fully PEP 8 compliant.
msg256659 - (view) Author: Chris Angelico (Rosuav) * Date: 2015-12-18 09:23
As an alternative to checking only *.py, the second version uses the 'file' command to recognize text files. Run from the cpython top-level directory (rather than Lib/), it finds a large number of additional results, many of which appear to have a UTF-8 BOM. Are they also worth removing?
msg256660 - (view) Author: Chris Angelico (Rosuav) * Date: 2015-12-18 09:37
There are non-ASCII dashes and apostrophes in .rst files; are they worth cleaning up?
msg256662 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-12-18 09:40
I think yes if this is not Windows-specific file. See also 4796dec0a7d0, 7255af1a1c50, a8568ea83599, 652baf23c368.
msg256663 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-12-18 09:44
In .rst files they are good.

Add | (LC_ALL=C egrep "$(printf '[\x80-\xff]+')";) in the pipe after your script to highligh non-ASCII characters.
msg256664 - (view) Author: Chris Angelico (Rosuav) * Date: 2015-12-18 10:01
Misc/NEWS has a UTF-8 BOM. Otherwise, it and Misc/HISTORY look fine (all names and other legit cases). Lib/idlelib/CREDITS.txt and Lib/idlelib/README.txt both have non-UTF8 text in them. I don't understand what's with the first line of .bzrignore, so I'm (pun fully intended) ignoring it. Modules/_ctypes/libffi/ChangeLog has several non-ASCII markers, but it looks to be a historical record, so I'm not sure it should be tampered with. Expanded patch attached.

That egrep needs --color to do its highlighting, if it's not the default.
msg256665 - (view) Author: Chris Angelico (Rosuav) * Date: 2015-12-18 10:02
Another version of detection script attached.
msg256666 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-12-18 10:27
All in Modules/_ctypes/libffi/ is third-party code.

The ACUTE ACCENT character at the start of .bzrignore was added in b635462a5798 by Benjamin. I think this is just a typo.

The patch LGTM (if address Victor's comment).
msg256667 - (view) Author: Chris Angelico (Rosuav) * Date: 2015-12-18 10:31
Oops, didn't see Victor's comment. (How do I get notified when someone posts a patch review?) New patch uploaded which does this.

Note that Steven D'Aprano has expressed the opposite desire - that non-ASCII text be kept, as it should be acceptable and its presence makes for good toolchain testing. Noseying him in for his input.
msg256668 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-12-18 11:13
New changeset c87b2f61650f by Serhiy Storchaka in branch '3.5':
Issue #25899: Converted non-ASCII characters in docstrings and manpage
https://hg.python.org/cpython/rev/c87b2f61650f

New changeset 1eeb25f08cfd by Serhiy Storchaka in branch 'default':
Issue #25899: Converted non-ASCII characters in docstrings and manpage
https://hg.python.org/cpython/rev/1eeb25f08cfd

New changeset 7b176dafb56b by Serhiy Storchaka in branch '3.5':
Issue #25899: Fixed typo in .bzrignore.
https://hg.python.org/cpython/rev/7b176dafb56b

New changeset 8873f34e2186 by Serhiy Storchaka in branch '2.7':
Issue #25899: Fixed typo in .bzrignore.
https://hg.python.org/cpython/rev/8873f34e2186

New changeset e1d5f645b476 by Serhiy Storchaka in branch 'default':
Issue #25899: Fixed typo in .bzrignore.
https://hg.python.org/cpython/rev/e1d5f645b476
msg256669 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-12-18 11:21
Lib/idlelib/CREDITS.txt and Lib/idlelib/README.txt are read by IDLE. The first with the iso-8859-1 encoding, the second with locale encoding. Both are wrong if files are in UTF-8.
msg256670 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-12-18 11:24
New changeset 505593490f4c by Serhiy Storchaka in branch 'default':
Issue #25899: Converted Objects/listsort.txt to UTF-8.
https://hg.python.org/cpython/rev/505593490f4c
msg256671 - (view) Author: Chris Angelico (Rosuav) * Date: 2015-12-18 11:25
Ah, got it. That definitely settles Idle's CREDITS.txt. Are there any locale encodings where \x92 isn't an apostrophe?
msg256675 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-12-18 12:15
> Are there any locale encodings where \x92 isn't an apostrophe?

Latin1 and all ISO-8859-*. CP437 and perhaps all DOS codepages. KOI8 family. And of course UTF-8.
msg256677 - (view) Author: Chris Angelico (Rosuav) * Date: 2015-12-18 12:18
So Lib/idlelib/README.txt would decode wrongly in anything other than a Windows codepage? Seems a good reason to asciify line 3.
msg256680 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-12-18 12:31
Opened issue25905 for IDLE-related files.
History
Date User Action Args
2022-04-11 14:58:25adminsetgithub: 70087
2015-12-18 12:31:39serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg256680

stage: resolved
2015-12-18 12:18:38Rosuavsetmessages: + msg256677
2015-12-18 12:15:21serhiy.storchakasetmessages: + msg256675
2015-12-18 11:25:12Rosuavsetmessages: + msg256671
2015-12-18 11:24:46serhiy.storchakasetnosy: + terry.reedy
2015-12-18 11:24:27python-devsetmessages: + msg256670
2015-12-18 11:21:11serhiy.storchakasetmessages: + msg256669
2015-12-18 11:13:59python-devsetnosy: + python-dev
messages: + msg256668
2015-12-18 10:31:25Rosuavsetfiles: + asciify.diff
nosy: + steven.daprano
messages: + msg256667

2015-12-18 10:27:17serhiy.storchakasetnosy: + benjamin.peterson
messages: + msg256666
2015-12-18 10:02:32Rosuavsetfiles: + nonascii.py

messages: + msg256665
2015-12-18 10:01:20Rosuavsetfiles: + asciify.diff

messages: + msg256664
2015-12-18 09:44:05serhiy.storchakasetmessages: + msg256663
2015-12-18 09:40:08serhiy.storchakasetmessages: + msg256662
2015-12-18 09:37:04Rosuavsetmessages: + msg256660
2015-12-18 09:23:16Rosuavsetfiles: + nonascii.py

messages: + msg256659
2015-12-18 09:05:51Rosuavsetfiles: + nonascii.py

messages: + msg256657
2015-12-18 08:32:05serhiy.storchakasetversions: + Python 2.7, Python 3.5
nosy: + serhiy.storchaka

messages: + msg256655

type: behavior
2015-12-18 07:43:12Rosuavcreate