Issue 25899: Unnecessary non-ASCII characters in standard library

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/70087

classification

Title:	Unnecessary non-ASCII characters in standard library
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.6, Python 3.5, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Rosuav, benjamin.peterson, python-dev, serhiy.storchaka, steven.daprano, terry.reedy
Priority:	normal	Keywords:	patch

Created on 2015-12-18 07:43 by Rosuav, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
asciify.diff	Rosuav, 2015-12-18 07:43		review
nonascii.py	Rosuav, 2015-12-18 09:05
nonascii.py	Rosuav, 2015-12-18 09:23
asciify.diff	Rosuav, 2015-12-18 10:01		review
nonascii.py	Rosuav, 2015-12-18 10:02
asciify.diff	Rosuav, 2015-12-18 10:31		review

Messages (18)
msg256647 - (view)	Author: Chris Angelico (Rosuav) *	Date: 2015-12-18 07:43
Discussion on python-list led to searching out unnecessary non-ASCII in the stdlib. While there are places where non-ASCII text is good and worthwhile (eg in comments identifying people such as Łukasz Langa, Peter Åstrand, Martin v. Löwis, and Gerhard Häring, or code specifically demonstrating or implementing non-ASCII behaviour), there are some instances which are unnecessary. Attached is a patch converting apostrophes, dashes, and one space.
msg256655 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-12-18 08:32
Non-ASCII apostrophes and dashes in docstrings can be considered as a bug, since they can lead pydoc or help() to fail. The patch LGTM. Could you please provide your script used to search non-ASCII sources Chris? Are there doubtful non-ASCII inclusions in tests, in C files, or in the documentation (including Misc/NEWS, Misc/HISTORY, etc)?
msg256657 - (view)	Author: Chris Angelico (Rosuav) *	Date: 2015-12-18 09:05
Whoops! Meant to include that as a second attachment. Now attached. It's a quickly-thrown-together thing and not fully PEP 8 compliant.
msg256659 - (view)	Author: Chris Angelico (Rosuav) *	Date: 2015-12-18 09:23
As an alternative to checking only *.py, the second version uses the 'file' command to recognize text files. Run from the cpython top-level directory (rather than Lib/), it finds a large number of additional results, many of which appear to have a UTF-8 BOM. Are they also worth removing?
msg256660 - (view)	Author: Chris Angelico (Rosuav) *	Date: 2015-12-18 09:37
There are non-ASCII dashes and apostrophes in .rst files; are they worth cleaning up?
msg256662 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-12-18 09:40
I think yes if this is not Windows-specific file. See also 4796dec0a7d0, 7255af1a1c50, a8568ea83599, 652baf23c368.
msg256663 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-12-18 09:44
In .rst files they are good. Add \| (LC_ALL=C egrep "$(printf '[\x80-\xff]+')";) in the pipe after your script to highligh non-ASCII characters.
msg256664 - (view)	Author: Chris Angelico (Rosuav) *	Date: 2015-12-18 10:01
Misc/NEWS has a UTF-8 BOM. Otherwise, it and Misc/HISTORY look fine (all names and other legit cases). Lib/idlelib/CREDITS.txt and Lib/idlelib/README.txt both have non-UTF8 text in them. I don't understand what's with the first line of .bzrignore, so I'm (pun fully intended) ignoring it. Modules/_ctypes/libffi/ChangeLog has several non-ASCII markers, but it looks to be a historical record, so I'm not sure it should be tampered with. Expanded patch attached. That egrep needs --color to do its highlighting, if it's not the default.
msg256665 - (view)	Author: Chris Angelico (Rosuav) *	Date: 2015-12-18 10:02
Another version of detection script attached.
msg256666 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-12-18 10:27
All in Modules/_ctypes/libffi/ is third-party code. The ACUTE ACCENT character at the start of .bzrignore was added in b635462a5798 by Benjamin. I think this is just a typo. The patch LGTM (if address Victor's comment).
msg256667 - (view)	Author: Chris Angelico (Rosuav) *	Date: 2015-12-18 10:31
Oops, didn't see Victor's comment. (How do I get notified when someone posts a patch review?) New patch uploaded which does this. Note that Steven D'Aprano has expressed the opposite desire - that non-ASCII text be kept, as it should be acceptable and its presence makes for good toolchain testing. Noseying him in for his input.
msg256668 - (view)	Author: Roundup Robot (python-dev)	Date: 2015-12-18 11:13
New changeset c87b2f61650f by Serhiy Storchaka in branch '3.5': Issue #25899: Converted non-ASCII characters in docstrings and manpage https://hg.python.org/cpython/rev/c87b2f61650f New changeset 1eeb25f08cfd by Serhiy Storchaka in branch 'default': Issue #25899: Converted non-ASCII characters in docstrings and manpage https://hg.python.org/cpython/rev/1eeb25f08cfd New changeset 7b176dafb56b by Serhiy Storchaka in branch '3.5': Issue #25899: Fixed typo in .bzrignore. https://hg.python.org/cpython/rev/7b176dafb56b New changeset 8873f34e2186 by Serhiy Storchaka in branch '2.7': Issue #25899: Fixed typo in .bzrignore. https://hg.python.org/cpython/rev/8873f34e2186 New changeset e1d5f645b476 by Serhiy Storchaka in branch 'default': Issue #25899: Fixed typo in .bzrignore. https://hg.python.org/cpython/rev/e1d5f645b476
msg256669 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-12-18 11:21
Lib/idlelib/CREDITS.txt and Lib/idlelib/README.txt are read by IDLE. The first with the iso-8859-1 encoding, the second with locale encoding. Both are wrong if files are in UTF-8.
msg256670 - (view)	Author: Roundup Robot (python-dev)	Date: 2015-12-18 11:24
New changeset 505593490f4c by Serhiy Storchaka in branch 'default': Issue #25899: Converted Objects/listsort.txt to UTF-8. https://hg.python.org/cpython/rev/505593490f4c
msg256671 - (view)	Author: Chris Angelico (Rosuav) *	Date: 2015-12-18 11:25
Ah, got it. That definitely settles Idle's CREDITS.txt. Are there any locale encodings where \x92 isn't an apostrophe?
msg256675 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-12-18 12:15
> Are there any locale encodings where \x92 isn't an apostrophe? Latin1 and all ISO-8859-*. CP437 and perhaps all DOS codepages. KOI8 family. And of course UTF-8.
msg256677 - (view)	Author: Chris Angelico (Rosuav) *	Date: 2015-12-18 12:18
So Lib/idlelib/README.txt would decode wrongly in anything other than a Windows codepage? Seems a good reason to asciify line 3.
msg256680 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-12-18 12:31
Opened issue25905 for IDLE-related files.

History
Date	User	Action	Args
2022-04-11 14:58:25	admin	set	github: 70087
2015-12-18 12:31:39	serhiy.storchaka	set	status: open -> closed resolution: fixed messages: + msg256680 stage: resolved
2015-12-18 12:18:38	Rosuav	set	messages: + msg256677
2015-12-18 12:15:21	serhiy.storchaka	set	messages: + msg256675
2015-12-18 11:25:12	Rosuav	set	messages: + msg256671
2015-12-18 11:24:46	serhiy.storchaka	set	nosy: + terry.reedy
2015-12-18 11:24:27	python-dev	set	messages: + msg256670
2015-12-18 11:21:11	serhiy.storchaka	set	messages: + msg256669
2015-12-18 11:13:59	python-dev	set	nosy: + python-dev messages: + msg256668
2015-12-18 10:31:25	Rosuav	set	files: + asciify.diff nosy: + steven.daprano messages: + msg256667
2015-12-18 10:27:17	serhiy.storchaka	set	nosy: + benjamin.peterson messages: + msg256666
2015-12-18 10:02:32	Rosuav	set	files: + nonascii.py messages: + msg256665
2015-12-18 10:01:20	Rosuav	set	files: + asciify.diff messages: + msg256664
2015-12-18 09:44:05	serhiy.storchaka	set	messages: + msg256663
2015-12-18 09:40:08	serhiy.storchaka	set	messages: + msg256662
2015-12-18 09:37:04	Rosuav	set	messages: + msg256660
2015-12-18 09:23:16	Rosuav	set	files: + nonascii.py messages: + msg256659
2015-12-18 09:05:51	Rosuav	set	files: + nonascii.py messages: + msg256657
2015-12-18 08:32:05	serhiy.storchaka	set	versions: + Python 2.7, Python 3.5 nosy: + serhiy.storchaka messages: + msg256655 type: behavior
2015-12-18 07:43:12	Rosuav	create