msg256647 - (view) |
Author: Chris Angelico (Rosuav) * |
Date: 2015-12-18 07:43 |
Discussion on python-list led to searching out unnecessary non-ASCII in the stdlib. While there are places where non-ASCII text is good and worthwhile (eg in comments identifying people such as Łukasz Langa, Peter Åstrand, Martin v. Löwis, and Gerhard Häring, or code specifically demonstrating or implementing non-ASCII behaviour), there are some instances which are unnecessary. Attached is a patch converting apostrophes, dashes, and one space.
|
msg256655 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2015-12-18 08:32 |
Non-ASCII apostrophes and dashes in docstrings can be considered as a bug, since they can lead pydoc or help() to fail.
The patch LGTM.
Could you please provide your script used to search non-ASCII sources Chris?
Are there doubtful non-ASCII inclusions in tests, in C files, or in the documentation (including Misc/NEWS, Misc/HISTORY, etc)?
|
msg256657 - (view) |
Author: Chris Angelico (Rosuav) * |
Date: 2015-12-18 09:05 |
Whoops! Meant to include that as a second attachment. Now attached.
It's a quickly-thrown-together thing and not fully PEP 8 compliant.
|
msg256659 - (view) |
Author: Chris Angelico (Rosuav) * |
Date: 2015-12-18 09:23 |
As an alternative to checking only *.py, the second version uses the 'file' command to recognize text files. Run from the cpython top-level directory (rather than Lib/), it finds a large number of additional results, many of which appear to have a UTF-8 BOM. Are they also worth removing?
|
msg256660 - (view) |
Author: Chris Angelico (Rosuav) * |
Date: 2015-12-18 09:37 |
There are non-ASCII dashes and apostrophes in .rst files; are they worth cleaning up?
|
msg256662 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2015-12-18 09:40 |
I think yes if this is not Windows-specific file. See also 4796dec0a7d0, 7255af1a1c50, a8568ea83599, 652baf23c368.
|
msg256663 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2015-12-18 09:44 |
In .rst files they are good.
Add | (LC_ALL=C egrep "$(printf '[\x80-\xff]+')";) in the pipe after your script to highligh non-ASCII characters.
|
msg256664 - (view) |
Author: Chris Angelico (Rosuav) * |
Date: 2015-12-18 10:01 |
Misc/NEWS has a UTF-8 BOM. Otherwise, it and Misc/HISTORY look fine (all names and other legit cases). Lib/idlelib/CREDITS.txt and Lib/idlelib/README.txt both have non-UTF8 text in them. I don't understand what's with the first line of .bzrignore, so I'm (pun fully intended) ignoring it. Modules/_ctypes/libffi/ChangeLog has several non-ASCII markers, but it looks to be a historical record, so I'm not sure it should be tampered with. Expanded patch attached.
That egrep needs --color to do its highlighting, if it's not the default.
|
msg256665 - (view) |
Author: Chris Angelico (Rosuav) * |
Date: 2015-12-18 10:02 |
Another version of detection script attached.
|
msg256666 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2015-12-18 10:27 |
All in Modules/_ctypes/libffi/ is third-party code.
The ACUTE ACCENT character at the start of .bzrignore was added in b635462a5798 by Benjamin. I think this is just a typo.
The patch LGTM (if address Victor's comment).
|
msg256667 - (view) |
Author: Chris Angelico (Rosuav) * |
Date: 2015-12-18 10:31 |
Oops, didn't see Victor's comment. (How do I get notified when someone posts a patch review?) New patch uploaded which does this.
Note that Steven D'Aprano has expressed the opposite desire - that non-ASCII text be kept, as it should be acceptable and its presence makes for good toolchain testing. Noseying him in for his input.
|
msg256668 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2015-12-18 11:13 |
New changeset c87b2f61650f by Serhiy Storchaka in branch '3.5':
Issue #25899: Converted non-ASCII characters in docstrings and manpage
https://hg.python.org/cpython/rev/c87b2f61650f
New changeset 1eeb25f08cfd by Serhiy Storchaka in branch 'default':
Issue #25899: Converted non-ASCII characters in docstrings and manpage
https://hg.python.org/cpython/rev/1eeb25f08cfd
New changeset 7b176dafb56b by Serhiy Storchaka in branch '3.5':
Issue #25899: Fixed typo in .bzrignore.
https://hg.python.org/cpython/rev/7b176dafb56b
New changeset 8873f34e2186 by Serhiy Storchaka in branch '2.7':
Issue #25899: Fixed typo in .bzrignore.
https://hg.python.org/cpython/rev/8873f34e2186
New changeset e1d5f645b476 by Serhiy Storchaka in branch 'default':
Issue #25899: Fixed typo in .bzrignore.
https://hg.python.org/cpython/rev/e1d5f645b476
|
msg256669 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2015-12-18 11:21 |
Lib/idlelib/CREDITS.txt and Lib/idlelib/README.txt are read by IDLE. The first with the iso-8859-1 encoding, the second with locale encoding. Both are wrong if files are in UTF-8.
|
msg256670 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2015-12-18 11:24 |
New changeset 505593490f4c by Serhiy Storchaka in branch 'default':
Issue #25899: Converted Objects/listsort.txt to UTF-8.
https://hg.python.org/cpython/rev/505593490f4c
|
msg256671 - (view) |
Author: Chris Angelico (Rosuav) * |
Date: 2015-12-18 11:25 |
Ah, got it. That definitely settles Idle's CREDITS.txt. Are there any locale encodings where \x92 isn't an apostrophe?
|
msg256675 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2015-12-18 12:15 |
> Are there any locale encodings where \x92 isn't an apostrophe?
Latin1 and all ISO-8859-*. CP437 and perhaps all DOS codepages. KOI8 family. And of course UTF-8.
|
msg256677 - (view) |
Author: Chris Angelico (Rosuav) * |
Date: 2015-12-18 12:18 |
So Lib/idlelib/README.txt would decode wrongly in anything other than a Windows codepage? Seems a good reason to asciify line 3.
|
msg256680 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2015-12-18 12:31 |
Opened issue25905 for IDLE-related files.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:25 | admin | set | github: 70087 |
2015-12-18 12:31:39 | serhiy.storchaka | set | status: open -> closed resolution: fixed messages:
+ msg256680
stage: resolved |
2015-12-18 12:18:38 | Rosuav | set | messages:
+ msg256677 |
2015-12-18 12:15:21 | serhiy.storchaka | set | messages:
+ msg256675 |
2015-12-18 11:25:12 | Rosuav | set | messages:
+ msg256671 |
2015-12-18 11:24:46 | serhiy.storchaka | set | nosy:
+ terry.reedy
|
2015-12-18 11:24:27 | python-dev | set | messages:
+ msg256670 |
2015-12-18 11:21:11 | serhiy.storchaka | set | messages:
+ msg256669 |
2015-12-18 11:13:59 | python-dev | set | nosy:
+ python-dev messages:
+ msg256668
|
2015-12-18 10:31:25 | Rosuav | set | files:
+ asciify.diff nosy:
+ steven.daprano messages:
+ msg256667
|
2015-12-18 10:27:17 | serhiy.storchaka | set | nosy:
+ benjamin.peterson messages:
+ msg256666
|
2015-12-18 10:02:32 | Rosuav | set | files:
+ nonascii.py
messages:
+ msg256665 |
2015-12-18 10:01:20 | Rosuav | set | files:
+ asciify.diff
messages:
+ msg256664 |
2015-12-18 09:44:05 | serhiy.storchaka | set | messages:
+ msg256663 |
2015-12-18 09:40:08 | serhiy.storchaka | set | messages:
+ msg256662 |
2015-12-18 09:37:04 | Rosuav | set | messages:
+ msg256660 |
2015-12-18 09:23:16 | Rosuav | set | files:
+ nonascii.py
messages:
+ msg256659 |
2015-12-18 09:05:51 | Rosuav | set | files:
+ nonascii.py
messages:
+ msg256657 |
2015-12-18 08:32:05 | serhiy.storchaka | set | versions:
+ Python 2.7, Python 3.5 nosy:
+ serhiy.storchaka
messages:
+ msg256655
type: behavior |
2015-12-18 07:43:12 | Rosuav | create | |