classification
Title: Add support.NONASCII to test non-ASCII characters
Type: Stage:
Components: Tests Versions: Python 3.2, Python 3.3, Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: chris.jerdonek, python-dev, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2012-11-05 12:12 by vstinner, last changed 2013-01-03 01:07 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
support_non_ascii-2.patch vstinner, 2012-11-05 23:10 review
brute.py vstinner, 2012-11-05 23:12
brute2.py vstinner, 2012-11-06 22:41
Messages (19)
msg174897 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-11-05 12:12
Attached patch adds support.NONASCII to have a "portable" non-ASCII character that can be used to test non-ASCII strings. The patch uses it in some existing functions.

I wrote the patch on the default branch, we may start to use it since Python 3.2.
msg174900 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-05 12:26
I think you should ensure that os.fsdecode(os.fsencode(character)) == character.
msg174904 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-05 12:58
If NONASCII is None I suggest the followed fallback code

    for i in range(0x80, 0xFFFF):
        character = chr(i)
        if character.isprintable():
            try:
                if os.fsdecode(os.fsencode(character)) == character:
                    NONASCII = character
                    break
            except UnicodeError:
                pass
msg174922 - (view) Author: Chris Jerdonek (chris.jerdonek) * (Python committer) Date: 2012-11-05 17:23
+# NONASCII: non-ASCII character encodable by os.fsencode(),
+# or None if there is no such character.
+NONASCII = None

Can you use a name that reflects that this is a specific type of non-ASCII character having a special property (e.g. FS_NONASCII)?  I think "ASCII" should be reserved for a generic non-ASCII character.  Moreover, there may be other types of non-ASCII we can add in the future.
msg174946 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-11-05 23:10
> I think you should ensure that os.fsdecode(os.fsencode(character)) == character.

Chosen characters respect this property, but it doesn't hurt to add such check.

> Can you use a name that reflects that this is a specific type
> of non-ASCII character having a special property (e.g. FS_NONASCII)?

Done.
msg174948 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-11-05 23:12
> If NONASCII is None I suggest the followed fallback code

I prefer to not "brute force" Unicode because it would slow down any test, even tests not using FS_NONASCII. I wrote attached brute.py script to compute an exhaustive list of non-ASCII characters encodable to "any" locale encoding. My locale encoding list is not complete, but it should be enough for our purpose. The list can be completed later.
msg174949 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-11-05 23:17
I tested support_non_ascii-2.patch on Windows with cp932 ANSI code page (FS encoding), and on Linux with ASCII, ISO-8859-1, ISO-8859-15 and UTF-8 locale encodings.
msg174959 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-06 09:48
I tested brute.py for all supported in Python encodings:

No character for encoding cp1006:surrogateescape :-(
No character for encoding cp720:surrogateescape :-(
No character for encoding cp864:surrogateescape :-(
No character for encoding iso8859_3:surrogateescape :-(
No character for encoding iso8859_6:surrogateescape :-(
No character for encoding mac_arabic:surrogateescape :-(
No character for encoding mac_farsi:surrogateescape :-(
msg174961 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-11-06 10:20
> I tested brute.py for all supported in Python encodings:

Oh thanks, interesting result. I completed the encoding list and the character list: see brute2.py. I added "joker" characters: U+00A0 and U+20AC which match requierements for most locale encodings.
msg175016 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-11-06 22:23
New changeset de8cf1ece068 by Victor Stinner in branch 'default':
Issue #16414: Add support.FS_NONASCII and support.TESTFN_NONASCII
http://hg.python.org/cpython/rev/de8cf1ece068
msg175017 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-11-06 22:33
New changeset 0e9fbdda3c92 by Victor Stinner in branch 'default':
Issue #16414: Fix support.TESTFN_UNDECODABLE and test_genericpath.test_nonascii_abspath()
http://hg.python.org/cpython/rev/0e9fbdda3c92
msg175018 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-06 22:34
Why were you add '- ' suffix to TESTFN_NONASCII?
msg175019 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-11-06 22:39
I don't see U+00A0 and U+20AC in the changeset.
msg175020 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-11-06 22:40
New changeset 55710b8c6670 by Victor Stinner in branch 'default':
Issue #16414: Fix typo in support.TESTFN_NONASCII (useless space)
http://hg.python.org/cpython/rev/55710b8c6670
msg175021 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-11-06 22:43
New changeset 7f90305d9f23 by Victor Stinner in branch 'default':
Issue #16414: Test more characters for support.FS_NONASCII
http://hg.python.org/cpython/rev/7f90305d9f23
msg175025 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-11-06 23:10
New changeset fce9e892c65d by Victor Stinner in branch 'default':
Issue #16414: Fix test_os on Windows, don't test os.listdir() with undecodable
http://hg.python.org/cpython/rev/fce9e892c65d
msg175026 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-11-06 23:12
> Why were you add '- ' suffix to TESTFN_NONASCII?

Oops, the space was a mistake. I add "-" just for the readability of the generated filename.

> I don't see U+00A0 and U+20AC in the changeset.

Oh, I forgot to update the patch with the latest results of "brute2.py". It is now fixed.

Thanks for the review!
msg175033 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-11-06 23:40
Handling non-ASCII paths is always a pain. I don't plan to backport support.FS_NONASCII to Python 3.3 right now, but I may backport it later.
msg178870 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-01-03 00:59
New changeset 41658a4fb3cc by Victor Stinner in branch '3.2':
Issue #16218, #16414, #16444: Backport FS_NONASCII, TESTFN_UNDECODABLE,
http://hg.python.org/cpython/rev/41658a4fb3cc

New changeset 4d40c1ce8566 by Victor Stinner in branch '3.3':
(Merge 3.2) Issue #16218, #16414, #16444: Backport FS_NONASCII,
http://hg.python.org/cpython/rev/4d40c1ce8566
History
Date User Action Args
2013-01-03 01:07:33vstinnersetversions: + Python 3.2, Python 3.3
2013-01-03 00:59:48python-devsetmessages: + msg178870
2012-11-06 23:40:40vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg175033

versions: - Python 3.3
2012-11-06 23:12:22vstinnersetmessages: + msg175026
2012-11-06 23:10:13python-devsetmessages: + msg175025
2012-11-06 22:43:05python-devsetmessages: + msg175021
2012-11-06 22:41:18vstinnersetfiles: + brute2.py
2012-11-06 22:40:15python-devsetmessages: + msg175020
2012-11-06 22:39:58serhiy.storchakasetmessages: + msg175019
2012-11-06 22:34:10serhiy.storchakasetmessages: + msg175018
2012-11-06 22:33:32python-devsetmessages: + msg175017
2012-11-06 22:23:28python-devsetnosy: + python-dev
messages: + msg175016
2012-11-06 10:20:30vstinnersetmessages: + msg174961
2012-11-06 09:48:23serhiy.storchakasetmessages: + msg174959
2012-11-05 23:17:33vstinnersetmessages: + msg174949
2012-11-05 23:12:37vstinnersetfiles: - support_non_ascii.patch
2012-11-05 23:12:29vstinnersetfiles: + brute.py

messages: + msg174948
2012-11-05 23:10:47vstinnersetfiles: + support_non_ascii-2.patch

messages: + msg174946
2012-11-05 17:23:31chris.jerdoneksetnosy: + chris.jerdonek
messages: + msg174922
2012-11-05 12:58:22serhiy.storchakasetmessages: + msg174904
2012-11-05 12:26:41serhiy.storchakasetmessages: + msg174900
2012-11-05 12:12:14vstinnercreate