Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use support.TESTFN_UNDECODABLE on UNIX #60648

Closed
vstinner opened this issue Nov 8, 2012 · 22 comments
Closed

Use support.TESTFN_UNDECODABLE on UNIX #60648

vstinner opened this issue Nov 8, 2012 · 22 comments
Labels
tests Tests in the Lib/test dir topic-unicode type-feature A feature request or enhancement

Comments

@vstinner
Copy link
Member

vstinner commented Nov 8, 2012

BPO 16444
Nosy @pitrou, @vstinner, @ezio-melotti, @asvetlov, @serhiy-storchaka
Files
  • support_undecodable.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-12-04.20:53:30.192>
    created_at = <Date 2012-11-08.22:52:14.333>
    labels = ['type-feature', 'tests', 'expert-unicode']
    title = 'Use support.TESTFN_UNDECODABLE on UNIX'
    updated_at = <Date 2013-01-03.01:07:37.018>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2013-01-03.01:07:37.018>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2012-12-04.20:53:30.192>
    closer = 'vstinner'
    components = ['Tests', 'Unicode']
    creation = <Date 2012-11-08.22:52:14.333>
    creator = 'vstinner'
    dependencies = []
    files = ['27928']
    hgrepos = []
    issue_num = 16444
    keywords = ['patch']
    message_count = 22.0
    messages = ['175200', '175201', '175202', '175209', '175221', '175222', '175223', '175271', '175272', '175275', '175291', '175296', '175396', '175399', '175402', '175406', '175413', '175423', '176893', '176955', '176958', '178868']
    nosy_count = 6.0
    nosy_names = ['pitrou', 'vstinner', 'ezio.melotti', 'asvetlov', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue16444'
    versions = ['Python 3.2', 'Python 3.3', 'Python 3.4']

    @vstinner
    Copy link
    Member Author

    vstinner commented Nov 8, 2012

    Attached patch changes how support.TESTFN_UNDECODABLE is computed on UNIX: use the filesystem encoding in *strict* mode, not using the surrogateescape error handler.

    So we can use support.TESTFN_UNDECODABLE to check if a function uses correctly the surrogateescape error handler and/or check if it behaves correctly with non-ASCII characters.

    The patch uses also support.TESTFN_UNDECODABLE (only on UNIX) in test_cmd_line_script.test_non_ascii() to also check that the fix for bpo-16218 works with UTF-8 locale encoding.

    Please test the patch on UNIX, Windows and Mac OS X.

    We may also use support.TESTFN_UNDECODABLE in test_cmd_line_script.test_non_ascii() on Windows, I will check.

    Windows has some strange behaviour with undecodable characters: some of them are replaced a character with a similar glyph.

    @vstinner vstinner added tests Tests in the Lib/test dir topic-unicode labels Nov 8, 2012
    @vstinner
    Copy link
    Member Author

    vstinner commented Nov 8, 2012

    The patch contains two print to help debugging the patch itself, these print statements must be removed later.

    +print("TESTFN_UNDECODABLE = %a" % TESTFN_UNDECODABLE)
    +print("TESTFN_NONASCII = %a" % TESTFN_NONASCII)

    @vstinner
    Copy link
    Member Author

    vstinner commented Nov 8, 2012

    We may also use support.TESTFN_UNDECODABLE
    in test_cmd_line_script.test_non_ascii() on Windows

    Oh, subprocess doesn't support passing bytes arguments to a program anymore (since Python 3.0).
    http://bugs.python.org/issue4036#msg100376

    So it's better to use TESTFN_NONASCII instead for this test ;-) It confirms that we need two constants depending on the context. It dependson the platform and how the data is read/written: sometimes undecodable characters are supported on any platform (ex: base64 encoder), sometimes undecodable characters are not supported (ex: distutils expects valid metadata), sometimes it depends on the platform (ex: this test).

    @vstinner
    Copy link
    Member Author

    vstinner commented Nov 8, 2012

    Please test the patch on UNIX, Windows and Mac OS X.

    The full test suite pass on:

    • Linux with UTF-8 locale encoding
    • Linux with ASCII locale encoding
    • Windows with cp932 ANSI code page
    • Mac OS 10.8 with ASCII locale encoding (and utf-8/surrogateescape for the filesystem encoding) ($LANG, $LC_ALL, $LC_CTYPE are not set)

    @serhiy-storchaka
    Copy link
    Member

    Try b'\x81', b'\x98', b'\xae', b'\xd5', b'\xff'. They are undecodable in all 1-byte encodings.

    b'\x81' : shift_jis_2004 shift_jis shift_jisx0213 cp869 cp874 cp932 cp1250 cp1252 cp1253 cp1254 cp1255 cp1257 cp1258
    b'\x98' : shift_jis_2004 shift_jis shift_jisx0213 cp874 cp932 cp1250 cp1251 cp1253 cp1257
    b'\xae' : iso8859-3 iso8859-6 iso8859-7 cp424
    b'\xd5' : iso8859-8 cp856 cp857
    b'\xff' : hp-roman8 iso8859-6 iso8859-7 iso8859-8 iso8859-11 shift_jis_2004 shift_jis shift_jisx0213 tis-620 cp864 cp874 cp1253 cp1255

    @serhiy-storchaka
    Copy link
    Member

    Try b'\xed\xb2\x80' and b'\xed\xb4\x80' for UTF-8 (on Unix and Mac OS X).

    b'\xed\xb2\x80' is b'\x80'.decode('utf-8', 'surrogateescape').encode('utf-8', 'surrogatepass').
    b'\xed\xb4\x80' is '\udd00'.encode('utf-8', 'surrogatepass') and '\udd00' can't be encoded with surrogateescape.

    @serhiy-storchaka
    Copy link
    Member

    The full test suite pass on:

    The matter is not only in the fact that tests passed. They should fail if the original bug occurs again. Have you tried to restore the bugs?

    @vstinner
    Copy link
    Member Author

    The matter is not only in the fact that tests passed.

    Right, but I don't want to introduce a regression :-)

    They should fail if the original bug occurs again. Have you tried to restore the bugs?

    test_cmd_line_script.test_non_ascii() comes from the issue bpo-16218, changeset 23ebe277e982. I checked this issue: support_undecodable.patch checks for non-regression with UTF-8 (and ASCI and ISO-8859-1) locale encoding on UNIX.

    test_genericpath.test_non_ascii() comes from the issue bpo-3426, this fix comes from the issue bpo-3187, changeset 8a7c930abab6. I don't want to spend time on trying the new test on this issue because this 8a7c930abab6 is a major change, I don't see how to revert it just to test the issue. I consider the issue has fixed, and the new test should not reduce the test coverage, but just increase it ;-)

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 10, 2012

    New changeset 6b8a8bc6ba9c by Victor Stinner in branch 'default':
    Issue bpo-16444, bpo-16218: Use TESTFN_UNDECODABLE on UNIX
    http://hg.python.org/cpython/rev/6b8a8bc6ba9c

    @serhiy-storchaka
    Copy link
    Member

    TESTFN_UNDECODABLE is not detected for cp1250, cp1251, cp1252, cp1254, cp1257 and cp1258. Just add b'\x81\x98\xae\xd5\xff', at leas one of this bytes undecodable in some encoding which has any undecodable bytes.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 10, 2012

    I suppose you noticed you broke a bunch of buildbots :)

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 10, 2012

    New changeset 398f8770bf0d by Victor Stinner in branch 'default':
    Issue bpo-16444: disable undecodable characters in test_non_ascii() test until
    http://hg.python.org/cpython/rev/398f8770bf0d

    @vstinner
    Copy link
    Member Author

    TESTFN_UNDECODABLE is not detected for cp1250, cp1251, cp1252, cp1254, cp1257 and cp1258.

    The Python encoding and the real codec used by Windows are different: Python fails to decode bytes 0x80-0x9f, but Windows does decode them. I prefer to avoid these bytes to not rely too much on the Python codec.

    @serhiy-storchaka
    Copy link
    Member

    These encodings used not only on Windows.

    @vstinner
    Copy link
    Member Author

    I suppose you noticed you broke a bunch of buildbots :)

    Failures occur on FreeBSD, OpenIndiana and some other buildbots which don't set a locale and so use the "C" locale. main() decodes command line arguments from the locale encoding using _Py_char2wchar(). On these OSes, the "C" locale uses the ISO-8859-1 encoding, but the problem is that nl_langinfo(CODESET) announces ASCII :-/ test_cmd_line.test_undecodable_code() handles this case. Extract of a comment:

    # _Py_char2wchar() decoded b'\xff' as '\xff' even if the locale is
    # C and the locale encoding is ASCII. It occurs on FreeBSD, Solaris
    # and Mac OS X.

    Mac OS X is now using UTF-8 to decode the command line arguments.

    I just created the issue bpo-16455 to fix FreeBSD and OpenIndiana.

    I propose to close this issue because I consider it as fixed (bpo-16455 will reenable TESTFN_UNDECODABLE in test_cmd_line_script).

    @vstinner
    Copy link
    Member Author

    These encodings used not only on Windows.

    You can uses cpXXX encodings explictly to read or write a file, but these encodings are not used for sys.getfilesystemencoding() (or sys.stdout.encoding).

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 12, 2012

    New changeset 6017f09ead53 by Victor Stinner in branch '3.3':
    Issue bpo-16218, bpo-16444: Backport improvment on tests for non-ASCII characters
    http://hg.python.org/cpython/rev/6017f09ead53

    @serhiy-storchaka
    Copy link
    Member

    You can uses cpXXX encodings explictly to read or write a file, but these
    encodings are not used for sys.getfilesystemencoding() (or
    sys.stdout.encoding).

    At least CP1251 has been used for many cyrillic locales in before-UTF8 age (I use it sometimes still). For now CP1251 is the default encoding for Byelorussian and Bulgarian:

    $ grep CP /usr/share/i18n/SUPPORTED
    be_BY CP1251
    bg_BG CP1251
    ru_RU.CP1251 CP1251
    yi_US CP1255

    @serhiy-storchaka
    Copy link
    Member

    Ping.

    @serhiy-storchaka serhiy-storchaka added the type-feature A feature request or enhancement label Dec 4, 2012
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Dec 4, 2012

    New changeset ed0ff4b3d1c4 by Victor Stinner in branch 'default':
    Issue bpo-16444: test more bytes in support.TESTFN_UNDECODABLE to support more Windows code pages
    http://hg.python.org/cpython/rev/ed0ff4b3d1c4

    @vstinner
    Copy link
    Member Author

    vstinner commented Dec 4, 2012

    Ooook, all remaining issues about undecodable bytes should now be fixed (until someone opens a new one? :-))

    @vstinner vstinner closed this as completed Dec 4, 2012
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jan 3, 2013

    New changeset 41658a4fb3cc by Victor Stinner in branch '3.2':
    Issue bpo-16218, bpo-16414, bpo-16444: Backport FS_NONASCII, TESTFN_UNDECODABLE,
    http://hg.python.org/cpython/rev/41658a4fb3cc

    New changeset 4d40c1ce8566 by Victor Stinner in branch '3.3':
    (Merge 3.2) Issue bpo-16218, bpo-16414, bpo-16444: Backport FS_NONASCII,
    http://hg.python.org/cpython/rev/4d40c1ce8566

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    tests Tests in the Lib/test dir topic-unicode type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants