Issue 9819: TESTFN_UNICODE and TESTFN_UNDECODABLE

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/54028

classification

Title:	TESTFN_UNICODE and TESTFN_UNDECODABLE
Type:		Stage:
Components:	Tests, Unicode	Versions:	Python 3.1, Python 3.2

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	vstinner	Nosy List:	ocean-city, vstinner
Priority:	normal	Keywords:	patch

Created on 2010-09-10 09:39 by ocean-city, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
find_unencode_filename-2.py	vstinner, 2010-09-10 11:25
unicode_file.patch	vstinner, 2010-09-11 05:56

Messages (13)
msg115989 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2010-09-10 09:39
Hello. I noticed test suite reports WARNING every time. /////////////////////////////////////////////////// E:\python-dev>py3k -m test.regrtest test_os WARNING: The filename '@test_464_tmp-共有される' CAN be encoded by the filesyste m encoding (mbcs). Unicode filename tests may not be effective (snip) /////////////////////////////////////////////////// This happens because TESTFN_UNICODE_UNDECODABLE in Lib/test/support.py is decodable on Japanese environment (cp932). It is easy to make this really undecodable in Japanese. Using the characters like "\u2661" or "\u2668" (Former is heart mark, latter is "Onsen" - Hot spring mark) I could remove the warning by this. TESTFN_UNENCODABLE = TESTFN + "-\u5171\u6709\u3055\u308c\u308b\u2661\u2668" /////////////////////////////////////////////////// And another issue. This happens only on test_unicode_file, /////////////////////////////////////////////////// E:\python-dev>py3k -m test.test_unicode_file Traceback (most recent call last): File "e:\python-dev\py3k\lib\test\test_unicode_file.py", line 12, in <module> TESTFN_UNICODE.encode(TESTFN_ENCODING) UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: inval id character During handling of the above exception, another exception occurred: Traceback (most recent call last): File "e:\python-dev\py3k\lib\runpy.py", line 160, in _run_module_as_main "__main__", fname, loader, pkg_name) File "e:\python-dev\py3k\lib\runpy.py", line 73, in _run_code exec(code, run_globals) File "e:\python-dev\py3k\lib\test\test_unicode_file.py", line 16, in <module> raise unittest.SkipTest("No Unicode filesystem semantics on this platform.") unittest.case.SkipTest: No Unicode filesystem semantics on this platform. /////////////////////////////////////////////////// This happens because TESTFN_UNICODE cannot be encoded in Japanese. E:\python-dev>py3k Python 3.2a2+ (py3k:84663M, Sep 10 2010, 13:24:41) [MSC v.1400 32 bit (Intel)] o n win32 Type "help", "copyright", "credits" or "license" for more information. >>> print("-\xe0\xf2") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'cp932' codec can't encode character '\xe0' in position 1: i llegal multibyte sequence But interesting, this bytes sequence "\xe0\xf2" can be read as cp932 multibyte characters. E:\python-dev>python Python 2.6.6 (r266:84297, Aug 24 2010, 18:46:32) [MSC v.1500 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> print "\xe0\xf2" 瑣 >>> "\xe0\xf2".decode("cp932") u'\u7463' E:\python-dev>py3k Python 3.2a2+ (py3k:84663M, Sep 10 2010, 13:24:41) [MSC v.1400 32 bit (Intel)] o n win32 Type "help", "copyright", "credits" or "license" for more information. >>> print('\u7463') 瑣 I believe this value "\xe0\xf2" came from python2.x, maybe "\u7463" should be used here? I'm not sure it can be decoded everywhere using other encodings, though.
msg115991 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2010-09-10 09:42
And one little thing. I noticed variable name varies in python2.x and python3.x. TESTFN_UNICODE_UNDECODEABLE (2.x) TESTFN_UNICODE_UNDECODABLE (3.x) I think 2.x should be unified into 3.x name. Thanks.
msg115994 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-10 10:27
> WARNING: The filename '@test_464_tmp-共有される' CAN be encoded > by (...) cp932 We should find character not encodable in any Windows code page, but accepted as filenames. > characters like "\u2661" or "\u2668" (...) mbcs uses "ANSI" code pages: cp1250..cp1258 and cp874 (and maybe others because you wrote that your setup uses cp932): http://en.wikipedia.org/wiki/Code_page#Windows_.28ANSI.29_code_pages I wrote a short script to find a unencodable filename (attached to this issue). Output: u'\u0301' is encodable to cp1258 u'\u0363' is not encodable to any code page u'\u2661' is encodable to cp949 u'\u5171' is encodable to cp932, cp936, cp949, cp950 (CODE_PAGES constant of the script might be incomplete) u'\u2661' is not a good candidate. u'\u0363' looks better. Be we can mix different characters to limit the probability that the whole string is encodable. Example: u'\u2661\u5171' is encodable to cp949 u'\u0301\u0363\u2661\u5171' is not encodable to any code page > TESTFN_UNICODE_UNDECODEABLE (2.x) This is a typo fixed by r83987 in py3k.
msg115997 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-10 10:45
See also #9820.
msg115999 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2010-09-10 10:57
Thank you for a reply. > u'\u2661' is encodable to cp949 Doh! I can imagine it's difficult to find out such character. ;-)
msg116000 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2010-09-10 11:01
I also confirmed '\u0363' can be used as filename. "dir" command cannot print filename correctly, though. E:\python-dev\foo のディレクトリ 2010/09/10 19:44 <DIR> . 2010/09/10 19:44 <DIR> .. 2010/09/10 19:44 3 ͣ 1 個のファイル 3 バイト 2 個のディレクトリ 2,788,741,120 バイトの空き領域
msg116007 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-10 11:25
> "dir" command cannot print filename correctly, though. Who cares? We just have to be able to create a file with a name containing non encodable characters, list the directory, and then remove this evil file. -- With r84666, Python uses "-\u5171\u6709\u3055\u308c\u308b" suffix for TESTFN_UNENCODABLE. Does it fix the issue on your host? I attached an improved version of find_unencode_filename.py (with more code pages). -- > > TESTFN_UNICODE_UNDECODEABLE (2.x) > This is a typo fixed by r83987 in py3k. I backported the fix to 2.7 (r84667).
msg116010 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2010-09-10 11:43
Thank you for the fix. > Who cares? We just have to be able to create a file with a name > containing non encodable characters, list the directory, and then > remove this evil file. I won't. ;-) Sorry, that was not compliant. I just thought it's interesting to see such a white box in console, and see it can be opened and deleted correctly.
msg116012 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-10 12:21
> With r84666, Python uses "-\u5171\u6709\u3055\u308c\u308b" > suffix for TESTFN_UNENCODABLE. Backported to 3.1 as r84668. I don't want to patch Python 2.x (its unicode support is lower and the code is too different than Python3) and so I close the issue.
msg116074 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2010-09-11 03:55
How about TESTFN_UNICODE (test_unicode_file) issue? Should I reopen this entry or invalid?
msg116081 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-11 05:56
> How about TESTFN_UNICODE (test_unicode_file) issue? File "e:\python-dev\py3k\lib\test\test_unicode_file.py", line 12, in <module> TESTFN_UNICODE.encode(TESTFN_ENCODING) UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character That's because mbcs encoding is more strict on Windows. But this test is not needed on Windows. Please try attached patch (unicode_file.patch).
msg116082 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2010-09-11 07:23
Thank you, your patch works. E:\python-dev\py3k>py3k -m test.test_unicode_file test_directories (__main__.TestUnicodeFiles) ... ok test_single_files (__main__.TestUnicodeFiles) ... ok ---------------------------------------------------------------------- Ran 2 tests in 0.611s OK [69875 refs]
msg116094 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-11 12:53
> Thank you, your patch works. Ok, patch commited to 3.2 as r84710. Thanks for your feedback.

History
Date	User	Action	Args
2022-04-11 14:57:06	admin	set	github: 54028
2010-09-13 20:39:35	vstinner	set	status: open -> closed resolution: fixed
2010-09-11 12:53:20	vstinner	set	messages: + msg116094
2010-09-11 07:23:40	ocean-city	set	messages: + msg116082
2010-09-11 05:56:51	vstinner	set	files: + unicode_file.patch
2010-09-11 05:56:39	vstinner	set	status: closed -> open keywords: + patch resolution: fixed -> (no value) messages: + msg116081
2010-09-11 03:55:04	ocean-city	set	messages: + msg116074
2010-09-10 12:21:55	vstinner	set	status: open -> closed resolution: fixed messages: + msg116012
2010-09-10 11:43:55	ocean-city	set	messages: + msg116010
2010-09-10 11:25:17	vstinner	set	files: - find_unencode_filename.py
2010-09-10 11:25:02	vstinner	set	files: + find_unencode_filename-2.py messages: + msg116007
2010-09-10 11:01:49	ocean-city	set	messages: + msg116000
2010-09-10 10:57:38	ocean-city	set	messages: + msg115999
2010-09-10 10:45:34	vstinner	set	messages: + msg115997
2010-09-10 10:27:52	vstinner	set	files: + find_unencode_filename.py messages: + msg115994
2010-09-10 09:48:09	amaury.forgeotdarc	set	assignee: vstinner nosy: + vstinner
2010-09-10 09:42:11	ocean-city	set	messages: + msg115991
2010-09-10 09:39:51	ocean-city	create