Message 251731 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	martin.panter
Recipients	Arfrever, martin.panter, orsenthil, pitrou, serhiy.storchaka, vstinner
Date	2015-09-27.22:26:29
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1443392789.68.0.377916580939.issue25184@psf.upfronthosting.co.za>
In-reply-to

Content
I don’t have much to do with Windows, but I understand we don’t support surrogate-escaped bytes there. E.g. see <https://docs.python.org/dev/library/os.html#os.fsdecode> and sys.getfilesystemencoding(). However I suspect your first patch would have failed on Windows doing os.fsencode(TESTFN_UNENCODABLE); apparently it cannot represent all possible file names in bytes. The second patch doesn’t call fsencode() so this shouldn’t be a problem. Your tests do not test that the output is valid Unicode without surrogates. With your first patch applied, when pydoc wrote the HTML to a UTF-8 disk file, I got the error: File "/media/disk/home/proj/python/cpy\udcffthon/Lib/pydoc.py", line 2626, in cli writedoc(arg) File "/media/disk/home/proj/python/cpy\udcffthon/Lib/pydoc.py", line 1659, in writedoc file.write(page) UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 674: surrogates not allowed I have been working on an alternative patch using my IRI (Unicode URLs) proposal for “file:” links, and “surrogatepass” for HTTP links. But I am also trying to fix some related problems with the built-in HTTP server, and the unit tests are a bit tricky.

I don’t have much to do with Windows, but I understand we don’t support surrogate-escaped bytes there. E.g. see <https://docs.python.org/dev/library/os.html#os.fsdecode> and sys.getfilesystemencoding(). However I suspect your first patch would have failed on Windows doing os.fsencode(TESTFN_UNENCODABLE); apparently it cannot represent all possible file names in bytes. The second patch doesn’t call fsencode() so this shouldn’t be a problem.

Your tests do not test that the output is valid Unicode without surrogates. With your first patch applied, when pydoc wrote the HTML to a UTF-8 disk file, I got the error:

  File "/media/disk/home/proj/python/cpy\udcffthon/Lib/pydoc.py", line 2626, in cli
    writedoc(arg)
  File "/media/disk/home/proj/python/cpy\udcffthon/Lib/pydoc.py", line 1659, in writedoc
    file.write(page)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 674: surrogates not allowed

I have been working on an alternative patch using my IRI (Unicode URLs) proposal for “file:” links, and “surrogatepass” for HTTP links. But I am also trying to fix some related problems with the built-in HTTP server, and the unit tests are a bit tricky.

History
Date	User	Action	Args
2015-09-27 22:26:29	martin.panter	set	recipients: + martin.panter, orsenthil, pitrou, vstinner, Arfrever, serhiy.storchaka
2015-09-27 22:26:29	martin.panter	set	messageid: <1443392789.68.0.377916580939.issue25184@psf.upfronthosting.co.za>
2015-09-27 22:26:29	martin.panter	link	issue25184 messages
2015-09-27 22:26:29	martin.panter	create