Issue 765036: Unicode non-characters

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/38779

classification

process

Created on 2003-07-03 01:52 by gnosis, last changed 2022-04-10 16:09 by admin. This issue is now closed.

Messages (2)
msg16831 - (view)	Author: David Mertz (gnosis)	Date: 2003-07-03 01:52
The alleged codepoints unichr(0xFFFE) and unichr(0xFFFF) are not unicode characters. This document: http://www.unicode.org/charts/PDF/UFFF0.pdf Contains: Noncharacters These codes are intended for process internal uses, but are not permitted for interchange. FFFE !<not a character> ¨ the value FFFE !is guaranteed not to be a Unicode character at all ¨ may be used to detect byte order by contrast with FEFF which is a character FEFF zero width no-break space FFFF !<not a character> ¨ the value FFFF !is guaranteed not to be a Unicode character at all In particular, an XML document that contains such an alleged unicode entity in not well-formed. All unicode-aware versions of Python threat these codepoints in the same manner as other codepoints, e.g. both unichr(0xFFFE) and u'\uffff' pass without complaint. I believe the correct behavior would be for Python to raise an exception, or at least a warning, on access to these spurious characters.
msg16832 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2003-07-03 07:07
Logged In: YES user_id=38388 This is on purpose: you do need a way to write programs which write and handle BOMs. If you want your program to raise exceptions for these character points, you can easily implement the required checks.