classification
Title: Unicode non-characters
Type: Stage:
Components: Unicode Versions: Python 2.3
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: lemburg Nosy List: gnosis, lemburg
Priority: normal Keywords:

Created on 2003-07-03 01:52 by gnosis, last changed 2003-07-03 07:07 by lemburg. This issue is now closed.

Messages (2)
msg16831 - (view) Author: David Mertz (gnosis) Date: 2003-07-03 01:52
The alleged codepoints unichr(0xFFFE) and
unichr(0xFFFF) are not unicode characters.  This document:

  http://www.unicode.org/charts/PDF/UFFF0.pdf

Contains:

  Noncharacters
  These codes are intended for process internal uses, but
  are not permitted for interchange.

  FFFE !<not a character>
  ¨ the value FFFE !is guaranteed not to be
    a Unicode character at all
  ¨ may be used to detect byte order by
    contrast with FEFF which is a character
    FEFF zero width no-break space

  FFFF !<not a character>
  ¨ the value FFFF !is guaranteed not to be
    a Unicode character at all

In particular, an XML document that contains such an
alleged unicode entity in not well-formed.

All unicode-aware versions of Python threat these
codepoints in the same manner as other codepoints, e.g.
both unichr(0xFFFE) and u'\uffff' pass without complaint.

I believe the correct behavior would be for Python to
raise an exception, or at least a warning, on access to
these spurious characters.

msg16832 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2003-07-03 07:07
Logged In: YES 
user_id=38388

This is on purpose: you do need a way to write programs
which write and handle BOMs. If you want your program to
raise exceptions for these character points, you can easily
implement the required checks.
History
Date User Action Args
2003-07-03 01:52:58gnosiscreate