Unicode non-characters #38779

gnosis · 2003-07-03T01:52:58Z

BPO	765036
Nosy	@malemburg

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/malemburg'
closed_at = <Date 2003-07-03.07:07:22.000>
created_at = <Date 2003-07-03.01:52:58.000>
labels = ['expert-unicode']
title = 'Unicode non-characters'
updated_at = <Date 2003-07-03.07:07:22.000>
user = 'https://bugs.python.org/gnosis'

bugs.python.org fields:

activity = <Date 2003-07-03.07:07:22.000>
actor = 'lemburg'
assignee = 'lemburg'
closed = True
closed_date = None
closer = None
components = ['Unicode']
creation = <Date 2003-07-03.01:52:58.000>
creator = 'gnosis'
dependencies = []
files = []
hgrepos = []
issue_num = 765036
keywords = []
message_count = 2.0
messages = ['16831', '16832']
nosy_count = 2.0
nosy_names = ['lemburg', 'gnosis']
pr_nums = []
priority = 'normal'
resolution = 'wont fix'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue765036'
versions = ['Python 2.3']

gnosis · 2003-07-03T01:52:58Z

The alleged codepoints unichr(0xFFFE) and
unichr(0xFFFF) are not unicode characters. This document:

http://www.unicode.org/charts/PDF/UFFF0.pdf

Contains:

Noncharacters
These codes are intended for process internal uses, but
are not permitted for interchange.

FFFE !<not a character>
¨ the value FFFE !is guaranteed not to be
a Unicode character at all
¨ may be used to detect byte order by
contrast with FEFF which is a character
FEFF zero width no-break space

FFFF !<not a character>
¨ the value FFFF !is guaranteed not to be
a Unicode character at all

In particular, an XML document that contains such an
alleged unicode entity in not well-formed.

All unicode-aware versions of Python threat these
codepoints in the same manner as other codepoints, e.g.
both unichr(0xFFFE) and u'\uffff' pass without complaint.

I believe the correct behavior would be for Python to
raise an exception, or at least a warning, on access to
these spurious characters.

malemburg · 2003-07-03T07:07:22Z

Logged In: YES
user_id=38388

This is on purpose: you do need a way to write programs
which write and handle BOMs. If you want your program to
raise exceptions for these character points, you can easily
implement the required checks.

gnosis mannequin closed this as completed Jul 3, 2003

gnosis mannequin assigned malemburg Jul 3, 2003

gnosis mannequin added the topic-unicode label Jul 3, 2003

gnosis mannequin closed this as completed Jul 3, 2003

gnosis mannequin assigned malemburg Jul 3, 2003

gnosis mannequin added the topic-unicode label Jul 3, 2003

ezio-melotti transferred this issue from another repository Apr 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode non-characters #38779

Unicode non-characters #38779

gnosis mannequin commented Jul 3, 2003

gnosis mannequin commented Jul 3, 2003

malemburg commented Jul 3, 2003

Unicode non-characters #38779

Unicode non-characters #38779

Comments

gnosis mannequin commented Jul 3, 2003

gnosis mannequin commented Jul 3, 2003

malemburg commented Jul 3, 2003