Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode non-characters #38779

Closed
gnosis mannequin opened this issue Jul 3, 2003 · 2 comments
Closed

Unicode non-characters #38779

gnosis mannequin opened this issue Jul 3, 2003 · 2 comments
Assignees

Comments

@gnosis
Copy link
Mannequin

gnosis mannequin commented Jul 3, 2003

BPO 765036
Nosy @malemburg

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = 'https://github.com/malemburg'
closed_at = <Date 2003-07-03.07:07:22.000>
created_at = <Date 2003-07-03.01:52:58.000>
labels = ['expert-unicode']
title = 'Unicode non-characters'
updated_at = <Date 2003-07-03.07:07:22.000>
user = 'https://bugs.python.org/gnosis'

bugs.python.org fields:

activity = <Date 2003-07-03.07:07:22.000>
actor = 'lemburg'
assignee = 'lemburg'
closed = True
closed_date = None
closer = None
components = ['Unicode']
creation = <Date 2003-07-03.01:52:58.000>
creator = 'gnosis'
dependencies = []
files = []
hgrepos = []
issue_num = 765036
keywords = []
message_count = 2.0
messages = ['16831', '16832']
nosy_count = 2.0
nosy_names = ['lemburg', 'gnosis']
pr_nums = []
priority = 'normal'
resolution = 'wont fix'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue765036'
versions = ['Python 2.3']

@gnosis
Copy link
Mannequin Author

gnosis mannequin commented Jul 3, 2003

The alleged codepoints unichr(0xFFFE) and
unichr(0xFFFF) are not unicode characters. This document:

http://www.unicode.org/charts/PDF/UFFF0.pdf

Contains:

Noncharacters
These codes are intended for process internal uses, but
are not permitted for interchange.

FFFE !<not a character>
¨ the value FFFE !is guaranteed not to be
a Unicode character at all
¨ may be used to detect byte order by
contrast with FEFF which is a character
FEFF zero width no-break space

FFFF !<not a character>
¨ the value FFFF !is guaranteed not to be
a Unicode character at all

In particular, an XML document that contains such an
alleged unicode entity in not well-formed.

All unicode-aware versions of Python threat these
codepoints in the same manner as other codepoints, e.g.
both unichr(0xFFFE) and u'\uffff' pass without complaint.

I believe the correct behavior would be for Python to
raise an exception, or at least a warning, on access to
these spurious characters.

@gnosis gnosis mannequin closed this as completed Jul 3, 2003
@gnosis gnosis mannequin assigned malemburg Jul 3, 2003
@gnosis gnosis mannequin added the topic-unicode label Jul 3, 2003
@gnosis gnosis mannequin closed this as completed Jul 3, 2003
@gnosis gnosis mannequin assigned malemburg Jul 3, 2003
@gnosis gnosis mannequin added the topic-unicode label Jul 3, 2003
@malemburg
Copy link
Member

Logged In: YES
user_id=38388

This is on purpose: you do need a way to write programs
which write and handle BOMs. If you want your program to
raise exceptions for these character points, you can easily
implement the required checks.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant