Message 219809 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ncoghlan
Recipients	docs@python, gvanrossum, ncoghlan, pitrou, vstinner
Date	2014-06-05.12:34:07
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1401971648.18.0.0439993458759.issue21667@psf.upfronthosting.co.za>
In-reply-to

Content
If someone doesn't understand what "Unicode code point" means, that's going to be the least of their problems when it comes to implementing a conformant Python implementation. We could link to http://unicode.org/glossary/#code_point, but that doesn't really add much beyond "value from 0 to 0x10FFFF". If you try to dive into the formal Unicode spec instead, you end up in a twisty maze of definitions of things that are all closely related, but generally not the same thing (code positions, code units, code spaces, abstract characters, glyphs, graphemes, etc). The main advantage of using the more formal "code point" over the informal "character" is that it discourages people from assuming they know what they are (with the usual mistaken assumption being that Unicode code points correspond directly to glyphs the way ASCII and Extended ASCII printable characters correspond to their glyphs). The rest of the paragraph then provides the mechanical details of the meaningful interpretations of them in Python (as length 1 strings and as numbers in a particular range) and the operations for translating between those two formats (chr and ord). Fair point about the slicing - it may be better to just talk about indexing.

If someone doesn't understand what "Unicode code point" means, that's going to be the least of their problems when it comes to implementing a conformant Python implementation. We could link to http://unicode.org/glossary/#code_point, but that doesn't really add much beyond "value from 0 to 0x10FFFF". If you try to dive into the formal Unicode spec instead, you end up in a twisty maze of definitions of things that are all closely related, but generally not the same thing (code positions, code units, code spaces, abstract characters, glyphs, graphemes, etc).

The main advantage of using the more formal "code point" over the informal "character" is that it discourages people from assuming they know what they are (with the usual mistaken assumption being that Unicode code points correspond directly to glyphs the way ASCII and Extended ASCII printable characters correspond to their glyphs). The rest of the paragraph then provides the mechanical details of the meaningful interpretations of them in Python (as length 1 strings and as numbers in a particular range) and the operations for translating between those two formats (chr and ord).

Fair point about the slicing - it may be better to just talk about indexing.

History
Date	User	Action	Args
2014-06-05 12:34:08	ncoghlan	set	recipients: + ncoghlan, gvanrossum, pitrou, vstinner, docs@python
2014-06-05 12:34:08	ncoghlan	set	messageid: <1401971648.18.0.0439993458759.issue21667@psf.upfronthosting.co.za>
2014-06-05 12:34:08	ncoghlan	link	issue21667 messages
2014-06-05 12:34:07	ncoghlan	create