Issue 4153: Unicode HOWTO up to date?

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/48403

classification

Title:	Unicode HOWTO up to date?
Type:		Stage:	resolved
Components:	Documentation	Versions:	Python 3.2, Python 3.3

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	ezio.melotti	Nosy List:	akuchling, belopolsky, cvrebert, eric.araujo, ezio.melotti, georg.brandl, ncoghlan, python-dev, rhettinger, serhiy.storchaka, terry.reedy
Priority:	low	Keywords:	patch

Created on 2008-10-20 18:04 by terry.reedy, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue4153.diff	belopolsky, 2010-11-18 16:38		review
issue4153-2.diff	ezio.melotti, 2011-09-01 08:04	Patch against 3.2	review
unicode-howto.txt	akuchling, 2013-06-08 19:46	Revised patch	review
unicode-howto.txt	akuchling, 2013-06-08 22:33		review

Messages (28)
msg74999 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2008-10-20 18:03
The Unicode HOWTO begins with "Warning This HOWTO has not yet been updated for Python 3000’s string object changes." Without reading in detail, it appears it has been updated, at least somewhat, and certainly more than I feared from the warning. "The String Type Since Python 3.0, the language features a str type that contain Unicode characters" and then a section "Converting to Bytes" and a later reference to bytearrays. So perhaps the warning is obsolete and should be removed. Also, the revision history should have at least one more entry for the 3.0 updates, which certainly were entered since 2005
msg76240 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2008-11-22 10:27
Thanks for noting this! The most basic changes had been done, but I had to revise some sections for changes. Done in r67338.
msg121444 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-18 06:12
Reopening because it looks like the fix was reverted in r82301. """ This HOWTO discusses Python 2.x’s support for Unicode, and explains various problems that people commonly encounter when trying to work with Unicode. (This HOWTO has not yet been updated to cover the 3.x versions of Python.) """ http://docs.python.org/dev/howto/unicode.html
msg121466 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-18 16:03
The changes added in r82301 are misleading because code examples in this HOWTO have been converted to 3.x. I am attaching a patch that removes "has not yet been updated to cover the 3.x" warning and makes some minor stylistic changes. I have bumped the release version to 1.12, but I would like to remove the revision history which is largely irrelevant.
msg121474 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-18 16:45
r82301 appears to be a blind merge of r82120 from the trunk. It is fairly obvious that it was not intentional.
msg121488 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2010-11-18 19:41
Thanks for persisting with this. Looking at the patch: @@ -65,7 +63,7 @@ goal was to have Unicode contain the alphabets for every single human language. It turns out that even 16 bits isn't enough to meet that goal, and the modern Unicode specification uses a wider range of codes, 0-1,114,111 (0x10ffff in -base-16). +base 16). I visually parse 0-1,114,111 as 0-1, 114, 111. So I think either the commas should be removed or extra spaces are needed: 0-1114111 or 0 - 1,114,111. In your recent (and excellent) chr/ord doc patch, you used (or stayed with) 'hexadecimal' versus 'base 16'. Do we have a standard? I think I prefer the former. -character with value 0x12ca (4810 decimal). The Unicode standard contains a lot +character with value 0x12ca (4,810 decimal). The Unicode standard contains a lot I prefer without the added comma. >>> b'\x80abc'.decode("utf-8", "replace") - '\ufffdabc' + 'ï¿½abc' Three replacements (i with diaeresis, upside-down ?, 1/2) for one bad char looks wrong. With IDLE I get '�abc' (? in hexagon, codepoint 65533). Perhaps something just went wrong to patch from your file to my browser window. @@ -281,10 +279,10 @@ built-in :func:`ord` function that takes a one-character Unicode string and returns the code point value:: You fixed chr/ord doc, need to fix references thereto in this doc. -point. The ``\U`` escape sequence is similar, but expects 8 hex digits, not 4:: +point. The ``\U`` escape sequence is similar, but expects eight base 16 +digits, not four:: I really think of them as hex or hexadecimal digits, just as 0-9 are decimal, not base 10 digits. >>> s = "a\xac\u1234\u20ac\U00008000" ^^^^ two-digit hex escape
msg121490 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-18 20:00
On Thu, Nov 18, 2010 at 2:41 PM, Terry J. Reedy <report@bugs.python.org> wrote: .. > I visually parse 0-1,114,111 as 0-1, 114, 111. So I think either the commas > should be removed or extra spaces are needed: 0-1114111 or 0 - 1,114,111. What about "0 through 1,114,111"? > you used (or stayed with) 'hexadecimal' versus 'base 16'. Do we have a standard? > I think I prefer the former. I prefer 'base 16'. I thought about changing 'hexadecimal' to 'base 16' in chr/ord docs, but decided to leave it because the term 'hexadecimal' is used elsewhere on the same page notably in hex() function description where it is quite appropriate. No, we don't have a standard. I've also seen "base-16" used elsewhere which I really don't like. > + 'ï¿½abc' > > Three replacements (i with diaeresis, upside-down ?, 1/2) for one bad char looks wrong. That must be UTF-8 misinterpreted as Latin-1. Won't affect the commit. > With IDLE I get '�abc' (? in hexagon, codepoint 65533). Perhaps something > just went wrong to patch from your file to my browser window. Yes. I get the same on the terminal window and that's what it should look like. > built-in :func:`ord` function that takes a one-character Unicode string and > returns the code point value:: > > You fixed chr/ord doc, need to fix references thereto in this doc. > I don't understand. I think "one-character Unicode string" is fine here because "Unicode string" means an abstract Unicode string, not :class:`str`. > -point. The ``\U`` escape sequence is similar, but expects 8 hex digits, not 4:: > +point. The ``\U`` escape sequence is similar, but expects eight base 16 > +digits, not four:: > > I really think of them as hex or hexadecimal digits, just as 0-9 are decimal, not base 10 digits. > I am fine with "hexadecimal" here. I did not like "hex".
msg121491 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-18 20:12
On Thu, Nov 18, 2010 at 3:00 PM, Alexander Belopolsky <report@bugs.python.org> wrote: .. >> I really think of them as hex or hexadecimal digits, just as 0-9 are decimal, not base 10 digits. >> > > I am fine with "hexadecimal" here. I did not like "hex". If you think about it, "hexadecimal digit" is a twice oxymoron because both "decimal" and "digit" imply base 10. :-) It does look like the most widely used term, nevertheless.
msg121495 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2010-11-18 20:47
0 through ... is fine with me. Yes, hex numeral would be more accurate than hex digit.
msg121499 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2010-11-19 01:02
> Yes, hex numeral would be more accurate than hex digit. Stick with hex digit. We've used that phraseology for a long time. See string.hexdigits for example. And "hex numeral" just sounds weird -- it makes me do a double-take to see if there was some special implied meaning.
msg121547 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-19 16:22
Committed in revision 86530. Thanks Terry and Raymond for your comments. I would like to keep this issue open (at a low priority) because the question in the titles is still relevant. There are many new 3.x features that are not covered such as surrogateescape error handler. Such topics may or may not be appropriate for a HOWTO. there are also some stylistic changes that I would like to consider: 1. Replace verbatim URLs with properly formatted hyperlinked titles of the referenced resources. 2. I couldn't figure out who the original author was. With first person passages, such as "I remember looking at Apple ][ BASIC programs, .." it may be appropriate to list the original author at the top even if the text has been changed by others over the years. At the very least the Acknowlegements section should start with "This article was originally written by X [on an occasion Y.]" 3. Examples should be properly marked up to allow sphinx to run them and check the output.
msg121548 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2010-11-19 16:30
Agreed on 1 and 3. Regarding 2, looking at the early history of the file makes me suspect that amk is the author.
msg143310 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-09-01 08:04
After the recent discussions on python-dev I went through the Unicode howto and fixed a few things, then I found this issue so I'm attaching the patch here. The patch addresses mostly markup issues, but it also removes the usage of 'byte string'. A few more things that should be done: * clarify some more terms (e.g. codepoints, code units, characters, possibly scalar values etc.); * mention the differences between narrow and wide builds, including: - a discussion about the UCS-2/UTF-16 implementation of narrow builds; - something about surrogates and surrogate pairs; - effects of slicing and indexing on narrow builds; - functions/methods that (don't) accept non-BMP chars on narrow builds; * something about Unicode supports in the re module (this probably can wait after the 'regex' inclusion). Also the codecs doc has a section about Unicode and encodings that might be moved to the howto.
msg143317 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-09-01 10:57
I also left a few comments on rietveld about other things that can be improved. Please reply and comment there.
msg143421 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2011-09-02 17:13
> something about Unicode supports in the re module (this probably can > wait after the 'regex' inclusion). I’d prefer documentation for the re module now.
msg143422 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2011-09-02 17:38
> it also removes the usage of 'byte string'. I see you’ve replaced it with “byte object”. I’m -0, as “byte[s] string” is not ambiguous IMO.
msg143424 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-09-02 17:44
There was some discussion a while ago on python-dev about it. AFAIR the outcome was that using "bytes strings" should be avoided because bytes are bytes, and not strings (until they get decoded at least). Using 'string' for both might lead people to think that there are two kinds of strings, bytes and Unicode (like in python 2) while they should think that there are only Unicode strings and they can be converted to a bytes object (or simply to 'bytes').
msg143426 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2011-09-02 17:58
Ah, I see: you’re equating “string” with “text string” or “character string”, whereas I read “bytes string” as “finite sequence of bytes”. With this definition, there are two string types in Python 3, it’s just that they’re much more divorced than in 2.x. > they should think that there are only Unicode strings I’d say they should think that text processing should only happen with the one type dedicated to text, i.e. str. > they can be converted to a bytes object (or simply to 'bytes') Okay, +0 to use only “bytes object” (or “bytes” when it sounds better).
msg180283 - (view)	Author: Roundup Robot (python-dev)	Date: 2013-01-20 10:19
New changeset 260a9afd999a by Ezio Melotti in branch '3.2': #4153: update the Unicode howto. http://hg.python.org/cpython/rev/260a9afd999a New changeset 572ca3d35c2f by Ezio Melotti in branch '3.3': #4153: merge with 3.2. http://hg.python.org/cpython/rev/572ca3d35c2f New changeset 034e1e076c77 by Ezio Melotti in branch 'default': #4153: merge with 3.3. http://hg.python.org/cpython/rev/034e1e076c77
msg180284 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-01-20 10:31
I committed the attached patch with some minor modifications, but there are still comments that should be addressed on Rietveld.
msg180738 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-01-27 03:27
The section in the HOWTO on Python's unicode support also misses the fact that the easiest way to include a Unicode character in a string literal in Python 3 is to include that character in the string literal (since source code is now treated as UTF-8 by default).
msg180820 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-01-28 02:16
As discussed in #13997, the HOWTO should be reorganized to start with a basic introduction and then expand on more advanced topic. See also msg180743 for a couple of essays that could be linked as "see also" or integrated in the HOWTO.
msg190820 - (view)	Author: A.M. Kuchling (akuchling) *	Date: 2013-06-08 19:46
Continuing my tour of the howtos, here's a patch making many of the changes discussed here and on issue13997. Changes made: * state that python3 source encoding is UTF-8, and give examples * mention surrogateescape in the 'tips and tricks' section, and backslashreplace in the "Python's Unicode Support" section. * default filesystem encoding is now UTF-8, not ascii. * link to Nick Coghlan's and Ned Batchelder's notes/presentations. * remove revision history * remove usage of "I think", "I'm not going to", etc. * update acks section Things I did not do, though they were suggested: * Move tip "Software should only work with Unicode strings internally" from the last section to somewhere earlier and more prominent. Perhaps it could go somewhere in the "Python's Unicode Support" section. * mention codecs.StreamRecoder and StreamReaderWriter (I could put this in 'tips and tricks'). * Examples should be properly marked up to allow sphinx to run them and check the output. (May not be possible.) * mention unicode support in re module * clarify some more terms (e.g. codepoints, code units, characters, possibly scalar values etc.) -- I don't see why they matter, since we don't use them.
msg190835 - (view)	Author: A.M. Kuchling (akuchling) *	Date: 2013-06-08 22:33
Updated version of my patch, which adds two more todo items and handles Ezio's review comments: * Switch from Greek examples to French, and remove non-Latin-1 characters. * Change language for bytes.decode to "but supports a few more possible handlers". * Describe Unicode support in the re module. * Describe StreamRecoder. I don't see why StreamReaderWriter would need to be mentioned. I do not intend to do the remaining items on the todo list (clarify some more terms; make it work with doctest).
msg190841 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-06-09 04:06
amk's latest patch looks like a very nice improvement to me. One suggested wording tweak for the aside about the simplified history: s/The average Python programmer doesn't need to know the historical details/The precise historical details aren't relevant to understanding how to use Unicode effectively/ (and then continue with "; if you're curious ..." as it does now)
msg191511 - (view)	Author: Roundup Robot (python-dev)	Date: 2013-06-20 13:46
New changeset 1dbbed06a163 by Andrew Kuchling in branch '3.3': #4153: update Unicode howto for Python 3.3 http://hg.python.org/cpython/rev/1dbbed06a163
msg191513 - (view)	Author: A.M. Kuchling (akuchling) *	Date: 2013-06-20 14:16
As far as I can tell, there are no other outstanding suggestions for howto updates, so I'll now close this item. Feel free to re-open or file a new item if there are further improvements that can be made.
msg191638 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-06-22 09:11
Most of changes are applicable to Python 2 too. Do you want backport part of your patch to 2.7?

History
Date	User	Action	Args
2022-04-11 14:56:40	admin	set	github: 48403
2013-06-22 09:11:32	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg191638
2013-06-20 14:16:18	akuchling	set	status: open -> closed resolution: fixed messages: + msg191513 stage: commit review -> resolved
2013-06-20 13:46:27	python-dev	set	messages: + msg191511
2013-06-09 04:06:14	ncoghlan	set	messages: + msg190841
2013-06-08 22:33:21	akuchling	set	files: + unicode-howto.txt messages: + msg190835
2013-06-08 19:46:46	akuchling	set	files: + unicode-howto.txt messages: + msg190820
2013-01-28 02:20:15	ezio.melotti	link	issue13997 superseder
2013-01-28 02:16:25	ezio.melotti	set	messages: + msg180820
2013-01-27 03:27:52	ncoghlan	set	messages: + msg180738
2013-01-27 02:39:05	cvrebert	set	nosy: + cvrebert
2013-01-20 10:31:46	ezio.melotti	set	messages: + msg180284
2013-01-20 10:20:29	python-dev	set	nosy: + python-dev messages: + msg180283
2012-09-26 17:45:54	ezio.melotti	set	assignee: ezio.melotti
2011-09-17 16:38:18	ezio.melotti	set	nosy: + ncoghlan
2011-09-02 17:58:08	eric.araujo	set	messages: + msg143426
2011-09-02 17:44:30	ezio.melotti	set	messages: + msg143424
2011-09-02 17:38:27	eric.araujo	set	messages: + msg143422
2011-09-02 17:13:56	eric.araujo	set	messages: + msg143421
2011-09-01 10:57:45	ezio.melotti	set	messages: + msg143317
2011-09-01 08:04:12	ezio.melotti	set	files: + issue4153-2.diff versions: + Python 3.3 messages: + msg143310 assignee: georg.brandl -> (no value) resolution: fixed -> (no value) stage: commit review
2010-11-19 16:30:39	eric.araujo	set	messages: + msg121548
2010-11-19 16:22:09	belopolsky	set	priority: normal -> low messages: + msg121547
2010-11-19 01:02:47	rhettinger	set	nosy: + rhettinger messages: + msg121499
2010-11-18 20:47:05	terry.reedy	set	messages: + msg121495
2010-11-18 20:12:20	belopolsky	set	messages: + msg121491
2010-11-18 20:00:24	belopolsky	set	messages: + msg121490
2010-11-18 19:42:12	ezio.melotti	set	nosy: + ezio.melotti
2010-11-18 19:41:54	terry.reedy	set	messages: + msg121488
2010-11-18 16:52:37	eric.araujo	set	nosy: + eric.araujo
2010-11-18 16:48:26	belopolsky	set	nosy: + akuchling
2010-11-18 16:45:37	belopolsky	set	messages: + msg121474
2010-11-18 16:38:50	belopolsky	set	files: - issue4153.diff
2010-11-18 16:38:40	belopolsky	set	files: + issue4153.diff
2010-11-18 16:03:45	belopolsky	set	files: + issue4153.diff keywords: + patch messages: + msg121466
2010-11-18 06:12:58	belopolsky	set	status: closed -> open versions: + Python 3.2, - Python 3.0 nosy: + belopolsky messages: + msg121444
2008-11-22 10:27:32	georg.brandl	set	status: open -> closed resolution: fixed messages: + msg76240
2008-10-20 18:04:00	terry.reedy	create