This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Unicode HOWTO up to date?
Type: Stage: resolved
Components: Documentation Versions: Python 3.2, Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: akuchling, belopolsky, cvrebert, eric.araujo, ezio.melotti, georg.brandl, ncoghlan, python-dev, rhettinger, serhiy.storchaka, terry.reedy
Priority: low Keywords: patch

Created on 2008-10-20 18:04 by terry.reedy, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
issue4153.diff belopolsky, 2010-11-18 16:38 review
issue4153-2.diff ezio.melotti, 2011-09-01 08:04 Patch against 3.2 review
unicode-howto.txt akuchling, 2013-06-08 19:46 Revised patch review
unicode-howto.txt akuchling, 2013-06-08 22:33 review
Messages (28)
msg74999 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2008-10-20 18:03
The Unicode HOWTO begins with 
"Warning This HOWTO has not yet been updated for Python 3000’s string
object changes."

Without reading in detail, it appears it has been updated, at least
somewhat, and certainly more than I feared from the warning.
"The String Type
Since Python 3.0, the language features a str type that contain Unicode
characters"
and then a section "Converting to Bytes" and a later reference to
bytearrays.

So perhaps the warning is obsolete and should be removed.
Also, the revision history should have at least one more entry for the
3.0 updates, which certainly were entered since 2005
msg76240 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-11-22 10:27
Thanks for noting this! The most basic changes had been done, but I had
to revise some sections for changes. Done in r67338.
msg121444 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-18 06:12
Reopening because it looks like the fix was reverted in r82301.

"""
This HOWTO discusses Python 2.x’s support for Unicode, and explains various problems that people commonly encounter when trying to work with Unicode. (This HOWTO has not yet been updated to cover the 3.x versions of Python.)
""" http://docs.python.org/dev/howto/unicode.html
msg121466 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-18 16:03
The changes added in r82301 are misleading because code examples in this HOWTO have been converted to 3.x.  I am attaching a patch that removes "has not yet been updated to cover the 3.x" warning and makes some minor stylistic changes.

I have bumped the release version to 1.12, but I would like to remove the revision history which is largely irrelevant.
msg121474 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-18 16:45
r82301 appears to be a blind merge of r82120 from the trunk.  It is fairly obvious that it was not intentional.
msg121488 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-11-18 19:41
Thanks for persisting with this. Looking at the patch:

@@ -65,7 +63,7 @@
 goal was to have Unicode contain the alphabets for every single human language.
 It turns out that even 16 bits isn't enough to meet that goal, and the modern
 Unicode specification uses a wider range of codes, 0-1,114,111 (0x10ffff in
-base-16).
+base 16).

I visually parse 0-1,114,111 as 0-1, 114, 111. So I think either the commas should be removed or extra spaces are needed: 0-1114111 or 0 - 1,114,111. In your recent (and excellent) chr/ord doc patch, you used (or stayed with) 'hexadecimal' versus 'base 16'. Do we have a standard? I *think* I prefer the former.

-character with value 0x12ca (4810 decimal).  The Unicode standard contains a lot
+character with value 0x12ca (4,810 decimal).  The Unicode standard contains a lot

I prefer without the added comma.

     >>> b'\x80abc'.decode("utf-8", "replace")
-    '\ufffdabc'
+    '�abc'

Three replacements (i with diaeresis, upside-down ?, 1/2) for one bad char looks wrong. With IDLE I get '�abc' (? in hexagon, codepoint 65533). Perhaps something just went wrong to patch from your file to my browser window.

@@ -281,10 +279,10 @@
 built-in :func:`ord` function that takes a one-character Unicode string and
 returns the code point value::

You fixed chr/ord doc, need to fix references thereto in this doc.

-point.  The ``\U`` escape sequence is similar, but expects 8 hex digits, not 4::
+point.  The ``\U`` escape sequence is similar, but expects eight base 16
+digits, not four::

I really think of them as hex or hexadecimal digits, just as 0-9 are decimal, not base 10 digits.


 
     >>> s = "a\xac\u1234\u20ac\U00008000"
               ^^^^ two-digit hex escape
msg121490 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-18 20:00
On Thu, Nov 18, 2010 at 2:41 PM, Terry J. Reedy <report@bugs.python.org> wrote:
..
> I visually parse 0-1,114,111 as 0-1, 114, 111. So I think either the commas
> should be removed or extra spaces are needed: 0-1114111 or 0 - 1,114,111.

What about "0 through 1,114,111"?

> you used (or stayed with) 'hexadecimal' versus 'base 16'. Do we have a standard?
> I *think* I prefer the former.

I prefer 'base 16'.  I thought about changing 'hexadecimal' to 'base
16' in chr/ord docs, but decided to leave it because the term
'hexadecimal' is used elsewhere on the same page notably in hex()
function description where it is quite appropriate.   No, we don't
have a standard.  I've also seen "base-16" used elsewhere which I
really don't like.

> +    '�abc'
>
> Three replacements (i with diaeresis, upside-down ?, 1/2) for one bad char looks wrong.

That must be UTF-8 misinterpreted as Latin-1.  Won't affect the commit.

> With IDLE I get '�abc' (? in hexagon, codepoint 65533). Perhaps something
> just went wrong to patch from your file to my browser window.

Yes.  I get the same on the terminal window and that's what it should look like.

>  built-in :func:`ord` function that takes a one-character Unicode string and
>  returns the code point value::
>
> You fixed chr/ord doc, need to fix references thereto in this doc.
>

I don't understand.  I think "one-character Unicode string" is fine
here because "Unicode string" means an abstract Unicode string, not
:class:`str`.

> -point.  The ``\U`` escape sequence is similar, but expects 8 hex digits, not 4::
> +point.  The ``\U`` escape sequence is similar, but expects eight base 16
> +digits, not four::
>
> I really think of them as hex or hexadecimal digits, just as 0-9 are decimal, not base 10 digits.
>

I am fine with "hexadecimal" here.  I did not like "hex".
msg121491 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-18 20:12
On Thu, Nov 18, 2010 at 3:00 PM, Alexander Belopolsky
<report@bugs.python.org> wrote:
..
>> I really think of them as hex or hexadecimal digits, just as 0-9 are decimal, not base 10 digits.
>>
>
> I am fine with "hexadecimal" here.  I did not like "hex".

If you think about it, "hexadecimal digit" is a twice oxymoron because
both "decimal" and "digit" imply base 10. :-)  It does look like the
most widely used term, nevertheless.
msg121495 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-11-18 20:47
0 through ... is fine with me.

Yes, hex numeral would be more accurate than hex digit.
msg121499 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-11-19 01:02
> Yes, hex numeral would be more accurate than hex digit.

Stick with hex digit.   We've used that phraseology for a long time.  See string.hexdigits for example.  And "hex numeral" just sounds weird -- it makes me do a double-take to see if there was some special implied meaning.
msg121547 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-19 16:22
Committed in revision 86530. Thanks Terry and Raymond for your comments.  I would like to keep this issue open (at a low priority) because the question in the titles is still relevant.  There are many new 3.x features that are not covered such as surrogateescape error handler.  Such topics may or may not be appropriate for a HOWTO.  there are also some stylistic changes that I would like to consider:

1. Replace verbatim URLs with properly formatted hyperlinked titles of the referenced resources.

2. I couldn't figure out who the original author was. With first person passages, such as "I remember looking at Apple ][ BASIC programs, .." it may be appropriate to list the original author at the top even if the text has been changed by others over the years.  At the very least the Acknowlegements section should start with "This article was originally written by X [on an occasion Y.]"

3. Examples should be properly marked up to allow sphinx to run them and check the output.
msg121548 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-11-19 16:30
Agreed on 1 and 3.  Regarding 2, looking at the early history of the file makes me suspect that amk is the author.
msg143310 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-09-01 08:04
After the recent discussions on python-dev I went through the Unicode howto and fixed a few things, then I found this issue so I'm attaching the patch here.
The patch addresses mostly markup issues, but it also removes the usage of 'byte string'.
A few more things that should be done:
  * clarify some more terms (e.g. codepoints, code units, characters, possibly scalar values etc.);
  * mention the differences between narrow and wide builds, including:
    - a discussion about the UCS-2/UTF-16 implementation of narrow builds;
    - something about surrogates and surrogate pairs;
    - effects of slicing and indexing on narrow builds;
    - functions/methods that (don't) accept non-BMP chars on narrow builds;
  * something about Unicode supports in the re module (this probably can wait after the 'regex' inclusion).

Also the codecs doc has a section about Unicode and encodings that might be moved to the howto.
msg143317 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-09-01 10:57
I also left a few comments on rietveld about other things that can be improved.  Please reply and comment there.
msg143421 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-09-02 17:13
> something about Unicode supports in the re module (this probably can
> wait after the 'regex' inclusion).
I’d prefer documentation for the re module now.
msg143422 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-09-02 17:38
> it also removes the usage of 'byte string'.
I see you’ve replaced it with “byte object”.  I’m -0, as “byte[s] string” is not ambiguous IMO.
msg143424 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-09-02 17:44
There was some discussion a while ago on python-dev about it.  AFAIR the outcome was that using "bytes *strings*" should be avoided because bytes are bytes, and not strings (until they get decoded at least).  Using 'string' for both might lead people to think that there are two kinds of strings, bytes and Unicode (like in python 2) while they should think that there are only Unicode strings and they can be converted to a bytes object (or simply to 'bytes').
msg143426 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-09-02 17:58
Ah, I see: you’re equating “string” with “text string” or “character string”, whereas I read “bytes string” as “finite sequence of bytes”.  With this definition, there *are* two string types in Python 3, it’s just that they’re much more divorced than in 2.x.

> they should think that there are only Unicode strings
I’d say they should think that text processing should only happen with the one type dedicated to text, i.e. str.

> they can be converted to a bytes object (or simply to 'bytes')
Okay, +0 to use only “bytes object” (or “bytes” when it sounds better).
msg180283 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-01-20 10:19
New changeset 260a9afd999a by Ezio Melotti in branch '3.2':
#4153: update the Unicode howto.
http://hg.python.org/cpython/rev/260a9afd999a

New changeset 572ca3d35c2f by Ezio Melotti in branch '3.3':
#4153: merge with 3.2.
http://hg.python.org/cpython/rev/572ca3d35c2f

New changeset 034e1e076c77 by Ezio Melotti in branch 'default':
#4153: merge with 3.3.
http://hg.python.org/cpython/rev/034e1e076c77
msg180284 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-01-20 10:31
I committed the attached patch with some minor modifications, but there are still comments that should be addressed on Rietveld.
msg180738 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-01-27 03:27
The section in the HOWTO on Python's unicode support also misses the fact that the easiest way to include a Unicode character in a string literal in Python 3 is to *include that character in the string literal* (since source code is now treated as UTF-8 by default).
msg180820 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-01-28 02:16
As discussed in #13997, the HOWTO should be reorganized to start with a basic introduction and then expand on more advanced topic.

See also msg180743 for a couple of essays that could be linked as "see also" or integrated in the HOWTO.
msg190820 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2013-06-08 19:46
Continuing my tour of the howtos, here's a patch making many of the changes discussed here and on issue13997.  Changes made:

* state that python3 source encoding is UTF-8, and give examples

* mention surrogateescape in the 'tips and tricks' section, and backslashreplace in the "Python's Unicode Support" section.

* default filesystem encoding is now UTF-8, not ascii.

* link to Nick Coghlan's and Ned Batchelder's notes/presentations.

* remove revision history

* remove usage of "I think", "I'm not going to", etc.

* update acks section

Things I did *not* do, though they were suggested:

* Move tip "Software should only work with Unicode strings internally" from the last section to somewhere earlier and more prominent.  Perhaps it could go somewhere in the "Python's Unicode Support" section.

* mention codecs.StreamRecoder and StreamReaderWriter (I could put this in 'tips and tricks').

* Examples should be properly marked up to allow sphinx to run them and check the output.  (May not be possible.)

* mention unicode support in re module

* clarify some more terms (e.g. codepoints, code units, characters, possibly scalar values etc.) -- I don't see why they matter, since we don't use them.
msg190835 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2013-06-08 22:33
Updated version of my patch, which adds two more todo items and handles Ezio's review comments:

* Switch from Greek examples to French, and remove non-Latin-1 characters.

* Change language for bytes.decode to "but supports a few more possible handlers".

* Describe Unicode support in the re module.

* Describe StreamRecoder.  I don't see why StreamReaderWriter would need to be mentioned.

I do not intend to do the remaining items on the todo list (clarify some more terms; make it work with doctest).
msg190841 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-06-09 04:06
amk's latest patch looks like a very nice improvement to me.

One suggested wording tweak for the aside about the simplified
history: s/The average Python programmer doesn't need to know the
historical details/The precise historical details aren't relevant to
understanding how to use Unicode effectively/ (and then continue with
"; if you're curious ..." as it does now)
msg191511 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-06-20 13:46
New changeset 1dbbed06a163 by Andrew Kuchling in branch '3.3':
#4153: update Unicode howto for Python 3.3
http://hg.python.org/cpython/rev/1dbbed06a163
msg191513 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2013-06-20 14:16
As far as I can tell, there are no other outstanding suggestions for howto updates, so I'll now close this item.  Feel free to re-open or file a new item if there are further improvements that can be made.
msg191638 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-06-22 09:11
Most of changes are applicable to Python 2 too. Do you want backport part of your patch to 2.7?
History
Date User Action Args
2022-04-11 14:56:40adminsetgithub: 48403
2013-06-22 09:11:32serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg191638
2013-06-20 14:16:18akuchlingsetstatus: open -> closed
resolution: fixed
messages: + msg191513

stage: commit review -> resolved
2013-06-20 13:46:27python-devsetmessages: + msg191511
2013-06-09 04:06:14ncoghlansetmessages: + msg190841
2013-06-08 22:33:21akuchlingsetfiles: + unicode-howto.txt

messages: + msg190835
2013-06-08 19:46:46akuchlingsetfiles: + unicode-howto.txt

messages: + msg190820
2013-01-28 02:20:15ezio.melottilinkissue13997 superseder
2013-01-28 02:16:25ezio.melottisetmessages: + msg180820
2013-01-27 03:27:52ncoghlansetmessages: + msg180738
2013-01-27 02:39:05cvrebertsetnosy: + cvrebert
2013-01-20 10:31:46ezio.melottisetmessages: + msg180284
2013-01-20 10:20:29python-devsetnosy: + python-dev
messages: + msg180283
2012-09-26 17:45:54ezio.melottisetassignee: ezio.melotti
2011-09-17 16:38:18ezio.melottisetnosy: + ncoghlan
2011-09-02 17:58:08eric.araujosetmessages: + msg143426
2011-09-02 17:44:30ezio.melottisetmessages: + msg143424
2011-09-02 17:38:27eric.araujosetmessages: + msg143422
2011-09-02 17:13:56eric.araujosetmessages: + msg143421
2011-09-01 10:57:45ezio.melottisetmessages: + msg143317
2011-09-01 08:04:12ezio.melottisetfiles: + issue4153-2.diff
versions: + Python 3.3
messages: + msg143310

assignee: georg.brandl -> (no value)
resolution: fixed -> (no value)
stage: commit review
2010-11-19 16:30:39eric.araujosetmessages: + msg121548
2010-11-19 16:22:09belopolskysetpriority: normal -> low

messages: + msg121547
2010-11-19 01:02:47rhettingersetnosy: + rhettinger
messages: + msg121499
2010-11-18 20:47:05terry.reedysetmessages: + msg121495
2010-11-18 20:12:20belopolskysetmessages: + msg121491
2010-11-18 20:00:24belopolskysetmessages: + msg121490
2010-11-18 19:42:12ezio.melottisetnosy: + ezio.melotti
2010-11-18 19:41:54terry.reedysetmessages: + msg121488
2010-11-18 16:52:37eric.araujosetnosy: + eric.araujo
2010-11-18 16:48:26belopolskysetnosy: + akuchling
2010-11-18 16:45:37belopolskysetmessages: + msg121474
2010-11-18 16:38:50belopolskysetfiles: - issue4153.diff
2010-11-18 16:38:40belopolskysetfiles: + issue4153.diff
2010-11-18 16:03:45belopolskysetfiles: + issue4153.diff
keywords: + patch
messages: + msg121466
2010-11-18 06:12:58belopolskysetstatus: closed -> open
versions: + Python 3.2, - Python 3.0
nosy: + belopolsky

messages: + msg121444
2008-11-22 10:27:32georg.brandlsetstatus: open -> closed
resolution: fixed
messages: + msg76240
2008-10-20 18:04:00terry.reedycreate