msg74999 - (view) |
Author: Terry J. Reedy (terry.reedy) * |
Date: 2008-10-20 18:03 |
The Unicode HOWTO begins with
"Warning This HOWTO has not yet been updated for Python 3000’s string
object changes."
Without reading in detail, it appears it has been updated, at least
somewhat, and certainly more than I feared from the warning.
"The String Type
Since Python 3.0, the language features a str type that contain Unicode
characters"
and then a section "Converting to Bytes" and a later reference to
bytearrays.
So perhaps the warning is obsolete and should be removed.
Also, the revision history should have at least one more entry for the
3.0 updates, which certainly were entered since 2005
|
msg76240 - (view) |
Author: Georg Brandl (georg.brandl) * |
Date: 2008-11-22 10:27 |
Thanks for noting this! The most basic changes had been done, but I had
to revise some sections for changes. Done in r67338.
|
msg121444 - (view) |
Author: Alexander Belopolsky (belopolsky) * |
Date: 2010-11-18 06:12 |
Reopening because it looks like the fix was reverted in r82301.
"""
This HOWTO discusses Python 2.x’s support for Unicode, and explains various problems that people commonly encounter when trying to work with Unicode. (This HOWTO has not yet been updated to cover the 3.x versions of Python.)
""" http://docs.python.org/dev/howto/unicode.html
|
msg121466 - (view) |
Author: Alexander Belopolsky (belopolsky) * |
Date: 2010-11-18 16:03 |
The changes added in r82301 are misleading because code examples in this HOWTO have been converted to 3.x. I am attaching a patch that removes "has not yet been updated to cover the 3.x" warning and makes some minor stylistic changes.
I have bumped the release version to 1.12, but I would like to remove the revision history which is largely irrelevant.
|
msg121474 - (view) |
Author: Alexander Belopolsky (belopolsky) * |
Date: 2010-11-18 16:45 |
r82301 appears to be a blind merge of r82120 from the trunk. It is fairly obvious that it was not intentional.
|
msg121488 - (view) |
Author: Terry J. Reedy (terry.reedy) * |
Date: 2010-11-18 19:41 |
Thanks for persisting with this. Looking at the patch:
@@ -65,7 +63,7 @@
goal was to have Unicode contain the alphabets for every single human language.
It turns out that even 16 bits isn't enough to meet that goal, and the modern
Unicode specification uses a wider range of codes, 0-1,114,111 (0x10ffff in
-base-16).
+base 16).
I visually parse 0-1,114,111 as 0-1, 114, 111. So I think either the commas should be removed or extra spaces are needed: 0-1114111 or 0 - 1,114,111. In your recent (and excellent) chr/ord doc patch, you used (or stayed with) 'hexadecimal' versus 'base 16'. Do we have a standard? I *think* I prefer the former.
-character with value 0x12ca (4810 decimal). The Unicode standard contains a lot
+character with value 0x12ca (4,810 decimal). The Unicode standard contains a lot
I prefer without the added comma.
>>> b'\x80abc'.decode("utf-8", "replace")
- '\ufffdabc'
+ '�abc'
Three replacements (i with diaeresis, upside-down ?, 1/2) for one bad char looks wrong. With IDLE I get '�abc' (? in hexagon, codepoint 65533). Perhaps something just went wrong to patch from your file to my browser window.
@@ -281,10 +279,10 @@
built-in :func:`ord` function that takes a one-character Unicode string and
returns the code point value::
You fixed chr/ord doc, need to fix references thereto in this doc.
-point. The ``\U`` escape sequence is similar, but expects 8 hex digits, not 4::
+point. The ``\U`` escape sequence is similar, but expects eight base 16
+digits, not four::
I really think of them as hex or hexadecimal digits, just as 0-9 are decimal, not base 10 digits.
>>> s = "a\xac\u1234\u20ac\U00008000"
^^^^ two-digit hex escape
|
msg121490 - (view) |
Author: Alexander Belopolsky (belopolsky) * |
Date: 2010-11-18 20:00 |
On Thu, Nov 18, 2010 at 2:41 PM, Terry J. Reedy <report@bugs.python.org> wrote:
..
> I visually parse 0-1,114,111 as 0-1, 114, 111. So I think either the commas
> should be removed or extra spaces are needed: 0-1114111 or 0 - 1,114,111.
What about "0 through 1,114,111"?
> you used (or stayed with) 'hexadecimal' versus 'base 16'. Do we have a standard?
> I *think* I prefer the former.
I prefer 'base 16'. I thought about changing 'hexadecimal' to 'base
16' in chr/ord docs, but decided to leave it because the term
'hexadecimal' is used elsewhere on the same page notably in hex()
function description where it is quite appropriate. No, we don't
have a standard. I've also seen "base-16" used elsewhere which I
really don't like.
> + '�abc'
>
> Three replacements (i with diaeresis, upside-down ?, 1/2) for one bad char looks wrong.
That must be UTF-8 misinterpreted as Latin-1. Won't affect the commit.
> With IDLE I get '�abc' (? in hexagon, codepoint 65533). Perhaps something
> just went wrong to patch from your file to my browser window.
Yes. I get the same on the terminal window and that's what it should look like.
> built-in :func:`ord` function that takes a one-character Unicode string and
> returns the code point value::
>
> You fixed chr/ord doc, need to fix references thereto in this doc.
>
I don't understand. I think "one-character Unicode string" is fine
here because "Unicode string" means an abstract Unicode string, not
:class:`str`.
> -point. The ``\U`` escape sequence is similar, but expects 8 hex digits, not 4::
> +point. The ``\U`` escape sequence is similar, but expects eight base 16
> +digits, not four::
>
> I really think of them as hex or hexadecimal digits, just as 0-9 are decimal, not base 10 digits.
>
I am fine with "hexadecimal" here. I did not like "hex".
|
msg121491 - (view) |
Author: Alexander Belopolsky (belopolsky) * |
Date: 2010-11-18 20:12 |
On Thu, Nov 18, 2010 at 3:00 PM, Alexander Belopolsky
<report@bugs.python.org> wrote:
..
>> I really think of them as hex or hexadecimal digits, just as 0-9 are decimal, not base 10 digits.
>>
>
> I am fine with "hexadecimal" here. I did not like "hex".
If you think about it, "hexadecimal digit" is a twice oxymoron because
both "decimal" and "digit" imply base 10. :-) It does look like the
most widely used term, nevertheless.
|
msg121495 - (view) |
Author: Terry J. Reedy (terry.reedy) * |
Date: 2010-11-18 20:47 |
0 through ... is fine with me.
Yes, hex numeral would be more accurate than hex digit.
|
msg121499 - (view) |
Author: Raymond Hettinger (rhettinger) * |
Date: 2010-11-19 01:02 |
> Yes, hex numeral would be more accurate than hex digit.
Stick with hex digit. We've used that phraseology for a long time. See string.hexdigits for example. And "hex numeral" just sounds weird -- it makes me do a double-take to see if there was some special implied meaning.
|
msg121547 - (view) |
Author: Alexander Belopolsky (belopolsky) * |
Date: 2010-11-19 16:22 |
Committed in revision 86530. Thanks Terry and Raymond for your comments. I would like to keep this issue open (at a low priority) because the question in the titles is still relevant. There are many new 3.x features that are not covered such as surrogateescape error handler. Such topics may or may not be appropriate for a HOWTO. there are also some stylistic changes that I would like to consider:
1. Replace verbatim URLs with properly formatted hyperlinked titles of the referenced resources.
2. I couldn't figure out who the original author was. With first person passages, such as "I remember looking at Apple ][ BASIC programs, .." it may be appropriate to list the original author at the top even if the text has been changed by others over the years. At the very least the Acknowlegements section should start with "This article was originally written by X [on an occasion Y.]"
3. Examples should be properly marked up to allow sphinx to run them and check the output.
|
msg121548 - (view) |
Author: Éric Araujo (eric.araujo) * |
Date: 2010-11-19 16:30 |
Agreed on 1 and 3. Regarding 2, looking at the early history of the file makes me suspect that amk is the author.
|
msg143310 - (view) |
Author: Ezio Melotti (ezio.melotti) * |
Date: 2011-09-01 08:04 |
After the recent discussions on python-dev I went through the Unicode howto and fixed a few things, then I found this issue so I'm attaching the patch here.
The patch addresses mostly markup issues, but it also removes the usage of 'byte string'.
A few more things that should be done:
* clarify some more terms (e.g. codepoints, code units, characters, possibly scalar values etc.);
* mention the differences between narrow and wide builds, including:
- a discussion about the UCS-2/UTF-16 implementation of narrow builds;
- something about surrogates and surrogate pairs;
- effects of slicing and indexing on narrow builds;
- functions/methods that (don't) accept non-BMP chars on narrow builds;
* something about Unicode supports in the re module (this probably can wait after the 'regex' inclusion).
Also the codecs doc has a section about Unicode and encodings that might be moved to the howto.
|
msg143317 - (view) |
Author: Ezio Melotti (ezio.melotti) * |
Date: 2011-09-01 10:57 |
I also left a few comments on rietveld about other things that can be improved. Please reply and comment there.
|
msg143421 - (view) |
Author: Éric Araujo (eric.araujo) * |
Date: 2011-09-02 17:13 |
> something about Unicode supports in the re module (this probably can
> wait after the 'regex' inclusion).
I’d prefer documentation for the re module now.
|
msg143422 - (view) |
Author: Éric Araujo (eric.araujo) * |
Date: 2011-09-02 17:38 |
> it also removes the usage of 'byte string'.
I see you’ve replaced it with “byte object”. I’m -0, as “byte[s] string” is not ambiguous IMO.
|
msg143424 - (view) |
Author: Ezio Melotti (ezio.melotti) * |
Date: 2011-09-02 17:44 |
There was some discussion a while ago on python-dev about it. AFAIR the outcome was that using "bytes *strings*" should be avoided because bytes are bytes, and not strings (until they get decoded at least). Using 'string' for both might lead people to think that there are two kinds of strings, bytes and Unicode (like in python 2) while they should think that there are only Unicode strings and they can be converted to a bytes object (or simply to 'bytes').
|
msg143426 - (view) |
Author: Éric Araujo (eric.araujo) * |
Date: 2011-09-02 17:58 |
Ah, I see: you’re equating “string” with “text string” or “character string”, whereas I read “bytes string” as “finite sequence of bytes”. With this definition, there *are* two string types in Python 3, it’s just that they’re much more divorced than in 2.x.
> they should think that there are only Unicode strings
I’d say they should think that text processing should only happen with the one type dedicated to text, i.e. str.
> they can be converted to a bytes object (or simply to 'bytes')
Okay, +0 to use only “bytes object” (or “bytes” when it sounds better).
|
msg180283 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2013-01-20 10:19 |
New changeset 260a9afd999a by Ezio Melotti in branch '3.2':
#4153: update the Unicode howto.
http://hg.python.org/cpython/rev/260a9afd999a
New changeset 572ca3d35c2f by Ezio Melotti in branch '3.3':
#4153: merge with 3.2.
http://hg.python.org/cpython/rev/572ca3d35c2f
New changeset 034e1e076c77 by Ezio Melotti in branch 'default':
#4153: merge with 3.3.
http://hg.python.org/cpython/rev/034e1e076c77
|
msg180284 - (view) |
Author: Ezio Melotti (ezio.melotti) * |
Date: 2013-01-20 10:31 |
I committed the attached patch with some minor modifications, but there are still comments that should be addressed on Rietveld.
|
msg180738 - (view) |
Author: Alyssa Coghlan (ncoghlan) * |
Date: 2013-01-27 03:27 |
The section in the HOWTO on Python's unicode support also misses the fact that the easiest way to include a Unicode character in a string literal in Python 3 is to *include that character in the string literal* (since source code is now treated as UTF-8 by default).
|
msg180820 - (view) |
Author: Ezio Melotti (ezio.melotti) * |
Date: 2013-01-28 02:16 |
As discussed in #13997, the HOWTO should be reorganized to start with a basic introduction and then expand on more advanced topic.
See also msg180743 for a couple of essays that could be linked as "see also" or integrated in the HOWTO.
|
msg190820 - (view) |
Author: A.M. Kuchling (akuchling) * |
Date: 2013-06-08 19:46 |
Continuing my tour of the howtos, here's a patch making many of the changes discussed here and on issue13997. Changes made:
* state that python3 source encoding is UTF-8, and give examples
* mention surrogateescape in the 'tips and tricks' section, and backslashreplace in the "Python's Unicode Support" section.
* default filesystem encoding is now UTF-8, not ascii.
* link to Nick Coghlan's and Ned Batchelder's notes/presentations.
* remove revision history
* remove usage of "I think", "I'm not going to", etc.
* update acks section
Things I did *not* do, though they were suggested:
* Move tip "Software should only work with Unicode strings internally" from the last section to somewhere earlier and more prominent. Perhaps it could go somewhere in the "Python's Unicode Support" section.
* mention codecs.StreamRecoder and StreamReaderWriter (I could put this in 'tips and tricks').
* Examples should be properly marked up to allow sphinx to run them and check the output. (May not be possible.)
* mention unicode support in re module
* clarify some more terms (e.g. codepoints, code units, characters, possibly scalar values etc.) -- I don't see why they matter, since we don't use them.
|
msg190835 - (view) |
Author: A.M. Kuchling (akuchling) * |
Date: 2013-06-08 22:33 |
Updated version of my patch, which adds two more todo items and handles Ezio's review comments:
* Switch from Greek examples to French, and remove non-Latin-1 characters.
* Change language for bytes.decode to "but supports a few more possible handlers".
* Describe Unicode support in the re module.
* Describe StreamRecoder. I don't see why StreamReaderWriter would need to be mentioned.
I do not intend to do the remaining items on the todo list (clarify some more terms; make it work with doctest).
|
msg190841 - (view) |
Author: Alyssa Coghlan (ncoghlan) * |
Date: 2013-06-09 04:06 |
amk's latest patch looks like a very nice improvement to me.
One suggested wording tweak for the aside about the simplified
history: s/The average Python programmer doesn't need to know the
historical details/The precise historical details aren't relevant to
understanding how to use Unicode effectively/ (and then continue with
"; if you're curious ..." as it does now)
|
msg191511 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2013-06-20 13:46 |
New changeset 1dbbed06a163 by Andrew Kuchling in branch '3.3':
#4153: update Unicode howto for Python 3.3
http://hg.python.org/cpython/rev/1dbbed06a163
|
msg191513 - (view) |
Author: A.M. Kuchling (akuchling) * |
Date: 2013-06-20 14:16 |
As far as I can tell, there are no other outstanding suggestions for howto updates, so I'll now close this item. Feel free to re-open or file a new item if there are further improvements that can be made.
|
msg191638 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-06-22 09:11 |
Most of changes are applicable to Python 2 too. Do you want backport part of your patch to 2.7?
|
|
Date |
User |
Action |
Args |
2022-04-11 14:56:40 | admin | set | github: 48403 |
2013-06-22 09:11:32 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages:
+ msg191638
|
2013-06-20 14:16:18 | akuchling | set | status: open -> closed resolution: fixed messages:
+ msg191513
stage: commit review -> resolved |
2013-06-20 13:46:27 | python-dev | set | messages:
+ msg191511 |
2013-06-09 04:06:14 | ncoghlan | set | messages:
+ msg190841 |
2013-06-08 22:33:21 | akuchling | set | files:
+ unicode-howto.txt
messages:
+ msg190835 |
2013-06-08 19:46:46 | akuchling | set | files:
+ unicode-howto.txt
messages:
+ msg190820 |
2013-01-28 02:20:15 | ezio.melotti | link | issue13997 superseder |
2013-01-28 02:16:25 | ezio.melotti | set | messages:
+ msg180820 |
2013-01-27 03:27:52 | ncoghlan | set | messages:
+ msg180738 |
2013-01-27 02:39:05 | cvrebert | set | nosy:
+ cvrebert
|
2013-01-20 10:31:46 | ezio.melotti | set | messages:
+ msg180284 |
2013-01-20 10:20:29 | python-dev | set | nosy:
+ python-dev messages:
+ msg180283
|
2012-09-26 17:45:54 | ezio.melotti | set | assignee: ezio.melotti |
2011-09-17 16:38:18 | ezio.melotti | set | nosy:
+ ncoghlan
|
2011-09-02 17:58:08 | eric.araujo | set | messages:
+ msg143426 |
2011-09-02 17:44:30 | ezio.melotti | set | messages:
+ msg143424 |
2011-09-02 17:38:27 | eric.araujo | set | messages:
+ msg143422 |
2011-09-02 17:13:56 | eric.araujo | set | messages:
+ msg143421 |
2011-09-01 10:57:45 | ezio.melotti | set | messages:
+ msg143317 |
2011-09-01 08:04:12 | ezio.melotti | set | files:
+ issue4153-2.diff versions:
+ Python 3.3 messages:
+ msg143310
assignee: georg.brandl -> (no value) resolution: fixed -> (no value) stage: commit review |
2010-11-19 16:30:39 | eric.araujo | set | messages:
+ msg121548 |
2010-11-19 16:22:09 | belopolsky | set | priority: normal -> low
messages:
+ msg121547 |
2010-11-19 01:02:47 | rhettinger | set | nosy:
+ rhettinger messages:
+ msg121499
|
2010-11-18 20:47:05 | terry.reedy | set | messages:
+ msg121495 |
2010-11-18 20:12:20 | belopolsky | set | messages:
+ msg121491 |
2010-11-18 20:00:24 | belopolsky | set | messages:
+ msg121490 |
2010-11-18 19:42:12 | ezio.melotti | set | nosy:
+ ezio.melotti
|
2010-11-18 19:41:54 | terry.reedy | set | messages:
+ msg121488 |
2010-11-18 16:52:37 | eric.araujo | set | nosy:
+ eric.araujo
|
2010-11-18 16:48:26 | belopolsky | set | nosy:
+ akuchling
|
2010-11-18 16:45:37 | belopolsky | set | messages:
+ msg121474 |
2010-11-18 16:38:50 | belopolsky | set | files:
- issue4153.diff |
2010-11-18 16:38:40 | belopolsky | set | files:
+ issue4153.diff |
2010-11-18 16:03:45 | belopolsky | set | files:
+ issue4153.diff keywords:
+ patch messages:
+ msg121466
|
2010-11-18 06:12:58 | belopolsky | set | status: closed -> open versions:
+ Python 3.2, - Python 3.0 nosy:
+ belopolsky
messages:
+ msg121444
|
2008-11-22 10:27:32 | georg.brandl | set | status: open -> closed resolution: fixed messages:
+ msg76240 |
2008-10-20 18:04:00 | terry.reedy | create | |