classification
Title: Unicode HOWTO incorrectly states that UTF-8 contains no zero bytes
Type: enhancement Stage: resolved
Components: Documentation Versions: Python 3.8, Python 3.7, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: asvetlov, docs@python, ezio.melotti, josh.r, mbiggs, serhiy.storchaka
Priority: normal Keywords: easy, patch

Created on 2019-05-04 00:00 by mbiggs, last changed 2019-05-17 11:05 by cheryl.sabella. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 13111 merged redshiftzero, 2019-05-06 15:08
PR 13188 closed mbiggs, 2019-05-08 11:13
PR 13383 merged miss-islington, 2019-05-17 10:48
Messages (6)
msg341363 - (view) Author: mbiggs (mbiggs) * Date: 2019-05-04 00:00
In the Unicode HOWTO: http://docs.python.org/3.3/howto/unicode.html

It says the following:


"UTF-8 has several convenient properties:
(...)
2. A Unicode string is turned into a sequence of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes."

This is not right.  UTF-8 uses the zero byte to represent the Unicode codepoint U+0000 (the ASCII NULL character).  This is a valid character in UTF-8 and is handled just fine by python's UTF-8 string encoding/decoding.
msg341364 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2019-05-04 00:06
This is right for 99.99% cases: utf8 doesn't encode any character except explicit zero with zero bytes.

UTF-16 for example encodes 'a' as b'\xff\xfea\x00'
msg341414 - (view) Author: mbiggs (mbiggs) * Date: 2019-05-05 01:27
So a correct statement would be "A UTF-8 string is turned into a sequence of bytes that contains embedded zero bytes only where they represent the NULL character (U+0000)."

I think it's important to correct this because the part about processing UTF-8 with C functions like strcpy(), was wrong and could cause bugs.
msg341418 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-05-05 05:54
I agree that the documentation should be updated. Do you mind to create a pull request mbiggs?

There are UTF-8 variants which guarantee that the encoded text has no zero bytes (see Modified UTF-8), but Python only provides the standard UTF-8 and UTF-8 with BOM.
msg341477 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2019-05-05 22:25
Minor bikeshed: If updating the documentation, refer to U+0000 as "the null character" or "NUL", not "NULL". Using "NULL" allows for confusion with NULL pointers; "the null character" (the name used in the Unicode standard) or "NUL" (the official three letter abbreviation in ASCII, Unicode too I think) has no such opportunity for confusion.
msg341948 - (view) Author: mbiggs (mbiggs) * Date: 2019-05-08 21:46
Ah sent a pull request but didn't realize that redshiftzero already had.  Their PR looks good to me.

Thanks for fixing this!
History
Date User Action Args
2019-05-17 11:05:14cheryl.sabellasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2019-05-17 10:48:34miss-islingtonsetpull_requests: + pull_request13294
2019-05-08 21:46:59mbiggssetmessages: + msg341948
2019-05-08 11:13:00mbiggssetpull_requests: + pull_request13102
2019-05-06 15:08:46redshiftzerosetkeywords: + patch
stage: needs patch -> patch review
pull_requests: + pull_request13026
2019-05-06 01:44:48ezio.melottisetnosy: + ezio.melotti
type: enhancement
2019-05-05 22:25:59josh.rsetnosy: + josh.r
messages: + msg341477
2019-05-05 05:54:52serhiy.storchakasetversions: - Python 3.5, Python 3.6
nosy: + serhiy.storchaka

messages: + msg341418

keywords: + easy
stage: needs patch
2019-05-05 01:27:28mbiggssetmessages: + msg341414
2019-05-04 00:06:45asvetlovsetnosy: + asvetlov
messages: + msg341364
2019-05-04 00:00:17mbiggscreate