Message 214470 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	loewis
Recipients	benjamin.peterson, docs@python, eric.araujo, ezio.melotti, gwideman, lemburg, loewis, pitrou, tshepang, vstinner
Date	2014-03-22.12:22:19
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1395490940.57.0.455094841821.issue20906@psf.upfronthosting.co.za>
In-reply-to

Content
"4. Many Internet standards are defined in terms of textual data" I believe the author was thinking of the "old" TCP-based protocols (ftp, smtp, RFC 822, HTTP), which have their commands/messages as ASCII-strings, with a variable-length records (often terminated by line end). I think bringing this up as an argument against UTF-32 somewhat flawed, for two reasons: 1. Historically, many of these protocols restricted themselves to pure ASCII, so using UTF-8 is as much a protocol violation as is using UTF-32. 2. The tricky part in this protocols is often not the risk of embedding NUL, but embedding CRLF (as 0D 0A might well appear in a character, a.g. MALAYALAM LETTER UU) OTOH, it is a fact that several of these protocols got revised to support Unicode, and often re-interpreting the data as UTF-8 (with MIME being the notable exception that actually allows for UTF-32 on the wire if somebody choses to).

"4. Many Internet standards are defined in terms of textual data"

I believe the author was thinking of the "old" TCP-based protocols (ftp, smtp, RFC 822, HTTP), which have their commands/messages as ASCII-strings,  with a variable-length records (often terminated by line end).

I think bringing this up as an argument against UTF-32 somewhat flawed, for two reasons:
1. Historically, many of these protocols restricted themselves to pure ASCII, so using UTF-8 is as much a protocol violation as is using UTF-32.
2. The tricky part in this protocols is often not the risk of embedding NUL, but embedding CRLF (as 0D 0A might well appear in a character, a.g. MALAYALAM LETTER UU)

OTOH, it is a fact that several of these protocols got revised to support Unicode, and often re-interpreting the data as UTF-8 (with MIME being the notable exception that actually allows for UTF-32 on the wire if somebody choses to).

History
Date	User	Action	Args
2014-03-22 12:22:20	loewis	set	recipients: + loewis, lemburg, pitrou, vstinner, benjamin.peterson, ezio.melotti, eric.araujo, docs@python, tshepang, gwideman
2014-03-22 12:22:20	loewis	set	messageid: <1395490940.57.0.455094841821.issue20906@psf.upfronthosting.co.za>
2014-03-22 12:22:20	loewis	link	issue20906 messages
2014-03-22 12:22:19	loewis	create