Message 104913 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	Arfrever, ezio.melotti, gregory.p.smith, lemburg, loewis, pitrou, vstinner
Date	2010-05-04.08:51:16
SpamBayes Score	0.000541864
Marked as misclassified	No
Message-id	<4BDFE002.4000101@egenix.com>
In-reply-to	<4BDF4A06.1020603@v.loewis.de>

Content
Martin v. Löwis wrote: > > Martin v. Löwis <martin@v.loewis.de> added the comment: > >> Your name will end up being partially escaped as surrogate: >> >> 'L\udcf6wis' >> >> Further processing will fail > > That depends on the further processing, no? > >> Traceback (most recent call last): >> File "<stdin>", line 1, in <module> >> UnicodeEncodeError: 'latin-1' codec can't encode character '\udcf6' in position 1: ordinal not in >> range(256) > > Where did you get this error from? The roundup email interface must have eaten this first line of the traceback: >>> _.encode('latin-1') >> It doesn't work if an application tries to work with the data, >> e.g. tries to convert it > > Converting it to what? > >> parse it > > Parsing will work fine. > >> decode it > > It's a string. You shouldn't decode it. > >> The reason is >> that information included by the use of the 'surrogateescape' >> error handler is lost along the way and this then causes data >> corruption. > > And how would that not happen if it was bytes? The problems you describe > were one of the primary motivations to switch to Unicode: it's byte > strings that have these problems. Martin, it's obvious that you are not even trying to understand what I'm saying. That's not a good basis for discussion.

Martin v. Löwis wrote:
> 
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
>> Your name will end up being partially escaped as surrogate:
>>
>> 'L\udcf6wis'
>>
>> Further processing will fail
> 
> That depends on the further processing, no?
> 
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>> UnicodeEncodeError: 'latin-1' codec can't encode character '\udcf6' in position 1: ordinal not in
>> range(256)
> 
> Where did you get this error from?

The roundup email interface must have eaten this
first line of the traceback: >>> _.encode('latin-1')

>> It doesn't work if an application tries to work *with* the data,
>> e.g. tries to convert it
> 
> Converting it to what?
> 
>> parse it
> 
> Parsing will work fine.
> 
>> decode it
> 
> It's a string. You shouldn't decode it.
>
>> The reason is
>> that information included by the use of the 'surrogateescape'
>> error handler is lost along the way and this then causes data
>> corruption.
> 
> And how would that not happen if it was bytes? The problems you describe
> were one of the primary motivations to switch to Unicode: it's *byte*
> strings that have these problems.

Martin, it's obvious that you are not even trying to understand
what I'm saying. That's not a good basis for discussion.

History
Date	User	Action	Args
2010-05-04 08:51:20	lemburg	set	recipients: + lemburg, loewis, gregory.p.smith, pitrou, vstinner, ezio.melotti, Arfrever
2010-05-04 08:51:18	lemburg	link	issue8603 messages
2010-05-04 08:51:16	lemburg	create