Message 98327 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	amaury.forgeotdarc, ezio.melotti, flox, lemburg, r.david.murray, rhansen
Date	2010-01-26.11:48:53
SpamBayes Score	5.495604e-15
Marked as misclassified	No
Message-id	<4B5ED6A3.3020607@egenix.com>
In-reply-to	<1264427643.29.0.225238610182.issue7615@psf.upfronthosting.co.za>

Content
Amaury Forgeot d'Arc wrote: > I feel uneasy to change the default unicode-escape encoding. > I think that we mix two features here; to transfer a unicode string between two points, programs must agree on where the data ends, and how characters are represented as bytes. > All codecs including unicode-escape only dealt with byte conversion; (unicode-escape converts everything to printable 7bit ascii); > these patches want to add a feature related to the "where does the string end" issue, and is only aimed at "python code" containers. Other transports and protocols may choose different delimiters. > > My point is that unicode-escape used to not change printable 7-bit ascii characters, and the patches will change this. > > And actually this will break existing code. It did not take me long to find two examples of programs which embed unicode_escape-encoded text between quotes, and take care themselves of escaping quotes. First example generates javascript code, the second generates SQL statements: > http://github.com/chriseppstein/pywebmvc/blob/master/src/code/pywebmvc/tools/searchtool.py#L450 > http://gitweb.sabayon.org/?p=entropy.git;a=blob;f=libraries/entropy/db/__init__.py;h=2d818455efa347f35b2e96d787fefd338055d066;hb=HEAD#l6463 Ouch... these codecs should not have been used outside Python. I wonder why these applications don't use repr(text) to format the JavaScript/SQL strings. I guess this is the result of documenting them in http://docs.python.org/library/codecs.html#standard-encodings Too bad that the docs actually say "Produce a string that is suitable as Unicode literal in Python source code." The codecs main intent was to decode Unicode literals in Python source code to Unicode objects... The naming in the examples you mention also suggest that the programmers used the table from the docs - they use "unicode_escape" as codec name, not the standard "unicode-escape" name which we use throughout the Python code. The fact that the demonstrated actual use already does apply the extra quote escaping suggests that we cannot easily add this to the existing codecs. It would break those applications, since they'd be applying double-escaping. > This does not prevent the creation of a new codec, let's call it 'python-unicode-escape' [ or 'repr' :-) ] I think that's a good idea to move forward. Python 3.x comes with a new Unicode repr() format which we could turn into a new codec: it automatically adds the quotes, processes the in-string quotes and backslashes and also escapes \t, \n and \r as well as all non-printable code points. As for naming the new codec, I'd suggest "unicode-repr" since that's what it implements.

Amaury Forgeot d'Arc wrote:
> I feel uneasy to change the default unicode-escape encoding.
> I think that we mix two features here; to transfer a unicode string between two points, programs must agree on where the data ends, and how characters are represented as bytes.
> All codecs including unicode-escape only dealt with byte conversion; (unicode-escape converts everything to printable 7bit ascii);
> these patches want to add a feature related to the "where does the string end" issue, and is only aimed at "python code" containers. Other transports and protocols may choose different delimiters.
> 
> My point is that unicode-escape used to not change printable 7-bit ascii characters, and the patches will change this.
> 
> And actually this will break existing code. It did not take me long to find two examples of programs which embed unicode_escape-encoded text between quotes, and take care themselves of escaping quotes. First example generates javascript code, the second generates SQL statements:
> http://github.com/chriseppstein/pywebmvc/blob/master/src/code/pywebmvc/tools/searchtool.py#L450
> http://gitweb.sabayon.org/?p=entropy.git;a=blob;f=libraries/entropy/db/__init__.py;h=2d818455efa347f35b2e96d787fefd338055d066;hb=HEAD#l6463

Ouch... these codecs should not have been used outside
Python. I wonder why these applications don't use repr(text)
to format the JavaScript/SQL strings.

I guess this is the result of documenting them in
http://docs.python.org/library/codecs.html#standard-encodings

Too bad that the docs actually say "Produce a string that is
suitable as Unicode literal in Python source code." The codecs
main intent was to *decode* Unicode literals in Python source
code to Unicode objects...

The naming in the examples you mention also suggest that the
programmers used the table from the docs - they use
"unicode_escape" as codec name, not the standard
"unicode-escape" name which we use throughout the Python
code.

The fact that the demonstrated actual use already does apply the
extra quote escaping suggests that we cannot easily add this
to the existing codecs. It would break those applications, since
they'd be applying double-escaping.

> This does not prevent the creation of a new codec, let's call it 'python-unicode-escape' [ or 'repr' :-) ]

I think that's a good idea to move forward.

Python 3.x comes with a new Unicode repr() format which we could
turn into a new codec: it automatically adds the quotes, processes
the in-string quotes and backslashes and also escapes \t, \n and \r
as well as all non-printable code points.

As for naming the new codec, I'd suggest "unicode-repr" since
that's what it implements.

History
Date	User	Action	Args
2010-01-26 11:48:57	lemburg	set	recipients: + lemburg, amaury.forgeotdarc, ezio.melotti, r.david.murray, flox, rhansen
2010-01-26 11:48:54	lemburg	link	issue7615 messages
2010-01-26 11:48:53	lemburg	create