Message 97385 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	rhansen
Recipients	ezio.melotti, lemburg, r.david.murray, rhansen
Date	2010-01-07.22:42:57
SpamBayes Score	1.6098234e-15
Marked as misclassified	No
Message-id	<1262904178.99.0.252194974459.issue7615@psf.upfronthosting.co.za>
In-reply-to

Content
> We'll need a patch that implements single and double quote escaping > for unicode_escape and a \uXXXX style escaping of quotes for the > raw_unicode_escape encoder. OK, I'll remove unicode_escape_single_quotes.patch and update unicode_escape_reorg.patch. > Other changes are not necessary. Would you please clarify? There are a few other (minor) bugs that were discovered while writing unicode_escape_reorg.patch that I think should be fixed: * the UTF-16 surrogate pair decoding logic could read past the end of the provided Py_UNICODE character array if the last character is between 0xD800 and 0xDC00 * _PyString_Resize() will be called on an empty string if the size argument of unicodeescape_string() is 0. This will raise a SystemError because _PyString_Resize() can only be called if the object's ref count is 1 (even if no resizing is to take place) yet PyString_FromStringAndSize() returns a shared empty string instance if size is 0. * it is unclear what unicodeescape_string() should do if size < 0 Beyond those issues, I'm worried about manageability stemming from the amount of code duplication. If a bug is found in one of those encoding functions, the other two will likely need updating. > The pickle copy of the codec can be left untouched (both cPickle.c > and pickle.py) - it doesn't matter whether quotes are escaped or not > in the pickle data stream. Unfortunately, pickle.py must be modified because it does its own backslash escaping before encoding with the raw_unicode_escape codec. This means that backslashes would become double escaped and the decoded value would differ (confirmed by running the pickle unit tests). The (minor) bugs in PyUnicode_EncodeRawUnicodeEscape() are also present in cPickle.c, so they should probably be fixed as well. > The codecs' encode direction is not defined anywhere in the > documentation, AFAIK, and basically an implementation detail. I read the escape codec documentation (see the original post) as implying that the encoders can generate eval-able string literals. I'll add some clarifying statements. Thanks for the feedback!

> We'll need a patch that implements single and double quote escaping 
> for unicode_escape and a \uXXXX style escaping of quotes for the 
> raw_unicode_escape encoder.

OK, I'll remove unicode_escape_single_quotes.patch and update unicode_escape_reorg.patch.

> Other changes are not necessary.

Would you please clarify?  There are a few other (minor) bugs that were discovered while writing unicode_escape_reorg.patch that I think should be fixed:
  * the UTF-16 surrogate pair decoding logic could read past the end of the provided Py_UNICODE character array if the last character is between 0xD800 and 0xDC00
  * _PyString_Resize() will be called on an empty string if the size argument of unicodeescape_string() is 0.  This will raise a SystemError because _PyString_Resize() can only be called if the object's ref count is 1 (even if no resizing is to take place) yet PyString_FromStringAndSize() returns a shared empty string instance if size is 0.
  * it is unclear what unicodeescape_string() should do if size < 0

Beyond those issues, I'm worried about manageability stemming from the amount of code duplication.  If a bug is found in one of those encoding functions, the other two will likely need updating.

> The pickle copy of the codec can be left untouched (both cPickle.c 
> and pickle.py) - it doesn't matter whether quotes are escaped or not 
> in the pickle data stream.

Unfortunately, pickle.py must be modified because it does its own backslash escaping before encoding with the raw_unicode_escape codec.  This means that backslashes would become double escaped and the decoded value would differ (confirmed by running the pickle unit tests).

The (minor) bugs in PyUnicode_EncodeRawUnicodeEscape() are also present in cPickle.c, so they should probably be fixed as well.

> The codecs' encode direction is not defined anywhere in the 
> documentation, AFAIK, and basically an implementation detail.

I read the escape codec documentation (see the original post) as implying that the encoders can generate eval-able string literals.  I'll add some clarifying statements.

Thanks for the feedback!

History
Date	User	Action	Args
2010-01-07 22:42:59	rhansen	set	recipients: + rhansen, lemburg, ezio.melotti, r.david.murray
2010-01-07 22:42:58	rhansen	set	messageid: <1262904178.99.0.252194974459.issue7615@psf.upfronthosting.co.za>
2010-01-07 22:42:57	rhansen	link	issue7615 messages
2010-01-07 22:42:57	rhansen	create