Message 225800 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	Arfrever, ezio.melotti, ncoghlan, pitrou, r.david.murray, serhiy.storchaka, vstinner
Date	2014-08-24.07:58:20
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1408867101.32.0.177426890365.issue18814@psf.upfronthosting.co.za>
In-reply-to

Content
I think similar functions should be added in the unicodedata module rather than the string module or as str methods. If I'm not mistaken this was already proposed in another issue. In C we already added macros like IS_{HIGH\|LOW\|}_SURROGATE and possibly others to help dealing with surrogates but AFAIK there's no Python equivalent yet. As for the specific constants/functions/methods you propose, IMHO the name escaped_surrogates is not too clear. If it's a string of lone surrogates I would just call it unicodedata.surrogates (and .high_surrogates/.low_surrogates). These can also be used to build oneliner to check if a string contains surrogates and/or to remove them. clean has a very generic name with no hints about surrogates, and its purpose is quite specific. I'm also not a big fan of redecode. The equivalent calls to encode/decode are not much longer and more explicit. Also having to redecode often indicates that there's a bug before that should be fixed instead (if possible).

I think similar functions should be added in the unicodedata module rather than the string module or as str methods.  If I'm not mistaken this was already proposed in another issue.
In C we already added macros like IS_{HIGH|LOW|}_SURROGATE and possibly others to help dealing with surrogates but AFAIK there's no Python equivalent yet.
As for the specific constants/functions/methods you propose, IMHO the name escaped_surrogates is not too clear.  If it's a string of lone surrogates I would just call it unicodedata.surrogates (and .high_surrogates/.low_surrogates).  These can also be used to build oneliner to check if a string contains surrogates and/or to remove them.
clean has a very generic name with no hints about surrogates, and its purpose is quite specific.
I'm also not a big fan of redecode.  The equivalent calls to encode/decode are not much longer and more explicit.  Also having to redecode often indicates that there's a bug before that should be fixed instead (if possible).

History
Date	User	Action	Args
2014-08-24 07:58:21	ezio.melotti	set	recipients: + ezio.melotti, ncoghlan, pitrou, vstinner, Arfrever, r.david.murray, serhiy.storchaka
2014-08-24 07:58:21	ezio.melotti	set	messageid: <1408867101.32.0.177426890365.issue18814@psf.upfronthosting.co.za>
2014-08-24 07:58:21	ezio.melotti	link	issue18814 messages
2014-08-24 07:58:20	ezio.melotti	create