classification
Title: re.sub returns str when processing empty unicode string
Type: behavior Stage:
Components: Regular Expressions Versions: Python 2.4, Python 2.5
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: gvanrossum Nosy List: beda, effbot, gvanrossum, jafo
Priority: low Keywords:

Created on 2007-09-10 06:37 by beda, last changed 2007-09-17 09:44 by jafo. This issue is now closed.

Files
File name Uploaded Description Edit
sre.diff gvanrossum, 2007-09-10 20:37
sre.diff gvanrossum, 2007-09-10 21:40
Messages (11)
msg55775 - (view) Author: Beda Kosata (beda) Date: 2007-09-10 06:37
While re.sub normally returns unicode strings when processing unicode,
it returns a normal string when dealing with an empty unicode string.

Example:
>>> print type( re.sub( "XX", "", u""))
<type 'str'>
>>> print type( re.sub( "XX", "", u"A"))
<type 'unicode'>

This inconsistency could lead to annoying bugs (at least it did for me :)
msg55788 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-09-10 17:14
I agree.  I wonder if it should return Unicode as soon as *any* of the
arguments are unicode???
msg55789 - (view) Author: Beda Kosata (beda) Date: 2007-09-10 18:25
I would certainly expect it to return unicode when either the "modified"
string or the replacement are unicode. I don't think that the type of
the replaced string should influence the type of the result.
msg55790 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-09-10 18:42
Actually, it already implements the best possible rules, *except* for
the special case of an empty 3rd argument.  (When there are no
substitutions, it normally returns the input unchanged; but somehow an
empty input is handled with a shortcut even before that point.  It ought
to be a simlpe fix.
msg55793 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-09-10 20:37
Here's a patch.
msg55797 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-09-10 21:40
Here's a better patch that also fixes a few related issues.
msg55798 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-09-10 21:40
Fredrik, thoughts?
msg55800 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2007-09-10 21:54
Looks good to me.  I still subscribe to the idea that
robust code should accept 8-bit *ASCII* strings any-
where it accepts Unicode (especially when the 8-bit
string is empty), but that's me.

Feel free to check this in (or assign back to you if
you don't have the time).
msg55803 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2007-09-10 21:56
(is there a way to just add a comment in the new tracker, btw, or is
everything a "change note", even if nothing has changed?)
msg55805 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-09-10 22:03
Thanks, Fredrik.
Fixed in 2.6.
Committed revision 58098.
Someone else could backport to 2.5.
Shouldn't be merged into 3.0.
msg55957 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2007-09-17 09:44
Applied as revision 58179 to 2.5 maintenance branch, passes tests.
History
Date User Action Args
2007-09-17 09:44:15jafosetstatus: open -> closed
nosy: + jafo
messages: + msg55957
priority: low
2007-09-10 22:04:08effbotsetmessages: - msg55804
2007-09-10 22:03:41gvanrossumsetmessages: + msg55805
2007-09-10 22:01:16effbotsetmessages: + msg55804
2007-09-10 21:56:40effbotsetmessages: + msg55803
2007-09-10 21:54:54effbotsetassignee: effbot -> gvanrossum
resolution: accepted
messages: + msg55800
2007-09-10 21:40:25gvanrossumsetassignee: gvanrossum -> effbot
messages: + msg55798
nosy: + effbot
2007-09-10 21:40:05gvanrossumsetfiles: + sre.diff
messages: + msg55797
2007-09-10 20:37:41gvanrossumsetfiles: + sre.diff
assignee: gvanrossum
messages: + msg55793
2007-09-10 18:42:54gvanrossumsetmessages: + msg55790
2007-09-10 18:25:30bedasetmessages: + msg55789
2007-09-10 17:14:03gvanrossumsetnosy: + gvanrossum
messages: + msg55788
2007-09-10 06:37:18bedacreate