Title: Incoherent bevavior with umlaut in regular expressions
Type: behavior Stage: resolved
Components: Regular Expressions Versions: Python 2.7
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: cklein, eryksun, ezio.melotti, mrabarnett, r.david.murray
Priority: normal Keywords:

Created on 2015-08-14 07:07 by cklein, last changed 2015-08-14 16:53 by zach.ware. This issue is now closed.

Messages (4)
msg248560 - (view) Author: Christian Klein (cklein) Date: 2015-08-14 07:07
The Python 2.7 re module seems not to agree what to consider a word character:

import re
s = u'f\xfc'
print re.sub('\W', '*', s, re.UNICODE)
print re.findall('\w', s, re.UNICODE)

The application of re.sub removes the character u'ΓΌ' which implies it's considered a non word character (\W).
But then re.findall shows it as a word character (\w).

Python 3.4 and Python 3.5 are correct respectively coherent.
(But that's unfortunately not an option for Google App Engine)
msg248561 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2015-08-14 07:43
You're passing re.UNICODE (32) as the value of the count parameter, i.e. the function signature is re.sub(pattern, repl, string, count=0, flags=0).
msg248562 - (view) Author: Christian Klein (cklein) Date: 2015-08-14 07:46
Wow, that's very embarrassing. Thank you.
(I tried to get further help before but nobody recognized that stupid mistake)
msg248584 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-08-14 12:52
Don't be embarrassed; a report like this turns up on this tracker about every three or four months.  Unfortunately there's nothing we can do to make the situation better because of backward compatibility concerns.
