classification
Title: test_re.py fails
Type: behavior Stage:
Components: Tests Versions: Python 2.5
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: loewis Nosy List: amaury.forgeotdarc, donmez, gvanrossum, loewis
Priority: normal Keywords:

Created on 2007-12-13 10:15 by donmez, last changed 2007-12-20 21:00 by donmez. This issue is now closed.

Files
File name Uploaded Description Edit
test2.py donmez, 2007-12-19 21:59
test.py donmez, 2007-12-19 22:07
Messages (34)
msg58527 - (view) Author: Ismail Donmez (donmez) * Date: 2007-12-13 10:15
Using python 2.5 revision 59479 from release25-maint branch, 

[~/python-2.5]> LD_LIBRARY_PATH=/home/cartman/python-2.5: ./python
./Lib/test/test_re.py
test_anyall (__main__.ReTests) ... ok
test_basic_re_sub (__main__.ReTests) ... ok
test_bigcharset (__main__.ReTests) ... ok
test_bug_113254 (__main__.ReTests) ... ok
test_bug_1140 (__main__.ReTests) ... ok
test_bug_114660 (__main__.ReTests) ... ok
test_bug_117612 (__main__.ReTests) ... ok
test_bug_418626 (__main__.ReTests) ... ok
test_bug_448951 (__main__.ReTests) ... ok
test_bug_449000 (__main__.ReTests) ... ok
test_bug_449964 (__main__.ReTests) ... ok
test_bug_462270 (__main__.ReTests) ... ok
test_bug_527371 (__main__.ReTests) ... ok
test_bug_545855 (__main__.ReTests) ... ok
test_bug_581080 (__main__.ReTests) ... ok
test_bug_612074 (__main__.ReTests) ... ok
test_bug_725106 (__main__.ReTests) ... ok
test_bug_725149 (__main__.ReTests) ... ok
test_bug_764548 (__main__.ReTests) ... ok
test_bug_817234 (__main__.ReTests) ... ok
test_bug_926075 (__main__.ReTests) ... ok
test_bug_931848 (__main__.ReTests) ... ok
test_category (__main__.ReTests) ... ok
test_constants (__main__.ReTests) ... ok
test_empty_array (__main__.ReTests) ... ok
test_expand (__main__.ReTests) ... ok
test_finditer (__main__.ReTests) ... ok
test_flags (__main__.ReTests) ... ok
test_getattr (__main__.ReTests) ... ok
test_getlower (__main__.ReTests) ... ok
test_groupdict (__main__.ReTests) ... ok
test_ignore_case (__main__.ReTests) ... ok
test_non_consuming (__main__.ReTests) ... ok
test_not_literal (__main__.ReTests) ... ok
test_pickling (__main__.ReTests) ... ok
test_qualified_re_split (__main__.ReTests) ... ok
test_qualified_re_sub (__main__.ReTests) ... ok
test_re_escape (__main__.ReTests) ... ok
test_re_findall (__main__.ReTests) ... ok
test_re_groupref (__main__.ReTests) ... ok
test_re_groupref_exists (__main__.ReTests) ... ok
test_re_match (__main__.ReTests) ... ok
test_re_split (__main__.ReTests) ... ok
test_re_subn (__main__.ReTests) ... ok
test_repeat_minmax (__main__.ReTests) ... ok
test_scanner (__main__.ReTests) ... ok
test_search_coverage (__main__.ReTests) ... ok
test_search_star_plus (__main__.ReTests) ... ok
test_special_escapes (__main__.ReTests) ... ok
test_sre_character_class_literals (__main__.ReTests) ... ok
test_sre_character_literals (__main__.ReTests) ... ok
test_stack_overflow (__main__.ReTests) ... ok
test_sub_template_numeric_escape (__main__.ReTests) ... ok
test_symbolic_refs (__main__.ReTests) ... ok
test_weakref (__main__.ReTests) ... ok

----------------------------------------------------------------------
Ran 55 tests in 0.194s

OK
Running re_tests test suite
=== Failed incorrectly ('(?u)\\b.\\b', u'\xc4', 0, 'found', u'\xc4')
=== Failed incorrectly ('(?u)\\w', u'\xc4', 0, 'found', u'\xc4')
msg58542 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-12-13 18:02
Can't reproduce.

Like before, what platform, compiler etc.?  Does using ./configure
--with-pydebug make a difference?  What's the LD_LIBRARY_PATH for?
msg58548 - (view) Author: Ismail Donmez (donmez) * Date: 2007-12-13 18:21
gcc 4.3, Linux 2.6.18, 32bit. 

Without  LD_LIBRARY_PATH it would use the system libraries and not the
compiled ones which anyway is not wanted.

Configure line used is (damn I forgot to specify this before, sorry)

--with-fpectl \
--enable-shared \
--enable-ipv6 \
--with-threads \
--enable-unicode=ucs4 \
--with-wctype-functions

--enable-pydebug doesn't help.
msg58553 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-12-13 18:28
> Without  LD_LIBRARY_PATH it would use the system libraries and not the
> compiled ones which anyway is not wanted.

What system libraries?

Does it make a difference if you don't specify either of

--enable-unicode=ucs4 \
--with-wctype-functions

?

Is GCC 4.3 released yet?
msg58556 - (view) Author: Ismail Donmez (donmez) * Date: 2007-12-13 18:50
> What system libraries?

libpython2.5.so.1.0 , this is a shared lib build after all.

> Does it make a difference if you don't specify either of
>
> --enable-unicode=ucs4 \
> --with-wctype-functions

Removing --with-wctype-functions fixes the issue.

> Is GCC 4.3 released yet?

Not yet but soon, its less buggy compared to 4.1 and 4.2 at the moment.
msg58559 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2007-12-13 19:05
> > Is GCC 4.3 released yet?
>
> Not yet but soon, its less buggy compared to 4.1 and 4.2 
> at the moment.

Not quite yet, gcc 4.3 had a big inlining bug that was just corrected
two weeks ago:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33434
You may have encountered this bug, or another similar one...
msg58565 - (view) Author: Ismail Donmez (donmez) * Date: 2007-12-13 19:11
> Not quite yet, gcc 4.3 had a big inlining bug that was just corrected
> two weeks ago:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33434
> You may have encountered this bug, or another similar one...

Two weeks ago is too old for me, I am using SVN snapshot from yesterday :-)
msg58585 - (view) Author: Ismail Donmez (donmez) * Date: 2007-12-13 20:44
Removing --with-wctype-functions in total fixes following regression tests,

test_codecs 
test_re 
test_ucn 
test_unicodedata
msg58587 - (view) Author: Ismail Donmez (donmez) * Date: 2007-12-13 20:54
Remove test_ucn from the list, it still fails but its for another bug
report.
msg58639 - (view) Author: Ismail Donmez (donmez) * Date: 2007-12-14 22:41
Any ideas/comments on how to move forward with this?

Thanks,
ismail
msg58700 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-12-17 18:51
Focus on how using --with-wctype-functions changes things and how this
could affect the regex implementation. (I wouldn't be surprised if the
other failing tests were to to the regex bugs.)
msg58824 - (view) Author: Ismail Donmez (donmez) * Date: 2007-12-19 20:13
Python README says --with-wctype-functions is deprecated and will be
removed in Python 2.6 , I don't think its worth to fix it now. Also test
failures with --with-wctype-functions is seems to be known according to
Google.

What I wonder if removing --with-wctype-functions causes any regressions
under Turkish locale. I will do some research on that.
msg58825 - (view) Author: Ismail Donmez (donmez) * Date: 2007-12-19 20:21
Indeed there seems to be regressions:

Python 2.4 :

[~]> python
Python 2.4.4 (#1, Oct 23 2007, 11:25:50)
[GCC 3.4.6] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_ALL,"")
'tr_TR.UTF-8'
>>> print unicode("ıııı")
ıııı
>>> print unicode("ıııı").upper()
IIII
>>> print unicode("iiiii").upper()
İİİİİ
>>> print unicode("İİİİİ").lower()
iiiii
>>> print unicode("IIIIIII").lower()
ııııııı


Python 2.5 (incorrect) :

>>> import locale
>>> locale.setlocale(locale.LC_ALL,"")
'tr_TR.UTF-8'
>>> print unicode("iiiii").upper()
IIIII
>>> print unicode("ıııı").upper()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
ordinal not in range(128)
>>> print unicode("iiii").upper()
IIII


Looks like wctypes should not be dropped.
msg58826 - (view) Author: Ismail Donmez (donmez) * Date: 2007-12-19 20:26
Situation is even more complicated, following functions behave
_correctly_ when wctypes is enabled :

>>> print unicode("iiiii").upper()
İİİİİ
>>> print unicode("IIII").lower()
ıııı

Following doesn't work even if wctypes is enabled :

>>> print unicode("ıııı").upper()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
ordinal not in range(128)
>>> print unicode("İİİİİ").lower()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
ordinal not in range(128)

All of these four calls works fine in python 2.4 when wctypes is enabled.
msg58830 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-12-19 21:41
Martin, can you have a look at this?

Cartman, can you produce a unittest for the correct behavior that only
uses ASCII input (using \u.... instead of just typing Turkish characters)?
msg58831 - (view) Author: Ismail Donmez (donmez) * Date: 2007-12-19 21:59
Test works fine when using the \u syntax. You have to use the unicode()
with Turkish characters to get the error. See attached test2.py 

With python 2.4 :

[~]> python test2.py
Following should print I
I
Following should print i
i

With python 2.5 SVN :

[~/python-2.5]> ./python ~/test2.py
Following should print I
Got a unicode decode error
Following should print i
Got a unicode decode error
msg58832 - (view) Author: Ismail Donmez (donmez) * Date: 2007-12-19 22:07
So in conclusion,

- Enabling wctypes makes Turkish support work with \u syntax, breaks
unicode()
- Disabling wctypes breaks Turkish support with \u and/or unicode()

Attached test.py tests Turkish corner cases of lower()/upper() . Correct
output is which python 2.4 gives :

Following should print I
I
Following should print i
i
Following should print İ
İ
Following should print ı
ı
msg58833 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-12-19 22:38
Hm.  The test2.py file, when I download it, contains the two bytes
"\xc4\xb1" in the first unicode() call, and "\xc4\xb0" in the second
one.  This is *always* supposed to produce a UnicodeDecodeError, since
it would use the default encoding which is ASCII.  So I don't understand
how you get this to pass with 2.4 at all.

When you replace the arguments with these hex escapes, does it still
pass for you?  Or does that break it?
msg58834 - (view) Author: Ismail Donmez (donmez) * Date: 2007-12-19 22:42
Replacing Turkish characters with hex versions in test2.py still results
in UnicodeDecodeError and works with python 2.4.
msg58835 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-12-19 22:45
> Replacing Turkish characters with hex versions in test2.py still results
> in UnicodeDecodeError and works with python 2.4.

I'm hoping Martin can confirm this, but I suspect that this is due to
a tightening of the rules for converting from 8-bit strings to unicode
strings.

What happens if you change to unicode("....", "utf-8")?
msg58837 - (view) Author: Ismail Donmez (donmez) * Date: 2007-12-19 22:56
Ok that was because we had modified default encoding in Lib/site.py to
be utf-8. Sorry!

The only problem left is last 2 conversions in test.py gives wrong
results when wctypes is disabled, that is :

print u"\u0069".upper()

should give \u0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE)

print u"\u0049".lower()

should give \u0131 (LATIN SMALL LETTER DOTLESS I)

These transformations work fine with python2.5 when
--with-wctype-functions is used.
msg58843 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-12-19 23:52
> print u"\u0069".upper()
>
> should give \u0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE)
>
> print u"\u0049".lower()
>
> should give \u0131 (LATIN SMALL LETTER DOTLESS I)
>
> These transformations work fine with python2.5 when
> --with-wctype-functions is used.

I think that is rather a bug in the wctype functions. Those are ASCII
letters 'i' and 'I' and their upper/lower versions are fixed by the
Unicode standard to be the corresponding ASCII letters ('I' and 'i').
The Unicode case conversions are not affected by locale.
msg58844 - (view) Author: Ismail Donmez (donmez) * Date: 2007-12-19 23:54
But it should be affected by locale, thats the point of locale.setlocale
call. This is how libc's wc functions behave.
msg58847 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-12-19 23:58
> But it should be affected by locale, thats the point of locale.setlocale
> call. This is how libc's wc functions behave.

No, the locale should only affect 8-bit string operations, never
unicode operations.
msg58848 - (view) Author: Ismail Donmez (donmez) * Date: 2007-12-20 00:00
Ok then what is the suggested way to get back the Turkish way of doing
upper/lower on  i & I ?
msg58849 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-12-20 00:03
> Ok then what is the suggested way to get back the Turkish way of doing
> upper/lower on  i & I ?

That's a question for Martin von Loewis. I suppose you could use 8-bit
strings exclusively. Or you could use .translate() with a custom dict.
msg58862 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2007-12-20 06:05
I think too many issues get mixed in this report. I would like to ignore
all but one issue, but I don't understand what the one issue is that
this report should deal with.

cartman, when you compare Python 2.4 and 2.5, could it be that the 2.4
Python was compiled --with-wctype-functions, and the 2.5 Python
--without-wctype-functions? That would surely explain the difference.

The Unicode lower/upper implementations are, by default, locale-inaware.
That is correct behavior, and by design. If you want locale-dependent
behavior, use 8-bit strings as Guido says.

ISTM that the original report was resolved - the tests don't support
--with-wctype-functions. This is because they assume that they know that 
LATIN CAPITAL LETTER A WITH DIAERESIS is a letter - which may not be the
case if the isletter test is locale-specific. If this is too be fixed,
the proper fix would be to just remove the test, which I advise against
- instead, the best behavior that Python should implement is the current
one, i.e. it is a good thing that the test fails
--with-wctype-functions. Perhaps a comment should be attached explaining
the potential breakage.
msg58869 - (view) Author: Ismail Donmez (donmez) * Date: 2007-12-20 12:48
Hi Martin,

Actually the only problem is how can I get wctype functionality with
8-bit strings, any example is appreciated.

This bug itself is invalid because --with-wctype-functions is
deprecated. But as I said I just hope removing that doesn't regress
Turkish functionality.
msg58884 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-12-20 15:05
Two easy ways to get the functionality using 8-bit strings, assuming
you've already set your locale properly:

(1) If your data is already an 8-bit string (i.e. isinstance(data,
str)), simply use data.upper() or data.lower()

(2) If your data is Unicode (i.e. isinstance(data, unicode)), convert to
8-bit using encode, apply upper/lower, and convert back to unicode. 
E.g. data.encode("Latin-1").upper().decode("Latin-1").  (I don't know
which encoding to use though -- So substitute whatever you have for
Latin-1, but don't use UTF-8.)

PS Martin: the 2.4/2.5 differences were caused by Cartman having hacked
his 2.4 installation to change the default encoding.
msg58887 - (view) Author: Ismail Donmez (donmez) * Date: 2007-12-20 15:18
Funnily,

print "iiii".encode("iso-8859-9").decode("iso-8859-9").upper()

works, but

print "iiii".encode("iso-8859-9").upper().decode("iso-8859-9")

not.
msg58888 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-12-20 15:19
> Funnily,
>
> print "iiii".encode("iso-8859-9").decode("iso-8859-9").upper()
>
> works, but
>
> print "iiii".encode("iso-8859-9").upper().decode("iso-8859-9")
>
> not.

You'll have to debug this yourself.
msg58890 - (view) Author: Ismail Donmez (donmez) * Date: 2007-12-20 15:22
I guess so, I will no longer spam this bug. Thanks for the suggestions.
msg58927 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2007-12-20 20:45
> print "iiii".encode("iso-8859-9").upper().decode("iso-8859-9")
> does not

Please get your types right. "iiii" is a byte string (in Python 2.x).
encode: unicode -> string
decode: string -> unicode

That you still can apply .encode to the byte string is a bug/pit fall in
Python 2.x, which gets fixed in 3.x (by only supporting .encode on the
unicode type).
msg58928 - (view) Author: Ismail Donmez (donmez) * Date: 2007-12-20 21:00
Tried like ,

unicode("iii").encode("iso-8859-9").upper()

doesn't work, I'll ask on python users list. Thanks.
History
Date User Action Args
2007-12-20 21:00:47donmezsetmessages: + msg58928
2007-12-20 20:45:38loewissetmessages: + msg58927
2007-12-20 15:22:43donmezsetmessages: + msg58890
2007-12-20 15:19:44gvanrossumsetmessages: + msg58888
2007-12-20 15:18:29donmezsetmessages: + msg58887
2007-12-20 15:05:34gvanrossumsetstatus: open -> closed
resolution: not a bug
messages: + msg58884
2007-12-20 12:48:57donmezsetmessages: + msg58869
2007-12-20 06:05:44loewissetmessages: + msg58862
2007-12-20 00:03:50gvanrossumsetmessages: + msg58849
2007-12-20 00:00:51donmezsetmessages: + msg58848
2007-12-19 23:58:10gvanrossumsetmessages: + msg58847
2007-12-19 23:54:30donmezsetmessages: + msg58844
2007-12-19 23:52:39gvanrossumsetmessages: + msg58843
2007-12-19 22:56:23donmezsetmessages: + msg58837
2007-12-19 22:45:15gvanrossumsetmessages: + msg58835
2007-12-19 22:42:56donmezsetmessages: + msg58834
2007-12-19 22:38:29gvanrossumsetmessages: + msg58833
2007-12-19 22:07:28donmezsetfiles: + test.py
messages: + msg58832
2007-12-19 21:59:28donmezsetfiles: + test2.py
messages: + msg58831
2007-12-19 21:41:22gvanrossumsetassignee: loewis
messages: + msg58830
nosy: + loewis
2007-12-19 20:26:45donmezsetmessages: + msg58826
2007-12-19 20:21:38donmezsetmessages: + msg58825
2007-12-19 20:13:42donmezsetmessages: + msg58824
2007-12-17 18:51:21gvanrossumsetmessages: + msg58700
2007-12-14 22:41:25donmezsetmessages: + msg58639
2007-12-13 20:54:29donmezsetmessages: + msg58587
2007-12-13 20:44:41donmezsetmessages: + msg58585
2007-12-13 19:11:29donmezsetmessages: + msg58565
2007-12-13 19:05:01amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg58559
2007-12-13 18:50:16donmezsetmessages: + msg58556
2007-12-13 18:28:20gvanrossumsetmessages: + msg58553
2007-12-13 18:21:48donmezsetmessages: + msg58548
2007-12-13 18:02:50gvanrossumsetnosy: + gvanrossum
messages: + msg58542
2007-12-13 10:18:15donmezsettype: behavior
2007-12-13 10:15:19donmezcreate