Issue1609
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2007-12-13 10:15 by donmez, last changed 2022-04-11 14:56 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
test2.py | donmez, 2007-12-19 21:59 | |||
test.py | donmez, 2007-12-19 22:07 |
Messages (34) | |||
---|---|---|---|
msg58527 - (view) | Author: Ismail Donmez (donmez) * | Date: 2007-12-13 10:15 | |
Using python 2.5 revision 59479 from release25-maint branch, [~/python-2.5]> LD_LIBRARY_PATH=/home/cartman/python-2.5: ./python ./Lib/test/test_re.py test_anyall (__main__.ReTests) ... ok test_basic_re_sub (__main__.ReTests) ... ok test_bigcharset (__main__.ReTests) ... ok test_bug_113254 (__main__.ReTests) ... ok test_bug_1140 (__main__.ReTests) ... ok test_bug_114660 (__main__.ReTests) ... ok test_bug_117612 (__main__.ReTests) ... ok test_bug_418626 (__main__.ReTests) ... ok test_bug_448951 (__main__.ReTests) ... ok test_bug_449000 (__main__.ReTests) ... ok test_bug_449964 (__main__.ReTests) ... ok test_bug_462270 (__main__.ReTests) ... ok test_bug_527371 (__main__.ReTests) ... ok test_bug_545855 (__main__.ReTests) ... ok test_bug_581080 (__main__.ReTests) ... ok test_bug_612074 (__main__.ReTests) ... ok test_bug_725106 (__main__.ReTests) ... ok test_bug_725149 (__main__.ReTests) ... ok test_bug_764548 (__main__.ReTests) ... ok test_bug_817234 (__main__.ReTests) ... ok test_bug_926075 (__main__.ReTests) ... ok test_bug_931848 (__main__.ReTests) ... ok test_category (__main__.ReTests) ... ok test_constants (__main__.ReTests) ... ok test_empty_array (__main__.ReTests) ... ok test_expand (__main__.ReTests) ... ok test_finditer (__main__.ReTests) ... ok test_flags (__main__.ReTests) ... ok test_getattr (__main__.ReTests) ... ok test_getlower (__main__.ReTests) ... ok test_groupdict (__main__.ReTests) ... ok test_ignore_case (__main__.ReTests) ... ok test_non_consuming (__main__.ReTests) ... ok test_not_literal (__main__.ReTests) ... ok test_pickling (__main__.ReTests) ... ok test_qualified_re_split (__main__.ReTests) ... ok test_qualified_re_sub (__main__.ReTests) ... ok test_re_escape (__main__.ReTests) ... ok test_re_findall (__main__.ReTests) ... ok test_re_groupref (__main__.ReTests) ... ok test_re_groupref_exists (__main__.ReTests) ... ok test_re_match (__main__.ReTests) ... ok test_re_split (__main__.ReTests) ... ok test_re_subn (__main__.ReTests) ... ok test_repeat_minmax (__main__.ReTests) ... ok test_scanner (__main__.ReTests) ... ok test_search_coverage (__main__.ReTests) ... ok test_search_star_plus (__main__.ReTests) ... ok test_special_escapes (__main__.ReTests) ... ok test_sre_character_class_literals (__main__.ReTests) ... ok test_sre_character_literals (__main__.ReTests) ... ok test_stack_overflow (__main__.ReTests) ... ok test_sub_template_numeric_escape (__main__.ReTests) ... ok test_symbolic_refs (__main__.ReTests) ... ok test_weakref (__main__.ReTests) ... ok ---------------------------------------------------------------------- Ran 55 tests in 0.194s OK Running re_tests test suite === Failed incorrectly ('(?u)\\b.\\b', u'\xc4', 0, 'found', u'\xc4') === Failed incorrectly ('(?u)\\w', u'\xc4', 0, 'found', u'\xc4') |
|||
msg58542 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2007-12-13 18:02 | |
Can't reproduce. Like before, what platform, compiler etc.? Does using ./configure --with-pydebug make a difference? What's the LD_LIBRARY_PATH for? |
|||
msg58548 - (view) | Author: Ismail Donmez (donmez) * | Date: 2007-12-13 18:21 | |
gcc 4.3, Linux 2.6.18, 32bit. Without LD_LIBRARY_PATH it would use the system libraries and not the compiled ones which anyway is not wanted. Configure line used is (damn I forgot to specify this before, sorry) --with-fpectl \ --enable-shared \ --enable-ipv6 \ --with-threads \ --enable-unicode=ucs4 \ --with-wctype-functions --enable-pydebug doesn't help. |
|||
msg58553 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2007-12-13 18:28 | |
> Without LD_LIBRARY_PATH it would use the system libraries and not the > compiled ones which anyway is not wanted. What system libraries? Does it make a difference if you don't specify either of --enable-unicode=ucs4 \ --with-wctype-functions ? Is GCC 4.3 released yet? |
|||
msg58556 - (view) | Author: Ismail Donmez (donmez) * | Date: 2007-12-13 18:50 | |
> What system libraries? libpython2.5.so.1.0 , this is a shared lib build after all. > Does it make a difference if you don't specify either of > > --enable-unicode=ucs4 \ > --with-wctype-functions Removing --with-wctype-functions fixes the issue. > Is GCC 4.3 released yet? Not yet but soon, its less buggy compared to 4.1 and 4.2 at the moment. |
|||
msg58559 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * | Date: 2007-12-13 19:05 | |
> > Is GCC 4.3 released yet? > > Not yet but soon, its less buggy compared to 4.1 and 4.2 > at the moment. Not quite yet, gcc 4.3 had a big inlining bug that was just corrected two weeks ago: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33434 You may have encountered this bug, or another similar one... |
|||
msg58565 - (view) | Author: Ismail Donmez (donmez) * | Date: 2007-12-13 19:11 | |
> Not quite yet, gcc 4.3 had a big inlining bug that was just corrected > two weeks ago: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33434 > You may have encountered this bug, or another similar one... Two weeks ago is too old for me, I am using SVN snapshot from yesterday :-) |
|||
msg58585 - (view) | Author: Ismail Donmez (donmez) * | Date: 2007-12-13 20:44 | |
Removing --with-wctype-functions in total fixes following regression tests, test_codecs test_re test_ucn test_unicodedata |
|||
msg58587 - (view) | Author: Ismail Donmez (donmez) * | Date: 2007-12-13 20:54 | |
Remove test_ucn from the list, it still fails but its for another bug report. |
|||
msg58639 - (view) | Author: Ismail Donmez (donmez) * | Date: 2007-12-14 22:41 | |
Any ideas/comments on how to move forward with this? Thanks, ismail |
|||
msg58700 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2007-12-17 18:51 | |
Focus on how using --with-wctype-functions changes things and how this could affect the regex implementation. (I wouldn't be surprised if the other failing tests were to to the regex bugs.) |
|||
msg58824 - (view) | Author: Ismail Donmez (donmez) * | Date: 2007-12-19 20:13 | |
Python README says --with-wctype-functions is deprecated and will be removed in Python 2.6 , I don't think its worth to fix it now. Also test failures with --with-wctype-functions is seems to be known according to Google. What I wonder if removing --with-wctype-functions causes any regressions under Turkish locale. I will do some research on that. |
|||
msg58825 - (view) | Author: Ismail Donmez (donmez) * | Date: 2007-12-19 20:21 | |
Indeed there seems to be regressions: Python 2.4 : [~]> python Python 2.4.4 (#1, Oct 23 2007, 11:25:50) [GCC 3.4.6] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import locale >>> locale.setlocale(locale.LC_ALL,"") 'tr_TR.UTF-8' >>> print unicode("ıııı") ıııı >>> print unicode("ıııı").upper() IIII >>> print unicode("iiiii").upper() İİİİİ >>> print unicode("İİİİİ").lower() iiiii >>> print unicode("IIIIIII").lower() ııııııı Python 2.5 (incorrect) : >>> import locale >>> locale.setlocale(locale.LC_ALL,"") 'tr_TR.UTF-8' >>> print unicode("iiiii").upper() IIIII >>> print unicode("ıııı").upper() Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128) >>> print unicode("iiii").upper() IIII Looks like wctypes should not be dropped. |
|||
msg58826 - (view) | Author: Ismail Donmez (donmez) * | Date: 2007-12-19 20:26 | |
Situation is even more complicated, following functions behave _correctly_ when wctypes is enabled : >>> print unicode("iiiii").upper() İİİİİ >>> print unicode("IIII").lower() ıııı Following doesn't work even if wctypes is enabled : >>> print unicode("ıııı").upper() Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128) >>> print unicode("İİİİİ").lower() Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128) All of these four calls works fine in python 2.4 when wctypes is enabled. |
|||
msg58830 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2007-12-19 21:41 | |
Martin, can you have a look at this? Cartman, can you produce a unittest for the correct behavior that only uses ASCII input (using \u.... instead of just typing Turkish characters)? |
|||
msg58831 - (view) | Author: Ismail Donmez (donmez) * | Date: 2007-12-19 21:59 | |
Test works fine when using the \u syntax. You have to use the unicode() with Turkish characters to get the error. See attached test2.py With python 2.4 : [~]> python test2.py Following should print I I Following should print i i With python 2.5 SVN : [~/python-2.5]> ./python ~/test2.py Following should print I Got a unicode decode error Following should print i Got a unicode decode error |
|||
msg58832 - (view) | Author: Ismail Donmez (donmez) * | Date: 2007-12-19 22:07 | |
So in conclusion, - Enabling wctypes makes Turkish support work with \u syntax, breaks unicode() - Disabling wctypes breaks Turkish support with \u and/or unicode() Attached test.py tests Turkish corner cases of lower()/upper() . Correct output is which python 2.4 gives : Following should print I I Following should print i i Following should print İ İ Following should print ı ı |
|||
msg58833 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2007-12-19 22:38 | |
Hm. The test2.py file, when I download it, contains the two bytes "\xc4\xb1" in the first unicode() call, and "\xc4\xb0" in the second one. This is *always* supposed to produce a UnicodeDecodeError, since it would use the default encoding which is ASCII. So I don't understand how you get this to pass with 2.4 at all. When you replace the arguments with these hex escapes, does it still pass for you? Or does that break it? |
|||
msg58834 - (view) | Author: Ismail Donmez (donmez) * | Date: 2007-12-19 22:42 | |
Replacing Turkish characters with hex versions in test2.py still results in UnicodeDecodeError and works with python 2.4. |
|||
msg58835 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2007-12-19 22:45 | |
> Replacing Turkish characters with hex versions in test2.py still results > in UnicodeDecodeError and works with python 2.4. I'm hoping Martin can confirm this, but I suspect that this is due to a tightening of the rules for converting from 8-bit strings to unicode strings. What happens if you change to unicode("....", "utf-8")? |
|||
msg58837 - (view) | Author: Ismail Donmez (donmez) * | Date: 2007-12-19 22:56 | |
Ok that was because we had modified default encoding in Lib/site.py to be utf-8. Sorry! The only problem left is last 2 conversions in test.py gives wrong results when wctypes is disabled, that is : print u"\u0069".upper() should give \u0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE) print u"\u0049".lower() should give \u0131 (LATIN SMALL LETTER DOTLESS I) These transformations work fine with python2.5 when --with-wctype-functions is used. |
|||
msg58843 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2007-12-19 23:52 | |
> print u"\u0069".upper() > > should give \u0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE) > > print u"\u0049".lower() > > should give \u0131 (LATIN SMALL LETTER DOTLESS I) > > These transformations work fine with python2.5 when > --with-wctype-functions is used. I think that is rather a bug in the wctype functions. Those are ASCII letters 'i' and 'I' and their upper/lower versions are fixed by the Unicode standard to be the corresponding ASCII letters ('I' and 'i'). The Unicode case conversions are not affected by locale. |
|||
msg58844 - (view) | Author: Ismail Donmez (donmez) * | Date: 2007-12-19 23:54 | |
But it should be affected by locale, thats the point of locale.setlocale call. This is how libc's wc functions behave. |
|||
msg58847 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2007-12-19 23:58 | |
> But it should be affected by locale, thats the point of locale.setlocale > call. This is how libc's wc functions behave. No, the locale should only affect 8-bit string operations, never unicode operations. |
|||
msg58848 - (view) | Author: Ismail Donmez (donmez) * | Date: 2007-12-20 00:00 | |
Ok then what is the suggested way to get back the Turkish way of doing upper/lower on i & I ? |
|||
msg58849 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2007-12-20 00:03 | |
> Ok then what is the suggested way to get back the Turkish way of doing > upper/lower on i & I ? That's a question for Martin von Loewis. I suppose you could use 8-bit strings exclusively. Or you could use .translate() with a custom dict. |
|||
msg58862 - (view) | Author: Martin v. Löwis (loewis) * | Date: 2007-12-20 06:05 | |
I think too many issues get mixed in this report. I would like to ignore all but one issue, but I don't understand what the one issue is that this report should deal with. cartman, when you compare Python 2.4 and 2.5, could it be that the 2.4 Python was compiled --with-wctype-functions, and the 2.5 Python --without-wctype-functions? That would surely explain the difference. The Unicode lower/upper implementations are, by default, locale-inaware. That is correct behavior, and by design. If you want locale-dependent behavior, use 8-bit strings as Guido says. ISTM that the original report was resolved - the tests don't support --with-wctype-functions. This is because they assume that they know that LATIN CAPITAL LETTER A WITH DIAERESIS is a letter - which may not be the case if the isletter test is locale-specific. If this is too be fixed, the proper fix would be to just remove the test, which I advise against - instead, the best behavior that Python should implement is the current one, i.e. it is a good thing that the test fails --with-wctype-functions. Perhaps a comment should be attached explaining the potential breakage. |
|||
msg58869 - (view) | Author: Ismail Donmez (donmez) * | Date: 2007-12-20 12:48 | |
Hi Martin, Actually the only problem is how can I get wctype functionality with 8-bit strings, any example is appreciated. This bug itself is invalid because --with-wctype-functions is deprecated. But as I said I just hope removing that doesn't regress Turkish functionality. |
|||
msg58884 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2007-12-20 15:05 | |
Two easy ways to get the functionality using 8-bit strings, assuming you've already set your locale properly: (1) If your data is already an 8-bit string (i.e. isinstance(data, str)), simply use data.upper() or data.lower() (2) If your data is Unicode (i.e. isinstance(data, unicode)), convert to 8-bit using encode, apply upper/lower, and convert back to unicode. E.g. data.encode("Latin-1").upper().decode("Latin-1"). (I don't know which encoding to use though -- So substitute whatever you have for Latin-1, but don't use UTF-8.) PS Martin: the 2.4/2.5 differences were caused by Cartman having hacked his 2.4 installation to change the default encoding. |
|||
msg58887 - (view) | Author: Ismail Donmez (donmez) * | Date: 2007-12-20 15:18 | |
Funnily, print "iiii".encode("iso-8859-9").decode("iso-8859-9").upper() works, but print "iiii".encode("iso-8859-9").upper().decode("iso-8859-9") not. |
|||
msg58888 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2007-12-20 15:19 | |
> Funnily, > > print "iiii".encode("iso-8859-9").decode("iso-8859-9").upper() > > works, but > > print "iiii".encode("iso-8859-9").upper().decode("iso-8859-9") > > not. You'll have to debug this yourself. |
|||
msg58890 - (view) | Author: Ismail Donmez (donmez) * | Date: 2007-12-20 15:22 | |
I guess so, I will no longer spam this bug. Thanks for the suggestions. |
|||
msg58927 - (view) | Author: Martin v. Löwis (loewis) * | Date: 2007-12-20 20:45 | |
> print "iiii".encode("iso-8859-9").upper().decode("iso-8859-9") > does not Please get your types right. "iiii" is a byte string (in Python 2.x). encode: unicode -> string decode: string -> unicode That you still can apply .encode to the byte string is a bug/pit fall in Python 2.x, which gets fixed in 3.x (by only supporting .encode on the unicode type). |
|||
msg58928 - (view) | Author: Ismail Donmez (donmez) * | Date: 2007-12-20 21:00 | |
Tried like , unicode("iii").encode("iso-8859-9").upper() doesn't work, I'll ask on python users list. Thanks. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:56:28 | admin | set | github: 45950 |
2007-12-20 21:00:47 | donmez | set | messages: + msg58928 |
2007-12-20 20:45:38 | loewis | set | messages: + msg58927 |
2007-12-20 15:22:43 | donmez | set | messages: + msg58890 |
2007-12-20 15:19:44 | gvanrossum | set | messages: + msg58888 |
2007-12-20 15:18:29 | donmez | set | messages: + msg58887 |
2007-12-20 15:05:34 | gvanrossum | set | status: open -> closed resolution: not a bug messages: + msg58884 |
2007-12-20 12:48:57 | donmez | set | messages: + msg58869 |
2007-12-20 06:05:44 | loewis | set | messages: + msg58862 |
2007-12-20 00:03:50 | gvanrossum | set | messages: + msg58849 |
2007-12-20 00:00:51 | donmez | set | messages: + msg58848 |
2007-12-19 23:58:10 | gvanrossum | set | messages: + msg58847 |
2007-12-19 23:54:30 | donmez | set | messages: + msg58844 |
2007-12-19 23:52:39 | gvanrossum | set | messages: + msg58843 |
2007-12-19 22:56:23 | donmez | set | messages: + msg58837 |
2007-12-19 22:45:15 | gvanrossum | set | messages: + msg58835 |
2007-12-19 22:42:56 | donmez | set | messages: + msg58834 |
2007-12-19 22:38:29 | gvanrossum | set | messages: + msg58833 |
2007-12-19 22:07:28 | donmez | set | files:
+ test.py messages: + msg58832 |
2007-12-19 21:59:28 | donmez | set | files:
+ test2.py messages: + msg58831 |
2007-12-19 21:41:22 | gvanrossum | set | assignee: loewis messages: + msg58830 nosy: + loewis |
2007-12-19 20:26:45 | donmez | set | messages: + msg58826 |
2007-12-19 20:21:38 | donmez | set | messages: + msg58825 |
2007-12-19 20:13:42 | donmez | set | messages: + msg58824 |
2007-12-17 18:51:21 | gvanrossum | set | messages: + msg58700 |
2007-12-14 22:41:25 | donmez | set | messages: + msg58639 |
2007-12-13 20:54:29 | donmez | set | messages: + msg58587 |
2007-12-13 20:44:41 | donmez | set | messages: + msg58585 |
2007-12-13 19:11:29 | donmez | set | messages: + msg58565 |
2007-12-13 19:05:01 | amaury.forgeotdarc | set | nosy:
+ amaury.forgeotdarc messages: + msg58559 |
2007-12-13 18:50:16 | donmez | set | messages: + msg58556 |
2007-12-13 18:28:20 | gvanrossum | set | messages: + msg58553 |
2007-12-13 18:21:48 | donmez | set | messages: + msg58548 |
2007-12-13 18:02:50 | gvanrossum | set | nosy:
+ gvanrossum messages: + msg58542 |
2007-12-13 10:18:15 | donmez | set | type: behavior |
2007-12-13 10:15:19 | donmez | create |