Message 142075 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tchrist
Recipients	Arfrever, ezio.melotti, jkloth, mrabarnett, pitrou, r.david.murray, rhettinger, tchrist, terry.reedy
Date	2011-08-14.19:00:19
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<2651.1313348403@chthon>
In-reply-to	<1313342152.26.0.740730067024.issue12749@psf.upfronthosting.co.za>

Content
Ezio Melotti <report@bugs.python.org> wrote on Sun, 14 Aug 2011 17:15:52 -0000: >> You're right: my wide build is not Python3, just Python2. > And is it failing? Here the tests pass on the wide builds, on both Python 2 and 3. Perhaps I am doing something wrong? linux% python --version Python 2.6.2 linux% python -c 'import sys; print sys.maxunicode' 1114111 linux% cat -n bigrange.py 1 #!/usr/bin/env python 2 # -- coding: UTF-8 -- 3 4 from __future__ import print_function 5 from __future__ import unicode_literals 6 7 import re 8 9 flags = re.UNICODE 10 11 if re.search("[a-z]", "c", flags): 12 print("match 1 passed") 13 else: 14 print("match 1 failed") 15 16 if re.search("[𝒜-𝒵]", "𝒞", flags): 17 print("match 2 passed") 18 else: 19 print("match 2 failed") 20 21 if re.search("[\U0001D49C-\U0001D4B5]", "\U0001D49E", flags): 22 print("match 3 passed") 23 else: 24 print("match 3 failed") 25 26 if re.search("[\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}]", 27 "\N{MATHEMATICAL SCRIPT CAPITAL C}", flags): 28 print("match 4 passed") 29 else: 30 print("match 4 failed") linux% python bigrange.py match 1 passed Traceback (most recent call last): File "bigrange.py", line 16, in <module> if re.search("[𝒜-𝒵]", "𝒞", flags): File "/usr/lib64/python2.6/re.py", line 142, in search return _compile(pattern, flags).search(string) File "/usr/lib64/python2.6/re.py", line 245, in _compile raise error, v # invalid expression sre_constants.error: bad character range >> In fact, it's even worse, because it's the stock build on Linux, >> which seems on this machine to be 2.6 not 2.7. > What is worse? FWIW on my system the default `python` is a 2.7 wide. `python3` is a 3.2 wide. I meant that it was running 2.6 not 2.7. >> I have private builds that are 2.7 and 3.2, but those are both narrow. >> I do not have a 3.3 build. Should I? > 3.3 is the version in development, not released yet. If you have an > HG clone of Python you can make a wide build of 3.x with ./configure > --with-wide-unicode andof 2.7 using ./configure --enable- > unicode=ucs4. And Antoine Pitrou <pitrou@free.fr> wrote: >> I have private builds that are 2.7 and 3.2, but those are both narrow. >> I do not have a 3.3 build. Should I? > I don't know if you should. But you can make one easily by passing > "--with-wide-unicode" to ./configure. Oh good. I need to read configure --help more carefully next time. I have to some Lucene work this afternoon, so I can let several builds chug along. Is there a way to easily have these co-exist on the same system? I'm sure I have to rebuild all C extensions for the new builds, but I wonder what to about (for example) /usr/local/lib/python3.2 being able to be only one of narrow or wide. Probably I just to go reading the configure stuff better for alternate paths. Unsure. Variant Perl builds can coexist on the same system with some directories shared and others not, but I often find other systems aren't quite that flexible, usually requiring their own dedicated trees. Manpaths can get tricky, too. >> I'm remembering why I removed Python2 from my Unicode talk, because >> of how it made me pull my hair out. People at the talk wanted to know >> what I meant, but I didn't have time to go into it. I think this >> gets added to the hairpulling list. > I'm not sure what you are referring to here. There seem to many more things to get wrong with Unicode in v2 than in v3. I don't know how much of this just my slowness at ramping up the learning curve, how much is due to historical defaults that don't work well for Unicode, and how much is Python2: re.search(u"[\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}]".encode('utf-8'), u"\N{MATHEMATICAL SCRIPT CAPITAL C}".encode('utf-8'), re.UNICODE) Python3: re.search("[\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}]", "\N{MATHEMATICAL SCRIPT CAPITAL C}", re.UNICODE) The Python2 version is much noisier. (1) You have keep remembering to u"..." everything because neither # -- coding: UTF-8 -- nor even from __future__ import unicode_literals suffices. (2) You have to manually encode every string, which is utterly bizarre to me. (3) Plus you then have turn around and tell re, "Hey by the way, you know those Unicode strings I just passed you? Those are Unicode strings, you know." Like it couldn't tell that already by realizing it got Unicode not byte strings. So weird. It's a very awkward model. Compare Perl's "\N{MATHEMATICAL SCRIPT CAPITAL C}" =~ /\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}]/ That's the kind of thing I'm used to. It knows those are Unicode pattern matches on Unicode strings with Unicode semantics. After all, the \N{⋯} always evaluates to Unicode strings, so the regex engine of course is Unicodey then without being begged. To do bytewise processing, I would have to manually do all that encoding rigamorale like you show for Python2. And I never want do that, because looking for code units is way beneath the level of abstraction I strongly prefer to work with. Code points are as low as I go, and often not even there, since I often need graphemes or sometimes even linguistic collating elements (n-graphs), like the <ch> or <ll> digraphs in traditional Spanish or <dd> or <rh> in Welsh, or heaven help us the <dz> or <sz> digraphs, the <dzs> or <tty> trigraphs, or the <ddsz> tetragraph in Hungarian. (Yes, only Hungarian alone has a tetragraph, and there are no pentagraphs; small solace that, though.) FWIW, I give Python major kudos for having \N{⋯} available so that people no longer have to embed non-ASCII or magic numbers or ugly function calls all over their code. * Non-ASCII sometimes has the advantages of legibility but paradoxically also sometimes has the disadvantage of illegibility, bizarre as that sounds. It is too easy to be tricked by lookalikey font issues. 16 if re.search("[𝒜-𝒵]", "𝒞", flags): * Magic numbers quite simply suck. Nobody knows what they are. 21 if re.search("[\U0001D49C-\U0001D4B5]", "\U0001D49E", flags): * Requiring explicitly coded callouts to a library are at best tedious and annoying. ICU4J's UCharacter and JDK7's Character classes both have String getName(int codePoint) but JDK7 has nothing that goes the other way around; for that, ICU4J has int getCharFromName(String name) and ICU4C has UChar32 u_charFromName ( UCharNameChoice nameChoice, const char * name, UErrorCode * pErrorCode ) Anybody can see how deathly unwieldy and of that. ICU4C's regex library admits \N{⋯} just as Perl and Python do, but that class is not available in ICU4J, so you have to JNI to it as Android does. This is really much cleaner and clearer for maintenance: 26 if re.search("[\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}]", 27 "\N{MATHEMATICAL SCRIPT CAPITAL C}", flags): As far as I know, nothing but Perl and Python allows \N{⋯} in interpolated literals — even for those few languages that have interolated literals. One question: If one really must use code point numbers in strings, does Python have any clean uniform way to enter them besides having to choose the clunky \uHHHH vs \UHHHHHHHH thing? The goal is to be able to specify any (legal) number of hex digits without having to zero-pad them, which is especially with Pyton's \U, since you usually only need 5 hex digits and only very rarely 6, but the dumb thing makes you type all 8 of them every time anyway. You should somehow be able to specify only as many hex digits as you actually need. Ruby, and now also recent Unicode tech reports like current tr18, tend to use \u{⋯} for that, The \x{⋯} flavor is used by Perl strings and regexes, plus also the regexes in ICU, JDK7, and Matthew's regex library for Python. It's just a lot easier, which is why I miss it from regular Python strings. It occurs to me that you could add it completely backwards compatibily, since it is currently a syntax error: % python3.2 -c 'print("\x65")' e % python3.2 -c 'print("\u0065")' e % python3.2 -c 'print("\u03B1")' α % python3.2 -c 'print("\U0001D4E9")' 𝓩 % python3.2 -c 'print("\u{1D4E9}")' File "<string>", line 1 SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-2: truncated \uXXXX escape Exit 1 % python3.2 -c 'print("\x{1D4E9}")' File "<string>", line 1 SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-2: truncated \xXX escape Exit 1 Perl only uses \x, not \x AND \u AND \U the way Python does, because ahem, it seems a bit silly to have three different ways to do it. :) % perl -le 'print "\x9"' \| cat -t ^I % perl -le 'print "\x65"' e % perl -le 'print "\x{9}"' \| cat -t ^I % perl -le 'print "\x{65}"' e % perl -le 'print "\x{3B1}"' α % perl -le 'print "\x{FB03}"' ﬃ % perl -le 'print "\x{1D4E9}"' 𝓩 % perl -le 'print "\x{1FFFF}" lt "\x{100000}"' 1 % perl -le 'print "\x{10_FFFF}" gt "\x{01_FFFF}"' 1 Thanks for your all your generous help and kindly patience. --tom

Ezio Melotti <report@bugs.python.org> wrote on Sun, 14 Aug 2011 17:15:52 -0000: 

>> You're right: my wide build is not Python3, just Python2.

> And is it failing?  Here the tests pass on the wide builds, on both Python 2 and 3.

Perhaps I am doing something wrong?

    linux% python --version
    Python 2.6.2

    linux% python -c 'import sys; print sys.maxunicode'
    1114111

    linux% cat -n bigrange.py
     1	#!/usr/bin/env python
     2	# -*- coding: UTF-8 -*-
     3	
     4	from __future__ import print_function
     5	from __future__ import unicode_literals
     6	
     7	import re
     8	
     9	flags = re.UNICODE
    10	
    11	if re.search("[a-z]", "c", flags): 
    12	    print("match 1 passed")
    13	else:
    14	    print("match 1 failed")
    15	
    16	if re.search("[𝒜-𝒵]", "𝒞", flags): 
    17	    print("match 2 passed")
    18	else:
    19	    print("match 2 failed")
    20	
    21	if re.search("[\U0001D49C-\U0001D4B5]", "\U0001D49E", flags): 
    22	    print("match 3 passed")
    23	else:
    24	    print("match 3 failed")
    25	
    26	if re.search("[\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}]",
    27	    "\N{MATHEMATICAL SCRIPT CAPITAL C}", flags): 
    28	    print("match 4 passed")
    29	else:
    30	    print("match 4 failed")

    linux% python bigrange.py
    match 1 passed
    Traceback (most recent call last):
      File "bigrange.py", line 16, in <module>
	if re.search("[𝒜-𝒵]", "𝒞", flags): 
      File "/usr/lib64/python2.6/re.py", line 142, in search
	return _compile(pattern, flags).search(string)
      File "/usr/lib64/python2.6/re.py", line 245, in _compile
	raise error, v # invalid expression
    sre_constants.error: bad character range

>> In fact, it's even worse, because it's the stock build on Linux, 
>> which seems on this machine to be 2.6 not 2.7.

> What is worse?  FWIW on my system the default `python` is a 2.7 wide. `python3` is a 3.2 wide.

I meant that it was running 2.6 not 2.7.  

>> I have private builds that are 2.7 and 3.2, but those are both narrow.
>> I do not have a 3.3 build.  Should I?

> 3.3 is the version in development, not released yet.  If you have an
> HG clone of Python you can make a wide build of 3.x with ./configure
> --with-wide-unicode andof 2.7 using ./configure --enable-
> unicode=ucs4.

And Antoine Pitrou <pitrou@free.fr> wrote:

>> I have private builds that are 2.7 and 3.2, but those are both narrow.
>> I do not have a 3.3 build.  Should I?

> I don't know if you *should*. But you can make one easily by passing
> "--with-wide-unicode" to ./configure.

Oh good.  I need to read configure --help more carefully next time.
I have to some Lucene work this afternoon, so I can let several builds
chug along.  

Is there a way to easily have these co-exist on the same system?  I'm sure
I have to rebuild all C extensions for the new builds, but I wonder what to
about (for example) /usr/local/lib/python3.2 being able to be only one of
narrow or wide.  Probably I just to go reading the configure stuff better
for alternate paths.  Unsure.  

Variant Perl builds can coexist on the same system with some directories
shared and others not, but I often find other systems aren't quite that
flexible, usually requiring their own dedicated trees.  Manpaths can get
tricky, too.

>> I'm remembering why I removed Python2 from my Unicode talk, because
>> of how it made me pull my hair out.  People at the talk wanted to know
>> what I meant, but I didn't have time to go into it.  I think this
>> gets added to the hairpulling list.

> I'm not sure what you are referring to here.

There seem to many more things to get wrong with Unicode in v2 than in v3.

I don't know how much of this just my slowness at ramping up the learning
curve, how much is due to historical defaults that don't work well for 
Unicode, and how much is 

Python2:

    re.search(u"[\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}]".encode('utf-8'), 
               u"\N{MATHEMATICAL SCRIPT CAPITAL C}".encode('utf-8'), re.UNICODE)

Python3:

    re.search("[\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}]",
               "\N{MATHEMATICAL SCRIPT CAPITAL C}", re.UNICODE)

The Python2 version is *much* noisier.  

(1) You have keep remembering to u"..." everything because neither
        # -*- coding: UTF-8 -*-
    nor even
        from __future__ import unicode_literals
    suffices.  

(2) You have to manually encode every string, which is utterly bizarre to me.

(3) Plus you then have turn around and tell re, "Hey by the way, you know those
    Unicode strings I just passed you?  Those are Unicode strings, you know."
    Like it couldn't tell that already by realizing it got Unicode not byte 
    strings.  So weird.

It's a very awkward model.  Compare Perl's

   "\N{MATHEMATICAL SCRIPT CAPITAL C}" =~ /\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}]/

That's the kind of thing I'm used to.   It knows those are Unicode pattern matches on
Unicode strings with Unicode semantics.  After all, the \N{⋯} always evaluates to
Unicode strings, so the regex engine of course is Unicodey then without being begged.
To do bytewise processing, I would have to manually do all that encoding rigamorale
like you show for Python2.  And I never want do that, because looking for code units
is way beneath the level of abstraction I strongly prefer to work with.  Code points
are as low as I go, and often not even there, since I often need graphemes or
sometimes even linguistic collating elements (n-graphs), like the <ch> or <ll>
digraphs in traditional Spanish or <dd> or <rh> in Welsh,  or heaven help us the 
<dz> or <sz> digraphs, the <dzs> or <tty> trigraphs, or the <ddsz> tetragraph 
in Hungarian.

    (Yes, only Hungarian alone has a tetragraph, and there are no pentagraphs;
     small solace that, though.)

FWIW, I give Python major kudos for having \N{⋯} available so that people 
no longer have to embed non-ASCII or magic numbers or ugly function
calls all over their code.    

  *  Non-ASCII sometimes has the advantages of legibility but paradoxically
     also sometimes has the disadvantage of illegibility, bizarre as that
     sounds.  It is too easy to be tricked by lookalikey font issues.

        16	if re.search("[𝒜-𝒵]", "𝒞", flags): 

  *  Magic numbers quite simply suck.  Nobody knows what they are.

        21	if re.search("[\U0001D49C-\U0001D4B5]", "\U0001D49E", flags): 

  *  Requiring explicitly coded callouts to a library are at best tedious and
     annoying.  ICU4J's UCharacter and JDK7's Character classes both have
         String  getName(int codePoint)
     but JDK7 has nothing that goes the other way around; for that, ICU4J has
         int     getCharFromName(String name)
     and ICU4C has 
         UChar32 u_charFromName  (   UCharNameChoice     nameChoice, 
                                     const char *        name, 
                                     UErrorCode *        pErrorCode 
                )
     Anybody can see how deathly unwieldy and of that.  

ICU4C's regex library admits \N{⋯} just as Perl and Python do, but that 
class is not available in ICU4J, so you have to JNI to it as Android does.  
This is really much cleaner and clearer for maintenance:

        26	if re.search("[\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}]",
        27	    "\N{MATHEMATICAL SCRIPT CAPITAL C}", flags): 

As far as I know, nothing but Perl and Python allows \N{⋯} in interpolated 
literals — even for those few languages that *have* interolated literals.

One question: If one really must use code point numbers in strings, does Python 
have any clean uniform way to enter them besides having to choose the clunky \uHHHH 
vs \UHHHHHHHH thing?   The goal is to be able to specify any (legal) number of hex
digits without having to zero-pad them, which is especially with Pyton's \U, since
you usually only need 5 hex digits and only very rarely 6, but the dumb thing makes
you type all 8 of them every time anyway.

You should somehow be able to specify only as many hex digits as you actually need.
Ruby, and now also recent Unicode tech reports like current tr18, tend to use \u{⋯}
for that, The \x{⋯} flavor is used by Perl strings and regexes, plus also the regexes
in ICU, JDK7, and Matthew's regex library for Python.  

It's just a lot easier, which is why I miss it from regular Python strings.  It
occurs to me that you could add it completely backwards compatibily, since it is
currently a syntax error:

    % python3.2 -c 'print("\x65")'
    e

    % python3.2 -c 'print("\u0065")'
    e

    % python3.2 -c 'print("\u03B1")'
    α

    % python3.2 -c 'print("\U0001D4E9")'
    𝓩

    % python3.2 -c 'print("\u{1D4E9}")'
      File "<string>", line 1
    SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-2: truncated \uXXXX escape
    Exit 1

    % python3.2 -c 'print("\x{1D4E9}")'
      File "<string>", line 1
    SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-2: truncated \xXX escape
    Exit 1

Perl only uses \x, not \x AND \u AND \U the way Python does, because 
ahem, it seems a bit silly to have three different ways to do it. :)

    % perl -le 'print "\x9"' | cat -t
    ^I
    % perl -le 'print "\x65"'
    e

    % perl -le 'print "\x{9}"' | cat -t
    ^I
    % perl -le 'print "\x{65}"'
    e
    % perl -le 'print "\x{3B1}"'
    α
    % perl -le 'print "\x{FB03}"'
    ﬃ
    % perl -le 'print "\x{1D4E9}"'
    𝓩
    % perl -le 'print "\x{1FFFF}"   lt "\x{100000}"'
    1
    % perl -le 'print "\x{10_FFFF}" gt "\x{01_FFFF}"'
    1

Thanks for your all your generous help and kindly patience.

--tom

History
Date	User	Action	Args
2011-08-14 19:00:22	tchrist	set	recipients: + tchrist, rhettinger, terry.reedy, pitrou, jkloth, ezio.melotti, mrabarnett, Arfrever, r.david.murray
2011-08-14 19:00:20	tchrist	link	issue12749 messages
2011-08-14 19:00:19	tchrist	create