Message 110746 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	pitrou
Recipients	BreamoreBoy, adamnelson, ajaksu2, collinwinter, eric.araujo, ezio.melotti, mastrodomenico, mgiuca, nagle, orsenthil, pitrou, vak, varmaa, vstinner
Date	2010-07-19.13:21:58
SpamBayes Score	2.5552067e-08
Marked as misclassified	No
Message-id	<1279545714.3146.36.camel@localhost.localdomain>
In-reply-to	<1279544007.96.0.781099148111.issue1712522@psf.upfronthosting.co.za>

Content
> Now one of the major goals of Python 2.6/2.7 is to allow the writing > of code which ports smoothly to Python 3. Unicode support is a major > issue here. I understand the argument. But 2.7 is a bugfix branch and shouldn't receive new features, even backports. If we wanted 2.x to converge further into 3.x, we would do a 2.8, which we have decided not to do. > I don't consider use of Unicode strings in Python 2.7 to be > "accidental". In my experience with Python 2, pretty much everything > already works with Unicode strings, and it's best practice to use > them. Not true. From the urllib module itself: $ touch /tmp/hé $ python -c 'import urllib; urllib.urlretrieve("file:///tmp/hé")' $ python -c 'import urllib; urllib.urlretrieve(u"file:///tmp/hé")' Traceback (most recent call last): File "<string>", line 1, in <module> File "/usr/lib64/python2.6/urllib.py", line 93, in urlretrieve return _urlopener.retrieve(url, filename, reporthook, data) File "/usr/lib64/python2.6/urllib.py", line 225, in retrieve url = unwrap(toBytes(url)) File "/usr/lib64/python2.6/urllib.py", line 1027, in toBytes " contains non-ASCII characters") UnicodeError: URL u'file:///tmp/h\xc3\xa9' contains non-ASCII characters > Having functions in Python 2.7 which don't accept Unicode (or worse, > raise random exceptions) runs against best practices for moving to > Python 3. There are lots of them, and urllib.quote() isn't an exception: 'x\x9c\xcbH\x04\x00\x013\x00\xca' >>> zlib.compress(u"hà") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 1: ordinal not in range(128) pwd.struct_passwd(pw_name='root', pw_passwd='x', pw_uid=0, pw_gid=0, pw_gecos='root', pw_dir='/root', pw_shell='/bin/bash') >>> pwd.getpwnam(u"rooté") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 4: ordinal not in range(128) > In fact, most code written to work with strings naturally works with > Unicode because unicode strings support the same basic operations. What should zlib compression of an unicode string result in? > > The original issue is against robotparser, and clearly states a bug > > (robotparser doesn't work in some cases). > > I don't know why this keeps coming back to robotparser. The original > bug was not against robotparser; it is called "quote throws exception > on Unicode URL" and that is the bug. Robotparser was just one > demonstrative piece of code which failed because of it. Well, there are two different concerns: - robotparser fails on certain Web pages, which is a bug (unless the Web pages are clearly malformed) - urllib.quote() should accept any kind of unicode strings, and perform appropriate encoding, with an ability to override default encoding parameters: this is a feature request The OP himself (John Nagle) said: “The problem is down inside a library module. "robotparser" is calling "urllib.quote". One of those two library modules needs to be fixed.” It seems to imply that the primary concern was robotparser not working.

> Now one of the major goals of Python 2.6/2.7 is to allow the writing
> of code which ports smoothly to Python 3. Unicode support is a major
> issue here.

I understand the argument. But 2.7 is a bugfix branch and shouldn't
receive new features, even backports. If we wanted 2.x to converge
further into 3.x, we would do a 2.8, which we have decided not to do.

> I don't consider use of Unicode strings in Python 2.7 to be
> "accidental". In my experience with Python 2, pretty much everything
> already works with Unicode strings, and it's best practice to use
> them.

Not true. From the urllib module itself:

$ touch /tmp/hé
$ python -c 'import urllib; urllib.urlretrieve("file:///tmp/hé")'
$ python -c 'import urllib; urllib.urlretrieve(u"file:///tmp/hé")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib64/python2.6/urllib.py", line 93, in urlretrieve
    return _urlopener.retrieve(url, filename, reporthook, data)
  File "/usr/lib64/python2.6/urllib.py", line 225, in retrieve
    url = unwrap(toBytes(url))
  File "/usr/lib64/python2.6/urllib.py", line 1027, in toBytes
    " contains non-ASCII characters")
UnicodeError: URL u'file:///tmp/h\xc3\xa9' contains non-ASCII characters

> Having functions in Python 2.7 which don't accept Unicode (or worse,
> raise random exceptions) runs against best practices for moving to
> Python 3.

There are lots of them, and urllib.quote() isn't an exception:

'x\x9c\xcbH\x04\x00\x013\x00\xca'
>>> zlib.compress(u"hà")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 1: ordinal not in range(128)

pwd.struct_passwd(pw_name='root', pw_passwd='x', pw_uid=0, pw_gid=0, pw_gecos='root', pw_dir='/root', pw_shell='/bin/bash')
>>> pwd.getpwnam(u"rooté")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 4: ordinal not in range(128)

> In fact, most code written to work with strings naturally works with
> Unicode because unicode strings support the same basic operations.

What should zlib compression of an unicode string result in?

> > The original issue is against robotparser, and clearly states a bug
> > (robotparser doesn't work in some cases).
> 
> I don't know why this keeps coming back to robotparser. The original
> bug was not against robotparser; it is called "quote throws exception
> on Unicode URL" and that is the bug. Robotparser was just one
> demonstrative piece of code which failed because of it.

Well, there are two different concerns:
- robotparser fails on certain Web pages, which is a bug (unless the Web
pages are clearly malformed)
- urllib.quote() should accept any kind of unicode strings, and perform
appropriate encoding, with an ability to override default encoding
parameters: this is a feature request

The OP himself (John Nagle) said:
“The problem is down inside a library module. "robotparser" is calling
"urllib.quote". One of those two library modules needs to be fixed.”

It seems to imply that the primary concern was robotparser not working.

History
Date	User	Action	Args
2010-07-19 13:22:01	pitrou	set	recipients: + pitrou, collinwinter, varmaa, nagle, orsenthil, vstinner, ajaksu2, ezio.melotti, eric.araujo, mgiuca, mastrodomenico, vak, adamnelson, BreamoreBoy
2010-07-19 13:21:59	pitrou	link	issue1712522 messages
2010-07-19 13:21:58	pitrou	create