Message110746
> Now one of the major goals of Python 2.6/2.7 is to allow the writing
> of code which ports smoothly to Python 3. Unicode support is a major
> issue here.
I understand the argument. But 2.7 is a bugfix branch and shouldn't
receive new features, even backports. If we wanted 2.x to converge
further into 3.x, we would do a 2.8, which we have decided not to do.
> I don't consider use of Unicode strings in Python 2.7 to be
> "accidental". In my experience with Python 2, pretty much everything
> already works with Unicode strings, and it's best practice to use
> them.
Not true. From the urllib module itself:
$ touch /tmp/hé
$ python -c 'import urllib; urllib.urlretrieve("file:///tmp/hé")'
$ python -c 'import urllib; urllib.urlretrieve(u"file:///tmp/hé")'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib64/python2.6/urllib.py", line 93, in urlretrieve
return _urlopener.retrieve(url, filename, reporthook, data)
File "/usr/lib64/python2.6/urllib.py", line 225, in retrieve
url = unwrap(toBytes(url))
File "/usr/lib64/python2.6/urllib.py", line 1027, in toBytes
" contains non-ASCII characters")
UnicodeError: URL u'file:///tmp/h\xc3\xa9' contains non-ASCII characters
> Having functions in Python 2.7 which don't accept Unicode (or worse,
> raise random exceptions) runs against best practices for moving to
> Python 3.
There are lots of them, and urllib.quote() isn't an exception:
'x\x9c\xcbH\x04\x00\x013\x00\xca'
>>> zlib.compress(u"hà")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 1: ordinal not in range(128)
pwd.struct_passwd(pw_name='root', pw_passwd='x', pw_uid=0, pw_gid=0, pw_gecos='root', pw_dir='/root', pw_shell='/bin/bash')
>>> pwd.getpwnam(u"rooté")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 4: ordinal not in range(128)
> In fact, most code written to work with strings naturally works with
> Unicode because unicode strings support the same basic operations.
What should zlib compression of an unicode string result in?
> > The original issue is against robotparser, and clearly states a bug
> > (robotparser doesn't work in some cases).
>
> I don't know why this keeps coming back to robotparser. The original
> bug was not against robotparser; it is called "quote throws exception
> on Unicode URL" and that is the bug. Robotparser was just one
> demonstrative piece of code which failed because of it.
Well, there are two different concerns:
- robotparser fails on certain Web pages, which is a bug (unless the Web
pages are clearly malformed)
- urllib.quote() should accept any kind of unicode strings, and perform
appropriate encoding, with an ability to override default encoding
parameters: this is a feature request
The OP himself (John Nagle) said:
“The problem is down inside a library module. "robotparser" is calling
"urllib.quote". One of those two library modules needs to be fixed.”
It seems to imply that the primary concern was robotparser not working. |
|
Date |
User |
Action |
Args |
2010-07-19 13:22:01 | pitrou | set | recipients:
+ pitrou, collinwinter, varmaa, nagle, orsenthil, vstinner, ajaksu2, ezio.melotti, eric.araujo, mgiuca, mastrodomenico, vak, adamnelson, BreamoreBoy |
2010-07-19 13:21:59 | pitrou | link | issue1712522 messages |
2010-07-19 13:21:58 | pitrou | create | |
|