Created on 2008-05-23 03:22 by vvro, last changed 2008-05-23 08:32 by lemburg. This issue is now closed.
|msg67214 - (view)||Author: Vasco Rodrigues (vvro)||Date: 2008-05-23 03:22|
The hashing algorithms don't support Unicode. Any Unicode text given to them is first tried to convert ascii and then hashed. Not all strings are convertible to ascii. Now that Unicode is becoming the default encoding, specially for the web side of python, where a lot of this hashing algorithms are used. There should be some kind of Unicode support in them. Example: from hashlib import md5 md5(u'joão') UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 2: ordinal not in range(128)
|msg67215 - (view)||Author: Raymond Hettinger (rhettinger) *||Date: 2008-05-23 03:38|
I don't think this is the right thing to do. The hash algorithms are defined in terms of bytes, but Unicode is an abstracted from a byte level encoding. It doesn't make sense to convert using an arbitrary encoding (such as UTF-8) because someone else might hash the same text using a different encoding. Mark, do you concur?
|msg67216 - (view)||Author: Vasco Rodrigues (vvro)||Date: 2008-05-23 04:19|
You could just make a check for unicode strings and issue the encode in the hash function. I understand the byte abstraction, but if you issue an encode on a unicode string with only ascii chars it gets converted to the same in ascii, result will be the same. So i got to do md5(u'joão'.encode("utf-8"))? Wasn't unicode becoming the default? If I do md5(u'john'), it works. And that's a unicode string. It should have told me, no unicode then...
|msg67218 - (view)||Author: Raymond Hettinger (rhettinger) *||Date: 2008-05-23 05:49|
Only 2.6 should be marked. This is a feature request for an implicit conversion with a default encoding; it is not a bugfix. FWIW, here's a reference to an earlier discussion: http://mail.python.org/pipermail/python-list/2004-April/258630.html Also, it is unpersuasive that md5() works with u'john'. That is just an artifact of the 2.x series. In 3.0, there is a more clear distinction between bytes and text. Recommending that this be rejected and closed. Without a universally accepted (not just with Python) implicit coding, there's no way to check MD5 checksums match for the same unicode text.
|msg67219 - (view)||Author: Martin v. Löwis (loewis) *||Date: 2008-05-23 05:59|
I'm rejecting this idea, for the reasons already given by others: the same string might have different hash values, depending on which encoding is chosen. Users will have to be explicit when hashing, just as they need to be explicit when they chose a hash algorithm (i.e. md5, sha1, or sha256 - they all do the same thing, but still produce different output). If you want a hash algorithm that abstracts from these details, use the builtin hash function: py> hash(u'joão') 679553179
|msg67222 - (view)||Author: Marc-Andre Lemburg (lemburg) *||Date: 2008-05-23 08:32|
On 2008-05-23 05:38, Raymond Hettinger wrote: > Raymond Hettinger <email@example.com> added the comment: > > I don't think this is the right thing to do. The hash algorithms are > defined in terms of bytes, but Unicode is an abstracted from a byte > level encoding. It doesn't make sense to convert using an arbitrary > encoding (such as UTF-8) because someone else might hash the same text > using a different encoding. > > Marc, do you concur? Yes. While we could fix an encoding to use for converting Unicode to bytes, e.g. UTF-8, you clearly want hash functions to be portable across platforms, programming languages and implementations. Other languages or implementations might choose UTF-16 or some other encoding, so it's not clear which encoding to choose and there doesn't seem to be a standard for this either. -1 on the idea. Martin already closed and rejected the idea for me. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 23 2008) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ :::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611
|2008-05-23 08:32:50||lemburg||set||messages: + msg67222|
|2008-05-23 05:59:15||loewis||set||status: open -> closed|
nosy: + loewis
messages: + msg67219
|2008-05-23 05:50:34||rhettinger||set||priority: low|
messages: + msg67218
versions: - Python 2.5, Python 2.4
|2008-05-23 04:45:22||vvro||set||versions: + Python 2.5, Python 2.4|
|2008-05-23 04:19:45||vvro||set||messages: + msg67216|
|2008-05-23 03:38:28||rhettinger||set||assignee: lemburg|
messages: + msg67215
nosy: + rhettinger, lemburg
versions: + Python 2.6, - Python 2.5