classification
Title: Unicode support for hashing algorithms
Type: enhancement Stage:
Components: Unicode Versions: Python 2.6
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: lemburg Nosy List: lemburg, loewis, rhettinger, vvro
Priority: low Keywords:

Created on 2008-05-23 03:22 by vvro, last changed 2008-05-23 08:32 by lemburg. This issue is now closed.

Messages (6)
msg67214 - (view) Author: Vasco Rodrigues (vvro) Date: 2008-05-23 03:22
The hashing algorithms don't support Unicode. Any Unicode text given to
them is first tried to convert ascii and then hashed. Not all strings
are convertible to ascii.
Now that Unicode is becoming the default encoding, specially for the web
side of python, where a lot of this hashing algorithms are used.
There should be some kind of Unicode support in them.

Example:
from hashlib import md5
md5(u'joão')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in
position 2: ordinal not in range(128)
msg67215 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2008-05-23 03:38
I don't think this is the right thing to do.  The hash algorithms are 
defined in terms of bytes, but Unicode is an abstracted from a byte 
level encoding.  It doesn't make sense to convert using an arbitrary 
encoding (such as UTF-8) because someone else might hash the same text 
using a different encoding.

Mark, do you concur?
msg67216 - (view) Author: Vasco Rodrigues (vvro) Date: 2008-05-23 04:19
You could just make a check for unicode strings and issue the encode in
the hash function.
I understand the byte abstraction, but if you issue an encode on a
unicode string with only ascii chars it gets converted to the same in
ascii, result will be the same.

So i got to do md5(u'joão'.encode("utf-8"))?
Wasn't unicode becoming the default?

If I do md5(u'john'), it works. And that's a unicode string. It should
have told me, no unicode then...
msg67218 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2008-05-23 05:49
Only 2.6 should be marked.  This is a feature request for an implicit 
conversion with a default encoding; it is not a bugfix.

FWIW, here's a reference to an earlier discussion:
http://mail.python.org/pipermail/python-list/2004-April/258630.html

Also, it is unpersuasive that md5() works with u'john'.  That is just 
an artifact of the 2.x series.  In 3.0, there is a more clear 
distinction between bytes and text.

Recommending that this be rejected and closed.  Without a universally 
accepted (not just with Python) implicit coding, there's no way to 
check MD5 checksums match for the same unicode text.
msg67219 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-05-23 05:59
I'm rejecting this idea, for the reasons already given by others: the
same string might have different hash values, depending on which
encoding is chosen. Users will have to be explicit when hashing, just as
they need to be explicit when they chose a hash algorithm (i.e. md5,
sha1, or sha256 - they all do the same thing, but still produce
different output).

If you want a hash algorithm that abstracts from these details, use the
builtin hash function:

py> hash(u'joão')
679553179
msg67222 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-05-23 08:32
On 2008-05-23 05:38, Raymond Hettinger wrote:
> Raymond Hettinger <rhettinger@users.sourceforge.net> added the comment:
> 
> I don't think this is the right thing to do.  The hash algorithms are 
> defined in terms of bytes, but Unicode is an abstracted from a byte 
> level encoding.  It doesn't make sense to convert using an arbitrary 
> encoding (such as UTF-8) because someone else might hash the same text 
> using a different encoding.
> 
> Marc, do you concur?

Yes.

While we could fix an encoding to use for converting Unicode to
bytes, e.g. UTF-8, you clearly want hash functions to be portable
across platforms, programming languages and implementations.

Other languages or implementations might choose UTF-16 or some
other encoding, so it's not clear which encoding to choose and
there doesn't seem to be a standard for this either.

-1 on the idea. Martin already closed and rejected the idea for me.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 23 2008)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
            Registered at Amtsgericht Duesseldorf: HRB 46611
History
Date User Action Args
2008-05-23 08:32:50lemburgsetmessages: + msg67222
2008-05-23 05:59:15loewissetstatus: open -> closed
nosy: + loewis
resolution: not a bug
messages: + msg67219
2008-05-23 05:50:34rhettingersetpriority: low
messages: + msg67218
versions: - Python 2.5, Python 2.4
2008-05-23 04:45:22vvrosetversions: + Python 2.5, Python 2.4
2008-05-23 04:19:45vvrosetmessages: + msg67216
2008-05-23 03:38:28rhettingersetassignee: lemburg
type: enhancement
messages: + msg67215
nosy: + rhettinger, lemburg
versions: + Python 2.6, - Python 2.5
2008-05-23 03:22:49vvrocreate