classification
Title: Cannot use non-ascii letters in disutils if setuptools is used.
Type: behavior Stage:
Components: Distutils Versions: Python 2.6
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, jwilk, lemburg, loewis, tarek
Priority: normal Keywords: patch

Created on 2008-04-06 09:47 by tarek, last changed 2010-11-25 23:56 by jwilk. This issue is now closed.

Files
File name Uploaded Description Edit
P1-1.0.tar.gz loewis, 2008-04-08 20:16
distutils.unicode.patch tarek, 2008-08-25 16:05
distutils-unicode-metadata.patch lemburg, 2008-08-25 17:03
Messages (30)
msg65028 - (view) Author: Tarek Ziadé (tarek) * (Python committer) Date: 2008-04-06 09:47
If I try to put my name in the Author field as a string field, 
it will brake because distutils makes the assumption that 
the fields are string encoded in ascii, before it decodes
it into unicode, then encode it in utf8 to send the data.

See in distutils.command.register.post_to_server :

value = unicode(value).encode("utf-8")


One way to avoid this error is to provide unicode for all field,
but will fail farther if setuptools is used, because
this other package makes the assumption that the fields *are* strings::

self.run_command('egg_info')
...
distutils/dist.py", line 1047, in write_pkg_info
    pkg_info.write('Author: %s\n' % self.get_contact() )
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
position 18: ordinal not in range(128)

So I guess distutils shouldn't guess that it receives ascii strings
and do a raw unicode() call, and should make the assumption that 
it receives unicode fields only.


Since many packages out there use strings, I have left a unicode()
call in my patch, together with a warning. 

test provided.
msg65032 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-04-06 13:21
The official supported way for non-ASCII characters in distutils is to
use Unicode strings. If anything else fails, that's not a bug.

IIUC, in this case, it's setuptools that fails, not distutils. Assuming
I understood correctly, I'm closing this as won't-fix/3rd party.
msg65033 - (view) Author: Tarek Ziadé (tarek) * (Python committer) Date: 2008-04-06 13:43
In that case, distutils should not do a unicode() call over each field
passed before .encode('utf8') is called, because it makes the assumption
that string type can be used.
msg65035 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-04-06 13:59
I don't understand. It is *certainly* allowed to use byte strings for
these data, as long as they are ASCII. The Unicode requirement exists
only for non-ASCII characters, and distutils makes explicit, deliberate
use of the default encoding here (hoping that nobody changed it away
from ASCII).

There are tons of setup.py files out there that use plain byte strings,
and there is no reason to break them, e.g. by mandating that the string
is Unicode already.
msg65038 - (view) Author: Tarek Ziadé (tarek) * (Python committer) Date: 2008-04-06 14:14
ok I see what you mean, thanks for the explanation
msg65040 - (view) Author: Tarek Ziadé (tarek) * (Python committer) Date: 2008-04-06 14:41
oh, hold one, it is more complicated in fact :)

setuptools calls DistributionMetadata.dist.write_pkg_file()
method to write the .egg-info file.

This method make the assertion that the metadata fields are string
so it is not setuptools fault.

This code fail the same way:

dist = Distribution(attrs={'author': u'Mister Café'})
dist.metadata.write_pkg_file(file)

 
So I guess the patch needs to be done in
distutils.dist.DistributionMetadata, so it checks upon
the type of field before it runs:

file.write('Author: %s\n' % self.get_contact() ) 

That what I meant when I said that distutils should
decide wheter it works with unicode or str for this fields.

I can re-write a new patch if you agree on this
msg65046 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-04-06 17:09
I agree there is a bug in distutils. Before we proceed, I think
distutils-sig needs to be consulted. My proposal would be the one I
suggested earlier: all strings should either be Unicode or ASCII-only
byte strings. This contradicts to the documentation that says that none
of the strings must be Unicode, so it would be an incompatible change
(and would indeed likely break packages that currently use UTF-8, and
sdist, but never register)
msg65047 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-04-06 17:17
As a follow-up: for compatibility, it might be possible to support
either Unicode or arbitrary plain strings in write_pkg_file. In 3k, such
support can then be dropped.

As that constitutes a new feature, it shouldn't be applied to 2.5.
msg65069 - (view) Author: Tarek Ziadé (tarek) * (Python committer) Date: 2008-04-07 08:06
ok, I'll summarize this in distutils-sig sometime today.

If we do use Unicode, I think we might need an extra meta-data,
"encoding", that would default to "utf8", and that could be used when
the class needs to serialize the data in a file.
msg65070 - (view) Author: Tarek Ziadé (tarek) * (Python committer) Date: 2008-04-07 08:17
adding a sample patch to show a possible implementation, and to point
the problem to people
msg65076 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-04-07 15:51
Note that 

value = unicode(value).encode("utf-8")

will also work if value is already Unicode, so a backwards compatible
fix would be to allow passing in:

 * ASCII encoded strings
 * Unicode objects

for the meta data keyword parameters and then apply unicode() to all the
meta-data arguments.

I don't think that we should support non-ASCII encodings for meta-data 
strings passed to setup().

If setuptools is broken in this respect, it needs to be fixed. Dito for
other 3rd party tools.
msg65102 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-04-07 19:37
> If we do use Unicode, I think we might need an extra meta-data,
> "encoding", that would default to "utf8", and that could be used when
> the class needs to serialize the data in a file.

I don't think so. Whenever the data is written to a file, the file
format should specify the encoding.

Regards,
Martin
msg65103 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-04-07 19:39
> I don't think that we should support non-ASCII encodings for meta-data 
> strings passed to setup().
> 
> If setuptools is broken in this respect, it needs to be fixed. Dito for
> other 3rd party tools.

We do need to support non-ASCII files, as distutils didn't previously
even support Unicode strings, and people still wanted to get their names
right. It's not about setuptools, and not about other 3rd party tools.
It's about distutils packages which we need to continue to support.
msg65108 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-04-07 19:49
Agreed, but any change will target the package authors who can easily
upgrade their packages to use Unicode for e.g. names.

If the change were to address distutils users, we'd have to be a lot
more careful.

In any case, if UTF-8 is the defacto standard used in older packages,
then we should probably use that as fallback solution if the ASCII
assumption doesn't work out:

try:
    value = unicode(value)
except UnicodeDecodeError:
    value = unicode(value, 'utf-8')
value = value.encode('utf-8')
msg65113 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-04-07 20:07
> Agreed, but any change will target the package authors who can easily
> upgrade their packages to use Unicode for e.g. names.

They can't: that would break their 2.5-and-earlier compatibility.

> If the change were to address distutils users, we'd have to be a lot
> more careful.

We do address distutils users: what else? Why should we be more careful?

> In any case, if UTF-8 is the defacto standard used in older packages,
> then we should probably use that as fallback solution if the ASCII
> assumption doesn't work out:
> 
> try:
>     value = unicode(value)
> except UnicodeDecodeError:
>     value = unicode(value, 'utf-8')
> value = value.encode('utf-8')

For writing the metadata, we don't need to make any assumptions. We
can just write the bytes as-is. This is how distutils has behaved
for many releases now, and this is how users have been using it.

Of course, we (probably) agree that this is conceptually wrong, as
we won't be able to know what the encoding of the metadata file is,
and we (probably) also agree that the metadata should have the
fixed encoding of UTF-8. However, I don't think we should deliberately
break packages before 3.0 (even if they chose to use some other
encoding); instead, such packages will silently start doing the
right thing with 3.0, when their strings become Unicode strings.
msg65118 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-04-07 20:22
With "distutils users" I'm referring to people that are told to run
"python setup.py install". Changed affecting the way this line behaves
need to be carefully considered.

OTOH, when upgrading a package to a new Python version (and distutils
version), package authors will have to modify their packages anyway, so
it is well possible to ask them to use Unicode strings for meta-information.

Supporting pre-2.6 Python version is also not much of a problem, since
authors could setup the strings in question to be either Unicode or
8-bit strings depending on the Python version.

This change would be really minor (compared to e.g the Py_ssize_t change
;-).

That said, I don't think it's a good idea to make package data more
complicated by allowing multiple encodings. The meta-data file should
have a fixed pre-defined encoding, preferrably UTF-8.
msg65158 - (view) Author: Tarek Ziadé (tarek) * (Python committer) Date: 2008-04-08 13:26
> For writing the metadata, we don't need to make any assumptions. We
> can just write the bytes as-is. This is how distutils has behaved
> for many releases now, and this is how users have been using it.

But write_pkg_file will use ascii encoding if we don't indicate it
here:

>>> pkg_info.write('Author: %s\n' % self.get_contact() )

So wouldn't a light fix in write_pkg_file() would be sufficient when a
unicode(field) fails, as MAL mentioned ? by trying utf8:

>>> try:
...    pkg_info.write('Author: %s\n' % self.get_contact() )
... except UnicodeEncodeError:
...    pkg_info.write('Author: %s\n' % self.get_contact().encode('utf8') ) 


As far as I know, this simple change will not impact people and will
just make it possible to use Unicode. And everything will be fine under
Py3K as it is now.

But I don't know yet how this would impact 3rd party softwares that reads
the egg-info file. But like MAL said, they will have to get fixed as well.
msg65213 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-04-08 20:16
> But write_pkg_file will use ascii encoding if we don't indicate it
> here:
> 
>>>> pkg_info.write('Author: %s\n' % self.get_contact() )

Why do you say that it uses ascii? It uses whatever encoding the string
returned by get_contact uses. See the attached P1-1.0.tar.gz for an
example. This doesn't use ASCII, and doesn't use UTF-8, and works with
2.4.

> So wouldn't a light fix in write_pkg_file() would be sufficient when a
> unicode(field) fails, as MAL mentioned ? by trying utf8:
> 
>>>> try:
> ...    pkg_info.write('Author: %s\n' % self.get_contact() )
> ... except UnicodeEncodeError:
> ...    pkg_info.write('Author: %s\n' % self.get_contact().encode('utf8') ) 

That would work - although I fail to see what this has to do with
a failing unicode(field). Instead, it has rather to do with a failing
.write().
msg65214 - (view) Author: Tarek Ziadé (tarek) * (Python committer) Date: 2008-04-08 20:39
>>>> pkg_info.write('Author: %s\n' % self.get_contact() )
> Why do you say that it uses ascii? It uses whatever encoding the string
> returned by get_contact uses. See the attached P1-1.0.tar.gz for an
> example. This doesn't use ASCII, and doesn't use UTF-8, and works with
> 2.4.


This happens of course only when get_contact returns an unicode.
It uses the ascii codec by default. Here's an example:

>>> contact = u'Barnabé'
>>> f = open('/tmp/test', 'w')
>>> f.write('Author: %s' % contact)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
position 14: ordinal not in range(128)

> That would work - although I fail to see what this has to do with
> a failing unicode(field). Instead, it has rather to do with a failing
> .write().

Absolutely, I was focusing on write_pkg_file() method that fails
when the egg-info file is written.
msg65650 - (view) Author: Tarek Ziadé (tarek) * (Python committer) Date: 2008-04-20 16:27
I suppose the simplest way to deal with the problem is to force utf-8
encoding for the concerned fields, since this problem will dissapear in 3k.
 
Here's a simplified patch, that does it, so write_pkg_file behaves as
expected.
msg66518 - (view) Author: Tarek Ziadé (tarek) * (Python committer) Date: 2008-05-10 13:35
I think this should also be fixed in 2.5
msg71936 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-08-25 15:16
Is this still an issue in 2.6 ?

AFAIK, there have been a few changes both to setuptools and PyPI that
make it easy to just use Unicode objects in the setup() call for
non-ASCII values.
msg71943 - (view) Author: Tarek Ziadé (tarek) * (Python committer) Date: 2008-08-25 16:05
The problem is in distutils code, not in setuptools or PyPI.

As long as I can see, the problem remains in the trunk. It is dead
simple to reproduce : put an unicode name for the author in a plain setup.py
with a non ascii character. (for example my name ;))

Here's an up-to-date patch that includes a test that reproduces the problem.
msg71944 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-08-25 17:03
Here's an updated patch that applies the same logic to all meta-data
fields, instead of just a few. This simplifies the code somewhat.

I've tested it with the test you provided and also with eGenix packages
using Unicode author names (ie. my name ;-)).

I guess we need at least one more reviewer to commit this change.
msg71971 - (view) Author: Tarek Ziadé (tarek) * (Python committer) Date: 2008-08-26 08:48
ok I will ask for this on the ML
msg72383 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-09-03 10:48
Removing Python 2.5 from the version list, since the patch may in some
cases (e.g. using a different encoding than UTF-8) cause problems with
existing setup.py files out there.

The patch is not compatible with Python 3.0 for obvious reasons, but
there shouldn't be any issue for Python 3.0 anyway.

Given that no one has volunteered to review the patch in addition to
Tarek and myself, I think we're good to go.

Tarek, if you're fine with this, please let me know and I'll check in
the patch (together with a note in NEWS).
msg72384 - (view) Author: Tarek Ziadé (tarek) * (Python committer) Date: 2008-09-03 10:58
Sure, sounds fine to me, thanks for the help on this issue
msg72385 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-09-03 11:28
Checked in as r66181 on trunk.
msg72790 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-09-08 21:44
Does this need to be merged into py3k? If so, can someone who handled
this bug do it. I met a few test failures in my attempt...
msg72834 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-09-09 09:37
On 2008-09-08 23:45, Benjamin Peterson wrote:
> Benjamin Peterson <musiccomposition@gmail.com> added the comment:
> 
> Does this need to be merged into py3k? If so, can someone who handled
> this bug do it. I met a few test failures in my attempt...

As mentioned in the ticket discussion, this does not need to
be forward patched to 3.0.
History
Date User Action Args
2010-11-25 23:56:05jwilksetnosy: + jwilk
2008-09-09 09:37:01lemburgsetmessages: + msg72834
title: Cannot use non-ascii letters in disutils if setuptools is used. -> Cannot use non-ascii letters in disutils if setuptools is used.
2008-09-08 21:44:04benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg72790
2008-09-03 11:28:59lemburgsetstatus: open -> closed
messages: + msg72385
2008-09-03 10:58:04tareksetmessages: + msg72384
2008-09-03 10:48:28lemburgsetmessages: + msg72383
versions: - Python 2.5
2008-08-26 08:48:48tareksetmessages: + msg71971
2008-08-25 17:04:00lemburgsetfiles: + distutils-unicode-metadata.patch
messages: + msg71944
2008-08-25 16:06:11tareksetfiles: - distutils.unicode.simplified.patch
2008-08-25 16:06:04tareksetfiles: - unicode.metadata.patch
2008-08-25 16:06:01tareksetfiles: - unicode.patch
2008-08-25 16:05:45tareksetfiles: + distutils.unicode.patch
messages: + msg71943
2008-08-25 15:16:56lemburgsetmessages: + msg71936
2008-08-24 22:28:00nnorwitzsettype: crash -> behavior
2008-05-10 13:35:03tareksetmessages: + msg66518
versions: + Python 2.5
2008-04-20 16:27:12tareksetfiles: + distutils.unicode.simplified.patch
messages: + msg65650
2008-04-12 18:29:44georg.brandllinkissue1721241 superseder
2008-04-08 20:39:50tareksetmessages: + msg65214
2008-04-08 20:16:06loewissetfiles: + P1-1.0.tar.gz
messages: + msg65213
2008-04-08 13:26:16tareksetmessages: + msg65158
2008-04-07 20:22:32lemburgsetmessages: + msg65118
2008-04-07 20:07:40loewissetmessages: + msg65113
2008-04-07 19:49:38lemburgsetmessages: + msg65108
2008-04-07 19:39:26loewissetmessages: + msg65103
2008-04-07 19:37:53loewissetmessages: + msg65102
2008-04-07 15:51:06lemburgsetnosy: + lemburg
messages: + msg65076
2008-04-07 08:17:05tareksetfiles: + unicode.metadata.patch
messages: + msg65070
2008-04-07 08:06:26tareksetmessages: + msg65069
2008-04-06 17:17:38loewissetmessages: + msg65047
2008-04-06 17:09:05loewissetstatus: closed -> open
resolution: wont fix ->
messages: + msg65046
versions: + Python 2.6, - 3rd party
2008-04-06 14:41:30tareksetmessages: + msg65040
2008-04-06 14:14:12tareksetmessages: + msg65038
2008-04-06 13:59:08loewissetmessages: + msg65035
2008-04-06 13:43:33tareksetmessages: + msg65033
2008-04-06 13:21:05loewissetstatus: open -> closed
resolution: wont fix
messages: + msg65032
nosy: + loewis
versions: + 3rd party, - Python 2.6
2008-04-06 10:14:38tareksetfiles: + unicode.patch
2008-04-06 10:14:28tareksetfiles: - unicode.patch
2008-04-06 09:47:18tarekcreate