This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Decode __file__ and co_filename to unicode using fs default
Type: Stage:
Components: Interpreter Core Versions: Python 3.0
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: gvanrossum Nosy List: alexandre.vassalotti, christian.heimes, gvanrossum
Priority: normal Keywords: patch

Created on 2007-10-12 13:41 by christian.heimes, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
py3k_file_fsenc2.patch christian.heimes, 2007-10-12 13:41
updated_file_fsenc.patch alexandre.vassalotti, 2007-10-13 22:10
updated_file_fsenc-2.patch alexandre.vassalotti, 2007-10-14 01:32
updated_file_fsenc-3.patch alexandre.vassalotti, 2007-10-14 01:42
updated_file_fsenc-4.patch alexandre.vassalotti, 2007-10-14 04:35
updated_file_fsenc-5.patch alexandre.vassalotti, 2007-10-14 16:17
updated_file_fsenc-6.patch christian.heimes, 2007-10-14 17:06
Messages (31)
msg56374 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-10-12 13:41
I'm sending the patch in for review.
msg56388 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-10-13 21:19
Couple of nits:

- You added a removal of hotshot from setup.py to the patch; but that's
been checked in in the mean time.

- Why add an 'errors' argument to the function when it's a fatal error
to use it?

- Using 0 to autodetect the length is scary.  Normally we have two APIs
for that, one ..._FromString and one ...FromStringAndSize.  If you
really don't want that, please use -1, which is at least an illegal value.

- Why is there code in codeobject.c::PyCode_New() that still accepts a
PyString for the filename?

- In that file (and possibly others, I didn't check) your code uses
spaces to indent while the surrounding code uses tabs.  Moreover, your
space indent seems to assume there are 4 spaces to a tab, but all our
code (Python and C) is formatted assuming tabs are 8 spaces.  (The
indent isn't always 8 spaces -- but ASCII TAB characters always are 8,
for us.)

- Why copy the default encoding before mangling it?  With a little extra
care you will only have to copy it once.  Also, consider not mangling at
all, but assuming the encoding comes in a canonical form -- several
other functions assume that, e.g. PyUnicode_Decode() and
PyUnicode_AsEncodedString().

- I haven't run the unit tests yet.  Will be doing that next...
msg56389 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2007-10-13 22:10
I found a few problems in your patch. In PyCode_New() the type check
for the filename argument is incorrect:

--- Objects/codeobject.c	(revision 58412)
+++ Objects/codeobject.c	(working copy)
@@ -59,7 +59,7 @@
 	    freevars == NULL || !PyTuple_Check(freevars) ||
 	    cellvars == NULL || !PyTuple_Check(cellvars) ||
 	    name == NULL || (!PyString_Check(name) && !PyUnicode_Check(name)) ||
-	    filename == NULL || !PyString_Check(filename) ||
+	    filename == NULL || (!PyString_Check(name) &&
!PyUnicode_Check(name)) ||
 	    lnotab == NULL || !PyString_Check(lnotab) ||
 	    !PyObject_CheckReadBuffer(code)) {
 		PyErr_BadInternalCall();

@@ -260,6 +267,8 @@
 		ourcellvars = PyTuple_New(0);
 	if (ourcellvars == NULL)
 		goto cleanup;
+    filename = PyUnicode_DecodeFSDefault(PyString_AS_STRING(filename),
+                                         0, NULL);


The following is unnecessary and will cause a reference leak:


@@ -260,6 +267,8 @@
 		ourcellvars = PyTuple_New(0);
 	if (ourcellvars == NULL)
 		goto cleanup;
+    filename = PyUnicode_DecodeFSDefault(PyString_AS_STRING(filename),
+                                         0, NULL);
 
 	co = (PyObject *)PyCode_New(argcount, kwonlyargcount,
 				    nlocals, stacksize, flags,


I think the interface of PyUnicode_DecodeFSDefault() could be improved
a bit. The function doesn't use the last argument 'errors', so why is
there? I am not sure if it is useful to keep second argument,
'length', either. So, I believe the function prototype should be
changed to:

    PyObject *PyUnicode_Decode_FSDefault(const char *s);

Another thing that I am not sure about is whether it is correct to
consider ISO-8859-15 the same thing as Latin-1.

Overall, the patch looks good to me and doesn't cause any test to
fail. I attached an updated patch with the above issues fixed.

Thank you, Christian, for the patch. :)
msg56390 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2007-10-13 22:20
Guido wrote: 
> Why copy the default encoding before mangling it?  With a little
> extra care you will only have to copy it once.  Also, consider not
> mangling at all, but assuming the encoding comes in a canonical form
> -- several other functions assume that, e.g. PyUnicode_Decode() and
> PyUnicode_AsEncodedString().

It is impossible guarantee that Py_FileSystemDefaultEncoding is
normalized, since its value can be set using nl_langinfo(CODESET)
during the bootstrapping process. PyUnicode_Decode() and other
decoding/encoding functions use the codec module, which is not available
during the early bootstrapping process, to normalize the encoding name.
msg56391 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-10-13 23:27
Guido van Rossum wrote:
> - You added a removal of hotshot from setup.py to the patch; but that's
> been checked in in the mean time.

Oh, the change shouldn't make it into the patch. I guess I forgot a svn
revert on setup.py

> - Why add an 'errors' argument to the function when it's a fatal error
> to use it?

I wanted the signature of the method be equal to the other methods
PyUnicode_Decode*. I copied the FatalError from
*_PyUnicode_AsDefaultEncodedString().

> - Using 0 to autodetect the length is scary.  Normally we have two APIs
> for that, one ..._FromString and one ...FromStringAndSize.  If you
> really don't want that, please use -1, which is at least an illegal value.

Oh right, -1 is *much* better for autodetect than 0. What do you prefer,
a second method or -1 as auto detect?

> - Why is there code in codeobject.c::PyCode_New() that still accepts a
> PyString for the filename?

Because it's my fault that I've overseen it. :/

> - In that file (and possibly others, I didn't check) your code uses
> spaces to indent while the surrounding code uses tabs.  Moreover, your
> space indent seems to assume there are 4 spaces to a tab, but all our
> code (Python and C) is formatted assuming tabs are 8 spaces.  (The
> indent isn't always 8 spaces -- but ASCII TAB characters always are 8,
> for us.)

Some C files like unicodeobject.c are using 4 spaces while other files
are using tabs for indention. My editor may got confused by the mix.
I've manually fixed it in the patch but I may have overseen a line or two.

> - Why copy the default encoding before mangling it?  With a little extra
> care you will only have to copy it once.  Also, consider not mangling at
> all, but assuming the encoding comes in a canonical form -- several
> other functions assume that, e.g. PyUnicode_Decode() and
> PyUnicode_AsEncodedString().

My C is a bit rusty and still need to learn news tricks. I'm trying to
see if I can remove the extra copy without causing a problem.
The other part of your question was already answered by Alexandre. The
aliases map is defined in Python code. It's not available so early in
the boot strapping process.
We'd have to redesign the assignment of co_filename and __file__
completely if we want to use the aliases and other codecs. For example
we could store a PyString at first and redo all names once the codecs
are set up.

Christian
msg56393 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-10-14 00:35
On 10/13/07, Christian Heimes <report@bugs.python.org> wrote:
> Guido van Rossum wrote:
> > - Why add an 'errors' argument to the function when it's a fatal error
> > to use it?
>
> I wanted the signature of the method be equal to the other methods
> PyUnicode_Decode*. I copied the FatalError from
> *_PyUnicode_AsDefaultEncodedString().

But that function is a terrible example; it was done that way because
an earlier version of the function *did* allow using the errors
argument and I wanted to make sure to catch all calls that were still
passing an errors value. This doesn't apply here, we're creating a
brand new API.

> > - Using 0 to autodetect the length is scary.  Normally we have two APIs
> > for that, one ..._FromString and one ...FromStringAndSize.  If you
> > really don't want that, please use -1, which is at least an illegal value.
>
> Oh right, -1 is *much* better for autodetect than 0. What do you prefer,
> a second method or -1 as auto detect?

Even better is Alexandre's version: always autodetect. I think we can
assume that filenames are always available as 0-terminated byte
arrays, since that's how all system calls need them.

Anyway, let me know if you want to change something in Alexandre's
version or if you want him to check it in.

Oh. Hm. I still wish that PyCode_New() could just insist that the
filename argument is a PyUnicode instance. Why can't it? Perhaps the
caller should be fixed instead?
msg56394 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-10-14 00:43
Guido van Rossum wrote:
> But that function is a terrible example; it was done that way because
> an earlier version of the function *did* allow using the errors
> argument and I wanted to make sure to catch all calls that were still
> passing an errors value. This doesn't apply here, we're creating a
> brand new API.

Ahhh! I really didn't know it. I thought that the non functional
arguments have a purpose.

> Anyway, let me know if you want to change something in Alexandre's
> version or if you want him to check it in.

I'm going to create a new patch based on his fixes and your recommendations.

> Oh. Hm. I still wish that PyCode_New() could just insist that the
> filename argument is a PyUnicode instance. Why can't it? Perhaps the
> caller should be fixed instead?

I'll try.
msg56396 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-10-14 00:51
> > Oh. Hm. I still wish that PyCode_New() could just insist that the
> > filename argument is a PyUnicode instance. Why can't it? Perhaps the
> > caller should be fixed instead?

> I'll try.

I figured out the problem -- it came from marshalled old code objects. 
If you throw away all .pyc files the problem goes away.  You can also
get rid of the similar checks for the 'name' argument -- this should
just be a PyUnicode too.  A systematic approach to invalidating all the
.pyc files is updating the magic number in import.c.
msg56397 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-10-14 00:57
Guido van Rossum wrote:
> - Why copy the default encoding before mangling it?  With a little extra
> care you will only have to copy it once. 

Now I remember why I added the strncpy() call plus encoding[31] = '\0'.
I wanted to make sure that the code doesn't break even if the encoding
name is longer than 31 + 1 chars long. I'm aware that it's very unlikely
but I didn't want to take chances. You are allowed to call me paranoid. *g*

Christian
msg56398 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-10-14 01:00
Well, you could ensure that by checking that you haven't reached the
end of the mangling buffer. That will have the added advantage that
when the input is something silly like 32 spaces followed by "utf-8"
it will be still be mangled correctly. The slight extra cost of the
check (could be a single pointer compare) is offset by saving a call
to strncpy().

--Guido

On 10/13/07, Christian Heimes <report@bugs.python.org> wrote:
>
> Christian Heimes added the comment:
>
> Guido van Rossum wrote:
> > - Why copy the default encoding before mangling it?  With a little extra
> > care you will only have to copy it once.
>
> Now I remember why I added the strncpy() call plus encoding[31] = '\0'.
> I wanted to make sure that the code doesn't break even if the encoding
> name is longer than 31 + 1 chars long. I'm aware that it's very unlikely
> but I didn't want to take chances. You are allowed to call me paranoid. *g*
>
> Christian
>
> __________________________________
> Tracker <report@bugs.python.org>
> <http://bugs.python.org/issue1272>
> __________________________________
>
msg56401 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2007-10-14 01:32
Guido wrote:
> I figured out the problem -- it came from marshalled old code objects. 
> If you throw away all .pyc files the problem goes away.  You can also
> get rid of the similar checks for the 'name' argument -- this should
> just be a PyUnicode too.  A systematic approach to invalidating all the
> .pyc files is updating the magic number in import.c.

Done.

I had to remove a few another PyString instances in pyexpat.c and
_ctypes.c. So, here (hopefully) the final version of the patch. 

The changes from the last version are:

   - Correct a typo in of the comments in PyUnicode_DecodeFSDefault
   - Specified in the API doc of PyUnicode_DecodeFSDefault that the
     function take a null-terminated string.
   - Bumped the magic number in import.c
   - Fix PyCode_New calls in _ctypes and pyexpat module.
   - Remove the PyString type check on 'filename' and 'name' in PyCode_New.
   - Remove the unneeded string coercion code from PyCode_New.
msg56402 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2007-10-14 01:42
> Remove the PyString type check on 'filename' and 'name' in PyCode_New.

Oops. I removed one of the ! the checks by mistake.
msg56404 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-10-14 02:15
Crys, is this OK with you?

On 10/13/07, Alexandre Vassalotti <report@bugs.python.org> wrote:
>
> Alexandre Vassalotti added the comment:
>
> Guido wrote:
> > I figured out the problem -- it came from marshalled old code objects.
> > If you throw away all .pyc files the problem goes away.  You can also
> > get rid of the similar checks for the 'name' argument -- this should
> > just be a PyUnicode too.  A systematic approach to invalidating all the
> > .pyc files is updating the magic number in import.c.
>
> Done.
>
> I had to remove a few another PyString instances in pyexpat.c and
> _ctypes.c. So, here (hopefully) the final version of the patch.
>
> The changes from the last version are:
>
>    - Correct a typo in of the comments in PyUnicode_DecodeFSDefault
>    - Specified in the API doc of PyUnicode_DecodeFSDefault that the
>      function take a null-terminated string.
>    - Bumped the magic number in import.c
>    - Fix PyCode_New calls in _ctypes and pyexpat module.
>    - Remove the PyString type check on 'filename' and 'name' in PyCode_New.
>    - Remove the unneeded string coercion code from PyCode_New.
>
> __________________________________
> Tracker <report@bugs.python.org>
> <http://bugs.python.org/issue1272>
> __________________________________
>
msg56405 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-10-14 03:06
Guido van Rossum wrote:
> Guido van Rossum added the comment:
> 
> Crys, is this OK with you?

Alexandre's mangle loop doesn't do the same job as mine. Chars like _
and - aren't removed from the encoding name and the if clauses don't
catch for example UTF-8 or ISO-8859-1 only UTF8 or ISO8859-1. Also he
has overseen a PyString_Check in the code repr function.

I'm working on a better mangler and I believe that I can make
Py_FilesystemEncoding available much earlier in Py_InitializeEx().

*after some debugging*

I fear that we are on the wrong path. Py_FileSystemEncoding is set
*much* later in the boot strapping process unless its value is hard
coded (Win32 and Apple only). Any attempt to get the right codec or even
a normalized name without the codecs package is going to extremely hard.

We have to get the codecs up and Py_FileSystemEncoding before we can
decode the filenames. :( I think that the problem needs much more
attention and a proper fix.

Christian
msg56406 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-10-14 03:35
OK, in the spirit of delegation I'll leave this for you and Alexandre to
work out more.  If you're stuck, post to the list so others can jump in.
msg56408 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2007-10-14 04:34
Christian wrote:
> Alexandre's mangle loop doesn't do the same job as mine. Chars like _
> and - aren't removed from the encoding name and the if clauses don't
> catch for example UTF-8 or ISO-8859-1 only UTF8 or ISO8859-1. 

That isn't true. My mangler does exactly the same thing as your
original one.

However, I forgot to add Py_CHARMASK to the calls of tolower() and
isalnum() which would cause problems on platforms with signed char.

> Also he has overseen a PyString_Check in the code repr function.

Fixed.

> We have to get the codecs up and Py_FileSystemEncoding before we can
> decode the filenames. :( I think that the problem needs much more
> attention and a proper fix.

Maybe adding a global variable, let's say _Py_Codecs_Ready, could be
used to notify PyUnicode_DecodeFSDefault that it can use
PyUnicode_Decode, instead of relying only on the built-in codecs. That
would be much simpler than changing boostrapping process.

Here yet another updated patch. The changes are the following:

   - Add Py_CHARMASK to tolower() and isalnum() calls in
     PyUnicode_DecodeFSDefault().
   - Use PyUnicode_Check(), instead of PyString_Check(), in
     code_repr().
   - Update comments for co_filename and co_name in PyCodeObject
     struct definition.
   - Fix a PyString_AS_STRING(co->co_name) instance in compile.c
   - Replace %S for %U in PyErr_Format() calls for substituting co_name.

One more thing, frozen.c needs to be updated. The module data contains
a code object with a PyString co_name. However, there is a bug in the
imp module (it doesn't detect the encoding from modelines, which cause
io.TextIOWrapper to crash) that prevents me from generating the data
for frozen.c.
msg56414 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-10-14 15:05
Alexandre Vassalotti wrote:
> That isn't true. My mangler does exactly the same thing as your
> original one.
> 
> However, I forgot to add Py_CHARMASK to the calls of tolower() and
> isalnum() which would cause problems on platforms with signed char.

I wasn't sure how tolower() reacts on numeric values. The addition of
Py_CHARMASK is a good idea. It's a shame stringobject.c doesn't expose
its non locale version of tolower() and isalnum().

> Maybe adding a global variable, let's say _Py_Codecs_Ready, could be
> used to notify PyUnicode_DecodeFSDefault that it can use
> PyUnicode_Decode, instead of relying only on the built-in codecs. That
> would be much simpler than changing boostrapping process.

It still f... up the file names of Python modules that were loaded
before the codecs are ready to use. What do you think about this
(pseudocode):

#if defined(MS_WINDOWS) && defined(HAVE_USABLE_WCHAR_T)
    return PyUnicode_DecodeMBCS(string, "replace)
#elif defined(__APPLE__)
    return PyUnicode_DecodeUTF8(string, "replace)
#else

* set __file__ and co_filename to a preliminary value with
  PyUnicode_DecodeUTF8(string, "replace")

* Store a pointer and original string in a linked list
   struct Py_FixFileNames {
       struct Py_FixFileNames *next;
       char *string;
       PyObject *unicode;
   };

* After the codecs and Py_FileSystemDefaultEncoding are set up in
  Py_InitializeEx() check the value. If the fs default encoding
  isn't UTF-8 redo all encodings with the correct encoding.

* Free the linked list

* From now on use the codecs package, PyUnicode_Decode() and
  Py_FileSystemDefaultEncoding for every filename

#endif

Add comments to See Python/bltinmodule.c Py_FileSystemDefaultEncoding
and Objects/unicodeobject.c unicode_default_encoding that
PyUnicode_DecodeFSDefault depends on the values.

It ain't beautiful and fast but we definitely get the right file names
after boot strapping and fair file names during boot strapping.

Christian
msg56415 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-10-14 15:23
Only a few modules are involved in the bootstrap. The filename is
mostly used to display in the traceback. There is already a fallback
in the traceback-printing code that tries to look through sys.path for
a file matching the module if it can't open the filename from the code
object.

So I suggest to do something much simpler -- if the default encoding
isn't set yet, use ASCII + replace, and leave it at that -- don't
bother fixing these things later.

Another thought: when installed, the built-in modules are compiled to
byte code by compileall.py, which runs late enough so that the
filesystem encoding is known and everything can be done right.

I have a question for Alexandre related to frozen.c -- why is there a
mode line with an encoding involved in freezing hello.py?
msg56417 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2007-10-14 16:00
> I have a question for Alexandre related to frozen.c -- why is there a
> mode line with an encoding involved in freezing hello.py?

For some reason which I don't know about, freeze.py tries to read all
the modules accessible from sys.path:

    # collect all modules of the program
    dir = os.path.dirname(scriptfile)
    path[0] = dir
    mf = modulefinder.ModuleFinder(path, debug, exclude, replace_paths)

The problem is the imp module, which modulefinder uses, does not
detect the encoding of the files from the mode-line. This causes
TextIOWrapper to crash when it tries to read modules using an encoding
other than ASCII or UTF-8. Here an example:

  >>> import imp
  >>> imp.find_module('heapq')[0].read()
  Traceback (most recent call last):
  ...
  UnicodeDecodeError: 'utf8' codec can't decode bytes in position
  1428-1430: invalid data

I probably should open a seperate issue in the tracker for this,
though.
msg56419 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2007-10-14 16:17
I thought of another way to implement PyUnicode_DecodeFSDefault. If
Py_FileSystemDefaultEncoding is set, decode with the codecs module,
otherwise use UTF-8 + replace. This works because when
Py_FileSystemDefaultEncoding is initialized at the end of
Py_InitializeEx(), the codecs module is ready to be used. Here's what
it looks like:

PyObject*
PyUnicode_DecodeFSDefault(const char *s)
{
    Py_ssize_t size = (Py_ssize_t)strlen(s);

    /* During the early bootstrapping process, Py_FileSystemDefaultEncoding
       can be undefined. If it is case, decode using UTF-8. The
following assumes
       that Py_FileSystemDefaultEncoding is set to a built-in encoding
during the
       bootstrapping process where the codecs aren't ready yet.
    */
    if (Py_FileSystemDefaultEncoding) {
        return PyUnicode_Decode(s, size,
                                Py_FileSystemDefaultEncoding,
                                "replace");
    }
    else {
        return PyUnicode_DecodeUTF8(s, size, "replace");
    }
}

It is not perfect, since the extra function calls in the codecs module
causes test_profile and test_doctest to fail. However, I think this is
much simpler that the previous versions.
msg56420 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-10-14 16:22
Alexandre Vassalotti wrote:
> Alexandre Vassalotti added the comment:
> 
> I thought of another way to implement PyUnicode_DecodeFSDefault. If
> Py_FileSystemDefaultEncoding is set, decode with the codecs module,
> otherwise use UTF-8 + replace. This works because when
> Py_FileSystemDefaultEncoding is initialized at the end of
> Py_InitializeEx(), the codecs module is ready to be used. Here's what
> it looks like:

That's almost what I had in mind but with two exceptions for __APPLE__
and MS_WINDOWS. For both OS the value of Py_FileSystemDefaultEncoding is
hard coded to UTF-8 / MBCS. In your code the method would use
PyUnicode_Decode() too early but you can work around the problem with
two #if defined.

Christian
msg56422 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-10-14 17:06
Changes since updated_file_fsenc-5.patch:

* Fix for hard coded FS default encoding on Apple and Windows
* Added two notes to unicode_default_encoding and
Py_FileSystemDefaultEncoding
msg56425 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-10-15 00:13
This looks promising. I'm working on the freeze issue. Once I get that
working I'll check this in. Thanks Alexandre and Christian for all
your hard work!!!
msg56426 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-10-15 00:27
> The problem is the imp module, which modulefinder uses, does not
> detect the encoding of the files from the mode-line. This causes
> TextIOWrapper to crash when it tries to read modules using an encoding
> other than ASCII or UTF-8. Here an example:
>
>   >>> import imp
>   >>> imp.find_module('heapq')[0].read()
>   Traceback (most recent call last):
>   ...
>   UnicodeDecodeError: 'utf8' codec can't decode bytes in position
>   1428-1430: invalid data

I can't reproduce this. Can you open a separate issue?
msg56427 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-10-15 00:34
>>   UnicodeDecodeError: 'utf8' codec can't decode bytes in position
>>   1428-1430: invalid data
> 
> I can't reproduce this. Can you open a separate issue?

It breaks for me with the same error message on Ubuntu Linux, i386,
UCS-4 build and locale de_DE.UTF-8.

(<io.TextIOWrapper object at 0xb7c9158c>,
'/home/heimes/dev/python/py3k/Lib/heapq.py', ('.py', 'U', 1))
>>> imp.find_module('heapq')[0].read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/heimes/dev/python/py3k/Lib/io.py", line 1224, in read
    res += decoder.decode(self.buffer.read(), True)
  File "/home/heimes/dev/python/py3k/Lib/codecs.py", line 291, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position
1428-1430: invalid data
>>> imp.find_module('heapq')[0].readline()
'# -*- coding: Latin-1 -*-\n'
>>> imp.find_module('heapq')[0].encoding
'UTF-8'
msg56428 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-10-15 00:36
> This looks promising. I'm working on the freeze issue. Once I get that
> working I'll check this in. Thanks Alexandre and Christian for all
> your hard work!!!

You're welcome. Does the patch qualify me for Misc/ACKS? :)

I'm going to work on the basestring patch after you have committed the
changes.

Christian
msg56429 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-10-15 00:57
I found two minor bugs in the fix. In Modules/posixmodule.c the tmpnam()
and tempnam() methods return a PyString instance. Please change line
5373 and 5431 to use PyUnicode_DecodeFSDefault().

Index: Modules/posixmodule.c
===================================================================
--- Modules/posixmodule.c       (Revision 58461)
+++ Modules/posixmodule.c       (Arbeitskopie)
@@ -5370,7 +5370,7 @@
 #endif
     if (name == NULL)
         return PyErr_NoMemory();
-    result = PyString_FromString(name);
+    result = PyUnicode_DecodeFSDefault(name);
     free(name);
     return result;
 }
@@ -5428,7 +5428,7 @@
        Py_XDECREF(err);
        return NULL;
     }
-    return PyString_FromString(buffer);
+    return PyUnicode_DecodeFSDefault(buffer);
 }
 #endif
msg56430 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-10-15 01:31
> > This looks promising. I'm working on the freeze issue. Once I get that
> > working I'll check this in. Thanks Alexandre and Christian for all
> > your hard work!!!
>
> You're welcome. Does the patch qualify me for Misc/ACKS? :)

Yes, and also Alexandre. :-)

> I'm going to work on the basestring patch after you have committed the
> changes.

The commit may have to wait a bit; I'm finding some problems on OSX
when the directory in which I'm building has a unicode character in
it. (Now, svn has severe problems with this as well, and so far it
looks like all the tests run; but still, it's disturbing.)
msg56432 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-10-15 02:53
Committed revision 58466.

Fingers crossed.
msg56482 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2007-10-16 01:22
I wrote in msg56419:
> It is not perfect, since the extra function calls in the codecs module
> causes test_profile and test_doctest to fail.

How this was resolved?
msg56483 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-10-16 01:25
>> It is not perfect, since the extra function calls in the codecs module
>> causes test_profile and test_doctest to fail.
> 
> How this was resolved?

It's not resolved yet.
History
Date User Action Args
2022-04-11 14:56:27adminsetgithub: 45613
2007-10-16 01:25:31christian.heimessetmessages: + msg56483
2007-10-16 01:22:46alexandre.vassalottisetmessages: + msg56482
2007-10-15 02:53:16gvanrossumsetstatus: open -> closed
resolution: accepted
messages: + msg56432
2007-10-15 01:31:29gvanrossumsetmessages: + msg56430
2007-10-15 00:57:19christian.heimessetmessages: + msg56429
2007-10-15 00:36:23christian.heimessetmessages: + msg56428
2007-10-15 00:34:28christian.heimessetmessages: + msg56427
2007-10-15 00:27:17gvanrossumsetmessages: + msg56426
2007-10-15 00:13:33gvanrossumsetmessages: + msg56425
2007-10-14 17:06:25christian.heimessetfiles: + updated_file_fsenc-6.patch
messages: + msg56422
2007-10-14 16:22:53christian.heimessetmessages: + msg56420
2007-10-14 16:17:39alexandre.vassalottisetfiles: + updated_file_fsenc-5.patch
messages: + msg56419
2007-10-14 16:00:22alexandre.vassalottisetmessages: + msg56417
2007-10-14 15:23:43gvanrossumsetmessages: + msg56415
2007-10-14 15:05:27christian.heimessetmessages: + msg56414
2007-10-14 04:35:06alexandre.vassalottisetfiles: + updated_file_fsenc-4.patch
messages: + msg56408
2007-10-14 03:35:23gvanrossumsetmessages: + msg56406
2007-10-14 03:06:47christian.heimessetmessages: + msg56405
2007-10-14 02:15:19gvanrossumsetmessages: + msg56404
2007-10-14 01:42:29alexandre.vassalottisetfiles: + updated_file_fsenc-3.patch
messages: + msg56402
2007-10-14 01:32:10alexandre.vassalottisetfiles: + updated_file_fsenc-2.patch
messages: + msg56401
2007-10-14 01:02:25gvanrossumlinkissue1264 superseder
2007-10-14 01:00:41gvanrossumsetmessages: + msg56398
2007-10-14 00:57:17christian.heimessetmessages: + msg56397
2007-10-14 00:51:49gvanrossumsetmessages: + msg56396
2007-10-14 00:43:59christian.heimessetmessages: + msg56394
2007-10-14 00:35:54gvanrossumsetmessages: + msg56393
2007-10-13 23:27:30christian.heimessetmessages: + msg56391
2007-10-13 22:20:02alexandre.vassalottisetmessages: + msg56390
2007-10-13 22:10:19alexandre.vassalottisetfiles: + updated_file_fsenc.patch
2007-10-13 22:10:06alexandre.vassalottisetnosy: + alexandre.vassalotti
messages: + msg56389
2007-10-13 21:19:06gvanrossumsetmessages: + msg56388
2007-10-13 20:49:32gvanrossumsetassignee: gvanrossum
nosy: + gvanrossum
2007-10-12 14:09:43loewissetkeywords: + patch
2007-10-12 13:41:38christian.heimescreate