classification
Title: Possible regression with stdlib in zipfile
Type: behavior Stage: patch review
Components: Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, georg.brandl, haypo, pitrou, ronaldoussoren
Priority: normal Keywords: patch

Created on 2011-01-20 13:52 by ronaldoussoren, last changed 2011-01-23 18:42 by ronaldoussoren. This issue is now closed.

Files
File name Uploaded Description Edit
issue10955.patch haypo, 2011-01-20 16:53
issue10955-2.patch haypo, 2011-01-21 13:04
Messages (25)
msg126614 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2011-01-20 13:52
I ran into this issue while debugging why py2app doesn't work with python 3.2rc2. The reason seems to be a regression w.r.t. having the stdlib inside a zipfile.

Note that I haven't tested this without going through py2app yet.

py2app basicly recreates a minimal sys.prefix that contains just the application python files and a minimal selection of files from the stdlib.

The file structure in the app bundle contains (for python3.2):

    .../Resources/
            lib/
               python32.zip   # Most compiled python files
               python3.2/     # Files that cannot be in the zip
                  lib-dynload # Extensions

This structure works fine with python2.7 (and earlier) and python3.1, with python 3.2rc2 I get a bootstrap error because the filesystem encoding codec cannot be located.

This can be worked around by moving the encodings package and the codecs module from the zipfile to the python3.2 directory. 

That however is not good enough, I also have to change the default search-path using Py_SetPath. The default path has python32.zip before the python3.2 directory, only when I switch those around the application loads fine.

All of this is on MacOSX 10.6.6 (where the filesystem encoding is UTF-8).

This is a regression because it is no longer possible to have a packaged python application where all python code is inside a zipfile. Some files must be outside of the file to bootstrap the interpreter.
msg126615 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-20 13:54
It should be a regression introduced by #8611 or #9425.
msg126618 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-20 14:30
zipimport decodes filenames of the archive from cp437 or UTF-8 (depending on a flag in each file entry). Python has a builtin UTF-8 codec, but no cp437 builtin codec. You should try to add encodings/cp437.py to your python3.2/ directory, or to build a ZIP archive with unicode filenames (I don't know how to do that).

Call trace:
 - Load the codec of the filesystem encoding
 - Initialize the codec registry
 - Load the codec from python32.zip
 - Load cp437 or UTF-8 codec to decode python32.zip filenames
 - *Bootstrap failure*

Detailed call trace to initialize the codec registry:
 - import encodings (Lib/encodings/__init__.py)
 - import codecs (Load Lib/codecs.py)
 - import encodings.aliases (Load Lib/encodings/aliases.py)

And then the call trace to load UTF-8 codec:
 - import encodings.utf_8 (Lib/encodings/utf_8.py)

Later, initstdio() loads also Latin1 codec (import encodings.latin_1, Lib/encodings/latin_1.py).

Python has builtin codecs for MBCS (filesystem encoding on Windows) and UTF-8 (filesystem encodings on Mac OS X and many other OSes) encodings, but the codec lookup loads the encodings module (encodings/xxx.py).
msg126619 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-20 14:31
Restore priority to normal: this is a workaround, and a better fix cannot be done before 3.2 final.
msg126627 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-20 16:53
The regression was introduced in r85690: use the correct encoding to decode the filename from the ZIP file. Attached patch fixes the bootstrap issue.
msg126635 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2011-01-20 18:15
About the patch: """Break out of this dependency by assuming that the path to the encodings module is ASCII-only."""

The 'path' here is the entry inside the zip file (and does not include the location of the zip file itself), so the comment is right as long as the Python stdlib only contains ascii names.

But if the zip file contains the stdlib *and* some other custom modules with cp437 names, the whole operation will fail; it can be the case with py2exe applications.
msg126654 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-20 22:10
Le jeudi 20 janvier 2011 à 18:15 +0000, Amaury Forgeot d'Arc a écrit :
> But if the zip file contains the stdlib *and* some other custom
> modules with cp437 names, the whole operation will fail; it can be the
> case with py2exe applications.

The ASCII fallback is only used before the codec registry is loaded. I
suppose that you can use non-ASCII module names in the same ZIP file: if
you load them after that the codec registry is ready, it should work.

I copied the fix from Objects/unicodeobject.c which has also a similar
bootstrap "hack" to encode/decode filenames.
msg126715 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2011-01-21 10:53
No, your change is in the read_directory() function, which reads the whole archive the first time it's used.
msg126716 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-21 10:58
> No, your change is in the read_directory() function, 
> which reads the whole archive the first time it's used.

Oh, I though that read_directory() only reads files one by one.
msg126717 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-21 11:08
Ronald Oussoren and Amaury Forgeot d'Arc: do you think that it is an acceptable limitation to only accept ASCII filenames in python32.zip? (not in all ZIP files, just in the file loaded at startup)

All possible solutions:

 a) Only accept ASCII filenames in python32.zip
 b) Only accept ZIP archive using UTF-8 filenames (unicode flag set for all files in the archive). On Linux, I don't know how to create such archive. I suppose that most ZIP archivers prefer the legacy format (unicode flag unset). But few people produce python32.zip files, maybe only py2exe / pyfreeze developers.
 c) Add encodings/cp437.py to your python3.2/ directory (outside the ZIP file), which can be a problem :-/
 d) Implement cp437 in C

I dislike (c) and (d), but I cannot say if (a) or (b) is better.
msg126720 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2011-01-21 11:50
What about tools that builds one .zip file for all modules, like py2exe?

A cp437 decoder is not so ugly to implement in C. It's just a charmap.
msg126721 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-21 11:53
Oh, py2app is implemented in Python and use the zipfile module. So if we can control how the filename is encoded, we can fix py2app to workaround this limitation :-)

7zip and WinRAR uses the same algorithm than ZipFile._encodeFilename(): try cp437 or use UTF-8. Eg. if a filename contains ∞ (U+221E), it is encoded to UTF-8.

WinZIP encodes all filenames to cp437: ∞ (U+221E) is replaced by 8 (U+0038), ☺ (U+263A) is replaced by... U+0001 !
msg126726 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-21 12:04
#10972 has a patch for zipfile to set the filename encoding if a ZipInfo object (to force the encoding to UTF-8).
msg126733 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-21 12:45
On Linux, the "zip" command line program (InfoZIP zip program) only sets the unicode flag if it is able to set the locale to "en_US.UTF-8". It can do better: check if the locale encoding is UTF-8, and only "en_US.UTF-8" locale if the encoding was not UTF-8.

The conclusion is today, it is very hard to create archives using only UTF-8 names (unicode flag set): only the zip program can do that on Linux.

With issue10955.patch (#10972): py2app and py2exe will only support ASCII filenames, but at least it fixes this issue :-)

With issue10955.patch + zipfile_unicode.patch (#10972): py2app will support non-ASCII filenames, but py2exe will only support ASCII filenames.

We can fix the bootstrap issue today, and improve zipfile later to support non-ASCII filenames. Anyway, Python 3.2 doesn't support non-ASCII filenames on Windows (#3080), and I plan to fix this in Python 3.3.
msg126736 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-21 13:04
Patch version 2: display a more useful error message:

$ python
Fatal Python error: Py_Initialize: Unable to get the locale encoding
NotImplementedError: bootstrap issue: python32.zip contains non-ASCII filenames without the unicode flag
Aborted

Instead of (message without the patch):

$ python
Fatal Python error: Py_Initialize: Unable to get the locale encoding
LookupError: no codec search functions registered: can't find encoding
Aborted
msg126737 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-21 13:18
Victor's second patch looks good to me. Georg, is this a release blocker?
msg126740 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2011-01-21 14:30
The python32.zip file generated by py2app contains both files from the stdlib and application files.  I cannot avoid haveing non-ascii filenames when a python package contains data files that have such names.

The patch in Issue10972 would be nice to have, that way py2app can enforce that the zipfile uses UTF-8 names.
msg126741 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-21 14:32
> The python32.zip file generated by py2app contains both files from the
> stdlib and application files.  I cannot avoid haveing non-ascii
> filenames when a python package contains data files that have such
> names.

I don't think this is a problem. We are only talking about
bootstrap-time importing of encodings modules. Once the encodings
machinery is initialized, importing non-ascii files should work fine.
msg126742 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2011-01-21 14:36
> We are only talking about bootstrap-time importing of encodings modules.
Again, the whole zip central directory is loaded on first import. If the zip file contains non-ascii filenames, nothing can be imported.
msg126743 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-21 14:45
> Again, the whole zip central directory is loaded on first import. If
> the zip file contains non-ascii filenames, nothing can be imported.

Does it have to be decoded eagerly, though?
msg126744 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-21 14:58
> I cannot avoid haveing non-ascii filenames when a python package
> contains data files that have such names.

Are "data files" Python modules (.py files)? Or can it be anything?
msg126823 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2011-01-22 08:52
Patch #2 looks innocent enough to me, and is clearly an improvement.
msg126824 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2011-01-22 08:53
For 3.3, we might want to consider implementing cp437 in C, as a necessary consequence of supporting import from zipfiles.  Shouldn't be so hard, I guess.
msg126829 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-22 10:35
georg.brandl> Patch #2 looks innocent enough to me, 
georg.brandl> and is clearly an improvement.

Ok, issue fixed by r88140 (+r88141):

Issue #10955: zipimport uses ASCII encoding instead of cp497 to decode
filenames, at bootstrap, if the codec registry is not ready yet. It is
still possible to have non-ASCII filenames using the Unicode flag
(UTF-8 encoding) for all file entries in the ZIP file.

Oh, by the way, using ASCII at bootstrap is not a regression of Python 3.2: Python 3.1 used the wrong encoding (UTF-8) to decode filenames encoded to cp437. Raise a UnicodeDecodeError is better than decoding with the wrong encoding.

> For 3.3, we might want to consider implementing cp437 in C, ...

Since, I don't like this solution, I will not open a new issue for that. Feel free to open a new issue if you consider that we need that in Python 3.3.
msg126896 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2011-01-23 18:42
Data files can be anything that can be a data-file in a setuptools/distribute setup.py file. Note that #10972 isn't necessary when python32.zip is build using the zipfile module, _encodeFilenameFlags uses either ASCII or UTF-8 to encode filenames and the new zipimport behavior matches that.

I can confirm that the current HEAD fixes the problems I had in py2app.

BTW. I do considered this a regression because having the stdlib in a zipfile used to work in earlier versions, was obviously something that was intended to work (python32.zip in the default value for sys.path) and no longer worked.

And last but definitely not least: Thanks for the quick response.
History
Date User Action Args
2011-01-29 19:38:23ned.deilylinkissue11065 superseder
2011-01-23 18:42:47ronaldoussorensetnosy: georg.brandl, ronaldoussoren, amaury.forgeotdarc, pitrou, haypo
messages: + msg126896
2011-01-22 10:35:40hayposetstatus: open -> closed

messages: + msg126829
resolution: fixed
nosy: georg.brandl, ronaldoussoren, amaury.forgeotdarc, pitrou, haypo
2011-01-22 08:53:50georg.brandlsetnosy: georg.brandl, ronaldoussoren, amaury.forgeotdarc, pitrou, haypo
messages: + msg126824
2011-01-22 08:52:27georg.brandlsetnosy: georg.brandl, ronaldoussoren, amaury.forgeotdarc, pitrou, haypo
messages: + msg126823
2011-01-21 14:58:07hayposetnosy: georg.brandl, ronaldoussoren, amaury.forgeotdarc, pitrou, haypo
messages: + msg126744
2011-01-21 14:45:40pitrousetnosy: georg.brandl, ronaldoussoren, amaury.forgeotdarc, pitrou, haypo
messages: + msg126743
2011-01-21 14:36:26amaury.forgeotdarcsetnosy: georg.brandl, ronaldoussoren, amaury.forgeotdarc, pitrou, haypo
messages: + msg126742
2011-01-21 14:32:24pitrousetnosy: georg.brandl, ronaldoussoren, amaury.forgeotdarc, pitrou, haypo
messages: + msg126741
2011-01-21 14:30:25ronaldoussorensetnosy: georg.brandl, ronaldoussoren, amaury.forgeotdarc, pitrou, haypo
messages: + msg126740
2011-01-21 13:18:50pitrousetnosy: + pitrou

messages: + msg126737
stage: test needed -> patch review
2011-01-21 13:04:45hayposetfiles: + issue10955-2.patch
nosy: georg.brandl, ronaldoussoren, amaury.forgeotdarc, haypo
messages: + msg126736
2011-01-21 12:45:04hayposetnosy: georg.brandl, ronaldoussoren, amaury.forgeotdarc, haypo
messages: + msg126733
2011-01-21 12:04:50hayposetnosy: georg.brandl, ronaldoussoren, amaury.forgeotdarc, haypo
messages: + msg126726
2011-01-21 11:53:48hayposetnosy: georg.brandl, ronaldoussoren, amaury.forgeotdarc, haypo
messages: + msg126721
2011-01-21 11:50:18amaury.forgeotdarcsetnosy: georg.brandl, ronaldoussoren, amaury.forgeotdarc, haypo
messages: + msg126720
2011-01-21 11:08:22hayposetnosy: georg.brandl, ronaldoussoren, amaury.forgeotdarc, haypo
messages: + msg126717
2011-01-21 10:58:56hayposetnosy: georg.brandl, ronaldoussoren, amaury.forgeotdarc, haypo
messages: + msg126716
2011-01-21 10:53:59amaury.forgeotdarcsetnosy: georg.brandl, ronaldoussoren, amaury.forgeotdarc, haypo
messages: + msg126715
2011-01-20 22:10:46hayposetnosy: georg.brandl, ronaldoussoren, amaury.forgeotdarc, haypo
messages: + msg126654
2011-01-20 18:15:32amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg126635
2011-01-20 16:53:05hayposetfiles: + issue10955.patch

messages: + msg126627
keywords: + patch
nosy: georg.brandl, ronaldoussoren, haypo
2011-01-20 14:31:05hayposetpriority: release blocker -> normal
nosy: georg.brandl, ronaldoussoren, haypo
messages: + msg126619
2011-01-20 14:30:02hayposetnosy: georg.brandl, ronaldoussoren, haypo
messages: + msg126618
2011-01-20 13:54:40hayposetpriority: normal -> release blocker
nosy: + georg.brandl
2011-01-20 13:54:26hayposetnosy: + haypo
messages: + msg126615
2011-01-20 13:52:14ronaldoussorencreate