classification
Title: Python 3, ZipFile Bug In Chinese
Type: behavior Stage:
Components: Library (Lib), Unicode Versions: Python 3.1
process
Status: closed Resolution: duplicate
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, georg.brandl, vstinner, yaoyu
Priority: normal Keywords:

Created on 2011-05-10 07:59 by yaoyu, last changed 2011-05-18 12:01 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
test.zip yaoyu, 2011-05-10 07:59
Messages (6)
msg135687 - (view) Author: yaoyu (yaoyu) Date: 2011-05-10 07:59
Python 3, ZipFile Bug In Chinese:
1. In Python3.1.3 can't extract "复件 test.txt" from test.zip
╕┤╝■ test.txt
Traceback (most recent call last):
  File "C:\Temp\PythonZipTest\pythonzip.py", line 14, in <module>
    main()
  File "C:\Temp\PythonZipTest\pythonzip.py", line 11, in main
    z.extract(z.namelist()[0])
  File "c:\python31\lib\zipfile.py", line 980, in extract
    return self._extract_member(member, path, pwd)
  File "c:\python31\lib\zipfile.py", line 1023, in _extract_member
    source = self.open(member, pwd=pwd)
  File "c:\python31\lib\zipfile.py", line 928, in open
    % (zinfo.orig_filename, fname))
zipfile.BadZipfile: File name in directory '╕┤╝■ test.txt' and header b'\xb8\xb4\xbc\xfe test.txt' differ.

2.  In Python3.2 extract "复件 test.txt" from test.zip uncorrect
  It extract the file as "╕┤╝■ test.txt"

3. In Python 2.7.1, It's OK!

          2011-05-10
Source Code
######################################################################
#coding=gbk

import zipfile
import os

def main():
  szTestDir = os.path.dirname(__file__)
  szFile = os.path.join(szTestDir, 'test.zip')
  z = zipfile.ZipFile(szFile)
  print(z.namelist()[0])
  z.extract(z.namelist()[0])

if __name__ == '__main__':
  main()
msg135837 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-05-12 14:33
This is a duplicate of #10801, issue fixed in Python 3.2 or later by 33543b4e0e5d. Should we backport the fix to Python 3.1, or you can upgrade to Python 3.2?

Output with Python 3.2: "╕┤╝■ test.txt".
msg135840 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2011-05-12 14:48
But according to the initial report, 3.2 does not give the expected behavior. This zip file actually stores the filename encoded with cp932, which is incorrect according to the specifications of the ZIP format (only cp437 and utf8 are valid)

See issue10614 for a possible solution: allow users to specify an alternate encoding to handle such invalid files.
msg135842 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-05-12 15:07
Oh, right.

Note: the encoding looks to be GBK, not CP932:

>>> '\u590d\u4ef6'.encode('gbk')
b'\xb8\xb4\xbc\xfe'
>>> '\u590d\u4ef6'.encode('gbk').decode('cp437')
'╕┤╝■'
>>> '\u590d\u4ef6'.encode('cp932')
...
UnicodeEncodeError: 'cp932' codec can't encode character '\u590d' ...
msg136226 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-05-18 11:30
See also #4621.
msg136232 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-05-18 12:01
This issue is just another example of the issue #10614: I'm closing it as a duplicate.
History
Date User Action Args
2011-05-18 12:01:36vstinnersetstatus: open -> closed
resolution: duplicate
messages: + msg136232
2011-05-18 11:30:57vstinnersetmessages: + msg136226
2011-05-12 15:07:37vstinnersetmessages: + msg135842
2011-05-12 14:48:15amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg135840
2011-05-12 14:35:17vstinnersetnosy: + georg.brandl
2011-05-12 14:34:44vstinnersetcomponents: + Library (Lib), Unicode
2011-05-12 14:33:52vstinnersetmessages: + msg135837
versions: - Python 3.2, Python 3.3
2011-05-12 14:22:06pitrousetnosy: + vstinner

versions: + Python 3.3
2011-05-10 07:59:56yaoyucreate