This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: UnicodeEncodeError in gzip when filename contains non-ascii
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: jaraco, koobs, lars.gustaebel, nadeem.vawda, python-dev, serhiy.storchaka, terry.reedy
Priority: low Keywords: patch

Created on 2011-12-26 15:55 by jaraco, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
gzip_unicode_filename-2.7.patch serhiy.storchaka, 2014-10-02 19:04 review
koobs-freebsd10-build-742.log koobs, 2014-10-13 01:07
Messages (13)
msg150265 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2011-12-26 15:55
While investigating #11638, I encountered another encoding issue related to tarballs. Consider this command:

python -c "import gzip; gzip.GzipFile(u'\xe5rchive', 'w', fileobj=open(u'\xe5rchive', 'wb'))"

When run, it triggers the following traceback:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "c:\python\lib\gzip.py", line 127, in __init__
    self._write_gzip_header()
  File "c:\python\lib\gzip.py", line 172, in _write_gzip_header
    self.fileobj.write(fname + '\000')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 0: ordinal not in range(128)

Based on the resolution of #13639, I believe the recommended fix is to handle unicode here much like Python 3 does--specifically, detect unicode, encode to 'latin-1' if possible or leave the filename blank if not.
msg150394 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-12-30 21:10
The actual fix in the previous issue, as in Python 3, was to always write the filename, but with errors replaced with '?/.
msg228251 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-10-02 19:04
> The actual fix in the previous issue, as in Python 3, was to always write the filename, but with errors replaced with '?/.

Filename is optional in gzip file. If it can't be encoded to Latin1, it should be just omitted.

Here is a patch which backports the solution from Python 3 (accumulated f37016d42729, fb069eafaf89, 8cff949323c9, and e044fa016c85).
msg229155 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-10-12 14:52
I there are no objections I'll commit the patch soon.
msg229185 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-10-12 18:20
fine with me
msg229194 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-10-12 19:25
New changeset 272c78c9c47e by Serhiy Storchaka in branch '2.7':
Issue #13664: GzipFile now supports non-ascii Unicode filenames.
https://hg.python.org/cpython/rev/272c78c9c47e
msg229195 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-10-12 19:28
Thank you Terry for the review.
msg229204 - (view) Author: Kubilay Kocak (koobs) (Python triager) Date: 2014-10-13 01:07
This broke a FreeBSD buildbot (koobs-freebsd10), complete log attached.
msg229206 - (view) Author: Kubilay Kocak (koobs) (Python triager) Date: 2014-10-13 01:25
koobs@10-STABLE-amd64:~ % locale
LANG=
LC_CTYPE="C"
LC_COLLATE="C"
LC_TIME="C"
LC_NUMERIC="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
msg229209 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-10-13 02:21
I rechecked and test_gzip passes on 2.7 Win7, I checked revision history and this is the only gzip or test_gzip patch for several months.
msg229227 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-10-13 07:31
Ah, ASCII locale...
msg229228 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-10-13 07:34
New changeset 7657cc08d29b by Serhiy Storchaka in branch '2.7':
Fixed the test of issue #13664 on platforms without unicode filenames support.
https://hg.python.org/cpython/rev/7657cc08d29b
msg229618 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-10-17 23:58
test_gzip passed after this patch.
History
Date User Action Args
2022-04-11 14:57:25adminsetgithub: 57873
2014-10-17 23:58:21terry.reedysetstatus: open -> closed

messages: + msg229618
2014-10-13 07:34:56python-devsetmessages: + msg229228
2014-10-13 07:31:43serhiy.storchakasetmessages: + msg229227
2014-10-13 02:21:39terry.reedysetmessages: + msg229209
2014-10-13 01:25:17koobssetmessages: + msg229206
2014-10-13 01:17:05ezio.melottisetstatus: closed -> open
2014-10-13 01:07:16koobssetfiles: + koobs-freebsd10-build-742.log
nosy: + koobs
messages: + msg229204

2014-10-12 19:28:38serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg229195

stage: commit review -> resolved
2014-10-12 19:25:48python-devsetnosy: + python-dev
messages: + msg229194
2014-10-12 18:20:36terry.reedysetmessages: + msg229185
stage: patch review -> commit review
2014-10-12 14:52:14serhiy.storchakasetmessages: + msg229155
2014-10-12 14:37:13serhiy.storchakasetassignee: serhiy.storchaka
2014-10-02 19:04:20serhiy.storchakasetfiles: + gzip_unicode_filename-2.7.patch

type: behavior

keywords: + patch
nosy: + serhiy.storchaka, nadeem.vawda
messages: + msg228251
stage: patch review
2011-12-30 21:10:25terry.reedysetnosy: + lars.gustaebel, terry.reedy
messages: + msg150394
2011-12-26 15:55:42jaracosetcomponents: + Library (Lib)
2011-12-26 15:55:32jaracocreate