Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zipfile: add "unicode" option to the force the filename encoding to UTF-8 #55181

Closed
vstinner opened this issue Jan 21, 2011 · 12 comments
Closed
Labels
stdlib Python modules in the Lib dir topic-unicode

Comments

@vstinner
Copy link
Member

BPO 10972
Nosy @amauryfa, @pitrou, @vstinner, @serhiy-storchaka
Superseder
  • bpo-10614: ZipFile: add a filename_encoding argument
  • Files
  • zipfile_unicode.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2017-06-28.01:37:05.118>
    created_at = <Date 2011-01-21.12:00:43.912>
    labels = ['library', 'expert-unicode']
    title = 'zipfile: add "unicode" option to the force the filename encoding to UTF-8'
    updated_at = <Date 2017-06-28.03:58:24.085>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2017-06-28.03:58:24.085>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = True
    closed_date = <Date 2017-06-28.01:37:05.118>
    closer = 'vstinner'
    components = ['Library (Lib)', 'Unicode']
    creation = <Date 2011-01-21.12:00:43.912>
    creator = 'vstinner'
    dependencies = []
    files = ['20478']
    hgrepos = []
    issue_num = 10972
    keywords = ['patch']
    message_count = 12.0
    messages = ['126724', '126725', '126727', '126731', '126734', '126735', '126745', '126746', '126759', '276182', '297125', '297148']
    nosy_count = 6.0
    nosy_names = ['amaury.forgeotdarc', 'alanmcintyre', 'pitrou', 'vstinner', 'THRlWiTi', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = 'resolved'
    status = 'closed'
    superseder = '10614'
    type = None
    url = 'https://bugs.python.org/issue10972'
    versions = ['Python 3.2', 'Python 3.3']

    @vstinner
    Copy link
    Member Author

    ZipInfo._encodeFilename() tries cp437 encoding or use UTF-8. It is not possible to decide the encoding.

    To workaround bpo-10955 (bootstrap issue with python32.zip), it would be nice to be able to create a ZIP file using only UTF-8 filenames.

    Attached patch adds unicode parameter to ZipFile.write(), ZipFile.writestr() and ZipInfo constructor.

    @vstinner vstinner added stdlib Python modules in the Lib dir topic-unicode labels Jan 21, 2011
    @vstinner
    Copy link
    Member Author

    Oh, this patch fixes also a bug: ZipFile._RealGetContents() doesn't keep the unicode flag, so open a ZIP file and then write it somewhere else may change the unicode flag if unicode flag was set but the filename is also encodable to UTF-8 (eg. ASCII filename).

    @vstinner
    Copy link
    Member Author

    7zip and WinRAR uses the same algorithm than ZipFile._encodeFilename(): try cp437 or use UTF-8. Eg. if a filename contains ∞ (U+221E), it is encoded to UTF-8.

    WinZIP encodes all filenames to cp437: ∞ (U+221E) is replaced by 8 (U+0038), ☺ (U+263A) is replaced by... U+0001!

    7zip, WinRAR and WinZIP are able to decode UTF-8 filenames (handle correctly the unicode flag).

    @vstinner vstinner changed the title zipfile: add unicode option to the choose filename encoding zipfile: add "unicode" option to the force the filename encoding to UTF-8 Jan 21, 2011
    @pitrou
    Copy link
    Member

    pitrou commented Jan 21, 2011

    What kind of problem are you trying to solve?

    @vstinner
    Copy link
    Member Author

    What kind of problem are you trying to solve?

    Support non-ASCII filenames in python32.zip (bpo-10955): at bootstrap, Python 3.2 can only use UTF-8 codec (not cp437).

    But I suppose also that forcing the encoding to UTF-8 gives a better Unicode support (when you decompress the archive).

    @pitrou
    Copy link
    Member

    pitrou commented Jan 21, 2011

    Support non-ASCII filenames in python32.zip (bpo-10955): at bootstrap,
    Python 3.2 can only use UTF-8 codec (not cp437).

    But I suppose also that forcing the encoding to UTF-8 gives a better
    Unicode support (when you decompress the archive).

    The question is, rather, why you need an external flag for that.

    @vstinner
    Copy link
    Member Author

    The question is, rather, why you need an external flag for that.

    Because I don't want to change the default encoding. I am not sure that all applications support UTF-8 encodings.

    But if you control your environment, force UTF-8 encoding should improve your Unicode support.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 21, 2011

    > The question is, rather, why you need an external flag for that.

    Because I don't want to change the default encoding. I am not sure
    that all applications support UTF-8 encodings.

    If this is a ZIP standard flag, why should we care about applications
    which don't support it? Should we add other flags to disable other
    features out of fear that other applications might not support them
    either?

    But if you control your environment, force UTF-8 encoding should
    improve your Unicode support.

    How is a random user supposed to know if their tools support UTF-8
    encoding? It's not like everyone is an expert in ZIP files. This is the
    kind of situation where asking the user to make a choice is more
    confusing than helpful. When adding the flag, not only you complicate
    the API, but you have to support this flag for the rest of your life
    (well, almost :-)).

    We could instead use utf-8 by default for all non-ascii filenames (and
    *perhaps* have a separate force_cp437 flag, but default it to False).

    @amauryfa
    Copy link
    Member

    This looks similar to bpo-10614

    @serhiy-storchaka
    Copy link
    Member

    Now UTF-8 is used for non-ASCII names. Can this issue be closed as outdated?

    @vstinner
    Copy link
    Member Author

    This looks similar to bpo-10614

    Right. Let's focus on that one which has a better design. "unicode" means everything and nothing. It's more reliable to specify an encoding.

    @serhiy-storchaka
    Copy link
    Member

    See also bpo-28080.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir topic-unicode
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants