Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tarfile: use surrogates for undecode fields #52637

Closed
vstinner opened this issue Apr 13, 2010 · 8 comments
Closed

tarfile: use surrogates for undecode fields #52637

vstinner opened this issue Apr 13, 2010 · 8 comments
Labels
stdlib Python modules in the Lib dir topic-unicode

Comments

@vstinner
Copy link
Member

BPO 8390
Nosy @loewis, @gustaebel, @vstinner
Files
  • tarfile_surrogates.patch
  • tarfile_surrogates.2.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-05-07.00:18:51.422>
    created_at = <Date 2010-04-13.23:53:14.225>
    labels = ['library', 'expert-unicode']
    title = 'tarfile: use surrogates for undecode fields'
    updated_at = <Date 2010-05-07.00:18:51.414>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2010-05-07.00:18:51.414>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-05-07.00:18:51.422>
    closer = 'vstinner'
    components = ['Library (Lib)', 'Unicode']
    creation = <Date 2010-04-13.23:53:14.225>
    creator = 'vstinner'
    dependencies = []
    files = ['16917', '17227']
    hgrepos = []
    issue_num = 8390
    keywords = ['patch']
    message_count = 8.0
    messages = ['103099', '104606', '104867', '104870', '104872', '104873', '105085', '105096']
    nosy_count = 3.0
    nosy_names = ['loewis', 'lars.gustaebel', 'vstinner']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue8390'
    versions = ['Python 3.1', 'Python 3.2']

    @vstinner
    Copy link
    Member Author

    When reading a tar archive, tarfile decodes fields using "replace" error handler by default. The result is that we loose informations if there is an undecodable character.

    Since the PEP-383, undecodable filenames are stored using surrogates in Python3. I think that it's a good idea to use surrogates for tar, because it's a common problem to have undecodable data in a tar archive (see the unicode section of the tarfile documentation).

    @vstinner vstinner added stdlib Python modules in the Lib dir topic-unicode labels Apr 13, 2010
    @vstinner
    Copy link
    Member Author

    lars: Do you have an opinion about this suggestion?

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented May 3, 2010

    Yes, I will soon have ;-) Please give me a few days...

    @vstinner
    Copy link
    Member Author

    vstinner commented May 3, 2010

    A better fix is maybe to store fields as bytes, but it would break the compatibility and unicode is pratical in Python3.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 3, 2010

    I think it is helpful to read the pax specification here:

    http://www.opengroup.org/onlinepubs/009695399/utilities/pax.html

    pax defines (IIUC) that all strings in a pax-compliant tar file are UTF-8 encoded. For the "invalid" option, they offer the alternatives bypass, rename, UTF-8, and write. It may be useful to provide the same options, in some form.

    @vstinner
    Copy link
    Member Author

    vstinner commented May 3, 2010

    My patch changes test_uname_unicode() of test_tarfile for the GNU and ustar formats (but not PAX). In GNU and ustar formats, the fields can be encoded in any encoding, and may contain invalid byte sequences.

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented May 5, 2010

    I think it is a good suggestion to use "surrogateescape" as the default, because (I hope) it produces the fewest errors and is the best choice if tarfile is used in connection with Python's filesystem calls.

    • When reading tar headers, undecodable chars in filenames end up as surrogates. This way no information is lost. In principle tarfile is merely a gateway to a filesystem inside an archive, so it feels natural if it treats filenames the same as Python's filesystem calls.

    • When writing tar headers, filenames with surrogate chars (e.g. from os.listdir()) will be converted back to bytes in the header (in case of gnu and ustar formats). Filenames will remain unchanged, this is exactly as one would expect.

    • When writing pax headers, filenames with surrogates will raise a UnicodeError because we may only use strict utf-8 inside a pax header. This is actually no difference to the status quo.

    @martin: As I understand it, the pax "invalid"-option is supposed to deal with the case when strings from a pax header are not representable in the user's encoding. In tarfile's case we don't have this problem when reading the archive until we try to extract it.

    Unfortunately, POSIX says nothing about how to store bad filenames in a pax archive. tarfile raises an error. GNU tar fails silently, it just puts the unchanged original filename into the pax header without converting it to utf-8, thus violating the standard.

    @vstinner
    Copy link
    Member Author

    vstinner commented May 5, 2010

    Thank you for your review. I commited the patch as r80824 (I fixed the documentation, :versionadded => :versionchanged), blocked as r80825 (3.2).

    --

    Unfortunately, POSIX says nothing about how to store bad filenames in
    a pax archive. tarfile raises an error. GNU tar fails silently,
    it just puts the unchanged original filename into the pax header
    without converting it to utf-8, thus violating the standard.

    Right. I opened a new issue about that: bpo-8333. I consider that it's a different problem.

    @vstinner vstinner closed this as completed May 7, 2010
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir topic-unicode
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant