Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling escape characters in HZ codec #74189

Closed
animalize mannequin opened this issue Apr 6, 2017 · 10 comments
Closed

Fix handling escape characters in HZ codec #74189

animalize mannequin opened this issue Apr 6, 2017 · 10 comments
Labels
3.7 (EOL) end of life topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@animalize
Copy link
Mannequin

animalize mannequin commented Apr 6, 2017

BPO 30003
Nosy @terryjreedy, @vstinner, @ezio-melotti, @animalize, @zhangyangyu
PRs
  • bpo-30003: Fix handling escape characters in HZ codec #1556
  • [3.5] bpo-30003: Fix handling escape characters in HZ codec #1718
  • [3.6] bpo-30003: Fix handling escape characters in HZ codec #1719
  • [2.7] bpo-30003: Fix handling escape characters in HZ codec #1720
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2017-05-22.17:04:55.233>
    created_at = <Date 2017-04-06.03:42:17.122>
    labels = ['type-bug', '3.7', 'expert-unicode']
    title = 'Fix handling escape characters in HZ codec'
    updated_at = <Date 2017-05-22.17:04:55.232>
    user = 'https://github.com/animalize'

    bugs.python.org fields:

    activity = <Date 2017-05-22.17:04:55.232>
    actor = 'xiang.zhang'
    assignee = 'none'
    closed = True
    closed_date = <Date 2017-05-22.17:04:55.233>
    closer = 'xiang.zhang'
    components = ['Unicode']
    creation = <Date 2017-04-06.03:42:17.122>
    creator = 'malin'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 30003
    keywords = []
    message_count = 10.0
    messages = ['291207', '291214', '291216', '291298', '291312', '291315', '294150', '294161', '294162', '294163']
    nosy_count = 5.0
    nosy_names = ['terry.reedy', 'vstinner', 'ezio.melotti', 'malin', 'xiang.zhang']
    pr_nums = ['1556', '1718', '1719', '1720']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue30003'
    versions = ['Python 2.7', 'Python 3.5', 'Python 3.6', 'Python 3.7']

    @animalize
    Copy link
    Mannequin Author

    animalize mannequin commented Apr 6, 2017

    hz is a Simplified Chinese codec, available in Python since around 2004.

    However, hz encoder has a serious bug, it forgets to escape ~
    >>> 'hi~'.encode('hz')
    b'hi~'    # the correct output should be b'hi~~'
    
    As a result, we can't finish a roundtrip:
    >>> b'hi~'.decode('hz')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'hz' codec can't decode byte 0x7e in position 2: incomplete multibyte

    In these years, no one has reported this bug, so I think it's pretty safe to remove hz codec.

    FYI:
    HZ codec is a 7-bit wrapper for GB2312, was formerly commonly used in email and USENET postings. It was designed in 1989 by Fung Fung Lee, and subsequently codified in 1995 into RFC 1843.

    It was popular in USENET networks, which in the late 1980s and early 1990s, generally did not allow transmission of 8-bit characters or escape characters.

    https://en.wikipedia.org/wiki/HZ_(character_encoding)

    Does other languages have hz codec?
    Java 8: no [1]
    .NET: yes [2]
    PHP: yes [3]
    Perl: yes [4]

    [1] http://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html
    [2] https://msdn.microsoft.com/en-us/library/system.text.encoding(v=vs.110).aspx
    [3] http://php.net/manual/en/mbstring.supported-encodings.php
    [4] http://perldoc.perl.org/Encode/CN.html

    @animalize animalize mannequin added 3.7 (EOL) end of life topic-unicode type-bug An unexpected behavior, bug, or error labels Apr 6, 2017
    @vstinner
    Copy link
    Member

    vstinner commented Apr 6, 2017

    Can't we fix the bug instead of removing the whole codec? Or do you know other bugs?

    The bug is only on the encoder part, right? I see unit test for '~' on the hz decoder.

    @animalize
    Copy link
    Mannequin Author

    animalize mannequin commented Apr 6, 2017

    I tried to fix this two years ago, here is the patch (not merged):
    http://bugs.python.org/review/24117/diff/14803/Modules/cjkcodecs/_codecs_cn.c

    But later, I thought it's a good opportunity to remove this codec, this serious bug indicates that almost no one is using it. But fixing will create a possibility that someone will using it in future.
    So I suggest we don't fix it, just remove it or leave it as is.

    hz is outdated, searching on internet almost no one talking about it.

    Or do you know other bugs?
    It has another small bug in decoder, about state switch, but it's trivial, also fixed in the patch.

    @terryjreedy
    Copy link
    Member

    We seldom just remove things; we usually deprecate in the doc and if possible, issue a runtime warning.

    This is probably not the only obsolete codec. There should be a uniform policy for deprecation and removal, if ever. But for any codec, there might be archives, even if the codec is not used for new files.

    If the codec is buggy, I think it should be fixed. Bt you yourself closed bpo-24117, suggesting that you did not believe that the patches should be applied.

    @vstinner
    Copy link
    Member

    vstinner commented Apr 8, 2017

    "But for any codec, there might be archives, even if the codec is not
    used for new files."

    The bug is in the encoder. The codec is still usable to *decode*
    files. So maybe a few people use it but didn't notice the encoder bug?

    @animalize
    Copy link
    Mannequin Author

    animalize mannequin commented Apr 8, 2017

    From my subjective feelings, probably no old archives still exist, but I can't assert it. That's why I suggest remove it, or at least don't fix it.

    Ah, let's slow down the pace, this bug exists over a dacade, we don't need to solve it at once.

    I closed bpo-24117, because it became a soup of small issues, so I split it into individual issues (such as this issue).

    @zhangyangyu zhangyangyu changed the title Remove hz codec Fix handling escape characters in HZ codec May 12, 2017
    @zhangyangyu
    Copy link
    Member

    New changeset 89a5e03 by Xiang Zhang in branch 'master':
    bpo-30003: Fix handling escape characters in HZ codec (bpo-1556)
    89a5e03

    @zhangyangyu
    Copy link
    Member

    New changeset 65440f8 by Xiang Zhang in branch '3.5':
    bpo-30003: Fix handling escape characters in HZ codec (bpo-1556) (bpo-1718)
    65440f8

    @zhangyangyu
    Copy link
    Member

    New changeset 54af41d by Xiang Zhang in branch '3.6':
    bpo-30003: Fix handling escape characters in HZ codec (bpo-1556) (bpo-1719)
    54af41d

    @zhangyangyu
    Copy link
    Member

    New changeset 6e1b832 by Xiang Zhang in branch '2.7':
    bpo-30003: Fix handling escape characters in HZ codec (bpo-1720) (bpo-1556)
    6e1b832

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants