-
-
Notifications
You must be signed in to change notification settings - Fork 29.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix handling escape characters in HZ codec #74189
Comments
hz is a Simplified Chinese codec, available in Python since around 2004. However, hz encoder has a serious bug, it forgets to escape ~
>>> 'hi~'.encode('hz')
b'hi~' # the correct output should be b'hi~~'
As a result, we can't finish a roundtrip:
>>> b'hi~'.decode('hz')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'hz' codec can't decode byte 0x7e in position 2: incomplete multibyte In these years, no one has reported this bug, so I think it's pretty safe to remove hz codec. FYI: It was popular in USENET networks, which in the late 1980s and early 1990s, generally did not allow transmission of 8-bit characters or escape characters. https://en.wikipedia.org/wiki/HZ_(character_encoding) Does other languages have hz codec? [1] http://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html |
Can't we fix the bug instead of removing the whole codec? Or do you know other bugs? The bug is only on the encoder part, right? I see unit test for '~' on the hz decoder. |
I tried to fix this two years ago, here is the patch (not merged): But later, I thought it's a good opportunity to remove this codec, this serious bug indicates that almost no one is using it. But fixing will create a possibility that someone will using it in future. hz is outdated, searching on internet almost no one talking about it.
|
We seldom just remove things; we usually deprecate in the doc and if possible, issue a runtime warning. This is probably not the only obsolete codec. There should be a uniform policy for deprecation and removal, if ever. But for any codec, there might be archives, even if the codec is not used for new files. If the codec is buggy, I think it should be fixed. Bt you yourself closed bpo-24117, suggesting that you did not believe that the patches should be applied. |
"But for any codec, there might be archives, even if the codec is not The bug is in the encoder. The codec is still usable to *decode* |
From my subjective feelings, probably no old archives still exist, but I can't assert it. That's why I suggest remove it, or at least don't fix it. Ah, let's slow down the pace, this bug exists over a dacade, we don't need to solve it at once. I closed bpo-24117, because it became a soup of small issues, so I split it into individual issues (such as this issue). |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: