Author terry.reedy
Recipients cdqzzy, ezio.melotti, hyeshik.chang, lemburg, terry.reedy, vstinner
Date 2011-05-11.20:24:36
SpamBayes Score 5.66214e-15
Marked as misclassified No
Message-id <1305145536.69.0.482267952747.issue12057@psf.upfronthosting.co.za>
In-reply-to
Content
Reading http://tools.ietf.org/html/rfc1843 suggests that the reason that there is no HZ pair in cjkencodings.py is that it is not a cjkencoding. Instead it is a formatter or meta-encoding for intermixing ascii codes and GB2312(-80) codes. (I assume the '-80' suffix means the 1980 version.)

In a bytes environment, I believe a strict HZ decoder would simply separate the input bytes into alternating ascii and GB bytes by splitting on the shift chars, changing '~~' to '~', and deleting '~\n' (2 chars). So it would need a special-case test. Python shifts between ascii and GB2312 decoders to produce a unicode stream. Because of the deletion of line-continuation markers, the codec is not 1 to 1. A test sentence should contain both that and an encoded ~.

>>> hz=b'''\
This ASCII sentence has a tilde: ~~.
The next sentence is in GB.~{<:Ky2;S{#,~}~
~{NpJ)l6HK!#~}Bye.'''
>>> hz
b'This ASCII sentence has a tilde: ~~.\nThe next sentence is in GB.~{<:Ky2;S{#,~}~\n~{NpJ)l6HK!#~}Bye.'
>>> HZ = hz.decode('HZ')
>>> HZ
'This ASCII sentence has a tilde: ~.\nThe next sentence is in GB.己所不欲,勿施於人。Bye.'
# second '\n' deleted
>>> HZ.encode('HZ')
b'This ASCII sentence has a tilde: ~.\nThe next sentence is in GB.~{<:Ky2;S{#,NpJ)l6HK!#~}Bye.'
# no '~}~\n~{' in the middle of GC codes.

I believe hz and u8=HZ.encode() should work as a test pair for the working of the hz parser itself:
>>> u8 = HZ.encode()
>>> u8
b'This ASCII sentence has a tilde: ~.\nThe next sentence is in GB.\xe5\xb7\xb1\xe6\x89\x80\xe4\xb8\x8d\xe6\xac\xb2\xef\xbc\x8c\xe5\x8b\xbf\xe6\x96\xbd\xe6\x96\xbc\xe4\xba\xba\xe3\x80\x82Bye.'
>>> u8.decode() == hz.decode('HZ')
True

However, I have no idea what the hz codec is doing with the shifted byte pairs between '~{' and '~}' All the gb codecs decode b'<:Ky2;S{#,NpJ)l6HK!#' to '<:Ky2;S{#,NpJ)l6HK!#' (ie, ascii chars to same unicode chars). And they encode '己所不欲,勿施於人。' to bytes with the high bit set.

I figured it out. The 1995 rfc says "A GB (GB1 and GB2) code is a two byte code, where the first byte is in the range $21-$77   (hexadecimal), and the second byte is in the range $21-$7E." This was in the days of for 7-bit bytes, at least for safe transmission. Now that we use 8-bit bytes nearly everywhere, the gb specs have probably be updated since 1980. This makes hz rather obsolete, since high-bit unset ascii codes and high-bit set gb codes can be mixed without the hz wrapping. In any case, Python's gb codecs act this way. So the hz codec is setting and unsetting the high bit when passing bytes to and from gb codec (assuming it does not use a modified version internally).
>>> hhz = [c - 128 for c in '己所不欲,勿施於人。'.encode('GB2312')]
>>> bytes(hhz)
b'<:Ky2;S{#,NpJ)l6HK!#'

Perhaps there should be a separate test like the above to be sure that hz really uses GB2312-80, as specified.
History
Date User Action Args
2011-05-11 20:25:36terry.reedysetrecipients: + terry.reedy, lemburg, hyeshik.chang, vstinner, ezio.melotti, cdqzzy
2011-05-11 20:25:36terry.reedysetmessageid: <1305145536.69.0.482267952747.issue12057@psf.upfronthosting.co.za>
2011-05-11 20:24:36terry.reedylinkissue12057 messages
2011-05-11 20:24:36terry.reedycreate