classification
Title: HZ codec has no test
Type: Stage:
Components: Library (Lib), Tests, Unicode Versions: Python 3.1, Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: 12100 Superseder:
Assigned To: Nosy List: cdqzzy, ezio.melotti, hyeshik.chang, lemburg, python-dev, r.david.murray, terry.reedy, vstinner
Priority: normal Keywords: patch

Created on 2011-05-11 13:01 by vstinner, last changed 2011-05-30 22:05 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
convert_cjkencodings.py vstinner, 2011-05-11 13:18
cjkencodings.patch vstinner, 2011-05-11 13:27 review
cjkencodings_dir.patch vstinner, 2011-05-24 21:50 review
iso2022_tests.patch vstinner, 2011-05-24 23:11 review
Messages (28)
msg135773 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-05-11 13:01
All CJK codecs have tests except the chinese HZ codec, I don't know why.

But to add a test, I need to add data to Lib/test/cjkencodings_test.py and the format of this file is not documented. It is not too difficult to understand the format by reading the code of the tests, but it's hard to maintain these tests (add more tests or change a test).

I need tests to be able to patch the codec to fix #12016.

My plan is to:

 - Change Lib/test/cjkencodings_test.py format: use two files for each encoding (one in the tested encoding, one in UTF-8)
 - Add tests to the HZ codec
 - Close this issue
 - Fix #12016
msg135775 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-05-11 13:27
convert_cjkencodings.py is script to replace Lib/test/cjkencodings_test.py by a Lib/test/cjkencodings/ directory:
---
big5hkscs.txt
big5hkscs-utf8.txt
big5.txt
big5-utf8.txt
cp949.txt
cp949-utf8.txt
euc_jisx0213.txt
euc_jisx0213-utf8.txt
euc_jp.txt
euc_jp-utf8.txt
euc_kr.txt
euc_kr-utf8.txt
gb18030.txt
gb18030-utf8.txt
gb2312.txt
gb2312-utf8.txt
gbk.txt
gbk-utf8.txt
johab.txt
johab-utf8.txt
shift_jis.txt
shift_jis-utf8.txt
shift_jisx0213.txt
shift_jisx0213-utf8.txt
---

cjkencodings.patch fixes Lib/test/test_multibytecodec_support.py to use the directoy.
msg135777 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-05-11 13:38
New files should be marked as binary in Mercurial: add "Lib/test/cjkencodings/* = BIN" in .hgeol.
msg135785 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-05-11 17:05
Looking at cjkencodings.py the format is pretty clear. The file consists of one statement that creates one dict that maps encoding names to a pair of (encoded) byte strings. The bytes literals are entirely hex escapes, with a maximum of 16 per chunk (line). From the usage you deduced that the first is encoded with named encoding and the second encoded with utf-8. (For anyone wondering, a separate utf-8 strings is needed for each encoding because each other encoding is limited to a different subset of unicode chars.)

So I am not completely convinced that pulling the file apart is a complete win. Another entry could be added (the file is formatted with that possibility in mind), but it would certainly be much easier if the original formatting program were available. I do have a couple of questions.

1. Did one of us create the test strings (if so, how) or do they come from an authoritative source (like the unicode site) that created and checked them with their reference implementations. If so, the missing pair *is* a puzzle. Anyway, if so, is there any possibility that we would need to get new test strings from that source? Or are the limitations of these coding definitely fixed.

2. If you create a test file for hz codec with the hz codec, how do we know it is correct? It would only serve to detect changes in the future.
msg135787 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-05-11 17:27
Terry J. Reedy wrote:
> 
> Terry J. Reedy <tjreedy@udel.edu> added the comment:
> 
> Looking at cjkencodings.py the format is pretty clear. The file consists of one statement that creates one dict that maps encoding names to a pair of (encoded) byte strings. The bytes literals are entirely hex escapes, with a maximum of 16 per chunk (line). From the usage you deduced that the first is encoded with named encoding and the second encoded with utf-8. (For anyone wondering, a separate utf-8 strings is needed for each encoding because each other encoding is limited to a different subset of unicode chars.)
> 
> So I am not completely convinced that pulling the file apart is a complete win. Another entry could be added (the file is formatted with that possibility in mind), but it would certainly be much easier if the original formatting program were available. I do have a couple of questions.
> 
> 1. Did one of us create the test strings (if so, how) or do they come from an authoritative source (like the unicode site) that created and checked them with their reference implementations. If so, the missing pair *is* a puzzle. Anyway, if so, is there any possibility that we would need to get new test strings from that source? Or are the limitations of these coding definitely fixed.
> 
> 2. If you create a test file for hz codec with the hz codec, how do we know it is correct? It would only serve to detect changes in the future.

Victor, could you please contact Hye-Shik Chang <perky@FreeBSD.org>
before making significant changes to the test suite.

Wouldn't it be better to just use example strings from the RFC and
keep the design as it is ?

http://tools.ietf.org/html/rfc1843
msg135789 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-05-11 18:37
> Looking at cjkencodings.py the format is pretty clear. The file
> consists of one statement that creates one dict that maps encoding
> names to a pair of (encoded) byte strings. The bytes literals are
> entirely hex escapes, with a maximum of 16 per chunk (line). From the
> usage you deduced that the first is encoded with named encoding and
> the second encoded with utf-8. (For anyone wondering, a separate utf-8
> strings is needed for each encoding because each other encoding is
> limited to a different subset of unicode chars.)
> 
> So I am not completely convinced that pulling the file apart is a
> complete win. Another entry could be added (the file is formatted with
> that possibility in mind), but it would certainly be much easier if
> the original formatting program were available.

With classic plain text files you don't need tools to convert a test
case. Use your text editor and you can use command line tools like
iconv, to modify an existing testcase or add a new testcase.

Example:

$ iconv -f utf-8 Lib/test/cjkencodings/gb18030-utf8.txt -t gb18030 -o
Lib/test/cjkencodings/gb18030-2.txt
$ md5sum Lib/test/cjkencodings/gb18030-2.txt
Lib/test/cjkencodings/gb18030.txt 
f8469bf751a9239a1038217e69d82532  Lib/test/cjkencodings/gb18030-2.txt
f8469bf751a9239a1038217e69d82532  Lib/test/cjkencodings/gb18030.txt

(Cool, iconv gives the same result :-))

> 1. Did one of us create the test strings (if so, how) or do they come
> from an authoritative source (like the unicode site) that created and
> checked them with their reference implementations.

Each encoding uses a different text, I don't know why. It's difficult to
see this fact by reading hexadecimal codes...

> Anyway, if so, is there any possibility that we would need to get new
> test strings from that source? Or are the limitations of these coding
> definitely fixed.

I don't understand why different texts are used. Why not just using the
same original text for all testcases? One reason can be that some
encodings (e.g. ISO 2202) use escape sequences to change the current
encoding. Or maybe because the characters are different (chinese vs
japanese characters?).

Anyway, we can use multiple testcases for each encoding.

> 2. If you create a test file for hz codec with the hz codec, how do we
> know it is correct? It would only serve to detect changes in the
> future.

We can use another codec than Python codec. The iconv command line
program doesn't know the "HZ" encoding (but it knows a lot of other
encodings).
msg135790 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-05-11 18:38
Le mercredi 11 mai 2011 à 17:27 +0000, Marc-Andre Lemburg a écrit :
> Victor, could you please contact Hye-Shik Chang <perky@FreeBSD.org>
> before making significant changes to the test suite.

Good idea, done.

> Wouldn't it be better to just use example strings from the RFC and
> keep the design as it is ?
> 
> http://tools.ietf.org/html/rfc1843

Nice, this RFC contains some useful examples.
msg135792 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-05-11 18:43
Lib/test/cjkencodings_test.py was created when CJK were introduced in Python: changeset 31386 by Hye-Shik Chang <hyeshik@gmail.com>.

"Add CJK codecs support as discussed on python-dev. (SF #873597)

Several style fixes are suggested by Martin v. Loewis and
Marc-Andre Lemburg. Thanks!"
msg135801 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-05-11 20:24
Reading http://tools.ietf.org/html/rfc1843 suggests that the reason that there is no HZ pair in cjkencodings.py is that it is not a cjkencoding. Instead it is a formatter or meta-encoding for intermixing ascii codes and GB2312(-80) codes. (I assume the '-80' suffix means the 1980 version.)

In a bytes environment, I believe a strict HZ decoder would simply separate the input bytes into alternating ascii and GB bytes by splitting on the shift chars, changing '~~' to '~', and deleting '~\n' (2 chars). So it would need a special-case test. Python shifts between ascii and GB2312 decoders to produce a unicode stream. Because of the deletion of line-continuation markers, the codec is not 1 to 1. A test sentence should contain both that and an encoded ~.

>>> hz=b'''\
This ASCII sentence has a tilde: ~~.
The next sentence is in GB.~{<:Ky2;S{#,~}~
~{NpJ)l6HK!#~}Bye.'''
>>> hz
b'This ASCII sentence has a tilde: ~~.\nThe next sentence is in GB.~{<:Ky2;S{#,~}~\n~{NpJ)l6HK!#~}Bye.'
>>> HZ = hz.decode('HZ')
>>> HZ
'This ASCII sentence has a tilde: ~.\nThe next sentence is in GB.己所不欲,勿施於人。Bye.'
# second '\n' deleted
>>> HZ.encode('HZ')
b'This ASCII sentence has a tilde: ~.\nThe next sentence is in GB.~{<:Ky2;S{#,NpJ)l6HK!#~}Bye.'
# no '~}~\n~{' in the middle of GC codes.

I believe hz and u8=HZ.encode() should work as a test pair for the working of the hz parser itself:
>>> u8 = HZ.encode()
>>> u8
b'This ASCII sentence has a tilde: ~.\nThe next sentence is in GB.\xe5\xb7\xb1\xe6\x89\x80\xe4\xb8\x8d\xe6\xac\xb2\xef\xbc\x8c\xe5\x8b\xbf\xe6\x96\xbd\xe6\x96\xbc\xe4\xba\xba\xe3\x80\x82Bye.'
>>> u8.decode() == hz.decode('HZ')
True

However, I have no idea what the hz codec is doing with the shifted byte pairs between '~{' and '~}' All the gb codecs decode b'<:Ky2;S{#,NpJ)l6HK!#' to '<:Ky2;S{#,NpJ)l6HK!#' (ie, ascii chars to same unicode chars). And they encode '己所不欲,勿施於人。' to bytes with the high bit set.

I figured it out. The 1995 rfc says "A GB (GB1 and GB2) code is a two byte code, where the first byte is in the range $21-$77   (hexadecimal), and the second byte is in the range $21-$7E." This was in the days of for 7-bit bytes, at least for safe transmission. Now that we use 8-bit bytes nearly everywhere, the gb specs have probably be updated since 1980. This makes hz rather obsolete, since high-bit unset ascii codes and high-bit set gb codes can be mixed without the hz wrapping. In any case, Python's gb codecs act this way. So the hz codec is setting and unsetting the high bit when passing bytes to and from gb codec (assuming it does not use a modified version internally).
>>> hhz = [c - 128 for c in '己所不欲,勿施於人。'.encode('GB2312')]
>>> bytes(hhz)
b'<:Ky2;S{#,NpJ)l6HK!#'

Perhaps there should be a separate test like the above to be sure that hz really uses GB2312-80, as specified.
msg135813 - (view) Author: Hyeshik Chang (hyeshik.chang) * (Python committer) Date: 2011-05-12 00:09
Hello, everyone!

The rationale why I chose to encode the test strings into a Python source code was that I wanted for them to be treated as text files which are trackable in CVS or subversion and to keep Python source codes free of any non-ASCII characters. Now I don't feel the need of "text file" status, STINNER's suggestion works for me.

Actually, all "stateful" encodings supported by cjkcodecs lack of adequate test codes. (There are seven more iso-2022 stateful encodings in addition of hz in Python.)  "cjkencoding_tests.py" is used for random chunk coding tests and most stateful encodings are not compatible with random chunk coding. For those reasons, I didn't include test strings for them there. But they apparently still need appropriate simple string coding and stream coding tests.

STINNER Victor wrote:
> I don't understand why different texts are used. Why not just using the
> same original text for all testcases? One reason can be that some
> encodings (e.g. ISO 2202) use escape sequences to change the current
> encoding. Or maybe because the characters are different (chinese vs
> japanese characters?).

Almost every encoding in cjkcodecs has different set of characters. They support different languages (Chinese, Japanese, Korean), different scripts (Hanja, Kanji, Traditional and Simplified Chinese), different standards (johab and KS X 1001 in Korean), different versions/variants (JIS X 0201 and JIS X 0213 in Japanese).  It would be quite striking, actually one of them, gb18030, is a "superset" of the Unicode so far.


Teddy J Reedy wrotes:
> Perhaps there should be a separate test like the above to be sure that hz really uses GB2312-80, as specified.

You're right.


By the way, my previous e-mail address <perky@FreeBSD.org> isn't reachable anymore, please send to <hyeshik@gmail.com> when you need.
msg135839 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-05-12 14:38
> I wanted for them to be treated as text files which are trackable
> in CVS or subversion and to keep Python source codes free of any
> non-ASCII characters

Mercurial supports binary file, I plan to mark the CJK testcases as binary using .hgeol.
msg136099 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-05-16 14:50
New changeset 16503022c4b8 by Victor Stinner in branch '3.1':
Issue #12057: Convert CJK encoding testcase BLOB into multiple text files
http://hg.python.org/cpython/rev/16503022c4b8

New changeset 370db8da308f by Victor Stinner in branch '3.2':
(Merge 3.1) Issue #12057: Convert CJK encoding testcase BLOB into multiple text
http://hg.python.org/cpython/rev/370db8da308f

New changeset e7daf2acc3a7 by Victor Stinner in branch 'default':
(Merge 3.2) Issue #12057: Convert CJK encoding testcase BLOB into multiple text
http://hg.python.org/cpython/rev/e7daf2acc3a7
msg136104 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-05-16 15:16
New changeset 1bd697cdd210 by Victor Stinner in branch '2.7':
Issue #12057: Convert CJK encoding testcase BLOB into multiple text files
http://hg.python.org/cpython/rev/1bd697cdd210
msg136106 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-05-16 15:23
Oh, I specified the wrong issue number of my last 3 commits: test_linecache failure is related to this issue.

New changeset 9a4d4506680a by Victor Stinner in branch '3.1':
Issue #11614: Fix test_linecache, cjkencodings_test.py doesn't exist anymore
http://hg.python.org/cpython/rev/9a4d4506680a

New changeset 43cbfacae463 by Victor Stinner in branch '3.2':
(Merge 3.1) Issue #11614: Fix test_linecache, cjkencodings_test.py doesn't
http://hg.python.org/cpython/rev/43cbfacae463

New changeset 06473da99270 by Victor Stinner in branch 'default':
(Merge 3.2) Issue #11614: Fix test_linecache, cjkencodings_test.py doesn't
http://hg.python.org/cpython/rev/06473da99270
msg136159 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-05-17 13:50
New changeset 83f4c270b27d by Victor Stinner in branch '2.7':
Issue #12057: Fix .hgeol and test_multibytecodec_support for the conversion of
http://hg.python.org/cpython/rev/83f4c270b27d
msg136195 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-05-17 23:12
ISO 2022 encodings don't have tests neither: test_multibytecodec doesn't test directly these encodings but it is "Unit test for multibytecodec itself". We may also add tests specific to ISO 2022 encodings:

 - iso2022_kr
 - iso2022_jp
 - iso2022_jp_1
 - iso2022_jp_2
 - iso2022_jp_2004
 - iso2022_jp_3
 - iso2022_jp_ext

While trying to write tests for the HZ encoding, I found a bug in CJK multibyte encodings => #12100, "Incremental encoders of CJK codecs reset the codec at each call to encode()".
msg136234 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-05-18 12:04
Haypo, since you've created a new directory there are makefile (and PC build file, I think) updates that will need to be made.  (This should be documented in the dev guide if it isn't already.)
msg136417 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-05-21 00:33
I think that issue #12100 should be fixed (wontfix/fixed) before this one.
msg136798 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-05-24 21:50
> Haypo, since you've created a new directory there are makefile 
> (and PC build file, I think) updates that will need to be made. 

Can you review attached cjkencodings_dir.patch?

> (This should be documented in the dev guide if it isn't already.)

Do you mean that the cjkencodings directory should be documented? (in setup.rst? subdirectories are not listed) Or the process of adding a new directory?
msg136800 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-05-24 22:14
I presume and hope David meant the process, as I would have no idea how to add a directory. And David did not seem completely sure.
msg136801 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-05-24 22:17
New changeset 10b23f1c8cb6 by Victor Stinner in branch '3.1':
Issue #12057: Add tests for the HZ encoding
http://hg.python.org/cpython/rev/10b23f1c8cb6

New changeset 3368d4a04e52 by Victor Stinner in branch '3.2':
(Merge 3.1) Issue #12057: Add tests for the HZ encoding
http://hg.python.org/cpython/rev/3368d4a04e52

New changeset 06c44a518d0b by Victor Stinner in branch 'default':
(Merge 3.2) Issue #12057: Add tests for the HZ encoding
http://hg.python.org/cpython/rev/06c44a518d0b
msg136802 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-05-24 22:19
New changeset 3c724c3eaed7 by Victor Stinner in branch '2.7':
Issue #12057: Add tests for the HZ encoding
http://hg.python.org/cpython/rev/3c724c3eaed7
msg136803 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-05-24 22:23
Looks good to me.  And I meant documenting the process for adding a directory.
msg136805 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-05-24 23:11
iso2022_tests.patch: add some tests for ISO2022 encodings:
 - testcase for iso2022_jp and iso2022_kr, iso2022_jp2 reuses iso2022_jp testcase
 - test some invalid byte sequences
msg136806 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-05-24 23:17
New changeset a024183e046f by Victor Stinner in branch '3.1':
Issue #12057: Add cjkencodings directory to the Makefile and Tools/msi/msi.py
http://hg.python.org/cpython/rev/a024183e046f

New changeset 4289cc96835e by Victor Stinner in branch '3.2':
(Merge 3.1) Issue #12057: Add cjkencodings directory to the Makefile and
http://hg.python.org/cpython/rev/4289cc96835e

New changeset b2b0cae86f56 by Victor Stinner in branch 'default':
(Merge 3.2) Issue #12057: Add cjkencodings directory to the Makefile and
http://hg.python.org/cpython/rev/b2b0cae86f56
msg136807 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-05-24 23:21
New changeset 8ba0192a0eb1 by Victor Stinner in branch '2.7':
Issue #12057: Add cjkencodings directory to the Makefile and Tools/msi/msi.py
http://hg.python.org/cpython/rev/8ba0192a0eb1
msg137338 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-05-30 22:03
New changeset 6c6923a406df by Victor Stinner in branch '2.7':
Issue #12057: Add tests for ISO 2022 codecs
http://hg.python.org/cpython/rev/6c6923a406df

New changeset 2a313ceaf17c by Victor Stinner in branch '3.2':
Issue #12057: Add tests for ISO 2022 codecs
http://hg.python.org/cpython/rev/2a313ceaf17c

New changeset 1a9ccb5bef27 by Victor Stinner in branch 'default':
(Merge 3.2) Issue #12057: Add tests for ISO 2022 codecs
http://hg.python.org/cpython/rev/1a9ccb5bef27
msg137339 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-05-30 22:05
We have know tests for some ISO 2022 codecs and the HZ codec, it's much better!
History
Date User Action Args
2011-05-30 22:05:13vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg137339
2011-05-30 22:03:17python-devsetmessages: + msg137338
2011-05-24 23:21:10python-devsetmessages: + msg136807
2011-05-24 23:17:47python-devsetmessages: + msg136806
2011-05-24 23:11:40vstinnersetfiles: + iso2022_tests.patch

messages: + msg136805
2011-05-24 22:23:49r.david.murraysetmessages: + msg136803
2011-05-24 22:19:34python-devsetmessages: + msg136802
2011-05-24 22:17:45python-devsetmessages: + msg136801
2011-05-24 22:14:24terry.reedysetmessages: + msg136800
2011-05-24 21:50:54vstinnersetfiles: + cjkencodings_dir.patch

messages: + msg136798
2011-05-21 00:33:45vstinnersetdependencies: + Incremental encoders of CJK codecs reset the codec at each call to encode()
messages: + msg136417
2011-05-18 12:04:32r.david.murraysetnosy: + r.david.murray
messages: + msg136234
2011-05-17 23:12:40vstinnersetmessages: + msg136195
2011-05-17 13:50:16python-devsetmessages: + msg136159
2011-05-16 15:23:08vstinnersetmessages: + msg136106
2011-05-16 15:16:29python-devsetmessages: + msg136104
2011-05-16 14:50:36python-devsetnosy: + python-dev
messages: + msg136099
2011-05-12 14:38:31vstinnersetmessages: + msg135839
2011-05-12 00:09:59hyeshik.changsetmessages: + msg135813
2011-05-11 20:27:58terry.reedysetmessages: - msg135802
2011-05-11 20:26:31terry.reedysetmessages: + msg135802
2011-05-11 20:24:36terry.reedysetmessages: + msg135801
2011-05-11 19:02:24vstinnersetnosy: + hyeshik.chang
2011-05-11 18:43:49vstinnersetmessages: + msg135792
2011-05-11 18:38:41vstinnersetmessages: + msg135790
2011-05-11 18:37:56vstinnersetmessages: + msg135789
2011-05-11 17:27:43lemburgsetmessages: + msg135787
2011-05-11 17:05:28terry.reedysetmessages: + msg135785
components: + Tests
2011-05-11 13:38:03vstinnersetmessages: + msg135777
2011-05-11 13:32:55vstinnerlinkissue12016 dependencies
2011-05-11 13:32:46vstinnersetdependencies: - HZ codec has no test
2011-05-11 13:32:46vstinnerunlinkissue12057 dependencies
2011-05-11 13:32:34vstinnersetdependencies: + HZ codec has no test
2011-05-11 13:32:34vstinnerlinkissue12057 dependencies
2011-05-11 13:27:41vstinnersetmessages: + msg135775
2011-05-11 13:27:31vstinnersetfiles: + cjkencodings.patch
2011-05-11 13:27:17vstinnersetfiles: - cjkencodings.patch
2011-05-11 13:18:58vstinnersetfiles: + cjkencodings.patch
keywords: + patch
2011-05-11 13:18:38vstinnersetfiles: + convert_cjkencodings.py
2011-05-11 13:01:57vstinnercreate