Message 135789 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	cdqzzy, ezio.melotti, lemburg, terry.reedy, vstinner
Date	2011-05-11.18:37:56
SpamBayes Score	2.7755576e-16
Marked as misclassified	No
Message-id	<1305139071.12577.14.camel@marge>
In-reply-to	<1305133530.72.0.884287770563.issue12057@psf.upfronthosting.co.za>

Content
> Looking at cjkencodings.py the format is pretty clear. The file > consists of one statement that creates one dict that maps encoding > names to a pair of (encoded) byte strings. The bytes literals are > entirely hex escapes, with a maximum of 16 per chunk (line). From the > usage you deduced that the first is encoded with named encoding and > the second encoded with utf-8. (For anyone wondering, a separate utf-8 > strings is needed for each encoding because each other encoding is > limited to a different subset of unicode chars.) > > So I am not completely convinced that pulling the file apart is a > complete win. Another entry could be added (the file is formatted with > that possibility in mind), but it would certainly be much easier if > the original formatting program were available. With classic plain text files you don't need tools to convert a test case. Use your text editor and you can use command line tools like iconv, to modify an existing testcase or add a new testcase. Example: $ iconv -f utf-8 Lib/test/cjkencodings/gb18030-utf8.txt -t gb18030 -o Lib/test/cjkencodings/gb18030-2.txt $ md5sum Lib/test/cjkencodings/gb18030-2.txt Lib/test/cjkencodings/gb18030.txt f8469bf751a9239a1038217e69d82532 Lib/test/cjkencodings/gb18030-2.txt f8469bf751a9239a1038217e69d82532 Lib/test/cjkencodings/gb18030.txt (Cool, iconv gives the same result :-)) > 1. Did one of us create the test strings (if so, how) or do they come > from an authoritative source (like the unicode site) that created and > checked them with their reference implementations. Each encoding uses a different text, I don't know why. It's difficult to see this fact by reading hexadecimal codes... > Anyway, if so, is there any possibility that we would need to get new > test strings from that source? Or are the limitations of these coding > definitely fixed. I don't understand why different texts are used. Why not just using the same original text for all testcases? One reason can be that some encodings (e.g. ISO 2202) use escape sequences to change the current encoding. Or maybe because the characters are different (chinese vs japanese characters?). Anyway, we can use multiple testcases for each encoding. > 2. If you create a test file for hz codec with the hz codec, how do we > know it is correct? It would only serve to detect changes in the > future. We can use another codec than Python codec. The iconv command line program doesn't know the "HZ" encoding (but it knows a lot of other encodings).

> Looking at cjkencodings.py the format is pretty clear. The file
> consists of one statement that creates one dict that maps encoding
> names to a pair of (encoded) byte strings. The bytes literals are
> entirely hex escapes, with a maximum of 16 per chunk (line). From the
> usage you deduced that the first is encoded with named encoding and
> the second encoded with utf-8. (For anyone wondering, a separate utf-8
> strings is needed for each encoding because each other encoding is
> limited to a different subset of unicode chars.)
> 
> So I am not completely convinced that pulling the file apart is a
> complete win. Another entry could be added (the file is formatted with
> that possibility in mind), but it would certainly be much easier if
> the original formatting program were available.

With classic plain text files you don't need tools to convert a test
case. Use your text editor and you can use command line tools like
iconv, to modify an existing testcase or add a new testcase.

Example:

$ iconv -f utf-8 Lib/test/cjkencodings/gb18030-utf8.txt -t gb18030 -o
Lib/test/cjkencodings/gb18030-2.txt
$ md5sum Lib/test/cjkencodings/gb18030-2.txt
Lib/test/cjkencodings/gb18030.txt 
f8469bf751a9239a1038217e69d82532  Lib/test/cjkencodings/gb18030-2.txt
f8469bf751a9239a1038217e69d82532  Lib/test/cjkencodings/gb18030.txt

(Cool, iconv gives the same result :-))

> 1. Did one of us create the test strings (if so, how) or do they come
> from an authoritative source (like the unicode site) that created and
> checked them with their reference implementations.

Each encoding uses a different text, I don't know why. It's difficult to
see this fact by reading hexadecimal codes...

> Anyway, if so, is there any possibility that we would need to get new
> test strings from that source? Or are the limitations of these coding
> definitely fixed.

I don't understand why different texts are used. Why not just using the
same original text for all testcases? One reason can be that some
encodings (e.g. ISO 2202) use escape sequences to change the current
encoding. Or maybe because the characters are different (chinese vs
japanese characters?).

Anyway, we can use multiple testcases for each encoding.

> 2. If you create a test file for hz codec with the hz codec, how do we
> know it is correct? It would only serve to detect changes in the
> future.

We can use another codec than Python codec. The iconv command line
program doesn't know the "HZ" encoding (but it knows a lot of other
encodings).

History
Date	User	Action	Args
2011-05-11 18:38:09	vstinner	set	recipients: + vstinner, lemburg, terry.reedy, ezio.melotti, cdqzzy
2011-05-11 18:37:56	vstinner	link	issue12057 messages
2011-05-11 18:37:56	vstinner	create