Author vstinner
Recipients cdqzzy, ezio.melotti, lemburg, terry.reedy, vstinner
Date 2011-05-11.18:37:56
SpamBayes Score 2.77556e-16
Marked as misclassified No
Message-id <1305139071.12577.14.camel@marge>
In-reply-to <1305133530.72.0.884287770563.issue12057@psf.upfronthosting.co.za>
Content
> Looking at cjkencodings.py the format is pretty clear. The file
> consists of one statement that creates one dict that maps encoding
> names to a pair of (encoded) byte strings. The bytes literals are
> entirely hex escapes, with a maximum of 16 per chunk (line). From the
> usage you deduced that the first is encoded with named encoding and
> the second encoded with utf-8. (For anyone wondering, a separate utf-8
> strings is needed for each encoding because each other encoding is
> limited to a different subset of unicode chars.)
> 
> So I am not completely convinced that pulling the file apart is a
> complete win. Another entry could be added (the file is formatted with
> that possibility in mind), but it would certainly be much easier if
> the original formatting program were available.

With classic plain text files you don't need tools to convert a test
case. Use your text editor and you can use command line tools like
iconv, to modify an existing testcase or add a new testcase.

Example:

$ iconv -f utf-8 Lib/test/cjkencodings/gb18030-utf8.txt -t gb18030 -o
Lib/test/cjkencodings/gb18030-2.txt
$ md5sum Lib/test/cjkencodings/gb18030-2.txt
Lib/test/cjkencodings/gb18030.txt 
f8469bf751a9239a1038217e69d82532  Lib/test/cjkencodings/gb18030-2.txt
f8469bf751a9239a1038217e69d82532  Lib/test/cjkencodings/gb18030.txt

(Cool, iconv gives the same result :-))

> 1. Did one of us create the test strings (if so, how) or do they come
> from an authoritative source (like the unicode site) that created and
> checked them with their reference implementations.

Each encoding uses a different text, I don't know why. It's difficult to
see this fact by reading hexadecimal codes...

> Anyway, if so, is there any possibility that we would need to get new
> test strings from that source? Or are the limitations of these coding
> definitely fixed.

I don't understand why different texts are used. Why not just using the
same original text for all testcases? One reason can be that some
encodings (e.g. ISO 2202) use escape sequences to change the current
encoding. Or maybe because the characters are different (chinese vs
japanese characters?).

Anyway, we can use multiple testcases for each encoding.

> 2. If you create a test file for hz codec with the hz codec, how do we
> know it is correct? It would only serve to detect changes in the
> future.

We can use another codec than Python codec. The iconv command line
program doesn't know the "HZ" encoding (but it knows a lot of other
encodings).
History
Date User Action Args
2011-05-11 18:38:09vstinnersetrecipients: + vstinner, lemburg, terry.reedy, ezio.melotti, cdqzzy
2011-05-11 18:37:56vstinnerlinkissue12057 messages
2011-05-11 18:37:56vstinnercreate