Issue 10552: Tools/unicode/gencodec.py error

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/54761

classification

Title:	Tools/unicode/gencodec.py error
Type:	behavior	Stage:	needs patch
Components:	Demos and Tools, macOS, Unicode	Versions:	Python 3.2, Python 3.4

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	belopolsky, ezio.melotti, hynek, iritkatriel, lemburg, loewis, martin.panter, ned.deily, ronaldoussoren
Priority:	low	Keywords:	patch

Created on 2010-11-27 20:29 by belopolsky, last changed 2022-04-11 14:57 by admin.

Files
File name	Uploaded	Description	Edit
issue10552.diff	belopolsky, 2010-11-27 21:15		review
issue10552a.diff	belopolsky, 2010-11-29 18:36		review
10552-remove-apple-files.txt	akuchling, 2013-11-10 18:24	Remove problematic mapping files before parsing
10552-remove-apple-files-v2.txt	martin.panter, 2015-01-13 05:57		review

Messages (15)
msg122549 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-27 20:29
$ ../../python.exe gencodec.py MAPPINGS/VENDORS/MISC/ build/ converting APL-ISO-IR-68.TXT to build/apl_iso_ir_68.py and build/apl_iso_ir_68.mapping converting ATARIST.TXT to build/atarist.py and build/atarist.mapping converting CP1006.TXT to build/cp1006.py and build/cp1006.mapping converting CP424.TXT to build/cp424.py and build/cp424.mapping Traceback (most recent call last): File "gencodec.py", line 421, in <module> convertdir(*sys.argv[1:]) File "gencodec.py", line 391, in convertdir pymap(mappathname, map, dirprefix + codefile,name,comments) File "gencodec.py", line 355, in pymap code = codegen(name,map,encodingname,comments) File "gencodec.py", line 268, in codegen precisions=(4, 2)) File "gencodec.py", line 152, in python_mapdef_code mappings = sorted(map.items()) TypeError: unorderable types: NoneType() < int() It does appear to have been updated for 3.x: $ python2.7 gencodec.py MAPPINGS/VENDORS/MISC/ build/ Traceback (most recent call last): File "gencodec.py", line 35, in <module> UNI_UNDEFINED = chr(0xFFFE) ValueError: chr() arg not in range(256)
msg122559 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-27 21:15
Attached patch addresses the issue by using -1 instead of None for missing codes. Comparison of generated encoding files to those in Lib/encodings shows only whitespace changes except one which appears to be a change on the unicode.org side: diff -b build/koi8_u.py ../../Lib/encodings/koi8_u.py 1c1 < """ Python Character Mapping Codec koi8_u generated from 'MAPPINGS/VENDORS/MISC/KOI8-U.TXT' with gencodec.py. --- > """ Python Character Mapping Codec koi8_u generated from 'python-mappings/KOI8-U.TXT' with gencodec.py. 221c221 < '\u0491' # 0xAD -> CYRILLIC SMALL LETTER GHE WITH UPTURN --- > '\u0491' # 0xAD -> CYRILLIC SMALL LETTER UKRAINIAN GHE WITH UPTURN 237c237 < '\u0490' # 0xBD -> CYRILLIC CAPITAL LETTER GHE WITH UPTURN --- > '\u0490' # 0xBD -> CYRILLIC CAPITAL LETTER UKRAINIAN GHE WITH UPTURN 308d307 <
msg122565 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-11-27 22:09
Alexander Belopolsky wrote: > > Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment: > > Attached patch addresses the issue by using -1 instead of None for missing codes. Comparison of generated encoding files to those in Lib/encodings shows only whitespace changes except one which appears to be a change on the unicode.org side: Please use a global constant instead of the literal -1, e.g. MISSING_CODE. Thanks. > diff -b build/koi8_u.py ../../Lib/encodings/koi8_u.py > 1c1 > < """ Python Character Mapping Codec koi8_u generated from 'MAPPINGS/VENDORS/MISC/KOI8-U.TXT' with gencodec.py. > --- >> """ Python Character Mapping Codec koi8_u generated from 'python-mappings/KOI8-U.TXT' with gencodec.py. > 221c221 > < '\u0491' # 0xAD -> CYRILLIC SMALL LETTER GHE WITH UPTURN > --- >> '\u0491' # 0xAD -> CYRILLIC SMALL LETTER UKRAINIAN GHE WITH UPTURN > 237c237 > < '\u0490' # 0xBD -> CYRILLIC CAPITAL LETTER GHE WITH UPTURN > --- >> '\u0490' # 0xBD -> CYRILLIC CAPITAL LETTER UKRAINIAN GHE WITH UPTURN > 308d307 > < That's just a comment and doesn't change the semantics of the codec.
msg122585 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-27 23:02
Attached patch uses MISSING_CODE as Mark suggested. There are still errors apparently because parsecodes() may return either an int or a tuple. I think only mac encodings are affected, so I would like to commit the current patch before tackling this issue. $ ../../python.exe gencodec.py MAPPINGS/VENDORS/APPLE/ build/ mac_ converting ARABIC.TXT to build/mac_arabic.py and build/mac_arabic.mapping converting CELTIC.TXT to build/mac_celtic.py and build/mac_celtic.mapping converting CENTEURO.TXT to build/mac_centeuro.py and build/mac_centeuro.mapping converting CHINSIMP.TXT to build/mac_chinsimp.py and build/mac_chinsimp.mapping Traceback (most recent call last): File "gencodec.py", line 424, in <module> convertdir(*sys.argv[1:]) File "gencodec.py", line 394, in convertdir pymap(mappathname, map, dirprefix + codefile,name,comments) File "gencodec.py", line 358, in pymap code = codegen(name,map,encodingname,comments) File "gencodec.py", line 271, in codegen precisions=(4, 2)) File "gencodec.py", line 155, in python_mapdef_code mappings = sorted(map.items()) TypeError: unorderable types: tuple() < int()
msg122586 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-27 23:03
Please ignore Makefile changes in the patch.
msg122829 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-29 16:57
Martin, I believe you were the last to update the unicode database. (See r85371.) Did you use python2.x to generate it or you have your own private copy of these tools? I noticed that genwincodecs.bat refers to c:\python26\python in 2.7 branch and c:\python30\python in py3k. Could this be an indication that these tools are out of date? What is the plan for maintaining these tools? Should fixes be done in 2.7 and 3.x be generated by 2to3? Or should fixes go to py3k and backported to 2.7 when they don't add new features?
msg122837 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-11-29 18:21
gencodec.py is only rarely used, namely when adding new codecs based on Unicode mapping files. It is not run regularly on the files from ftp.unicode.org and only updated on demand. AFAIK, it was last used on Python2 and never on Python3, hence the errors you find with it. BTW: You appear to have a comma appended to the constant, that doesn't belong there: +# Placeholder for a missing codepoint +MISSING_CODE = -1, + Perhaps that's causing the second error you are seeing.
msg122842 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-29 18:36
On Mon, Nov 29, 2010 at 1:21 PM, Marc-Andre Lemburg <report@bugs.python.org> wrote: .. > BTW: You appear to have a comma appended to the constant, that doesn't > belong there: > > +# Placeholder for a missing codepoint > +MISSING_CODE = -1, > + > > Perhaps that's causing the second error you are seeing. No, that comma was a left-over from the attempt to fix the mac_chinsimp error. The trace that I reported was generated with MISSING_CODE = -1. I am replacing the patch. Is it ok to commit a partial fix? It may take longer to fix the mac error.
msg122843 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-11-29 18:37
Alexander Belopolsky wrote: > > Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment: > > On Mon, Nov 29, 2010 at 1:21 PM, Marc-Andre Lemburg > <report@bugs.python.org> wrote: > .. >> BTW: You appear to have a comma appended to the constant, that doesn't >> belong there: >> >> +# Placeholder for a missing codepoint >> +MISSING_CODE = -1, >> + >> >> Perhaps that's causing the second error you are seeing. > > No, that comma was a left-over from the attempt to fix the > mac_chinsimp error. The trace that I reported was generated with > MISSING_CODE = -1. I am replacing the patch. > > Is it ok to commit a partial fix? It may take longer to fix the mac error. Sure, we won't need that script anytime soon and if we do, we can just as well use the Python2 version.
msg122850 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-29 18:52
On Mon, Nov 29, 2010 at 1:38 PM, Marc-Andre Lemburg <report@bugs.python.org> wrote: .. > Sure, we won't need that script anytime soon and if we do, we > can just as well use the Python2 version. That may not be true. I compared 2.7 and py3k versions and the later has some new features: * unidata_version changed from 5.2.0 to 6.0.0 * Unihan data is read from zip file * added processing of DerivedCoreProperties These changes don't affect gencodec.py, but it may be inconvenient to run makeunicodedata.py and gencodec.py using different versions of Python. I'll check that all non-mac encodings are correctly generated before committing.
msg122858 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-11-29 19:48
> These changes don't affect gencodec.py, but it may be inconvenient to > run makeunicodedata.py and gencodec.py using different versions of > Python. As MAL explains: these are completely unrelated, independent tools, and gencodec isn't run more than once per decade (or so). I only ever run makeunicodedata, and I have been using Python 3 to run it. The mappings are not supposed to ever change once produced. In particular, new versions of Unicode cannot affect them, since the existing characters all map fine to existing code points, which will not change their meaning per Unicode stability criteria.
msg122916 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-30 16:57
Committed in revision 86891. Keeping open to address Mac issue.
msg202543 - (view)	Author: A.M. Kuchling (akuchling) *	Date: 2013-11-10 18:24
For the Mac issue, we could just delete the mapping files before processing them. I've attached a patch that modifies the Makefile.
msg233902 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-01-13 05:57
Here is a new version of Kuchling’s patch. I restored some mapping files which do not give any errors (including the mac_turkish codec, which is actually documented), and removed both readme files.
msg406955 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2021-11-24 19:59
I don't think Martin's patch has been applied. Is it needed?

History
Date	User	Action	Args
2022-04-11 14:57:09	admin	set	github: 54761
2021-11-24 21:35:33	vstinner	set	nosy: - vstinner
2021-11-24 19:59:51	iritkatriel	set	nosy: + iritkatriel messages: + msg406955
2015-01-13 05:57:30	martin.panter	set	files: + 10552-remove-apple-files-v2.txt versions: + Python 3.4 nosy: + martin.panter, vstinner messages: + msg233902 components: + Unicode
2014-12-31 16:22:37	akuchling	set	nosy: - akuchling
2014-06-29 23:08:51	belopolsky	set	nosy: + ronaldoussoren, ned.deily, hynek
2014-06-29 23:07:44	belopolsky	set	assignee: belopolsky ->
2013-11-10 18:24:50	akuchling	set	files: + 10552-remove-apple-files.txt nosy: + akuchling messages: + msg202543
2010-12-30 22:14:16	georg.brandl	unlink	issue7962 dependencies
2010-11-30 16:57:48	belopolsky	set	nosy: lemburg, loewis, belopolsky, ezio.melotti messages: + msg122916 priority: normal -> low assignee: belopolsky components: + macOS stage: commit review -> needs patch
2010-11-29 20:22:31	belopolsky	unlink	issue10575 dependencies
2010-11-29 19:48:38	loewis	set	messages: + msg122858
2010-11-29 18:52:32	belopolsky	set	messages: + msg122850
2010-11-29 18:37:58	lemburg	set	messages: + msg122843
2010-11-29 18:36:58	belopolsky	set	files: - issue10552a.diff
2010-11-29 18:36:46	belopolsky	set	files: + issue10552a.diff messages: + msg122842
2010-11-29 18:21:55	lemburg	set	messages: + msg122837
2010-11-29 16:57:45	belopolsky	set	messages: + msg122829
2010-11-29 16:45:33	belopolsky	link	issue10575 dependencies
2010-11-27 23:03:04	belopolsky	set	messages: + msg122586
2010-11-27 23:02:25	belopolsky	set	files: + issue10552a.diff messages: + msg122585 stage: commit review
2010-11-27 22:16:02	ezio.melotti	set	nosy: + ezio.melotti
2010-11-27 22:09:48	lemburg	set	messages: + msg122565
2010-11-27 21:15:09	belopolsky	set	files: + issue10552.diff nosy: + loewis messages: + msg122559 keywords: + patch
2010-11-27 20:31:17	belopolsky	link	issue7962 dependencies
2010-11-27 20:29:09	belopolsky	create