This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Tools/unicode/gencodec.py error
Type: behavior Stage: needs patch
Components: Demos and Tools, macOS, Unicode Versions: Python 3.2, Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: belopolsky, ezio.melotti, hynek, iritkatriel, lemburg, loewis, martin.panter, ned.deily, ronaldoussoren
Priority: low Keywords: patch

Created on 2010-11-27 20:29 by belopolsky, last changed 2022-04-11 14:57 by admin.

Files
File name Uploaded Description Edit
issue10552.diff belopolsky, 2010-11-27 21:15 review
issue10552a.diff belopolsky, 2010-11-29 18:36 review
10552-remove-apple-files.txt akuchling, 2013-11-10 18:24 Remove problematic mapping files before parsing
10552-remove-apple-files-v2.txt martin.panter, 2015-01-13 05:57 review
Messages (15)
msg122549 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-27 20:29
$ ../../python.exe gencodec.py MAPPINGS/VENDORS/MISC/ build/
converting APL-ISO-IR-68.TXT to build/apl_iso_ir_68.py and build/apl_iso_ir_68.mapping
converting ATARIST.TXT to build/atarist.py and build/atarist.mapping
converting CP1006.TXT to build/cp1006.py and build/cp1006.mapping
converting CP424.TXT to build/cp424.py and build/cp424.mapping
Traceback (most recent call last):
  File "gencodec.py", line 421, in <module>
    convertdir(*sys.argv[1:])
  File "gencodec.py", line 391, in convertdir
    pymap(mappathname, map, dirprefix + codefile,name,comments)
  File "gencodec.py", line 355, in pymap
    code = codegen(name,map,encodingname,comments)
  File "gencodec.py", line 268, in codegen
    precisions=(4, 2))
  File "gencodec.py", line 152, in python_mapdef_code
    mappings = sorted(map.items())
TypeError: unorderable types: NoneType() < int()

It does appear to have been updated for 3.x:

$ python2.7 gencodec.py MAPPINGS/VENDORS/MISC/ build/
Traceback (most recent call last):
  File "gencodec.py", line 35, in <module>
    UNI_UNDEFINED = chr(0xFFFE)
ValueError: chr() arg not in range(256)
msg122559 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-27 21:15
Attached patch addresses the issue by using -1 instead of None for missing codes.  Comparison of generated encoding files to those in Lib/encodings shows only whitespace changes except one which appears to be a change on the unicode.org side:


diff -b build/koi8_u.py ../../Lib/encodings/koi8_u.py
1c1
< """ Python Character Mapping Codec koi8_u generated from 'MAPPINGS/VENDORS/MISC/KOI8-U.TXT' with gencodec.py.
---
> """ Python Character Mapping Codec koi8_u generated from 'python-mappings/KOI8-U.TXT' with gencodec.py.
221c221
<     '\u0491'    #  0xAD -> CYRILLIC SMALL LETTER GHE WITH UPTURN
---
>     '\u0491'   #  0xAD -> CYRILLIC SMALL LETTER UKRAINIAN GHE WITH UPTURN
237c237
<     '\u0490'    #  0xBD -> CYRILLIC CAPITAL LETTER GHE WITH UPTURN
---
>     '\u0490'   #  0xBD -> CYRILLIC CAPITAL LETTER UKRAINIAN GHE WITH UPTURN
308d307
<
msg122565 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-11-27 22:09
Alexander Belopolsky wrote:
> 
> Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:
> 
> Attached patch addresses the issue by using -1 instead of None for missing codes.  Comparison of generated encoding files to those in Lib/encodings shows only whitespace changes except one which appears to be a change on the unicode.org side:

Please use a global constant instead of the literal -1, e.g. MISSING_CODE.
Thanks.

> diff -b build/koi8_u.py ../../Lib/encodings/koi8_u.py
> 1c1
> < """ Python Character Mapping Codec koi8_u generated from 'MAPPINGS/VENDORS/MISC/KOI8-U.TXT' with gencodec.py.
> ---
>> """ Python Character Mapping Codec koi8_u generated from 'python-mappings/KOI8-U.TXT' with gencodec.py.
> 221c221
> <     '\u0491'    #  0xAD -> CYRILLIC SMALL LETTER GHE WITH UPTURN
> ---
>>     '\u0491'   #  0xAD -> CYRILLIC SMALL LETTER UKRAINIAN GHE WITH UPTURN
> 237c237
> <     '\u0490'    #  0xBD -> CYRILLIC CAPITAL LETTER GHE WITH UPTURN
> ---
>>     '\u0490'   #  0xBD -> CYRILLIC CAPITAL LETTER UKRAINIAN GHE WITH UPTURN
> 308d307
> <

That's just a comment and doesn't change the semantics of the codec.
msg122585 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-27 23:02
Attached patch uses MISSING_CODE as Mark suggested.  There are still errors apparently because parsecodes() may return either an int or a tuple.  I think only mac encodings are affected, so I would like to commit the current patch before tackling this issue. 

$ ../../python.exe  gencodec.py MAPPINGS/VENDORS/APPLE/ build/ mac_
converting ARABIC.TXT to build/mac_arabic.py and build/mac_arabic.mapping
converting CELTIC.TXT to build/mac_celtic.py and build/mac_celtic.mapping
converting CENTEURO.TXT to build/mac_centeuro.py and build/mac_centeuro.mapping
converting CHINSIMP.TXT to build/mac_chinsimp.py and build/mac_chinsimp.mapping
Traceback (most recent call last):
  File "gencodec.py", line 424, in <module>
    convertdir(*sys.argv[1:])
  File "gencodec.py", line 394, in convertdir
    pymap(mappathname, map, dirprefix + codefile,name,comments)
  File "gencodec.py", line 358, in pymap
    code = codegen(name,map,encodingname,comments)
  File "gencodec.py", line 271, in codegen
    precisions=(4, 2))
  File "gencodec.py", line 155, in python_mapdef_code
    mappings = sorted(map.items())
TypeError: unorderable types: tuple() < int()
msg122586 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-27 23:03
Please ignore Makefile changes in the patch.
msg122829 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-29 16:57
Martin,

I believe you were the last to update the unicode database. (See r85371.)  Did you use python2.x to generate it or you have your own private copy of these tools?

I noticed that genwincodecs.bat refers to c:\python26\python in 2.7 branch and c:\python30\python in py3k.  Could this be an indication that these tools are out of date?

What is the plan for maintaining these tools?  Should fixes be done in 2.7 and 3.x be generated by 2to3? Or should fixes go to py3k and backported to 2.7 when they don't add new features?
msg122837 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-11-29 18:21
gencodec.py is only rarely used, namely when adding new codecs based
on Unicode mapping files.

It is not run regularly on the files from ftp.unicode.org and only
updated on demand.

AFAIK, it was last used on Python2 and never on Python3, hence the
errors you find with it.

BTW: You appear to have a comma appended to the constant, that doesn't
belong there:

+# Placeholder for a missing codepoint
+MISSING_CODE = -1,
+

Perhaps that's causing the second error you are seeing.
msg122842 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-29 18:36
On Mon, Nov 29, 2010 at 1:21 PM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:
..
> BTW: You appear to have a comma appended to the constant, that doesn't
> belong there:
>
> +# Placeholder for a missing codepoint
> +MISSING_CODE = -1,
> +
>
> Perhaps that's causing the second error you are seeing.

No, that comma was a left-over from the attempt to fix the
mac_chinsimp error.  The trace that I reported was generated with
MISSING_CODE = -1.   I am replacing the patch.

Is it ok to commit a partial fix?  It may take longer to fix the mac error.
msg122843 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-11-29 18:37
Alexander Belopolsky wrote:
> 
> Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:
> 
> On Mon, Nov 29, 2010 at 1:21 PM, Marc-Andre Lemburg
> <report@bugs.python.org> wrote:
> ..
>> BTW: You appear to have a comma appended to the constant, that doesn't
>> belong there:
>>
>> +# Placeholder for a missing codepoint
>> +MISSING_CODE = -1,
>> +
>>
>> Perhaps that's causing the second error you are seeing.
> 
> No, that comma was a left-over from the attempt to fix the
> mac_chinsimp error.  The trace that I reported was generated with
> MISSING_CODE = -1.   I am replacing the patch.
> 
> Is it ok to commit a partial fix?  It may take longer to fix the mac error.

Sure, we won't need that script anytime soon and if we do, we
can just as well use the Python2 version.
msg122850 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-29 18:52
On Mon, Nov 29, 2010 at 1:38 PM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:
..
> Sure, we won't need that script anytime soon and if we do, we
> can just as well use the Python2 version.

That may not be true.  I compared 2.7 and py3k versions and the later
has some new features:

* unidata_version  changed from 5.2.0 to 6.0.0
* Unihan data is read from zip file
* added processing of DerivedCoreProperties

These changes don't affect gencodec.py, but it may be inconvenient to
run makeunicodedata.py and gencodec.py using different versions of
Python.

I'll check that all non-mac encodings are correctly generated before committing.
msg122858 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-11-29 19:48
> These changes don't affect gencodec.py, but it may be inconvenient to
> run makeunicodedata.py and gencodec.py using different versions of
> Python.

As MAL explains: these are completely unrelated, independent tools,
and gencodec isn't run more than once per decade (or so). I only ever
run makeunicodedata, and I have been using Python 3 to run it.

The mappings are not supposed to ever change once produced. In
particular, new versions of Unicode cannot affect them, since the
existing characters all map fine to existing code points, which will
not change their meaning per Unicode stability criteria.
msg122916 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-30 16:57
Committed in revision 86891.  Keeping open to address Mac issue.
msg202543 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2013-11-10 18:24
For the Mac issue, we could just delete the mapping files before processing them.  I've attached a patch that modifies the Makefile.
msg233902 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-01-13 05:57
Here is a new version of Kuchling’s patch. I restored some mapping files which do not give any errors (including the mac_turkish codec, which is actually documented), and removed both readme files.
msg406955 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-11-24 19:59
I don't think Martin's patch has been applied. Is it needed?
History
Date User Action Args
2022-04-11 14:57:09adminsetgithub: 54761
2021-11-24 21:35:33vstinnersetnosy: - vstinner
2021-11-24 19:59:51iritkatrielsetnosy: + iritkatriel
messages: + msg406955
2015-01-13 05:57:30martin.pantersetfiles: + 10552-remove-apple-files-v2.txt
versions: + Python 3.4
nosy: + martin.panter, vstinner

messages: + msg233902

components: + Unicode
2014-12-31 16:22:37akuchlingsetnosy: - akuchling
2014-06-29 23:08:51belopolskysetnosy: + ronaldoussoren, ned.deily, hynek
2014-06-29 23:07:44belopolskysetassignee: belopolsky ->
2013-11-10 18:24:50akuchlingsetfiles: + 10552-remove-apple-files.txt
nosy: + akuchling
messages: + msg202543

2010-12-30 22:14:16georg.brandlunlinkissue7962 dependencies
2010-11-30 16:57:48belopolskysetnosy: lemburg, loewis, belopolsky, ezio.melotti
messages: + msg122916
priority: normal -> low
assignee: belopolsky
components: + macOS
stage: commit review -> needs patch
2010-11-29 20:22:31belopolskyunlinkissue10575 dependencies
2010-11-29 19:48:38loewissetmessages: + msg122858
2010-11-29 18:52:32belopolskysetmessages: + msg122850
2010-11-29 18:37:58lemburgsetmessages: + msg122843
2010-11-29 18:36:58belopolskysetfiles: - issue10552a.diff
2010-11-29 18:36:46belopolskysetfiles: + issue10552a.diff

messages: + msg122842
2010-11-29 18:21:55lemburgsetmessages: + msg122837
2010-11-29 16:57:45belopolskysetmessages: + msg122829
2010-11-29 16:45:33belopolskylinkissue10575 dependencies
2010-11-27 23:03:04belopolskysetmessages: + msg122586
2010-11-27 23:02:25belopolskysetfiles: + issue10552a.diff

messages: + msg122585
stage: commit review
2010-11-27 22:16:02ezio.melottisetnosy: + ezio.melotti
2010-11-27 22:09:48lemburgsetmessages: + msg122565
2010-11-27 21:15:09belopolskysetfiles: + issue10552.diff

nosy: + loewis
messages: + msg122559

keywords: + patch
2010-11-27 20:31:17belopolskylinkissue7962 dependencies
2010-11-27 20:29:09belopolskycreate