This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: iso6937 encoding missing
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.7
process
Status: closed Resolution: postponed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: John Helour, koffie, lemburg, loewis, mdk, serhiy.storchaka, vstinner, xiang.zhang
Priority: low Keywords:

Created on 2015-05-31 13:20 by John Helour, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
iso6937.py John Helour, 2015-06-05 18:10 New codec for the iso6937 encoding (Python 3 version), full charset
check_iso6937.py mdk, 2016-11-13 22:11
iso6937.py John Helour, 2016-11-26 15:43 Newer codec for the iso6937 encoding, PEP8 compliant, added missing codepoints, utf-8 to \uXXXX rewrited, increased range of the encoding map
check_iso6937.py John Helour, 2016-12-04 12:56 Check encoding and decoding
iso6937.py John Helour, 2016-12-04 12:59 Newer codec for the iso6937 encoding, performance issue resolved, more info on error added
Messages (32)
msg244538 - (view) Author: John Helour (John Helour) * Date: 2015-05-31 13:20
Please add encoding for the iso6937 charset. Many settopboxes (DVB-T/S) and relevant devices uses it for displaying EPG, videotext, etc.

I've wrote (please look at the attached file) the encoding/decoding conversion codec some years ago.
msg244540 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-05-31 14:31
New encoding can be added only in new Python release (3.6).
msg244576 - (view) Author: John Helour (John Helour) * Date: 2015-06-01 11:20
I've rewrote the iso6937 codec into Python 3. 

Could someone check it please?
msg280720 - (view) Author: Julien Palard (mdk) * (Python committer) Date: 2016-11-13 22:11
Hi John, thanks for your contribution,

Looks like your implementation is missing some codepoints, like "\t":

    >>> print("\t".encode(encoding='iso6937'))                                                                                     
    [...]
    UnicodeError: encoding with 'iso6937' codec failed (UnicodeError: Unacceptable utf-8 character)

Probably due to the "range(0x20, "…, why `0x20`?

You're having problems to decode multibytes sequences as you're not having the `else: … result += chr(c[0])` in this case. So typically decoding `\xc2\x20` will raise a `KeyError` as `\x20` is _not_ in your decoding table.

Also, please conform your contribution to the PEP8: you're missing spaces after comas and you're sometime indenting with 8 spaces instead of 4.

I implemented a simple checker based on glibc localedata, it show clearly your decoding problems step by step, and should be easily extended to check for your encoding function too, see attachment. It uses the ISO6937 found typically in the locales debian package or in an 'apt-get sourcee glibc'.
msg280741 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2016-11-14 08:53
Another comment about coding style: please use \uXXXX hex code representations for the decoding map. The stdlib source code is normally kept ASCII compatible and, for codecs, the Unicode code point numbers make it easier to check the mappings for correctness.

Thanks.

PS: You will also have to sign a contributor agreement: https://www.python.org/psf/contrib/
msg280759 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2016-11-14 11:54
Just as reference, here's the wikipedia page for the encoding:

https://en.wikipedia.org/wiki/ISO/IEC_6937

and this is the ISO document (as preview):

http://webstore.iec.ch/preview/info_isoiec6937%7Bed3.0%7Den.pdf

(from the German wikipedia page).
msg280761 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-14 12:03
iso6937.py:

> # from utf-8 to iso6937
> def iso6937_encode(input,errors,encoding_map):

Wait, is this code for Python 3? Decode from UTF-8 and encode to ISO-6937 in the same function seems strange to me.

I expected that the codec only implements two functions: encode text (unicode) to ISO-6937 (bytes), decode bytes from ISO-6937 to text.

Since the encoding is non trivial (multibyte), if we decide to add it, I suggest to require unit tests. I would like to see unit tests on multibyte strings, to check how the error handler is handled.

--

In general, I would prefer to not embed too many codecs in Python, it has a little cost to maintain these codecs.

My rule is more to only added encodings used (in practice) as locale encodings.
msg280765 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-14 12:27
I think the encoder can just use codecs.charmap_encode(). The decoder seems could be simpler too.

Would be nice to generate the ISO 6937 encoding file from external data (e.g. from glibc localedata) like they are generated for other encodings. Take Tools/unicode/ files as a pattern.

Tests are required.

A number of lists of encodings should be updated: Doc/library/codecs.rst, Lib/encodings/aliases.py, Lib/locale.py, Lib/test/test_unicode.py, Lib/test/test_codecs.py, Lib/test/test_xml_etree.py.
msg280770 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-14 13:05
@Serhiy: Do you think that the encoding is popular enough to pay the
price of its maintainance?

It's already possible to register manually a new encoding in an
application. It was even already possible in Python 2.7 (and older).
msg280771 - (view) Author: Julien Palard (mdk) * (Python committer) Date: 2016-11-14 13:08
@Serhiy @haypo: Popular enough or not, it may start as a lib on pypi, we'll see its usage from here.
msg280773 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2016-11-14 13:11
On 14.11.2016 13:03, STINNER Victor wrote:
> 
> STINNER Victor added the comment:
> 
> iso6937.py:
> 
>> # from utf-8 to iso6937
>> def iso6937_encode(input,errors,encoding_map):
> 
> Wait, is this code for Python 3? Decode from UTF-8 and encode to ISO-6937 in the same function seems strange to me.

The patch shows the file as UTF-8. In reality, it is decoding from
Unicode strings.

> I expected that the codec only implements two functions: encode text (unicode) to ISO-6937 (bytes), decode bytes from ISO-6937 to text.
> 
> Since the encoding is non trivial (multibyte), if we decide to add it, I suggest to require unit tests. I would like to see unit tests on multibyte strings, to check how the error handler is handled.

+1

> In general, I would prefer to not embed too many codecs in Python, it has a little cost to maintain these codecs.
> 
> My rule is more to only added encodings used (in practice) as locale encodings.

This encoding is used in EPG data of various DVB television
formats. As such it is in active use (even though it is very old).

I think "active use" is a better approach to restricting
ourselves to only locale encodings, since the latter are
slowly converging towards UTF-8 :-)

BTW: Once a charmap style codec is written, there is little
change, so the maintenance is minimal. Codecs which include
more active logic such as this one are different, of course,
and therefore may potentially add more maintenance burden.
msg280779 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-14 15:02
> My rule is more to only added encodings used (in practice) as locale encodings.

Just for reference: issue19459, issue21081, issue22679, issue20087.

> @Serhiy: Do you think that the encoding is popular enough to pay the
price of its maintainance?

Yes, it seems to me that the encoding still in use. I found questions about decoding from ISO 6937 and implementations for different programming languages.
msg280783 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-14 15:39
Ok. I'm not waiting for a simpler patch reusing existing charmap
functions to see the complexity of the codec ;-)
msg281746 - (view) Author: John Helour (John Helour) * Date: 2016-11-25 22:22
PEP8 compliant, added missing codepoints, utf-8 to \uXXXX rewrited
msg281748 - (view) Author: John Helour (John Helour) * Date: 2016-11-25 22:35
@mdk

Big thanks for the checker.

>Looks like your implementation is missing some codepoints, like "\t":
>
>    >>> >print("\t".encode(encoding='iso6937'))                                                    
>    [...]
>    UnicodeError: encoding with 'iso6937' codec failed (UnicodeError: Unacceptable utf-8 character)

The '\t'  character is undefined in the iso6937 table, like all chars within the range 0x00 - 0x1F. I don't know how to handle such input for conversion.
msg281774 - (view) Author: Julien Palard (mdk) * (Python committer) Date: 2016-11-26 13:21
According to https://webstore.iec.ch/preview/info_isoiec6937%7Bed3.0%7Den.pdf:

> NOTE: The shaded positions 00/00 to 01/15 and 07/15 to 09/15 are outside the scope of this International Standard.

So it's clear to me that they are not undefined, they are just described elsewhere.

According to https://en.wikipedia.org/wiki/ISO/IEC_6937:
>ISO/IEC 6937:2001, [...] is a multibyte extension of ASCII

Also, the glibc charmap for ISO_6937 define them:

$ head -n 20 localedata/charmaps/ISO_6937
<code_set_name> ISO_6937
<comment_char> %
<escape_char> /
% version: 1.0
%  source: ECMA registry and ISO/IEC 6937:1992

% alias ISO-IR-156
% alias ISO_6937:1992
% alias ISO6937
CHARMAP
<U0000>     /x00         NULL (NUL)
<U0001>     /x01         START OF HEADING (SOH)
<U0002>     /x02         START OF TEXT (STX)
<U0003>     /x03         END OF TEXT (ETX)
<U0004>     /x04         END OF TRANSMISSION (EOT)
<U0005>     /x05         ENQUIRY (ENQ)
<U0006>     /x06         ACKNOWLEDGE (ACK)
<U0007>     /x07         BELL (BEL)
<U0008>     /x08         BACKSPACE (BS)
<U0009>     /x09         CHARACTER TABULATION (HT)

Finally, if we're not implementing this range, this mean we have _no_ way to encode a new line, which looks highly strange to me, newline being a commonly used character.

But I found _no_ line in the whole ISO/IEC6937 about its ASCII inheritance, I may have just missed it.
msg281780 - (view) Author: John Helour (John Helour) * Date: 2016-11-26 15:41
If I take the ISO_6937 file as a template for encoding table then
increasing the range 0x20..0x7f to 0x00..0xA0 is the simplest solution.
msg281781 - (view) Author: John Helour (John Helour) * Date: 2016-11-26 15:43
If I take the ISO_6937 file as a template for encoding table then
increasing the range 0x20..0x7f to 0x00..0xA0 is the simplest solution.
msg281869 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2016-11-28 12:40
The codec code has a few (performance) issues:

 * nonspacing_diacritical_marks should be a set for fast lookup
 * ord(c) in range(0x00, 0xA0) should be rewritten using < and >=
 * result += bytes([ord(c)]) has exponential timing (it copies
   the whole bytes string for every single operation); better
   use a bytearray and convert this to bytes in one final step
 * the error messages should include more useful information
   about the cause and location of the error, instead of just
   UnicodeError("Unacceptable unicode character") and
   raise KeyError

Please also check whether it's not possible to reuse the charmap codec
functions we have. Thanks.
msg282048 - (view) Author: John Helour (John Helour) * Date: 2016-11-29 21:32
> Please also check whether it's not possible to reuse the charmap codec functions we have
 I've found nothing useful, maybe you (as the author) can find something really useful which can improve code readability or increase the performance.

Please look at the newest codec version, particularly on line:

tmp += bytearray(encoding_map[c], 'latin1', 'ignore')

It is about extended ascii inheritance. Is it reliable and fast enough?
msg282084 - (view) Author: John Helour (John Helour) * Date: 2016-11-30 14:46
Please ignore my previous question about:
tmp += bytearray(encoding_map[c], 'latin1', 'ignore')

The latest version don't needs such encoding ...
msg282338 - (view) Author: John Helour (John Helour) * Date: 2016-12-04 13:17
Performance issue resolved, more info on error added.

I've checked encoding and decoding on a two UTF-8 ~3MiB txt files. Except the first BOM mark (May I ignore it?) all seems work OK.
msg282351 - (view) Author: Julien Palard (mdk) * (Python committer) Date: 2016-12-04 16:49
LGTM, for me it's time to release it as a package on pypi to check the adoption rate and see it it's worth adding it in Python and maybe close this issue.
msg288144 - (view) Author: Julien Palard (mdk) * (Python committer) Date: 2017-02-19 15:59
John: You should probably package this as a pip module alongisde with a git repository, at least to measure qty of interested persones, and get some feedback / contributions.
msg293745 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2017-05-16 03:25
Would you mind converting this patch to a Github PR John?
msg341580 - (view) Author: Julien Palard (mdk) * (Python committer) Date: 2019-05-06 18:08
For the moment, I'm closing this issue as there's no activity on it I suspect it may no be that usefull.

I may be wrong, so if someone actually needs this, don't hesitate either to put it as a package on PyPI (it should probably go there anyway), either to reopen the issue.
msg396381 - (view) Author: Maarten Derickx (koffie) Date: 2021-06-23 06:22
Is there any way to contact John Helour? I would still very much like to put this package on github and pypi. And would like to ask him permission for licensing. Or is there some standard open source license under which all code uploaded to https://bugs.python.org/ can automatically be distributed?

https://www.python.org/about/legal/ seems to indicate so, but doesn't mention an explicit license just the things you can do with it.
msg396384 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2021-06-23 08:08
Maarten, the code posted on bugs is copyrighted by the person who wrote it. We can only accept it for inclusion in Python after the CLA has been signed, since then we are allowed to relicense it.

As a result you can only take John's code and post it elsewhere, if John permits you to do so, since the files don't include a license.

Note: Creating a character map based codec is not hard using gencodec.py from the Tools/unicode/ dir and perhaps some added extra logic.
msg396505 - (view) Author: Maarten Derickx (koffie) Date: 2021-06-24 17:18
Hi Marc-Andre Lemburg,

Thanks for your reply. I tried using gencodec.py as could be downloaded from https://github.com/python/cpython/blob/main/Tools/unicode/gencodec.py as you mentioned. However the code in gencodec.py seems to be in a much worse shape than the iso6937.py attached here. The code in gencodec relies on being able to compare integers with tuples. This is caused by the lines:

mappings = sorted(map)

hinting that this code has never been run using python 3.

providing a decent sort key solves this issue. But after that other issues pop up. For example there seems to be some problems handling the 0x-001  by the not appropriately handling of items in the mapping that have MISSING_CODE resulting in things like:

    0x80: 0x-001

showing up in the generated code.

And then there is the issue that python_mapdef_code has as a side effect that it does 'del map["IDENTITY"]' causing "'IDENTITY' in map" in python_tabledef_code to always evaluate to False even when it should evaluate to True.

The problems above can be observed by just running gencodec.py on https://unicode.org/Public/MAPPINGS/VENDORS/APPLE/SYMBOL.TXT .

If gencodec.py was a trustworthy and well maintained piece of code, I would happily use it. However at the moment I don't see it as a valid option since debugging gencodec.py would cost me at least as much time as just writing its output myself instead of generating it. Additionally https://unicode.org/ doesn't seem to provide a mapping file for iso6937.

I do agree that using codecs.charmap_encode and codecs.charmap_decode is a much better solution then the one in iso6937.py. But I don't understand gencodec.py well enough to actually fix it.
msg396724 - (view) Author: Maarten Derickx (koffie) Date: 2021-06-29 13:12
The route via gencodec or more generally via charmap_encode and charmap_decode seems to be one that is not possible without some low level CPython code adjustments. The reason for this is that charmap_encode and charmap_decode only seem to support mappings where a single encoded byte corresponds to multiple unicode points.

However iso6937 is a mixed length encoding, meaning in this specific case that unicode characters sometimes need to be encoded as a single byte and sometimes with two bytes.

For example chr(0x00c0) should be encoded as b'\xc1' + b'A'.
msg396737 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2021-06-29 14:41
Right, the charmap codec was built with the Unicode Consortium mappings in mind.

However, you may have some luck decoding the two byte chars in ISO 6937 using combining code points in Unicode. With some extra post processing you could also normalize the output into single code points.

If I find time, I may have a look at gencodec.py again and update it to more modern interfaces. I've long given up maintenance of Unicode in Python and only try to help by giving some guidance based on the original implementation design.
msg396743 - (view) Author: Maarten Derickx (koffie) Date: 2021-06-29 15:07
Hi Marc-Andre Lemburg,

Thanks for your responses and guidance. At least your pointers to charmap_encode and charmap_decode helped, since it shows at least what the general idea is on how to deal with these types of encodings.

In the mean time I did produce some successes. I wrote some python code that can create character mappings based on the table in http://webstore.iec.ch/preview/info_isoiec6937%7Bed3.0%7Den.pdf so that we can be sure that there are no human errors in generating the mappings.

I think my further approach is to write pure python versions of charmap_encode and charmap_decode that can handle the general case of multi byte encodings to unicode case. This won't be as fast as using the builtins written c. But at least gives maintainable and hopefully reusable code.

Maybe later the c-implementation can be updated as well.
History
Date User Action Args
2022-04-11 14:58:17adminsetgithub: 68527
2021-06-29 15:07:49koffiesetmessages: + msg396743
2021-06-29 14:41:53lemburgsetmessages: + msg396737
2021-06-29 13:12:35koffiesetmessages: + msg396724
2021-06-24 17:18:21koffiesetmessages: + msg396505
2021-06-23 08:08:57lemburgsetmessages: + msg396384
2021-06-23 06:22:13koffiesetnosy: + koffie
messages: + msg396381
2019-05-06 18:08:20mdksetstatus: open -> closed
resolution: postponed
messages: + msg341580

stage: patch review -> resolved
2017-05-16 03:25:45xiang.zhangsetmessages: + msg293745
stage: needs patch -> patch review
2017-02-19 15:59:52mdksetmessages: + msg288144
2016-12-04 16:49:29mdksetmessages: + msg282351
2016-12-04 13:17:35John Heloursetmessages: + msg282338
2016-12-04 13:15:09serhiy.storchakasetpriority: normal -> low
assignee: serhiy.storchaka
2016-12-04 12:59:05John Heloursetfiles: + iso6937.py
2016-12-04 12:57:07John Heloursetfiles: - iso6937.py
2016-12-04 12:56:50John Heloursetfiles: + check_iso6937.py
2016-12-03 19:36:10John Heloursetfiles: + iso6937.py
2016-12-03 19:34:46John Heloursetfiles: - iso6937.py
2016-11-30 14:46:18John Heloursetfiles: + iso6937.py

messages: + msg282084
2016-11-30 14:38:49John Heloursetfiles: - iso6937.py
2016-11-30 14:24:21John Heloursetfiles: + iso6937.py
2016-11-30 14:23:19John Heloursetfiles: - iso6937.py
2016-11-30 14:22:36John Heloursetfiles: + iso6937.py
2016-11-30 14:19:38John Heloursetfiles: - iso6937.py
2016-11-29 21:32:36John Heloursetfiles: + iso6937.py

messages: + msg282048
2016-11-28 12:40:36lemburgsetmessages: + msg281869
2016-11-26 15:43:31John Heloursetfiles: + iso6937.py

messages: + msg281781
2016-11-26 15:41:43John Heloursetmessages: + msg281780
2016-11-26 15:38:32John Heloursetfiles: - iso6937.py
2016-11-26 13:21:30mdksetmessages: + msg281774
2016-11-25 22:35:21John Heloursetmessages: + msg281748
2016-11-25 22:22:35John Heloursetfiles: + iso6937.py

messages: + msg281746
2016-11-14 15:39:07vstinnersetmessages: + msg280783
2016-11-14 15:02:54serhiy.storchakasetmessages: + msg280779
2016-11-14 13:11:10lemburgsetmessages: + msg280773
2016-11-14 13:08:07mdksetmessages: + msg280771
2016-11-14 13:05:58vstinnersetmessages: + msg280770
2016-11-14 12:27:32serhiy.storchakasetstage: needs patch
messages: + msg280765
versions: + Python 3.7, - Python 3.6
2016-11-14 12:03:26vstinnersetnosy: + vstinner
messages: + msg280761
2016-11-14 11:54:54lemburgsetmessages: + msg280759
2016-11-14 08:53:33lemburgsetmessages: + msg280741
2016-11-14 02:12:46xiang.zhangsetnosy: + xiang.zhang
2016-11-13 22:11:01mdksetfiles: + check_iso6937.py
nosy: + mdk
messages: + msg280720

2015-06-18 09:26:40John Heloursetfiles: - iso6937.py
2015-06-05 18:10:54John Heloursetfiles: + iso6937.py
2015-06-05 18:09:23John Heloursetfiles: - iso6937.py
2015-06-05 08:45:49John Heloursetfiles: - iso6937.py
2015-06-05 08:44:33John Heloursetfiles: - iso6937.py
2015-06-05 08:44:16John Heloursetfiles: + iso6937.py
2015-06-05 08:36:28John Heloursetfiles: + iso6937.py
2015-06-01 11:20:10John Heloursetfiles: + iso6937.py

messages: + msg244576
2015-05-31 14:31:35serhiy.storchakasetnosy: + loewis, serhiy.storchaka, lemburg

messages: + msg244540
versions: + Python 3.6, - Python 2.7
2015-05-31 13:20:03John Helourcreate