classification
Title: IA5 Encoding should be in the default encodings
Type: enhancement Stage:
Components: Unicode Versions: Python 3.1, Python 2.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, lemburg, loewis, pascal.bach
Priority: normal Keywords:

Created on 2008-08-22 16:26 by pascal.bach, last changed 2008-08-25 15:31 by pascal.bach. This issue is now closed.

Files
File name Uploaded Description Edit
ia5.py pascal.bach, 2008-08-22 16:26 File wich implements the python .encode/decode methodes
Messages (8)
msg71755 - (view) Author: Pascal Bach (pascal.bach) Date: 2008-08-22 16:26
This encoding is used in the GSM standard it is a 7-bit encoding similar
to ASCII. 
The encoding definition is found in:
Short Message Service Centre EMI - UCP Interface 4.6 Specification (p. 79)
as well as in: 
[3GPP 23.038] 3GPP TS 23.038 Alphabets and language-specific information.

I think this encoding would be useful for other GSM specific use cases.
msg71771 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-08-22 19:20
The provided file does not work for "EXTENSION" characters:

>>> import ia5
>>> u"[a]".encode("ia5")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "ia5.py", line 18, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
TypeError: character mapping must be in range(256)

I doubt this can be achieved with just a charmap. You will have to roll
your own incremental stateful decoder.
Are you willing to do it?
msg71776 - (view) Author: Pascal Bach (pascal.bach) Date: 2008-08-22 20:49
Well I have seen the problem. 

I'm willing to do this to improve python, but I don't know exactly how
to do it.

I looked at how utf-8 and utf-7 are done but I didn't exactly
understand, are they based on C code?

Is there an example how this needs to be done? It would be nice if you
could get me some help where to start.
msg71803 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-08-23 09:06
You could start with utf_8.py, and of course replace the calls to
codecs.utf_8_encode and codecs.utf_8_decode.

- your "ia5_encode" follows this interface:
http://docs.python.org/dev/library/codecs.html#codecs.Codec.encode

- your "ia5_decode" has the signature:
    def ia5_decode(input, errors='strict', final=False)
and returns a tuple (output object, length consumed).
See
http://docs.python.org/dev/library/codecs.html#codecs.IncrementalDecoder.decode
for an explanation of the final parameter; 
in particular, if the input is a single 0x1B,
- it will return ('', 0) if final is False
- and raise UnicodeDecodeError("unexpected end of data") if final is True
msg71845 - (view) Author: Pascal Bach (pascal.bach) Date: 2008-08-24 17:38
I have looked at utf_8.py and I think I know how to implement the
incremental de/encoder. But I don't understand the codecs.register()
function. Do I have to provide stateless, stateful and streamwriter at
the same time? 
If I implement IncrementalEncoder and IncrementalDecoder can I just give
those two to codecs.register()?

Thank you for your help.
msg71887 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-08-24 21:52
I don't think this codec should be named IA-5. IA-5 is specified in
ITU-T Rec. T.50 (International Alphabet No. 5), recently renamed to
"International Reference Alphabet", and it does *not* specify that the
characters 0..31 are printable. Instead, IA5 is identical to ISO 646
(i.e. allowing for national variants), with the International Reference
Version of IA5 (e.g. as used in ASN.1 IA5String) is identical to US-ASCII.

If GSM uses a modified version of this, it should receive a separate
name. If you were looking at section 2 (Structure of EMI messages), what
makes you think that this specification calls the encoding "IA5"? In my
copy, it says:

# Alphanumeric characters are encoded as two numeric IA5 characters,
# the higher 3 bits (0..7) first, the lower 4 bits (0..F) thereafter,
# according to the following table.

So it *uses* IA5 to hex-encode the encoding. To achieve that, one would
have to write

  text.encode("emi-section-2").encode("hex")

[Notice that the "hex" codec already uses IA-5]

In any case, I don't think this is general enough to deserve inclusion
into the standard library. The codec system is designed to be so
flexible to support additional codecs outside the core.
msg71934 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-08-25 15:10
I think what you're after is the encoding used in SMS messages:

http://en.wikipedia.org/wiki/Short_message_service

Here's an old discussion about this codec:

http://mail.python.org/pipermail/python-list/2002-October/167267.html
http://mail.python.org/pipermail/python-list/2002-October/167271.html

Note that nowadays, SMSCs and interface software such as Kannel
typically accept UTF-16 data just fine, so the need for such a codec in
Python in minimal.

I agree with Martin, that the stdlib is not the right place for such a
codec. It's easy to write your own codec package and have your
application register this package at startup time using codecs.register().
msg71939 - (view) Author: Pascal Bach (pascal.bach) Date: 2008-08-25 15:31
I currently use the codec in my ucplib already and this is not a
problem. I just thought that it might be useful for somebody else. But
maybe it is to use case specific. 
If this codec is not of general interest I think this report can be closed.
History
Date User Action Args
2008-08-25 15:31:47pascal.bachsetmessages: + msg71939
2008-08-25 15:11:00lemburgsetstatus: open -> closed
nosy: + lemburg
resolution: rejected
messages: + msg71934
2008-08-24 21:52:17loewissetnosy: + loewis
messages: + msg71887
2008-08-24 19:05:14pitrousetpriority: normal
versions: + Python 3.1, Python 2.7, - Python 2.5
2008-08-24 17:38:11pascal.bachsetmessages: + msg71845
2008-08-23 09:06:23amaury.forgeotdarcsetmessages: + msg71803
2008-08-22 20:49:30pascal.bachsetmessages: + msg71776
2008-08-22 19:20:25amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg71771
2008-08-22 16:26:46pascal.bachcreate