IA5 Encoding should be in the default encodings #47899

pascalbach · 2008-08-22T16:26:46Z

BPO	3649
Nosy	@malemburg, @loewis, @amauryfa
Files	ia5.py: File wich implements the python .encode/decode methodes

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2008-08-25.15:11:00.890>
created_at = <Date 2008-08-22.16:26:46.167>
labels = ['type-feature', 'expert-unicode']
title = 'IA5 Encoding should be in the default encodings'
updated_at = <Date 2008-08-25.15:31:47.296>
user = 'https://bugs.python.org/pascalbach'

bugs.python.org fields:

activity = <Date 2008-08-25.15:31:47.296>
actor = 'pascal.bach'
assignee = 'none'
closed = True
closed_date = <Date 2008-08-25.15:11:00.890>
closer = 'lemburg'
components = ['Unicode']
creation = <Date 2008-08-22.16:26:46.167>
creator = 'pascal.bach'
dependencies = []
files = ['11214']
hgrepos = []
issue_num = 3649
keywords = []
message_count = 8.0
messages = ['71755', '71771', '71776', '71803', '71845', '71887', '71934', '71939']
nosy_count = 4.0
nosy_names = ['lemburg', 'loewis', 'amaury.forgeotdarc', 'pascal.bach']
pr_nums = []
priority = 'normal'
resolution = 'rejected'
stage = None
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue3649'
versions = ['Python 3.1', 'Python 2.7']

pascalbach · 2008-08-22T16:26:43Z

This encoding is used in the GSM standard it is a 7-bit encoding similar
to ASCII.
The encoding definition is found in:
Short Message Service Centre EMI - UCP Interface 4.6 Specification (p. 79)
as well as in:
[3GPP 23.038] 3GPP TS 23.038 Alphabets and language-specific information.

I think this encoding would be useful for other GSM specific use cases.

amauryfa · 2008-08-22T19:20:25Z

The provided file does not work for "EXTENSION" characters:

>>> import ia5
>>> u"[a]".encode("ia5")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "ia5.py", line 18, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
TypeError: character mapping must be in range(256)

I doubt this can be achieved with just a charmap. You will have to roll
your own incremental stateful decoder.
Are you willing to do it?

pascalbach · 2008-08-22T20:49:29Z

Well I have seen the problem.

I'm willing to do this to improve python, but I don't know exactly how
to do it.

I looked at how utf-8 and utf-7 are done but I didn't exactly
understand, are they based on C code?

Is there an example how this needs to be done? It would be nice if you
could get me some help where to start.

amauryfa · 2008-08-23T09:06:22Z

You could start with utf_8.py, and of course replace the calls to
codecs.utf_8_encode and codecs.utf_8_decode.

your "ia5_encode" follows this interface:
http://docs.python.org/dev/library/codecs.html#codecs.Codec.encode
your "ia5_decode" has the signature:
def ia5_decode(input, errors='strict', final=False)
and returns a tuple (output object, length consumed).
See
http://docs.python.org/dev/library/codecs.html#codecs.IncrementalDecoder.decode
for an explanation of the final parameter;
in particular, if the input is a single 0x1B,
it will return ('', 0) if final is False
and raise UnicodeDecodeError("unexpected end of data") if final is True

pascalbach · 2008-08-24T17:38:10Z

I have looked at utf_8.py and I think I know how to implement the
incremental de/encoder. But I don't understand the codecs.register()
function. Do I have to provide stateless, stateful and streamwriter at
the same time?
If I implement IncrementalEncoder and IncrementalDecoder can I just give
those two to codecs.register()?

Thank you for your help.

loewis · 2008-08-24T21:52:16Z

I don't think this codec should be named IA-5. IA-5 is specified in
ITU-T Rec. T.50 (International Alphabet No. 5), recently renamed to
"International Reference Alphabet", and it does *not* specify that the
characters 0..31 are printable. Instead, IA5 is identical to ISO 646
(i.e. allowing for national variants), with the International Reference
Version of IA5 (e.g. as used in ASN.1 IA5String) is identical to US-ASCII.

If GSM uses a modified version of this, it should receive a separate
name. If you were looking at section 2 (Structure of EMI messages), what
makes you think that this specification calls the encoding "IA5"? In my
copy, it says:

# Alphanumeric characters are encoded as two numeric IA5 characters,
# the higher 3 bits (0..7) first, the lower 4 bits (0..F) thereafter,
# according to the following table.

So it *uses* IA5 to hex-encode the encoding. To achieve that, one would
have to write

text.encode("emi-section-2").encode("hex")

[Notice that the "hex" codec already uses IA-5]

In any case, I don't think this is general enough to deserve inclusion
into the standard library. The codec system is designed to be so
flexible to support additional codecs outside the core.

malemburg · 2008-08-25T15:11:00Z

I think what you're after is the encoding used in SMS messages:

http://en.wikipedia.org/wiki/Short_message_service

Here's an old discussion about this codec:

http://mail.python.org/pipermail/python-list/2002-October/167267.html
http://mail.python.org/pipermail/python-list/2002-October/167271.html

Note that nowadays, SMSCs and interface software such as Kannel
typically accept UTF-16 data just fine, so the need for such a codec in
Python in minimal.

I agree with Martin, that the stdlib is not the right place for such a
codec. It's easy to write your own codec package and have your
application register this package at startup time using codecs.register().

pascalbach · 2008-08-25T15:31:47Z

I currently use the codec in my ucplib already and this is not a
problem. I just thought that it might be useful for somebody else. But
maybe it is to use case specific.
If this codec is not of general interest I think this report can be closed.

pascalbach mannequin added topic-unicode type-feature A feature request or enhancement labels Aug 22, 2008

malemburg closed this as completed Aug 25, 2008

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IA5 Encoding should be in the default encodings #47899

IA5 Encoding should be in the default encodings #47899

pascalbach mannequin commented Aug 22, 2008

pascalbach mannequin commented Aug 22, 2008

amauryfa commented Aug 22, 2008

pascalbach mannequin commented Aug 22, 2008

amauryfa commented Aug 23, 2008

pascalbach mannequin commented Aug 24, 2008

loewis mannequin commented Aug 24, 2008

malemburg commented Aug 25, 2008

pascalbach mannequin commented Aug 25, 2008

IA5 Encoding should be in the default encodings #47899

IA5 Encoding should be in the default encodings #47899

Comments

pascalbach mannequin commented Aug 22, 2008

pascalbach mannequin commented Aug 22, 2008

amauryfa commented Aug 22, 2008

pascalbach mannequin commented Aug 22, 2008

amauryfa commented Aug 23, 2008

pascalbach mannequin commented Aug 24, 2008

loewis mannequin commented Aug 24, 2008

malemburg commented Aug 25, 2008

pascalbach mannequin commented Aug 25, 2008