This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author serhiy.storchaka
Recipients amaury.forgeotdarc, doerwalter, lemburg, scoder, serhiy.storchaka, vstinner
Date 2013-11-22.22:53:40
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1385160820.77.0.636045011248.issue18059@psf.upfronthosting.co.za>
In-reply-to
Content
> I'm not sure that multibyte encodings other than UTF-8 are used in the world.

I don't use any of them but I heard some of them are still widely used.

This issue was provoked by issue13612. See also related issue15877.

> pyexpat_encoding_create() looks like an heuristic. How many multibyte codecs can be used with your patch?

All codecs which can be supported by expat.

"""
   1. Every ASCII character that can appear in a well-formed XML document,
      other than the characters

      $@\^`{}~

      must be represented by a single byte, and that byte must be the
      same byte that represents that character in ASCII.

   2. No character may require more than 4 bytes to encode.

   3. All characters encoded must have Unicode scalar values <=
      0xFFFF, (i.e., characters that would be encoded by surrogates in
      UTF-16 are  not allowed).  Note that this restriction doesn't
      apply to the built-in support for UTF-8 and UTF-16.

   4. No Unicode character may be encoded by more than one distinct
      sequence of bytes.
"""

14 Python encodings satisfy these criteria: big5, big5hkscs, cp932, cp949, cp950, euc-jp, euc-jis-2004, euc-jisx0213, gb2312, gbk, johab, shift-jis, shift-jis-2004, shift-jisx0213.

> A whitelist of multibyte codecs may be less reliable. What do you think?

pyexpat_multibyte_encodings_4.patch implements this way. It hardcodes a list of supported encodings with minimal required tables.

pyexpat_multibyte_encodings_5.patch supports any encoding which satisfy expat criteria and builds all needed data at first access (tens kilobytes). After heavy start it works much faster than previous patch.
History
Date User Action Args
2013-11-22 22:53:40serhiy.storchakasetrecipients: + serhiy.storchaka, lemburg, doerwalter, amaury.forgeotdarc, scoder, vstinner
2013-11-22 22:53:40serhiy.storchakasetmessageid: <1385160820.77.0.636045011248.issue18059@psf.upfronthosting.co.za>
2013-11-22 22:53:40serhiy.storchakalinkissue18059 messages
2013-11-22 22:53:40serhiy.storchakacreate