This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author malin
Recipients ezio.melotti, malin, vstinner
Date 2017-04-05.03:50:02
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <>
This issue is split from issue24117, that issue became a soup of small issues, so I'm going to close it.

For 4-byte GB18030 sequence, the legal range is:
0x81-0xFE for the 1st byte
0x30-0x39 for the 2nd byte
0x81-0xFE for the 3rd byte
0x30-0x39 for the 4th byte
GB18030 standard:

The current code forgets to check 0xFE for the 1st and 3rd byte.
Therefore, there are 8630 illegal 4-byte sequences can be decoded by GB18030 codec, here is an example:

# legal sequence b'\x81\x31\x81\x30' is decoded to U+060A, it's fine.
uchar = b'\x81\x31\x81\x30'.decode('gb18030')

# illegal sequence 0x8130FF30 can be decoded to U+060A as well, this should not happen.
uchar = b'\x81\x30\xFF\x30'  .decode('gb18030')
Date User Action Args
2017-04-05 03:50:03malinsetrecipients: + malin, vstinner, ezio.melotti
2017-04-05 03:50:03malinsetmessageid: <>
2017-04-05 03:50:03malinlinkissue29990 messages
2017-04-05 03:50:02malincreate