Title: Range checking in GB18030 decoder
Type: behavior Stage: needs patch
Components: Unicode Versions: Python 3.7, Python 3.6, Python 3.5, Python 2.7
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Ma Lin, ezio.melotti, haypo, xiang.zhang
Priority: normal Keywords:

Created on 2017-04-05 03:50 by Ma Lin, last changed 2017-04-14 17:21 by Mariatta.

Pull Requests
URL Status Linked Edit
PR 994 closed python-dev, 2017-04-05 03:53
PR 999 closed Ma Lin, 2017-04-05 06:28
Messages (6)
msg291153 - (view) Author: Ma Lin (Ma Lin) * Date: 2017-04-05 03:50
This issue is split from issue24117, that issue became a soup of small issues, so I'm going to close it.

For 4-byte GB18030 sequence, the legal range is:
0x81-0xFE for the 1st byte
0x30-0x39 for the 2nd byte
0x81-0xFE for the 3rd byte
0x30-0x39 for the 4th byte
GB18030 standard:

The current code forgets to check 0xFE for the 1st and 3rd byte.
Therefore, there are 8630 illegal 4-byte sequences can be decoded by GB18030 codec, here is an example:

# legal sequence b'\x81\x31\x81\x30' is decoded to U+060A, it's fine.
uchar = b'\x81\x31\x81\x30'.decode('gb18030')

# illegal sequence 0x8130FF30 can be decoded to U+060A as well, this should not happen.
uchar = b'\x81\x30\xFF\x30'  .decode('gb18030')
msg291205 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2017-04-06 03:23
The table in wikipedia is somewhat complex. I find and the table in it is same as (except 0x80) but in English. I agree with Ma Lin bytes sequences like b'\x81\x30\xFF\x30' are invalid.

For current implementation, you could see:

>>> invalid = b'\x81\x30\xff\x30'
>>> invalid.decode('gb18030').encode('gb18030') == invalid
msg291208 - (view) Author: Ma Lin (Ma Lin) * Date: 2017-04-06 04:18
> except 0x80 (€)

I suppose the English edition is not the final release of GB18030-2000.

At the end of official Chinese edition of GB18030-2005, listed the difference between GB18030-2000 and GB18030-2005 clearly, it doesn't mention 0x80 (€), so GB18030-2000 should not has 0x80 as well.

Why 0x80 (€) appear in English edition?
I searched on Google, this topic said 0x80 appears in *draft* of GB18030-2000.
So maybe the English edition is a translation of GB18030-2000 draft, this logic seems ok.

Anyway, 0x80 is another story, not conflict with this issue.
Zhang, do I need to make PR for 3.6/3.5/2.7 respectively?
msg291210 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2017-04-06 05:05
Yes, 0x80 doesn't matter here.

It's nice to make the backporting PRs. But let's wait some time for ezio and haypo's comments and reviews. Get the master PR merged first and then continue on backporting. :-)
msg291213 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-04-06 06:54
An incorrect implementation of a decoder might lead to security vulnerabilities:

*But* UTF-8 decoder of Python 2 is *not* strict and nobody complained.

I suggest that, once the changed is merged in master, backport the fix to 3.6 and 3.5.

But I'm not sure that it's worth it to backport it to 2.7? Is there a risk to break an application?
msg291215 - (view) Author: Ma Lin (Ma Lin) * Date: 2017-04-06 07:31
This is a very trivial bug, it's hard to imagine a scene that someone trying to decode those 8630 illegal 4-byte sequences with GB18030 decoder.
And I think this bug can't lead to security vulnerabilities.

As far as I can see, GB2312/GBK/GB18030 codecs are bugfree except this bug, of course maybe I'm wrong.
Date User Action Args
2017-04-14 17:21:39Mariattasetstage: patch review -> needs patch
2017-04-06 07:31:53Ma Linsetmessages: + msg291215
2017-04-06 06:54:09hayposetmessages: + msg291213
2017-04-06 05:05:02xiang.zhangsetmessages: + msg291210
2017-04-06 04:18:36Ma Linsetmessages: + msg291208
2017-04-06 03:24:45xiang.zhangsetstage: patch review
versions: + Python 2.7, Python 3.5, Python 3.6
2017-04-06 03:23:12xiang.zhangsetmessages: + msg291205
2017-04-05 06:28:48Ma Linsetpull_requests: + pull_request1171
2017-04-05 03:56:05xiang.zhangsetnosy: + xiang.zhang
2017-04-05 03:53:59python-devsetpull_requests: + pull_request1168
2017-04-05 03:50:03Ma Lincreate