This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Range checking in GB18030 decoder
Type: behavior Stage: resolved
Components: Unicode Versions: Python 3.7, Python 3.6, Python 3.5, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, malin, vstinner, xiang.zhang
Priority: normal Keywords:

Created on 2017-04-05 03:50 by malin, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 994 closed python-dev, 2017-04-05 03:53
PR 999 closed malin, 2017-04-05 06:28
PR 1495 merged xiang.zhang, 2017-05-08 03:25
PR 1507 merged xiang.zhang, 2017-05-09 03:46
PR 1508 merged xiang.zhang, 2017-05-09 03:47
PR 1509 merged xiang.zhang, 2017-05-09 04:03
Messages (11)
msg291153 - (view) Author: Ma Lin (malin) * Date: 2017-04-05 03:50
This issue is split from issue24117, that issue became a soup of small issues, so I'm going to close it.

For 4-byte GB18030 sequence, the legal range is:
0x81-0xFE for the 1st byte
0x30-0x39 for the 2nd byte
0x81-0xFE for the 3rd byte
0x30-0x39 for the 4th byte
GB18030 standard:
https://en.wikipedia.org/wiki/GB_18030
https://pan.baidu.com/share/link?shareid=2606985291&uk=3341026630

The current code forgets to check 0xFE for the 1st and 3rd byte.
Therefore, there are 8630 illegal 4-byte sequences can be decoded by GB18030 codec, here is an example:

# legal sequence b'\x81\x31\x81\x30' is decoded to U+060A, it's fine.
uchar = b'\x81\x31\x81\x30'.decode('gb18030')
print(hex(ord(uchar)))

# illegal sequence 0x8130FF30 can be decoded to U+060A as well, this should not happen.
uchar = b'\x81\x30\xFF\x30'  .decode('gb18030')
print(hex(ord(uchar)))
msg291205 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2017-04-06 03:23
The table in wikipedia is somewhat complex. I find ftp://ftp.software.ibm.com/software/globalization/documents/gb18030m.pdf and the table in it is same as https://pan.baidu.com/share/link?shareid=2606985291&uk=3341026630 (except 0x80) but in English. I agree with Ma Lin bytes sequences like b'\x81\x30\xFF\x30' are invalid.

For current implementation, you could see:

>>> invalid = b'\x81\x30\xff\x30'
>>> invalid.decode('gb18030').encode('gb18030') == invalid
False
msg291208 - (view) Author: Ma Lin (malin) * Date: 2017-04-06 04:18
> except 0x80 (€)

I suppose the English edition is not the final release of GB18030-2000.

At the end of official Chinese edition of GB18030-2005, listed the difference between GB18030-2000 and GB18030-2005 clearly, it doesn't mention 0x80 (€), so GB18030-2000 should not has 0x80 as well.

Why 0x80 (€) appear in English edition?
I searched on Google, this topic said 0x80 appears in *draft* of GB18030-2000.
http://www.pkucn.com/thread-304395-1-1.html
So maybe the English edition is a translation of GB18030-2000 draft, this logic seems ok.

Anyway, 0x80 is another story, not conflict with this issue.
Zhang, do I need to make PR for 3.6/3.5/2.7 respectively?
msg291210 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2017-04-06 05:05
Yes, 0x80 doesn't matter here.

It's nice to make the backporting PRs. But let's wait some time for ezio and haypo's comments and reviews. Get the master PR merged first and then continue on backporting. :-)
msg291213 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-04-06 06:54
An incorrect implementation of a decoder might lead to security vulnerabilities:
http://unicodebook.readthedocs.io/issues.html#security-vulnerabilities

*But* UTF-8 decoder of Python 2 is *not* strict and nobody complained.

I suggest that, once the changed is merged in master, backport the fix to 3.6 and 3.5.

But I'm not sure that it's worth it to backport it to 2.7? Is there a risk to break an application?
msg291215 - (view) Author: Ma Lin (malin) * Date: 2017-04-06 07:31
This is a very trivial bug, it's hard to imagine a scene that someone trying to decode those 8630 illegal 4-byte sequences with GB18030 decoder.
And I think this bug can't lead to security vulnerabilities.

As far as I can see, GB2312/GBK/GB18030 codecs are bugfree except this bug, of course maybe I'm wrong.
msg293275 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2017-05-09 03:38
New changeset 9da408d15bdef624a5632182cb4edf98001fa82f by Xiang Zhang in branch 'master':
bpo-29990: Fix range checking in GB18030 decoder (#1495)
https://github.com/python/cpython/commit/9da408d15bdef624a5632182cb4edf98001fa82f
msg293276 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2017-05-09 04:16
New changeset 72e1b61da0920c5607481304879e039b63e2a3d5 by Xiang Zhang in branch '3.6':
bpo-29990: Fix range checking in GB18030 decoder (#1495) (#1507)
https://github.com/python/cpython/commit/72e1b61da0920c5607481304879e039b63e2a3d5
msg293277 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2017-05-09 04:17
New changeset f5f7870d9322b46ab87c45b2c4c46f6b10ecbd70 by Xiang Zhang in branch '3.5':
bpo-29990: Fix range checking in GB18030 decoder (#1495) (#1508)
https://github.com/python/cpython/commit/f5f7870d9322b46ab87c45b2c4c46f6b10ecbd70
msg293278 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2017-05-09 04:18
New changeset 4e7457b85316e6591a4f0c3a4d0807bfdf7a2bea by Xiang Zhang in branch '2.7':
bpo-29990: Fix range checking in GB18030 decoder (#1509)
https://github.com/python/cpython/commit/4e7457b85316e6591a4f0c3a4d0807bfdf7a2bea
msg293279 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2017-05-09 04:22
Thanks Ma Lin for finding the problem! Don't know why you close the PR but anyway, we solve it finally.
History
Date User Action Args
2022-04-11 14:58:44adminsetgithub: 74176
2017-05-09 04:22:14xiang.zhangsetmessages: + msg293279
2017-05-09 04:19:26xiang.zhangsetstatus: open -> closed
resolution: fixed
stage: needs patch -> resolved
2017-05-09 04:18:58xiang.zhangsetmessages: + msg293278
2017-05-09 04:17:11xiang.zhangsetmessages: + msg293277
2017-05-09 04:16:52xiang.zhangsetmessages: + msg293276
2017-05-09 04:03:55xiang.zhangsetpull_requests: + pull_request1611
2017-05-09 03:47:37xiang.zhangsetpull_requests: + pull_request1610
2017-05-09 03:46:14xiang.zhangsetpull_requests: + pull_request1609
2017-05-09 03:38:35xiang.zhangsetmessages: + msg293275
2017-05-08 03:25:20xiang.zhangsetpull_requests: + pull_request1597
2017-04-14 17:21:39Mariattasetstage: patch review -> needs patch
2017-04-06 07:31:53malinsetmessages: + msg291215
2017-04-06 06:54:09vstinnersetmessages: + msg291213
2017-04-06 05:05:02xiang.zhangsetmessages: + msg291210
2017-04-06 04:18:36malinsetmessages: + msg291208
2017-04-06 03:24:45xiang.zhangsetstage: patch review
versions: + Python 2.7, Python 3.5, Python 3.6
2017-04-06 03:23:12xiang.zhangsetmessages: + msg291205
2017-04-05 06:28:48malinsetpull_requests: + pull_request1171
2017-04-05 03:56:05xiang.zhangsetnosy: + xiang.zhang
2017-04-05 03:53:59python-devsetpull_requests: + pull_request1168
2017-04-05 03:50:03malincreate