Issue 23997: unicodedata_UCD_lookup() has theoretical buffer overflow

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/68185

classification

Title:	unicodedata_UCD_lookup() has theoretical buffer overflow
Type:	behavior	Stage:	patch review
Components:	Extension Modules	Versions:	Python 3.6, Python 3.5, Python 2.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	benjamin.peterson, christian.heimes, ezio.melotti, lemburg, pitrou, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2015-04-18 22:32 by christian.heimes, last changed 2022-04-11 14:58 by admin.

Files
File name	Uploaded	Description	Edit
unicode_name_maxlen.patch	christian.heimes, 2015-04-18 22:32		review
unicode_name_maxlen_trunc.patch	serhiy.storchaka, 2015-12-19 23:12		review

Messages (2)
msg241461 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2015-04-18 22:32
Coverity has found a potential buffer overflow in the unicodedata module. The function call _getcode() which calls _cmpname(). _cmpname() copies data into fixed size buffer of length NAME_MAXLEN. Neither lookup() nor _getcode() limit name_length to NAME_MAXLEN. Therefore the buffer could theoretical overflow. In practice the buffer overflow can't be abused because Tools/unicode/makeunicodedata.py already limits max name length. I still like to fix the bug because it is a low hanging fruit. In most versions of Python the code already checks that name_length fits in INT_MAX. CID 1295028 (#1 of 1): Out-of-bounds access (OVERRUN) overrun-call: Overrunning callee's array of size 256 by passing argument (int)name_length (which evaluates to 2147483647) in call to _getcode
msg256744 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-12-19 23:12
For now the error message virtually always contains the name (unless the length of its UTF-8 representation > INT_MAX). With unicode_name_maxlen.patch it doesn't contains the name of length few hundreds or tens characters. Proposed patch makes the error message always contain the name, but truncated to NAME_MAXLEN bytes. >>> name = ''.join(map(chr, range(0x2c80, 0x2ce4))) >>> unicodedata.lookup(name) Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: "undefined character name 'ⲀⲁⲂⲃⲄⲅⲆⲇⲈⲉⲊⲋⲌⲍⲎⲏⲐⲑⲒⲓⲔⲕⲖⲗⲘⲙⲚⲛⲜⲝⲞⲟⲠⲡⲢⲣⲤⲥⲦⲧⲨⲩⲪⲫⲬⲭⲮⲯⲰⲱⲲⲳⲴⲵⲶⲷⲸⲹⲺⲻⲼⲽⲾⲿⳀⳁⳂⳃⳄⳅⳆⳇⳈⳉⳊⳋⳌⳍⳎⳏⳐⳑⳒⳓⳔ�...'"

History
Date	User	Action	Args
2022-04-11 14:58:15	admin	set	github: 68185
2015-12-19 23:12:32	serhiy.storchaka	set	files: + unicode_name_maxlen_trunc.patch messages: + msg256744 components: + Extension Modules versions: + Python 3.6, - Python 3.3, Python 3.4
2015-04-18 22:32:38	christian.heimes	create