This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Potential memory leak in normalizestring()
Type: resource usage Stage: resolved
Components: Interpreter Core Versions: Python 3.8, Python 3.7, Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: josh.r, methane, miss-islington
Priority: normal Keywords: patch

Created on 2018-04-05 12:34 by methane, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 6386 merged methane, 2018-04-05 13:21
PR 6393 merged miss-islington, 2018-04-06 06:51
PR 6394 merged miss-islington, 2018-04-06 06:52
Messages (5)
msg314986 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2018-04-05 14:15
Patch is good, but while we're at it, is there any reason why this multi-allocation design was even used? It PyMem_Mallocs a buffer, makes a C-style string in it, then uses PyUnicode_FromString to convert C-style string to Python str.

Seems like the correct approach would be to just use PyUnicode_New to preallocate the final string buffer up front, then pull out the internal buffer with PyUnicode_1BYTE_DATA and populate that directly, saving a pointless allocation/deallocation, which also means the failure case means no cleanup needed at all, while barely changing the code (aside from removing the need to explicitly NUL terminate).

Only reason I can see to avoid this would be if the codec names could contain arbitrary Unicode encoded as UTF-8 (and therefore strlen wouldn't tell you the final length in Unicode ordinals), but I'm pretty sure that's not the case (if it is, we're not normalizing properly, since we only lower case ASCII). If Unicode codec names need to be handled, there are other options, though the easy savings go away.
msg315010 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2018-04-06 06:46
> Only reason I can see to avoid this would be if the codec names could contain arbitrary Unicode encoded as UTF-8 (and therefore strlen wouldn't tell you the final length in Unicode ordinals), but I'm pretty sure that's not the case (if it is, we're not normalizing properly, since we only lower case ASCII). If Unicode codec names need to be handled, there are other options, though the easy savings go away.

Maybe, we can add "encoding name must be ascii" restriction in future version (3.8+).
But for now, I want to avoid any potential backward incompatibility.
msg315011 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2018-04-06 06:51
New changeset 0c1c4563a65ac451021d927058e4f25013934eb2 by INADA Naoki in branch 'master':
bpo-33231: Fix potential leak in normalizestring() (GH-6386)
https://github.com/python/cpython/commit/0c1c4563a65ac451021d927058e4f25013934eb2
msg315012 - (view) Author: miss-islington (miss-islington) Date: 2018-04-06 07:12
New changeset 64421d9237e33725e3c2916cdf2b6d6da1751c2a by Miss Islington (bot) in branch '3.7':
bpo-33231: Fix potential leak in normalizestring() (GH-6386)
https://github.com/python/cpython/commit/64421d9237e33725e3c2916cdf2b6d6da1751c2a
msg315013 - (view) Author: miss-islington (miss-islington) Date: 2018-04-06 07:37
New changeset 2350a4765265158072bf7ad9f04402406d3d1ada by Miss Islington (bot) in branch '3.6':
bpo-33231: Fix potential leak in normalizestring() (GH-6386)
https://github.com/python/cpython/commit/2350a4765265158072bf7ad9f04402406d3d1ada
History
Date User Action Args
2022-04-11 14:58:59adminsetgithub: 77412
2018-04-06 07:37:49methanesetstatus: open -> closed
stage: patch review -> resolved
2018-04-06 07:37:29methanesetresolution: fixed
2018-04-06 07:37:06miss-islingtonsetmessages: + msg315013
2018-04-06 07:12:40miss-islingtonsetnosy: + miss-islington
messages: + msg315012
2018-04-06 06:52:34miss-islingtonsetpull_requests: + pull_request6102
2018-04-06 06:51:48miss-islingtonsetpull_requests: + pull_request6101
2018-04-06 06:51:29methanesetmessages: + msg315011
2018-04-06 06:46:10methanesetmessages: + msg315010
2018-04-05 14:15:19josh.rsetnosy: + josh.r
messages: + msg314986
2018-04-05 13:21:58methanesetkeywords: + patch
stage: patch review
pull_requests: + pull_request6095
2018-04-05 12:34:19methanecreate