This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author vstinner
Recipients vstinner
Date 2014-10-15.20:30:44
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1413405045.21.0.992047718976.issue22649@psf.upfronthosting.co.za>
In-reply-to
Content
The case_operation() in Objects/unicodeobject.c is used for case operations: lower, upper, casefold, etc.

Currently, the function uses a buffer of Py_UCS4 and overallocate the buffer by 300%. The function uses the worst case: one character replaced with 3 characters.

I propose the use the _PyUnicodeWriter API to be able to optimize the most common case: each character is replaced by only one another character, and the output string uses the same unicode kind (UCS1, UCS2 or UCS4).

The patch preallocates the writer using the kind of the input string, but in some cases, the result uses a lower kind (ex: latin1 => ASCII). "Special" characters taking the slow path from unit tests:

- test_capitalize: 'finnish' => 'FInnish' (ascii)
- test_casefold: 'ß' => 'ss', 'fi' => 'fi'
- test_swapcase: 'fi' => 'FI', 'ß' => 'SS'
- test_title: 'fiNNISH' => 'Finnish'
- test_upper: 'fi' => 'FI', 'ß' => 'SS'

The writer only uses overallocation if a replaced character uses more than one character. Bad cases where the length changes:

- test_capitalize: 'ῳῳῼῼ' => 'ΩΙῳῳῳ', 'hİ' => 'Hi̇', 'ῒİ' => 'Ϊ̀i̇', 'finnish' => 'FInnish'
- test_casefold: 'ß' => 'ss', 'fi' => 'fi'
- test_lower: 'İ' => 'i̇'
- test_swapcase: 'fi' => 'FI', 'İ' => 'i̇', 'ß' => 'SS', 'ῒ' => 'Ϊ̀'
- test_title: 'fiNNISH' => 'Finnish'
- test_upper: 'fi' => 'FI', 'ß' => 'SS', 'ῒ', 'Ϊ̀'
History
Date User Action Args
2014-10-15 20:30:45vstinnersetrecipients: + vstinner
2014-10-15 20:30:45vstinnersetmessageid: <1413405045.21.0.992047718976.issue22649@psf.upfronthosting.co.za>
2014-10-15 20:30:45vstinnerlinkissue22649 messages
2014-10-15 20:30:44vstinnercreate