Message 229497 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	vstinner
Date	2014-10-15.20:30:44
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1413405045.21.0.992047718976.issue22649@psf.upfronthosting.co.za>
In-reply-to

Content
The case_operation() in Objects/unicodeobject.c is used for case operations: lower, upper, casefold, etc. Currently, the function uses a buffer of Py_UCS4 and overallocate the buffer by 300%. The function uses the worst case: one character replaced with 3 characters. I propose the use the _PyUnicodeWriter API to be able to optimize the most common case: each character is replaced by only one another character, and the output string uses the same unicode kind (UCS1, UCS2 or UCS4). The patch preallocates the writer using the kind of the input string, but in some cases, the result uses a lower kind (ex: latin1 => ASCII). "Special" characters taking the slow path from unit tests: - test_capitalize: 'ﬁnnish' => 'FInnish' (ascii) - test_casefold: 'ß' => 'ss', 'ﬁ' => 'fi' - test_swapcase: 'ﬁ' => 'FI', 'ß' => 'SS' - test_title: 'ﬁNNISH' => 'Finnish' - test_upper: 'ﬁ' => 'FI', 'ß' => 'SS' The writer only uses overallocation if a replaced character uses more than one character. Bad cases where the length changes: - test_capitalize: 'ῳῳῼῼ' => 'ΩΙῳῳῳ', 'hİ' => 'Hi̇', 'ῒİ' => 'Ϊ̀i̇', 'ﬁnnish' => 'FInnish' - test_casefold: 'ß' => 'ss', 'ﬁ' => 'fi' - test_lower: 'İ' => 'i̇' - test_swapcase: 'ﬁ' => 'FI', 'İ' => 'i̇', 'ß' => 'SS', 'ῒ' => 'Ϊ̀' - test_title: 'ﬁNNISH' => 'Finnish' - test_upper: 'ﬁ' => 'FI', 'ß' => 'SS', 'ῒ', 'Ϊ̀'

The case_operation() in Objects/unicodeobject.c is used for case operations: lower, upper, casefold, etc.

Currently, the function uses a buffer of Py_UCS4 and overallocate the buffer by 300%. The function uses the worst case: one character replaced with 3 characters.

I propose the use the _PyUnicodeWriter API to be able to optimize the most common case: each character is replaced by only one another character, and the output string uses the same unicode kind (UCS1, UCS2 or UCS4).

The patch preallocates the writer using the kind of the input string, but in some cases, the result uses a lower kind (ex: latin1 => ASCII). "Special" characters taking the slow path from unit tests:

- test_capitalize: 'ﬁnnish' => 'FInnish' (ascii)
- test_casefold: 'ß' => 'ss', 'ﬁ' => 'fi'
- test_swapcase: 'ﬁ' => 'FI', 'ß' => 'SS'
- test_title: 'ﬁNNISH' => 'Finnish'
- test_upper: 'ﬁ' => 'FI', 'ß' => 'SS'

The writer only uses overallocation if a replaced character uses more than one character. Bad cases where the length changes:

- test_capitalize: 'ῳῳῼῼ' => 'ΩΙῳῳῳ', 'hİ' => 'Hi̇', 'ῒİ' => 'Ϊ̀i̇', 'ﬁnnish' => 'FInnish'
- test_casefold: 'ß' => 'ss', 'ﬁ' => 'fi'
- test_lower: 'İ' => 'i̇'
- test_swapcase: 'ﬁ' => 'FI', 'İ' => 'i̇', 'ß' => 'SS', 'ῒ' => 'Ϊ̀'
- test_title: 'ﬁNNISH' => 'Finnish'
- test_upper: 'ﬁ' => 'FI', 'ß' => 'SS', 'ῒ', 'Ϊ̀'

History
Date	User	Action	Args
2014-10-15 20:30:45	vstinner	set	recipients: + vstinner
2014-10-15 20:30:45	vstinner	set	messageid: <1413405045.21.0.992047718976.issue22649@psf.upfronthosting.co.za>
2014-10-15 20:30:45	vstinner	link	issue22649 messages
2014-10-15 20:30:44	vstinner	create