Message 226298 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	loewis, serhiy.storchaka, vstinner
Date	2014-09-03.07:10:35
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1409728236.26.0.3865327258.issue22324@psf.upfronthosting.co.za>
In-reply-to

Content
> Will not this cause performance regression? When we hardly work with wchar_t-based API, it looks good to cache encoded value. Yes, it will be slower. But I prefer slower code with a lower memory footprint. On UNIX, I don't think that anyone will notice the difference. My concern is that the cache is never released. If the conversion is only needed once at startup, the memory will stay until Python exits. It's not really efficient. On Windows, conversion to wchar_t* is common because Python uses the Windows wide character API ("W" API vs "A" ANSI code page API). For example, most access to the filesystem use wchar_t* type. On Python < 3.3, Python was compiled in narrow mode and so Unicode was already using wchar_t* internally to store characters. Since Python 3.3, Python uses a more compact representation. wchar_t* shares Unicode data only if sizeof(wchar_t*) == KIND where KIND is 1, 2 or 4 bytes per character. Examples: "\u20ac" on Windows (16 bits wchar_t) or "\U0010ffff" on Linux (32 bits wchar_t) .

> Will not this cause performance regression? When we hardly work with wchar_t-based API, it looks good to cache encoded value.

Yes, it will be slower. But I prefer slower code with a lower memory footprint. On UNIX, I don't think that anyone will notice the difference.

My concern is that the cache is never released. If the conversion is only needed once at startup, the memory will stay until Python exits. It's not really efficient.

On Windows, conversion to wchar_t* is common because Python uses the Windows wide character API ("W" API vs "A" ANSI code page API). For example, most access to the filesystem use wchar_t* type.

On Python < 3.3, Python was compiled in narrow mode and so Unicode was already using wchar_t* internally to store characters. Since Python 3.3, Python uses a more compact representation. wchar_t* shares Unicode data only if sizeof(wchar_t*) == KIND where KIND is 1, 2 or 4 bytes per character. Examples: "\u20ac" on Windows (16 bits wchar_t) or "\U0010ffff" on Linux (32 bits wchar_t) .

History
Date	User	Action	Args
2014-09-03 07:10:36	vstinner	set	recipients: + vstinner, loewis, serhiy.storchaka
2014-09-03 07:10:36	vstinner	set	messageid: <1409728236.26.0.3865327258.issue22324@psf.upfronthosting.co.za>
2014-09-03 07:10:36	vstinner	link	issue22324 messages
2014-09-03 07:10:35	vstinner	create