Message 215041 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	progfou
Recipients	ezio.melotti, lemburg, progfou, vstinner
Date	2014-03-28.12:41:01
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1396010463.51.0.766928002364.issue21081@psf.upfronthosting.co.za>
In-reply-to

Content
> * Please provide some background information how widely the encoding is used. I get less than 1000 hits in Google when looking for "TCVN 5712:1993". Here is the background for the need for this encoding. The recent laws[0] in Vietnam have set TCVN 6909:2001 (Unicode based) as the standard encoding everybody should use. Still, there was more than 30 old Vietnamese encodings that were used for tenths of years before that, with some of them being still used (it takes times for people to accept the change and for technicians to do what's required to change technology). Among them, TCVN 5712:1993 was (is) mostly used in the North of Vietnam and VNI (a private company encoding) in the South of Vietnam. Worse than that, these old encodings use the C0 bank to store some Vietnamese letters (especially the 'ư', one of the most used in this language), which has the very unpleasant consequence to let some software (like OpenOffice/LibreOffice) being unable to render the texts correctly, even when using the correct fonts. Since this was a showstopper for Free Software adoption in Vietnam, I decided at that time to create a tool[1][2] to help in converting from these old encodings to Unicode. The project was then endorsed by the Ministry of Sciences and Technology of Vietnam, which asked me to make further developments[3]. Even if these old encodings are, hopefully, not the widest used in Vietnam now, there are still tons/plenty of old documents (sorry, I can't be more precise on the volume of administrative or private documents) that need to be read/modified or, best, converted to Unicode; and here is where the encodings are needed. Now every time some Vietnamese people (and Laotian people, I'll come back on this in another bug report) want to use OpenOffice/LibreOffice and still be able to open their old documents, they have to install this Python extension for this. I foresee there will be not only plain documents to convert but also databases and other kind of data storage. And here is where Python has a great occasion to become the tool of choice. [0] http://thuvienphapluat.vn/archive/Quyet-dinh-72-2002-QD-TTg-thong-nhat-dung-bo-ma-ky-tu-chu-Viet-TCVN-6909-2001-trao-doi-thong-tin-dien-tu-giua-to-chuc-dang-nha-nuoc-vb49528.aspx [1] http://wiki.hanoilug.org/projects:ovniconv [2] http://extensions.services.openoffice.org/project/ovniconv [3] http://extensions.services.openoffice.org/en/project/b2uconverter > Now, the encoding was a standard in Vietnam, but it has been updated in 1999 to TCVN 5712:1999. I have to admit I missed this one. It may explain the differences I saw when I reversed engineered the TCVN encoding through the study the documents Vietnamese users provided to me. I will check this one and come back with more details. > There's also an encoding called VSCII. VSCII is the same as TCVN 5712:1993. This page contains interesting information about these encodings: http://www.informatik.uni-leipzig.de/~duc/software/misc/tcvn.txt > * In the file you write "kind of TCVN 5712:1993 VN3 with CP1252 additions". This won't work, since we can only accept codecs which are based on set standards. I can understand that and I'll do my best to check if it's really based on one of the TCVN standards, be it 5712:1993 or 5712:1999. Still, after years of usage, I know perfectly that it's exactly the encoding we need (for the North part of Vietnam at least). > It would be better to provide a link to an official Unicode character set mapping table and then use the gencodec.py script on this table. I saw a reference to this processing tool in the Python provided encodings and tried to find a Unicode mapping table at the Unicode website but failed up to now. I'll try harder. > * For Vietnamese, Python already provides cp1258 - how much is this encoding used in comparison to e.g. TCVN 5712:1993 ? To be efficient at typing Vietnamese, you need a keyboard input software (Vietkey and Unikey being the most used). Microsoft tried to create dedicated Vietnamese encoding (cp1258) and keyboard, but I never saw or heard about its adoption at any place. Knowing the way Vietnamese users use their computer, I would say it probably has never been in real use. > * Vietnamese encodings: http://www.panl10n.net/english/outputs/Survey/Vietnamese.pdf In this sentence you can see the most used old encodings in Vietnam: “On the Linux platform, fonts based on Unicode [6], TCVN, VNI and VPS [7] encodings can be adequately used to input Vietnamese text.” This is not only the most used on Linux (in fact, on Linux we have to use Unicode, mostly because of the problem I explained before) but also on Windows. I don't know the situation for Mac OS or other OS though. My goal is to add these encodings into Python, to help Vietnam make its steps into Unicode. > * East Asian encodings: http://www.unicode.org/iuc/iuc15/tb1/slides.pdf This document tells: “Context is critical—Unicode is considered the “newer” character set in the context of this talk.” It was written in the goal to put Unicode as a replacement for all already covered charsets, which then shall become obsolete. So, of course, in this point of view, every 8 bits Vietnamese charsets are obsolete. But it doesn't mean there are not of use anymore, not at all!

> * Please provide some background information how widely the encoding is used. I get less than 1000 hits in Google when looking for "TCVN 5712:1993".

Here is the background for the need for this encoding.

The recent laws[0] in Vietnam have set TCVN 6909:2001 (Unicode based) as the standard encoding everybody should use. Still, there was more than 30 old Vietnamese encodings that were used for tenths of years before that, with some of them being still used (it takes times for people to accept the change and for technicians to do what's required to change technology). Among them, TCVN 5712:1993 was (is) mostly used in the North of Vietnam and VNI (a private company encoding) in the South of Vietnam.

Worse than that, these old encodings use the C0 bank to store some Vietnamese letters (especially the 'ư', one of the most used in this language), which has the very unpleasant consequence to let some software (like OpenOffice/LibreOffice) being unable to render the texts correctly, even when using the correct fonts. Since this was a showstopper for Free Software adoption in Vietnam, I decided at that time to create a tool[1][2] to help in converting from these old encodings to Unicode. The project was then endorsed by the Ministry of Sciences and Technology of Vietnam, which asked me to make further developments[3].

Even if these old encodings are, hopefully, not the widest used in Vietnam now, there are still tons/plenty of old documents (sorry, I can't be more precise on the volume of administrative or private documents) that need to be read/modified or, best, converted to Unicode; and here is where the encodings are needed. Now every time some Vietnamese people (and Laotian people, I'll come back on this in another bug report) want to use OpenOffice/LibreOffice and still be able to open their old documents, they have to install this Python extension for this.

I foresee there will be not only plain documents to convert but also databases and other kind of data storage. And here is where Python has a great occasion to become the tool of choice.

[0] http://thuvienphapluat.vn/archive/Quyet-dinh-72-2002-QD-TTg-thong-nhat-dung-bo-ma-ky-tu-chu-Viet-TCVN-6909-2001-trao-doi-thong-tin-dien-tu-giua-to-chuc-dang-nha-nuoc-vb49528.aspx
[1] http://wiki.hanoilug.org/projects:ovniconv
[2] http://extensions.services.openoffice.org/project/ovniconv
[3] http://extensions.services.openoffice.org/en/project/b2uconverter


> Now, the encoding was a standard in Vietnam, but it has been updated in 1999 to TCVN 5712:1999.

I have to admit I missed this one. It may explain the differences I saw when I reversed engineered the TCVN encoding through the study the documents Vietnamese users provided to me. I will check this one and come back with more details.

> There's also an encoding called VSCII.

VSCII is the same as TCVN 5712:1993.

This page contains interesting information about these encodings: http://www.informatik.uni-leipzig.de/~duc/software/misc/tcvn.txt


> * In the file you write "kind of TCVN 5712:1993 VN3 with CP1252 additions". This won't work, since we can only accept codecs which are based on set standards.

I can understand that and I'll do my best to check if it's really based on one of the TCVN standards, be it 5712:1993 or 5712:1999. Still, after years of usage, I know perfectly that it's exactly the encoding we need (for the North part of Vietnam at least).


> It would be better to provide a link to an official Unicode character set mapping table and then use the gencodec.py script on this table.

I saw a reference to this processing tool in the Python provided encodings and tried to find a Unicode mapping table at the Unicode website but failed up to now. I'll try harder.


> * For Vietnamese, Python already provides cp1258 - how much is this encoding used in comparison to e.g. TCVN 5712:1993 ?

To be efficient at typing Vietnamese, you need a keyboard input software (Vietkey and Unikey being the most used). Microsoft tried to create dedicated Vietnamese encoding (cp1258) and keyboard, but I never saw or heard about its adoption at any place. Knowing the way Vietnamese users use their computer, I would say it probably has never been in real use.


> * Vietnamese encodings: http://www.panl10n.net/english/outputs/Survey/Vietnamese.pdf

In this sentence you can see the most used old encodings in Vietnam: “On the Linux platform, fonts based on Unicode [6], TCVN, VNI and VPS [7] encodings can be adequately used to input Vietnamese text.”

This is not only the most used on Linux (in fact, on Linux we have to use Unicode, mostly because of the problem I explained before) but also on Windows. I don't know the situation for Mac OS or other OS though.

My goal is to add these encodings into Python, to help Vietnam make its steps into Unicode.


> * East Asian encodings: http://www.unicode.org/iuc/iuc15/tb1/slides.pdf

This document tells: “Context is critical—Unicode is considered the “newer” character set in the context of this talk.” It was written in the goal to put Unicode as a replacement for all already covered charsets, which then shall become obsolete. So, of course, in this point of view, every 8 bits Vietnamese charsets are obsolete. But it doesn't mean there are not of use anymore, not at all!

History
Date	User	Action	Args
2014-03-28 12:41:03	progfou	set	recipients: + progfou, lemburg, vstinner, ezio.melotti
2014-03-28 12:41:03	progfou	set	messageid: <1396010463.51.0.766928002364.issue21081@psf.upfronthosting.co.za>
2014-03-28 12:41:03	progfou	link	issue21081 messages
2014-03-28 12:41:01	progfou	create