Look at all available files of encodes_chars
directory here
all.encodes.common.chars.txt file contains commonly available 1066 compound characters among all 25 encodings.
So if your input text fully falls only under the compund characters of all.encodes.common.chars.txt
, then auto2unicode
will fails. :-(
Look at the files dinamani2utf8.unique.chars.txt, nakkeeran2utf8.unique.chars.txt, murasoli2utf8.unique.chars.txt and webulagam2utf8.unique.chars.txt. They are fully empty!
It seems these two encodes characters are fully falls under commonly available 916 compound characters all.encodes.common.chars.txt.
So there is zero % chance to identify dinamani
, nakkeeran
, murasoli
and webulagam
encodes by using auto2unicode
function.
Look at the files tam2utf8.unique.chars.txt it has only one unique compound characters.
So there is (1/1066)x100 = 0.09380863039399624 % chances to identify tam
encode by using auto2unicode
function.
Tip : If you need to find auto encode of your input text, then make sure that atleast one compund characters from encode_name.unique.chars.txt file available in your input text. Find your encode unique compound characters here
S.No | Enocdes | Unique Chars |
---|---|---|
1 | Anjal | 190 |
2 | Bamini | 117 |
3 | Boomi | 100 |
4 | Diacritic | 201 |
5 | Dinakaran | 87 |
6 | Dinamani | 0 |
7 | Dinathanthy | 170 |
8 | Indoweb | 230 |
9 | Kavipriya | 39 |
10 | Koeln | 104 |
11 | Libi | 14 |
12 | Murasoli | 0 |
13 | Mylai | 188 |
14 | Nakkeeran | 0 |
15 | OldVikatan | 92 |
16 | Pallavar | 163 |
17 | Roman | 342 |
18 | Shreelipi | 15 |
19 | Softview | 31 |
20 | Tab | 68 |
21 | Tace | 312 |
22 | Tam | 1 |
23 | Tscii | 211 |
24 | Vanavil | 25 |
25 | Webulagam | 0 |
S.No | Common Compound Characters | Count |
---|---|---|
1 | common characters in all encodes | 1066 |
#eg 1:#
>>> text = """¾¢ÕÅûÙÅ÷
«ÕǢ ¾¢ÕìÌÈû """
>>> uni = tscii2unicode(text)
>>> print uni
திருவள்ளுவர் அருளிய திருக்குறள்
unicode for above tscii characters are as shown above
#eg 2:#
>>> text = """¾¢ÕÅûÙÅ÷
«ÕǢ ¾¢ÕìÌÈû �"""
>>> uni = auto2unicode(text)
>>> print uni
"Whola! found encode : tscii2utf8"
It print above msg and return converted unicode characters.
திருவள்ளுவர் அருளிய திருக்குறள்
unicode for above tscii characters are as shown above
#eg 3:#
>>> text = """ù£tPùP\[tI
è£n\[nwh ùSô ªþ£ ùaô """
>>> uni = auto2unicode(text)
>>> print uni
"Sorry, couldn't find encode :-(
Need more words to find unique encode out side of 916 common compound characters"
It print above msg and return None.
Look closely at characters of input & output of eg 2
& eg 3
In eg 2
input text has atleast one / more compound characters from tscii2utf8.unique.chars.txt.
But in eg 3
input fully falls only under the compund characters of
all.encodes.common.chars.txt. So it couldn't find correct encodes of input text !
For other encodes user need to look at this encodes_chars
directory files &
identify atleast one unique char, incert in input text. so that auto2unicode
function
can identiy encode for you !
#eg 4:#
The below code shows example for reverse engine, i.e. unicode2encode
>>> tscii = """¾¢ÕÅûÙÅ÷ «ÕǢ ¾¢ÕìÌÈû """
>>> uni_1 = tscii2unicode(tscii)
>>> tscii_from_uni = unicode2tscii(uni_1)
>>> uni_2 = tscii2unicode(tscii_from_uni)
>>> print "Tscii original input", tscii
>>> print "From tscii to unicode :", uni_1
>>> print "From unicode to tscii :", tscii_from_uni
>>> print "Again back to unicode from above tscii :", uni_2
Outputs of the above snippet which convert in bothways
Tscii original input : ¾¢ÕÅûÙÅ÷ «ÕǢ ¾¢ÕìÌÈû
From tscii to unicode : திருவள்ளுவர் அருளிய திருக்குறள்
From unicode to tscii : ¾¢ÕÅûÙÅ÷ «ÕǢ ¾¢ÕìÌÈû
Again back to unicode from above tscii : திருவள்ளுவர் அருளிய திருக்குறள்
#eg 5:#
The below code shows example for auto reverse engine, i.e. unicode2auto
>>> uni_1 = """திருவள்ளுவர் அருளிய திருக்குறள் """
>>> tscii = unicode2tscii(uni_1)
>>> tscii_sample = tscii.split(' ')[0]
>>> tscii_from_auto = unicode2auto(uni_1, tscii_sample)
>>> uni_2 = auto2unicode(tscii_from_auto)
>>> print "Unicode original input :", uni_1
>>> print "From unicode to tscii :", tscii
>>> print "From unicode to tscii :", tscii_from_auto
>>> print "From unicode to tscii by auto function :", uni_2
Outputs of the above snippet which convert in bothways by auto functions
Whola! found encode : tscii2utf8
Whola! found encode : tscii2utf8
Unicode original input : திருவள்ளுவர் அருளிய திருக்குறள்
From unicode to tscii : ¾¢ÕÅûÙÅ÷ «ÕǢ ¾¢ÕìÌÈû
From unicode to tscii by auto function : ¾¢ÕÅûÙÅ÷ «ÕǢ ¾¢ÕìÌÈû
Again back to unicode from above tscii by auto function: திருவள்ளுவர் அருளிய திருக்குறள்
#eg 6:#
Click on eg 6
header to view big file encode 2 unicode conversion and its inputs/outputs are stored here
Regards,
Arulalan.T
Date : 09.08.2014