Skip to content

Latest commit

 

History

History
195 lines (123 loc) · 7.13 KB

README.md

File metadata and controls

195 lines (123 loc) · 7.13 KB

Look at all available files of encodes_chars directory here

all.encodes.common.chars.txt file contains commonly available 1066 compound characters among all 25 encodings.

So if your input text fully falls only under the compund characters of all.encodes.common.chars.txt, then auto2unicode will fails. :-(

Look at the files dinamani2utf8.unique.chars.txt, nakkeeran2utf8.unique.chars.txt, murasoli2utf8.unique.chars.txt and webulagam2utf8.unique.chars.txt. They are fully empty!

It seems these two encodes characters are fully falls under commonly available 916 compound characters all.encodes.common.chars.txt.

So there is zero % chance to identify dinamani, nakkeeran, murasoli and webulagam encodes by using auto2unicode function.

Look at the files tam2utf8.unique.chars.txt it has only one unique compound characters.

So there is (1/1066)x100 = 0.09380863039399624 % chances to identify tam encode by using auto2unicode function.

Tip : If you need to find auto encode of your input text, then make sure that atleast one compund characters from encode_name.unique.chars.txt file available in your input text. Find your encode unique compound characters here

S.No Enocdes Unique Chars
1 Anjal 190
2 Bamini 117
3 Boomi 100
4 Diacritic 201
5 Dinakaran 87
6 Dinamani 0
7 Dinathanthy 170
8 Indoweb 230
9 Kavipriya 39
10 Koeln 104
11 Libi 14
12 Murasoli 0
13 Mylai 188
14 Nakkeeran 0
15 OldVikatan 92
16 Pallavar 163
17 Roman 342
18 Shreelipi 15
19 Softview 31
20 Tab 68
21 Tace 312
22 Tam 1
23 Tscii 211
24 Vanavil 25
25 Webulagam 0
S.No Common Compound Characters Count
1 common characters in all encodes 1066

#eg 1:#

>>> text = """¾¢ÕÅûÙÅ÷ 
«ÕǢ ¾¢ÕìÌÈû  """

>>> uni = tscii2unicode(text)
>>> print uni

திருவள்ளுவர் அருளிய திருக்குறள்

unicode for above tscii characters are as shown above

#eg 2:#

>>> text = """¾¢ÕÅûÙÅ÷ 
«ÕǢ ¾¢ÕìÌÈû  �"""

>>> uni = auto2unicode(text)
>>> print uni

"Whola! found encode : tscii2utf8"

It print above msg and return converted unicode characters.

திருவள்ளுவர் அருளிய திருக்குறள்

unicode for above tscii characters are as shown above

#eg 3:#

>>> text = """ù£tPùP\[tI
è£n\[nwh ùSô ªþ£ ùaô """

>>> uni = auto2unicode(text)
>>> print uni

"Sorry, couldn't find encode :-(

Need more words to find unique encode out side of 916 common compound characters"

It print above msg and return None.

Look closely at characters of input & output of eg 2 & eg 3

In eg 2 input text has atleast one / more compound characters from tscii2utf8.unique.chars.txt.

But in eg 3 input fully falls only under the compund characters of all.encodes.common.chars.txt. So it couldn't find correct encodes of input text !

For other encodes user need to look at this encodes_chars directory files & identify atleast one unique char, incert in input text. so that auto2unicode function can identiy encode for you !

#eg 4:#

The below code shows example for reverse engine, i.e. unicode2encode

>>> tscii = """¾¢ÕÅûÙÅ÷ «ÕǢ ¾¢ÕìÌÈû  """
>>> uni_1 = tscii2unicode(tscii)
>>> tscii_from_uni = unicode2tscii(uni_1)
>>> uni_2 = tscii2unicode(tscii_from_uni)

>>> print "Tscii original input", tscii
>>> print "From tscii to unicode :", uni_1 
>>> print "From unicode to tscii :", tscii_from_uni
>>> print "Again back to unicode from above tscii :", uni_2

Outputs of the above snippet which convert in bothways

Tscii original input : ¾¢ÕÅûÙÅ÷ «ÕǢ ¾¢ÕìÌÈû

From tscii to unicode : திருவள்ளுவர் அருளிய திருக்குறள்

From unicode to tscii : ¾¢ÕÅûÙÅ÷ «ÕǢ ¾¢ÕìÌÈû

Again back to unicode from above tscii : திருவள்ளுவர் அருளிய திருக்குறள்

#eg 5:# The below code shows example for auto reverse engine, i.e. unicode2auto

>>> uni_1 = """திருவள்ளுவர் அருளிய திருக்குறள்    """
>>> tscii = unicode2tscii(uni_1)
>>> tscii_sample = tscii.split(' ')[0]
>>> tscii_from_auto = unicode2auto(uni_1, tscii_sample)
>>> uni_2 = auto2unicode(tscii_from_auto)

>>> print "Unicode original input :", uni_1
>>> print "From unicode to tscii :", tscii  
>>> print "From unicode to tscii :", tscii_from_auto
>>> print "From unicode to tscii by auto function :", uni_2

Outputs of the above snippet which convert in bothways by auto functions

Whola! found encode : tscii2utf8

Whola! found encode : tscii2utf8

Unicode original input : திருவள்ளுவர் அருளிய திருக்குறள்

From unicode to tscii : ¾¢ÕÅûÙÅ÷ «ÕǢ ¾¢ÕìÌÈû

From unicode to tscii by auto function : ¾¢ÕÅûÙÅ÷ «ÕǢ ¾¢ÕìÌÈû

Again back to unicode from above tscii by auto function: திருவள்ளுவர் அருளிய திருக்குறள்

#eg 6:#

Click on eg 6 header to view big file encode 2 unicode conversion and its inputs/outputs are stored here


Regards,

Arulalan.T

Date : 09.08.2014