forked from x1angli/cvt2utf
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.rst
198 lines (133 loc) · 6.29 KB
/
README.rst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
Converts text files or source code files into UTF-8 encoding
============================================================
| This lightweight tool converts non-UTF-encoded (such as GB2312, GBK,
BIG5 encoded) files to UTF-8 encoded files.
| It can either be executed from command line (CLI), or imported into
other Python code.
Installation
------------
Automatic Installation (recommended)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#. Make sure Python 3, along with pip, is properly installed.
#. In your CLI, execute ``pip install convert2utf``
Manual Installation (for developers only)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#. Make sure Python 3 is properly installed.
#. Clone this project, or just download the .zip file from github.com
and unarchive it
#. Start CLI (command line interface), enter the local folder
#. Setup Python virtual environment with ``virtualenv ...`` or
``python -m venv ...``
#. Run: ``pip install -r requirements.txt``
Usage
-----
There is only one mandatory argument: filename, where you can specify
the directory or file name.
- ***Batch mode***: Pass in a directory as the input, and all text
files that meets the criteria underneath it will be converted to
UTF8-encoding.
- ***Single file mode***\ \_: If the input argument is just an
individual file, it would be straightforwardly converted to UTF-8.
***Examples:***
- Change all .txt files to UTF-8 encoding.
Those byte-order marks a.k.a. "BOM"s or "signature"s in existing
UTF-8 files will be removed.
``python cvt2utf.py "D:\mynotebook"``
Afterwards, you could use any text editor (e.g. [Notepad++]
(https://notepad-plus-plus.org/)) to verify the text files underneath
the specified folder are already converted to UTF-8.
- Change all .csv files to UTF-8 encoding. Since BOM are used by some
applications (such as Microsoft Excel)
``python cvt2utf.py "D:\mynotebook" --exts csv --keepbom``
- Convert all .php, .js, .java, .py files to UTF-8 encoding.
Also, make sure all BOMs are removed. They are really nuisance for
source code files!
``python cvt2utf.py "D:\workspace" --exts php js java py``
- After manually verify the new UTF-8 files are correct, you can remove
all .bak files
``python cvt2utf.py "D:\workspace" --cleanbak``
- Alternatively, if you are confident with Python's in-house encoding
and decoding, you can simply convert files without creating backups.
Do **NOT** call this, unless you know what you are doing.
``python cvt2utf.py "D:\workspace" --overwrite``
- Converts an individual file
``python cvt2utf.py "D:\workspace\a.txt"``
- Show help information
``python cvt2utf.py -h``
(Linux only) Directly run the program
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sometimes, you may want to run the program without specifying the Python
interpretor, such as:
::
./cvt2utf.py "~/mynotebooks"
(Note the leading ``python`` command is missing here)
To achieve this, you first need to grant the execution permission onto
the Python, (skip this provided it already have the eXecution
permission:
::
sudo chmod +x ./cvt2utf.py
Then activate the virtual environment:
::
. venv/bin/activate
Alternatively, if you already have all dependencies installed with your
default python environment, or you've already activated virtualenv’s
python you could skip this.
Then, make sure dependencies are installed
::
pip install -r requirements.txt
Finally, execute the file: (you could add command arguments here):
::
./cvt2utf.py "~/the/base/dir"
You might want to use absolute path for this program if you are running
it in an arbitrary working directory.
(For developers) Programmatically use this Python module
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For Python programmers who want to use this module, see below
::
>>> from cvt2utf import Convert2Utf8
>>> cvt2utf = Convert2Utf8(['php', 'css', 'htm', 'html', 'js'], False, False)
>>> cvt2utf.run('D:\\workspace')
>>> cvt2utf.run('D:\\another\\folder')
| Note: the constructor Convert2Utf8() takes 3 arguments: the extension
list, the switch to keep BOM, the direct-overwriting mode.
| The usage of these arguments is same as the command-line method.
Miscellaneous
-------------
By default, the converted output text files will **NOT** contain BOM
(byte order mark). Should you want to learn what is BOM along with its
implication, please check: https://en.wikipedia.org/wiki/Byte_order_mark
FAQ
---
Why do we choose UTF-8 among all charsets?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
**A**: For i18n, UTF-8 is wide spread. It is the de facto standard for
non-English texts.
Compared with UTF-16, UTF-8 is usually more compact and "with full
fidelity". It also doesn't suffer from the endianness issue of UTF-16.
Why do we need this tool?
^^^^^^^^^^^^^^^^^^^^^^^^^
**A**: Indeed, there are a bunch of text editors out there (such as
Notepad++) that handle various encodings of text files very well. Yet
for the purpose of **batch conversion** we need this Python script. This
script is also written for educational purpose -- developers can learn
from this script to get an idea of how to handle text encoding.
Why should we remove BOMs (byte order mark) rather than add them?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
**A**: Most compilers and interpreters can handle UTF-8 source code
files very well, provided that those files are encoded **w/o** BOM. Some
compilers/interpreters might fail or give unexpected output whenever BOM
is present. For this reason, I strongly advise the removal of BOM
whenever we use UTF-8 encoding.
Side note: of course, there are certain situations where BOMs are
preferred. (For example, Microsoft Excel cannot parse correctly UTF8 w/o
BOM CSV files with international characters. ) Such situations are rare.
Overall, the necessity of BOM trumps other concerns.
Shall we trust this program?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
**A**: This code is still at its "beta" phase. We are striving to
deliver high reliable solutions to our users. You might be aware that
Python's built-in encoding/decoding mechanism may not be very reliable.
For that reason, we suggest users create backups, either manually
duplicate the file/directory, or automatically through our package
(remember, the backup feature will be short-circuited with the
``--overwrite`` switch)