Skip to content

Commit

Permalink
Update to 0.0.2
Browse files Browse the repository at this point in the history
  • Loading branch information
alanshi committed Feb 9, 2023
1 parent 1093a2a commit c89a945
Show file tree
Hide file tree
Showing 6 changed files with 100 additions and 9 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@ build
*.spec
venv
.DS_Store
.idea/
.idea/
charset_mnbvc.egg-info
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2023 Alan Shi

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
57 changes: 49 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,12 @@
2. 尝试使用5种编码对字符进行decode ```utf_8```,```utf_16```,```gb18030```,```gb2312```,```big5```
3. 将每一组decode的结果对中文字符串和常用中文字进行正则匹配,有匹配结果的表明符合编码要求

#### 模块安装
```
pip install charset-mnbvc
```

#### 使用说明
* chinese_charset_detect.py -i <inputDirectory> inputDirectory为需要检测的目录
* dist目录包含macos下的可执行文件,windows环境下暂未打包,希望有朋友帮忙编译一下

#### 模块调用方法
##### 根据文件夹获取所有文件编码
```
from charset_mnbvc import api
Expand All @@ -36,15 +36,13 @@ print(f"文件名: {file_path}, 编码: {coding_name}")
```


#### 使用可执行文件范例:
#### 完整使用范例:
```
./dist/chinese_charset_detect -i tests
or
python chinese_charset_detect.py -i tests
```

```
#### 测试结果:
```
文件名: tests/.DS_Store, 编码: unknow
文件名: tests/fixtures/test4.txt, 编码: gb18030
文件名: tests/fixtures/1045.txt, 编码: gb18030
Expand All @@ -56,4 +54,47 @@ python chinese_charset_detect.py -i tests
总文件数: 8
总耗时长: 0.5920612812042236
```

chinese_charset_detect.py
```
import time
import sys
import getopt
from charset_mnbvc import api
def main(argv):
ifolder_path = ""
try:
opts, args = getopt.getopt(argv, "hi:o:", ["ifolder_path="])
except getopt.GetoptError:
print('test.py -i <inputDirectory>')
sys.exit(2)
for opt, arg in opts:
if opt == '-h':
print('chinese_charset_detect.py -i <inputDirectory> inputDirectory为需要检测的目录')
sys.exit()
elif opt in ("-i", "--ifolder_path"):
ifolder_path = arg
start = time.time()
file_count, results = api.from_dir(
folder_path=ifolder_path,
)
for result in results:
print(f"文件名: {result[0]}, 编码: {result[1]}")
print(f"总文件数: {file_count}")
end = time.time()
print(f"总耗时长: {end - start}")
if __name__ == "__main__":
try:
main(sys.argv[1:])
except Exception as e:
print('chinese_charset_detect.py -i <inputDirectory> inputDirectory为需要检测的目录')
sys.exit(2)
```
Binary file added dist/charset_mnbvc-0.0.2.tar.gz
Binary file not shown.
Binary file removed dist/chinese_charset_detect
Binary file not shown.
28 changes: 28 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
from setuptools import setup, find_packages
import pathlib

here = pathlib.Path(__file__).parent.resolve()

long_description = (here / "README.md").read_text(encoding="utf-8")


setup(
name="charset_mnbvc",
version="0.0.2",
description="本项目旨在对大量文本文件进行快速编码检测以辅助mnbvc语料集项目的数据清洗工作",
url="https://github.com/alanshi/charset_mnbvc",
author="Alan Shi",
author_email="[email protected]",
long_description=long_description,
long_description_content_type="text/markdown",
classifiers=[
"License :: OSI Approved :: MIT License",
"Programming Language :: Python :: 3"
],
packages=['charset_mnbvc'],
python_requires=">=3.7",
project_urls={ # Optional
"Bug Reports": "https://github.com/alanshi/charset_mnbvc/issues",
"Source": "https://github.com/alanshi/charset_mnbvc/",
},
)

0 comments on commit c89a945

Please sign in to comment.