Skip to content

Commit

Permalink
migrate from elit
Browse files Browse the repository at this point in the history
  • Loading branch information
imgarylai committed Nov 30, 2018
0 parents commit 7ad0d9b
Show file tree
Hide file tree
Showing 14 changed files with 1,440 additions and 0 deletions.
180 changes: 180 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
*.DS_Store
.AppleDouble
.LSOverride

# Icon must end with two \r
Icon


# Thumbnails
._*

# Files that might appear in the root of a volume
.DocumentRevisions-V100
.fseventsd
.Spotlight-V100
.TemporaryItems
.Trashes
.VolumeIcon.icns
.com.apple.timemachine.donotpresent

# Directories potentially created on remote AFP share
.AppleDB
.AppleDesktop
Network Trash Folder
Temporary Items
.apdisk

# Xcode
#
# gitignore contributors: remember to update Global/Xcode.gitignore, Objective-C.gitignore & Swift.gitignore

## Build generated
build/
DerivedData/

## Various settings
*.pbxuser
!default.pbxuser
*.mode1v3
!default.mode1v3
*.mode2v3
!default.mode2v3
*.perspectivev3
!default.perspectivev3
xcuserdata/

## Other
*.moved-aside
*.xccheckout
*.xcscmblueprint

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/

.idea/
.vscode/

corpus.friends+nyt+wiki+amazon.fasttext.skip.d100.bin
nyt-wiki.min10.pos-amb.bin
wsj-pos.dev.gold.tsv
wsj-pos.trn.gold.tsv

# cpp
cmake-build-debug/

elit/version.py
tmp/
resources/*

.python-version

# Ignore large files.
data/*
result/*
elit/dat/*
/.pytest_cache/
# Corpus and models are placed under data
data
*.gln
*.pkl
*.bin
*.p
*.params
1 change: 1 addition & 0 deletions readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Tokenizer
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
elit==0.1.26
19 changes: 19 additions & 0 deletions tokenizer/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# ========================================================================
# Copyright 2018 ELIT
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ========================================================================

__author__ = "Gary Lai"

from .tokenizer import Tokenizer, EnglishTokenizer, SpaceTokenizer
17 changes: 17 additions & 0 deletions tokenizer/resources/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# ========================================================================
# Copyright 2018 ELIT
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ========================================================================

__author__ = "Gary Lai"
159 changes: 159 additions & 0 deletions tokenizer/resources/english_abbreviation_period.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
adm
al
alt
aly
anx
apr
assn
assns
atty
attys
aug
ave
avn
bancorp
bhd
blvd
boul
brig
bro
bros
byp
capt
capts
cir
cmdr
cmdrs
co
col
cols
comdr
comdrs
comm
comms
comp
cor
corp
corps
cos
cpl
cpls
cpt
cpts
cwt
dea
dec
dept
depts
det
div
dr
drc
drs
eg
elec
esq
est
ests
etc
exp
ext
exts
feb
fig
fr
fri
ft
fur
gal
gen
gens
gov
govs
hon
hrs
ibid
ie
inc
inst
intl
jan
jr
jul
jun
lb
lbs
lea
lib
lieut
lieuts
ln
lt
ltd
lts
maj
majs
mar
messrs
mfg
mi
miss
mon
mr
mrs
ms
mt
mtg
mus
natl
nov
oct
oz
pfc
ph.d
phd
plc
pres
prof
profs
prop
pte
pty
pvt
pvts
qtr
rd
rep
reps
rev
revs
sci
sen
sens
sep
sept
ser
sgt
sgts
spc
sq
squ
sr
st
sta
stat
ste
str
supt
supts
sys
thu
thurs
tue
tues
univ
viz
vol
vs
wed
wrt
7 changes: 7 additions & 0 deletions tokenizer/resources/english_apostrophe_front.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
bout
cause
em
fraid
nother
tis
twas
Loading

0 comments on commit 7ad0d9b

Please sign in to comment.