GitHub - gawati/gawati-tagit: NLP based automatic tagger for legal text

gawati-tagit is a tool to generate tags for a given AKN fulltext document.

Dependencies

Python3
Perl

Install

Set up and activate a Python3 virtual environment. Then,

$ pip install -e .

Start Python3 prompt

>>> import nltk
>>> nltk.download('punkt')
>>> exit(0)

Run

$ export FLASK_APP=tagit
$ flask run --port=5001

To turn on development features, set the env variable before running.

$ export FLASK_ENV=development

Build & Distribution

Version is maintained in setup.py.
python setup.py sdist will create a development package with “.dev” and the current date appended.
python setup.py release sdist will create a release package with only the version.
To learn more about the deploy process referenced, read this

The flask app can be used to upload an AKN document and generate tags for it. File types allowed: .xml and .txt.

You may also use tagger.py to:

Bulk convert xml files in a folder to clean text. The output clean text files get written to data/akn_text.
- This conversion is meant for akn fulltext documents.
- Expects the language of the document to be the penultimate part of the folder structure or filename or iri.
  Example folder structure or iri: akn_ft/mg/act/1940-03-16/gn_no_48-1940/eng@/main.xml
  Example filename: akn_mg_act_1940-03-16_gn_no_48-1940_eng_main.xml
- Outputs clean text files placed in their respective language folders.
Train a tf-idf (term frequency - inverse document frequency) model with the given corpus and language. This generates
- a dictionary of words (vocabulary) that gets saved as tagit.dict.<lang>
- a model that gets saved as tagit.model.<lang>
  Training assumes data/akn_text containing clean text files is present.
Generate tags for a given document using the language specific model. It also updates dictionary with the new doc.
- Expects the language of the document to be the penultimate part of the filename or iri. See examples above.

For instructions, run

$ python -m tagit.tagger --help

IMPORTANT: If tagit.model.<lang> isn't present, ensure you train one using the above script, before using the /tag API.
The .model and .dict files are metadata generated by the tagger for a sample dataset. You may regenerate them for your own dataset.

Tag Document API

Request URL: http://localhost:5001/api/tag
Method: POST
Content-Type: multipart/form-data
Request body: file: <The binary data contents of the file>
Response: On success, 200 with response json { tags: <list of tags>}.
On error, 400 Bad Request with response json { error: <error message>}.

IMPORTANT: Expects the language of the document to be the penultimate part of the filename. See examples above.

Notes:

A comparison of tf-idf, doc2vec and fasttext:

With tf-idf, we can get a mapping of weighted vectors to words. We return 10 tags with the highest weights. Where as with doc2vec, we get vectors of a fixed size. Haven't found a way to map vectors to words.
doc2vec seems more suitable for clustering similar documents. tf-idf seems more suitable for generating tags for documents.
fasttext also seems mostly suitable for similarity (nearest neighbour) applications. Moreover, the vectors are meant to be generated for the entire text (and not per document) so that context is understood.
fasttext has published pre-trained English word vectors and pre-trained models for 157 languages. Unclear how to use them.
Need to verify how Gensim performs for other languages.

In the gawati-editor context

The service would work as follows as part of the gawati-editor workflow:

Create a new AKN document
Add attachments
For each attachment, click extract text. This converts pdf to fulltext.
Then, the above /api/tag API is called to generate tags for the attachment. These tags get saved in the respective fulltext file.
When you refresh tags for the main AKN document:
- it gathers all the tags from the fulltext files of the attachments.
- it scans the main AKN metadata for all showAs text.
- all the showAs texts are prefixed to the list of tags.
Note that we never run the main AKN metadata through this service.

Deploy

Activate the Python3 virtual environment.

Install gunicorn

$ pip install gunicorn

Set the log paths in gunicorn.conf
Configure apache (apache.conf)

<Location "/path/to/gawati-tagit">
    ProxyPass "http://127.0.0.1:5001/"
    ProxyPassReverse "http://127.0.0.1:5001/"
</Location>

Run gunicorn

$ gunicorn -c gunicorn.conf -b 0.0.0.5001 tagit:app

Check the app in the browser, http://your-ip-here/path/to/flaskapp

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
data/akn_text		data/akn_text
logs		logs
stopwords		stopwords
tagit		tagit
tmp		tmp
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
gunicorn.conf		gunicorn.conf
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
tagit.dict		tagit.dict
tagit.dict.eng		tagit.dict.eng
tagit.model		tagit.model
tagit.model.eng		tagit.model.eng
xml2txt.pl		xml2txt.pl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dependencies

Install

Run

Build & Distribution

Tag Document API

Notes:

In the gawati-editor context

Deploy

About

Releases

Packages

Contributors 2

Languages

License

gawati/gawati-tagit

Folders and files

Latest commit

History

Repository files navigation

Dependencies

Install

Run

Build & Distribution

Tag Document API

Notes:

In the gawati-editor context

Deploy

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages