gawati-tagit is a tool to generate tags for a given AKN fulltext document.
- Python3
- Perl
Set up and activate a Python3 virtual environment. Then,
$ pip install -e .
Start Python3 prompt
>>> import nltk
>>> nltk.download('punkt')
>>> exit(0)
$ export FLASK_APP=tagit
$ flask run --port=5001
To turn on development features, set the env variable before running.
$ export FLASK_ENV=development
Version is maintained in setup.py
.
python setup.py sdist
will create a development package with “.dev” and the current date appended.
python setup.py release sdist
will create a release package with only the version.
To learn more about the deploy process referenced, read this
The flask app can be used to upload an AKN document and generate tags for it.
File types allowed: .xml
and .txt
.
You may also use tagger.py
to:
- Bulk convert xml files in a folder to clean text. The output clean text files get written to
data/akn_text
.- This conversion is meant for akn fulltext documents.
- Expects the language of the document to be the penultimate
part of the folder structure or filename or iri.
Example folder structure or iri:akn_ft/mg/act/1940-03-16/gn_no_48-1940/eng@/main.xml
Example filename:akn_mg_act_1940-03-16_gn_no_48-1940_eng_main.xml
- Outputs clean text files placed in their respective language folders.
- Train a tf-idf (term frequency - inverse document frequency) model with the given corpus and language. This generates
- a dictionary of words (vocabulary) that gets saved as
tagit.dict.<lang>
- a model that gets saved as
tagit.model.<lang>
Training assumesdata/akn_text
containing clean text files is present.
- a dictionary of words (vocabulary) that gets saved as
- Generate tags for a given document using the language specific model. It also updates dictionary with the new doc.
- Expects the language of the document to be the penultimate part of the filename or iri. See examples above.
For instructions, run
$ python -m tagit.tagger --help
IMPORTANT: If tagit.model.<lang>
isn't present, ensure you train one using the above script, before using the /tag
API.
The .model and .dict files are metadata generated by the tagger for a sample dataset. You may regenerate them for your own dataset.
Request URL: http://localhost:5001/api/tag
Method: POST
Content-Type: multipart/form-data
Request body: file: <The binary data contents of the file>
Response: On success, 200 with response json { tags: <list of tags>}
.
On error, 400 Bad Request with response json { error: <error message>}
.
IMPORTANT: Expects the language of the document to be the penultimate part of the filename. See examples above.
A comparison of tf-idf, doc2vec and fasttext:
-
With tf-idf, we can get a mapping of weighted vectors to words. We return 10 tags with the highest weights. Where as with doc2vec, we get vectors of a fixed size. Haven't found a way to map vectors to words.
-
doc2vec seems more suitable for clustering similar documents. tf-idf seems more suitable for generating tags for documents.
-
fasttext also seems mostly suitable for similarity (nearest neighbour) applications. Moreover, the vectors are meant to be generated for the entire text (and not per document) so that context is understood.
-
fasttext has published pre-trained English word vectors and pre-trained models for 157 languages. Unclear how to use them.
-
Need to verify how Gensim performs for other languages.
The service would work as follows as part of the gawati-editor workflow:
- Create a new AKN document
- Add attachments
- For each attachment, click extract text. This converts pdf to fulltext.
Then, the above/api/tag
API is called to generate tags for the attachment. These tags get saved in the respective fulltext file. - When you refresh tags for the main AKN document:
- it gathers all the tags from the fulltext files of the attachments.
- it scans the main AKN metadata for all showAs text.
- all the showAs texts are prefixed to the list of tags.
- Note that we never run the main AKN metadata through this service.
Activate the Python3 virtual environment.
- Install gunicorn
$ pip install gunicorn
-
Set the log paths in
gunicorn.conf
-
Configure apache (apache.conf)
<Location "/path/to/gawati-tagit">
ProxyPass "http://127.0.0.1:5001/"
ProxyPassReverse "http://127.0.0.1:5001/"
</Location>
- Run gunicorn
$ gunicorn -c gunicorn.conf -b 0.0.0.5001 tagit:app
Check the app in the browser, http://your-ip-here/path/to/flaskapp