Skip to content

gawati/gawati-tagit

Repository files navigation

gawati-tagit is a tool to generate tags for a given AKN fulltext document.

Dependencies

  • Python3
  • Perl

Install

Set up and activate a Python3 virtual environment. Then,

$ pip install -e .

Start Python3 prompt

>>> import nltk
>>> nltk.download('punkt')
>>> exit(0)

Run

$ export FLASK_APP=tagit
$ flask run --port=5001

To turn on development features, set the env variable before running.

$ export FLASK_ENV=development

Build & Distribution

Version is maintained in setup.py.
python setup.py sdist will create a development package with “.dev” and the current date appended.
python setup.py release sdist will create a release package with only the version.
To learn more about the deploy process referenced, read this

The flask app can be used to upload an AKN document and generate tags for it. File types allowed: .xml and .txt.

You may also use tagger.py to:

  1. Bulk convert xml files in a folder to clean text. The output clean text files get written to data/akn_text.
    • This conversion is meant for akn fulltext documents.
    • Expects the language of the document to be the penultimate part of the folder structure or filename or iri.
      Example folder structure or iri: akn_ft/mg/act/1940-03-16/gn_no_48-1940/eng@/main.xml
      Example filename: akn_mg_act_1940-03-16_gn_no_48-1940_eng_main.xml
    • Outputs clean text files placed in their respective language folders.
  2. Train a tf-idf (term frequency - inverse document frequency) model with the given corpus and language. This generates
    • a dictionary of words (vocabulary) that gets saved as tagit.dict.<lang>
    • a model that gets saved as tagit.model.<lang>
      Training assumes data/akn_text containing clean text files is present.
  3. Generate tags for a given document using the language specific model. It also updates dictionary with the new doc.
    • Expects the language of the document to be the penultimate part of the filename or iri. See examples above.

For instructions, run

$ python -m tagit.tagger --help

IMPORTANT: If tagit.model.<lang> isn't present, ensure you train one using the above script, before using the /tag API.
The .model and .dict files are metadata generated by the tagger for a sample dataset. You may regenerate them for your own dataset.

Tag Document API

Request URL: http://localhost:5001/api/tag
Method: POST
Content-Type: multipart/form-data
Request body: file: <The binary data contents of the file>
Response: On success, 200 with response json { tags: <list of tags>}.
On error, 400 Bad Request with response json { error: <error message>}.

IMPORTANT: Expects the language of the document to be the penultimate part of the filename. See examples above.

Notes:

A comparison of tf-idf, doc2vec and fasttext:

  1. With tf-idf, we can get a mapping of weighted vectors to words. We return 10 tags with the highest weights. Where as with doc2vec, we get vectors of a fixed size. Haven't found a way to map vectors to words.

  2. doc2vec seems more suitable for clustering similar documents. tf-idf seems more suitable for generating tags for documents.

  3. fasttext also seems mostly suitable for similarity (nearest neighbour) applications. Moreover, the vectors are meant to be generated for the entire text (and not per document) so that context is understood.

  4. fasttext has published pre-trained English word vectors and pre-trained models for 157 languages. Unclear how to use them.

  5. Need to verify how Gensim performs for other languages.

In the gawati-editor context

The service would work as follows as part of the gawati-editor workflow:

  1. Create a new AKN document
  2. Add attachments
  3. For each attachment, click extract text. This converts pdf to fulltext.
    Then, the above /api/tag API is called to generate tags for the attachment. These tags get saved in the respective fulltext file.
  4. When you refresh tags for the main AKN document:
    • it gathers all the tags from the fulltext files of the attachments.
    • it scans the main AKN metadata for all showAs text.
    • all the showAs texts are prefixed to the list of tags.
  5. Note that we never run the main AKN metadata through this service.

Deploy

Activate the Python3 virtual environment.

  1. Install gunicorn
$ pip install gunicorn
  1. Set the log paths in gunicorn.conf

  2. Configure apache (apache.conf)

<Location "/path/to/gawati-tagit">
    ProxyPass "http://127.0.0.1:5001/"
    ProxyPassReverse "http://127.0.0.1:5001/"
</Location>
  1. Run gunicorn
$ gunicorn -c gunicorn.conf -b 0.0.0.5001 tagit:app

Check the app in the browser, http://your-ip-here/path/to/flaskapp

About

NLP based automatic tagger for legal text

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published