Skip to content

Latest commit

 

History

History
65 lines (41 loc) · 2.97 KB

README.md

File metadata and controls

65 lines (41 loc) · 2.97 KB

Low-level NLP Tools for Magahi and Bhojpuri Shared Task

Task Description

=================

The task is to develop low-level NLP tools for Magahi and Bhojpuri. Both Magahi and Bhojpuri are Eastern Indo-Aryan languages spoken largely in the Eastern states of Bihar, Jharkhand and Uttar Pradesh in India. These languages are part of what is considered a dialect continuum running the Eastern part of India to its Weatern part and consisting of approximately 50 languages / varieties. Hindi, the official language of India, is part of the same continuum and as such these are closely related to each other. However, despite this similarity, these languages have large divergences in terms of lexicon as well as morphological make-up. As such most of the tools developed for Hindi do not perform very well with the other languages. For this task, we are providing small annotated datasets for Magahi and Bhojpuri in order to develop part-of-speech tagger and morphological analyser for these languages. The dataset is annotated with the part of speech categories and morphological features from Universal Dependencies tagset.

Sub tasks

===========

The task has 2 sub-tasks - a. POS tagger for each language b. Number, Gender, Person, Tense, Aspect, Honorificity and Case relation analyser for each language

Data

========

We will provide 5,000 annotated sentences (in CONLL-U format) for each of the 2 languages. In addition to this, participants are also encouraged to use the Hindi dataset available with Universal Dependencies project. Additionally they are free to use any other dataset as long as the dataset is freely available for research

Evaluation Procedure

=========================

The standard evaluation metrics for evaluating and ranking the teams will be macro-averaged F1 scores.

Baseline

=============

The simple probabilistic baseline (the most frequent tags get assigned to each token) will be provided by the organisers.

Important Dates

====================

Training dataset will be made available by 15th April, 2019. Other deadlines are as per the workshop schedule.

Results

============

Results will be made available as per the workshop schedule

Paper submission

=====================

Paper submission instructions will be same as for the workshop

 Query

If you have any queries regarding this task, please raise [Issue](https://github.com/shashwatup9k/nsurl-2019/issues).
=== Machine-readable metadata (DO NOT REMOVE!) =====================================================
Data available since: Low-level NLP Tools for Magahi and Bhojpuri Shared Task-2019
License: CC BY-NC-SA 4.0
=======
Includes text: yes
Shared Task Organisers: Kumar; Ritesh and Ojha, Atul Kr.
Contributor/©holder: Panlingua Language Processing LLP, N. Delhi, India and KMI-Linguistics, Dr. Bhimrao Ambedkar University, Agra
=======================================================================================================