Skip to content

Latest commit

 

History

History

dataset

Distantly/Weakly Labeled NER Data

We release five distantly/weakly labeled NER datasets:

CoNLL03 Tweet OntoNote5.0 Webpage Wikigold
Entity Types 4 10 18 4 4
Origin Easy to find WNUT-16 LDC2013T19 CogComp Github

Format

It is basically the CoNLL format https://simpletransformers.ai/docs/ner-data-formats/#text-file-in-conll-format

Only the fields “str_words” and “tag” are used. Other fields are used in other projects and are not used in this repo. The “tag” is the index of the label, where the mapping is defined in “tag_to_id.json”

You can see how we use these files in data_utils.py.

References

  • [CoNLL03] Erik F. Tjong Kim Sang and Fien De Meulder. "Introduction to the CoNLL-2003 shared task: Languageindependent named entity recognition." In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003.
  • [Tweet] Ritter, Alan, Sam Clark, and Oren Etzioni. "Named entity recognition in tweets: an experimental study." Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2011.
  • [OntoNotes5.0] Weischedel, Ralph, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue et al. "Ontonotes release 5.0 ldc2013t19." Linguistic Data Consortium, Philadelphia, PA (2013).
  • [Webpage] Ratinov, Lev, and Dan Roth. "Design challenges and misconceptions in named entity recognition." Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009). 2009.
  • [Wikigold] Balasuriya, Dominic, et al. "Named entity recognition in wikipedia." Proceedings of the 2009 Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources (People’s Web). 2009.