Steps to utilize NeuroNER for other languages #30

sooheon · 2017-07-04T02:49:39Z

It appears that BART at least is pretty language agnostic. The English specific parts of NeuroNER (afaict), are the recommended glove.6B.100d word vectors, and all of the spacy related tokenizing code, which is used to translate BART format into CoNLL format (correct?)

Am I correct that if I:

Supply Korean word vectors in /data/word_vectors
Supply CoNLL formatted train, valid, and test data using BART labeled Korean text which I run through my own tokenizer

I will be able to train and use NeuroNER for Korean text?

The text was updated successfully, but these errors were encountered:

Franck-Dernoncourt · 2017-07-04T02:57:50Z

Correct! Note that providing word vectors is optional (it's typically better if you have some), and that I haven't tested NeuroNER with languages other than English. I know someone successfully used it in French (after an encoding fix PR :)), and someone was supposed to try with Bangladeshi but I haven't heard back from him.

…

On Jul 3, 2017 9:49 PM, "Sooheon Kim" ***@***.***> wrote: It appears that BART at least is pretty language agnostic. The English specific parts of NeuroNER (afaict), are the recommended glove.6B.100d word vectors, and all of the spacy related tokenizing code, which is used to translate BART format into CoNLL format (correct?) Am I correct that if I: 1. Supply Korean word vectors in /data/word_vectors 2. Supply CoNLL formatted train, valid, and test data using BART labeled Korean text which I run through my own tokenizer Will I be able to train and use NeuroNER for Korean text? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#30>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAA7447RV_hPWNxKIwrgUb6oHxSekvLUks5sKahDgaJpZM4OM1FP> .

Gregory-Howard · 2017-07-06T08:24:51Z

Hi (I'm the guy who uses NeuroNER in French)!
These 2 steps are true, but you also need spacy (or nltk) working in Korean.
I'm explaining a bit more for SpaCy :
You need a SpaCy Korean model. This consist in a tokenizer and a POS Tagging model.
Someone asked exactly this question : explosion/spaCy#929
Then you will have to change spacylanguage in parameter.ini
I hope I'm clear, if not, feel free to ask.

Steps (for spacy) language : X:

Check if NLTK or spacy support your language X (you need full support). (https://github.com/explosion/spaCy#spacy-industrial-strength-nlp)
- if not add your language : https://spacy.io/docs/usage/adding-languages (1-2 week)
Supply X word vectors in /data/word_vectors
Supply CoNLL/BRAT(into directory) formatted train, valid, and test data using BART labeled X text which I run through my own tokenizer
change parameter.ini : token_pretrained_embedding_filepath, token_embedding_dimension, spacylanguage, dataset_text_folder
run main.py

sooheon · 2017-07-06T14:49:47Z

Thanks for the additional detail! That looks perfectly doable.

ersinyar · 2018-02-22T12:40:55Z

I don't understand what exactly spacy (or nltk) does in NeuroNER. I think spacy is used as tokenizer. Do we need a language specific tokenizer? And also why do we need POS tagging model? Can't we just use nltk for tokenization?

Gregory-Howard · 2018-02-23T16:13:02Z

Spacy is used in this file : https://github.com/Franck-Dernoncourt/NeuroNER/blob/master/src/brat_to_conll.py#L20
The problem here is for span in document.sents: this method need a model to works.
I think if we transform a bit the code we might just need a tokenizer.

Killthebug · 2018-03-04T14:07:09Z

Hey all! I'm trying to get NeuroNER to work for some Hindi data, but from what I understand spaCy does not support Hindi.

Would you recommend I user NLTK for the same because from what I gather, (spaCy or NLTK) is primarily used from sentence splitting and tokenizing here : https://github.com/Franck-Dernoncourt/NeuroNER/blob/master/src/brat_to_conll.py#L20

svanhvitlilja · 2018-10-26T15:30:13Z

Hi! As you seem to be the people who have the most experience in using NeuroNER for languages other than English, could I please ask you to take a look at my query regarding Icelandic?

Unfortunately Spacy, Stanford and NLTK don't support Icelandic, so we need to find a way to use NeuroNER by relying on available NLP tools for Icelandic.
Thanks a lot!
Issue: #126

Peacelover01 · 2019-12-14T14:21:28Z

Can we use the NeuroNER model for Urdu language, spacy does't support Urdu language. Also can we use other word embedding like Facebook fasttext.

svanhvitlilja · 2019-12-14T14:32:46Z

You can use your own tokenizer, and bypass spacy by changing a few lines in the source code, we did that for Icelandic. I can give you some pointers if you want.
Don't know about the other embeddings, would like to know :)

Peacelover01 · 2019-12-15T12:31:15Z

Thank you @svanhviti16 for your reply. It will be highly appreciated.

Franck-Dernoncourt added the question label Jul 4, 2017

jaqsro mentioned this issue Apr 30, 2019

ValueError: could not broadcast input array from shape (99) into shape (100) #148

Open

asmundur mentioned this issue Aug 14, 2020

Bypassing spacy for deployment of a pretrained model #169

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Steps to utilize NeuroNER for other languages #30

Steps to utilize NeuroNER for other languages #30

sooheon commented Jul 4, 2017 •

edited

Loading

Franck-Dernoncourt commented Jul 4, 2017 via email

Gregory-Howard commented Jul 6, 2017 •

edited

Loading

sooheon commented Jul 6, 2017

ersinyar commented Feb 22, 2018

Gregory-Howard commented Feb 23, 2018

Killthebug commented Mar 4, 2018

svanhvitlilja commented Oct 26, 2018

Peacelover01 commented Dec 14, 2019

svanhvitlilja commented Dec 14, 2019

Peacelover01 commented Dec 15, 2019

Steps to utilize NeuroNER for other languages #30

Steps to utilize NeuroNER for other languages #30

Comments

sooheon commented Jul 4, 2017 • edited Loading

Franck-Dernoncourt commented Jul 4, 2017 via email

Gregory-Howard commented Jul 6, 2017 • edited Loading

sooheon commented Jul 6, 2017

ersinyar commented Feb 22, 2018

Gregory-Howard commented Feb 23, 2018

Killthebug commented Mar 4, 2018

svanhvitlilja commented Oct 26, 2018

Peacelover01 commented Dec 14, 2019

svanhvitlilja commented Dec 14, 2019

Peacelover01 commented Dec 15, 2019

sooheon commented Jul 4, 2017 •

edited

Loading

Gregory-Howard commented Jul 6, 2017 •

edited

Loading