-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Steps to utilize NeuroNER for other languages #30
Comments
Correct! Note that providing word vectors is optional (it's typically
better if you have some), and that I haven't tested NeuroNER with languages
other than English. I know someone successfully used it in French (after an
encoding fix PR :)), and someone was supposed to try with Bangladeshi but I
haven't heard back from him.
…On Jul 3, 2017 9:49 PM, "Sooheon Kim" ***@***.***> wrote:
It appears that BART at least is pretty language agnostic. The English
specific parts of NeuroNER (afaict), are the recommended glove.6B.100d
word vectors, and all of the spacy related tokenizing code, which is used
to translate BART format into CoNLL format (correct?)
Am I correct that if I:
1. Supply Korean word vectors in /data/word_vectors
2. Supply CoNLL formatted train, valid, and test data using BART
labeled Korean text which I run through my own tokenizer
Will I be able to train and use NeuroNER for Korean text?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#30>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AAA7447RV_hPWNxKIwrgUb6oHxSekvLUks5sKahDgaJpZM4OM1FP>
.
|
Hi (I'm the guy who uses NeuroNER in French)! Steps (for spacy) language : X:
|
Thanks for the additional detail! That looks perfectly doable. |
I don't understand what exactly spacy (or nltk) does in NeuroNER. I think spacy is used as tokenizer. Do we need a language specific tokenizer? And also why do we need POS tagging model? Can't we just use nltk for tokenization? |
Spacy is used in this file : https://github.com/Franck-Dernoncourt/NeuroNER/blob/master/src/brat_to_conll.py#L20 |
Hey all! I'm trying to get NeuroNER to work for some Hindi data, but from what I understand spaCy does not support Hindi. Would you recommend I user NLTK for the same because from what I gather, (spaCy or NLTK) is primarily used from sentence splitting and tokenizing here : https://github.com/Franck-Dernoncourt/NeuroNER/blob/master/src/brat_to_conll.py#L20 |
Hi! As you seem to be the people who have the most experience in using NeuroNER for languages other than English, could I please ask you to take a look at my query regarding Icelandic? Unfortunately Spacy, Stanford and NLTK don't support Icelandic, so we need to find a way to use NeuroNER by relying on available NLP tools for Icelandic. |
Can we use the NeuroNER model for Urdu language, spacy does't support Urdu language. Also can we use other word embedding like Facebook fasttext. |
You can use your own tokenizer, and bypass spacy by changing a few lines in the source code, we did that for Icelandic. I can give you some pointers if you want. |
Thank you @svanhviti16 for your reply. It will be highly appreciated. |
It appears that BART at least is pretty language agnostic. The English specific parts of NeuroNER (afaict), are the recommended
glove.6B.100d
word vectors, and all of the spacy related tokenizing code, which is used to translate BART format into CoNLL format (correct?)Am I correct that if I:
/data/word_vectors
train
,valid
, andtest
data using BART labeled Korean text which I run through my own tokenizerI will be able to train and use NeuroNER for Korean text?
The text was updated successfully, but these errors were encountered: