Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Steps to utilize NeuroNER for other languages #30

Open
sooheon opened this issue Jul 4, 2017 · 10 comments
Open

Steps to utilize NeuroNER for other languages #30

sooheon opened this issue Jul 4, 2017 · 10 comments
Labels

Comments

@sooheon
Copy link

sooheon commented Jul 4, 2017

It appears that BART at least is pretty language agnostic. The English specific parts of NeuroNER (afaict), are the recommended glove.6B.100d word vectors, and all of the spacy related tokenizing code, which is used to translate BART format into CoNLL format (correct?)

Am I correct that if I:

  1. Supply Korean word vectors in /data/word_vectors
  2. Supply CoNLL formatted train, valid, and test data using BART labeled Korean text which I run through my own tokenizer

I will be able to train and use NeuroNER for Korean text?

@Franck-Dernoncourt
Copy link
Owner

Franck-Dernoncourt commented Jul 4, 2017 via email

@Gregory-Howard
Copy link
Contributor

Gregory-Howard commented Jul 6, 2017

Hi (I'm the guy who uses NeuroNER in French)!
These 2 steps are true, but you also need spacy (or nltk) working in Korean.
I'm explaining a bit more for SpaCy :
You need a SpaCy Korean model. This consist in a tokenizer and a POS Tagging model.
Someone asked exactly this question : explosion/spaCy#929
Then you will have to change spacylanguage in parameter.ini
I hope I'm clear, if not, feel free to ask.

Steps (for spacy) language : X:

@sooheon
Copy link
Author

sooheon commented Jul 6, 2017

Thanks for the additional detail! That looks perfectly doable.

@ersinyar
Copy link

I don't understand what exactly spacy (or nltk) does in NeuroNER. I think spacy is used as tokenizer. Do we need a language specific tokenizer? And also why do we need POS tagging model? Can't we just use nltk for tokenization?

@Gregory-Howard
Copy link
Contributor

Spacy is used in this file : https://github.com/Franck-Dernoncourt/NeuroNER/blob/master/src/brat_to_conll.py#L20
The problem here is for span in document.sents: this method need a model to works.
I think if we transform a bit the code we might just need a tokenizer.

@Killthebug
Copy link

Hey all! I'm trying to get NeuroNER to work for some Hindi data, but from what I understand spaCy does not support Hindi.

Would you recommend I user NLTK for the same because from what I gather, (spaCy or NLTK) is primarily used from sentence splitting and tokenizing here : https://github.com/Franck-Dernoncourt/NeuroNER/blob/master/src/brat_to_conll.py#L20

@svanhvitlilja
Copy link

Hi! As you seem to be the people who have the most experience in using NeuroNER for languages other than English, could I please ask you to take a look at my query regarding Icelandic?

Unfortunately Spacy, Stanford and NLTK don't support Icelandic, so we need to find a way to use NeuroNER by relying on available NLP tools for Icelandic.
Thanks a lot!
Issue: #126

@Peacelover01
Copy link

Can we use the NeuroNER model for Urdu language, spacy does't support Urdu language. Also can we use other word embedding like Facebook fasttext.

@svanhvitlilja
Copy link

You can use your own tokenizer, and bypass spacy by changing a few lines in the source code, we did that for Icelandic. I can give you some pointers if you want.
Don't know about the other embeddings, would like to know :)

@Peacelover01
Copy link

Thank you @svanhviti16 for your reply. It will be highly appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants