-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/Pretokenized and presegmented text #19
Conversation
spacy_udpipe/udpipe.py
Outdated
except StopIteration: | ||
return False | ||
tokens = line.split("\t") + [str(None)] # EOS token | ||
for i, (token, next_token) in enumerate(zip(tokens[:-1], tokens[1:])): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: this looks like a nice functional solution, but I'd prefer a simpler one using i
to check token[i + 1] as I would guess it is faster. Also you have word.id commented out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more Pythonic according to Ch 1, Item 8. Why would it be faster with explicit indexing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't say I tested it, but it just seems like less operations are involved with explicit indexing (I'm not saying the difference is really big). I'm just thinking that indexing list elements is faster then creating two iterable objects.
I don't agree that your case here suits what you reference. It names two or more iterables. Here you have a one list (which you know the length of) which you then use as two iterables.
from .utils import get_path | ||
|
||
|
||
class PretokenizedInputFormat(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to add tokenizer to name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the short doc says, this is just a dummy tokenizer. From v2.0
of spaCy, Tokenizer.tokens_from_list
has been deprecated and Doc.__init__
is the new method to create a Doc
from pretokenized text. However, this early in the pipeline, not enough info is available to justify creating a Doc
so PretokenizedInputFormat
used as a dummy tokenizer is imho the most elegant solution because it also follows the API of other UDPipe Tokenizers
.
elif isinstance(text, list): | ||
if isinstance(text[0], list): | ||
text = "\n".join("\t".join(sent) for sent in text) | ||
tokenizer = PretokenizedInputFormat() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps allow for users to specify their own tokenizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this is a 'fake' tokenizer to comply with the rest of the code (real tokenizers).
spacy_udpipe/udpipe.py
Outdated
) | ||
) | ||
if not tokenizer: | ||
raise Exception("The model does not have a tokenizer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add what has been passed to the method to make it easier to debug.
""" | ||
self.lines = iter(text.split("\n")) | ||
|
||
def nextSentence(self, sentence: Sentence, _: ProcessingError) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you have this ProcessingError parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Custom PretokenizedInputFormat
complies with the UDPipe Tokenizer API to enable easy plug-and-play with the rest of the code.
tokenizer = self.model.newTokenizer(self.model.DEFAULT) | ||
elif isinstance(text, list): | ||
if isinstance(text[0], list): | ||
text = "\n".join("\t".join(sent) for sent in text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this REALLY shouldn't happen, you could detect if there is '\t' or '\n' in one of the words the text and therefore give some warning, or comment about this in in docs.
sentences = [] | ||
|
||
sentence = Sentence() | ||
while input_format.nextSentence(sentence, error): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps a cleaner solution would be to yield the sentence in each iteration. That is much more in line what typical iterators do and then you can simply have a for loop instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A valid point, however, these lines actually call C++ code via Python bindings.
spacy_udpipe/udpipe.py
Outdated
except StopIteration: | ||
return False | ||
tokens = line.split("\t") + [str(None)] # EOS token | ||
for i, (token, next_token) in enumerate(zip(tokens[:-1], tokens[1:])): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't say I tested it, but it just seems like less operations are involved with explicit indexing (I'm not saying the difference is really big). I'm just thinking that indexing list elements is faster then creating two iterable objects.
I don't agree that your case here suits what you reference. It names two or more iterables. Here you have a one list (which you know the length of) which you then use as two iterables.
Great stuff, thanks!! Any reasons that there are tagged as pre-releases? I think they are full releases on PyPi, right? |
You are welcome! |
UDPipe
bindings toudpipe.py