Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/Pretokenized and presegmented text #19

Merged
merged 3 commits into from
May 9, 2020

Conversation

asajatovic
Copy link
Collaborator

@asajatovic asajatovic added the enhancement New feature or request label May 7, 2020
@asajatovic asajatovic requested review from FilipBolt and mttk May 7, 2020 15:02
@asajatovic asajatovic self-assigned this May 7, 2020
except StopIteration:
return False
tokens = line.split("\t") + [str(None)] # EOS token
for i, (token, next_token) in enumerate(zip(tokens[:-1], tokens[1:])):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: this looks like a nice functional solution, but I'd prefer a simpler one using i to check token[i + 1] as I would guess it is faster. Also you have word.id commented out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more Pythonic according to Ch 1, Item 8. Why would it be faster with explicit indexing?

Copy link
Collaborator

@FilipBolt FilipBolt May 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't say I tested it, but it just seems like less operations are involved with explicit indexing (I'm not saying the difference is really big). I'm just thinking that indexing list elements is faster then creating two iterable objects.

I don't agree that your case here suits what you reference. It names two or more iterables. Here you have a one list (which you know the length of) which you then use as two iterables.

from .utils import get_path


class PretokenizedInputFormat(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to add tokenizer to name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the short doc says, this is just a dummy tokenizer. From v2.0 of spaCy, Tokenizer.tokens_from_list has been deprecated and Doc.__init__ is the new method to create a Doc from pretokenized text. However, this early in the pipeline, not enough info is available to justify creating a Doc so PretokenizedInputFormat used as a dummy tokenizer is imho the most elegant solution because it also follows the API of other UDPipe Tokenizers.

elif isinstance(text, list):
if isinstance(text[0], list):
text = "\n".join("\t".join(sent) for sent in text)
tokenizer = PretokenizedInputFormat()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps allow for users to specify their own tokenizer

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this is a 'fake' tokenizer to comply with the rest of the code (real tokenizers).

)
)
if not tokenizer:
raise Exception("The model does not have a tokenizer")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add what has been passed to the method to make it easier to debug.

"""
self.lines = iter(text.split("\n"))

def nextSentence(self, sentence: Sentence, _: ProcessingError) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you have this ProcessingError parameter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Custom PretokenizedInputFormat complies with the UDPipe Tokenizer API to enable easy plug-and-play with the rest of the code.

tokenizer = self.model.newTokenizer(self.model.DEFAULT)
elif isinstance(text, list):
if isinstance(text[0], list):
text = "\n".join("\t".join(sent) for sent in text)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this REALLY shouldn't happen, you could detect if there is '\t' or '\n' in one of the words the text and therefore give some warning, or comment about this in in docs.

sentences = []

sentence = Sentence()
while input_format.nextSentence(sentence, error):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps a cleaner solution would be to yield the sentence in each iteration. That is much more in line what typical iterators do and then you can simply have a for loop instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A valid point, however, these lines actually call C++ code via Python bindings.

except StopIteration:
return False
tokens = line.split("\t") + [str(None)] # EOS token
for i, (token, next_token) in enumerate(zip(tokens[:-1], tokens[1:])):
Copy link
Collaborator

@FilipBolt FilipBolt May 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't say I tested it, but it just seems like less operations are involved with explicit indexing (I'm not saying the difference is really big). I'm just thinking that indexing list elements is faster then creating two iterable objects.

I don't agree that your case here suits what you reference. It names two or more iterables. Here you have a one list (which you know the length of) which you then use as two iterables.

@BramVanroy
Copy link
Contributor

Great stuff, thanks!! Any reasons that there are tagged as pre-releases? I think they are full releases on PyPi, right?

@asajatovic
Copy link
Collaborator Author

Great stuff, thanks!! Any reasons that there are tagged as pre-releases? I think they are full releases on PyPi, right?

You are welcome!
The reason for tagging as pre-releases is that they are not quite ready to be used in a production environment and pip ignores pre-releases by default.

@asajatovic asajatovic deleted the feature/tokenization branch August 19, 2020 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants