Feature/Pretokenized and presegmented text #19

asajatovic · 2020-05-07T15:02:29Z

Add support for pretokenized and presegmented input text, as requested in Allow pre-tokenised text #18 and Option to "disable sentence segmentation" needed #13, respectively.
Relocate code directly relevant to UDPipe bindings to udpipe.py

FilipBolt · 2020-05-09T17:07:47Z

spacy_udpipe/udpipe.py

+        except StopIteration:
+            return False
+        tokens = line.split("\t") + [str(None)]  # EOS token
+        for i, (token, next_token) in enumerate(zip(tokens[:-1], tokens[1:])):


nitpick: this looks like a nice functional solution, but I'd prefer a simpler one using i to check token[i + 1] as I would guess it is faster. Also you have word.id commented out.

This is more Pythonic according to Ch 1, Item 8. Why would it be faster with explicit indexing?

Can't say I tested it, but it just seems like less operations are involved with explicit indexing (I'm not saying the difference is really big). I'm just thinking that indexing list elements is faster then creating two iterable objects.

I don't agree that your case here suits what you reference. It names two or more iterables. Here you have a one list (which you know the length of) which you then use as two iterables.

FilipBolt · 2020-05-09T17:08:46Z

spacy_udpipe/udpipe.py

+from .utils import get_path
+
+
+class PretokenizedInputFormat(object):


I'd suggest to add tokenizer to name.

As the short doc says, this is just a dummy tokenizer. From v2.0 of spaCy, Tokenizer.tokens_from_list has been deprecated and Doc.__init__ is the new method to create a Doc from pretokenized text. However, this early in the pipeline, not enough info is available to justify creating a Doc so PretokenizedInputFormat used as a dummy tokenizer is imho the most elegant solution because it also follows the API of other UDPipe Tokenizers.

FilipBolt · 2020-05-09T17:10:53Z

spacy_udpipe/udpipe.py

+        elif isinstance(text, list):
+            if isinstance(text[0], list):
+                text = "\n".join("\t".join(sent) for sent in text)
+                tokenizer = PretokenizedInputFormat()


perhaps allow for users to specify their own tokenizer

Again, this is a 'fake' tokenizer to comply with the rest of the code (real tokenizers).

FilipBolt · 2020-05-09T17:11:58Z

spacy_udpipe/udpipe.py

+                )
+            )
+        if not tokenizer:
+            raise Exception("The model does not have a tokenizer")


Add what has been passed to the method to make it easier to debug.

FilipBolt · 2020-05-09T17:15:16Z

spacy_udpipe/udpipe.py

+        """
+        self.lines = iter(text.split("\n"))
+
+    def nextSentence(self, sentence: Sentence, _: ProcessingError) -> bool:


Why do you have this ProcessingError parameter?

Custom PretokenizedInputFormat complies with the UDPipe Tokenizer API to enable easy plug-and-play with the rest of the code.

FilipBolt · 2020-05-09T17:19:35Z

spacy_udpipe/udpipe.py

+            tokenizer = self.model.newTokenizer(self.model.DEFAULT)
+        elif isinstance(text, list):
+            if isinstance(text[0], list):
+                text = "\n".join("\t".join(sent) for sent in text)


Although this REALLY shouldn't happen, you could detect if there is '\t' or '\n' in one of the words the text and therefore give some warning, or comment about this in in docs.

FilipBolt · 2020-05-09T17:21:35Z

spacy_udpipe/udpipe.py

+        sentences = []
+
+        sentence = Sentence()
+        while input_format.nextSentence(sentence, error):


perhaps a cleaner solution would be to yield the sentence in each iteration. That is much more in line what typical iterators do and then you can simply have a for loop instead.

A valid point, however, these lines actually call C++ code via Python bindings.

FilipBolt · 2020-05-09T18:07:31Z

spacy_udpipe/udpipe.py

+        except StopIteration:
+            return False
+        tokens = line.split("\t") + [str(None)]  # EOS token
+        for i, (token, next_token) in enumerate(zip(tokens[:-1], tokens[1:])):


Can't say I tested it, but it just seems like less operations are involved with explicit indexing (I'm not saying the difference is really big). I'm just thinking that indexing list elements is faster then creating two iterable objects.

I don't agree that your case here suits what you reference. It names two or more iterables. Here you have a one list (which you know the length of) which you then use as two iterables.

BramVanroy · 2020-05-10T15:41:08Z

Great stuff, thanks!! Any reasons that there are tagged as pre-releases? I think they are full releases on PyPi, right?

asajatovic · 2020-05-10T16:13:16Z

Great stuff, thanks!! Any reasons that there are tagged as pre-releases? I think they are full releases on PyPi, right?

You are welcome!
The reason for tagging as pre-releases is that they are not quite ready to be used in a production environment and pip ignores pre-releases by default.

asajatovic added 2 commits May 7, 2020 16:47

Add support for pretokenized and presgmented input text

1eaef9b

Update README.md

5d2765e

asajatovic added the enhancement New feature or request label May 7, 2020

asajatovic requested review from FilipBolt and mttk May 7, 2020 15:02

asajatovic self-assigned this May 7, 2020

FilipBolt reviewed May 9, 2020

View reviewed changes

FilipBolt approved these changes May 9, 2020

View reviewed changes

Code review fixes

8d161f4

asajatovic merged commit ed5288a into master May 9, 2020

This was referenced May 9, 2020

Allow pre-tokenised text #18

Closed

Option to "disable sentence segmentation" needed #13

Closed

asajatovic deleted the feature/tokenization branch August 19, 2020 11:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/Pretokenized and presegmented text #19

Feature/Pretokenized and presegmented text #19

asajatovic commented May 7, 2020

FilipBolt May 9, 2020

asajatovic May 9, 2020

FilipBolt May 9, 2020 •

edited

Loading

FilipBolt May 9, 2020

asajatovic May 9, 2020

FilipBolt May 9, 2020

asajatovic May 9, 2020

FilipBolt May 9, 2020

FilipBolt May 9, 2020

asajatovic May 9, 2020

FilipBolt May 9, 2020

FilipBolt May 9, 2020

asajatovic May 9, 2020

FilipBolt May 9, 2020 •

edited

Loading

BramVanroy commented May 10, 2020

asajatovic commented May 10, 2020

		from .utils import get_path


		class PretokenizedInputFormat(object):

Feature/Pretokenized and presegmented text #19

Feature/Pretokenized and presegmented text #19

Conversation

asajatovic commented May 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FilipBolt May 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FilipBolt May 9, 2020 • edited Loading

Choose a reason for hiding this comment

BramVanroy commented May 10, 2020

asajatovic commented May 10, 2020

FilipBolt May 9, 2020 •

edited

Loading

FilipBolt May 9, 2020 •

edited

Loading