Efficient workflow to decode sentences of corpus #176

ablaette · 2020-11-19T22:21:27Z

Getting a list of sentences would be possible by something like corpus() %>% split() %>% get_token_stream(). This may not be very efficient. A nice and efficient workflow might work as follows:

corpus("REUTERS") %>%
  regions(s_attribute = "s") %>%
  get_token_stream(split = TRUE)

Or, alternatively:

corpus("REUTERS") %>%
  segment(s_attribute = "s") %>%
  get_token_stream(split = TRUE)

The text was updated successfully, but these errors were encountered:

ablaette · 2020-11-20T09:30:49Z

A new polmineR version on the dev branch (v0.8.5.9011) now includes an implementation of the first option that occurred to me. Generally speaking, the implementation is much faster than the original approach and performance is satisfactory. The bottleneck is a cut() call. No idea so far how I could speed this up. But as I said, the performance boost is significant already.

ChristophLeonhardt · 2020-12-03T20:56:53Z

This is a fine solution. However it does not yet work with subcorpora. It splits the entire corpus even if you do insert a subset() in the pipe before regions().

ablaette · 2022-04-25T22:05:11Z

Good point. I now implemented a regions() method for subcorpus objects. This is an example I played with.

x <- corpus("GERMAPARL_PARLACLARIN_III") %>% 
  subset(year = "2017") %>%
  regions(s_attribute = "s") %>%
  get_token_stream(split = TRUE)

Performance is ok, I think.

ablaette pushed a commit that referenced this issue Apr 25, 2022

decode sentences from subcorpus #176

528c92d

ablaette closed this as completed Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient workflow to decode sentences of corpus #176

Efficient workflow to decode sentences of corpus #176

ablaette commented Nov 19, 2020

ablaette commented Nov 20, 2020

ChristophLeonhardt commented Dec 3, 2020

ablaette commented Apr 25, 2022

Efficient workflow to decode sentences of corpus #176

Efficient workflow to decode sentences of corpus #176

Comments

ablaette commented Nov 19, 2020

ablaette commented Nov 20, 2020

ChristophLeonhardt commented Dec 3, 2020

ablaette commented Apr 25, 2022