Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient workflow to decode sentences of corpus #176

Closed
ablaette opened this issue Nov 19, 2020 · 3 comments
Closed

Efficient workflow to decode sentences of corpus #176

ablaette opened this issue Nov 19, 2020 · 3 comments

Comments

@ablaette
Copy link
Collaborator

Getting a list of sentences would be possible by something like corpus() %>% split() %>% get_token_stream(). This may not be very efficient. A nice and efficient workflow might work as follows:

corpus("REUTERS") %>%
  regions(s_attribute = "s") %>%
  get_token_stream(split = TRUE)

Or, alternatively:

corpus("REUTERS") %>%
  segment(s_attribute = "s") %>%
  get_token_stream(split = TRUE)
@ablaette
Copy link
Collaborator Author

A new polmineR version on the dev branch (v0.8.5.9011) now includes an implementation of the first option that occurred to me. Generally speaking, the implementation is much faster than the original approach and performance is satisfactory. The bottleneck is a cut() call. No idea so far how I could speed this up. But as I said, the performance boost is significant already.

@ChristophLeonhardt
Copy link
Contributor

This is a fine solution. However it does not yet work with subcorpora. It splits the entire corpus even if you do insert a subset() in the pipe before regions().

@ablaette
Copy link
Collaborator Author

Good point. I now implemented a regions() method for subcorpus objects. This is an example I played with.

x <- corpus("GERMAPARL_PARLACLARIN_III") %>% 
  subset(year = "2017") %>%
  regions(s_attribute = "s") %>%
  get_token_stream(split = TRUE)

Performance is ok, I think.

ablaette pushed a commit that referenced this issue Apr 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants