Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when converting to Term-Document Matrix (when p_attribute !="word") #105

Closed
cgnguyen opened this issue Oct 11, 2019 · 1 comment
Closed

Comments

@cgnguyen
Copy link

I am getting an error when I try to convert the GERMAPARL data into a document-term matrix.

The command performs as expected when the p_attribute = "word.

temp<-polmineR::as.DocumentTermMatrix("GERMAPARL", p_attribute="word", s_attribute="parliamentary_group" , verbose=T)

This, however, does not work when I try to use this either for p_attribute = "lemma" or "pos".

For example, trying to run

`tdm <- polmineR::as.TermDocumentMatrix("GERMAPARL", p_attribute = "lemma", s_attribute = "party")'

I receive the following error message

Error in simple_triplet_matrix(i = countDT[["doc_id"]], j = countDT[["new_token_id"]], : 'nrow, ncol' invalid

@ablaette
Copy link
Collaborator

Thanks a lot for reporting this issue and offering the example. I managed easily to reproduce the error - in both cases, irrespective of how the argument p_attribute is defined. So this is a consistent bugy behavior...

A new polmineR-version (v0.7.11.9045) addresses the issue. You can download it from the development branch of the GitHub-repo:

devtools::install_github("PolMine/polmineR", ref = "dev")

To explain: The cause for the error we was was that the s-attribute "party" does not have a value for all speeches. You will see this when calling ...

s_attributes("GERMAPARL")

So there are empty character vectors (""). Certainly, we can debate whether the value should rather be "parteilos" in the GermaParl-corpus. Be that as it may, the bug we see should not occur.

It has been caused by a reindexing that needs to be performed to make the method work if we throw out values by providing additional values for s-attributes: The procedure I had implemented was not robust if an empty string ("") needs to be reindexed, resulting in the error we saw to create the simple_triplet_matrix.

I now rely on the temporary creation of a factor, achieving the same result, yet more robust and (somewhat) faster than before.

I had been thinking about this before, yet the issue you report was a good incentive to finally do this. Thanks! Let me know whether everything works now.

@PolMine PolMine closed this as completed Feb 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants