Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Dataset interface to support sparse matrix. #165

Merged
merged 7 commits into from
Jul 3, 2019

Conversation

eugene-yang
Copy link
Contributor

@eugene-yang eugene-yang commented Jun 30, 2019

Changed the libact.base.dataset.Dataset interface to support sparse matrix as X.
The interfaces for get_entries, get_labeled_entries and get_unlabled_entries are changed.
Since most of the usage of these methods are getting the list of tuple and zip them back to the a feature matrix and list of labels, directly change the interface to output in this format would benefit both using and storing the data.

This would also directly support scipy.sparse.csr_matrix since the zipping during the initialization is removed.
The interface of Dataset.data[] is still implemented via __getitem__ magic method to support some of the use case that involve direct access to the entries.

@codecov-io
Copy link

codecov-io commented Jun 30, 2019

Codecov Report

Merging #165 into master will decrease coverage by <.01%.
The diff coverage is 98.52%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #165      +/-   ##
==========================================
- Coverage   89.46%   89.46%   -0.01%     
==========================================
  Files          37       37              
  Lines        1557     1566       +9     
==========================================
+ Hits         1393     1401       +8     
- Misses        164      165       +1
Impacted Files Coverage Δ
...ery_strategies/multiclass/hierarchical_sampling.py 95.85% <100%> (ø) ⬆️
...query_strategies/multilabel/binary_minimization.py 100% <100%> (ø) ⬆️
...ct/query_strategies/active_learning_by_learning.py 85.71% <100%> (ø) ⬆️
...ltilabel/cost_sensitive_reference_pair_encoding.py 92.1% <100%> (ø) ⬆️
libact/query_strategies/variance_reduction.py 68.88% <100%> (-1.33%) ⬇️
libact/labelers/ideal_labeler.py 100% <100%> (ø) ⬆️
libact/query_strategies/hintsvm.py 93.02% <100%> (ø) ⬆️
...es/multilabel/multilabel_with_auxiliary_learner.py 89.58% <100%> (ø) ⬆️
libact/models/multilabel/dummy_clf.py 94.11% <100%> (ø) ⬆️
.../multiclass/active_learning_with_cost_embedding.py 100% <100%> (ø) ⬆️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d86b7b8...776ee7b. Read the comment docs.

libact/base/dataset.py Outdated Show resolved Hide resolved
@yangarbiter
Copy link
Collaborator

yangarbiter commented Jul 1, 2019

The changes looks good to my.
Thanks for the contribution.
Just fix the coding style (mainly about whitespaces) and I think it's be ready to merge.

@eugene-yang
Copy link
Contributor Author

@yangarbiter Do you want me to fix them?
Like the whitespaces around the brackets?

@yangarbiter
Copy link
Collaborator

Yes, please fix them
For example lb = lbr.label( trn_ds.data[ask_id][0] ) should be lb = lbr.label(trn_ds.data[ask_id][0])
I didn't mark all of them.
But try to comply with google's style guide (https://google.github.io/styleguide/pyguide.html#36-whitespace)
Thanks.

@yangarbiter
Copy link
Collaborator

yangarbiter commented Jul 3, 2019

Last two questions and I'll merge.
Thanks for the hard work @eugene-yang .

examples/albl_plot.py Outdated Show resolved Hide resolved
examples/plot.py Outdated Show resolved Hide resolved
libact/base/dataset.py Outdated Show resolved Hide resolved
libact/base/dataset.py Outdated Show resolved Hide resolved
libact/base/dataset.py Outdated Show resolved Hide resolved
"""
return list(filter(lambda entry: entry[1] is not None, self.data))
return self._X[self.get_labeled_mask()], self._y[self.get_labeled_mask()].tolist()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to make y into a list, can we just keep it numpy array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually need this.
For some multi-label code, I think they are taking advantage of the nested structure of out output. I removed it ran against the unit test and got back with a lot of fails. So I would suggest just keep it as a list. And I don't think the performance would not be affected too much.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think we can merge first and I'll look into this later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I am starting to adapt libact to my own research experiments. So might start opening more full requests for improvements in the future :)

libact/query_strategies/quire.py Outdated Show resolved Hide resolved
@yangarbiter yangarbiter merged commit 33975f8 into ntucllab:master Jul 3, 2019
@eugene-yang eugene-yang deleted the sparse-dataset branch July 3, 2019 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants