-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/pubfig83 #46
base: master
Are you sure you want to change the base?
Feature/pubfig83 #46
Conversation
@zstone, you comments would be great as well |
p = rng.permutation(len(remainder)) | ||
if 'Train%d' % _ind not in splits: | ||
splits['Train%d' % _ind] = [] | ||
splits['Train%d' % _ind].extend(remainder[p[:80]].copy()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason to have hardcoded values here (eg 90, 10, 80) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally preferred the variable ntest, ntrain approach, like you suggest. But, on the other hand, it might be good to have "standard" splits. I've occasionally heard @zstone and @davidcox talking about standard 90/10 splits. Of course, we COULD make them variables and then just set defaults. I think it depends on how the creators of this dataset intended to be used -- @npinto, you and @zstone worked on this together (?) so if you think giving people guidance that suggests that non-standard-sized splits are "OK", then that seems fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I support the creation of a set of standard splits of any dataset, but I would favor canonizing a minimal, language-agnostic materialization of said standard splits rather than defining them implicitly by code in a specific language that must be run correctly in a specific environment to regenerate the right splits. Concretely, I would advocate looking for the simplest-possible JSON format that would completely specify, say, ten 90/10 splits with reference to relative image paths within a canonical archive of a dataset.
That's just my opinion, though, and it may not fit the way scikit-data already works!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zstone I'm not sure if it's obvious how to use JSON inside scikit-data in a natural way. however, could you comment on the split scheme currently proposed in the code? I just added support for non-standard splits. I put the split related data in the dataset init, because we might want to have the split data travel with the dataset instance. The current code has the idea that to change splits, you have to re-instantiate a new instance (or I guess, monkey around inside the code; probably the various attributes should be private). What do you think?
@yamins81 Coming back to this after forever - is this the branch of pubfig83 that went into our ICML2013 paper? If not, could you point me to that code? I'd like to merge it in (and I'm sorry for not doing it a long time ago!) |
standardized splits and tests -- @jaberg, @npinto, your comments would be great
Comments on how things are done are in the test_pubfig83 file.