-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cognoml refactor part 1 #3
Conversation
…asic testing. Tested results on dummy data successfully
I got a notification that said
and I was like who is badeeep. Congrats on this remarkable start to your commit hash. Will try to get to reviewing this in the next couple of days. |
1.) Raise new AttributeError rather than just printing 2.) Just refer to self.json_sanitize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jessept, thanks for this monster pull request. Especially for the great documentation you've added to functions. I am slightly worried that the migration to classes will harm modularity and create a higher development overhead. However, I do also see advantages. What are you planning for part 2 of this? Just want to make sure I understand the long-term strategy before I do line-by-line review of this PR.
I opted to pickle the resulting data frames. These take ~4 seconds to read in rather than 3 minutes.
Awesome, really impressed by that speedup (didn't expect such an improvement). This makes cognoma/cancer-data#9 a non-issue. So +1 for loading and pickling data upon first download.
The data download process is now separate from the classify process and is more flexible/readable. All data cleaning/downloading/etc. is handled in the CognomlData class.
Agree that it makes sense to separate downloading from the classify process. However, I'm not crazy about using a single class for downloading and data management. I think it could prematurely restrict our flexibility. Basically, many of the functions in cognoml/data.py
lose their modularity when they become methods of CognomlData
objects. I'd prefer to keep functions that may have general applicability outside of the class. @awm33, what is your opinion on heavily relying on classes for our data management?
|
||
def get_results(self): | ||
pipeline = self.pipeline | ||
json_sanitize = self.json_sanitize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just refer to self.json_sanitize
for its single usage below
try: | ||
predict_df = pd.DataFrame({'sample_id': x.index, 'predicted_status': pipeline.predict(x)}) | ||
except AttributeError: | ||
print("Pipeline {} does not have a predict method".format(pipeline)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should rethrow the error -- you can't continue without predictions and predict_df
won't exist
x_test = self.x_test | ||
x = self.X | ||
obs_df = self.obs_df | ||
obs_df = pd.DataFrame({'sample_id': obs_df.index, 'status': obs_df.values}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worried about non-deterministic column ordering when creating dataframes from dictionaries. Can you make the ordering explicit. Either:
obs_df = pd.DataFrame({'sample_id': obs_df.index, 'status': obs_df.values})[['sample_id', 'status']]
Or
obs_df = pd.DataFrame.from_items([
('sample_id', obs_df.index),
('status', obs_df.values),
])
I really wish python had a slick way of instantiating OrderedDicts ):
Regarding modularity of data mangagement, @gwaygenomics will likely want to use the |
@dhimmel @jessept If we use classes, perhaps we should create a base class that establishes an interface, say DataSource, then inherit from that base class. Like FigshareDataSource, S3DataSource, etc. As long as the functions/properties that classify needs are named the same and provide the same end result, it should work.
Maybe we should store it somewhere as a pickle. For at least production, we could store it as a pickle file on S3, then it would be in a more direct data format. It would be accessed within the data center rather than through figshare. |
@dhimmel I see your concerns. I really like @awm33 idea of creating a base class and having other processes inherit from there. We can also build with standalone functions, the classes just make it a bit easier for me to read/navigate. @dhimmel in terms of modularity of data management, this is one of the core concepts of the analysis.py refactor. The idea with refactoring analysis.py is that we establish the same design patterns as scikit-learn (design and predictor matrices, fit and predict methods, etc.). We can put in relatively general data frames (X and y) and still get Cognoml-specific results. I wholeheartedly agree that we should try to make the data import/cleaning as modular as we can, and this is meant to be a first step in that direction. The code feedback given thus far has been extremely helpful. My plans for part 2: Obviously performance can be a bit of a daunting task, especially before an MVP, but I think it's worth focusing on the low-hanging fruit. My largest worry is having a great MVP that takes > 5 minutes to run every time, and having practitioners take a look at it and ultimately dismiss it because it takes too long to run. Let's discuss briefly in tomorrow's meeting, I think we can make quite a bit of progress there. |
My philosophy was that scikit-learn made the classes so we don't have to. But I'm fine with giving it a try. I also like the superclass idea for data reading. So go ahead and continue with your preferred architecture. Let's try to merge this sooner rather than later as we can start building off of it. I still need to do a little more review and actually test the code. See ya'll tonight. |
Per discussion yesterday: These 2 should be enough to correct existing issues with the pull request. Any additional work should be part of a separate request. |
…main, updated analysis to only look at correct data, re-wrote data to get correct json-formatted data from either github or front-end processes.
@dhimmel Newest commit to pull request fixes 1.) and 2.) above. |
@jessept I'm on it. You may enjoy some of these markdown features. For example, in your above comment you should actually tag the issues like #5 and #6. |
@jessept can you delete the legacy code that's no longer needed? |
y_test = y.head(5000) | ||
x_test = pd.DataFrame(x[x.index.isin(list(y_test.index))]) | ||
classifier = CognomlClassifier(x_test, y_test) | ||
a = CognomlData(mutations_json_url='https://github.com/cognoma/machine-learning/raw/876b8131bab46878cb49ae7243e459ec0acd2b47/data/api/hippo-input.json') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would I pass this info from the task? I think being able to pass sample_id and mutation_status should be an option. Being able to pass than as a key/value (ex ...,"TCGA-ZF-AA4X-01":0,...
or a table (ex ...],["TCGA-ZF-AA4X-01",0],[...
) would be nice.
I somehow commented on outdated diff. Pasting it here How would I pass this info from the task? I think being able to pass sample_id and mutation_status should be an option. Being able to pass than as a key/value (ex |
pipeline = self.pipeline | ||
x = self.X | ||
try: | ||
predict_df = pd.DataFrame({'sample_id': x.index, 'predicted_status': pipeline.predict(x)}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Affected by #6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before the pull request, cognoml
returned predictions for all samples, not just selected samples. See the commit that added this functionality. The goal was to potentially show researchers predictions for samples that they didn't fit their model on. Can we look into restoring this functionality before merging this PR? Note the original commit message:
Unselected observations (samples in the dataset that were not selected by the user) are now returned. These observations receive predictions but are missing (-1 encoded) for fields such as
testing
andstatus
.
obs_test_df = obs_df.query("testing == 1") | ||
dimensions = collections.OrderedDict() | ||
dimensions['observations_selected'] = sum(obs_df.selected == 1) | ||
dimensions['observations_unselected'] = sum(obs_df.selected == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resulting JSON has:
"observations_unselected": 0,
Rather than.
"observations_unselected": 2264,
See this comment in original code:
# obs_df switches to containing non-selected samples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been implemented in latest pull request b3427c7
2.) restored previous functionality for prediction of entire data set rather than just requested sample ids 3.) Added additional OrderedDicts where necessary to preserve json order
@awm33 could you talk a bit more about what you mean? I am assuming that the incoming data will look similar to what exists in the main.py script, an API call that has a json form with the sample_id and mutation status columns. It would be very straightforward to implement a change that would accept both dictionaries and tables, I just want to understand how this flexibility would be helpful. Does it help you test? |
@awm33 the added complexity to support multiple ways of inputting the sample/mutation information is not worth it IMO. It's trivial to convert between the three formats you mention -- why not just convert before calling the function? I'm very open to changing the names to
Your thoughts here could help my opinion "evolve". |
Diff for hippo JSON output highlighted the bug: < "positive_prevalence": 0.27846, --- > "positive_prevalence": -0.11771,
@jessept congrats on this monumental pull request. I made one small change (8cfa0b9). After this change, this pull request yields the same JSON as previously. Now that we have this first major refactoring in place, we should aim for smaller, more modular pull requests, so we can keep development speedy. Cheers! 🍾 |
@jessept you will want to use these guidelines to keep your fork synced. You will want to make sure your local master branch is never ahead of the |
@dhimmel thanks for your patience/persistence helping push this through, especially the code review. Will do on the fork syncing going forward, smaller pull requests ahead! |
@jessept Sorry for getting back late here but maybe we can address making changes for the worker in another PR. A given classifier task has a list of entrezids and disease types. The worker code will query for the any samples that match the list of disease types and join that to the mutations table. The result will not be an a [sample_id,mutation_status] form, so the worker needs to transform it into that form and pass it to the cognoml code. So what I am looking for is a way to pass that directly somehow. @dhimmel I was giving those as examples for passing the data. So one of those formats, not all of them. |
This is meant as a first step in making the ML code for cognoma both easily replicable and modular.
Big changes:
Performance changes: