Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cognoml refactor part 1 #3

Merged
merged 9 commits into from
Nov 10, 2016
Merged

cognoml refactor part 1 #3

merged 9 commits into from
Nov 10, 2016

Conversation

jessept
Copy link
Collaborator

@jessept jessept commented Oct 27, 2016

This is meant as a first step in making the ML code for cognoma both easily replicable and modular.

Big changes:

  1. Data - The data download process is now separate from the classify process and is more flexible/readable. All data cleaning/downloading/etc. is handled in the CognomlData class.
  2. Analysis - This process is now a little cleaner but could still use quite a bit of work. The goal with the CognomlClassifier is to fit the scikit-learn design pattern of fit/predict/etc. I also wanted to abstract out the Pipeline method so that it is easier to explore/iterate with other pipelines and just be able to "plug and play" going forward.

Performance changes:

  1. Data input - Instead of re-reading the .tsv files into a data frame (which can take ~3 minutes), I opted to pickle the resulting data frames. These take ~4 seconds to read in rather than 3 minutes.

@dhimmel
Copy link
Member

dhimmel commented Oct 27, 2016

I got a notification that said

badeeeb added docstrings for classifier class

and I was like who is badeeep. Congrats on this remarkable start to your commit hash.

Will try to get to reviewing this in the next couple of days.

1.) Raise new AttributeError rather than just printing
2.) Just refer to self.json_sanitize
Copy link
Member

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jessept, thanks for this monster pull request. Especially for the great documentation you've added to functions. I am slightly worried that the migration to classes will harm modularity and create a higher development overhead. However, I do also see advantages. What are you planning for part 2 of this? Just want to make sure I understand the long-term strategy before I do line-by-line review of this PR.

I opted to pickle the resulting data frames. These take ~4 seconds to read in rather than 3 minutes.

Awesome, really impressed by that speedup (didn't expect such an improvement). This makes cognoma/cancer-data#9 a non-issue. So +1 for loading and pickling data upon first download.

The data download process is now separate from the classify process and is more flexible/readable. All data cleaning/downloading/etc. is handled in the CognomlData class.

Agree that it makes sense to separate downloading from the classify process. However, I'm not crazy about using a single class for downloading and data management. I think it could prematurely restrict our flexibility. Basically, many of the functions in cognoml/data.py lose their modularity when they become methods of CognomlData objects. I'd prefer to keep functions that may have general applicability outside of the class. @awm33, what is your opinion on heavily relying on classes for our data management?


def get_results(self):
pipeline = self.pipeline
json_sanitize = self.json_sanitize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just refer to self.json_sanitize for its single usage below

try:
predict_df = pd.DataFrame({'sample_id': x.index, 'predicted_status': pipeline.predict(x)})
except AttributeError:
print("Pipeline {} does not have a predict method".format(pipeline))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should rethrow the error -- you can't continue without predictions and predict_df won't exist

x_test = self.x_test
x = self.X
obs_df = self.obs_df
obs_df = pd.DataFrame({'sample_id': obs_df.index, 'status': obs_df.values})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worried about non-deterministic column ordering when creating dataframes from dictionaries. Can you make the ordering explicit. Either:

obs_df = pd.DataFrame({'sample_id': obs_df.index, 'status': obs_df.values})[['sample_id', 'status']]

Or

obs_df = pd.DataFrame.from_items([
    ('sample_id', obs_df.index),
    ('status', obs_df.values),
])

I really wish python had a slick way of instantiating OrderedDicts ):

@dhimmel
Copy link
Member

dhimmel commented Oct 31, 2016

Regarding modularity of data mangagement, @gwaygenomics will likely want to use the cognoml package using data that does not come from figshare. So we want an architecture that allows modular swapping of input data.

@awm33
Copy link
Member

awm33 commented Oct 31, 2016

@dhimmel @jessept If we use classes, perhaps we should create a base class that establishes an interface, say DataSource, then inherit from that base class. Like FigshareDataSource, S3DataSource, etc. As long as the functions/properties that classify needs are named the same and provide the same end result, it should work.

I opted to pickle the resulting data frames. These take ~4 seconds to read in rather than 3 minutes.

Maybe we should store it somewhere as a pickle. For at least production, we could store it as a pickle file on S3, then it would be in a more direct data format. It would be accessed within the data center rather than through figshare.

@jessept
Copy link
Collaborator Author

jessept commented Oct 31, 2016

@dhimmel I see your concerns. I really like @awm33 idea of creating a base class and having other processes inherit from there. We can also build with standalone functions, the classes just make it a bit easier for me to read/navigate.

@dhimmel in terms of modularity of data management, this is one of the core concepts of the analysis.py refactor. The idea with refactoring analysis.py is that we establish the same design patterns as scikit-learn (design and predictor matrices, fit and predict methods, etc.). We can put in relatively general data frames (X and y) and still get Cognoml-specific results. I wholeheartedly agree that we should try to make the data import/cleaning as modular as we can, and this is meant to be a first step in that direction. The code feedback given thus far has been extremely helpful.

My plans for part 2:
1.) Testing. I want to use an existing, well-defined dummy data set to do some integration testing, and I also want to write up unit test cases for each of the functions to make it easier for others to contribute without needing as much background. These also get us a step closer to continuous deployment.
2.) Performance. It looks like a lot of the genes take quite a while to fit. I think if we are going to pickle existing data results, we should think about caching fitting results too. That way, if users of the MVP end up fitting many of the same genes, we can really improve their experience (>20x performance gain using prefitted results) and avoid wasting unneeded compute time in our cloud environment.

Obviously performance can be a bit of a daunting task, especially before an MVP, but I think it's worth focusing on the low-hanging fruit. My largest worry is having a great MVP that takes > 5 minutes to run every time, and having practitioners take a look at it and ultimately dismiss it because it takes too long to run. Let's discuss briefly in tomorrow's meeting, I think we can make quite a bit of progress there.

@dhimmel
Copy link
Member

dhimmel commented Nov 1, 2016

The idea with refactoring analysis.py is that we establish the same design patterns as scikit-learn (design and predictor matrices, fit and predict methods, etc.).

My philosophy was that scikit-learn made the classes so we don't have to. But I'm fine with giving it a try. I also like the superclass idea for data reading. So go ahead and continue with your preferred architecture. Let's try to merge this sooner rather than later as we can start building off of it.

I still need to do a little more review and actually test the code. See ya'll tonight.

@jessept
Copy link
Collaborator Author

jessept commented Nov 2, 2016

Per discussion yesterday:
1.) I misunderstood how mutation data is being read into the process. I will correct the current CognomlData class to reflect this. (Issue 5)
2.) I will update the get_results method to reflect an orderedDict where necessary (Issue 6)

These 2 should be enough to correct existing issues with the pull request. Any additional work should be part of a separate request.

…main, updated analysis to only look at correct data, re-wrote data to get correct json-formatted data from either github or front-end processes.
@jessept
Copy link
Collaborator Author

jessept commented Nov 2, 2016

@dhimmel Newest commit to pull request fixes 1.) and 2.) above.

@dhimmel
Copy link
Member

dhimmel commented Nov 2, 2016

@jessept I'm on it. You may enjoy some of these markdown features. For example, in your above comment you should actually tag the issues like #5 and #6.

@dhimmel
Copy link
Member

dhimmel commented Nov 2, 2016

@jessept can you delete the legacy code that's no longer needed?

y_test = y.head(5000)
x_test = pd.DataFrame(x[x.index.isin(list(y_test.index))])
classifier = CognomlClassifier(x_test, y_test)
a = CognomlData(mutations_json_url='https://github.com/cognoma/machine-learning/raw/876b8131bab46878cb49ae7243e459ec0acd2b47/data/api/hippo-input.json')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would I pass this info from the task? I think being able to pass sample_id and mutation_status should be an option. Being able to pass than as a key/value (ex ...,"TCGA-ZF-AA4X-01":0,... or a table (ex ...],["TCGA-ZF-AA4X-01",0],[...) would be nice.

@awm33
Copy link
Member

awm33 commented Nov 3, 2016

I somehow commented on outdated diff. Pasting it here

How would I pass this info from the task? I think being able to pass sample_id and mutation_status should be an option. Being able to pass than as a key/value (ex ...,"TCGA-ZF-AA4X-01":0,... or a table (ex ...],["TCGA-ZF-AA4X-01",0],[...) would be nice.

pipeline = self.pipeline
x = self.X
try:
predict_df = pd.DataFrame({'sample_id': x.index, 'predicted_status': pipeline.predict(x)})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Affected by #6

Copy link
Member

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the pull request, cognoml returned predictions for all samples, not just selected samples. See the commit that added this functionality. The goal was to potentially show researchers predictions for samples that they didn't fit their model on. Can we look into restoring this functionality before merging this PR? Note the original commit message:

Unselected observations (samples in the dataset that were not selected by the user) are now returned. These observations receive predictions but are missing (-1 encoded) for fields such as testing and status.

obs_test_df = obs_df.query("testing == 1")
dimensions = collections.OrderedDict()
dimensions['observations_selected'] = sum(obs_df.selected == 1)
dimensions['observations_unselected'] = sum(obs_df.selected == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resulting JSON has:

    "observations_unselected": 0,

Rather than.

    "observations_unselected": 2264,

See this comment in original code:

# obs_df switches to containing non-selected samples

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been implemented in latest pull request b3427c7

2.) restored previous functionality for prediction of entire data set rather than just requested sample ids
3.) Added additional OrderedDicts where necessary to preserve json order
@jessept
Copy link
Collaborator Author

jessept commented Nov 8, 2016

@awm33 could you talk a bit more about what you mean? I am assuming that the incoming data will look similar to what exists in the main.py script, an API call that has a json form with the sample_id and mutation status columns.

It would be very straightforward to implement a change that would accept both dictionaries and tables, I just want to understand how this flexibility would be helpful. Does it help you test?

@dhimmel
Copy link
Member

dhimmel commented Nov 8, 2016

How would I pass this info from the task? I think being able to pass sample_id and mutation_status should be an option. Being able to pass than as a key/value (ex ...,"TCGA-ZF-AA4X-01":0,... or a table (ex ...],["TCGA-ZF-AA4X-01",0],[...) would be nice.

@awm33 the added complexity to support multiple ways of inputting the sample/mutation information is not worth it IMO. It's trivial to convert between the three formats you mention -- why not just convert before calling the function?

I'm very open to changing the names to sample_ids and mutation_statuses if that clarifies things. Or changing the function to take a dict like sample_to_status. But I'm not convinced the added complexity of supporting duplicate ways of providing the same information is warranted.

I just want to understand how this flexibility would be helpful. Does it help you test?

Your thoughts here could help my opinion "evolve".

Diff for hippo JSON output highlighted the bug:

<     "positive_prevalence": 0.27846,
---
>     "positive_prevalence": -0.11771,
@dhimmel dhimmel merged commit 459e36f into cognoma:master Nov 10, 2016
@dhimmel
Copy link
Member

dhimmel commented Nov 10, 2016

@jessept congrats on this monumental pull request. I made one small change (8cfa0b9). After this change, this pull request yields the same JSON as previously. Now that we have this first major refactoring in place, we should aim for smaller, more modular pull requests, so we can keep development speedy. Cheers! 🍾

@dhimmel
Copy link
Member

dhimmel commented Nov 10, 2016

@jessept you will want to use these guidelines to keep your fork synced. You will want to make sure your local master branch is never ahead of the cognoma master. Make changes and pull requests on branches.

@jessept
Copy link
Collaborator Author

jessept commented Nov 10, 2016

@dhimmel thanks for your patience/persistence helping push this through, especially the code review. Will do on the fork syncing going forward, smaller pull requests ahead!

@awm33
Copy link
Member

awm33 commented Nov 11, 2016

@jessept Sorry for getting back late here but maybe we can address making changes for the worker in another PR.

A given classifier task has a list of entrezids and disease types. The worker code will query for the any samples that match the list of disease types and join that to the mutations table. The result will not be an a [sample_id,mutation_status] form, so the worker needs to transform it into that form and pass it to the cognoml code.

So what I am looking for is a way to pass that directly somehow.

@dhimmel I was giving those as examples for passing the data. So one of those formats, not all of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants