cognoml refactor part 1 #3

jessept · 2016-10-27T16:56:58Z

This is meant as a first step in making the ML code for cognoma both easily replicable and modular.

Big changes:

Data - The data download process is now separate from the classify process and is more flexible/readable. All data cleaning/downloading/etc. is handled in the CognomlData class.
Analysis - This process is now a little cleaner but could still use quite a bit of work. The goal with the CognomlClassifier is to fit the scikit-learn design pattern of fit/predict/etc. I also wanted to abstract out the Pipeline method so that it is easier to explore/iterate with other pipelines and just be able to "plug and play" going forward.

Performance changes:

Data input - Instead of re-reading the .tsv files into a data frame (which can take ~3 minutes), I opted to pickle the resulting data frames. These take ~4 seconds to read in rather than 3 minutes.

…asic testing. Tested results on dummy data successfully

dhimmel · 2016-10-27T22:23:37Z

I got a notification that said

badeeeb added docstrings for classifier class

and I was like who is badeeep. Congrats on this remarkable start to your commit hash.

Will try to get to reviewing this in the next couple of days.

1.) Raise new AttributeError rather than just printing 2.) Just refer to self.json_sanitize

dhimmel

@jessept, thanks for this monster pull request. Especially for the great documentation you've added to functions. I am slightly worried that the migration to classes will harm modularity and create a higher development overhead. However, I do also see advantages. What are you planning for part 2 of this? Just want to make sure I understand the long-term strategy before I do line-by-line review of this PR.

I opted to pickle the resulting data frames. These take ~4 seconds to read in rather than 3 minutes.

Awesome, really impressed by that speedup (didn't expect such an improvement). This makes cognoma/cancer-data#9 a non-issue. So +1 for loading and pickling data upon first download.

The data download process is now separate from the classify process and is more flexible/readable. All data cleaning/downloading/etc. is handled in the CognomlData class.

Agree that it makes sense to separate downloading from the classify process. However, I'm not crazy about using a single class for downloading and data management. I think it could prematurely restrict our flexibility. Basically, many of the functions in cognoml/data.py lose their modularity when they become methods of CognomlData objects. I'd prefer to keep functions that may have general applicability outside of the class. @awm33, what is your opinion on heavily relying on classes for our data management?

dhimmel · 2016-10-27T22:40:49Z

cognoml/analysis.py

+
+    def get_results(self):
+        pipeline = self.pipeline
+        json_sanitize = self.json_sanitize


Why not just refer to self.json_sanitize for its single usage below

dhimmel · 2016-10-27T22:41:20Z

cognoml/analysis.py

+        try:
+            predict_df = pd.DataFrame({'sample_id': x.index, 'predicted_status': pipeline.predict(x)})
+        except AttributeError:
+            print("Pipeline {} does not have a predict method".format(pipeline))


This should rethrow the error -- you can't continue without predictions and predict_df won't exist

dhimmel · 2016-10-31T02:46:42Z

cognoml/analysis.py

+        x_test = self.x_test
+        x = self.X
+        obs_df = self.obs_df
+        obs_df = pd.DataFrame({'sample_id': obs_df.index, 'status': obs_df.values})


Worried about non-deterministic column ordering when creating dataframes from dictionaries. Can you make the ordering explicit. Either:

obs_df = pd.DataFrame({'sample_id': obs_df.index, 'status': obs_df.values})[['sample_id', 'status']]

Or

obs_df = pd.DataFrame.from_items([ ('sample_id', obs_df.index), ('status', obs_df.values), ])

I really wish python had a slick way of instantiating OrderedDicts ):

dhimmel · 2016-10-31T03:07:22Z

Regarding modularity of data mangagement, @gwaygenomics will likely want to use the cognoml package using data that does not come from figshare. So we want an architecture that allows modular swapping of input data.

awm33 · 2016-10-31T04:08:44Z

@dhimmel @jessept If we use classes, perhaps we should create a base class that establishes an interface, say DataSource, then inherit from that base class. Like FigshareDataSource, S3DataSource, etc. As long as the functions/properties that classify needs are named the same and provide the same end result, it should work.

I opted to pickle the resulting data frames. These take ~4 seconds to read in rather than 3 minutes.

Maybe we should store it somewhere as a pickle. For at least production, we could store it as a pickle file on S3, then it would be in a more direct data format. It would be accessed within the data center rather than through figshare.

jessept · 2016-10-31T20:24:06Z

@dhimmel I see your concerns. I really like @awm33 idea of creating a base class and having other processes inherit from there. We can also build with standalone functions, the classes just make it a bit easier for me to read/navigate.

@dhimmel in terms of modularity of data management, this is one of the core concepts of the analysis.py refactor. The idea with refactoring analysis.py is that we establish the same design patterns as scikit-learn (design and predictor matrices, fit and predict methods, etc.). We can put in relatively general data frames (X and y) and still get Cognoml-specific results. I wholeheartedly agree that we should try to make the data import/cleaning as modular as we can, and this is meant to be a first step in that direction. The code feedback given thus far has been extremely helpful.

My plans for part 2:
1.) Testing. I want to use an existing, well-defined dummy data set to do some integration testing, and I also want to write up unit test cases for each of the functions to make it easier for others to contribute without needing as much background. These also get us a step closer to continuous deployment.
2.) Performance. It looks like a lot of the genes take quite a while to fit. I think if we are going to pickle existing data results, we should think about caching fitting results too. That way, if users of the MVP end up fitting many of the same genes, we can really improve their experience (>20x performance gain using prefitted results) and avoid wasting unneeded compute time in our cloud environment.

Obviously performance can be a bit of a daunting task, especially before an MVP, but I think it's worth focusing on the low-hanging fruit. My largest worry is having a great MVP that takes > 5 minutes to run every time, and having practitioners take a look at it and ultimately dismiss it because it takes too long to run. Let's discuss briefly in tomorrow's meeting, I think we can make quite a bit of progress there.

dhimmel · 2016-11-01T19:15:49Z

The idea with refactoring analysis.py is that we establish the same design patterns as scikit-learn (design and predictor matrices, fit and predict methods, etc.).

My philosophy was that scikit-learn made the classes so we don't have to. But I'm fine with giving it a try. I also like the superclass idea for data reading. So go ahead and continue with your preferred architecture. Let's try to merge this sooner rather than later as we can start building off of it.

I still need to do a little more review and actually test the code. See ya'll tonight.

jessept · 2016-11-02T13:31:45Z

Per discussion yesterday:
1.) I misunderstood how mutation data is being read into the process. I will correct the current CognomlData class to reflect this. (Issue 5)
2.) I will update the get_results method to reflect an orderedDict where necessary (Issue 6)

These 2 should be enough to correct existing issues with the pull request. Any additional work should be part of a separate request.

…main, updated analysis to only look at correct data, re-wrote data to get correct json-formatted data from either github or front-end processes.

jessept · 2016-11-02T16:40:13Z

@dhimmel Newest commit to pull request fixes 1.) and 2.) above.

dhimmel · 2016-11-02T16:46:35Z

@jessept I'm on it. You may enjoy some of these markdown features. For example, in your above comment you should actually tag the issues like #5 and #6.

dhimmel · 2016-11-02T17:04:14Z

@jessept can you delete the legacy code that's no longer needed?

awm33 · 2016-11-03T02:48:04Z

cognoml/main.py

-    y_test = y.head(5000)
-    x_test = pd.DataFrame(x[x.index.isin(list(y_test.index))])
-    classifier = CognomlClassifier(x_test, y_test)
+    a = CognomlData(mutations_json_url='https://github.com/cognoma/machine-learning/raw/876b8131bab46878cb49ae7243e459ec0acd2b47/data/api/hippo-input.json')


How would I pass this info from the task? I think being able to pass sample_id and mutation_status should be an option. Being able to pass than as a key/value (ex ...,"TCGA-ZF-AA4X-01":0,... or a table (ex ...],["TCGA-ZF-AA4X-01",0],[...) would be nice.

awm33 · 2016-11-03T02:52:06Z

I somehow commented on outdated diff. Pasting it here

How would I pass this info from the task? I think being able to pass sample_id and mutation_status should be an option. Being able to pass than as a key/value (ex ...,"TCGA-ZF-AA4X-01":0,... or a table (ex ...],["TCGA-ZF-AA4X-01",0],[...) would be nice.

dhimmel · 2016-11-04T17:38:56Z

cognoml/analysis.py

+        pipeline = self.pipeline
+        x = self.X
+        try:
+            predict_df = pd.DataFrame({'sample_id': x.index, 'predicted_status': pipeline.predict(x)})


Affected by #6

dhimmel

Before the pull request, cognoml returned predictions for all samples, not just selected samples. See the commit that added this functionality. The goal was to potentially show researchers predictions for samples that they didn't fit their model on. Can we look into restoring this functionality before merging this PR? Note the original commit message:

Unselected observations (samples in the dataset that were not selected by the user) are now returned. These observations receive predictions but are missing (-1 encoded) for fields such as testing and status.

dhimmel · 2016-11-04T18:08:48Z

cognoml/analysis.py

+        obs_test_df = obs_df.query("testing == 1")
+        dimensions = collections.OrderedDict()
+        dimensions['observations_selected'] = sum(obs_df.selected == 1)
+        dimensions['observations_unselected'] = sum(obs_df.selected == 0)


The resulting JSON has:

"observations_unselected": 0,

Rather than.

"observations_unselected": 2264,

See this comment in original code:

# obs_df switches to containing non-selected samples

This has been implemented in latest pull request b3427c7

2.) restored previous functionality for prediction of entire data set rather than just requested sample ids 3.) Added additional OrderedDicts where necessary to preserve json order

jessept · 2016-11-08T15:29:08Z

@awm33 could you talk a bit more about what you mean? I am assuming that the incoming data will look similar to what exists in the main.py script, an API call that has a json form with the sample_id and mutation status columns.

It would be very straightforward to implement a change that would accept both dictionaries and tables, I just want to understand how this flexibility would be helpful. Does it help you test?

dhimmel · 2016-11-08T19:21:30Z

How would I pass this info from the task? I think being able to pass sample_id and mutation_status should be an option. Being able to pass than as a key/value (ex ...,"TCGA-ZF-AA4X-01":0,... or a table (ex ...],["TCGA-ZF-AA4X-01",0],[...) would be nice.

@awm33 the added complexity to support multiple ways of inputting the sample/mutation information is not worth it IMO. It's trivial to convert between the three formats you mention -- why not just convert before calling the function?

I'm very open to changing the names to sample_ids and mutation_statuses if that clarifies things. Or changing the function to take a dict like sample_to_status. But I'm not convinced the added complexity of supporting duplicate ways of providing the same information is warranted.

I just want to understand how this flexibility would be helpful. Does it help you test?

Your thoughts here could help my opinion "evolve".

Diff for hippo JSON output highlighted the bug: < "positive_prevalence": 0.27846, --- > "positive_prevalence": -0.11771,

dhimmel · 2016-11-10T20:01:55Z

@jessept congrats on this monumental pull request. I made one small change (8cfa0b9). After this change, this pull request yields the same JSON as previously. Now that we have this first major refactoring in place, we should aim for smaller, more modular pull requests, so we can keep development speedy. Cheers! 🍾

dhimmel · 2016-11-10T20:03:27Z

@jessept you will want to use these guidelines to keep your fork synced. You will want to make sure your local master branch is never ahead of the cognoma master. Make changes and pull requests on branches.

jessept · 2016-11-10T20:06:49Z

@dhimmel thanks for your patience/persistence helping push this through, especially the code review. Will do on the fork syncing going forward, smaller pull requests ahead!

awm33 · 2016-11-11T03:53:41Z

@jessept Sorry for getting back late here but maybe we can address making changes for the worker in another PR.

A given classifier task has a list of entrezids and disease types. The worker code will query for the any samples that match the list of disease types and join that to the mutations table. The result will not be an a [sample_id,mutation_status] form, so the worker needs to transform it into that form and pass it to the cognoml code.

So what I am looking for is a way to pass that directly somehow.

@dhimmel I was giving those as examples for passing the data. So one of those formats, not all of them.

jessept added 4 commits October 26, 2016 11:07

added data module to handle all data fetching/cleaning/formatting

56897c1

Cleaned up "results" method, broadened "pipline" method, added very b…

a60abb9

…asic testing. Tested results on dummy data successfully

added functionality to main to enable easier testing

b976b06

added docstrings for classifier class

badeeeb

Took @dhimmel's suggestions for improving code:

7524978

1.) Raise new AttributeError rather than just printing 2.) Just refer to self.json_sanitize

dhimmel reviewed Oct 31, 2016

View reviewed changes

Moved some functions to utils, provided new example of how to run in …

10f12ac

…main, updated analysis to only look at correct data, re-wrote data to get correct json-formatted data from either github or front-end processes.

removed legacy classify code

f5dd033

awm33 reviewed Nov 3, 2016

View reviewed changes

dhimmel reviewed Nov 4, 2016

View reviewed changes

1.) moved sample-id specific slicing to util package from data package

b3427c7

2.) restored previous functionality for prediction of entire data set rather than just requested sample ids 3.) Added additional OrderedDicts where necessary to preserve json order

Fix positive_prevalence bug

8cfa0b9

Diff for hippo JSON output highlighted the bug: < "positive_prevalence": 0.27846, --- > "positive_prevalence": -0.11771,

dhimmel merged commit 459e36f into cognoma:master Nov 10, 2016

jessept mentioned this pull request Nov 10, 2016

Use OrderedDict in performance processing #6

Closed

jessept mentioned this pull request Nov 10, 2016

Remove mutation filtering in CognomlData #5

Closed

dhimmel mentioned this pull request Nov 16, 2016

Improving figshare download modularity #15

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cognoml refactor part 1 #3

cognoml refactor part 1 #3

jessept commented Oct 27, 2016 •

edited by dhimmel

Loading

dhimmel commented Oct 27, 2016

dhimmel left a comment

dhimmel Oct 27, 2016

dhimmel Oct 27, 2016

dhimmel Oct 31, 2016

dhimmel commented Oct 31, 2016

awm33 commented Oct 31, 2016

jessept commented Oct 31, 2016

dhimmel commented Nov 1, 2016

jessept commented Nov 2, 2016

jessept commented Nov 2, 2016

dhimmel commented Nov 2, 2016

dhimmel commented Nov 2, 2016

awm33 Nov 3, 2016

awm33 commented Nov 3, 2016

dhimmel Nov 4, 2016

dhimmel left a comment

dhimmel Nov 4, 2016

jessept Nov 8, 2016

jessept commented Nov 8, 2016

dhimmel commented Nov 8, 2016 •

edited

Loading

dhimmel commented Nov 10, 2016

dhimmel commented Nov 10, 2016

jessept commented Nov 10, 2016

awm33 commented Nov 11, 2016

cognoml refactor part 1 #3

cognoml refactor part 1 #3

Conversation

jessept commented Oct 27, 2016 • edited by dhimmel Loading

dhimmel commented Oct 27, 2016

dhimmel left a comment

Choose a reason for hiding this comment

dhimmel Oct 27, 2016

Choose a reason for hiding this comment

dhimmel Oct 27, 2016

Choose a reason for hiding this comment

dhimmel Oct 31, 2016

Choose a reason for hiding this comment

dhimmel commented Oct 31, 2016

awm33 commented Oct 31, 2016

jessept commented Oct 31, 2016

dhimmel commented Nov 1, 2016

jessept commented Nov 2, 2016

jessept commented Nov 2, 2016

dhimmel commented Nov 2, 2016

dhimmel commented Nov 2, 2016

awm33 Nov 3, 2016

Choose a reason for hiding this comment

awm33 commented Nov 3, 2016

dhimmel Nov 4, 2016

Choose a reason for hiding this comment

dhimmel left a comment

Choose a reason for hiding this comment

dhimmel Nov 4, 2016

Choose a reason for hiding this comment

jessept Nov 8, 2016

Choose a reason for hiding this comment

jessept commented Nov 8, 2016

dhimmel commented Nov 8, 2016 • edited Loading

dhimmel commented Nov 10, 2016

dhimmel commented Nov 10, 2016

jessept commented Nov 10, 2016

awm33 commented Nov 11, 2016

jessept commented Oct 27, 2016 •

edited by dhimmel

Loading

dhimmel commented Nov 8, 2016 •

edited

Loading