-
Notifications
You must be signed in to change notification settings - Fork 147
Feature: Implement classification algorithm #2
Comments
Not sure if this is a dumb question, but given that there will probably be multiple models in the end, how do you intend to separate them? Would it make sense to to further divide |
@reiinakano Not a dumb question at all, and in fact having a "model zoo" was considered. It's not out of the question as the project moves forward, but until we have something to go from the current goal is just to have a best-available frozen model (+ its preprocessing and training pipeline) for each task. |
Thanks for the answer. My concern is that you're considering a lot of different algorithms from the get-go (as seen in #18 - #28). What happens when multiple people decide working on different algorithms? Personally I don't think it would be a terrible idea to separate them into subpackages from the start to facilitate parallel development. As long as they have the same interface, switching to the best-available model is a matter of changing the subpackage name in calls. Also, if the best model moving forward is an ensemble of the algorithms (as is virtually always the case), they're all already nicely separated for easy integration into an ensemble. |
Agreed that could be a useful refactoring once the pain point exists. If you're interested in working on this bit of architecture I'd encourage you to keep alert for the right time (could be soon) and propose the change as a refactoring; at the moment it's premature. |
As it was already mentioned: different architectures require different preprocessing methods. Whether it will be a good decision if the input of the feature will include the information about preprocessing function either? I hope it'll be a better to open a new issue dedicated to the preprocessing step. In this way, model may be fed along with a parameters' dictionary. The latter will be used in order to create a preprocessing instance. I don't open the issue myself, because I'm not highly confident in how the expected behaviour should look like. Newertheless I'll be glad to implement such features. |
@vessemer Thanks for the thinking around this. I have a couple questions just to make sure I understand!
You mean should we be passing more information from the interface part back to the prediction part? Or something else?
Sorry, not sure I understand the question? And seems like you probably already grabbed this but as a general observation, our thought is that we'd like to first see one solid implementation even if it will need to be refactored to generalize later on. We're afraid of being "achitecture astronauts" and making things too complicated before everyone has a common point of reference. |
@isms thanks for the quick response. I mean that prior to any model evaluation some preprocessing should take place over the raw DICOM data. I agree with your concern regarding cumbersome architecture at the starting point.
Right now for me, it is not clear what is meant by the 'model': whether it is a model object from some ML backend such as TensorFlow or Theano, or it's some essence of higher level with its own preprocessing method. |
Hi @vessemer, I see you've already added some preprocessing logic in your PR, but #58 may give you a bit more context. Essentially, each competition implementation has it's own set of preprocessing steps. During the integration of any individual implementation, its associated preprocessing (and other) steps should be decoupled so that they work together as modular pieces of a functional whole. As @isms said, in the future, we may go down the path of mixing and matching the best pieces of various implementations and/or allow for an ensemble of implementations to work together.
You're on the right track with your initial thought. Serialized models will live in the |
Hi @reubano, as it was mentioned in the technical details from the issue #2:
The |
@vessemer thanks for pointing that out! The file structure changed since this issues was first created. The new base path is |
@reubano, I've noted that the major part of algorithms aimed at nodule candidates' classification has similar pipelines for preprocessing steps. In general, these pipelines can be separated into two stages: global- (lungs-) and local- (nodule-) levels' operations. It's important to uncouple these stages, i.e. it is neither unnecessary nor consistently to apply zero-mean normalization at the local stage. Moreover, for the local-stage, it is common to apply augmentation on the fly (even at a prediction time). Thus I've introduced the pre-processing class which operates at the global level, along with the possibility to apply it while reading medical files. Related to the data-generators at the local level, it's also quite common to use trivial spatial augmentation, but then the patch approached to the desired shape. The latter transformations may highly deviate from model to model, hence I guess it will be better if we will wrap the |
Merged in #76 |
Overview
Currently, there is a just a placeholder in the algorithm that classifies nodules in scans. Nodules are areas of interest that might be cancerous. We need to adapt the Data Science Bowl algorithms to predict
P(cancer)
for a given set of centroids for nodules.Expected Behavior
Given a model trained to perform this task, a DICOM image, and the nodule centroids, return the
P(cancer)
for each nodule.Design doc reference:
Jobs to be done > Annotate > Prediction service
Technical details
prediction/src/algorithms/classify/trained_model/predict
method.prediction/src/algorithms/classify/src/
folderprediction/src/algorithms/classify/assets/
folder usinggit-lfs
Out of scope
This feature is a first-pass at getting a model that completes the task with the defined input and output. We are not yet judging the model based on its accuracy or computational performance.
Acceptance criteria
NOTE: All PRs must follow the standard PR checklist.
The text was updated successfully, but these errors were encountered: