Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request of clarifications about processed data in cpg0019 #360

Open
jasperhyp opened this issue Jan 1, 2025 · 5 comments
Open

Request of clarifications about processed data in cpg0019 #360

jasperhyp opened this issue Jan 1, 2025 · 5 comments

Comments

@jasperhyp
Copy link

jasperhyp commented Jan 1, 2025

Hi! Thanks for creating this great resource. I was aware that the processed datasets used in this study have been uploaded to CPG as indicated here -- thanks for sharing!

I was wondering if you could kindly clarify if the processed dataset only contains the training split (but not the validation split, so that it is not the full processed e.g. cpg0012 dataset), since the parent folder is named training_images (e.g. cpg0019-moshkov-deepprofiler/broad/training_images/BBBC036/). Also, it seems that there are much fewer folders in the processed BBBC036 compared with the original cpg0012 images as in here. For example, 24277 is not in the processed version of BBBC036/CDRP, and even in 24278, there are many subfolders missing in the processed dataset compared with the raw dataset. Could you please clarify? (See third comment.) I would also appreciate it if you could suggest possible ways to acquire the full processed datasets. If that's not readily available, could you please kindly point me towards the script/notebook that would generate the processed images from raw ones?

Thank you very much in advance! Happy New Year!

@jasperhyp
Copy link
Author

In particular, perhaps this is related?

@jasperhyp
Copy link
Author

jasperhyp commented Jan 5, 2025

Oops I realized the training data is composed of subsets of those datasets as stated in the paper:

We selected 348 treatments from BBBC022 (strongest 35%), 354 from BBBC036 (strongest 23%) and 47 treatments from BBBC037 (strongest 23%). We complemented these treatments with the corresponding replicates in the LINCS and BBBC043 datasets, and added 7 new compounds and 32 new gene overexpression perturbations, resulting in 488 treatments in total (Fig. 4).

Still, could you please suggest possible ways to acquire the full processed datasets? If that's not readily available, could you please kindly point me towards the script/notebook that would generate the processed images from raw ones? Thanks!

@Arkkienkeli
Copy link
Member

Hi @jasperhyp, datasets were processed with a compression pipeline of DeepProfiler (full images), then single-cells were also extracted from those compressed images. Full datasets can be found from the corresponding CPG \BBBC entries.
For BBBC037, 22, 36 you can reuse shared metadata in s3://cellpainting-gallery/cpg0019-moshkov-deepprofiler/broad/workspace_dl/metadata/. For other two datasets you would need to prepare the metadata for preprocessing on your own. To avoid unnecessary preprocessing of LINCS and LUAD, you can download only plates that are mentioned in s3://cellpainting-gallery/cpg0019-moshkov-deepprofiler/broad/workspace_dl/metadata/sc-metadata.csv

We did not share this intermediate data.

@jasperhyp
Copy link
Author

jasperhyp commented Jan 8, 2025

Hi @Arkkienkeli , Thank you for the clarification! Currently, I am primarily interested in the BBBC036 dataset.

If I understand correctly, there is no single-cell crops for the full BBBC036 dataset, and here are the processing steps to convert original images to processed single-cell crops as in cpg0019-moshkov-deepprofiler/broad/training_images/BBBC036/:

  1. Full image --> compressed & illumination-corrected image (python3 deepprofiler --root=/home/ubuntu/project/ --config filename.json prepare)
  2. Full image --> single-cell nuclei locations (requires CellProfiler to generate)
  3. Compressed image + single-cell nuclei locations --> single-cell crops (python deepprofiler --root=/path/deepprofiler_project/ --config=config.json --metadata=metadata.csv --single-cells=sample --gpu 0 export-sc as you suggested in this issue)

Could you confirm if the above understanding is correct?

If this is the case, I believe the following are needed to generate single-cell crops for the full BBBC036 dataset beyond the raw images in cellpainting-gallery/cpg0012-wawer-bioactivecompoundprofiling/broad/images/CDRP/images:

  1. Metadata: Thank you for pointing out the shared metadata in s3://cellpainting-gallery/cpg0019-moshkov-deepprofiler/broad/workspace_dl/metadata/. I just checked and it contains a total of 2238 compounds, which is smaller than the complete set of 30412 compounds (as indicated by the associated CellProfiler features). Could you please clarify if I need to generate the full metadata, following this file?
  2. X, Y locations of single-cell nuclei: It seems this is not available in both cpg0012 repo and here. Could you kindly share a potential pipeline configuration for this (and I assume the metadata will also be generated along the way)? It seems to be standard and straightforward, but I am afraid of using a different set of parameter.

Edit: I am now considering using random crops instead of single-cell crops as it requires less effort in preprocessing. In that case, only Step 1 (compression + illumination correction) needs to be run. Could you kindly point me towards the correct json configuration to use for --config?

Thank you very much for your time and effort in sharing all these!

@jamila-griffith
Copy link

Its would be great if the RunDeepProfiler can be made functional so that it runs in the same pipeline as cellprofiler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants