Request of clarifications about processed data in `cpg0019` #360

jasperhyp · 2025-01-01T19:44:20Z

Hi! Thanks for creating this great resource. I was aware that the processed datasets used in this study have been uploaded to CPG as indicated here -- thanks for sharing!

I was wondering if you could kindly clarify if the processed dataset only contains the training split (but not the validation split, so that it is not the full processed e.g. cpg0012 dataset), since the parent folder is named training_images (e.g. cpg0019-moshkov-deepprofiler/broad/training_images/BBBC036/). Also, it seems that there are much fewer folders in the processed BBBC036 compared with the original cpg0012 images as in here. For example, 24277 is not in the processed version of BBBC036/CDRP, and even in 24278, there are many subfolders missing in the processed dataset compared with the raw dataset. Could you please clarify? (See third comment.) I would also appreciate it if you could suggest possible ways to acquire the full processed datasets. If that's not readily available, could you please kindly point me towards the script/notebook that would generate the processed images from raw ones?

Thank you very much in advance! Happy New Year!

The text was updated successfully, but these errors were encountered:

jasperhyp · 2025-01-02T02:05:42Z

In particular, perhaps this is related?

jasperhyp · 2025-01-05T19:08:40Z

Oops I realized the training data is composed of subsets of those datasets as stated in the paper:

We selected 348 treatments from BBBC022 (strongest 35%), 354 from BBBC036 (strongest 23%) and 47 treatments from BBBC037 (strongest 23%). We complemented these treatments with the corresponding replicates in the LINCS and BBBC043 datasets, and added 7 new compounds and 32 new gene overexpression perturbations, resulting in 488 treatments in total (Fig. 4).

Still, could you please suggest possible ways to acquire the full processed datasets? If that's not readily available, could you please kindly point me towards the script/notebook that would generate the processed images from raw ones? Thanks!

Arkkienkeli · 2025-01-06T08:15:09Z

Hi @jasperhyp, datasets were processed with a compression pipeline of DeepProfiler (full images), then single-cells were also extracted from those compressed images. Full datasets can be found from the corresponding CPG \BBBC entries.
For BBBC037, 22, 36 you can reuse shared metadata in s3://cellpainting-gallery/cpg0019-moshkov-deepprofiler/broad/workspace_dl/metadata/. For other two datasets you would need to prepare the metadata for preprocessing on your own. To avoid unnecessary preprocessing of LINCS and LUAD, you can download only plates that are mentioned in s3://cellpainting-gallery/cpg0019-moshkov-deepprofiler/broad/workspace_dl/metadata/sc-metadata.csv

We did not share this intermediate data.

jasperhyp · 2025-01-08T03:26:48Z

Hi @Arkkienkeli , Thank you for the clarification! Currently, I am primarily interested in the BBBC036 dataset.

If I understand correctly, there is no single-cell crops for the full BBBC036 dataset, and here are the processing steps to convert original images to processed single-cell crops as in cpg0019-moshkov-deepprofiler/broad/training_images/BBBC036/:

Full image --> compressed & illumination-corrected image (python3 deepprofiler --root=/home/ubuntu/project/ --config filename.json prepare)
Full image --> single-cell nuclei locations (requires CellProfiler to generate)
Compressed image + single-cell nuclei locations --> single-cell crops (python deepprofiler --root=/path/deepprofiler_project/ --config=config.json --metadata=metadata.csv --single-cells=sample --gpu 0 export-sc as you suggested in this issue)

Could you confirm if the above understanding is correct?

If this is the case, I believe the following are needed to generate single-cell crops for the full BBBC036 dataset beyond the raw images in cellpainting-gallery/cpg0012-wawer-bioactivecompoundprofiling/broad/images/CDRP/images:

Metadata: Thank you for pointing out the shared metadata in s3://cellpainting-gallery/cpg0019-moshkov-deepprofiler/broad/workspace_dl/metadata/. I just checked and it contains a total of 2238 compounds, which is smaller than the complete set of 30412 compounds (as indicated by the associated CellProfiler features). Could you please clarify if I need to generate the full metadata, following this file?
X, Y locations of single-cell nuclei: It seems this is not available in both cpg0012 repo and here. Could you kindly share a potential pipeline configuration for this (and I assume the metadata will also be generated along the way)? It seems to be standard and straightforward, but I am afraid of using a different set of parameter.

Edit: I am now considering using random crops instead of single-cell crops as it requires less effort in preprocessing. In that case, only Step 1 (compression + illumination correction) needs to be run. Could you kindly point me towards the correct json configuration to use for --config?

Thank you very much for your time and effort in sharing all these!

jamila-griffith · 2025-01-10T03:21:34Z

Its would be great if the RunDeepProfiler can be made functional so that it runs in the same pipeline as cellprofiler

jasperhyp mentioned this issue Jan 1, 2025

Issue on page /machine_learning.html broadinstitute/cellpainting-gallery#107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request of clarifications about processed data in `cpg0019` #360

Request of clarifications about processed data in `cpg0019` #360

jasperhyp commented Jan 1, 2025 •

edited

Loading

jasperhyp commented Jan 2, 2025

jasperhyp commented Jan 5, 2025 •

edited

Loading

Arkkienkeli commented Jan 6, 2025

jasperhyp commented Jan 8, 2025 •

edited

Loading

jamila-griffith commented Jan 10, 2025

Request of clarifications about processed data in cpg0019 #360

Request of clarifications about processed data in cpg0019 #360

Comments

jasperhyp commented Jan 1, 2025 • edited Loading

jasperhyp commented Jan 2, 2025

jasperhyp commented Jan 5, 2025 • edited Loading

Arkkienkeli commented Jan 6, 2025

jasperhyp commented Jan 8, 2025 • edited Loading

jamila-griffith commented Jan 10, 2025

Request of clarifications about processed data in `cpg0019` #360

Request of clarifications about processed data in `cpg0019` #360

jasperhyp commented Jan 1, 2025 •

edited

Loading

jasperhyp commented Jan 5, 2025 •

edited

Loading

jasperhyp commented Jan 8, 2025 •

edited

Loading