Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: constant memory dataset DownloadValidate #7398

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

nayib-jose-gloria
Copy link
Contributor

@nayib-jose-gloria nayib-jose-gloria commented Dec 11, 2024

Reason for Change

Changes

  • replace chunked matrix traversal in extract_metadata with call to count_matrix_nonzero in cellxgene_schema that does the same thing
  • When populating dataset citation, update dataset.uns in-place using h5py rather than re-opening and re-writing the entire anndata object.
  • Use memory efficient read_h5ad call from cellxgene_schema in extract_metadata
  • update Validate batch jobs to use constant memory
  • delete ProcessDownload (more complex logic no longer required if we are using a constant memory allocation) and instead incorporate the URI -> S3 upload into the ProcessValidate step
  • Update SFN infra to remove obviated steps (Download, RegisterJobDefinition, DeregisterJobDefinition)

Testing steps

  • local testing with large datasets
  • TODO: rdev testing,

Notes for Reviewer

  • TODO: determine what constant memory should be set at, based on testing
  • TODO: fix tests as needed

Copy link
Contributor

Deployment Summary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant