MAST: Add flat option to Observations.download_products #2511

jdavies-st · 2022-09-08T07:49:28Z

This adds a flat boolean option to Observations.download_products() that does not make any subdirectories for the downloads. Default is turned off, i.e. False.

If set to True, and download_dir is specified, it will put all files into download_dir without subdirectories.

If set to True and download_dir is not specified, it will put files in the current working directory, again with no subdirectories.

Setting flat=True has no effect when curl_flag=True, as the bash script is server-side generated at MAST.

The default of False retains the current behavior and puts files into the standard directory structure of mastDownload/ <obs_collection> / <obs_id> /.

Resolves #2500.

codecov · 2022-09-08T09:25:16Z

Codecov Report

Merging #2511 (279c683) into main (c9972e0) will decrease coverage by 0.00%.
The diff coverage is 78.57%.

@@            Coverage Diff             @@
##             main    #2511      +/-   ##
==========================================
- Coverage   62.98%   62.98%   -0.01%     
==========================================
  Files         133      133              
  Lines       17308    17314       +6     
==========================================
+ Hits        10902    10905       +3     
- Misses       6406     6409       +3

Impacted Files	Coverage Δ
astroquery/mast/observations.py	`76.01% <78.57%> (-0.66%)`	⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

eerovaher · 2022-09-08T13:27:57Z

astroquery/mast/observations.py

+            if not flat:
+                local_path = os.path.join(base_dir, data_product['obs_collection'], data_product['obs_id'])
+                if not os.path.exists(local_path):
+                    os.makedirs(local_path)
+            else:
+                local_path = base_dir


Have you considered using Path from pathlib?

Suggested change

if not flat:

local_path = os.path.join(base_dir, data_product['obs_collection'], data_product['obs_id'])

if not os.path.exists(local_path):

os.makedirs(local_path)

else:

local_path = base_dir

if flat:

local_path = Path(base_dir)

else:

local_path = Path(base_dir, data_product['obs_collection'], data_product['obs_id'])

local_path.mkdir(parents=True, exist_ok=True)

I did consider, I just didn't know what the downstream effects of that would be, given that all paths in the module use the older os.path routines, as do most in the core library. You'll notice in the test I did use the pytest tmp_path fixture, which supplies a temporary Path object as location to write the temporary downloaded file. And it works just fine. So it should work, but maybe it would be good to do a consistent cleanup throughout the module?

I did fix

https://github.com/astropy/astroquery/pull/2511/files#diff-55036b43100098d3295a04581bde036bd91d0371ed055fb0dabd0d8ddbfb6b5bL738

base_dir = download_dir.rstrip('/') + "/mastDownload"

which I believe was probably a bug, as that cannot work with a Windows filesystem. But I held off doing anymore changes that were not focused on adding this new feature. Happy to do that in a subsequent PR so as not to detract from this one.

Yes, it would be better to be consistent throughout the module and switching to pathlib sounds like something that deserves a separate pull request.

I would rather not mix the two without a good reason, and even pathlib's own documentation refers that os.path can be used. That being said I would merge a PR if it consistently changes all os.path to pathlib in the package (but if so, I would still not like to see / operators as those may be misleading during quick code reads (and because we still get a lot of contributions doing raw sting manipulations and additions using '/' in them.

astroquery/mast/tests/test_mast_remote.py

bsipocz

There is a possibly relevant test failure (in fact likely introduced in the previous PRs), could you please have a look at it, too?

======================================================= FAILURES =======================================================
_______________________________ TestMast.test_observations_get_cloud_uris_no_duplicates ________________________________

self = <astroquery.mast.tests.test_mast_remote.TestMast object at 0x117450c70>
msa_product_table = <Table masked=True length=6>
 obsID   obs_collection dataproduct_type             obs_id             ...  size  parent...
87601178           JWST            image jw02736008001_03101_00004_nrs2 ... 567360     87602009     PUBLIC           1

    def test_observations_get_cloud_uris_no_duplicates(self, msa_product_table):
    
        # Get a product list with 6 duplicate JWST MSA config files
        products = msa_product_table
    
        # enable access to public AWS S3 bucket
        mast.Observations.enable_cloud_dataset()
    
        # Check duplicate cloud URIs as well
>       uris = mast.Observations.get_cloud_uris(products)

astroquery/mast/tests/test_mast_remote.py:347: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
astroquery/mast/observations.py:791: in get_cloud_uris
    return self._cloud_connection.get_cloud_uri_list(data_products, include_bucket, full_url)
astroquery/mast/cloud.py:161: in get_cloud_uri_list
    return [self.get_cloud_uri(product, include_bucket, full_url) for product in data_products]
astroquery/mast/cloud.py:161: in <listcomp>
    return [self.get_cloud_uri(product, include_bucket, full_url) for product in data_products]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <astroquery.mast.cloud.CloudAccess object at 0x1174632b0>
data_product = <Row index=0>
 obsID   obs_collection dataproduct_type             obs_id             description  type               ...                 -- CALJWST         --        2736 jw02736008001_01_msa.fits 567360     87602009     PUBLIC           1
include_bucket = True, full_url = False

    def get_cloud_uri(self, data_product, include_bucket=True, full_url=False):
        """
        For a given data product, returns the associated cloud URI.
        If the product is from a mission that does not support cloud access an
        exception is raised. If the mission is supported but the product
        cannot be found in the cloud, the returned path is None.
    
        Parameters
        ----------
        data_product : `~astropy.table.Row`
            Product to be converted into cloud data uri.
        include_bucket : bool
            Default True. When false returns the path of the file relative to the
            top level cloud storage location.
            Must be set to False when using the full_url argument.
        full_url : bool
            Default False. Return an HTTP fetchable url instead of a cloud uri.
            Must set include_bucket to False to use this option.
    
        Returns
        -------
        response : str or None
            Cloud URI generated from the data product. If the product cannot be
            found in the cloud, None is returned.
        """
    
        s3_client = self.boto3.client('s3', config=self.config)
    
        path = utils.mast_relative_path(data_product["dataURI"])
        if path is None:
            raise InvalidQueryError("Malformed data uri {}".format(data_product['dataURI']))
    
        if 'galex' in path:
            path = path.lstrip("/mast/")
        else:
            path = path.lstrip("/")
    
        try:
            s3_client.head_object(Bucket=self.pubdata_bucket, Key=path)
            if include_bucket:
                path = "s3://{}/{}".format(self.pubdata_bucket, path)
            elif full_url:
                path = "http://s3.amazonaws.com/{}/{}".format(self.pubdata_bucket, path)
            return path
        except self.botocore.exceptions.ClientError as e:
            if e.response['Error']['Code'] != "404":
                raise
    
>       warnings.warn("Unable to locate file {}.".format(data_product['productFilename']), NoResultsWarning)
E       astroquery.exceptions.NoResultsWarning: Unable to locate file jw02736008001_01_msa.fits.

astroquery/mast/cloud.py:135: NoResultsWarning

bsipocz · 2022-09-08T15:56:30Z

astroquery/mast/observations.py

@@ -584,7 +584,7 @@ def download_file(self, uri, *, local_path=None, base_url=None, cache=True, clou

        return status, msg, url

-    def _download_files(self, products, base_dir, *, cache=True, cloud_only=False,):
+    def _download_files(self, products, base_dir, flat=False, *, cache=True, cloud_only=False,):


just being extra nitpicky/pedantic as this is not public api, please add the kwarg after the *

Ah, right. Thanks. It would be nice to eventually expunge that * and be specific about what args are accepted, but I suspect that's a separate issue.

I'm not sure I understand what you mean, the * is doing exactly that enforcement of being specific.

I meant that it would be better for documentation (autodoc) to specify all the args that * allows, but I guess that varies with dataset as the allowable filters vary? No matter, not super relevant this PR. And it's a private method, so not that important for documenting the API.

I may still misunderstand you, there is no args in *, it's the syntax to enforce keyword-only arguments (not to be the same as when it says *args or similar which would allow additional positional arguments to be passed.)

https://peps.python.org/pep-3102/

* limits the number of positional arguments, see https://peps.python.org/pep-3102/

Ah, thank you! I didn't know about this fun feature. 😄

One of my favourites, and would really love to see that we actually get to #1746 (mast is already done thanks to Jenny)

jdavies-st · 2022-09-09T09:39:07Z

@bsipocz That test failure is real and peculiar, as I recall it passing when I developed the test. I see that it does fail now, even in the refactored form. And I can't get it to pass. That said, it's not required that it finds the file on S3, only that it removes duplicates, so I've told it to ignore the NoResultsWarning. The test is only failing because warnings are turned into exceptions.

The AWS S3 stuff is a bit opaque to me I'm afraid. If you have any suggestions on a better way test, I'd be happy to hear.

Anyway, test now passes, though it may fail if it suddenly starts finding that file in S3, which means something like with pytest.warns(None): might be the better context manager.

jdavies-st · 2022-09-09T10:17:08Z

Btw, this PR does not implement flat downloads when curl_flag=True is used. This is because the curl script that is downloaded is generated server side at MAST via MAST's bundle API, and astroquery.mast has no control over how this is done. Bummer.

So that is a current limitation.

eerovaher · 2022-09-09T12:58:12Z

There should be a warning if flat and curl_flag are both True.

jdavies-st · 2022-09-09T15:27:00Z

There should be a warning if flat and curl_flag are both True.

Yes, agree. Will add.

jdavies-st · 2022-09-09T15:55:55Z

astroquery/mast/observations.py

@@ -642,8 +648,8 @@ def _download_curl_script(self, products, out_dir):
        """

        url_list = [("uri", url) for url in products['dataURI']]
-        download_file = "mastDownload_" + time.strftime("%Y%m%d%H%M%S")
-        local_path = os.path.join(out_dir.rstrip('/'), download_file + ".sh")


The .rstrip("/") here is unnecessary and is actually a bug if the path is a Path object, something writing the test caught.

bsipocz · 2022-09-10T01:28:10Z

Thank you @jdavies-st!

jdavies-st added 2 commits September 8, 2022 11:05

Add flat option for now subdirs in Observation.download_products

8a50015

Add test for flat MAST downloads; consolidate test setup in fixture

5ac3501

keflavich approved these changes Sep 8, 2022

View reviewed changes

jdavies-st force-pushed the mast-no-subdirs branch from 4933462 to 5ac3501 Compare September 8, 2022 09:19

Update CHANGES.rst

ed97edb

eerovaher reviewed Sep 8, 2022

View reviewed changes

astroquery/mast/tests/test_mast_remote.py Outdated Show resolved Hide resolved

bsipocz added the mast label Sep 8, 2022

bsipocz added this to the v0.4.7 milestone Sep 8, 2022

bsipocz requested changes Sep 8, 2022

View reviewed changes

jdavies-st added 2 commits September 9, 2022 11:36

Fix arg/kwarg order in _download_files() call signature

dd3a352

Fix tests

1ec8393

Update docstring for curl_flag=True case

9282366

Warn when flat=True and curl_flag=True

279c683

jdavies-st commented Sep 9, 2022

View reviewed changes

bsipocz approved these changes Sep 10, 2022

View reviewed changes

bsipocz merged commit fae02b8 into astropy:main Sep 10, 2022

jdavies-st deleted the mast-no-subdirs branch September 10, 2022 06:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAST: Add flat option to Observations.download_products #2511

MAST: Add flat option to Observations.download_products #2511

jdavies-st commented Sep 8, 2022 •

edited

Loading

codecov bot commented Sep 8, 2022 •

edited

Loading

eerovaher Sep 8, 2022

jdavies-st Sep 8, 2022

eerovaher Sep 8, 2022

bsipocz Sep 8, 2022

bsipocz left a comment

bsipocz Sep 8, 2022

jdavies-st Sep 9, 2022

bsipocz Sep 9, 2022 •

edited

Loading

jdavies-st Sep 9, 2022

bsipocz Sep 9, 2022

eerovaher Sep 9, 2022

jdavies-st Sep 9, 2022

bsipocz Sep 9, 2022

jdavies-st commented Sep 9, 2022

jdavies-st commented Sep 9, 2022

eerovaher commented Sep 9, 2022

jdavies-st commented Sep 9, 2022

jdavies-st Sep 9, 2022

bsipocz commented Sep 10, 2022

MAST: Add flat option to Observations.download_products #2511

MAST: Add flat option to Observations.download_products #2511

Conversation

jdavies-st commented Sep 8, 2022 • edited Loading

codecov bot commented Sep 8, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bsipocz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bsipocz Sep 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdavies-st commented Sep 9, 2022

jdavies-st commented Sep 9, 2022

eerovaher commented Sep 9, 2022

jdavies-st commented Sep 9, 2022

Choose a reason for hiding this comment

bsipocz commented Sep 10, 2022

jdavies-st commented Sep 8, 2022 •

edited

Loading

codecov bot commented Sep 8, 2022 •

edited

Loading

bsipocz Sep 9, 2022 •

edited

Loading