-
-
Notifications
You must be signed in to change notification settings - Fork 408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAST: Add flat option to Observations.download_products #2511
Conversation
4933462
to
5ac3501
Compare
Codecov Report
@@ Coverage Diff @@
## main #2511 +/- ##
==========================================
- Coverage 62.98% 62.98% -0.01%
==========================================
Files 133 133
Lines 17308 17314 +6
==========================================
+ Hits 10902 10905 +3
- Misses 6406 6409 +3
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
if not flat: | ||
local_path = os.path.join(base_dir, data_product['obs_collection'], data_product['obs_id']) | ||
if not os.path.exists(local_path): | ||
os.makedirs(local_path) | ||
else: | ||
local_path = base_dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered using Path
from pathlib
?
if not flat: | |
local_path = os.path.join(base_dir, data_product['obs_collection'], data_product['obs_id']) | |
if not os.path.exists(local_path): | |
os.makedirs(local_path) | |
else: | |
local_path = base_dir | |
if flat: | |
local_path = Path(base_dir) | |
else: | |
local_path = Path(base_dir, data_product['obs_collection'], data_product['obs_id']) | |
local_path.mkdir(parents=True, exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider, I just didn't know what the downstream effects of that would be, given that all paths in the module use the older os.path
routines, as do most in the core library. You'll notice in the test I did use the pytest tmp_path
fixture, which supplies a temporary Path
object as location to write the temporary downloaded file. And it works just fine. So it should work, but maybe it would be good to do a consistent cleanup throughout the module?
I did fix
base_dir = download_dir.rstrip('/') + "/mastDownload"
which I believe was probably a bug, as that cannot work with a Windows filesystem. But I held off doing anymore changes that were not focused on adding this new feature. Happy to do that in a subsequent PR so as not to detract from this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would be better to be consistent throughout the module and switching to pathlib
sounds like something that deserves a separate pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not mix the two without a good reason, and even pathlib
's own documentation refers that os.path
can be used. That being said I would merge a PR if it consistently changes all os.path to pathlib in the package (but if so, I would still not like to see /
operators as those may be misleading during quick code reads (and because we still get a lot of contributions doing raw sting manipulations and additions using '/'
in them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a possibly relevant test failure (in fact likely introduced in the previous PRs), could you please have a look at it, too?
======================================================= FAILURES =======================================================
_______________________________ TestMast.test_observations_get_cloud_uris_no_duplicates ________________________________
self = <astroquery.mast.tests.test_mast_remote.TestMast object at 0x117450c70>
msa_product_table = <Table masked=True length=6>
obsID obs_collection dataproduct_type obs_id ... size parent...
87601178 JWST image jw02736008001_03101_00004_nrs2 ... 567360 87602009 PUBLIC 1
def test_observations_get_cloud_uris_no_duplicates(self, msa_product_table):
# Get a product list with 6 duplicate JWST MSA config files
products = msa_product_table
# enable access to public AWS S3 bucket
mast.Observations.enable_cloud_dataset()
# Check duplicate cloud URIs as well
> uris = mast.Observations.get_cloud_uris(products)
astroquery/mast/tests/test_mast_remote.py:347:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
astroquery/mast/observations.py:791: in get_cloud_uris
return self._cloud_connection.get_cloud_uri_list(data_products, include_bucket, full_url)
astroquery/mast/cloud.py:161: in get_cloud_uri_list
return [self.get_cloud_uri(product, include_bucket, full_url) for product in data_products]
astroquery/mast/cloud.py:161: in <listcomp>
return [self.get_cloud_uri(product, include_bucket, full_url) for product in data_products]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <astroquery.mast.cloud.CloudAccess object at 0x1174632b0>
data_product = <Row index=0>
obsID obs_collection dataproduct_type obs_id description type ... -- CALJWST -- 2736 jw02736008001_01_msa.fits 567360 87602009 PUBLIC 1
include_bucket = True, full_url = False
def get_cloud_uri(self, data_product, include_bucket=True, full_url=False):
"""
For a given data product, returns the associated cloud URI.
If the product is from a mission that does not support cloud access an
exception is raised. If the mission is supported but the product
cannot be found in the cloud, the returned path is None.
Parameters
----------
data_product : `~astropy.table.Row`
Product to be converted into cloud data uri.
include_bucket : bool
Default True. When false returns the path of the file relative to the
top level cloud storage location.
Must be set to False when using the full_url argument.
full_url : bool
Default False. Return an HTTP fetchable url instead of a cloud uri.
Must set include_bucket to False to use this option.
Returns
-------
response : str or None
Cloud URI generated from the data product. If the product cannot be
found in the cloud, None is returned.
"""
s3_client = self.boto3.client('s3', config=self.config)
path = utils.mast_relative_path(data_product["dataURI"])
if path is None:
raise InvalidQueryError("Malformed data uri {}".format(data_product['dataURI']))
if 'galex' in path:
path = path.lstrip("/mast/")
else:
path = path.lstrip("/")
try:
s3_client.head_object(Bucket=self.pubdata_bucket, Key=path)
if include_bucket:
path = "s3://{}/{}".format(self.pubdata_bucket, path)
elif full_url:
path = "http://s3.amazonaws.com/{}/{}".format(self.pubdata_bucket, path)
return path
except self.botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] != "404":
raise
> warnings.warn("Unable to locate file {}.".format(data_product['productFilename']), NoResultsWarning)
E astroquery.exceptions.NoResultsWarning: Unable to locate file jw02736008001_01_msa.fits.
astroquery/mast/cloud.py:135: NoResultsWarning
astroquery/mast/observations.py
Outdated
@@ -584,7 +584,7 @@ def download_file(self, uri, *, local_path=None, base_url=None, cache=True, clou | |||
|
|||
return status, msg, url | |||
|
|||
def _download_files(self, products, base_dir, *, cache=True, cloud_only=False,): | |||
def _download_files(self, products, base_dir, flat=False, *, cache=True, cloud_only=False,): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just being extra nitpicky/pedantic as this is not public api, please add the kwarg after the *
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. Thanks. It would be nice to eventually expunge that *
and be specific about what args are accepted, but I suspect that's a separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what you mean, the * is doing exactly that enforcement of being specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that it would be better for documentation (autodoc) to specify all the args that *
allows, but I guess that varies with dataset as the allowable filters vary? No matter, not super relevant this PR. And it's a private method, so not that important for documenting the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may still misunderstand you, there is no args in *
, it's the syntax to enforce keyword-only arguments (not to be the same as when it says *args
or similar which would allow additional positional arguments to be passed.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*
limits the number of positional arguments, see https://peps.python.org/pep-3102/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thank you! I didn't know about this fun feature. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of my favourites, and would really love to see that we actually get to #1746 (mast is already done thanks to Jenny)
@bsipocz That test failure is real and peculiar, as I recall it passing when I developed the test. I see that it does fail now, even in the refactored form. And I can't get it to pass. That said, it's not required that it finds the file on S3, only that it removes duplicates, so I've told it to ignore the The AWS S3 stuff is a bit opaque to me I'm afraid. If you have any suggestions on a better way test, I'd be happy to hear. Anyway, test now passes, though it may fail if it suddenly starts finding that file in S3, which means something like |
Btw, this PR does not implement flat downloads when So that is a current limitation. |
There should be a warning if |
Yes, agree. Will add. |
@@ -642,8 +648,8 @@ def _download_curl_script(self, products, out_dir): | |||
""" | |||
|
|||
url_list = [("uri", url) for url in products['dataURI']] | |||
download_file = "mastDownload_" + time.strftime("%Y%m%d%H%M%S") | |||
local_path = os.path.join(out_dir.rstrip('/'), download_file + ".sh") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .rstrip("/")
here is unnecessary and is actually a bug if the path is a Path
object, something writing the test caught.
Thank you @jdavies-st! |
This adds a
flat
boolean option toObservations.download_products()
that does not make any subdirectories for the downloads. Default is turned off, i.e.False
.If set to
True
, anddownload_dir
is specified, it will put all files into download_dir without subdirectories.If set to
True
anddownload_dir
is not specified, it will put files in the current working directory, again with no subdirectories.Setting
flat=True
has no effect whencurl_flag=True
, as the bash script is server-side generated at MAST.The default of
False
retains the current behavior and puts files into the standard directory structure ofmastDownload/ <obs_collection> / <obs_id> /
.Resolves #2500.