Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streaming from OPeNDAP? #24

Closed
HeatherSavoy-USDA opened this issue Jan 7, 2022 · 13 comments
Closed

streaming from OPeNDAP? #24

HeatherSavoy-USDA opened this issue Jan 7, 2022 · 13 comments
Assignees
Labels
enhancement New feature or request

Comments

@HeatherSavoy-USDA
Copy link
Collaborator

I've been taking a look into what options there are for streaming MODIS, particularly focusing on methods that can be useful for other products. OPeNDAP seems to be like a potential option for NASA earth data. I've been looking at LP DAAC hyrax, but there seem to be other potential products/servers of interest for our group. There is a python package pydap for accessing these kinds of servers. One general maybe issue is the requirement to authenticate an EarthData login (would need to review the user agreement, what account would to the requests?), but pydap has support for the authentication process if it is OK.

Spatial and temporal subsets can be made, and metadata queried, but formats might vary a bit depending on what server you are talking to.

Warning though: I've only dabbled in reading about this option and have not yet successfully executed my test of extracting a few timesteps of NDVI covering the Jornada. I can use basic curl commands to handle authentication and request the intended spatial extent from the intended NDVI product, but I'm not sure on the temporal part (I am pulling a few timesteps but I picked the first 4) and I haven't successfully read in the result. I was trying in R since that's what I know, but I can't tell if its an R netcd4 issue or not that I can view the layers that I downloaded, but the geographic info is missing. I tried the pydap package and I can make some example code download data from some other server, but not for my MODIS/hyrax test.

At this point, I'd need other people's input on if this is something worth continuing to explore as we try to include some streaming of popular datasets that are too big to store on scinet.

I found this LP DAAC webinar as a helpful starting point.

@HeatherSavoy-USDA HeatherSavoy-USDA added the enhancement New feature or request label Jan 7, 2022
@HeatherSavoy-USDA
Copy link
Collaborator Author

HeatherSavoy-USDA commented Jan 10, 2022

I've been working a bit more on an example of using this api.

Since our code is already in python and pydap is supposed to play nice with xarray, I focused on figuring out the example using these tools. I spent more time than I'd like to admit trying to get a python version that worked on my computer with all of the dependencies, but ultimately gave up and started working on CERES. The python 3.6 version available on CERES seems to work so far.

What is currently working on CERES:

  • establishing a session using my Earthdata login with the setup_session function from pydap.cas.urs (this is pydap's NASA earthdata specific support) and xarrray.backends.PydapDataStore.open
  • pointing to a specific dataset on a NASA server that I got from reading a github issue (see below)
  • reading a subset of the data from that dataset with xarray

What is NOT currently working on CERES:

  • Although I got the dataset above to read in, I get an error back from the server (302 Found, The document has moved) if I try to read in the dataset I want (see below). I'm not sure why it works for one and not the other, but I can download the dataset I want just using curl (on my machine and CERES) as described here, so I don't think its an authorized apps issue in the earthdata login.

What still needs to be figured out (besides the things that don't currently work...):

Code snippet of what's working on CERES at the moment:

import numpy as np
import pandas as pd
import xarray as xr
import rioxarray
from pydap.cas.urs import setup_session

#What i want but haven't gotten to work (w/ or w/o the subsetting) - but works with curl
#ds_url = 'http://opendap.cr.usgs.gov/opendap/hyrax/MOD13Q1.061/h09v05.ncml.nc4?_250m_16_days_NDVI[1:1:3][4000:1:4799][4000:1:4799]'

# This one works!
ds_url = 'https://goldsmr5.gesdisc.eosdis.nasa.gov:443/opendap/MERRA2_MONTHLY/M2IMNPANA.5.12.4/2019/MERRA2_400.instM_3d_ana_Np.201901.nc4'

session = setup_session(<my_username>, <my_password>, check_url=ds_url)
store = xr.backends.PydapDataStore.open(ds_url, session=session)

ds = xr.open_dataset(store)

surface_pressure = ds["PS"]
surface_pressure_box = surface_pressure.sel(lat=slice(32,33),lon=slice(-106,-105))

@haitao-git
Copy link
Collaborator

had to install and import lxml to get the line
session = setup_session(<my_username>, <my_password>, check_url=ds_url)
work.

but still got error for the next line
store = xr.backends.PydapDataStore.open(ds_url, session=session)

Exception has occurred: SSLCertVerificationError
[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1129)

@haitao-git
Copy link
Collaborator

tried the sample code at ceres hpc,

no "ssl:certificate" problem for getting data from
https://goldsmr5.gesdisc.eosdis.nasa.gov

but still "302 found" error for getting data from
http://opendap.cr.usgs.gov/opendap/hyrax

@haitao-git
Copy link
Collaborator

haitao-git commented Jan 28, 2022

tried the cookie/.netrc method from the link
https://wiki.earthdata.nasa.gov/display/EL/How+To+Access+Data+With+PyDAP

created .netrc file in home directory

changed the code from python 2 to python 3, and little modification of http importing

the method open_url() for 'https://goldsmr4.gesdisc.eosdis.nasa.gov' worked,

but still "302 Found" error at open_url() to open data from ''http://opendap.cr.usgs.gov/opendap/hyrax"

# https://wiki.earthdata.nasa.gov/display/EL/How+To+Access+Data+With+PyDAP

# from pydap.util.urs import install_basic_client
# install_basic_client()
# from pydap.client import open_url
# dataset = open_url('https://goldsmr4.gesdisc.eosdis.nasa.gov/opendap/MERRA2/M2T1NXSLV.5.12.4/2016/06/MERRA2_400.tavg1_2d_slv_Nx.20160601.nc4')
 
# # If the module is not included, you can manually add it to your code as follows:
# # BEGIN BASIC AUTH MODULE CODE (Comments removed)

# create .netrc file in home dir
# machine urs.earthdata.nasa.gov
# 	login xxxxxx
# 	password xxxxxxxxxxxx

# machine uat.urs.earthdata.nasa.gov
# 	login xxxxxx
# 	password xxxxxxxxxxx   

# change permission of the file .netrc
# chmod og-rw .netrc 

import http.cookiejar as cookielib

#import cookielib # only for python 2
import netrc
import urllib.request as urllib2

# import urllib2 # only for python 2
import re
 
import pydap.lib
from pydap.exceptions import ClientError
 
import logging
 
log = logging.getLogger(__name__)
log.setLevel(logging.DEBUG)
 
# Set the debug level for urllib2.
debuglevel=1
 
def install_basic_client(uri='', user='', passwd='', use_netrc=True):
    # Create special opener with support for Cookies
    cj = cookielib.CookieJar()
    
    # Create the password manager and load with the credentials using 
    pwMgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
 
    # Get passwords from the .netrc file nless use_netrc is False    
    if use_netrc:
        logins = netrc.netrc()
        accounts = logins.hosts # a dist of hosts and tuples
        # for host, info in accounts.iteritems():
        for host, info in accounts.items():
            login, account, password = info
            log.debug('Host: %s; login: %s; account: %s; password: %s' % (host, login, account, password))
            pwMgr.add_password(None, host, login, password)
        
    if uri and user and passwd:
        pwMgr.add_password(None, uri, user, passwd)
    
    opener = urllib2.build_opener(urllib2.HTTPBasicAuthHandler(pwMgr),
                                  urllib2.HTTPCookieProcessor(cj))
    
    opener.addheaders = [('User-agent', 'pydap/EL')]
 
    urllib2.install_opener(opener)
 
    def new_request(url):
        if url[-1] is '&': url = url[0:-1]
        log.debug('Opening %s (install_basic_client)' % url)
        r = urllib2.urlopen(url)
        
        resp = r.headers.dict
        resp['status'] = str(r.code)
        data = r.read()
 
        # When an error is returned, we parse the error message from the
        # server and return it in a ``ClientError`` exception.
        if resp.get("content-description") == "dods_error":
            m = re.search('code = (?P<code>\d+);\s*message = "(?P<msg>.*)"',
                    data, re.DOTALL | re.MULTILINE)
            msg = 'Server error %(code)s: "%(msg)s"' % m.groupdict()
            raise ClientError(msg)
 
        return resp, data
 
    #from pydap.util import http
    import http
    http.request = new_request
    
# END BASIC AUTH MODULE CODE
 
install_basic_client()
from pydap.client import open_url

# working line
dataset = open_url('https://goldsmr4.gesdisc.eosdis.nasa.gov/opendap/MERRA2/M2T1NXSLV.5.12.4/2016/06/MERRA2_400.tavg1_2d_slv_Nx.20160601.nc4')

# 302 Found error, 
# dataset = open_url('https://opendap.cr.usgs.gov/opendap/hyrax/MOD13Q1.061/h09v05.ncml')

# 302 Found error, 
# dataset = open_url('https://opendap.cr.usgs.gov/opendap/hyrax/MOD13Q1.061/h09v05.ncml?_250m_16_days_NDVI[1:1:3][4000:1:4799][4000:1:4799]')

# 302 Found error, 
#dataset = open_url('http://opendap.cr.usgs.gov/opendap/hyrax/MOD13Q1.061/h09v05.ncml?_250m_16_days_NDVI[1:1:3][4000:1:4799][4000:1:4799]')

@haitao-git
Copy link
Collaborator

haitao-git commented Jan 28, 2022

Thought I found a good sample/tutorial which we can borrow some codes from,

https://github.com/ornldaac/daymet-python-opendap-xarray

https://www.youtube.com/watch?v=0jCKiqrcYaU

but stuck at the line
# Using pydap's open_url
thredds_ds = open_url(granule_dap)

with error
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 1: ordinal not in range(128)

related pydap github issue: pydap/pydap#196

@haitao-git
Copy link
Collaborator

looks like the error: "UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 1: ordinal not in range(128)" can be solved by pydap/pydap#126

change the file net.py in the pydap package according to
pydap/pydap@5a14671

then restart the python kernel.

stuck again on the line of code:
rioxarray.open_rasterio

ValueError: unable to decode time units 'days since 2010-01-01 12:00:00' with "calendar 'proleptic_gregorian'". Try opening your dataset with decode_times=False or installing cftime if it is not installed.

@haitao-git
Copy link
Collaborator

tried to create get_session function to solve 302 Found / redirection issue on hyrax server login, described in pydap/pydap#188, no luck

@haitao-git
Copy link
Collaborator

played with AppEEARS API for a while
https://git.earthdata.nasa.gov/projects/LPDUR/repos/appeears-api-getting-started/browse/AppEEARS_API_Area.ipynb

MOD13Q1.061 is one of the supported dataset by AppEEARS

@HeatherSavoy-USDA
Copy link
Collaborator Author

HeatherSavoy-USDA commented Feb 2, 2022

Good news: pydap 3.3 was released yesterday. Using python 3.6.6 on CERES, I upgraded pydap in my home directory using python3 -m pip install --user git+https://github.com/pydap/pydap. I could run that daymet tutorial above without having to modify the pydap source code. When I tried our desired MODIS dataset, I at least now get a more useful error message to accompany the 302 Found message:

This is redirect error. These should not usually raise an error in pydap beacuse redirects are handled implicitly. If it failed it is likely due to a circular redirect.

Which I think confirms the issue is as discussed in pydap/pydap#188.

So I'm currently inclined to consider pydap as useful for daily Daymet or datasets available from NASA GES DISC, but maybe use appears API for MODIS initially.

@HeatherSavoy-USDA
Copy link
Collaborator Author

I tried the solution suggested in pydap/pydap#188 too with no luck either. Keep getting the redirect error message.

I also ran the AppEEARS tutorial linked above and its stuck in pending status for over an hour. But there is a

AρρEEARS is experiencing technical issues that are delaying request processing.

banner on their website so it may not be a representative processing time. @haitao-nmsu said it only took a couple min to run the tutorial. I'll check again tomorrow.

@HeatherSavoy-USDA
Copy link
Collaborator Author

Alternatively, we could also start with ORNL DAAC MODIS product. The daymet tutorial above can be generalized to at least other ORNL DAAC products on their THREDDS server. Based on a quick edit to the tutorial code, I'm able to access their MODIS product found here. The tutorial also shows how the NASA CMR (mentioned above) can be queried for searching dataset metadata. Next, I'll see if I can figure out a generalized approach leveraging CMR and ORNL DAAC THREDDS (for now, other pydap-compatible servers later) to include datasets in gcdl.

@HeatherSavoy-USDA
Copy link
Collaborator Author

HeatherSavoy-USDA commented Feb 4, 2022

Things that the daymet opendap tutorial hard-coded, but I think we could generalize:

  1. They had the dataset DOI and used CMR to translate it to a collection/concept id. I think we could let the user browse the datasets available on the supported server(s) via CMR, and obtain the collection id (and other metadata) that way:
# import requests

cmr_url = 'https://cmr.earthdata.nasa.gov/search/'
ornl_search = cmr_url + 'collections.json?pretty=true&provider=ORNL_DAAC&page_size=2000&keyword=thredds'

user_keywords = ['NDVI','MODIS']

response = requests.get(ornl_search + '%20' + '%20'.join(user_keywords))
collections = response.json()['feed']['entry']

# Display dataset titles to user
product_titles = [c['title'] for c in collections] 
# Set user selection
title_id = 0 

my_collection = [c for c in collections if c['title'] == product_titles[title_id]][0]

Or, maybe for pilot, just have catalog entries for tested datasets that include the collection id. In this collection entry, there are data on the temporal and spatial coverage of the dataset.

  1. They had the URL for the OPeNDAP service for their dataset. This could be determined programmatically though. You can get the thredds catalog entry from CMR:
thredds_catalog = [url['href'] for url in my_collection['links'] if 'thredds' in url['href']][0]

The catalog can then be retrieved as xml. I just haven't figured out how to parse the xml yet. But within the .xml, there is a service section that tells you how to point to the opendap service: <service name="odap" serviceType="OPENDAP" base="/thredds/dodsC/"/>. Using essentially the same granule search as the tutorial, the opendap urls for the relevant files can all be built:

# import re

granulesearch = cmr_url + 'granules.json?collection_concept_id=' + my_collection['id'] + \
                '&page_size=1000'  

response = requests.get(granulesearch)
granules = response.json()['feed']['entry'] 
granule_names = []  
for g in granules:
    granule_name = g['title'] 
    granule_names.append(granule_name)


thredds_url = 'https://thredds.daac.ornl.gov'
service_base = '/thredds/dodsC/'  #from catalog xml
dataset_base = re.compile('catalog/(.*)catalog').findall(thredds_catalog)[0]

granule_dap = thredds_url + service_base + dataset_base + re.sub('^[A-Za-z_]+.','',granule_names[0])

Note, the granule search can be constrained further, e.g. with spatial bounds.

  1. They had the variable names and CRS info. These can be retrieved from the first granule after its opened with xarray and pydap.
# import xarray as xr
# from pydap.client import open_url

thredds_ds = open_url(granule_dap) 
ds = xr.open_dataset(xr.backends.PydapDataStore(thredds_ds), decode_coords="all")

# variables
ds.data_vars

# crs - get name from grid_mapping
crs = CRS(ds.lambert_azimuthal_equal_area.spatial_ref) 

With all of those pieces, you can use xarray to do spatial and temporal subsetting and bring down the data subsets matching our user requests. It should be extendable to any dataset from ORNL DAAC thredds server (e.g. daily daymet) or other NASA data centers with thredds servers with OPenDAP service (assuming no authentication redirect issues).

So the next questions are:

  1. How much of this is for the pilot milestone?
  2. How does this work in gcdl? Basically abstractions and implementation for remote dataset support #25

@HeatherSavoy-USDA
Copy link
Collaborator Author

HeatherSavoy-USDA commented Mar 7, 2022

I'm going to close this issue since 474d2fb has functional OPenDAP usage except for the roadblocks contained in #25 which are not OPeNDAP specific.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants