Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime network loading of FIELDS_JSON_URL and mixed type enum #400

Closed
schwehr opened this issue Jun 2, 2021 · 9 comments · Fixed by #1045
Closed

Runtime network loading of FIELDS_JSON_URL and mixed type enum #400

schwehr opened this issue Jun 2, 2021 · 9 comments · Fixed by #1045
Milestone

Comments

@schwehr
Copy link
Collaborator

schwehr commented Jun 2, 2021

@volaya @duckontheweb @lossyrob and #264

Introduced here pystac/summaries.py#L40, FIELDS_JSON_URL causes the default path to go load the field definitions from a remote file. I am not a fan of having an additional code path (there was already the validation schema fetches) outside of testing:

  • That file is not versioned with pystac
  • It incurs a runtime cost of fetching 24089 bytes across a network. For some folks this might be an issue.
  • It's another code path that I have to patch out for hermetic testing
  • It blocks using using this code when offline or sandboxed
  • If something is wrong with that file, it's not obvious how to approach fixing it
  • What if the file moves?

I would much rather have a script to rebuild a python file that contains just the resulting table that is only the parts used here.

Currently it's this:

FIELDS_JSON_URL = ( 
   "https://cdn.jsdelivr.net/npm/@radiantearth/stac-fields/fields-normalized.json"
)

@lru_cache(maxsize=None)
def _get_fields_json(url: str) -> Dict[str, Any]:
    return pystac.StacIO.default().read_json(url)

# ...

class Summarizer:
    # ...
    def __init__(self, fields: Optional[str] = None):
        fieldspath = fields or FIELDS_JSON_URL
        try:
            jsonfields = _get_fields_json(fieldspath)
        except:
            if fields is None:
                raise Exception(
                    "Could not read fields definition file at "
                    f"{fields} or it is invalid.\n"
                    "Try using a local fields definition file."
                )
            else:
                raise
        self._set_field_definitions(jsonfields)

    def _set_field_definitions(self, fields: Dict[str, Any]) -> None:
        self.summaryfields: Dict[str, SummaryStrategy] = {}
        for name, desc in fields["metadata"].items():
            if isinstance(desc, dict):
                strategy_value = desc.get("summary", True)
                try:
                    strategy: SummaryStrategy = SummaryStrategy(strategy_value)
                except ValueError:
                    strategy = SummaryStrategy.DEFAULT
                if strategy != SummaryStrategy.DONT_SUMMARIZE:
                    self.summaryfields[name] = strategy
            else:
                self.summaryfields[name] = SummaryStrategy.DEFAULT

Additionally, having an enum of mixed types is likely to bite us down the road. An importer should translate all of those strategies into single character strings. Especially with True vrs 3 other strings that eval to True. It's not obvious here what the default even is.

class SummaryStrategy(Enum):
    ARRAY = "v"
    RANGE = "r"
    SCHEMA = "s"
    DONT_SUMMARIZE = False
    DEFAULT = True

Also, the above uses DONT for don't. That's hard on some users. Please use DO_NOT_SUMMARIZE or something else more readable. I generally suggest avoiding contractions in code even when it is just in comments.

@duckontheweb
Copy link
Contributor

@volaya @lossyrob Apologies for not thinking through this more during the PR review, but I think I'm in agreement that we should use a fields.json file that is included with this library by default. Using the remote file means that there's a chance that it could be updated in a way that either breaks or changes functionality in this library (e.g. if it is updated for a new extension version that we have not yet added support for).

I think we are also starting to get more users for whom internet connectivity can be an issue. It would be good to allow them to do as much work as possible offline and to make it very transparent where we are making network requests in our code to make debugging easier.

@lossyrob
Copy link
Member

Using the remote file means that there's a chance that it could be updated in a way that either breaks or changes functionality in this library

I thought we were pointing to an immutable release of the file? If not, I think we should ask the project to publish releases and point to a specific released version, as we are with schemas.

We're already pulling in json schemas for the validation functionality, so I don't think we'd be buying too much in the case of connectivity by pulling in this file and maintaining it separately - which would increase maintenance burden.

I do think being clear about when network access occurs and keeping it out of core functionality is a good practice. Also this is a single file as opposed to the linked set of files for json schemas, so it wouldn't be too bad to bake the released json file into the library. But I'm not sure the duplication and distribution of external files is a good pattern to start using generally.

@duckontheweb
Copy link
Contributor

I thought we were pointing to an immutable release of the file? If not, I think we should ask the project to publish releases and point to a specific released version, as we are with schemas.

The fields schema URL is https://cdn.jsdelivr.net/npm/@radiantearth/stac-fields/fields-normalized.json, which doesn't appear to be versioned. I can open an issue in that repo to request versioned releases. I can also ask for a minified version to cut down on the file size.

We're already pulling in json schemas for the validation functionality, so I don't think we'd be buying too much in the case of connectivity by pulling in this file and maintaining it separately - which would increase maintenance burden.

We are, and in that case it would be totally unreasonable to try to maintain those files within PySTAC. As you mentioned, it wouldn't be as much of a burden here since it's just one file, but I agree it's not ideal to replicate an external file. I'm not sure there's a clear best path here, both options seem less than ideal for different reasons. I think versioning and minifying the schema would make me +0 on either approach.

One other possibility would be to create a local file cache for these requests. This would only work if the fields schema was versioned, but the code could check that the cache version exists and matches the version in the URL and skip the network request if it does. We might also consider using this option for validation to help folks with poor connectivity. Downsides of this are the added complexity and increased local storage, but maybe we could make this optional?

@m-mohr
Copy link
Contributor

m-mohr commented Jun 14, 2021

Maybe I did not understand the PR correctly, but wasn't it implemented in a way that it's simple to replace the online version with a local one?

fields.json will be versioned soon, see stac-utils/stac-fields#12, but I need to reach 1.0.0 first.

Are the white spaces in the JSON file really an issue, @schwehr?

@schwehr
Copy link
Collaborator Author

schwehr commented Jun 14, 2021

I don't care about empty spaces. I do care that the version for this file doesn't move with pystac and the pystac likes to do surprise network access. I was only thinking that keeping the essential parts makes it simpler for folks to follow, but having a full copy in git seems reasonable.

I currently patch out the code that does network access in my local copy of pystac. The idea of doing a fetch by hand of the file and patching that into my repo isn't appealing.

@duckontheweb
Copy link
Contributor

  • It incurs a runtime cost of fetching 24089 bytes across a network. For some folks this might be an issue.

@m-mohr @schwehr My suggestion of minifying the JSON file was based on this point in the original issue description and the fact that removing whitespace reduces the file size from 24089 bytes to 12319 bytes. If that is not a concern, then I do not see a need for removing whitespace and I can strike that from stac-utils/stac-fields#12.

@m-mohr
Copy link
Contributor

m-mohr commented Jun 15, 2021

Back to the loading: Didn't the PR had some code that allowed to override the fields.json with a custom one? In that case @schwehr could just set that and no request would be made and keeping the file up to date would need to be achieved on the user side, which should then avoid surprises. That's how I understood it...

@schwehr
Copy link
Collaborator Author

schwehr commented Jun 15, 2021

Back to the loading: Didn't the PR had some code that allowed to override the fields.json with a custom one? In that case @schwehr could just set that and no request would be made and keeping the file up to date would need to be achieved on the user side, which should then avoid surprises. That's how I understood it...

There are a number of issues not having the file in the pystac gut repo. To pick just one:

This doesn't help someone who is trying bisect a bug. If they use even a local cache of the file, they will then have pystac behavior that doesn't change with where they are in the git history. If the bug or testing behavior ends up being dependent on that file, bisect tools may be unable to find when the bug started.

@gadomski gadomski added this to the 1.8 milestone Jan 31, 2023
@jsignell
Copy link
Member

It would make sense to me to pull this json blob in on release and then package it up as part of pystac. If that sounds like a reasonable approach then I can take a crack at implementation.

@gadomski gadomski linked a pull request Mar 15, 2023 that will close this issue
5 tasks
@gadomski gadomski modified the milestones: 1.8, 1.7.1 Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants