diff --git a/CHANGES.rst b/CHANGES.rst index ced6e1a6c..eac10c639 100644 --- a/CHANGES.rst +++ b/CHANGES.rst @@ -1,5 +1,8 @@ 2.6 (unreleased) ------------------- +---------------- + +- Add a developer overview document to help understand how ASDF works + internally. Still a work in progress. [#730] - Add ``package`` property to extension metadata, and deprecate use of ``software``. [#728] diff --git a/docs/asdf/developer_overview.rst b/docs/asdf/developer_overview.rst new file mode 100644 index 000000000..4072c0d60 --- /dev/null +++ b/docs/asdf/developer_overview.rst @@ -0,0 +1,352 @@ +High level overview of the basic ASDF library +============================================= + +This document is an attempt to make it easier to understand the design and +workings of the python asdf library for those unfamiliar with it. This is +expected to grow organically so at the moment it should not be considered +complete or comprehensive. + +Understanding the design is complicated by the fact that the library +effectively inserts custom methods or classes into the objects that +the pyyaml and jsonschema libraries use. Understanding what is going on +thus means having some understanding of the relevant parts of the +internals of both of those libraries. This overview will try to provide +a small amount of context for these packages to illuminate how the code +in asdf interacts with them. + +There are at least two ways of outlining the design. One is to give high level +overviews of the various modules and how they interact with other modules. The +other is to illustrate how code is actually invoked in common operations, this +often being much more informative on a practical level (at least some find that to +be the case). This document will attempt to do both. + +We will start with a high-level review of concepts and terms and point to where +these are handled in the asdf modules. + +Because of the complexity, this initial design overview will focus on issues of +validation and tree construction when reading. + +Some terminology and definitions +-------------------------------- + +**URI vs URL (Universal Resource Identifier)**. This is distinguished from URL +(Universal Resource Locator) primarily in that URI is a mechanism for a unique +name that follows a particular syntax, but itself may not indicate where the +resource is. Generally URLs are expected to be used on the web for the HTTP +protocol, though for asdf, this isn't necessarily the case as mentioned next. + +**Resolver:** Tools to map URIs and tags into actual locations of schema files, +which may be local directories (the usual approach) or an actual URL for +retrieval over the network. This is more complicated that it may seem for +reasons explained later. + +**Validator:** Tool to confirm that the YAML conforms to the schemas that +apply. A lot goes on in this area and it is pretty complex in the +implementation. + +**Tree building:** The YAML content is built into a tree in two stages. The YAML +parser converts the raw YAML into a custom Python structure. It is that +structure that is validated. Then if no errors are found, the tree is +converted into a tree where tagged nodes get converted into corresponding Python +objects (usually, an option exists to prevent this from happening, which is +useful for some applications), e.g., WCS object or numpy arrays (well, not +quite that simply for numpy arrays). + +The above is a simplified view of what happens when an ASDF file is read. + +Most of resolver tools and code is in ``resolver.py`` (but not all). + +Most of the validation code is in ``schema.py``. + +The code that builds the trees is spread in many places: ``tagged.py``, +``treeutil.py``, ``types.py`` as well as all the extension code that supplies +code to handle the tags within (and often the the associated schemas). + +A note on the location of schemas and tag code; there is a bit of schizophrenic +aspect to this since schema should be language agnostic and in that view, not +bundled with specific language library code. But currently nearly all of the +implementation is in Python so while the long-term goal is to keep them +separate, it is more convenient to keep them together for now. You will see +cases where they are separate and some where they are bundled]. + +Actions that happen when ASDF is imported +----------------------------------------- + +The entry points for all asdf extensions are obtained in ``extension.py`` (by +the class ``_DefaultExtensions``) which is instantiated at the end of the module +as ``default_extensions``, but the entry points are only found when +default_extensions.extensions is accessed (it's a property) + +The effect of this is to load all the specified entry point classes for all the +extensions that have registered through the entry point mechanism. (see +[https://packaging.python.org/specifications/entry-points/]) The list of classes +so loaded is what ``default_extensions.extensions`` returns along with all the +built-in extensions part of ASDF. + +When an ``AsdfFile`` class is instantiated, one thing that happens on the +``__init__`` is that ``self._process_extensions()`` is called with an empty +list. That results in ``default_extensions.extension_list`` being accessed, +which then results in ``extension.AsdfExtensionList`` being instantiated with +the created extensions property. + +This class populates the ``tag_mapping``, ``url_mapping`` lists and the +validators dictionary, as well as populating the ``_type_index`` attribute with +the ``AsdfTypes`` subclasses defined in the extensions. + +As a last step, the ``tag_mapping`` and ``url_mapping`` methods are generated +from ``resolver.Resolver`` with the initial lists. These lists consist of +2-tuples. In the first case it is a mechanism to map the tag string to a url +string, typically with an expected prefix or suffix to the tag (suffix is +typical) so that given a full tag, it generates a url that includes the suffix +This permits one mapping to cover many tag variants. (The details of mapping +machinery with examples are given in a later section since understanding this is +essential to defining new tags and corresponding schemas.) + +The URL mapping works in a similar way, except that it consists of 2-tuples +where the first element is the common elements of the url, and the second part +maps it to an actual location (url or file path). Again the second part may +include a place holder for the suffix or prefix, and code to generate the path +to the schema file. + +The use of the resolver object turns these lists into functions so that +supplied the appropriate input that matches something in the list, it gives the +corresponding output. + +Outline of how an ASDF file is opened and read into the corresponding Python +object. +------------------------------------------------------------------------------------ + +The starting point can be found in ``asdf.py`` essentially through the following +chain (many calls and steps left out to keep it simpler to follow) + +When ``asdf.open("myasdffile.asdf")`` is called, it is aliased to +``asdf.open_asdf`` which first creates an instance of ``asdf.AsdfFile`` (let's +call the instance ``af``), then calls ``af._open_impl()`` and then +``af._open_asdf``. That invokes a call to ``generic_io.get_file()``. + +``generic.py`` basically contains code to handle all the variants of I/O +possible (files, streaming, http access, etc). In this case it returns a +``RealFile`` instance that wraps a local file system file. + +Next the file is examined to see if it is an ASDF file (first by examining the +first few lines in the header). If it passes those checks, the header (yaml) +section of the file is extracted through a proxy mechanism that signals an end +of file when the end of the yaml is reached, but otherwise looks like a file +object. + +The yaml parsing phase described below normally returns a "tagged_tree". That is +(somewhat simplified), it returns the data structure that yaml would normally +return without any object conversion (i.e., all nodes are either dicts, lists, +or scalar values), except that they are objects that now support a tag attribute +that indicates if a tag was associated with that node and what the tag was. + +This reader object is passed to the yaml parser by calling +``yamlutil.load_tree``. A simple explanation for what goes on here is necessary +to understand how this all works. Yaml supports various kinds of loaders. For +security reasons, the "safe" loader is used (note that both C and python +versions are supported through an indirection of the ``_yaml_base_loader`` +defined at the beginning of that module that determines whether the C version is +available). The loaders are recursive mechanisms that build the tree structure. +Note that ``yamlutil.load_tree`` creates a temporary subclass of ``AsdfLoader`` +and attaches a reference to the AsdfFile instance as the ``.ctx`` attribute of +that temporary subclass. + +One of the hooks that pyyaml supplies is the ability to overload the method +``construct_object``. That's what the class ``yamlutil.AsdfLoader`` does. pyyaml +calls this method at each node in the tree to see if anything special should be +done. One could perform conversion to predefined objects here, but instead it +does the following: it sees if the node.tag attribute is handled by yaml itself +(examples?) it calls that constructor which returns the type yaml converts it +to. Otherwise: + + - it converts the node to the type indicated (dict, list, or scalar type) by + - yaml for that node. it obtains the appropriate tag class (an AsdfType + - subclass) from the AsdfFile instance (using ``ctx.type_index.fix_yaml_tag`` + - to deal with version issues to match the most appropriate tag class). it + - wraps all the node alternatives in a special asdf ``Tagged`` class instance + - variant where that object contains a ._tag attribute that is a reference to + - the corresponding Tag class. + +The loading process returns a tree of these Tagged object instances. This +tagged_tree is then returned to the ``af`` instance (still running the +``_open_asdf()`` method) this tree is passed to to the ``_validate()`` method +(This is the major reason that the tree isn't directly converted to an object +tree since jsonschema would not be able to use the final object tree for +validation, besides issues relate to the fact that things that don't validate +may not be convertable to the designated object.) + +The validate machinery is a bit confusing since there are essentially two basic +approaches to how validation is done. One type of validation is for validation +of schema files themselves, and the other for schemas for tags. + +The schema.py file is fairly involved and the details are covered elsewhere. +When the validator machinery is constructed, it uses the fundamental validation +files (schemas). But this doesn't handle the fact that the file being validated +is yaml, not json and that there are items in yaml not part of json so special +handling is needed. And the way it is handled is through a internal mechanism of +the jsonschema library. There is a method that jsonschema calls recursively for +a validator and it is called iter_errors. The subclass of the jsonschema +validator class is defined as schema.ASDFValidator and this method is overloaded +in this class. Despite its name, it's primary purpose is to validate the special +features that yaml has, namely applying schemas associated with tags (this is +not part of the normal jsonschema scheme [ahem]). It is in this method that it +looks for a tag for a node and if it exists and in the tag_index, loads the +appropriate schema and applies it to the node. (jsonschemas are normally only +associated with a whole json entity rather than specific nodes). While the +purpose of this method is to iteratively handle errors that jsonschema detects, +it has essentially been repurposed as the means of interjecting handling tag +schemas. + +In order to prevent repeated loading of the same schema, the lru caching scheme +is used (from functools in the standard library) where the last n cached schemas +are saved (details of how this works were recently changed to prevent a serious +memory leak) + +In any event, a lot is going on behind the scenes in validation and it deserves +its own description elsewhere. + +After validation, the tagged tree is then passed to +yamlutil.tagged_tree_to_custom_tree() where the nodes in the tree that have +special tag code convert the nodes into the appropriate Python objects that the +base asdf and extensions are aware of. This is accomplished by that function +defining a walker "callback" function (defined within that function as to pick +up the af object intrinsically). The function then passes the callback walker to +treeutil.walk_and_modify() where the tree will be traversed recursively applying +the tag code associated with the tag to the more primitive tree representation +replacing such nodes with Python objects. The tree traversal starts from the +top, but the objects are created from the bottom up due to recursion. + +The result is what af.tree is set to, after doing another tree traversal looking +for special type hooks for each node. It isn't clear if there is yet any use of that +feature. + +Not quite that simple +--------------------- + +Outline of schema.py +-------------------- + +This module is somewhat confusing due to the many functions and methods with +some variant of validate in their name. This will try to make clear what they do +(a renaming of these may be in order). + +Here is a list of the functions/classes in ``schema.py`` and their purpose and +where they sit in the order of things + +default_ext_resolver + +**_type_to_tag:** Handles mapping python types to yaml_tags, with the addition +of support for OrderedDicts. + +The next 5 functions are put in the ``YAML_VALIDATORS`` dictionary to ultimately +be used by ``_create_validator`` to create the json validator object + +------ + +**validate_tag:** Obtain the relevant tag for the supplied instance (either +built ins or custom objects) and check that it matches the tag supplied to the +function. + +**validate_propertyOrder:** Not really a validator but rather as a trick to +indicate that properties should retain their order. + +**validate_flowStyle:** Not really a validator but rather as a trick to store +what style to use to write the elements (for yaml objects and arrays) + +**validate_style:** Not really a validator but rather as a trick to store info +on what style to use to write the string. + +**validate_type:** Used to deal with date strings + +(It may make sense to rename the above to be more descriptive of the action than where +they are stuck in the validation machinery; e.g., ``set_propertyOrder``) + +**validate_fill_default:** Set the default values for all properties that have a +subschema that defines a default. Called indirectly in ``fill_defaults`` + +**validate_remove_default:** does the opposite; remove all properties where +value equals subschema default. Called indirectly in ``remove_defaults`` (For +this and the above, validate in the name mostly confuses although it is used by +the json validator.) + +[these could be renamed as well since they do more than validate] + + +**_create_validator:** Creates an ``ASDFValidator`` class on the fly that uses +the ``jsonchema.validators`` class created. This ``ASDFValidator`` class +overrides the ``iter_errors`` method that is used to handle yaml tag cases +(using the ``._tag`` attribute of the node to obtain the corresponding schema +for that tag; e.g., it calls ``load_schema`` to obtain the right schema when +called for each node in the jsonschema machinery). What isn't clear to me is why +this is done on the fly and at least cached since it really only handles two +variants of calls (basically which JSONSCHEMA version is to be used). Otherwise +it doesn't appear to vary except for that. Admittedly, this is only created at +the top level. This is called by ``get_validator``. + +**class OrderedLoader:** Inherits from the ``_yaml_base_loader``, but otherwise +does nothing new in the definition. But the following code defines +``construct_mapping``, and then adds it as a method. + +**construct_mapping:** Defined outside the ``OrderedLoader`` class but to be +added to the ``OrderedLoader`` class by use of the base class add_constructor +method. This function flattens the mapping and returns an ``OrderedDict`` of the +property attributes (This needs some deep understanding of how the yaml parser +actually works, which is not covered here. Apparently mappings can be +represented as nested trees as the yaml is originally parsed. Or something like +that.) + +**_load_schema:** Loads json or yaml schemas (using the ``OrderedLoader``). + +**_make_schema_loader:** Defines the function load_schema using the provided +resolver and _load_schema. + +**_make_resolver:** Sets the schema loader for http, https, file, tag using a +dictionary where these access methods are the keys and the schema loader +returning only the schema (and not the uri). These all appear to use the same +schema loader. + +**_load_draft4_metaschema:** + +**load_custom_schema:** Deals with custom schemas. + +**load_schema:** Loads a schema from the specified location (this is cached). +Called for every tag encountered (uses resolver machinery). Most of the +complexity is in resolving json references. Calls ``_make_schema_loader, +resolver, reference.resolve_fragment, load_schema`` + +**get_validator:** Calls ``_create_validator``. Is called by validate to return +the created validator. + +**validate_large_literals:** Ensures tree has no large literals (raises error if +it does) + +**validate:** Uses ``get_validator`` to get a validator object and then calls +its validate method, and validates any large literals using +``validate_large_literals``. + +**fill_defaults:** Inserts attributes missing with the default value + +**remove_defaults:** Where the tree has attributes with value equal to the +default, strip the attribute. + +**check_schema:** Checks schema against the metaschema. + +--------------- + +**Illustration of the where these are called:** + +``af._open_asdf`` calls ``af.validate`` which calls ``af._validate`` which then +calls ``schema.validate`` with the tagged tree as the first argument (it can be +called again if there is a custom schema). + +**in schema.py** + +``validate -> get_validator -> _create_validator`` (returns ``ASDFValidator``). +There are two levels of validation, those passed to the json_validation +machinery for the schemas themselves, and those that the tag machinery triggers +when the jsonschema validator calls through ``iter_errors``. The first level +handles all the tricks at the top. the ``ASDFValidator`` uses ``load_schema`` +which in turn calls ``_make_schema_loader``, then ``_load_schema``. +``_load_schema`` uses the ``OrderedLoader`` to load the schemas. + +Got that? diff --git a/docs/index.rst b/docs/index.rst index 8417295de..c15b52972 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -45,6 +45,17 @@ API Documentation asdf/user_api asdf/developer_api +Developer Overview +================== + +Currently a work in progress. Intended to give an overview of how the various +parts of ASDF interact and which modules do what and how. + +.. toctree:: + :maxdepth: 1 + + asdf/developer_overview + Contributing and reporting issues =================================