Skip to content

ASDF Design Overview

Dan D'Avella edited this page Apr 19, 2019 · 5 revisions

This page provides a high-level summary of the current design of the asdf package. It is intended to serve as a guide for future maintainers.

Any ASDF implementation must be responsible for many aspects of parsing and creating ASDF files. When reading files, the following high-level actions must occur:

  • Find and verify ASDF-specific header information
  • Parse YAML tree
  • Find and load core schema
  • Validate the YAML content using available schemas
  • Resolve schema references
  • Find and parse the block index if it exists
  • Find block segment if it exists
  • Parse block headers and load blocks on request
  • Properly handle and close all IO resources (including memory maps) when no longer needed

File Parsing

asdf uses the generic_io submodule to provide an abstraction layer for various IO resources (e.g. file on disk, network resource, IO stream, etc.). Files are opened using the generic_io.get_file method, from which is returned a GenericFile instance. This object can then be used to read the contents of the file, which is returned as a bytes object.

asdf checks for the presence of the header magic value at the beginning of the file. If this can't be found, then parsing is aborted. It also attempts to determine the file format version based on the header comment lines.

        header_line = fd.read_until(b'\r?\n', 2, "newline", include=True)
        self._file_format_version = cls._parse_header_line(header_line)
        self.version = self._file_format_version

(link)

It then looks for the beginning of the YAML content. It parses this content and creates the tree (more details on this in the next section).

        yaml_token = fd.read(4)
        tree = {}
        has_blocks = False
        if yaml_token == b'%YAM':
            reader = fd.reader_until(
                constants.YAML_END_MARKER_REGEX, 7, 'End of YAML marker',
                include=True, initial_content=yaml_token)


            # For testing: just return the raw YAML content
            if _get_yaml_content:
                yaml_content = reader.read()
                fd.close()
                return yaml_content


            # We parse the YAML content into basic data structures
            # now, but we don't do anything special with it until
            # after the blocks have been read
            tree = yamlutil.load_tree(reader, self, self._ignore_version_mismatch)
            has_blocks = fd.seek_until(constants.BLOCK_MAGIC, 4, include=True)
        elif yaml_token == constants.BLOCK_MAGIC:
            has_blocks = True
        elif yaml_token != b'':
            raise IOError("ASDF file appears to contain garbage after header.")

(link)

After parsing the YAML content, asdf looks to see whether any binary data blocks are present, and whether a block index is present. It does not load the data blocks yet, however.

        if has_blocks:
            self._blocks.read_internal_blocks(
                fd, past_magic=True, validate_checksums=validate_checksums)
            self._blocks.read_block_index(fd, self)

(link)

YAML Parsing

asdf uses a standard yaml implementation for writing the metadata tree. However, it implements custom dumper and loader classes in order to enable the tagging of custom types. The dumper and loader are fairly straightforward and involve the overriding of the represent_data and construct_object methods, respectively.

class AsdfDumper(_yaml_base_dumper):
    """
    A specialized YAML dumper that understands "tagged basic Python
    data types" as implemented in the `tagged` module.
    """


    def __init__(self, *args, **kwargs):
        kwargs['default_flow_style'] = None
        super().__init__(*args, **kwargs)


    def represent_data(self, data):
        node = super(AsdfDumper, self).represent_data(data)


        tag_name = getattr(data, '_tag', None)
        if tag_name is not None:
            node.tag = tag_name


        return node

(link)

class AsdfLoader(_yaml_base_loader):
    """
    A specialized YAML loader that can construct "tagged basic Python
    data types" as implemented in the `tagged` module.
    """
    ignore_version_mismatch = False
    def construct_object(self, node, deep=False):
        tag = node.tag
        if node.tag in self.yaml_constructors:
            return super(AsdfLoader, self).construct_object(node, deep=False)
        data = _yaml_to_base_type(node, self)
        tag = self.ctx.type_index.fix_yaml_tag(
            self.ctx, tag, self.ignore_version_mismatch)
        data = tagged.tag_object(tag, data)
        return data# 

(link)

When reading the YAML tree, custom types are not immediately converted. Instead, each basic node in the parsed YAML tree (consisting of scalar types, strings, lists, and dicts) is tagged by adding an attribute to the node. This is done by asdf.tagged.tag_object:

def tag_object(tag, instance, ctx=None):
    """
    Tag an object by wrapping it in a ``Tagged`` instance.
    """
    if isinstance(instance, Tagged):
        instance._tag = tag
    elif isinstance(instance, dict):
        instance = TaggedDict(instance, tag)
    elif isinstance(instance, list):
        instance = TaggedList(instance, tag)
    elif isinstance(instance, str):
        instance = TaggedString(instance)
        instance._tag = tag
    else:
        from . import AsdfFile, yamlutil
        if ctx is None:
            ctx = AsdfFile()
        try:
            instance = yamlutil.custom_tree_to_tagged_tree(instance, ctx)
        except TypeError:
            raise TypeError("Don't know how to tag a {0}".format(type(instance)))
        instance._tag = tag
    return instance

(link)

Before actually converting the tagged YAML tree to a tree containing custom types, asdf performs schema validation.

Schema Validation