dharma.tree

XML tree representation.

Node types are: Tree, Tag, Comment, String, Instruction. Attributes are not represented as nodes, even though they are treated like that in the xpath model, because this would be weird in python.

All node types derive from an abstract base class Node. Tree and Tag nodes derive from a Branch abstract class, which itself derives from Node. There is no inheritance relationship between concrete node types. For instance, Comment is not a subclass of String, unlike in bs4. Thus, to check whether a node is of a concrete given type T, using isinstance(node, T), etc. is sufficient. And to check whether a node is a branch or a leaf, it is sufficient to check isinstance(node, Branch).

When parsing documents and when modifying attributes, we always normalize spaces in attributes: we replace all sequences of whitespace characters with " " and we trim whitespace from both sides.

For simplicity, we do not deal with XML namespaces at all. We just remove namespace prefixes in both elements and attributes. Thus, <xsl:template> becomes <template>, and <foo xml:lang="eng"> becomes <foo lang="eng">. This means that we cannot deal with documents where namespaces are significant. This also means that we cannot properly serialize XML documents that used namespaces initially.

We use XPath expressions for searching and matching, but only support a small subset of it. Most notably, it is only possible to select Tag and Tree nodes. Other types of nodes, attributes in particular, can only be used in predicates, as in foo[@bar]. We also do not support expressions that index node sets in some way: testing a node position in a node set or evaluating the length of a node set is not possible.

XPath expressions can use the following functions:

glob(pattern[, text])

Checks if text matches the given glob pattern. If text is not given, it defaults to the node's text contents.

regex(pattern[, text])

Like glob, but for regular expressions. Matching is unanchored, so ^ and $ must be used if the idea is to match a full string.

lang(), mixed(), empty(), plain(), errors(), name()

Returns the corresponding attributes in Node.

To evaluate an expression, we first convert it to straightforward python source code, then compile the result, and finally run the code. Compiled expressions are saved in a global table and are systematically reused. There is no caching policy for now.

Location

Represents the location of a node in an XML file. Fields are:

start: byte index of the start of the node
end: idem for the end of the node
line: line number (one-based)
column: column number (one-based)

parse_string

def parse_string(source, path=None)

Parse an XML string into a Tree. If path is given, it will be used as filename in error messages, and will be accessible through the file attribute.

parse

def parse(file, path=None)

Parse an XML file into a Tree. The file argument can either be a file-like object or a string that indicates the file's path. The path argument can be used to indicate the file's path, for errors messages. If it is not given, the path of the file will be deduced from file, if possible.

Node Objects

class Node()

assigned_lang

Language assigned by the user.

inferred_lang

Actual language, inferred by bubbling up the language of children elements.

tree

@property
def tree()

The Tree this node belongs to. If the node is a Tree, this returns the Tree itself.

file

@property
def file()

Path of the XML file this subtree was constructed from.

location

Location object, which indicates the boundaries of the subtree in the XML source it was constructed from. If the subtree does not come from an XML file, Location is None. Likewise, if the subtree is modified in some way, or if it is extracted from the original tree or copied from it, Location will be set to None.

parent

@property
def parent()

Parent node. All nodes have a parent, except Tree nodes, whose parent is None.

path

@property
def path()

The path of this node. See the locate method.

mixed

@property
def mixed()

Whether this node has both Tag and non-blank String children. This can only be called on Branch nodes.

empty

@property
def empty()

True if this node has no Tag children nor non-blank String children. This can only be called on Branch nodes.

plain

@property
def plain()

True if this node is a String, or if it is a Branch that has no children or only String children (discounting comments and processing instructions).

source

@property
def source()

XML source of this subtree, as it appears in the file it was parsed from, as a string. If the location member is None, source will also be None.

byte_source

@property
def byte_source()

XML source of this subtree, in bytes, encoded as UTF-8.

root

@property
def root()

The root Tag node of the Tree this subtree belongs to.

stuck_child

def stuck_child()

Returns the first Tag child of this node, if it has one and if there is no intervening non-blank text in-between. Can only be called on Branch nodes.

stuck_following_sibling

def stuck_following_sibling()

Returns the first Tag sibling of this node, if it has one and if there is no intervening non-blank text in-between. Can only be called on Tag nodes.

delete

def delete()

Removes this node and all its descendants from the tree. Returns the removed subtree.

locate

def locate(path)

Finds the node that matches the given xpath expression. This only works for basic expressions of the form: /, /foo[1], /foo[1]/bar[5], etc. The path of a node is given in its path attribute.

find

def find(path)

Finds nodes that match the given XPath expression. Returns a list of matching nodes.

first

def first(path)

Like the find method, but returns only the first matching node, or None if there is no match.

match_func

@staticmethod
def match_func(path)

Returns a function that matches the given path if called on a Node object. See the documentation of Node.matches().

matches

def matches(path)

Checks if this node matches the given XPath expression. Returns a boolean.

The expression is evaluated like an XSLT pattern. For details, see the XSLT 1.0 standard, under § 5.2 Patterns.

children

def children()

Returns a list of Tag children of this node.

replace_with

def replace_with(other)

Removes this node and its descendants from the tree, and puts another node in its place. Returns the removed subtree.

text

def text(space="default")

Returns the text contents of this subtree. Per default, we do normalize-space(); to prevent this, pass space="preserve".

xml

def xml(strip_comments=False,
        strip_instructions=False,
        html=False,
        color=False)

Returns an XML representation of this subtree.

If html is true, the result will be escaped, for inclusion in an HTML file. If color is true, the result will be colorized, either through CSS classes (if html is true), or with ANSI escapes codes (otherwise).

copy

def copy()

Makes a copy of this subtree. The returned object holds no reference to the original. It is bound to a new Tree.

unwrap

def unwrap()

Removes a node from the tree but leaves its descendants in-place. Returns the detached node.

This cannot be called on a Tree node. Also note that unwrapping the root Tag node of a Tree might yield an invalid XML document that contains several roots.

coalesce

def coalesce()

Coalesces adjacent string nodes and removes empty string nodes from this subtree. Has no effect on leaf nodes. In particular, if this is called on an empty String node, this node will not be removed from the tree.

Branch Objects

class Branch(Node, list)

Base class for non-leaf nodes viz. Tree nodes and Tag nodes.

Branches are represented as lists of nodes. They support most list operations. Those that are not implemented will raise an exception if called.

Tree Objects

class Tree(Branch)

Tree represents the XML document proper. It must contain a single tag node and optionally comments and processing instructions. Tree objects constructed from files also hold blank String nodes for new lines, etc.

Tag Objects

class Tag(Branch)

Represents element nodes.

Tag objects have both a list-like and a dict-like interface.

When they are indexed with integers, the children of the node are accessed. When they are indexed with strings, XML attributes are accessed. Indexing an attribute that the Tag does not possess is not treated as an error, and returns the empty string.

The in operator also takes types into account: if a Node is given, it will check whether this node is a child of Tag. Otherwise, it assumes the argument is an attribute name, and checks whether the Tag bears this attribute.

Iterating over a Tag node yields the tag's children. The methods keys(), values() and items() can be used for iterating over attributes.

init

def __init__(name, *attributes_iter, **attributes)

The argument name is the name of the node as a string, e.g. "html". This argument can be followed by a single positional argument attributes_iter. If given, it must be an iterator that returns tuples of the form (key, value), or a dict subclass. Attributes can also be passed as keyword arguments with **attributes.

Attributes ordering is preserved for attributes passed through attributes_iter. This is the reason we have it. New attributes created manually with e.g. node["attr"] = "foo" are added at the end of the attributes list. (We use an OrderedDict under the hood.)

String Objects

class String(Node, collections.UserString)

Represents a text node.

String nodes behave like normal str objects, but they can also be edited in-place with the following methods.

clear

def clear()

Sets this String to the empty string.

append

def append(data)

Adds text at the end of this String.

prepend

def prepend(data)

Adds text at the beginning of this String.

insert

def insert(index, data)

Adds text at the given index of this String.

Comment Objects

class Comment(Node, collections.UserString)

Represents a comment.

Comment nodes behave like strings.

Instruction Objects

class Instruction(Node)

Represents a processing instruction.

Initial XML declarations e.g. <?xml version="1.0"?> are also represented as processing instructions.

Error Objects

class Error(Exception)

Raised for parsing errors viz. for malformed XML files. Schema errors do not raise exceptions.

Files

tree.md

Latest commit

History